Separate character strings into rows and columns using tidyr
functions
separate_*
family of functions
Background
TIL, as of tidyr
1.3.0, there’s a new family of string separation functions:
separate_wider_delim()
separate_wider_position()
separate_wider_regex()
separate_longer_delim()
separate_longer_position()
These functions generally do two things:
- Separate strings into individual rows or columns.
- Separate strings by some character delimiter, position, or regular expression.
If you’ve used tidyr
in the past, you’re likely familiar with the separate()
function. This function was useful in cases where character strings needed to be separated into different columns based on a pattern. While re-reading the 2nd edition of the R for Data Science book for the Online Data Science Learning Community (check us out here), I was reminded separate()
was superseded by this family of functions. For myself, I decided writing a post was needed to better understand how to use these functions. In this post I’ll describe each function and build on what’s discussed in the book by sharing some examples I might use these functions.
separate_longer_delim()
Let’s start with separate_longer_delim()
. This function takes a character string and turns it into additional rows using a delimiter. We’ll need some example data.
I’m a college sports fan, so let’s create a dataset containing Big Ten and Big 12 conference teams. Here’s the code to create some example data:
big_ten <- str_c(
"Illinois",
"Indiana",
"Iowa",
"Maryland",
"Michigan",
"Michigan State",
"Minnesota",
"Nebraska",
"Northwestern",
"Ohio State",
"Oregon",
"Penn State",
"Purdue",
"Rutgers",
"UCLA",
"USC",
"Washington",
"Wisconsin",
sep = ", "
)
big_12 <- str_c(
"Arizona State",
"Iowa State",
"BYU",
"Colorado",
"Baylor",
"TCU",
"Texas Tech",
"Kansas State",
"West Virginia",
"Kansas",
"Cincinnati",
"Houston",
"Utah",
"Arizona",
"UCF",
"Oklahoma State",
sep = ", "
)
data_teams <- tibble(
conferene = c("Big Ten", "Big 12"),
teams = c(big_ten, big_12)
)
data_teams
# A tibble: 2 × 2
conferene teams
<chr> <chr>
1 Big Ten Illinois, Indiana, Iowa, Maryland, Michigan, Michigan State, Minnesota, Nebraska, North…
2 Big 12 Arizona State, Iowa State, BYU, Colorado, Baylor, TCU, Texas Tech, Kansas State, West V…
In practice, this dataset isn’t really useful: having a list of teams in a character string doesn’t afford us any ability to apply other operations to the data. It would be useful if the data were in long format, where each team is on its own row, and the conference is repeated. It’s also important to recognize each team in the character string are delimited by a comma. Let’s use the separate_longer_delim()
function to make it more useful. Here’s what the code looks like:
data_teams |>
separate_longer_delim(
cols = teams,
delim = ", "
)
# A tibble: 34 × 2
conferene teams
<chr> <chr>
1 Big Ten Illinois
2 Big Ten Indiana
3 Big Ten Iowa
4 Big Ten Maryland
5 Big Ten Michigan
6 Big Ten Michigan State
7 Big Ten Minnesota
8 Big Ten Nebraska
9 Big Ten Northwestern
10 Big Ten Ohio State
# ℹ 24 more rows
separate_wider_position()
separate_longer_position()
separates character strings into additional rows using a position index, rather than some type of character delimiter. The book mentions the use of separate_longer_position()
is quite rare. In fact, I struggled thinking of a practical example of where this might be applied. As such, this example is a little contrived. How about days it rained in Lincoln, NE for the months of September and October, 2024?
rain_sept <- str_c(
"N", "N", "N", "N", "N",
"Y", "N", "N", "N", "Y",
"N", "N", "N", "N", "N",
"N", "Y", "Y", "Y", "N",
"Y", "Y", "Y", "Y", "N",
"N", "N", "N", "N", "N"
)
rain_oct <- str_c(
"N", "N", "Y", "N", "N",
"N", "N", "N", "N", "N",
"N", "N", "N", "N", "N",
"N", "N", "N", "N", "N",
"Y", "N", "Y", "N", "N",
"N", "N", "N", "N", "Y",
"N"
)
data_lnk_rain <- tibble(
year = 2024,
month = c("September", "October"),
rain = c(rain_sept, rain_oct)
)
data_lnk_rain
# A tibble: 2 × 3
year month rain
<dbl> <chr> <chr>
1 2024 September NNNNNYNNNYNNNNNNYYYNYYYYNNNNNN
2 2024 October NNYNNNNNNNNNNNNNNNNNYNYNNNNNNYN
Again, what if we want longer data, where each day’s observation gets its own row? The character strings in the rain
variable don’t really have a point of reference where to separate the string. The only point of reference we have is positionally using some type of index. In this case, we have a character string of length 1 representing whether precipitation was observed that day: “Y” = rain; “N” = no rain. While we’re at it, let’s also mutate a day
column for each observation. The following code chunk demonstrates how to do this:
data_lnk_rain |>
separate_longer_position(rain, width = 1) |>
group_by(year, month) |>
mutate(day = 1:n(), .before = rain)
# A tibble: 61 × 4
# Groups: year, month [2]
year month day rain
<dbl> <chr> <int> <chr>
1 2024 September 1 N
2 2024 September 2 N
3 2024 September 3 N
4 2024 September 4 N
5 2024 September 5 N
6 2024 September 6 Y
7 2024 September 7 N
8 2024 September 8 N
9 2024 September 9 N
10 2024 September 10 Y
# ℹ 51 more rows
separate_wider_delim()
Separating strings into columns is a common operation. At times, extra variables are crammed into one variable, where it would be useful if the data was separated into additional variables. Let’s go back to our college teams example data.
This data will combine team names with conferences. However, it’s often useful to have a separate variable only containing the conference (i.e., we may want to filter out specific conferences for an analysis). At times, our data will look like this:
data_team_conference <- tibble(
team = c(
"Nebraska (Big Ten)",
"Texas (Big 12)",
"Pittsburgh (ACC)",
"Creighton (Big East)",
"Texas A&M (SEC)"
)
)
data_team_conference
# A tibble: 5 × 1
team
<chr>
1 Nebraska (Big Ten)
2 Texas (Big 12)
3 Pittsburgh (ACC)
4 Creighton (Big East)
5 Texas A&M (SEC)
We can now separate this column of character strings into multiple columns using separate_wider_delim()
. In the end, we want two columns: team
and conference
. Column names are passed as a character string of names to the names
argument. The (
character will be used as the delimiter to separate into the two columns, passed to the delim
argument. The last step is to use mutate()
and str_remove()
to remove the extra )
at the end of the character string. Here is the code to perform all these steps:
data_team_conference |>
separate_wider_delim(
team,
delim = " (",
names = c("team", "conference")
) |>
mutate(
conference = str_remove(conference, "\\)")
)
# A tibble: 5 × 2
team conference
<chr> <chr>
1 Nebraska Big Ten
2 Texas Big 12
3 Pittsburgh ACC
4 Creighton Big East
5 Texas A&M SEC
Too much or Too little data
One common problem arises when you separate strings, though: sometimes you end up with too much or too little data. Moreover, at times, you need more information to debug problems with the separation step. separate_wider_delim()
provides several arguments when confronted with these issues. These arguments provide additional functionality, including creating additional columns, dropping extra data, merging, or outputting debugging information.
The too_few
and too_many
arguments are available to help in these situations. I highlight their utility with a couple of examples to follow. However, I suggest reading and reviewing the examples in the docs to gain a complete understanding of the utility of these arguments.
Take for example file path parsing. Let’s start with some example data.
data_file_paths <- tibble(
path = c(
"2024_10_01-1925-f-monthly_users",
"2024_11_01-1925-f-monthly_users",
"2024_12_01-1925-f-monthly_users",
"2024_10_01-1925-m-monthly_users",
"2024_11_01-1925-m-monthly_users",
"2024_12_01-1925-m-monthly_users",
"2024_10_01-2635-f-monthly_users",
"2024_11_01-2635-f-monthly_users",
"2024_12_01-2635-f-monthly_users",
"2024_10_01-2635-m-monthly_users",
"2024_11_01-2635-m-monthly_users",
"2024_12_01-2635-m-monthly_users"
)
)
data_file_paths
# A tibble: 12 × 1
path
<chr>
1 2024_10_01-1925-f-monthly_users
2 2024_11_01-1925-f-monthly_users
3 2024_12_01-1925-f-monthly_users
4 2024_10_01-1925-m-monthly_users
5 2024_11_01-1925-m-monthly_users
6 2024_12_01-1925-m-monthly_users
7 2024_10_01-2635-f-monthly_users
8 2024_11_01-2635-f-monthly_users
9 2024_12_01-2635-f-monthly_users
10 2024_10_01-2635-m-monthly_users
11 2024_11_01-2635-m-monthly_users
12 2024_12_01-2635-m-monthly_users
Column separation can be performed either by specifying a delimiter or by position. Let’s start off by separating by delimiter by using the separate_wider_delim()
function. We specify the following arguments:
-
data
- The data containing the column(s) we want to separate. -
cols
- The column(s) we want to separate. -
delim
- The character delimiter we want to use to separate our information into newly mutated columns. -
names
- The names of the columns the separated column will be separated into.
In this example we’ll separate path
into four columns: date
, age_grp
, gender
, and report
. Here’s what the code looks like to separate by a character delimiter:
separate_wider_delim(
data = data_file_paths,
cols = path,
delim = "-",
names = c("date", "age_grp", "gender", "report")
)
# A tibble: 12 × 4
date age_grp gender report
<chr> <chr> <chr> <chr>
1 2024_10_01 1925 f monthly_users
2 2024_11_01 1925 f monthly_users
3 2024_12_01 1925 f monthly_users
4 2024_10_01 1925 m monthly_users
5 2024_11_01 1925 m monthly_users
6 2024_12_01 1925 m monthly_users
7 2024_10_01 2635 f monthly_users
8 2024_11_01 2635 f monthly_users
9 2024_12_01 2635 f monthly_users
10 2024_10_01 2635 m monthly_users
11 2024_11_01 2635 m monthly_users
12 2024_12_01 2635 m monthly_users
The same outcome can be achieved by separating positionally as well. Here we just use some index values passed via a semi-named vector to the widths
argument. The unnamed portions are the empty, unneeded portions of the string we want to separate by. The named numeric values are the character lengths of each piece of information we want to capture into our newly created columns.
separate_wider_position(
data = data_file_paths,
cols = path,
widths = c(date = 10, 1, age_grp = 4, 1, gender = 1, 1, report = 13)
)
# A tibble: 12 × 4
date age_grp gender report
<chr> <chr> <chr> <chr>
1 2024_10_01 1925 f monthly_users
2 2024_11_01 1925 f monthly_users
3 2024_12_01 1925 f monthly_users
4 2024_10_01 1925 m monthly_users
5 2024_11_01 1925 m monthly_users
6 2024_12_01 1925 m monthly_users
7 2024_10_01 2635 f monthly_users
8 2024_11_01 2635 f monthly_users
9 2024_12_01 2635 f monthly_users
10 2024_10_01 2635 m monthly_users
11 2024_11_01 2635 m monthly_users
12 2024_12_01 2635 m monthly_users
too_many
Say we have a case where we don’t need the end of the path, specifically the -monthly_users
portion. For example:
separate_wider_delim(
data = data_file_paths,
cols = path,
delim = "-",
names = c("date", "age_grp", "gender")
)
Error in `separate_wider_delim()`:
! Expected 3 pieces in each element of `path`.
! 12 values were too long.
ℹ Use `too_many = "debug"` to diagnose the problem.
ℹ Use `too_many = "drop"/"merge"` to silence this message.
We get an error. The error is the result of having more data then there are columns to separate into. To address this, we need to pass different options to the too_many
argument of the function. Let’s use too_many = "debug"
to receive additional information on what needs to be fixed. Although the problem is pretty straightforward here, I wanted to show this option in case you’re confronted with a situation with a more complex separation.
separate_wider_delim(
data = data_file_paths,
cols = path,
delim = "-",
names = c("date", "age_grp", "gender"),
too_many = "debug"
)
Warning: Debug mode activated: adding variables `path_ok`, `path_pieces`, and `path_remainder`.
# A tibble: 12 × 7
date age_grp gender path path_ok path_pieces path_remainder
<chr> <chr> <chr> <chr> <lgl> <int> <chr>
1 2024_10_01 1925 f 2024_10_01-1925-f-monthly_users FALSE 4 -monthly_users
2 2024_11_01 1925 f 2024_11_01-1925-f-monthly_users FALSE 4 -monthly_users
3 2024_12_01 1925 f 2024_12_01-1925-f-monthly_users FALSE 4 -monthly_users
4 2024_10_01 1925 m 2024_10_01-1925-m-monthly_users FALSE 4 -monthly_users
5 2024_11_01 1925 m 2024_11_01-1925-m-monthly_users FALSE 4 -monthly_users
6 2024_12_01 1925 m 2024_12_01-1925-m-monthly_users FALSE 4 -monthly_users
7 2024_10_01 2635 f 2024_10_01-2635-f-monthly_users FALSE 4 -monthly_users
8 2024_11_01 2635 f 2024_11_01-2635-f-monthly_users FALSE 4 -monthly_users
9 2024_12_01 2635 f 2024_12_01-2635-f-monthly_users FALSE 4 -monthly_users
10 2024_10_01 2635 m 2024_10_01-2635-m-monthly_users FALSE 4 -monthly_users
11 2024_11_01 2635 m 2024_11_01-2635-m-monthly_users FALSE 4 -monthly_users
12 2024_12_01 2635 m 2024_12_01-2635-m-monthly_users FALSE 4 -monthly_users
The too_many = "debug"
outputs a tibble with some additional columns ( *_ok
, *_pieces
, *_remainder
) of information. The *
being the name of the column to be separated. These columns contain information to help us quickly diagnose the problem. Using our example data, we get the following:
This column is useful for identifying the presence of any variable length strings.
path_ok
provides a boolean to quickly identify cases where the separation failed.path_pieces
represents the number of pieces resulting from separating the string.path_remainder
shows what’s left after the separation is performed. This is useful for identifying if there’s any additional information you want to retain in additional columns.
Although not applicable here, a neat trick to quickly identify the columns that didn’t separate as we expected is to use the following:
debug_path |> filter(!x_ok)
Now that we have additional information to help us figure out what’s going on, we can choose another option for the too_many
argument to handle our specific case. We have two additional options beyond error
and debug
:
-
drop
will drop the additional information that doesn’t fit into our newly specified columns. -
merge
will keep the additional information, but it will merge it with the data in the final column.
Let’s observe both options:
separate_wider_delim(
data = data_file_paths,
cols = path,
delim = "-",
names = c("date", "age_grp", "gender"),
too_many = "drop"
)
# A tibble: 12 × 3
date age_grp gender
<chr> <chr> <chr>
1 2024_10_01 1925 f
2 2024_11_01 1925 f
3 2024_12_01 1925 f
4 2024_10_01 1925 m
5 2024_11_01 1925 m
6 2024_12_01 1925 m
7 2024_10_01 2635 f
8 2024_11_01 2635 f
9 2024_12_01 2635 f
10 2024_10_01 2635 m
11 2024_11_01 2635 m
12 2024_12_01 2635 m
separate_wider_delim(
data = data_file_paths,
cols = path,
delim = "-",
names = c("date", "age_grp", "gender"),
too_many = "merge"
)
# A tibble: 12 × 3
date age_grp gender
<chr> <chr> <chr>
1 2024_10_01 1925 f-monthly_users
2 2024_11_01 1925 f-monthly_users
3 2024_12_01 1925 f-monthly_users
4 2024_10_01 1925 m-monthly_users
5 2024_11_01 1925 m-monthly_users
6 2024_12_01 1925 m-monthly_users
7 2024_10_01 2635 f-monthly_users
8 2024_11_01 2635 f-monthly_users
9 2024_12_01 2635 f-monthly_users
10 2024_10_01 2635 m-monthly_users
11 2024_11_01 2635 m-monthly_users
12 2024_12_01 2635 m-monthly_users
Either operation is pretty straightforward: drop the additional information or merge what’s left in the newly created column. Nonetheless, it’s likely best to debug first. Knowing what’s going on with your separation before applying a fix can be useful, and it will help you avoid parsing mistakes.
too_few
Next, I want to highlight options for when you have too few data. Let’s go back to some college sports examples, specifically college basketball game log data. Such data might look like this. Take note of the W/L
column. Not only are wins and losses denoted, but the variable may also contain info if the win occurred during an overtime period. Let’s mimic this structure in some example data.
data_bball_wl <- tibble(
game = c(1:6),
team = c(
"Nebraska",
"Nebraska",
"Nebraska",
"Nebraska",
"Nebraska",
"Nebraska"
),
w_l = c(
"W",
"W (1 OT)",
"L",
"W (3 OT)",
"L (2 OT)",
"W"
)
)
data_bball_wl
# A tibble: 6 × 3
game team w_l
<int> <chr> <chr>
1 1 Nebraska W
2 2 Nebraska W (1 OT)
3 3 Nebraska L
4 4 Nebraska W (3 OT)
5 5 Nebraska L (2 OT)
6 6 Nebraska W
separate_wider_delim(
data = data_bball_wl,
cols = "w_l",
delim = " (",
names = c("w_l", "ots")
)
Error in `separate_wider_delim()`:
! Expected 2 pieces in each element of `w_l`.
! 3 values were too short.
ℹ Use `too_few = "debug"` to diagnose the problem.
ℹ Use `too_few = "align_start"/"align_end"` to silence this message.
Error, so let’s debug what’s happening with our separation.
separate_wider_delim(
data = data_bball_wl,
cols = "w_l",
delim = " (",
names = c("w_l", "ots"),
too_few = "debug"
)
Warning: Debug mode activated: adding variables `w_l_ok`, `w_l_pieces`, and `w_l_remainder`.
# A tibble: 6 × 7
game team w_l ots w_l_ok w_l_pieces w_l_remainder
<int> <chr> <chr> <chr> <lgl> <int> <chr>
1 1 Nebraska W <NA> FALSE 1 ""
2 2 Nebraska W (1 OT) 1 OT) TRUE 2 ""
3 3 Nebraska L <NA> FALSE 1 ""
4 4 Nebraska W (3 OT) 3 OT) TRUE 2 ""
5 5 Nebraska L (2 OT) 2 OT) TRUE 2 ""
6 6 Nebraska W <NA> FALSE 1 ""
Just what we thought, rows 1, 3, 5, and 6 don’t contain enough information to complete our operation of filling the ots
variable. The separate_wider_delim()
function has two options for the too_many
argument to address this issue:
-
align_end
addsNA
at the start of short matches to pad to the correct length. -
align_start
addsNA
at the end of the short matches to pad to the correct length.
I’ll start with align_end
first, just to demonstrate what it does, though this operation isn’t what we’re looking to do here. Then, I’ll show you align_start
, the operation needed to complete our separation successfully.
separate_wider_delim(
data = data_bball_wl,
cols = "w_l",
delim = " (",
names = c("w_l", "ot"),
too_few = "align_end"
)
# A tibble: 6 × 4
game team w_l ot
<int> <chr> <chr> <chr>
1 1 Nebraska <NA> W
2 2 Nebraska W 1 OT)
3 3 Nebraska <NA> L
4 4 Nebraska W 3 OT)
5 5 Nebraska L 2 OT)
6 6 Nebraska <NA> W
separate_wider_delim(
data = data_bball_wl,
cols = "w_l",
delim = " (",
names = c("w_l", "ot"),
too_few = "align_start"
)
# A tibble: 6 × 4
game team w_l ot
<int> <chr> <chr> <chr>
1 1 Nebraska W <NA>
2 2 Nebraska W 1 OT)
3 3 Nebraska L <NA>
4 4 Nebraska W 3 OT)
5 5 Nebraska L 2 OT)
6 6 Nebraska W <NA>
In short, all the align selection does is modify where the NA
will be placed, essentially modifying the padding to create a correct length for the separation to be valid.
Great, now that we’ve identified where to separate the columns, we just need to do some additional string manipulation to finish the wrangling of this data. Below are the wrangling steps I applied:
data_bball_wl |>
separate_wider_delim(
cols = "w_l",
delim = " (",
names = c("w_l", "ot"),
too_few = "align_start"
) |>
mutate(
n_ot = str_remove(ot, " OT\\)"),
ot = ifelse(is.na(ot), FALSE, TRUE),
)
# A tibble: 6 × 5
game team w_l ot n_ot
<int> <chr> <chr> <lgl> <chr>
1 1 Nebraska W FALSE <NA>
2 2 Nebraska W TRUE 1
3 3 Nebraska L FALSE <NA>
4 4 Nebraska W TRUE 3
5 5 Nebraska L TRUE 2
6 6 Nebraska W FALSE <NA>
separate_wider_regex()
In this last section, we’ll cover a slightly more advanced topic: using regular expressions to separate into different columns. The separate_wider_regex()
performs this operation. To highlight how to do this, let’s go back to our file path example data, data_file_paths
.
Say, for some reason, after we separated the different fields embedded in the file path, we want to separate the data further into year and month columns, dropping the day information, since it’s redundant in this case (i.e., these files are always the first of the month). Here’s the additional code we would need:
data_file_paths |>
separate_wider_delim(
cols = path,
delim = "-",
names = c(
"date",
"age_grp",
"gender",
"report"
)
) |>
separate_wider_regex(
cols = date,
patterns = c(
year = "\\d{4}",
"_",
month = "\\d{2}"
),
too_few = "align_start"
)
# A tibble: 12 × 5
year month age_grp gender report
<chr> <chr> <chr> <chr> <chr>
1 2024 10 1925 f monthly_users
2 2024 11 1925 f monthly_users
3 2024 12 1925 f monthly_users
4 2024 10 1925 m monthly_users
5 2024 11 1925 m monthly_users
6 2024 12 1925 m monthly_users
7 2024 10 2635 f monthly_users
8 2024 11 2635 f monthly_users
9 2024 12 2635 f monthly_users
10 2024 10 2635 m monthly_users
11 2024 11 2635 m monthly_users
12 2024 12 2635 m monthly_users
Indeed, it’s not lost on me that this operation could be successfully completed using the following approach as well:
data_file_paths |>
separate_wider_delim(
cols = path,
delim = "-",
names = c(
"date",
"age_grp",
"gender",
"report"
)
) |>
separate_wider_delim(
cols = date,
delim = "_",
names = c("year", "month"),
too_many = "drop"
)
# A tibble: 12 × 5
year month age_grp gender report
<chr> <chr> <chr> <chr> <chr>
1 2024 10 1925 f monthly_users
2 2024 11 1925 f monthly_users
3 2024 12 1925 f monthly_users
4 2024 10 1925 m monthly_users
5 2024 11 1925 m monthly_users
6 2024 12 1925 m monthly_users
7 2024 10 2635 f monthly_users
8 2024 11 2635 f monthly_users
9 2024 12 2635 f monthly_users
10 2024 10 2635 m monthly_users
11 2024 11 2635 m monthly_users
12 2024 12 2635 m monthly_users
However, using a regular expression affords a little more accuracy. We’re able to specifically specify the pattern we’re looking to capture for our column.
Wrap up
The takeaway from this TIL post is simple: use tidyr
’s family of separate_wider_*
and separate_longer_*
functions when you need to separate a string into additional columns or rows.
This post wasn’t aiming to be a fully comprehensive guide on how to use this family of functions. As such, I highly suggest reviewing the following to learn more and view additional example use cases:
Chapter 14: Strings from the R for Data Science (2e) book.
tidyr
’sseparate_wider_*
andseparate_longer_*
function documentation (or run?separate_wider_delim
/?separate_longer_delim
in your console).tidyr
’s 1.3.0 release notes.
Until next time, keep having fun with your data wrangling 📈.
Reuse
Citation
@misc{berke2024,
author = {Berke, Collin K},
title = {Separate Character Strings into Rows and Columns Using
`Tidyr` Functions},
date = {2024-12-27},
langid = {en}
}