I will often come across a coded, or categorized, dataset where a particular categories values are all stuffed into a single column. Although there may be reasons for this type of data structure, it does not lend itself very nicely for use in R.

Here is a simplified example, tracking the sport interests of four individuals:

# data: 
ID  | Name  | Interests
-----------------------
1   | Jon   | Football
2   | Sue   | Soccer,Basketball
3   | Beth  | Football,Baseball,Soccer
4   | Steve | Soccer

As you can see, this format can quickly pose a problem. What if we wanted to know how many people liked Soccer?

We could string match on the column values:

soccer = grep(data$interests, pattern="Soccer") # c(1,2,3)

However, depending on the search string, this can break very easily. For example, imagine if instead this was our original data:

# data: 
ID  | Name  | Interests
-----------------------
1   | Jon   | American Football
2   | Sue   | Football,Basketball
3   | Beth  | American Football,Baseball,Football
4   | Steve | Football

So, if we were still interested in the number of people interested in Soccer, or now Football, this would give us:

football = grep(data$interests, pattern="Football") # c(1,2,3,4)

Notice, we are now getting a positive match on the first row. Since Jon is interested in American Football, our regular expression matches on both sports.

So we could use some more sophistacation in our pattern, but, across larger datasets, you won’t always know what problems to look for. This makes using these regular expressions difficult and time-consuming.

Instead, what if we just created individual columns for the unique values from Interests column? The result data would look like this:

# data:
ID  | Name  | American Football | Football  | Basketball    | Baseball
-----------------------------------------------------------------------
1   | Jon   | 1                 | 0         | 0             | 0
2   | Sue   | 0                 | 1         | 1             | 0
3   | Beth  | 1                 | 1         | 0             | 1
4   | Steve | 0                 | 1         | 0             | 0

This format will make things much easier. We can find all Soccer (or Football) fans very easily (1), or quickly find the sport with the most interest (2):

# 1: Find soccer fans
football = data[data$Football == 1,]

# 2:
sports = c("American Football","Football","Basketball","Baseball") 
sport_sums = colSums(data[,sports])
sports[which.max(sport_sums)] # Football: which.max(sport_sums) == 1

Now that we see some the benefits of translating the initial Interests column, let’s see how to do it.


Step 1: Unique Values

To get the values to be used for the new columns we will first split all of the values in the Interests column, and then reduce that the unique values.

interests = as.character(data$Interests)
split_interests = sapply(interests, strsplit, ",", USE.NAMES=F); # USE.NAMES isn't required, but makes looking at the output easier
unique_interests = unique(unlist(split_interests)) # c("American Football", "Football", "Basketball", "Baseball")

Step 2: Create the Columns

So, unique_interests contains the names for our new columns, so we can append those to our data.frame, setting all initial values to 0

Note: This is not required, but I’m doing it here mostly for illustrative purposes following Method 1 in the Step 3

data[,unique_interests] = 0

# data, Interests column truncated for better formatting
ID  Name Interests American Football Football Basketball Baseball
 1   Jon       ...                 0        0          0        0
 2   Sue       ...                 0        0          0        0
 3  Beth       ...                 0        0          0        0
 4 Steve       ...                 0        0          0        0

Step 3: Populate Each Row

So now it’s a matter of populating each row according to their respective values in the Interests column. If we take a look at split_interests from above, we have a list of vectors representing the per-row interests:

# split_interests
[[1]]   # Jon
[1] "American Football"

[[2]]   # Sue
[1] "Football"   "Basketball"

[[3]]   # Beth
[1] "American Football" "Baseball"          "Football"         

[[4]]   # Steve
[1] "Football"

There are many methods that can be used here, a couple of which are to 1.) iterate split_interests and find the new columns in our data and set them to 1, or 2.) convert our list to a matrix and append it to data

Method 1

Using a simple loop, we can iterate the interests and set each value to 1.

Warning: This is NOT the preferred approach. I’m more so including this method as means to illustrate the process, rather than an example to implement, and that will become clear after we examine both methods.

for(row in 1:length(split_interests)) {
    data[row, split_interests[[row]]] = 1
}
Method 2

Create a (N, M) matrix, where N == nrow(data) and M == length(unique_interests). We will then use this matrix to append on to our original data.

An easy way to do this is to use sapply and give it our split_interests list, and a function where current_interests will represent the interest values for the given row (or entry in the list).

interests_per_row = sapply(split_interests, function(current_interests) { 
    # Using the full vector of new columns, unique_interests, 
    # we see which are in the current row's interest vector. 
    # This will return a logical vector with a length equal 
    # to length(unique_interests)
    current = (unique_interests %in% current_interests)

    # multiply the logical vector by 1 and return.  The 
    # multiplication will convert to a numeric vector, 
    # with FALSE == 0, and TRUE == 1
    return(1 * current)
})

# interests_per_row
     [,1] [,2] [,3] [,4]
[1,]    1    0    1    0
[2,]    0    1    1    1
[3,]    0    1    0    0
[4,]    0    0    1    0

However, we have encountered a catch, and can’t append this to the data quite yet: it’s actually in the wrong format. The rows from our original data is actually represented here as the columns and the interests are the rows. There an easy fix though:.

# transpose the matrix using t()
interests_per_row = t(interests_per_row)

     [,1] [,2] [,3] [,4]
[1,]    1    0    0    0
[2,]    0    1    1    0
[3,]    1    1    0    1
[4,]    0    1    0    0

Now we can append this to our original data

Note: When following this method, this is what makes the Step 2 unnecessary

data[,unique_interests] = interests_per_row

# data, Interests column truncated for better formatting
ID  Name Interests American Football Football Basketball Baseball
 1   Jon       ...                 1        0          1        0
 2   Sue       ...                 0        1          1        1
 3  Beth       ...                 0        1          0        0
 4 Steve       ...                 0        0          1        0

Step 5: Why Method 2

Although the first method, depending on your background, seems much more straightforward, it is, in fact, much slower. As the dataset gets larger, the problem will really show itself, but even a dataset of only four rows, its noticeably slower.

> microbenchmark::microbenchmark(method1(), method2(), times=1000)
Unit: microseconds
      expr     min      lq    mean  median     uq     max neval cld
 method1() 184.157 198.042 220.934 205.239 220.66 1608.19  1000   b
 method2() 149.452 157.264 179.171 164.378 175.62 1568.67  1000  a