Creating “Tidy” Web Traffic Data in R

There are many tutorials for transforming data using R, but I wanted to demonstrate how you might create dummy web traffic data and then make it tidy ( have one row for every observation) so that it can be further processed using statistical modeling.

The sample data I created looks like the table below. So there is a separate line for each ID. This might be a visitor ID or a session ID. The ISP, week, and country are the same for every row of the ID. But the page ID is different. If the user visited five different pages they would have five unique rows. Two pages, two rows. And so on. You can imagine many different scenarios where you might have a dataset structured in this way. The column “Page_Val” is a custom column that must be added to the dataset by the user. When we pivot the data each page ID will become it’s own column. If a user visited a particular page Page-Val will be used to add a 1 to that page ID column for that user. Otherwise an NA will appear. We can then go back and replace all the NAs with 0s. I used two lines of code to perform the transformation and replacing of NAs with 0s, but using the fill parameter in the dcast function in “reshape2” it can actually be done in a single line. That’s why it pays to read the documentation very closely, which I didn’t do the first time around. Shame on me.

Screen Shot 2015-03-23 at 6.56.45 PM

Because the sample data is so small the data can easily be printed out to the screen before and after the transformation.

Screen Shot 2015-03-23 at 7.11.06 PM

But the data doesn’t have to be small. In fact, I wrote a custom R function that generates this dummy data. It can easily produce large amounts of dummy data so more robust excercises can be performed. For instance, I produced 1.5 million rows of data in just a couple of seconds and then ran speed tests using the standard reshape function in one trial and the dcast function from “reshape2” in another trial. The dcast function was about 15 times faster in transforming the data. All of the R code is below.

###############################################################################################
# James McCammon
# Wed, 18 March 2015
#
# File Description:
# This file demonstrates how to transform dummy website data from long to wide format.
# The dummy data is strucutred so that there are mulitple rows for each visit, with 
# each visit row containing a unique page id representing a different webpage viewed
# in that visit. The goal is to transform the data so that each visit is a single row,
# with webpage ids transformed into columns names with 1s and 0s as values representing 
# whether  the page was viewed in that particular visit. This file demonstrates two different
# methods to accomplish this task. It also performs a speed test to determine which
# of the two methods is faster.
###############################################################################################
 
# Install packages if necessary
needed_packages = c("reshape2", "norm")
new_packages = needed_packages[!(needed_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new.packages)
 
###############################################################################################
# Create small amount of raw dummy data
###############################################################################################
# Make a function to create dummy data
get_raw_data = function(num_visitors, n, ids, isps, weeks, countries, pages) {
  raw_data = cbind.data.frame(
    'Id' = do.call(rep, list(x=ids, times=n)),
    'ISP' = isps[do.call(rep, list(x=sample(1:length(isps), size=num_visitors, replace=TRUE), times=n))],
    'Week' = weeks[do.call(rep, list(x=sample(1:length(weeks), size=num_visitors, replace=TRUE), times=n))],
    'Country' = countries[do.call(rep, list(x=sample(1:length(countries), size=num_visitors, replace=TRUE), times=n))],
    'Page' = unlist(sapply(n, FUN=function(i) sample(pages, i))),
    'Page_Val' = rep(1, times=sum(n)))
  raw_data
}
 
# Set sample parameters to draw from when creating dummy raw data
num_visitors = 5
max_row_size = 7
ids = 1:num_visitors
isps = c('Digiserve', 'Combad', 'AT&P', 'Verison', 'Broadserv', 'fastconnect')
weeks = c('Week1')
countries = c('U.S.', 'Brazil', 'China', 'Canada', 'Mexico', 'Chile', 'Russia')
pages = as.character(sample(100000:200000, size=max_row_size, replace=FALSE))
n = sample(1:max_row_size, size=num_visitors, replace=TRUE)
# Create a small amount of dummy raw data
raw_data_s = get_raw_data(num_visitors, n, ids, isps, weeks, countries, pages)
 
###############################################################################################
# Transform raw data: Method 1
###############################################################################################
# Reshape data
transformed_data_s = reshape(raw_data_s,
                timevar = 'Page',
                idvar = c('Id', 'ISP', 'Week', 'Country'),
                direction='wide')
# Replace NAs with 0
transformed_data_s[is.na(transformed_data_s)] = 0
 
# View the raw and transformed versions of the data
raw_data_s
transformed_data_s
 
###############################################################################################
# Transform raw data: Method 2
###############################################################################################
# Load libraries
require(reshape2)
require(norm)
 
# Transform the data using dcast from the reshape2 package
transformed_data_s = dcast(raw_data_s, Id + ISP + Week + Country ~ Page, value.var='Page_Val')
# Replace NAs using a coding function from the norm package
transformed_data_s = .na.to.snglcode(transformed_data_s, 0)
# An even simpler way of filling in missing data with 0s is to
# specify fill=0 as an argument to dcast.
 
# View the raw and transformed versions of the data
raw_data_s
transformed_data_s
 
###############################################################################################
# Run speed tests on larger data
###############################################################################################
# Caution: This will create a large data object and run several functions that may take a
# while to run on your machine. As a cheat sheet the results of the test are as follows:
# The transformation method that used the standard "reshape" command took 28.3 seconds
# The transformation method that used "dcast" from the "reshape2" package took 1.8 seconds
 
# Set sample parameters to draw from when creating dummy raw data
num_visitors = 100000
max_row_size = 30
ids = 1:num_visitors
pages = as.character(sample(100000:200000, size=max_row_size, replace=FALSE))
n = sample(1:max_row_size, size=num_visitors, replace=TRUE)
# Create a large amount of raw dummy data
raw_data_l = get_raw_data(num_visitors, n, ids, isps, weeks, countries, pages)
 
s1 = system.time({
  transformed_data_l = reshape(raw_data_l,
                               timevar = 'Page',
                               idvar = c('Id', 'ISP', 'Week', 'Country'),
                               direction='wide')
  transformed_data_l[is.na(transformed_data_l)] = 0 
})
 
s2 = system.time({
  transformed_data_l = dcast(raw_data_l, Id + ISP + Week + Country ~ Page, value.var='Page_Val')
  transformed_data_l = .na.to.snglcode(transformed_data_l, 0)   
})
 
s1[3]
s2[3]

Created by Pretty R at inside-R.org

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s