Will a 16 seed ever beat a 1 seed?

That question arose during a recent dinner with my friend Graham at a popular pizza restaurant in Seattle. Tony Kornheiser of PTI said last week that it will never happen. Graham agrees. I think it will happen in our lifetime. It has almost happened a number of times already.

It seemed to me from casual observation like 16 seeds are getting closer to winning on average, but I decided to check this by plotting the point differential of higher-seeded teams during the first round of the NCAA tournament (in the first round there are four 1-vs-16 games). Indeed, compared to many other matchups the 1-vs-16 matchup has shifted greatly over time.

The margin of victory is still substantial, between 10 and 15 points so far this decade, but I remain confident that, let’s say, sometime in the next 40 years it will happen. There have already been eight 15 seed victories over number 2 seeds and twenty-one 14 seed wins over 3 seeds. The situation looks even better when you consider the closest 1-vs-16 game during each tournament (since we only need one 16 seed to win).

It seems that about two to three times every decade there are relatively close 1-vs-16 games and about once a decade there is an extremely close game (decided by just a couple of points). The 2000s did not fare well for 16 seeds.

I think the outcomes of close games are more stochastic than most. Leadership attribution bias seems to turn these stochastic events into narratives of late-game heroics and we’re prone to say that the 1 seed is more poised and resistant to pressure than players at smaller schools. Of course the history of the tournament has shown us many, many exceptions to this rule (if it’s a rule at all). At tension with this narrative is the story of the underdog that is just happy to be at the tournament and has nothing to lose, playing loose, having fun, and playing “to win” while the nervous champion is playing “not to lose.” So to me many of these games are closer to a coin flip than we like to think and given enough coin flips a tail is bound to come up eventually.

Also, think about it this way: How much difference is there between a 1 seed and a 2 seed, and between a 15 seed and a 16 seed? As you know if you ever watch the tournament seeding show there isn’t much of a difference. The lowest ranked number 1 seed isn’t much better – and could be worse – than some of the number 2 seeds, and eight times has a two seed lost in the first round. Of course you could argue the 2 seeds that lost should have actually been 3 seeds, although this year Michigan State lost as a number 2 seed and many considered them to be a favorite in the entire tournament (by some measures this is the biggest tournament upset ever). The larger point is that seeding is also somewhat stochastic and the question “can a 16 seed beat a 1 seed” is really the question of whether an overmatched team can exhibit a one-time victory over an opponent that is much more dominant on average. And we already know the answer to this question is “yes.”

To give the other side its due, since 1979 when seeding began, 22 of the 37 NCAA Tournament winners have been 1 seeds, so at least some of the top seeds are properly ranked and 16 seeds will have a tough time beating them when ranking is accurate.

It’s also a question as to why the decrease in point margin has occurred. Keep in mind the plot above is just a second-order trend line, although if you plot the underlying year-by-year margin it does follow the trend on average (of course). It’s interesting that the late 1980s and early 1990s was a time of lower point differential for 16-vs-1 games and that this period also correlated with three wins for 2 seeds in the first round (1991, 1993, 1997). Likewise the past few years have seen another decrease in 16-vs-1 game point differential and another string of 2-seed wins, two in 2012 and one each in 2013 and 2016. Similarly, between 1986 and 1999 there were thirteen times that a 14 beat a 3, and since 2010 this has occurred another six times (the intervening years saw only two instances of this in 2005 and 2006). These two periods also correspond to the highly touted recruiting classes of Michigan in 1991 (famously nicknamed the “Fab Five“) and Kentucky’s 2013 “mega class.” It may be that there are episodic shifts in recruiting that systematically leave certain types of talent on the table for smaller schools to cull and develop. (Of course, it may be I’m just seeing patterns where none exist).

My recent memory is that although there are a few good schools that still get the best players, smaller schools have seen that if they recruit good players (particularly good shooters or traditional big men) and play as a team they have a chance at beating anyone in the first rounds and perhaps going deep into the tournament. This increases their confidence and performance. This is another reason I think a 16 seed will eventually beat a 1, because the recent years of the tournament have expanded our imagination of what is possible. Think about the well-known phenomenon of a 12 seed always beating a 5 seed. How nervous do you think 5 seeds are every year? How much of this consistency in a 12 beating a 5 is self-reinforcing and due to the 5 seeds having “the jitters” and the 12 seeds being relatively overconfident?

Creating “Tidy” Web Traffic Data in R

There are many tutorials for transforming data using R, but I wanted to demonstrate how you might create dummy web traffic data and then make it tidy ( have one row for every observation) so that it can be further processed using statistical modeling.

The sample data I created looks like the table below. So there is a separate line for each ID. This might be a visitor ID or a session ID. The ISP, week, and country are the same for every row of the ID. But the page ID is different. If the user visited five different pages they would have five unique rows. Two pages, two rows. And so on. You can imagine many different scenarios where you might have a dataset structured in this way. The column “Page_Val” is a custom column that must be added to the dataset by the user. When we pivot the data each page ID will become it’s own column. If a user visited a particular page Page-Val will be used to add a 1 to that page ID column for that user. Otherwise an NA will appear. We can then go back and replace all the NAs with 0s. I used two lines of code to perform the transformation and replacing of NAs with 0s, but using the fill parameter in the dcast function in “reshape2” it can actually be done in a single line. That’s why it pays to read the documentation very closely, which I didn’t do the first time around. Shame on me.

Because the sample data is so small the data can easily be printed out to the screen before and after the transformation.

But the data doesn’t have to be small. In fact, I wrote a custom R function that generates this dummy data. It can easily produce large amounts of dummy data so more robust excercises can be performed. For instance, I produced 1.5 million rows of data in just a couple of seconds and then ran speed tests using the standard reshape function in one trial and the dcast function from “reshape2” in another trial. The dcast function was about 15 times faster in transforming the data. All of the R code is below.

###############################################################################################
# James McCammon
# Wed, 18 March 2015
#
# File Description:
# This file demonstrates how to transform dummy website data from long to wide format.
# The dummy data is strucutred so that there are mulitple rows for each visit, with 
# each visit row containing a unique page id representing a different webpage viewed
# in that visit. The goal is to transform the data so that each visit is a single row,
# with webpage ids transformed into columns names with 1s and 0s as values representing 
# whether  the page was viewed in that particular visit. This file demonstrates two different
# methods to accomplish this task. It also performs a speed test to determine which
# of the two methods is faster.
###############################################################################################
 
# Install packages if necessary
needed_packages = c("reshape2", "norm")
new_packages = needed_packages[!(needed_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new.packages)
 
###############################################################################################
# Create small amount of raw dummy data
###############################################################################################
# Make a function to create dummy data
get_raw_data = function(num_visitors, n, ids, isps, weeks, countries, pages) {
  raw_data = cbind.data.frame(
    'Id' = do.call(rep, list(x=ids, times=n)),
    'ISP' = isps[do.call(rep, list(x=sample(1:length(isps), size=num_visitors, replace=TRUE), times=n))],
    'Week' = weeks[do.call(rep, list(x=sample(1:length(weeks), size=num_visitors, replace=TRUE), times=n))],
    'Country' = countries[do.call(rep, list(x=sample(1:length(countries), size=num_visitors, replace=TRUE), times=n))],
    'Page' = unlist(sapply(n, FUN=function(i) sample(pages, i))),
    'Page_Val' = rep(1, times=sum(n)))
  raw_data
}
 
# Set sample parameters to draw from when creating dummy raw data
num_visitors = 5
max_row_size = 7
ids = 1:num_visitors
isps = c('Digiserve', 'Combad', 'AT&P', 'Verison', 'Broadserv', 'fastconnect')
weeks = c('Week1')
countries = c('U.S.', 'Brazil', 'China', 'Canada', 'Mexico', 'Chile', 'Russia')
pages = as.character(sample(100000:200000, size=max_row_size, replace=FALSE))
n = sample(1:max_row_size, size=num_visitors, replace=TRUE)
# Create a small amount of dummy raw data
raw_data_s = get_raw_data(num_visitors, n, ids, isps, weeks, countries, pages)
 
###############################################################################################
# Transform raw data: Method 1
###############################################################################################
# Reshape data
transformed_data_s = reshape(raw_data_s,
                timevar = 'Page',
                idvar = c('Id', 'ISP', 'Week', 'Country'),
                direction='wide')
# Replace NAs with 0
transformed_data_s[is.na(transformed_data_s)] = 0
 
# View the raw and transformed versions of the data
raw_data_s
transformed_data_s
 
###############################################################################################
# Transform raw data: Method 2
###############################################################################################
# Load libraries
require(reshape2)
require(norm)
 
# Transform the data using dcast from the reshape2 package
transformed_data_s = dcast(raw_data_s, Id + ISP + Week + Country ~ Page, value.var='Page_Val')
# Replace NAs using a coding function from the norm package
transformed_data_s = .na.to.snglcode(transformed_data_s, 0)
# An even simpler way of filling in missing data with 0s is to
# specify fill=0 as an argument to dcast.
 
# View the raw and transformed versions of the data
raw_data_s
transformed_data_s
 
###############################################################################################
# Run speed tests on larger data
###############################################################################################
# Caution: This will create a large data object and run several functions that may take a
# while to run on your machine. As a cheat sheet the results of the test are as follows:
# The transformation method that used the standard "reshape" command took 28.3 seconds
# The transformation method that used "dcast" from the "reshape2" package took 1.8 seconds
 
# Set sample parameters to draw from when creating dummy raw data
num_visitors = 100000
max_row_size = 30
ids = 1:num_visitors
pages = as.character(sample(100000:200000, size=max_row_size, replace=FALSE))
n = sample(1:max_row_size, size=num_visitors, replace=TRUE)
# Create a large amount of raw dummy data
raw_data_l = get_raw_data(num_visitors, n, ids, isps, weeks, countries, pages)
 
s1 = system.time({
  transformed_data_l = reshape(raw_data_l,
                               timevar = 'Page',
                               idvar = c('Id', 'ISP', 'Week', 'Country'),
                               direction='wide')
  transformed_data_l[is.na(transformed_data_l)] = 0 
})
 
s2 = system.time({
  transformed_data_l = dcast(raw_data_l, Id + ISP + Week + Country ~ Page, value.var='Page_Val')
  transformed_data_l = .na.to.snglcode(transformed_data_l, 0)   
})
 
s1[3]
s2[3]