Kelsey Plum’s Chase for #1

Thanks to Graham for this question on Whale. Graham asked about Kelsey Plum and whether she will break the record?

“What record?” you might ask. Plum is very close to becoming the all-time NCAA women’s basketball scoring leader. That’s a really big deal and probably one of the under reported stories in basketball right now. NCAA Women’s Basketball started in 1981, that’s 35 years worth of basketball. Plum will have scored more points than greats like Brittney Griner, Chamique Holdsclaw, and Cheryl Miller.

In December of last year Plum became the all-time Pac-12 scoring leader with 44 points in a win against Boise State. She had 44 points again on Sunday in a 72-68 loss to Stanford.

Plum is now averaging 31.4 points per game (to go along with 5 rebounds and 5 assists) and now sits third all time, just 255 points away from Jackie Stiles who set the record 15 years ago with 3,393. The UW Women have 8 games to go in the regular season and if Plum keeps up her average she’ll fall just shy of the record by season’s end with 3,388 points. Luckily, the UW Women are all but guaranteed at least two post-season games, one in the Pac-12 tournament and another in the NCAA tournament, which they’ll likely make even if they fall in the first round of the Pac-12. They could play up to nine games if they make it to the final in both, but will likely end up playing somewhere around six or seven games. Still, this gives Plum plenty of time to break the record and I predict she’ll surpass it by 100 or 150 points.

I plotted the graph below using R to show Plum’s chase for the record.


Where to Rank the UConn Women in Terms of Dominance

Unless a miracle occurs the UConn Women’s Basketball team will soon win their fourth championship in a row in an undefeated season when they beat Syracuse on Tuesday night. This will be their sixth championship since 2009. If they lose it will be one of the greatest upsets in the history of team sports.

Where does their recent dominance rank in the all-time history of sports? I put together this short survey.

The UNC women’s soccer team is — as far as I know — the most dominate team in the history of sports, collegiate or professional (at least in the U.S.), Harlem Globetrotters aside. They’ve been consistently dominate now for three decades and won 22 of the 36 NCAA National Championships. Of course U.S. women’s professional soccer has also been dominate the past 15 years with numerous World Cup and Olympic gold medal wins as well as being ranked No. 1 continuously from March 2008 to December 2014. En Espana, Barcelona has created a dominant European futbol team.

The UCLA men’s basketball team of the 1960s and 1970s won seven straight national titles under the famous John Wooden. The Iowa Hawkeyes men’s wrestling team also had an amazing run of dominance, especially throughout the 1990s. My alma matter, the University of Washington, has won five consecutive national crew championship in the men’s varsity eight. Jointly, the University of Minnesota and University of Minnesota Duluth have been dominating Women’s Ice Hockey since 2001, winning a combined 10 National Championships.

The University of Arkansas won eight consecutive national Track & Field Championships on the men’s side throughout the 1990s, while the LSU women won 11 championships in a row in the ’80s and ’90s (wow!). Swimming and diving national championships seem to come in bundles. Since 1937 only 13 different men’s teams have won national championships and many were back to back or three-peats. The women’s side is equally streaky. By the way, there are quite a few schools with swimming and diving programs.

Of course Alabama’s football team has been quite successful over the past seven years, winning four FBS championships in a rather competitive sport that has recently instituted a playoff system (Alabama has won one out of two of those).

What I’ve listed so far have been Division I-A programs only. Certainly some smaller college programs have seen dominance. And of course there are dominant high school teams as well. St. Anthony’s in New Jersey has won 27 boys’s basketball state titles in the past 39 years, for instance. Maryville Tennessee’s high school football team has gone 145-5 and won seven state titles in recent memory. Cheryl Miller, perhaps the greatest female basketball player of all time (and yes, brother of Reggie Miller), led her high school team to a record of 132-4 from ’78-’82 and along the road scored 105 points in a single game. Reggie Miller often recalls the night he found out about his sister’s scoring outburst. He had just scored 39 points and was pretty proud of himself until his sister reported back that she had more than doubled that total. I recall hearing about several boy’s wrestling champions with perfect high school careers. Here is one example.

A number of professional teams have had long periods of dominance. Chinese women’s diving has been extremely dominant recentlyThe New York Baseball Yankees have won 27 World Championships and 40 American League pennants over the past 100 years, with many of these coming over the 45-year period between 1920 and 1965. I’m aware that Russian hockey and gymnastics teams were quite great in their prime, perhaps still so.

The Boston Celtics won eight straight World Championships throughout the 1960s, helping Bill Russell win a total of 11 championship rings during his career. Indeed, Bill Russell is sometimes considered the greatest winner in the history of team sports and as such when LeBron James left Russell off of his theoretical “Mount Rushmore of the NBA” Russell was able to respond with this amazing quote regarding his own athletic success:

Hey, thank you for leaving me off your Mount Rushmore. I’m glad you did. Basketball is a team game, it’s not for individual honors. I won back-to-back state championships in high school, back-to-back NCAA championships in college. I won an NBA championship my first year in the league, an NBA championship in my last year, and nine in between. That, Mr. James, is etched in stone.

Individual athletics has also seen sportsmen and women that have been consistently dominant. Tiger Woods, Usain Bolt, Sean White, and Michael Phelps have all had multi-year stretches of dominance in recent memory. At least one of them was just featured in an inspiring commercial. Of course there were many dominant athletes in each of these sports before the current incarnations (Jack Nicklaus, Carl Lewis, Mark Spitz).  Eric Heiden won five speed skating medals in the Lake Placid Olympics, starting with the 500 meters and ending with the 10,000 meters. I once watched a documentary in which this feat was compared with a single athlete winning both the 100 meter dash and the mile. Tony Hawk helped usher in skateboarding as a professional sport and was dominant while doing so. Chris Sharma did the same with rock climbing. Rich Froning Jr. has had early success in the burgeoning activity of crossfit as a sport, winning the title of “Fittest Man on Earth” four times since 2011.

Anderson Silva had a long run of dominance in MMA and you’ve certainly heard of some of boxing’s all-time greats. Ronda Rousey garnered fame for her win streak until she was beaten just this year; she also appeared in the horrible movie version of HBO’s Entourage, though I liked her performance. If you’ve ever been to the ballet you know it can be extremely athletic. How’s that for an inspirational commercial? Perhaps it’s time we consider ballet a sport?

And then there is this horse.

Tennis has a history of dominant players including two current players: Novak Djokovic and Serena Williams. Serena is already generally considered the best female player of all time and Novak may end up the greatest men’s player before his career is over. Previous generations included Steffi Graf, Martina Navratilova, Roger Federer, Pete Sampras, and many others. Each was extremely dominant during their prime. For example, during his prime Roger Federer held the Number 1 position for 302 consecutive weeks, reached 23 consecutive Grand Slam tournament semifinals and won five consecutive times both at Wimbledon and the US Open and three out of four at the Australian open.

Perhaps a dark-horse contender for most dominate athlete is Kelly Slater, the American professional surfer who won five consecutive titles from ’94 to ’98. There are a number of articles suggesting he may be the greatest male athlete of all time. He won his first title at age 20 and his last at age 39 (and he’s still surfing competitively!). Talk about longevity. Imagine if Kobe Bryant was leading the Lakers to a title this year or if Peyton Manning had been truly great in the Bronco’s Super Win and you have some idea of what Kelly Slater has accomplished. (Yes, I realize surfing is a non-contact sport. Or is it?).

What have I forgotten? Surely there must be a lot. Certainly, this list is too U.S. centric.

But back to the question at hand. There have been many conversations about whether the UConn women’s dominance is bad for women’s college basketball. It has been suggested by some that this is a sexist argument, but I disagree. If Kentucky’s men’s basketball team was on the verge of winning its fourth straight NCAA tournament there would certainly be discussion about their dominance, perhaps around allegations of illegal recruiting or steroid use or at the very least a discussion about reforming the current one-and-done system.

And the question of whether a team can be too dominate is not new. Indeed, many professional sports are structured specifically to provide — or at least attempt to provide — equity among smaller and larger markets. Think of the draft or salary caps. Of course, in individual sports we fear dominance less because we know natural aging will create a new wave of competition in a few years time, or if we’re talking about individual college sports the athlete will simply graduate.

But we also understand that long-term equilibrium can occur where success begets success. College players, shamefully, are not paid in dollars so the next best thing is to be paid in wins. UConn seems to be the central bank in that category.

On the other hand — and as the list above eludes to — dominance is not unique to the UConn women. In fact, in the grand scheme of things they aren’t so dominate after all. But in some sports we’re use to seeing new champions more often than in others, if only because most people in the U.S. only follow the big four. We’re use to seeing new men’s champions every year to be sure, even if they’re all generally from the same group of ten or twenty teams year after year. So it really stands out when the same women’s team wins repeatedly regardless of where they stand in the broader historical spectrum.

The best thing, it seems, would be for Syracuse to beat UConn and put the whole matter to rest.

Will a 16 seed ever beat a 1 seed?

That question arose during a recent dinner with my friend Graham at a popular pizza restaurant in Seattle. Tony Kornheiser of PTI said last week that it will never happen. Graham agrees. I think it will happen in our lifetime. It has almost happened a number of times already.

It seemed to me from casual observation like 16 seeds are getting closer to winning on average, but I decided to check this by plotting the point differential of higher-seeded teams during the first round of the NCAA tournament (in the first round there are four 1-vs-16 games). Indeed, compared to many other matchups the 1-vs-16 matchup has shifted greatly over time.

Screen Shot 2016-03-19 at 12.03.18 AM.png

The margin of victory is still substantial, between 10 and 15 points so far this decade, but I remain confident that, let’s say, sometime in the next 40 years it will happen.  There have already been eight 15 seed victories over number 2 seeds and twenty-one 14 seed wins over 3 seeds. The situation looks even better when you consider the closest 1-vs-16 game during each tournament (since we only need one 16 seed to win).

Screen Shot 2016-03-22 at 1.36.38 PM.png

It seems that about two to three times every decade there are relatively close 1-vs-16 games and about once a decade there is an extremely close game (decided by just a couple of points). The 2000s did not fare well for 16 seeds.

I think the outcomes of close games are more stochastic than most. Leadership attribution bias seems to turn these stochastic events into narratives of late-game heroics and we’re prone to say that the 1 seed is more poised and resistant to pressure than players at smaller schools. Of course the history of the tournament has shown us many, many exceptions to this rule (if it’s a rule at all). At tension with this narrative is the story of the underdog that is just happy to be at the tournament and has nothing to lose, playing loose, having fun, and playing “to win” while the nervous champion is playing “not to lose.” So to me many of these games are closer to a coin flip than we like to think and given enough coin flips a tail is bound to come up eventually.

Also, think about it this way: How much difference is there between a 1 seed and a 2 seed, and between a 15 seed and a 16 seed? As you know if you ever watch the tournament seeding show there isn’t much of a difference. The lowest ranked number 1 seed isn’t much better – and could be worse – than some of the number 2 seeds, and eight times has a two seed lost in the first round. Of course you could argue the 2 seeds that lost should have actually been 3 seeds, although this year Michigan State lost as a number 2 seed and many considered them to be a favorite in the entire tournament (by some measures this is the biggest tournament upset ever). The larger point is that seeding is also somewhat stochastic and the question “can a 16 seed beat a 1 seed” is really the question of whether an overmatched team can exhibit a one-time victory over an opponent that is much more dominant on average. And we already know the answer to this question is “yes.”

To give the other side its due, since 1979 when seeding began, 22 of the 37 NCAA Tournament winners have been 1 seeds, so at least some of the top seeds are properly ranked and 16 seeds will have a tough time beating them when ranking is accurate.

It’s also a question as to why the decrease in point margin has occurred. Keep in mind the plot above is just a second-order trend line, although if you plot the underlying year-by-year margin it does follow the trend on average (of course). It’s interesting that the late 1980s and early 1990s was a time of lower point differential for 16-vs-1 games and that this period also correlated with three wins for 2 seeds in the first round (1991, 1993, 1997). Likewise the past few years have seen another decrease in 16-vs-1 game point differential and another string of 2-seed wins, two in 2012 and one each in 2013 and 2016. Similarly, between 1986 and 1999 there were thirteen times that a 14 beat a 3, and since 2010 this has occurred another six times (the intervening years saw only two instances of this in 2005 and 2006). These two periods also correspond to the highly touted recruiting classes of Michigan in 1991 (famously nicknamed the “Fab Five“) and Kentucky’s 2013 “mega class.” It may be that there are episodic shifts in recruiting that systematically leave certain types of talent on the table for smaller schools to cull and develop. (Of course, it may be I’m just seeing patterns where none exist).

My recent memory is that although there are a few good schools that still get the best players, smaller schools have seen that if they recruit good players (particularly good shooters or traditional big men) and play as a team they have a chance at beating anyone in the first rounds and perhaps going deep into the tournament. This increases their confidence and performance. This is another reason I think a 16 seed will eventually beat a 1, because the recent years of the tournament have expanded our imagination of what is possible. Think about the well-known phenomenon of a 12 seed always beating a 5 seed. How nervous do you think 5 seeds are every year? How much of this consistency in a 12 beating a 5 is self-reinforcing and due to the 5 seeds having “the jitters” and the 12 seeds being relatively overconfident?

Basketball Project Part 4

After researching online basketball data in more depth I found that RealGM had so-called “split” data for college players. Players statistics are sliced in various ways such as performance against Top 25 teams.

n my original collection process involved scraping statistics from every college player, which was quite inefficient. It involved approximately 20,000 player-seasons worth of data and caused problems during the merge since so many players shared names. It also didn’t allow collection of the “split” data since these is housed on each player’s individual page instead of on the “All College Player Stats” page.

It was quite challenging figuring out how to scrape the RealGM site. The page structure was predictable aside from a unique id number for every player, which I assume comes from some sort of internal database on the RealGM site. These numbers range in length from two to five numerals and there is no way I could find to predict these numbers. For instance, Carmelo Anthony’s player page link is below. His player id is 452.

After a fair bit of thrashing about I finally came up with the solution to write an R script that would google the first portion of the player’s page link, read the Google page source, search for player’s site address using regular expressions, and then append their id to the rest of the structured web address.

For Carmelo, the script would use the following google search link:

The specificity of the search ensures that the RealGM link appears on the first page of search results (it was the first result in every test scenario I tried). The script then uses the following regular expression when search the Google search results page source:|News|\u2026)/[0-9]+

A player’s main page is always preceded by the player’s name and then “/Summary/id”, but “/News/id” and “/…/id” also appeared.  After it locates and reads this link it’s easy enough to strip out the player id and insert it into the player’s page that links to the advanced college data I was looking for.

# Read in players and convert names to proper format 
players.DF <- read.csv(file="~/.../Combined Data/Combined Data 1.csv")
players <- as.character(players.DF$Player)
players <- gsub("\\.","",players)
players <- gsub(" ","-",players)
# Initialize dataframes and vectors 
missedPlayers <- NULL
playerLinks <- rep(NA, length(players))
playerLinks <- data.frame(players.DF$Player, playerLinks)
# Create link for each player 
for(i in 1:length(players)) {
  url <- paste0('',players[i])
  result <- try(content <- getURLContent(url))
  if(class(result) == "try-error") { next; }
  id <- regexpr(paste0("", players[i],
  id <- substr(content, id, id + attr(id,"match.length"))
  id <- gsub("[^0-9]+","",id)
  id <- paste0('', players[i], '/NCAA/', 
  playerLinks[i,2] <- id
setnames(playerLinks, c("players.DF.Player","playerLinks"), c("Players","Links"))

Some sites have started to detect and try to prevent web scraping. On iteration 967 Google began blocking my search requests. However, I simply reran the script the next morning from iteration 967 onward to pickup the missing players.

I then used the fact that a missing id results in a page link with “NCAA//” to search for players that were still missing their ids.

> pickups <- playerLinks[which(grepl("NCAA//",playerLinks[[2]])),]

After examining the players I noticed many of these had apostrophes in their name, which I had forgotten to account for in my original name formatting.

Screen Shot 2014-03-27 at 3.13.47 PM

I adjusted my procedure and reran the script to get the pickups.

pickups <- playerLinks[which(grepl("NCAA//",playerLinks[[2]])),]
pickups <- pickups[[1]]
pickups <- gsub("'","",pickups)
pickups <- gsub(" ","-",pickups)
pickupNums <- grep("NCAA//",playerLinks[[2]])
for(i in 1:length(pickupNums)) {
  j <- pickupNums[i]
  url <- paste0('',pickups[i])
  result <- try(content <- getURLContent(url))
  if(class(result) == "try-error") { next; }
  id <- regexpr(paste0("", pickups[i],
  id <- substr(content, id, id + attr(id,"match.length"))
  id <- gsub("[^0-9]+","",id)
  id <- paste0('', pickups[i], 
  '/NCAA/', id,'/2014/By_Split/Advanced_Stats/Quality_Of_Opp')
  playerLinks[[j,2]] <- id

After rerunning the script three players were still missing ids, so I entered these manually.

playerLinks[[370,2]]  <- ""
playerLinks[[884,2]] <- ""
playerLinks[[1010,2]] <- ""

I also needed to manually check the three duplicate players and adjust their ids accordingly.

The final result looks like this:Screen Shot 2014-03-28 at 3.06.18 PM

The next step will be to cycle through the links and use readHTMLTable() to get the advanced statistics.

R Highlighting created by Pretty R at

Basketball Project Part 3

While I was looking around at basketball data during the course of the project I saw that had a few pieces of data I wanted to pick up: a player’s shooting arm (right or left) and their high school ranking. The site is also packed with a ton of other data I may use in the future such as a player’s shooting percentage from different distances from the basket. So I thought it would be good to create a procedure to scrape it.

The site use a particular website address structure that makes it easy to scrape: + the first letter of the player’s last name + the first 5 letters of the player’s last name (unless the player’s name is less than 5 letters in which case their whole name is used + the first two letters of their first name + a page number (usually a 1, but sometimes a 2 if more than one player share a name). For instance,

R reads the page source and again the site uses a structured page profile:

Screen Shot 2014-03-26 at 6.59.33 PM

I first used grep to locate the line of the page source that contained “Shoots:” and “Recruiting Rank:.” And then used regular expressions to strip the information out. Not all players have both (or either) set of information so I used a try() wrapper so the code could practice through errors resulting from no match to the regular expressions.

# Read in master player list
players.DF <- read.csv(file="~/.../All Drafted Players 2013-2003.csv")
allPlayers <- players.DF[,3]
# Convert names to proper format
allPlayers <- str_replace_all(allPlayers, "[[:punct:]]", "")
allPlayers <- tolower(allPlayers)
first <- str_extract(allPlayers,"^[^ ]+")
first <- substring(first,1,2)
last <- str_extract(allPlayers,"[^ ]+$")
last <- substring(last,1,5)
letter <- substring(last,1,1)
shootsVector <- rep(NA,length(allPlayers))
recruitVector <- rep(NA,length(allPlayers))
# Scrape the site and record shooting arm and HSranking
for(i in 1:20) {
  page <- read.csv(paste0(
  line <- grep("[Ss]hoots:(.*)Right|Left", page[,], value = FALSE, perl = TRUE)
  index <- regexpr("[Rr]ight|[Ll]eft",page[line,])
  shoots <- substr(page[line,], index, index + attr(index,"match.length") - 1)
  result <- try(shootsVector[i] <- shoots)
  if(class(result) == "try-error") { next; }
  line <- grep("Recruiting Rank:(.*)([0-9]+)", page[,], value = FALSE, perl = TRUE)
  index <- regexpr("\\([0-9]+\\)$",page[line,])
  recruit <- substr(page[line,], index + 1, index + attr(index,"match.length") - 2)
  result <- try(recruitVector[i] <- recruit)
  if(class(result) == "try-error") { next; }
# Combine information
players.DF <- cbind(players.DF, shootsVector,recruitVector)
setnames(players.DF,c("shootsVector","recruitVector"),c("Shooting Arm","HS Ranking"))
write.csv(players.DF,file="~/...Combined Data/Combined Data 1.csv")

The procedure is vulnerable to duplicates. There are ways to deal with it in code. One way would be to also read the college from the page source and use that to pick out the player. In this case, however, after running a duplicates report only 3 duplicates were found.

> which(duplicated(allPlayers))
[1]  715  732 1118
> allPlayers[715]
[1] "tony mitchell"
> allPlayers[732]
[1] "chris wright"
> allPlayers[1118]
[1] "jamar smith"

For that reason, it was much easier to just do a manual search on the 6 players and update their data. I choose to do this in Excel. Using the highlight duplicates feature, I could easily scroll down and find the 3 duplicate players and change their shooting arm and HS ranking as necessary.

Screen Shot 2014-03-26 at 6.03.06 PM

R Highlighting created by Pretty R at

Basketball Project Part 2

One piece of data I wanted to have for my statistical analysis was the quality of college a player attended. I chose to measure college quality by the number of weeks a team was in the Associated Press (AP) Top 25 college basketball rankings. Note, that I only used regular season rankings not pre- or post-season rankings, which are not available for all years. Historic rankings dating back to the 2002-2003 season are available on the ESPN website. However, when scraping ESPN’s webpage I found the data was semi-structured.

Screen Shot 2014-03-26 at 10.12.56 AM

The code to read in the college name must be robust enough to ignore all the possible characters following the college name, but flexible enough to detect “exotic” college names like “Texas A&M” and “St. John’s.” The code first reads in each week’s rankings and strips out the college name. It then binds the weeks together. If the season has less than 18 weeks NAs are introduced to ensure every season is the same length and can be bound together. The college quality is then calculated for each season. Finally, the weekly rankings for every season are bound together into a single table and saved as is the college quality for every season. The code is shown below.

# Initialize variables
seasons <- seq(2013,2003,by=-1)
allSeasonRankings <- NULL
allSeasonTable <- NULL
missedPages <- matrix(ncol=2,nrow=1)
colnames(missedPages) <- c("Season","Week")
k <- 1
# Web scrape
# Iterate over each week in each season
for(j in 1:length(seasons)) {
numWeeks <- 0
seasonRanking <- NULL
week <- NULL
  for (i in 2:19)
    result <- try(week <- readHTMLTable(paste0(
    seasons[j], '/week/', i ,'/seasontype/2'),skip.rows=c(1,2))[[1]][,2])
    if(class(result) == "try-error") { missedPages[k,] <- c(j,i); k <- k + 1; next; }
    seasons[j], '/week/', i ,'/seasontype/2'))
    numWeeks <- numWeeks + 1
    week <-
    seasonRanking <- cbind(seasonRanking,week[[1]])
    colnames(seasonRanking)[numWeeks] <- paste("Week",numWeeks)   
    # Ensure that all seasons have 18 weeks 
    # (the maximum number of weeks in a season since 2003)
    # so that all seasons have the same length and can easily be bound together
    while(numWeeks < 18) {
      numWeeks <- numWeeks + 1
      extra <- rep(NA,25)
      seasonRanking <- cbind(seasonRanking,extra)
      colnames(seasonRanking)[numWeeks]  <- paste("Week",numWeeks)  
# Bind seasons together
allSeasonRankings <- rbind(allSeasonRankings, seasonRanking)
# Calculate the percentage of weeks each school was in the AP Top 25
seasonTable <-
percentages <- round((seasonTable[2]/numWeeks)*100,2)
# Change column name to "Top 25 %" immediately. Otherwise percentages will 
# inherit the name "Freq" from the table function and not allow use of setnames() 
# since 2 columns have the same name
colnames(percentages)[1] <- "Top 25 %" 
seasonTable <- cbind(seasonTable, percentages)
seasonTable <- cbind(seasonTable, rep(seasons[j],length(seasonTable[1])))
allSeasonTable <- rbind(allSeasonTable,seasonTable)
# Clean up names
setnames(allSeasonTable,c("Var1", "rep(seasons[j], length(seasonTable[1]))"),
c("Team", "Season"))
# Add column with season
rankingYear <- rep(seasons, each=25)
# Combine data and cleanup names
allSeasonRankings <- cbind(rankingYear,allSeasonRankings)
allSeasonRankings <-
setnames(allSeasonRankings,"rankingYear", "Season")
# Save files
write.csv(allSeasonRankings,file="~/.../College Quality/Season Rankings.csv")
write.csv(allSeasonTable,file="~/.../College Quality/Percent Time in Top 25.csv")

The above code uses two custom functions to strip out the college name. One, strips out the college name and the second removes the trailing whitespace that sometimes occurs. There are a lot of different ways to do this. The most efficient is probably to use the functionality of the stringr package (such as string_extract()), but I wrote these functions when I was less aware of all of stringr’s functionality.

# Returns first string containing only letters, spaces, and the ' and & symbols
BegString <- function(x) {
  exp <- regexpr("^[a-zA-Z| |.|'|&]+",x)
  stringList <- substr(x,1,attr(exp,"match.length"))
  stringList <- removeTrailSpace(stringList)
# Removes trailing whitespace of a string
removeTrailSpace <- function(stringList) {
  whiteSpaceIndex <- regexpr(" +$",stringList)
  whiteSpaceSize <- attr(whiteSpaceIndex,"match.length")
  for(k in 1:length(stringList)) {
    if(whiteSpaceSize[k] > 0) {
      stringList[k] <- substr(stringList[k],1,whiteSpaceIndex[k]-1)

The weekly ranking table ends up looking like this:

Screen Shot 2014-03-26 at 10.35.11 AM

This table is saved purely for reference since all of the meat is in the college quality calculation. College quality is shown below. Again, I kept the “Freq” in for reference so that I could randomly verify the results of a few observations to make sure the code worked properly. As you can see, 43 different teams spent at least one week in the AP Top 25 rankings during 2013.

Screen Shot 2014-03-26 at 10.36.46 AM

Now that I have this data I can merge it with the master list of players using the school name and season as keys.

R highlighting created by Pretty R at


Basketball Project

As part of a graduate applied regression course I took we were required to create and present a research question. The top third of the questions were assigned three students, and these groups worked on the project for the last seven weeks of class. I proposed examining the relationship between early career NBA performance and a variety of pre-NBA player attributes.

NBA performance was to be measured using the co-called “Player Efficiency Rating” created by John Hollinger (usually denoted simply “PER”). The PER attempts to combine all of a player’s on-court statistics into a single number, with the NBA average set to 15 every season. The pre-NBA player profile included a variety of advanced statistics measuring shooting, rebounding, steals, assists, and blocks. For some players NBA combine data was also available. The combine data consisted of a variety of body measurements and results from athletic skills tests (such as standing vertical leap).

My team and I worked throughout the quarter and presented our results last week at the class poster presentation. However, I wanted to redo the project on my own time with better data and full control over the data analysis (rather than having to split up the work between there people).

Since this is the second time around I’m much smarter about how to cull, clean, and merge the data efficiently. The first step is to get a master list of players. I’m choosing to use RealGM Basketball’s draft data. It includes both drafted and undrafted players that played in the NBA (or D-league) dating back to 1978. The procedure I used (shown below) works for the modern two-round draft, which started in 1989. However, since college data is only available from the 2002-2003 season, I only went as far back as the 2003 NBA draft.

This dataset includes draft age, an obvious proxy for age a player began his on-court NBA career, something missing from our original dataset. It includes country of birth as well, which would allow a test of the common assertion that foreign players are better shooters. Importantly, this dataset also includes a player’s college name in a format that matches the Associated Press (AP) Top 25 rankings available on ESPN’s website. For instance, depending on the data source the University of Kentucky is sometimes written as “University of Kentucky”, elsewhere simply as “Kentucky”, and occasionally as “UK” (ESPN’s site uses the variant “Kentucky”). I’ve learned that thinking carefully beforehand about how to merge data saves a lot of pain later.

Controlling for the quality of a player’s college basketball program was an unfortunate omission from the original analysis. Because it embodies both the quality of coaching a player received and toughness of competition they faced it may have been a cause of omitted variable bias. For this measure I’ve decided on using the percentage of the season a team was in the AP Top 25 rankings.

To get this master player list I used R’s XML package to scrape the RealGM site. I used try() in conjunction with readHTMLTable() since otherwise my intermittent internet (or other unexpected problems) would cause the for() loop to stop completely. If try() encounters an error I log the page so I can examine it later and pickup any missing data.

After the scrape I examined the data and had to do some simple cleaning. Drafted and undrafted players have slightly different data available so I had to introduce some NA’s for the undrafted players so I could combine the dataframes. I also had to convert the columns from factors to characters or numeric depending on their values. Height, which in its native format as feet-inches (ex. 6-10) needed to be converted to a pure numeric value (I used height in inches). And a few columns had extra characters that needed to be removed.

To convert height I wrote a custom function (shown below). I could have used the R package stringr’s function str_extract() instead of regexpr() and substr(), but for variety (and practice) I went with the less efficient two-line approach.  In general, the length of my code could be substantially reduce, but at the cost of readability for others (as well as myself when I revisit the code in the future).

convertHeight <- function(x) {
  feet <- substr(x,1,1)
  inches <- regexpr("[0-9]+$",x)
  inches <- substr(x, inches, inches + attr(inches,"match.length"))
  height <- as.numeric(feet)*12 + as.numeric(inches)

Everything went smoothly aside from a warning that NA’s were introduced by coercion when converting “Weight” to numeric. After a quick search it turns out this was only a problem for a single player, number 1073.

> which([,5]) == TRUE)
[1] 1073

Player 1073 turns out to be Donell Williams from Fayetteville State who went undrafted in 2005 and later played a season in the D-league. I went back to RealGM’s site and confirmed that his weight was indeed marked as “N/A” in the source data.

The next steps will be to merge in the college quality data (from ESPN), a few additional pieces of data I scraped from Basketball-Reference (such as the shooting hand a player uses), all of the NBA combine data (from DraftExpress), and the players’ college and NBA statistics (from RealGM and Basketball-Reference). Each piece of data requires it’s own web scraping and cleaning, which I’ll take up in future posts.

# Load necessary libraries
# Initialize variables
round1 <- NULL
round2 <- NULL
drafted <- NULL
undrafted <- NULL
allDraftedPlayers <- NULL
allUndraftedPlayers <- NULL
missedPages <- NULL
seasons <- seq(2013,2003,by=-1)
# Get draft info for drafted and undrafted players
for(i in 1:length(seasons))
    result <- try(page<-readHTMLTable(paste0(
    '', seasons[i])))
    if(class(result) == "try-error") { missedPages <- rbind(missedPages,seasons[i]); next; }
    round1 <- page[[3]]
    round2 <- page[[4]]
    drafted <- rbind(round1,round2)
    undrafted <- page[[5]]
    # Print data for monitoring
    print(paste0('', seasons[i]))
    # Add draft year and combine data
    draftYear <- rep(seasons[i], dim(drafted)[1])
    drafted <- cbind(drafted,draftYear)
    allDraftedPlayers <- rbind(allDraftedPlayers,drafted)
    draftYear <- rep(seasons[i], dim(undrafted)[1])
    undrafted <- cbind(undrafted,draftYear)
    allUndraftedPlayers <- rbind(allUndraftedPlayers, undrafted)  
# Drop unused columns
allDraftedPlayers <- allDraftedPlayers[,-c(9,11:12)]
allUndraftedPlayers <- allUndraftedPlayers[,-c(8:9)]
# Add NAs to undrafted players as necessary
length <- length(allUndraftedPlayers[[1]])
allUndraftedPlayers <- cbind(rep(NA, length),allUndraftedPlayers[,c(1:7)],rep(NA,length),
# Unify names so rbind can combine datasets
colnames(allUndraftedPlayers)[1] <- "Pick"
colnames(allUndraftedPlayers)[9] <- "Draft RightsTrades"
allPlayers <- rbind(allDraftedPlayers,allUndraftedPlayers)
# Cleanup column names
setnames(allPlayers,c("DraftAge","Draft RightsTrades","draftYear"),
c("Draft Age","Draft Rights Traded","Draft Year"))
# Convert columns from factors to character and numeric as necessary
allPlayers[,-3] <- data.frame(lapply(allPlayers[,-3], as.character), 
allPlayers[,c(5,8)] <- data.frame(lapply(allPlayers[,c(5,8)], as.numeric), 
# Add dummy if player was traded on draft day
traded <- allPlayers[[9]]
allPlayers[which(regexpr("[a-zA-Z]+",traded) != -1), 9] <- 1
allPlayers[which(allPlayers[[9]] != 1), 9] <- 0
# Get rid of extra characters in class (mostly astricks)
allPlayers[[7]] <- str_extract(allPlayers[[7]],"[a-zA-Z]+")
allPlayers[[7]] <- gsub("DOB",NA,allPlayers[[7]])
# Convert height to inches from feet-inches format
allPlayers[[4]] <- convertHeight(allPlayers[[4]])
# Function for converting height
convertHeight <- function(x) {
  feet <- substr(x,1,1)
  inches <- regexpr("[0-9]+$",x)
  inches <- substr(x, inches, inches + attr(inches,"match.length"))
  height <- as.numeric(feet)*12 + as.numeric(inches)
write.csv(allPlayers,file="~/.../Draft Info/All Drafted Players 2013-2003.csv")

R Highlighting created by Pretty R at

The result is to take this:

Screen Shot 2014-03-26 at 1.11.00 AM

And transform it into this:

Screen Shot 2014-03-26 at 1.11.52 AM