Basketball Project Part 2

One piece of data I wanted to have for my statistical analysis was the quality of college a player attended. I chose to measure college quality by the number of weeks a team was in the Associated Press (AP) Top 25 college basketball rankings. Note, that I only used regular season rankings not pre- or post-season rankings, which are not available for all years. Historic rankings dating back to the 2002-2003 season are available on the ESPN website. However, when scraping ESPN’s webpage I found the data was semi-structured.

Screen Shot 2014-03-26 at 10.12.56 AM

The code to read in the college name must be robust enough to ignore all the possible characters following the college name, but flexible enough to detect “exotic” college names like “Texas A&M” and “St. John’s.” The code first reads in each week’s rankings and strips out the college name. It then binds the weeks together. If the season has less than 18 weeks NAs are introduced to ensure every season is the same length and can be bound together. The college quality is then calculated for each season. Finally, the weekly rankings for every season are bound together into a single table and saved as is the college quality for every season. The code is shown below.

library(XML)
library(data.table)
 
# Initialize variables
seasons <- seq(2013,2003,by=-1)
allSeasonRankings <- NULL
allSeasonTable <- NULL
missedPages <- matrix(ncol=2,nrow=1)
colnames(missedPages) <- c("Season","Week")
k <- 1
 
# Web scrape
# Iterate over each week in each season
for(j in 1:length(seasons)) {
numWeeks <- 0
seasonRanking <- NULL
week <- NULL
 
  for (i in 2:19)
  {
    result <- try(week <- readHTMLTable(paste0(
    'http://espn.go.com/mens-college-basketball/rankings/_/poll/1/year/',
    seasons[j], '/week/', i ,'/seasontype/2'),skip.rows=c(1,2))[[1]][,2])
 
    if(class(result) == "try-error") { missedPages[k,] <- c(j,i); k <- k + 1; next; }
    print(paste0('http://espn.go.com/mens-college-basketball/rankings/_/poll/1/year/', 
    seasons[j], '/week/', i ,'/seasontype/2'))
 
    numWeeks <- numWeeks + 1
    week <- as.data.frame(array(BegString(week)))
    seasonRanking <- cbind(seasonRanking,week[[1]])
    colnames(seasonRanking)[numWeeks] <- paste("Week",numWeeks)   
  }
    # Ensure that all seasons have 18 weeks 
    # (the maximum number of weeks in a season since 2003)
    # so that all seasons have the same length and can easily be bound together
    while(numWeeks < 18) {
      numWeeks <- numWeeks + 1
      extra <- rep(NA,25)
      seasonRanking <- cbind(seasonRanking,extra)
      colnames(seasonRanking)[numWeeks]  <- paste("Week",numWeeks)  
    }
 
# Bind seasons together
allSeasonRankings <- rbind(allSeasonRankings, seasonRanking)
 
# Calculate the percentage of weeks each school was in the AP Top 25
seasonTable <- as.data.frame(table(unlist(seasonRanking)))
percentages <- round((seasonTable[2]/numWeeks)*100,2)
 
# Change column name to "Top 25 %" immediately. Otherwise percentages will 
# inherit the name "Freq" from the table function and not allow use of setnames() 
# since 2 columns have the same name
colnames(percentages)[1] <- "Top 25 %" 
seasonTable <- cbind(seasonTable, percentages)
seasonTable <- cbind(seasonTable, rep(seasons[j],length(seasonTable[1])))
allSeasonTable <- rbind(allSeasonTable,seasonTable)
}
 
# Clean up names
setnames(allSeasonTable,c("Var1", "rep(seasons[j], length(seasonTable[1]))"),
c("Team", "Season"))
 
# Add column with season
rankingYear <- rep(seasons, each=25)
 
# Combine data and cleanup names
allSeasonRankings <- cbind(rankingYear,allSeasonRankings)
allSeasonRankings <- as.data.frame(allSeasonRankings)
setnames(allSeasonRankings,"rankingYear", "Season")
 
# Save files
write.csv(allSeasonRankings,file="~/.../College Quality/Season Rankings.csv")
write.csv(allSeasonTable,file="~/.../College Quality/Percent Time in Top 25.csv")

The above code uses two custom functions to strip out the college name. One, strips out the college name and the second removes the trailing whitespace that sometimes occurs. There are a lot of different ways to do this. The most efficient is probably to use the functionality of the stringr package (such as string_extract()), but I wrote these functions when I was less aware of all of stringr’s functionality.

# Returns first string containing only letters, spaces, and the ' and & symbols
BegString <- function(x) {
  exp <- regexpr("^[a-zA-Z| |.|'|&]+",x)
  stringList <- substr(x,1,attr(exp,"match.length"))
  stringList <- removeTrailSpace(stringList)
  return(stringList)
}
# Removes trailing whitespace of a string
removeTrailSpace <- function(stringList) {
 
  whiteSpaceIndex <- regexpr(" +$",stringList)
  whiteSpaceSize <- attr(whiteSpaceIndex,"match.length")
 
  for(k in 1:length(stringList)) {
    if(whiteSpaceSize[k] > 0) {
      stringList[k] <- substr(stringList[k],1,whiteSpaceIndex[k]-1)
    }
  }
  stringList
}

The weekly ranking table ends up looking like this:

Screen Shot 2014-03-26 at 10.35.11 AM

This table is saved purely for reference since all of the meat is in the college quality calculation. College quality is shown below. Again, I kept the “Freq” in for reference so that I could randomly verify the results of a few observations to make sure the code worked properly. As you can see, 43 different teams spent at least one week in the AP Top 25 rankings during 2013.

Screen Shot 2014-03-26 at 10.36.46 AM

Now that I have this data I can merge it with the master list of players using the school name and season as keys.

R highlighting created by Pretty R at inside-R.org

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s