Basketball Project Part 4

After researching online basketball data in more depth I found that RealGM had so-called “split” data for college players. Players statistics are sliced in various ways such as performance against Top 25 teams.

n my original collection process involved scraping statistics from every college player, which was quite inefficient. It involved approximately 20,000 player-seasons worth of data and caused problems during the merge since so many players shared names. It also didn’t allow collection of the “split” data since these is housed on each player’s individual page instead of on the “All College Player Stats” page.

It was quite challenging figuring out how to scrape the RealGM site. The page structure was predictable aside from a unique id number for every player, which I assume comes from some sort of internal database on the RealGM site. These numbers range in length from two to five numerals and there is no way I could find to predict these numbers. For instance, Carmelo Anthony’s player page link is below. His player id is 452.

http://basketball.realgm.com/player/Carmelo-Anthony/NCAA/452/2014/By_Split/
Advanced_Stats/Quality_Of_Opp

After a fair bit of thrashing about I finally came up with the solution to write an R script that would google the first portion of the player’s page link, read the Google page source, search for player’s site address using regular expressions, and then append their id to the rest of the structured web address.

For Carmelo, the script would use the following google search link:

https://www.google.com/search?q=realgm.com/player/Carmelo-Anthony

The specificity of the search ensures that the RealGM link appears on the first page of search results (it was the first result in every test scenario I tried). The script then uses the following regular expression when search the Google search results page source:

realgm.com/player/Carmelo-Anthony/(Summary|News|\u2026)/[0-9]+

A player’s main page is always preceded by the player’s name and then “/Summary/id”, but “/News/id” and “/…/id” also appeared. After it locates and reads this link it’s easy enough to strip out the player id and insert it into the player’s page that links to the advanced college data I was looking for.

library(XML)
library(RCurl)
library(data.table)
 
# Read in players and convert names to proper format 
players.DF <- read.csv(file="~/.../Combined Data/Combined Data 1.csv")
players <- as.character(players.DF$Player)
players <- gsub("\\.","",players)
players <- gsub(" ","-",players)
 
# Initialize dataframes and vectors 
missedPlayers <- NULL
playerLinks <- rep(NA, length(players))
playerLinks <- data.frame(players.DF$Player, playerLinks)
 
# Create link for each player 
for(i in 1:length(players)) {
  url <- paste0('https://www.google.com/search?q=realgm.com/player/',players[i])
  result <- try(content <- getURLContent(url))
  if(class(result) == "try-error") { next; }
  id <- regexpr(paste0("realgm.com/player/", players[i],
  "/(Summary|News|\u2026)","/[0-9]+"),content)
 
  id <- substr(content, id, id + attr(id,"match.length"))
  id <- gsub("[^0-9]+","",id)
  id <- paste0('http://basketball.realgm.com/player/', players[i], '/NCAA/', 
  id,'/2014/By_Split/Advanced_Stats/Quality_Of_Opp')
  playerLinks[i,2] <- id
}
 
setnames(playerLinks, c("players.DF.Player","playerLinks"), c("Players","Links"))

Some sites have started to detect and try to prevent web scraping. On iteration 967 Google began blocking my search requests. However, I simply reran the script the next morning from iteration 967 onward to pickup the missing players.

I then used the fact that a missing id results in a page link with “NCAA//” to search for players that were still missing their ids.

> pickups <- playerLinks[which(grepl("NCAA//",playerLinks[[2]])),]

After examining the players I noticed many of these had apostrophes in their name, which I had forgotten to account for in my original name formatting.

I adjusted my procedure and reran the script to get the pickups.

pickups <- playerLinks[which(grepl("NCAA//",playerLinks[[2]])),]
pickups <- pickups[[1]]
pickups <- gsub("'","",pickups)
pickups <- gsub(" ","-",pickups)
pickupNums <- grep("NCAA//",playerLinks[[2]])
 
for(i in 1:length(pickupNums)) {
  j <- pickupNums[i]
  url <- paste0('https://www.google.ca/search?q=realgm.com/player/',pickups[i])
  result <- try(content <- getURLContent(url))
  if(class(result) == "try-error") { next; }
  id <- regexpr(paste0("realgm.com/player/", pickups[i],
  "/(Summary|News|\u2026)","/[0-9]+"),content)
 
  id <- substr(content, id, id + attr(id,"match.length"))
  id <- gsub("[^0-9]+","",id)
  id <- paste0('http://basketball.realgm.com/player/', pickups[i], 
  '/NCAA/', id,'/2014/By_Split/Advanced_Stats/Quality_Of_Opp')
 
  playerLinks[[j,2]] <- id
}

After rerunning the script three players were still missing ids, so I entered these manually.

playerLinks[[370,2]]  <- "http://basketball.realgm.com/player/Eric-Gordon/NCAA/762/2014/By_Split/Advanced_Stats/Quality_Of_Opp"
playerLinks[[884,2]] <- " http://basketball.realgm.com/player/Randolph-Morris/NCAA/166/2014/By_Split/Advanced_Stats/Quality_Of_Opp"
playerLinks[[1010,2]] <- "http://basketball.realgm.com/player/Slavko-Vranes/NCAA/472/2014/By_Split/Advanced_Stats/Quality_Of_Opp"

I also needed to manually check the three duplicate players and adjust their ids accordingly.

The final result looks like this:

The next step will be to cycle through the links and use readHTMLTable() to get the advanced statistics.

R Highlighting created by Pretty R at inside-R.org

Basketball Project Part 3

While I was looking around at basketball data during the course of the project I saw that Basketball-Reference.com had a few pieces of data I wanted to pick up: a player’s shooting arm (right or left) and their high school ranking. The site is also packed with a ton of other data I may use in the future such as a player’s shooting percentage from different distances from the basket. So I thought it would be good to create a procedure to scrape it.

The site use a particular website address structure that makes it easy to scrape: http://www.basketball-reference.com/players + the first letter of the player’s last name + the first 5 letters of the player’s last name (unless the player’s name is less than 5 letters in which case their whole name is used + the first two letters of their first name + a page number (usually a 1, but sometimes a 2 if more than one player share a name). For instance, http://www.basketball-reference.com/players/a/anthoca01.html.

R reads the page source and again the site uses a structured page profile:

I first used grep to locate the line of the page source that contained “Shoots:” and “Recruiting Rank:.” And then used regular expressions to strip the information out. Not all players have both (or either) set of information so I used a try() wrapper so the code could practice through errors resulting from no match to the regular expressions.

library(stringr)
 
# Read in master player list
players.DF <- read.csv(file="~/.../All Drafted Players 2013-2003.csv")
allPlayers <- players.DF[,3]
 
# Convert names to proper format
allPlayers <- str_replace_all(allPlayers, "[[:punct:]]", "")
allPlayers <- tolower(allPlayers)
first <- str_extract(allPlayers,"^[^ ]+")
first <- substring(first,1,2)
last <- str_extract(allPlayers,"[^ ]+$")
last <- substring(last,1,5)
letter <- substring(last,1,1)
 
shootsVector <- rep(NA,length(allPlayers))
recruitVector <- rep(NA,length(allPlayers))
 
# Scrape the site and record shooting arm and HSranking
for(i in 1:20) {
  page <- read.csv(paste0(
  'http://www.basketball-reference.com/players/',letter[i],'/',last[i],first[i],'01.html'))
 
  line <- grep("[Ss]hoots:(.*)Right|Left", page[,], value = FALSE, perl = TRUE)
  index <- regexpr("[Rr]ight|[Ll]eft",page[line,])
  shoots <- substr(page[line,], index, index + attr(index,"match.length") - 1)
  result <- try(shootsVector[i] <- shoots)
  if(class(result) == "try-error") { next; }
 
  line <- grep("Recruiting Rank:(.*)([0-9]+)", page[,], value = FALSE, perl = TRUE)
  index <- regexpr("\\([0-9]+\\)$",page[line,])
  recruit <- substr(page[line,], index + 1, index + attr(index,"match.length") - 2)
  result <- try(recruitVector[i] <- recruit)
  if(class(result) == "try-error") { next; }
 
  print(shoots)
  print(recruit)
}
 
# Combine information
players.DF <- cbind(players.DF, shootsVector,recruitVector)
setnames(players.DF,c("shootsVector","recruitVector"),c("Shooting Arm","HS Ranking"))
write.csv(players.DF,file="~/...Combined Data/Combined Data 1.csv")

The procedure is vulnerable to duplicates. There are ways to deal with it in code. One way would be to also read the college from the page source and use that to pick out the player. In this case, however, after running a duplicates report only 3 duplicates were found.

> which(duplicated(allPlayers))
[1]  715  732 1118
> allPlayers[715]
[1] "tony mitchell"
> allPlayers[732]
[1] "chris wright"
> allPlayers[1118]
[1] "jamar smith"

For that reason, it was much easier to just do a manual search on the 6 players and update their data. I choose to do this in Excel. Using the highlight duplicates feature, I could easily scroll down and find the 3 duplicate players and change their shooting arm and HS ranking as necessary.

R Highlighting created by Pretty R at inside-R.org

Basketball Project Part 2

One piece of data I wanted to have for my statistical analysis was the quality of college a player attended. I chose to measure college quality by the number of weeks a team was in the Associated Press (AP) Top 25 college basketball rankings. Note, that I only used regular season rankings not pre- or post-season rankings, which are not available for all years. Historic rankings dating back to the 2002-2003 season are available on the ESPN website. However, when scraping ESPN’s webpage I found the data was semi-structured.

The code to read in the college name must be robust enough to ignore all the possible characters following the college name, but flexible enough to detect “exotic” college names like “Texas A&M” and “St. John’s.” The code first reads in each week’s rankings and strips out the college name. It then binds the weeks together. If the season has less than 18 weeks NAs are introduced to ensure every season is the same length and can be bound together. The college quality is then calculated for each season. Finally, the weekly rankings for every season are bound together into a single table and saved as is the college quality for every season. The code is shown below.

library(XML)
library(data.table)
 
# Initialize variables
seasons <- seq(2013,2003,by=-1)
allSeasonRankings <- NULL
allSeasonTable <- NULL
missedPages <- matrix(ncol=2,nrow=1)
colnames(missedPages) <- c("Season","Week")
k <- 1
 
# Web scrape
# Iterate over each week in each season
for(j in 1:length(seasons)) {
numWeeks <- 0
seasonRanking <- NULL
week <- NULL
 
  for (i in 2:19)
  {
    result <- try(week <- readHTMLTable(paste0(
    'http://espn.go.com/mens-college-basketball/rankings/_/poll/1/year/',
    seasons[j], '/week/', i ,'/seasontype/2'),skip.rows=c(1,2))[[1]][,2])
 
    if(class(result) == "try-error") { missedPages[k,] <- c(j,i); k <- k + 1; next; }
    print(paste0('http://espn.go.com/mens-college-basketball/rankings/_/poll/1/year/', 
    seasons[j], '/week/', i ,'/seasontype/2'))
 
    numWeeks <- numWeeks + 1
    week <- as.data.frame(array(BegString(week)))
    seasonRanking <- cbind(seasonRanking,week[[1]])
    colnames(seasonRanking)[numWeeks] <- paste("Week",numWeeks)   
  }
    # Ensure that all seasons have 18 weeks 
    # (the maximum number of weeks in a season since 2003)
    # so that all seasons have the same length and can easily be bound together
    while(numWeeks < 18) {
      numWeeks <- numWeeks + 1
      extra <- rep(NA,25)
      seasonRanking <- cbind(seasonRanking,extra)
      colnames(seasonRanking)[numWeeks]  <- paste("Week",numWeeks)  
    }
 
# Bind seasons together
allSeasonRankings <- rbind(allSeasonRankings, seasonRanking)
 
# Calculate the percentage of weeks each school was in the AP Top 25
seasonTable <- as.data.frame(table(unlist(seasonRanking)))
percentages <- round((seasonTable[2]/numWeeks)*100,2)
 
# Change column name to "Top 25 %" immediately. Otherwise percentages will 
# inherit the name "Freq" from the table function and not allow use of setnames() 
# since 2 columns have the same name
colnames(percentages)[1] <- "Top 25 %" 
seasonTable <- cbind(seasonTable, percentages)
seasonTable <- cbind(seasonTable, rep(seasons[j],length(seasonTable[1])))
allSeasonTable <- rbind(allSeasonTable,seasonTable)
}
 
# Clean up names
setnames(allSeasonTable,c("Var1", "rep(seasons[j], length(seasonTable[1]))"),
c("Team", "Season"))
 
# Add column with season
rankingYear <- rep(seasons, each=25)
 
# Combine data and cleanup names
allSeasonRankings <- cbind(rankingYear,allSeasonRankings)
allSeasonRankings <- as.data.frame(allSeasonRankings)
setnames(allSeasonRankings,"rankingYear", "Season")
 
# Save files
write.csv(allSeasonRankings,file="~/.../College Quality/Season Rankings.csv")
write.csv(allSeasonTable,file="~/.../College Quality/Percent Time in Top 25.csv")

The above code uses two custom functions to strip out the college name. One, strips out the college name and the second removes the trailing whitespace that sometimes occurs. There are a lot of different ways to do this. The most efficient is probably to use the functionality of the stringr package (such as string_extract()), but I wrote these functions when I was less aware of all of stringr’s functionality.

# Returns first string containing only letters, spaces, and the ' and & symbols
BegString <- function(x) {
  exp <- regexpr("^[a-zA-Z| |.|'|&]+",x)
  stringList <- substr(x,1,attr(exp,"match.length"))
  stringList <- removeTrailSpace(stringList)
  return(stringList)
}

# Removes trailing whitespace of a string
removeTrailSpace <- function(stringList) {
 
  whiteSpaceIndex <- regexpr(" +$",stringList)
  whiteSpaceSize <- attr(whiteSpaceIndex,"match.length")
 
  for(k in 1:length(stringList)) {
    if(whiteSpaceSize[k] > 0) {
      stringList[k] <- substr(stringList[k],1,whiteSpaceIndex[k]-1)
    }
  }
  stringList
}

The weekly ranking table ends up looking like this:

This table is saved purely for reference since all of the meat is in the college quality calculation. College quality is shown below. Again, I kept the “Freq” in for reference so that I could randomly verify the results of a few observations to make sure the code worked properly. As you can see, 43 different teams spent at least one week in the AP Top 25 rankings during 2013.

Now that I have this data I can merge it with the master list of players using the school name and season as keys.

R highlighting created by Pretty R at inside-R.org

Thesis Introduction

This is the introduction of my thesis, which was laid out using Adobe InDesign. The chapter header and graphs were created in Illustrator and imported into InDesign.

Tanzania Seed Map

This is a map I created for my thesis on agriculture in Tanzania. The data shows the district variation in the area planted to modern seed varieties. The data is from a 2007-2008 agriculture survey conducted by the Ministry of Agriculture. I used an average of the area planted during the long and short rainy season weighted to account for the fact that most agricultural activity takes place during the long season.

Tanzania’s Formal Seed Chain

This is a detailed diagram from my thesis showing Tanzania’s formal seed chain.

Tanzania Seed Chain

This is a chart I made for my thesis on agriculture in Tanzania describing the relationship between Tanzania’s three seed chains. The chart was adapted from several sources and, although it looks simple, it took quite a while to come up with this shape that would depict the relationship between the three different seed system.

Major Seed Importing Countries

This is a map I created showing the main countries exporting seed to Tanzania. I wanted to give the map some depth so I used the 3-D features in Illustrator to give a sort of bird’s eye view of the region. I then placed the dotted lines and arrows individually. The subtle shadow effect was made by copying the lines and arrows and using a Gaussian blur. Illustrator has a built-in shadow effect, but it looks very wonky if you try to use it with curved lines. Making the shadows by hand gives much more control and the effect is much more polished.

Modern Seed Yields

This is a chart I made using CGIAR data, which shows the improvement in yield for a variety of crops using so-called “modern seeds” (sometimes called “improved seeds”). These are non-GMO seed varieties cultivated by agronomists using traditional breeding techniques, but in a highly-controlled environment. I created this chart as part of my thesis on agriculture in Tanzania.

Metro-(gnome)

This is the first installment in my collection (o)-bjects. The original idea was to practice drawing somewhat intricate icons. Later, I came up with the idea to contrast the simplicity of the line drawings with whimsical title styling. Probably unnecessary, but I went with the idea nonetheless.

Nothing fancy about the metro-(gnome) body. The scale was created using a brush. I made a small brush and then applied the pattern to a 1-point line. I made the textured knob by modifying the procedure I found in a Lynda.com tutorial, which showed how to make a gear. I brought the gnome into Illustrator and then used Live Trace to get the painted look. It took a fair amount of cleanup to smooth out the edges and delete the background.

The original inspiration for this icon (and the series) was this ruler I found and pinned onto one of my Pintrest boards.

Image Source: The wonderful gnome picture came from Wikipedia.

James McCammon

Tag: skills