Basketball Project Part 4

After researching online basketball data in more depth I found that RealGM had so-called “split” data for college players. Players statistics are sliced in various ways such as performance against Top 25 teams.

n my original collection process involved scraping statistics from every college player, which was quite inefficient. It involved approximately 20,000 player-seasons worth of data and caused problems during the merge since so many players shared names. It also didn’t allow collection of the “split” data since these is housed on each player’s individual page instead of on the “All College Player Stats” page.

It was quite challenging figuring out how to scrape the RealGM site. The page structure was predictable aside from a unique id number for every player, which I assume comes from some sort of internal database on the RealGM site. These numbers range in length from two to five numerals and there is no way I could find to predict these numbers. For instance, Carmelo Anthony’s player page link is below. His player id is 452.

After a fair bit of thrashing about I finally came up with the solution to write an R script that would google the first portion of the player’s page link, read the Google page source, search for player’s site address using regular expressions, and then append their id to the rest of the structured web address.

For Carmelo, the script would use the following google search link:

The specificity of the search ensures that the RealGM link appears on the first page of search results (it was the first result in every test scenario I tried). The script then uses the following regular expression when search the Google search results page source:|News|\u2026)/[0-9]+

A player’s main page is always preceded by the player’s name and then “/Summary/id”, but “/News/id” and “/…/id” also appeared.  After it locates and reads this link it’s easy enough to strip out the player id and insert it into the player’s page that links to the advanced college data I was looking for.

# Read in players and convert names to proper format 
players.DF <- read.csv(file="~/.../Combined Data/Combined Data 1.csv")
players <- as.character(players.DF$Player)
players <- gsub("\\.","",players)
players <- gsub(" ","-",players)
# Initialize dataframes and vectors 
missedPlayers <- NULL
playerLinks <- rep(NA, length(players))
playerLinks <- data.frame(players.DF$Player, playerLinks)
# Create link for each player 
for(i in 1:length(players)) {
  url <- paste0('',players[i])
  result <- try(content <- getURLContent(url))
  if(class(result) == "try-error") { next; }
  id <- regexpr(paste0("", players[i],
  id <- substr(content, id, id + attr(id,"match.length"))
  id <- gsub("[^0-9]+","",id)
  id <- paste0('', players[i], '/NCAA/', 
  playerLinks[i,2] <- id
setnames(playerLinks, c("players.DF.Player","playerLinks"), c("Players","Links"))

Some sites have started to detect and try to prevent web scraping. On iteration 967 Google began blocking my search requests. However, I simply reran the script the next morning from iteration 967 onward to pickup the missing players.

I then used the fact that a missing id results in a page link with “NCAA//” to search for players that were still missing their ids.

> pickups <- playerLinks[which(grepl("NCAA//",playerLinks[[2]])),]

After examining the players I noticed many of these had apostrophes in their name, which I had forgotten to account for in my original name formatting.

Screen Shot 2014-03-27 at 3.13.47 PM

I adjusted my procedure and reran the script to get the pickups.

pickups <- playerLinks[which(grepl("NCAA//",playerLinks[[2]])),]
pickups <- pickups[[1]]
pickups <- gsub("'","",pickups)
pickups <- gsub(" ","-",pickups)
pickupNums <- grep("NCAA//",playerLinks[[2]])
for(i in 1:length(pickupNums)) {
  j <- pickupNums[i]
  url <- paste0('',pickups[i])
  result <- try(content <- getURLContent(url))
  if(class(result) == "try-error") { next; }
  id <- regexpr(paste0("", pickups[i],
  id <- substr(content, id, id + attr(id,"match.length"))
  id <- gsub("[^0-9]+","",id)
  id <- paste0('', pickups[i], 
  '/NCAA/', id,'/2014/By_Split/Advanced_Stats/Quality_Of_Opp')
  playerLinks[[j,2]] <- id

After rerunning the script three players were still missing ids, so I entered these manually.

playerLinks[[370,2]]  <- ""
playerLinks[[884,2]] <- ""
playerLinks[[1010,2]] <- ""

I also needed to manually check the three duplicate players and adjust their ids accordingly.

The final result looks like this:Screen Shot 2014-03-28 at 3.06.18 PM

The next step will be to cycle through the links and use readHTMLTable() to get the advanced statistics.

R Highlighting created by Pretty R at


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s