Basketball Project Part 3

While I was looking around at basketball data during the course of the project I saw that had a few pieces of data I wanted to pick up: a player’s shooting arm (right or left) and their high school ranking. The site is also packed with a ton of other data I may use in the future such as a player’s shooting percentage from different distances from the basket. So I thought it would be good to create a procedure to scrape it.

The site use a particular website address structure that makes it easy to scrape: + the first letter of the player’s last name + the first 5 letters of the player’s last name (unless the player’s name is less than 5 letters in which case their whole name is used + the first two letters of their first name + a page number (usually a 1, but sometimes a 2 if more than one player share a name). For instance,

R reads the page source and again the site uses a structured page profile:

Screen Shot 2014-03-26 at 6.59.33 PM

I first used grep to locate the line of the page source that contained “Shoots:” and “Recruiting Rank:.” And then used regular expressions to strip the information out. Not all players have both (or either) set of information so I used a try() wrapper so the code could practice through errors resulting from no match to the regular expressions.

# Read in master player list
players.DF <- read.csv(file="~/.../All Drafted Players 2013-2003.csv")
allPlayers <- players.DF[,3]
# Convert names to proper format
allPlayers <- str_replace_all(allPlayers, "[[:punct:]]", "")
allPlayers <- tolower(allPlayers)
first <- str_extract(allPlayers,"^[^ ]+")
first <- substring(first,1,2)
last <- str_extract(allPlayers,"[^ ]+$")
last <- substring(last,1,5)
letter <- substring(last,1,1)
shootsVector <- rep(NA,length(allPlayers))
recruitVector <- rep(NA,length(allPlayers))
# Scrape the site and record shooting arm and HSranking
for(i in 1:20) {
  page <- read.csv(paste0(
  line <- grep("[Ss]hoots:(.*)Right|Left", page[,], value = FALSE, perl = TRUE)
  index <- regexpr("[Rr]ight|[Ll]eft",page[line,])
  shoots <- substr(page[line,], index, index + attr(index,"match.length") - 1)
  result <- try(shootsVector[i] <- shoots)
  if(class(result) == "try-error") { next; }
  line <- grep("Recruiting Rank:(.*)([0-9]+)", page[,], value = FALSE, perl = TRUE)
  index <- regexpr("\\([0-9]+\\)$",page[line,])
  recruit <- substr(page[line,], index + 1, index + attr(index,"match.length") - 2)
  result <- try(recruitVector[i] <- recruit)
  if(class(result) == "try-error") { next; }
# Combine information
players.DF <- cbind(players.DF, shootsVector,recruitVector)
setnames(players.DF,c("shootsVector","recruitVector"),c("Shooting Arm","HS Ranking"))
write.csv(players.DF,file="~/...Combined Data/Combined Data 1.csv")

The procedure is vulnerable to duplicates. There are ways to deal with it in code. One way would be to also read the college from the page source and use that to pick out the player. In this case, however, after running a duplicates report only 3 duplicates were found.

> which(duplicated(allPlayers))
[1]  715  732 1118
> allPlayers[715]
[1] "tony mitchell"
> allPlayers[732]
[1] "chris wright"
> allPlayers[1118]
[1] "jamar smith"

For that reason, it was much easier to just do a manual search on the 6 players and update their data. I choose to do this in Excel. Using the highlight duplicates feature, I could easily scroll down and find the 3 duplicate players and change their shooting arm and HS ranking as necessary.

Screen Shot 2014-03-26 at 6.03.06 PM

R Highlighting created by Pretty R at


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s