Distribution Mean Convergence

Suppose we want to simulate $\frac{1}{n}\sum_{i=1}^{n} X_i$ for $X_1, X_2, \dots , X_n \sim N(0, 1)$, $n = 1, \dots , 10,000$. Suppose we want to do the same for the Cauchy distribution.

In other words, we want to draw several random variables from a normal distribution and then take the average. As $n$ increases we should get closer to the mean of the distribution we’re drawing from, 0 in this case.

The R code below will do this. It produces this graph:

Notice that while the Normal distribution converges quickly the Cauchy never does. This is because the Cauchy distribution has fat tails and so extreme observations are common.

################################################################
# R Simulation
# James McCammon
# 2/20/2017
################################################################
# This script goes through the simulation of plotting both normal
# and Cauchy means for random vectors of size 1 to 10,000. It
# also demonstrates function creation and plotting.
# Highlight desired section and click "Run."

# Set working directory as needed
setwd("~/R Projects/All of Statistics")

###
# Calculate means and plot using base R
###

# Set seed for reproducibility
set.seed(271)

#
# Version 1: Simple
#
n = seq(from=1, to=10000, by=1)
y=sapply(n, FUN=function(x) sum(rnorm(x))/x)
plot(n, y, type="l")

#
# Version 2: Define a function
#
sim = function(x, FUN) {
sapply(x, FUN=function(x) sum(FUN(x))/x)
}

# Use function to plot normal means
x = seq(from=1, to=10000, by=1)
y1 = sim(x, rnorm)
plot(x, y1, type="l")

# Use function to plot Cauchy means
y2 = sim(x, rcauchy)
plot(x, y2, type="l")

#
# Version 3: More complex function
#

# This function has:
# (1) error checking
# (2) extra argument options
# (3) the ability to input any distribution R supports
sim = function(x, FUN, ...) {
if(!is.character(FUN)) stop('Please enter distribution as string.')
dists = c('rnorm',
'rbeta',
'rbinom',
'rcauchy',
'rchisq',
'rexp',
'rf',
'rgamma',
'rgeom',
'rhyper',
'rlnorm',
'rmultinom',
'rnbinom',
'rpois',
'rt',
'runif',
'rweibull')
if(is.na(match(FUN, dists))) stop(paste('Please enter a valid distribution from one of:', paste(dists, collapse=', ')))
FUN = get(FUN)
sapply(x, FUN=function(x) sum(FUN(x, ...))/x)
}

# We have to define our function in string form.
# This will throw error 1.
test1 = sim(x, rnorm)

# We have to input a distribution R supports.
# This will throw error 2.
test2 = sim(x, 'my_cool_function')

# We can input additional arguments like the
# mean, standard deviations, or other shape parameters.
test3 = sim(x, 'rnorm', mean=10, sd=2)

####
# Using ggplot2 to make pretty graph
###

library(ggplot2)
library(ggthemes)
library(gridExtra)

png(filename='Ch3-Pr9.png', width=1200, height=600)

df1 = cbind.data.frame(x, y1)
p1 = ggplot(df1, aes(x=x, y=y1)) +
geom_line(size = 1, color='#937EBF') +
theme_fivethirtyeight() +
ggtitle("Normal Means")

df2 = cbind.data.frame(x, y2)
p2 = ggplot(df2, aes(x=x, y=y2)) +
geom_line(size = 1, color='#EF4664') +
theme_fivethirtyeight() +
ggtitle("Cauchy Means")

# Save charts
grid.arrange(p1,p2,nrow=2,ncol=1)

dev.off()
# Print charts to screen
grid.arrange(p1,p2,nrow=2,ncol=1)


Simple Random Walk – Method 1

Suppose we consider a simple random walk. A particle starts at initial position $z_i$ and moves one unit to the left with probability $p$ and moves one unit to the right with probability $1-p$. What is the expected position $\mathbb{E}[X_n]$ of the particle after $n$ steps?

I will calculate the expected value using two different methods. The second method is simpler, but I’ll start using an iteration method.

Our PMF is:

$f_X(x) = \bigg\{ \begin{tabular}{cc} p & x = -1 \\ 1-p & x = 1 \end{tabular}$

Let’s set our initial position as:
$n=0: \quad \mathbb{E}[X_0] = z_i$

After one step our expected position is then:
$n=1: \quad \mathbb{E}[X_1] = (z_i - 1)p + (z_i + 1)(1 - p)$
$n=1: \quad \mathbb{E}[X_1] = z_{i}p - p + z_i + 1 - z_{i}p - p$
$n=1: \quad \mathbb{E}[X_1] = z_i + 1 - 2p$

Great, let’s try iterating one more to see what we get. Note that at $n=2$ our position is now the result from $n=1$, $z_i + 1 - 2p$.
$n=2: \quad \mathbb{E}[X_2] = (z_i + 1 - 2p - 1)p + (z_i + 1 - 2p + 1)(1 - p)$
$n=2: \quad \mathbb{E}[X_2] = z_{i}p - 2p^2 + Z_i - 2p + 2 - z_{i}p + 2p^2 - 2p$
$n=2: \quad \mathbb{E}[X_2] = z_i + 2(1 - 2p)$

If we keep iterating we will see that $\mathbb{E}[X_n] = z_i + n(1 - 2p)$. But we can prove this formally through induction. We’ve already done our base case, so let’s now do the induction step. We will assume that $\mathbb{E}[X_n] = z_i + n(1 - 2p)$ is true and show that it is also true for $n + 1$.

$\mathbb{E}[X_{n+1}] = (z_i + n(1 - 2p) - 1)p + (z_i + n(1 - 2p) + 1)(1 - p)$
$\mathbb{E}[X_{n+1}] = (z_i + n - 2pn - 1)p + (z_i + n - 2pn + 1)(1 - p)$
$\mathbb{E}[X_{n+1}] = z_{i}p + pn - 2p^{2}n - p + z_i + n - 2pn + 1 -z_{i}p - pn + 2p^{2}n - p$
$\mathbb{E}[X_{n+1}] = - p + z_i + n - 2pn + 1 - p$
$\mathbb{E}[X_{n+1}] = z_i + (n + 1)(1 - 2p)$

Thus our induction step holds and we have shown that $\mathbb{E}[X_n] = z_i + n(1 - 2p)$.

Because we chose our initial starting position $z_i$ to be arbitrary, we might as well set it to 0 to obtain a final result of $\mathbb{E}[X_n] = n(1 - 2p)$.

Let’s take a moment to think about this result and make sure it seems reasonable. Suppose $p = 0.5$. This would mean we have an equal chance of moving left or moving right. Over the long run we would expect our final position to be exactly where we started. Plugging in $p = 0.5$ to our equation yields $n(1 - 2 \cdot 0.5) = n(1 - 1) = 0$. Just as we expected! What if $p = 1$? This means we only move to the left. Plugging $p = 1$ into our equation yields $n(1 - 2 \cdot 1) = n(-1) = -n$. This makes sense! If we can only move to the left then after $n$ steps we would expect to be $n$ steps left of our staring position (the origin as we chose it), the negative direction in our problem setup. We could also choose $p$ to be 0, meaning we only move to the right and we would get $n$, again just what we would expect!

We can also run a simulation in R to verify our results:

################################################################
# R Simulation
################################################################
# Generate random walk
rand_walk = function (n, p, z) {
walk = sample(c(-1,1), size=n, replace=TRUE, prob=c(p,1-p))
for (i in 1:n) {
z = z + walk[i]
}
return(z)
}

n = 1000 # Walk n steps
p = .3 # Probability of moving left
z = 0 # Set initial position to 0
trials = 10000 # Num times to repeate sim
# Run simulation
X = replicate(trials, rand_walk(n,p,z))

# Calculate empirical and theoretical results
empirical = mean(X)
theoretical = n*(1-2*p)
percent_diff = abs((empirical-theoretical)/empirical)*100

# print to console
empirical
theoretical
percent_diff


Printing to the console we see that after 10,000 trials of 1,000 steps each our empirical and theoretical results differ by just 0.046%.

> empirical
[1] 400.1842
> theoretical
[1] 400
> percent_diff
[1] 0.0460288


Finding the Expected Value of the Maximum of n Random Variables

My friend Ryan, who is also a math tutor at UW, and I are working our way through several math resources including Larry Wasserman’s famous All of Statistics. Here is a math problem:

Suppose we have $n$ random variables $X_1, ...X_n$ all distributed uniformly, $X_i \sim Uniform(0,1)$. We want to find the expected value of $\mathbb{E}[Y_n]$ where $Y_n = \max\{X_1,..., X_n\}$.

First, we need to find the Probability Density Function (PDF) $f_Y(y)$ and we do so in the usual way, by first finding the Cumulative Distribution Function (CDF) and taking the derivative:

$F_Y(y) = P(Y < y)$
$F_Y(y) = P(\max\{X_1, ..., X_n\} < y)$
$F_Y(y) = P(X_1,..., X_n < y)$

We want to be able to get this step:
$F_Y(y) = P(X_1 < y)P(X_2 < y) \cdots P(X_n < y)$

But must show independence and we are not give that our $X_i$‘s are in fact independent. Thanks to Ryan for helping me see that by definition:

$F_Y(y) = \underset{A}{\idotsint} f_{X_1, \dots, X_n}(y) \,dx_1 \dots dx_n$

However, note that in this case $f_{X_1, \dots, X_n}(y)$ is a unit $n-cube$ with area $A$ equal to $1$. In other words $f_{X_1, \dots, X_n}(y) = 1$. Our equation then simplifies:

$F_Y(y) = \idotsint 1 dx_1 \dots dx_n$
$F_Y(y) = \int dx_1 \dots \int dx_n = [F_X(y)]^n$ where $X$ here is a generic random variable, by symmetry (all $X_i$‘s are identically distributed). This is the same answer we would’ve gotten if we made the iid assumption earlier and obtained $F_Y(y) = P(X_1 < y)P(X_2 < y) \cdots P(X_n < y)$. Originally, I had made this assumption by way of wishful thinking — and a bit of intuition, it does seem that $n$ uniformly distributed random variables would be independent — but Ryan corrected my mistake.

Now that we have $F_Y(y)$ we can find $f_Y(y)$ the PDF.

$f_Y(y) = \frac{d}{dy}F_Y(y) = \frac{d}{dy}[F_X(y)]^n$
$f_Y(y) = n[F_X(y)]^{n-1}f_X(y)$ by the chain rule.

Recall that the PDF $f_X(x)$ of a $X \sim Uniform(0,1)$ is $\frac{1}{b-a} = \frac{1}{1-0} = 1$ for $x \in [0,1]$. And by extension the CDF $F_X(x)$ for a $X \sim Uniform(0,1)$ is:
$\int_a^x f(t)dt = \int_a^x \frac{1}{b-a}dt = t\frac{1}{b-a} \bigm|_a^x = \frac{x}{b-a} - \frac{a}{b-a} = \frac{x-a}{b-a} = \frac{x-0}{1-0} = x$.

Plugging these values into our equation above (and noting we have $F_X(y)$ not $F_X(x)$ meaning we simply replace the $x$ we just derived with $y$ as we would in any normal function) we have:

$f_Y(y) = ny^{n-1} \cdot 1$

Finally, we are ready to take our expectation:

$\mathbb{E}[Y] = \int_{y\in A}yf_Y(y)dy = \int_0^1 yny^{n-1}dy = n\int_0^1 y^{n}dy = n\bigg[\frac{1}{n+1}y^{n+1}\bigg]_0^1 = \frac{n}{n+1}$

Let’s take a moment and make sure this answer seems reasonable. First, note that if we have the trival case of $Y = \max\{X_1\}$ (which is simply $Y = X_1$; $n = 1$ in this case) we get $\frac{1}{1+1} = \frac{1}{2}$. This makes sense! If $Y = X_1$ then $Y$ is just a uniform random variable on the interval $0$ to $1$. And the expected value of that random variable is $\frac{1}{2}$ which is exactly what we got.

Also notice that $\lim_{n\to\infty} \frac{n}{n+1} = 1$. This also makes sense! If we take the maximum of 1 or 2 or 3 $X_i$‘s each randomly drawn from the interval 0 to 1, we would expect the largest of them to be a bit above $\frac{1}{2}$, the expected value for a single uniform random variable, but we wouldn’t expect to get values that are extremely close to 1 like .9. However, if we took the maximum of, say, 100 $X_i$‘s we would expect that at least one of them is going to be pretty close to 1 (and since we’re choosing the maximum that’s the one we would select). This doesn’t guarantee our math is correct (although it is), but it does give a gut check that what we derived is reasonable.

We can further verify our answer by simulation in R, for example by choosing $n = 5$ (thanks to the fantastic Markup.su syntax highlighter):

################################################################
# R Simulation
################################################################
X = 5
Y = replicate(100000, max(runif(X)))
empirical = mean(Y)
theoretical = (X/(X+1)) #5/6 = 8.33 in this case
percent_diff = abs((empirical-theoretical)/empirical)*100

# print to console
empirical
theoretical
percent_diff


We can see from our results that our theoretical and empirical results differ by just 0.05% after 100,000 runs of our simulation.

> empirical
[1] 0.8337853
> theoretical
[1] 0.8333333
> percent_diff
[1] 0.0542087


Kelsey Plum’s Chase for #1

Thanks to Graham for this question on Whale. Graham asked about Kelsey Plum and whether she will break the record?

“What record?” you might ask. Plum is very close to becoming the all-time NCAA women’s basketball scoring leader. That’s a really big deal and probably one of the under reported stories in basketball right now. NCAA Women’s Basketball started in 1981, that’s 35 years worth of basketball. Plum will have scored more points than greats like Brittney Griner, Chamique Holdsclaw, and Cheryl Miller.

In December of last year Plum became the all-time Pac-12 scoring leader with 44 points in a win against Boise State. She had 44 points again on Sunday in a 72-68 loss to Stanford.

Plum is now averaging 31.4 points per game (to go along with 5 rebounds and 5 assists) and now sits third all time, just 255 points away from Jackie Stiles who set the record 15 years ago with 3,393. The UW Women have 8 games to go in the regular season and if Plum keeps up her average she’ll fall just shy of the record by season’s end with 3,388 points. Luckily, the UW Women are all but guaranteed at least two post-season games, one in the Pac-12 tournament and another in the NCAA tournament, which they’ll likely make even if they fall in the first round of the Pac-12. They could play up to nine games if they make it to the final in both, but will likely end up playing somewhere around six or seven games. Still, this gives Plum plenty of time to break the record and I predict she’ll surpass it by 100 or 150 points.

I plotted the graph below using R to show Plum’s chase for the record.

Testing Tony Kornheiser’s Football (Soccer) Population Theory

Fans of the daily ESPN show Pardon the Interruption (PTI) will be familiar with the co-host’s frequent “Population Theory.” The theory has a few formulations; it is sometimes asserted that when two countries compete in international football the country with the larger population will win, while at other times it’s stated that the more populous country should win.

The “Population Theory” sometimes also incorporates the resources of the country. So, for example, Kornheiser recently stated that the United States should be performing better in international football both because the country has a large population, but also because it has spent a large sum of money on its football infrastructure.

I decided to test this theory by creating a dataset that combines football scores from SoccerLotto.com with population and per capita GDP data from various sources. Because of the SoccerLott.com formatting the page wasn’t easily scraped by R or copied and pasted into Excel, so a fair amount of manual work was involved. Here’s a picture of me doing that manual work to breakup this text 🙂

The dataset included 537 international football games that took place between 30 June 2015 and 27 June 2016. The most recent game in the dataset was the shocking Iceland upset over England. The population and per capita GDP data used whatever source was available. Because official government statistics are not collected annually the exact year differs. I’ve uploaded the data into a public Dropbox folder here. Feel free to use it. R code is provided below.

Per capita GDP is perhaps the most readily available proxy for national football resources, though admittedly it’s imperfect. Football is immensely popular globally and so many poor countries may spend disproportionately large sums on developing their football programs. A more useful statistic might be average age of first football participation, but as of yet I don’t have access to this type of data.

Results

So how does Kornheiser’s theory hold up to the data? Well, Kornheiser is right…but just barely. Over the past year the more populous country has won 51.6% of the time. So if you have to guess the outcome of an international football match and all you’re given is the population of the two countries involved then you should indeed bet on the more populous country.

Of the 537 games, 81 occurred on a neutral field. More populous countries fared poorly on neutral fields, winning only 43.2% of the time. While at home the more populous country won 53.1% of their matches.

Richer countries fared even worse, losing more than half their games (53.8%). Both at home and at neutral fields they also fared poorly (winning only 45.8% and 48.1% of their matches respectively).

The best predictor of international football matches (at least in the data I had available) was whether the team was playing at home: home teams won 60.1% of the time.

To look more closely at population and winning I plotted teams that had played more than three international matches in the past year against their population. There were 410 total games that met this criteria. I also plotted a linear trend line in red, which as the figures above suggest, slopes upward ever so slightly.

Although 527 games is a lot, it’s only a single year’s worth of data. It may be possible that this year was an anomaly and I’m working on collecting a larger set of data. As the chart above suggests many countries have a population around 100 million or less and so it would perhaps be surprising if countries with a few million more or fewer people had significantly different outcomes in their matches. But we can test this too…

When two countries whose population difference is less than 1 million play against one another the more populous country actually losses 55.9% of the time. When two countries are separated by less than 5 million people the more populous country wins slightly more than random chance with a winning percentage of 52.1%. But large population differences (greater than 50 million inhabitants) does not translate into more victories. They win just 51.2% of the time. So perhaps surprisingly the small sample of data I have suggests that population differences matter more when the differences are smaller (of course this could be spurious).

This can be further seen below in a slightly different view of the chart above that exchanges the axes and limits teams to those countries with less than 100 million people.

R code provided below:

###################################################################################################
# James McCammon
# International Football and Population Analysis
# 7/1/2016
# Version 1.0
###################################################################################################

# Import Data
setwd("~/Soccer Data")

################################################################################################
# Calculate summary data
################################################################################################
# Subset home field and neutral field games
nuetral_field = subset(soccer_data, Neutral=='Yes')
home_field = subset(soccer_data, Neutral=='No')

# Calculate % that larger country won
(sum(soccer_data[['Bigger.Country.Won']])/nrow(soccer_data)) * 100
# What about at neutral field?
(sum(nuetral_field[['Bigger.Country.Won']])/nrow(nuetral_field)) * 100
# What about at a home field?
(sum(home_field[['Bigger.Country.Won']])/nrow(home_field)) * 100

# Calculate % that richer country won
(sum(soccer_data[['Richer.Country.Won']])/nrow(soccer_data)) * 100
# What about at neutral field?
(sum(nuetral_field[['Richer.Country.Won']])/nrow(nuetral_field)) * 100
# What about at a home field?
(sum(home_field[['Richer.Country.Won']])/nrow(home_field)) * 100

home_field_winner = subset(home_field, !is.na(Winner))
(sum(home_field_winner[['Home.Team']] == home_field_winner[['Winner']])/nrow(home_field_winner)) * 100

# Calculate % that larger country won when pop diff is less than 1 million
ulatra_small_pop_diff_mathes = subset(soccer_data, abs(Home.Team.Population - Away.Team.Population) < 1000000)
(sum(ulatra_small_pop_diff_mathes[['Bigger.Country.Won']])/nrow(ulatra_small_pop_diff_mathes)) * 100
#Calculate % that larger country won when pop diff is less than 5 million
small_pop_diff_mathes = subset(soccer_data, abs(Home.Team.Population - Away.Team.Population) < 5000000)
(sum(small_pop_diff_mathes[['Bigger.Country.Won']])/nrow(small_pop_diff_mathes)) * 100
#Calculate % that larger country won when pop diff is larger than 50 million
big_pop_diff_mathes = subset(soccer_data, abs(Home.Team.Population - Away.Team.Population) > 50000000)
(sum(big_pop_diff_mathes[['Bigger.Country.Won']])/nrow(big_pop_diff_mathes)) * 100

################################################################################################
# Chart winning percentage vs. population
################################################################################################
library(dplyr)
library(reshape2)

base_data =
soccer_data %>%
filter(!is.na(Winner)) %>%
select(Home.Team, Away.Team, Winner) %>%
melt(id.vars = c('Winner'), value.name='Team')

games_played =
base_data %>%
group_by(Team) %>%
summarize(Games.Played = n())

games_won =
base_data %>%
mutate(Result = ifelse(Team == Winner,1,0)) %>%
group_by(Team) %>%
summarise(Games.Won = sum(Result))

team_results =
merge(games_won, games_played, by='Team') %>%
filter(Games.Played > 2) %>%
mutate(Win.Perct = Games.Won/Games.Played)

team_results = merge(team_results, population_data, by='Team')

# Plot all countries
library(ggplot2)
library(ggthemes)
ggplot(team_results, aes(x=Win.Perct, y=Population)) +
geom_point(size=3, color='#4EB7CD') +
geom_smooth(method='lm', se=FALSE, color='#FF6B6B', size=.75, alpha=.7) +
theme_fivethirtyeight() +
theme(axis.title=element_text(size=14)) +
scale_y_continuous(labels = scales::comma) +
xlab('Winning Percentage') +
ylab('Population') +
ggtitle(expression(atop('International Soccer Results Since June 2015',
atop(italic('Teams With Three or More Games Played (410 Total Games)'), ""))))
ggsave('population_vs_winning_perct.png')

# Plot countries smaller than 100 million
ggplot(subset(team_results,Population<100000000), aes(y=Win.Perct, x=Population)) +
geom_point(size=3, color='#4EB7CD') +
geom_smooth(method='lm', se=FALSE, color='#FF6B6B', size=.75, alpha=.7) +
theme_fivethirtyeight() +
theme(axis.title=element_text(size=14)) +
scale_x_continuous(labels = scales::comma) +
ylab('Winning Percentage') +
xlab('Population') +
ggtitle(expression(atop('International Soccer Results Since June 2015',
atop(italic('Excluding Countries with a Population Greater than 100 Million'), ""))))
ggsave('population_vs_winning_perct_smaller.png')

Created by Pretty R at inside-R.org

Splines!

Splines are a popular way to fit data that allow for a good fit when the data has some sort of global curvature (see the graph below for an example). A cubic spline is a popular example. It allows for a cubic polynomial to be fit within some neighborhood between so-called “knots.” So for the graph below you might have a knot at age 0 and another knot at age 50, yet another knot at age 100, and so on. Within each segment of the data defined by these knots you fit a cubic polynomial. You then require that at the knot locations where the different polynomials meet they must join together in a smooth way (this is done by enforcing the first and second derivatives be equal at the knot).

Below a spline with 20 evenly spaced knot locations is shown for some dummy data. I’ve also graphed a 95% confidence interval of the spline predictions in dotted red.

Here’s the code I used to produce it.

# Load libraries
library(reshape)
library(mgcv)
library(ggplot2)
library(data.table)

attach(data)

# Create 20 evenly spaced knots
splitDistance <- (max(age) - min(age))/19
knots <- seq(min(age), max(age), by = splitDistance)

# Fit a penalized cubic regression spline with 20 evenly spaced
# knots using cross validation
model <- gam(strontium.ratio ~ s(age, k=20, fx=F, bs="cr", m=2),
data=data, family=gaussian, sp = .45, knots = list(age = knots))
# Predict points to plot data
plotPoints <- seq(min(age), max(age), by = .05)
yHat <- predict.gam(model, list(age = plotPoints), se.fit = TRUE)

# Create 95% confidence interval
upper95 <- yHat[[1]] + 1.96*yHat[[2]]
lower95 <- yHat[[1]] - 1.96*yHat[[2]]

# Prepare data for plotting
spline <- as.data.frame(cbind(plotPoints,yHat[[1]]))
setnames(spline,"V2","value")
CI <- as.data.frame(cbind(plotPoints,upper95,lower95))
CI <- melt(CI, id = "plotPoints",
variable_name = "Confidence.Interval")

# Plot data
ggplot() +
geom_point(data = data, aes(x = age, y = strontium.ratio)) +
geom_line(data=spline, aes(x = plotPoints, y = value),
colour = '#3399ff') +
geom_line(data=CI, aes(x = plotPoints, y = value,
group = Confidence.Interval),linetype="dotted",colour = '#ff00ff') +
ggtitle("Cubic Spline") +
xlab("Age") +
ylab("Strontium Ratio")

Created by Pretty R at inside-R.org

To show the effect of knot placement we can generate 20 random knot locations for multiple models and compare them. Since these are so-called natural splines linearity is enforced after the last knot. Here, Model 3 had it’s first knot generated quite far on the righthand side of the Age axis and so it’s linear to the left of that point.

Again, here’s the code.

# Create 10 models with random knot placement
set.seed(35)
models <- as.data.frame(lapply(1:10,
FUN = function(i) {
knots <- sort(runif(20, min(age),
max(age)))
model <- gam(strontium.ratio ~
s(age, k=20, fx=F, bs="cr", m=2),
data=data, family=gaussian,
method="GCV.Cp",
knots = list(age = knots))
return(predict(model,
list(age = plotPoints)))
}))

colnames(models) <- lapply(1:10,
FUN = function(i) {
return(paste("Model",i))
})

# Plot splines
models <- cbind(plotPoints, models)
models <- melt(models, id = "plotPoints", variable_name = "Splines")

ggplot() +
geom_point(data = data, aes(x = age, y = strontium.ratio)) +
geom_line(data = models, aes(x = plotPoints, y = value,
group = Splines, colour = Splines)) +
ggtitle("A Plot of 10 Splines") +
xlab("Age") +
ylab("Strontium Ratio")

Created by Pretty R at inside-R.org

After researching online basketball data in more depth I found that RealGM had so-called “split” data for college players. Players statistics are sliced in various ways such as performance against Top 25 teams.

n my original collection process involved scraping statistics from every college player, which was quite inefficient. It involved approximately 20,000 player-seasons worth of data and caused problems during the merge since so many players shared names. It also didn’t allow collection of the “split” data since these is housed on each player’s individual page instead of on the “All College Player Stats” page.

It was quite challenging figuring out how to scrape the RealGM site. The page structure was predictable aside from a unique id number for every player, which I assume comes from some sort of internal database on the RealGM site. These numbers range in length from two to five numerals and there is no way I could find to predict these numbers. For instance, Carmelo Anthony’s player page link is below. His player id is 452.

http://basketball.realgm.com/player/Carmelo-Anthony/NCAA/452/2014/By_Split/
Advanced_Stats/Quality_Of_Opp

After a fair bit of thrashing about I finally came up with the solution to write an R script that would google the first portion of the player’s page link, read the Google page source, search for player’s site address using regular expressions, and then append their id to the rest of the structured web address.

https://www.google.com/search?q=realgm.com/player/Carmelo-Anthony

The specificity of the search ensures that the RealGM link appears on the first page of search results (it was the first result in every test scenario I tried). The script then uses the following regular expression when search the Google search results page source:

realgm.com/player/Carmelo-Anthony/(Summary|News|\u2026)/[0-9]+

A player’s main page is always preceded by the player’s name and then “/Summary/id”, but “/News/id” and “/…/id” also appeared.  After it locates and reads this link it’s easy enough to strip out the player id and insert it into the player’s page that links to the advanced college data I was looking for.

library(XML)
library(RCurl)
library(data.table)

# Read in players and convert names to proper format
players.DF <- read.csv(file="~/.../Combined Data/Combined Data 1.csv")
players <- as.character(players.DF$Player) players <- gsub("\\.","",players) players <- gsub(" ","-",players) # Initialize dataframes and vectors missedPlayers <- NULL playerLinks <- rep(NA, length(players)) playerLinks <- data.frame(players.DF$Player, playerLinks)

# Create link for each player
for(i in 1:length(players)) {
result <- try(content <- getURLContent(url))
if(class(result) == "try-error") { next; }
id <- regexpr(paste0("realgm.com/player/", players[i],
"/(Summary|News|\u2026)","/[0-9]+"),content)

id <- substr(content, id, id + attr(id,"match.length"))
id <- gsub("[^0-9]+","",id)
}

setnames(playerLinks, c("players.DF.Player","playerLinks"), c("Players","Links"))

Some sites have started to detect and try to prevent web scraping. On iteration 967 Google began blocking my search requests. However, I simply reran the script the next morning from iteration 967 onward to pickup the missing players.

I then used the fact that a missing id results in a page link with “NCAA//” to search for players that were still missing their ids.

> pickups <- playerLinks[which(grepl("NCAA//",playerLinks[[2]])),]

After examining the players I noticed many of these had apostrophes in their name, which I had forgotten to account for in my original name formatting.

I adjusted my procedure and reran the script to get the pickups.

pickups <- playerLinks[which(grepl("NCAA//",playerLinks[[2]])),]
pickups <- pickups[[1]]
pickups <- gsub("'","",pickups)
pickups <- gsub(" ","-",pickups)

for(i in 1:length(pickupNums)) {
j <- pickupNums[i]
result <- try(content <- getURLContent(url))
if(class(result) == "try-error") { next; }
id <- regexpr(paste0("realgm.com/player/", pickups[i],
"/(Summary|News|\u2026)","/[0-9]+"),content)

id <- substr(content, id, id + attr(id,"match.length"))
id <- gsub("[^0-9]+","",id)

}

After rerunning the script three players were still missing ids, so I entered these manually.

playerLinks[[370,2]]  <- "http://basketball.realgm.com/player/Eric-Gordon/NCAA/762/2014/By_Split/Advanced_Stats/Quality_Of_Opp"
playerLinks[[1010,2]] <- "http://basketball.realgm.com/player/Slavko-Vranes/NCAA/472/2014/By_Split/Advanced_Stats/Quality_Of_Opp"

I also needed to manually check the three duplicate players and adjust their ids accordingly.

The final result looks like this:

The next step will be to cycle through the links and use readHTMLTable() to get the advanced statistics.

R Highlighting created by Pretty R at inside-R.org

While I was looking around at basketball data during the course of the project I saw that Basketball-Reference.com had a few pieces of data I wanted to pick up: a player’s shooting arm (right or left) and their high school ranking. The site is also packed with a ton of other data I may use in the future such as a player’s shooting percentage from different distances from the basket. So I thought it would be good to create a procedure to scrape it.

The site use a particular website address structure that makes it easy to scrape: http://www.basketball-reference.com/players + the first letter of the player’s last name + the first 5 letters of the player’s last name (unless the player’s name is less than 5 letters in which case their whole name is used + the first two letters of their first name + a page number (usually a 1, but sometimes a 2 if more than one player share a name). For instance, http://www.basketball-reference.com/players/a/anthoca01.html.

R reads the page source and again the site uses a structured page profile:

I first used grep to locate the line of the page source that contained “Shoots:” and “Recruiting Rank:.” And then used regular expressions to strip the information out. Not all players have both (or either) set of information so I used a try() wrapper so the code could practice through errors resulting from no match to the regular expressions.

library(stringr)

# Read in master player list
players.DF <- read.csv(file="~/.../All Drafted Players 2013-2003.csv")
allPlayers <- players.DF[,3]

# Convert names to proper format
allPlayers <- str_replace_all(allPlayers, "[[:punct:]]", "")
allPlayers <- tolower(allPlayers)
first <- str_extract(allPlayers,"^[^ ]+")
first <- substring(first,1,2)
last <- str_extract(allPlayers,"[^ ]+$") last <- substring(last,1,5) letter <- substring(last,1,1) shootsVector <- rep(NA,length(allPlayers)) recruitVector <- rep(NA,length(allPlayers)) # Scrape the site and record shooting arm and HSranking for(i in 1:20) { page <- read.csv(paste0( 'http://www.basketball-reference.com/players/',letter[i],'/',last[i],first[i],'01.html')) line <- grep("[Ss]hoots:(.*)Right|Left", page[,], value = FALSE, perl = TRUE) index <- regexpr("[Rr]ight|[Ll]eft",page[line,]) shoots <- substr(page[line,], index, index + attr(index,"match.length") - 1) result <- try(shootsVector[i] <- shoots) if(class(result) == "try-error") { next; } line <- grep("Recruiting Rank:(.*)([0-9]+)", page[,], value = FALSE, perl = TRUE) index <- regexpr("\$$[0-9]+\$$$",page[line,])
recruit <- substr(page[line,], index + 1, index + attr(index,"match.length") - 2)
result <- try(recruitVector[i] <- recruit)
if(class(result) == "try-error") { next; }

print(shoots)
print(recruit)
}

# Combine information
players.DF <- cbind(players.DF, shootsVector,recruitVector)
setnames(players.DF,c("shootsVector","recruitVector"),c("Shooting Arm","HS Ranking"))
write.csv(players.DF,file="~/...Combined Data/Combined Data 1.csv")

The procedure is vulnerable to duplicates. There are ways to deal with it in code. One way would be to also read the college from the page source and use that to pick out the player. In this case, however, after running a duplicates report only 3 duplicates were found.

> which(duplicated(allPlayers))
[1]  715  732 1118
> allPlayers[715]
[1] "tony mitchell"
> allPlayers[732]
[1] "chris wright"
> allPlayers[1118]
[1] "jamar smith"

For that reason, it was much easier to just do a manual search on the 6 players and update their data. I choose to do this in Excel. Using the highlight duplicates feature, I could easily scroll down and find the 3 duplicate players and change their shooting arm and HS ranking as necessary.

R Highlighting created by Pretty R at inside-R.org

One piece of data I wanted to have for my statistical analysis was the quality of college a player attended. I chose to measure college quality by the number of weeks a team was in the Associated Press (AP) Top 25 college basketball rankings. Note, that I only used regular season rankings not pre- or post-season rankings, which are not available for all years. Historic rankings dating back to the 2002-2003 season are available on the ESPN website. However, when scraping ESPN’s webpage I found the data was semi-structured.

The code to read in the college name must be robust enough to ignore all the possible characters following the college name, but flexible enough to detect “exotic” college names like “Texas A&M” and “St. John’s.” The code first reads in each week’s rankings and strips out the college name. It then binds the weeks together. If the season has less than 18 weeks NAs are introduced to ensure every season is the same length and can be bound together. The college quality is then calculated for each season. Finally, the weekly rankings for every season are bound together into a single table and saved as is the college quality for every season. The code is shown below.

library(XML)
library(data.table)

# Initialize variables
seasons <- seq(2013,2003,by=-1)
allSeasonRankings <- NULL
allSeasonTable <- NULL
missedPages <- matrix(ncol=2,nrow=1)
colnames(missedPages) <- c("Season","Week")
k <- 1

# Web scrape
# Iterate over each week in each season
for(j in 1:length(seasons)) {
numWeeks <- 0
seasonRanking <- NULL
week <- NULL

for (i in 2:19)
{
seasons[j], '/week/', i ,'/seasontype/2'),skip.rows=c(1,2))[[1]][,2])

if(class(result) == "try-error") { missedPages[k,] <- c(j,i); k <- k + 1; next; }
seasons[j], '/week/', i ,'/seasontype/2'))

numWeeks <- numWeeks + 1
week <- as.data.frame(array(BegString(week)))
seasonRanking <- cbind(seasonRanking,week[[1]])
colnames(seasonRanking)[numWeeks] <- paste("Week",numWeeks)
}
# Ensure that all seasons have 18 weeks
# (the maximum number of weeks in a season since 2003)
# so that all seasons have the same length and can easily be bound together
while(numWeeks < 18) {
numWeeks <- numWeeks + 1
extra <- rep(NA,25)
seasonRanking <- cbind(seasonRanking,extra)
colnames(seasonRanking)[numWeeks]  <- paste("Week",numWeeks)
}

# Bind seasons together
allSeasonRankings <- rbind(allSeasonRankings, seasonRanking)

# Calculate the percentage of weeks each school was in the AP Top 25
seasonTable <- as.data.frame(table(unlist(seasonRanking)))
percentages <- round((seasonTable[2]/numWeeks)*100,2)

# Change column name to "Top 25 %" immediately. Otherwise percentages will
# inherit the name "Freq" from the table function and not allow use of setnames()
# since 2 columns have the same name
colnames(percentages)[1] <- "Top 25 %"
seasonTable <- cbind(seasonTable, percentages)
seasonTable <- cbind(seasonTable, rep(seasons[j],length(seasonTable[1])))
allSeasonTable <- rbind(allSeasonTable,seasonTable)
}

# Clean up names
setnames(allSeasonTable,c("Var1", "rep(seasons[j], length(seasonTable[1]))"),
c("Team", "Season"))

rankingYear <- rep(seasons, each=25)

# Combine data and cleanup names
allSeasonRankings <- cbind(rankingYear,allSeasonRankings)
allSeasonRankings <- as.data.frame(allSeasonRankings)
setnames(allSeasonRankings,"rankingYear", "Season")

# Save files
write.csv(allSeasonRankings,file="~/.../College Quality/Season Rankings.csv")
write.csv(allSeasonTable,file="~/.../College Quality/Percent Time in Top 25.csv")

The above code uses two custom functions to strip out the college name. One, strips out the college name and the second removes the trailing whitespace that sometimes occurs. There are a lot of different ways to do this. The most efficient is probably to use the functionality of the stringr package (such as string_extract()), but I wrote these functions when I was less aware of all of stringr’s functionality.

# Returns first string containing only letters, spaces, and the ' and & symbols
BegString <- function(x) {
exp <- regexpr("^[a-zA-Z| |.|'|&]+",x)
stringList <- substr(x,1,attr(exp,"match.length"))
stringList <- removeTrailSpace(stringList)
return(stringList)
}
# Removes trailing whitespace of a string
removeTrailSpace <- function(stringList) {

whiteSpaceIndex <- regexpr(" +$",stringList) whiteSpaceSize <- attr(whiteSpaceIndex,"match.length") for(k in 1:length(stringList)) { if(whiteSpaceSize[k] > 0) { stringList[k] <- substr(stringList[k],1,whiteSpaceIndex[k]-1) } } stringList } The weekly ranking table ends up looking like this: This table is saved purely for reference since all of the meat is in the college quality calculation. College quality is shown below. Again, I kept the “Freq” in for reference so that I could randomly verify the results of a few observations to make sure the code worked properly. As you can see, 43 different teams spent at least one week in the AP Top 25 rankings during 2013. Now that I have this data I can merge it with the master list of players using the school name and season as keys. R highlighting created by Pretty R at inside-R.org Basketball Project As part of a graduate applied regression course I took we were required to create and present a research question. The top third of the questions were assigned three students, and these groups worked on the project for the last seven weeks of class. I proposed examining the relationship between early career NBA performance and a variety of pre-NBA player attributes. NBA performance was to be measured using the co-called “Player Efficiency Rating” created by John Hollinger (usually denoted simply “PER”). The PER attempts to combine all of a player’s on-court statistics into a single number, with the NBA average set to 15 every season. The pre-NBA player profile included a variety of advanced statistics measuring shooting, rebounding, steals, assists, and blocks. For some players NBA combine data was also available. The combine data consisted of a variety of body measurements and results from athletic skills tests (such as standing vertical leap). My team and I worked throughout the quarter and presented our results last week at the class poster presentation. However, I wanted to redo the project on my own time with better data and full control over the data analysis (rather than having to split up the work between there people). Since this is the second time around I’m much smarter about how to cull, clean, and merge the data efficiently. The first step is to get a master list of players. I’m choosing to use RealGM Basketball’s draft data. It includes both drafted and undrafted players that played in the NBA (or D-league) dating back to 1978. The procedure I used (shown below) works for the modern two-round draft, which started in 1989. However, since college data is only available from the 2002-2003 season, I only went as far back as the 2003 NBA draft. This dataset includes draft age, an obvious proxy for age a player began his on-court NBA career, something missing from our original dataset. It includes country of birth as well, which would allow a test of the common assertion that foreign players are better shooters. Importantly, this dataset also includes a player’s college name in a format that matches the Associated Press (AP) Top 25 rankings available on ESPN’s website. For instance, depending on the data source the University of Kentucky is sometimes written as “University of Kentucky”, elsewhere simply as “Kentucky”, and occasionally as “UK” (ESPN’s site uses the variant “Kentucky”). I’ve learned that thinking carefully beforehand about how to merge data saves a lot of pain later. Controlling for the quality of a player’s college basketball program was an unfortunate omission from the original analysis. Because it embodies both the quality of coaching a player received and toughness of competition they faced it may have been a cause of omitted variable bias. For this measure I’ve decided on using the percentage of the season a team was in the AP Top 25 rankings. To get this master player list I used R’s XML package to scrape the RealGM site. I used try() in conjunction with readHTMLTable() since otherwise my intermittent internet (or other unexpected problems) would cause the for() loop to stop completely. If try() encounters an error I log the page so I can examine it later and pickup any missing data. After the scrape I examined the data and had to do some simple cleaning. Drafted and undrafted players have slightly different data available so I had to introduce some NA’s for the undrafted players so I could combine the dataframes. I also had to convert the columns from factors to characters or numeric depending on their values. Height, which in its native format as feet-inches (ex. 6-10) needed to be converted to a pure numeric value (I used height in inches). And a few columns had extra characters that needed to be removed. To convert height I wrote a custom function (shown below). I could have used the R package stringr’s function str_extract() instead of regexpr() and substr(), but for variety (and practice) I went with the less efficient two-line approach. In general, the length of my code could be substantially reduce, but at the cost of readability for others (as well as myself when I revisit the code in the future). convertHeight <- function(x) { feet <- substr(x,1,1) inches <- regexpr("[0-9]+$",x)
inches <- substr(x, inches, inches + attr(inches,"match.length"))
height <- as.numeric(feet)*12 + as.numeric(inches)
return(height)
}

Everything went smoothly aside from a warning that NA’s were introduced by coercion when converting “Weight” to numeric. After a quick search it turns out this was only a problem for a single player, number 1073.

> which(is.na(allPlayers[,5]) == TRUE)
[1] 1073

Player 1073 turns out to be Donell Williams from Fayetteville State who went undrafted in 2005 and later played a season in the D-league. I went back to RealGM’s site and confirmed that his weight was indeed marked as “N/A” in the source data.

The next steps will be to merge in the college quality data (from ESPN), a few additional pieces of data I scraped from Basketball-Reference (such as the shooting hand a player uses), all of the NBA combine data (from DraftExpress), and the players’ college and NBA statistics (from RealGM and Basketball-Reference). Each piece of data requires it’s own web scraping and cleaning, which I’ll take up in future posts.

# Load necessary libraries
library(XML)
library(data.table)
library(stringr)

# Initialize variables
round1 <- NULL
round2 <- NULL
drafted <- NULL
undrafted <- NULL
allDraftedPlayers <- NULL
allUndraftedPlayers <- NULL
missedPages <- NULL
seasons <- seq(2013,2003,by=-1)

# Get draft info for drafted and undrafted players
for(i in 1:length(seasons))
{
if(class(result) == "try-error") { missedPages <- rbind(missedPages,seasons[i]); next; }

round1 <- page[[3]]
round2 <- page[[4]]
drafted <- rbind(round1,round2)
undrafted <- page[[5]]

# Print data for monitoring

# Add draft year and combine data
draftYear <- rep(seasons[i], dim(drafted)[1])
drafted <- cbind(drafted,draftYear)
allDraftedPlayers <- rbind(allDraftedPlayers,drafted)
draftYear <- rep(seasons[i], dim(undrafted)[1])
undrafted <- cbind(undrafted,draftYear)
allUndraftedPlayers <- rbind(allUndraftedPlayers, undrafted)
}

# Drop unused columns
allDraftedPlayers <- allDraftedPlayers[,-c(9,11:12)]
allUndraftedPlayers <- allUndraftedPlayers[,-c(8:9)]

# Add NAs to undrafted players as necessary
length <- length(allUndraftedPlayers[[1]])
allUndraftedPlayers <- cbind(rep(NA, length),allUndraftedPlayers[,c(1:7)],rep(NA,length),
allUndraftedPlayers[,c(8:9)])

# Unify names so rbind can combine datasets
colnames(allUndraftedPlayers)[1] <- "Pick"
allPlayers <- rbind(allDraftedPlayers,allUndraftedPlayers)

# Cleanup column names

# Convert columns from factors to character and numeric as necessary
allPlayers[,-3] <- data.frame(lapply(allPlayers[,-3], as.character),
stringsAsFactors=FALSE)
allPlayers[,c(5,8)] <- data.frame(lapply(allPlayers[,c(5,8)], as.numeric),
stringsAsFactors=FALSE)

allPlayers[which(regexpr("[a-zA-Z]+",traded) != -1), 9] <- 1
allPlayers[which(allPlayers[[9]] != 1), 9] <- 0

# Get rid of extra characters in class (mostly astricks)
allPlayers[[7]] <- str_extract(allPlayers[[7]],"[a-zA-Z]+")
allPlayers[[7]] <- gsub("DOB",NA,allPlayers[[7]])

# Convert height to inches from feet-inches format
allPlayers[[4]] <- convertHeight(allPlayers[[4]])

# Function for converting height
convertHeight <- function(x) {
feet <- substr(x,1,1)
inches <- regexpr("[0-9]+\$",x)
inches <- substr(x, inches, inches + attr(inches,"match.length"))
height <- as.numeric(feet)*12 + as.numeric(inches)
return(height)
}

write.csv(allPlayers,file="~/.../Draft Info/All Drafted Players 2013-2003.csv")

R Highlighting created by Pretty R at inside-R.org

The result is to take this:

And transform it into this: