Testing Tony Kornheiser’s Football (Soccer) Population Theory

Fans of the daily ESPN show Pardon the Interruption (PTI) will be familiar with the co-host’s frequent “Population Theory.” The theory has a few formulations; it is sometimes asserted that when two countries compete in international football the country with the larger population will win, while at other times it’s stated that the more populous country should win.

The “Population Theory” sometimes also incorporates the resources of the country. So, for example, Kornheiser recently stated that the United States should be performing better in international football both because the country has a large population, but also because it has spent a large sum of money on its football infrastructure.

I decided to test this theory by creating a dataset that combines football scores from SoccerLotto.com with population and per capita GDP data from various sources. Because of the SoccerLott.com formatting the page wasn’t easily scraped by R or copied and pasted into Excel, so a fair amount of manual work was involved. Here’s a picture of me doing that manual work to breakup this text 🙂

IMG_4265

The dataset included 537 international football games that took place between 30 June 2015 and 27 June 2016. The most recent game in the dataset was the shocking Iceland upset over England. The population and per capita GDP data used whatever source was available. Because official government statistics are not collected annually the exact year differs. I’ve uploaded the data into a public Dropbox folder here. Feel free to use it. R code is provided below.

Per capita GDP is perhaps the most readily available proxy for national football resources, though admittedly it’s imperfect. Football is immensely popular globally and so many poor countries may spend disproportionately large sums on developing their football programs. A more useful statistic might be average age of first football participation, but as of yet I don’t have access to this type of data.

Results

So how does Kornheiser’s theory hold up to the data? Well, Kornheiser is right…but just barely. Over the past year the more populous country has won 51.6% of the time. So if you have to guess the outcome of an international football match and all you’re given is the population of the two countries involved then you should indeed bet on the more populous country.

Of the 537 games, 81 occurred on a neutral field. More populous countries fared poorly on neutral fields, winning only 43.2% of the time. While at home the more populous country won 53.1% of their matches.

Richer countries fared even worse, losing more than half their games (53.8%). Both at home and at neutral fields they also fared poorly (winning only 45.8% and 48.1% of their matches respectively).

The best predictor of international football matches (at least in the data I had available) was whether the team was playing at home: home teams won 60.1% of the time.

To look more closely at population and winning I plotted teams that had played more than three international matches in the past year against their population. There were 410 total games that met this criteria. I also plotted a linear trend line in red, which as the figures above suggest, slopes upward ever so slightly.

population_vs_winning_perct.png

Although 527 games is a lot, it’s only a single year’s worth of data. It may be possible that this year was an anomaly and I’m working on collecting a larger set of data. As the chart above suggests many countries have a population around 100 million or less and so it would perhaps be surprising if countries with a few million more or fewer people had significantly different outcomes in their matches. But we can test this too…

When two countries whose population difference is less than 1 million play against one another the more populous country actually losses 55.9% of the time. When two countries are separated by less than 5 million people the more populous country wins slightly more than random chance with a winning percentage of 52.1%. But large population differences (greater than 50 million inhabitants) does not translate into more victories. They win just 51.2% of the time. So perhaps surprisingly the small sample of data I have suggests that population differences matter more when the differences are smaller (of course this could be spurious).

This can be further seen below in a slightly different view of the chart above that exchanges the axes and limits teams to those countries with less than 100 million people.

population_vs_winning_perct_smaller.png

R code provided below:

###################################################################################################
# James McCammon
# International Football and Population Analysis
# 7/1/2016
# Version 1.0
###################################################################################################
 
# Import Data
setwd("~/Soccer Data")
soccer_data = read.csv('soccer_data.csv', header=TRUE, stringsAsFactors=FALSE)
population_data = read.csv('population.csv', header=TRUE, stringsAsFactors=FALSE)
 
 
################################################################################################
# Calculate summary data
################################################################################################
# Subset home field and neutral field games
nuetral_field = subset(soccer_data, Neutral=='Yes')
home_field = subset(soccer_data, Neutral=='No')
 
# Calculate % that larger country won
(sum(soccer_data[['Bigger.Country.Won']])/nrow(soccer_data)) * 100
# What about at neutral field?
(sum(nuetral_field[['Bigger.Country.Won']])/nrow(nuetral_field)) * 100
# What about at a home field?
(sum(home_field[['Bigger.Country.Won']])/nrow(home_field)) * 100
 
# Calculate % that richer country won
(sum(soccer_data[['Richer.Country.Won']])/nrow(soccer_data)) * 100
# What about at neutral field?
(sum(nuetral_field[['Richer.Country.Won']])/nrow(nuetral_field)) * 100
# What about at a home field?
(sum(home_field[['Richer.Country.Won']])/nrow(home_field)) * 100
 
# Calculate home field advantage
home_field_winner = subset(home_field, !is.na(Winner))
(sum(home_field_winner[['Home.Team']] == home_field_winner[['Winner']])/nrow(home_field_winner)) * 100
 
# Calculate % that larger country won when pop diff is less than 1 million
ulatra_small_pop_diff_mathes = subset(soccer_data, abs(Home.Team.Population - Away.Team.Population) < 1000000)
(sum(ulatra_small_pop_diff_mathes[['Bigger.Country.Won']])/nrow(ulatra_small_pop_diff_mathes)) * 100
#Calculate % that larger country won when pop diff is less than 5 million
small_pop_diff_mathes = subset(soccer_data, abs(Home.Team.Population - Away.Team.Population) < 5000000)
(sum(small_pop_diff_mathes[['Bigger.Country.Won']])/nrow(small_pop_diff_mathes)) * 100
#Calculate % that larger country won when pop diff is larger than 50 million
big_pop_diff_mathes = subset(soccer_data, abs(Home.Team.Population - Away.Team.Population) > 50000000)
(sum(big_pop_diff_mathes[['Bigger.Country.Won']])/nrow(big_pop_diff_mathes)) * 100
 
 
################################################################################################
# Chart winning percentage vs. population
################################################################################################
library(dplyr)
library(reshape2)
 
base_data = 
  soccer_data %>%
  filter(!is.na(Winner)) %>%
  select(Home.Team, Away.Team, Winner) %>%
  melt(id.vars = c('Winner'), value.name='Team')
 
games_played = 
  base_data %>%
  group_by(Team) %>%
  summarize(Games.Played = n())
 
games_won = 
  base_data %>%
  mutate(Result = ifelse(Team == Winner,1,0)) %>%
  group_by(Team) %>%
  summarise(Games.Won = sum(Result))
 
team_results = 
  merge(games_won, games_played, by='Team') %>%
  filter(Games.Played > 2) %>%
  mutate(Win.Perct = Games.Won/Games.Played)
 
team_results = merge(team_results, population_data, by='Team')
 
# Plot all countries
library(ggplot2)
library(ggthemes)
ggplot(team_results, aes(x=Win.Perct, y=Population)) +
  geom_point(size=3, color='#4EB7CD') +
  geom_smooth(method='lm', se=FALSE, color='#FF6B6B', size=.75, alpha=.7) +
  theme_fivethirtyeight() +
  theme(axis.title=element_text(size=14)) +
  scale_y_continuous(labels = scales::comma) +
  xlab('Winning Percentage') +
  ylab('Population') +
  ggtitle(expression(atop('International Soccer Results Since June 2015', 
                     atop(italic('Teams With Three or More Games Played (410 Total Games)'), ""))))
ggsave('population_vs_winning_perct.png')
 
# Plot countries smaller than 100 million
ggplot(subset(team_results,Population<100000000), aes(y=Win.Perct, x=Population)) +
  geom_point(size=3, color='#4EB7CD') +
  geom_smooth(method='lm', se=FALSE, color='#FF6B6B', size=.75, alpha=.7) +
  theme_fivethirtyeight() +
  theme(axis.title=element_text(size=14)) +
  scale_x_continuous(labels = scales::comma) +
  ylab('Winning Percentage') +
  xlab('Population') +
  ggtitle(expression(atop('International Soccer Results Since June 2015', 
                          atop(italic('Excluding Countries with a Population Greater than 100 Million'), ""))))
ggsave('population_vs_winning_perct_smaller.png')

Created by Pretty R at inside-R.org

Splines!

Splines are a popular way to fit data that allow for a good fit when the data has some sort of global curvature (see the graph below for an example). A cubic spline is a popular example. It allows for a cubic polynomial to be fit within some neighborhood between so-called “knots.” So for the graph below you might have a knot at age 0 and another knot at age 50, yet another knot at age 100, and so on. Within each segment of the data defined by these knots you fit a cubic polynomial. You then require that at the knot locations where the different polynomials meet they must join together in a smooth way (this is done by enforcing the first and second derivatives be equal at the knot).

Below a spline with 20 evenly spaced knot locations is shown for some dummy data. I’ve also graphed a 95% confidence interval of the spline predictions in dotted red.

cubic spline

Here’s the code I used to produce it.

# Load libraries
library(reshape)
library(mgcv)
library(ggplot2)
library(data.table)
 
# Load data
setwd("Masked")
data <- read.csv(file = "Masked", header = TRUE)
attach(data)
 
# Create 20 evenly spaced knots
splitDistance <- (max(age) - min(age))/19
knots <- seq(min(age), max(age), by = splitDistance)
 
# Fit a penalized cubic regression spline with 20 evenly spaced 
# knots using cross validation
model <- gam(strontium.ratio ~ s(age, k=20, fx=F, bs="cr", m=2), 
data=data, family=gaussian, sp = .45, knots = list(age = knots))
# Predict points to plot data 
plotPoints <- seq(min(age), max(age), by = .05)
yHat <- predict.gam(model, list(age = plotPoints), se.fit = TRUE)
 
# Create 95% confidence interval
upper95 <- yHat[[1]] + 1.96*yHat[[2]]
lower95 <- yHat[[1]] - 1.96*yHat[[2]]
 
# Prepare data for plotting
spline <- as.data.frame(cbind(plotPoints,yHat[[1]]))
setnames(spline,"V2","value")
CI <- as.data.frame(cbind(plotPoints,upper95,lower95))
CI <- melt(CI, id = "plotPoints", 
variable_name = "Confidence.Interval")
 
# Plot data
ggplot() +
geom_point(data = data, aes(x = age, y = strontium.ratio)) +
geom_line(data=spline, aes(x = plotPoints, y = value), 
colour = '#3399ff') +
geom_line(data=CI, aes(x = plotPoints, y = value, 
group = Confidence.Interval),linetype="dotted",colour = '#ff00ff') +
ggtitle("Cubic Spline") + 
xlab("Age") + 
ylab("Strontium Ratio")

Created by Pretty R at inside-R.org

To show the effect of knot placement we can generate 20 random knot locations for multiple models and compare them. Since these are so-called natural splines linearity is enforced after the last knot. Here, Model 3 had it’s first knot generated quite far on the righthand side of the Age axis and so it’s linear to the left of that point.

10 Splines

Again, here’s the code.

# Create 10 models with random knot placement
set.seed(35)
models <- as.data.frame(lapply(1:10,
                        FUN = function(i) {
                        knots <- sort(runif(20, min(age), 
                        max(age)))
                        model <- gam(strontium.ratio ~ 
                        s(age, k=20, fx=F, bs="cr", m=2), 
                        data=data, family=gaussian, 
                        method="GCV.Cp", 
                        knots = list(age = knots))
                        return(predict(model,
                        list(age = plotPoints)))
                        }))
 
colnames(models) <- lapply(1:10,
                    FUN = function(i) {
                    return(paste("Model",i))
                    })
 
# Plot splines
models <- cbind(plotPoints, models)
models <- melt(models, id = "plotPoints", variable_name = "Splines")
 
ggplot() +
geom_point(data = data, aes(x = age, y = strontium.ratio)) +
geom_line(data = models, aes(x = plotPoints, y = value, 
group = Splines, colour = Splines)) +
ggtitle("A Plot of 10 Splines") + 
xlab("Age") + 
ylab("Strontium Ratio")

Created by Pretty R at inside-R.org