Testing Tony Kornheiser’s Football (Soccer) Population Theory

Fans of the daily ESPN show Pardon the Interruption (PTI) will be familiar with the co-host’s frequent “Population Theory.” The theory has a few formulations; it is sometimes asserted that when two countries compete in international football the country with the larger population will win, while at other times it’s stated that the more populous country should win.

The “Population Theory” sometimes also incorporates the resources of the country. So, for example, Kornheiser recently stated that the United States should be performing better in international football both because the country has a large population, but also because it has spent a large sum of money on its football infrastructure.

I decided to test this theory by creating a dataset that combines football scores from SoccerLotto.com with population and per capita GDP data from various sources. Because of the SoccerLott.com formatting the page wasn’t easily scraped by R or copied and pasted into Excel, so a fair amount of manual work was involved. Here’s a picture of me doing that manual work to breakup this text 🙂

IMG_4265

The dataset included 537 international football games that took place between 30 June 2015 and 27 June 2016. The most recent game in the dataset was the shocking Iceland upset over England. The population and per capita GDP data used whatever source was available. Because official government statistics are not collected annually the exact year differs. I’ve uploaded the data into a public Dropbox folder here. Feel free to use it. R code is provided below.

Per capita GDP is perhaps the most readily available proxy for national football resources, though admittedly it’s imperfect. Football is immensely popular globally and so many poor countries may spend disproportionately large sums on developing their football programs. A more useful statistic might be average age of first football participation, but as of yet I don’t have access to this type of data.

Results

So how does Kornheiser’s theory hold up to the data? Well, Kornheiser is right…but just barely. Over the past year the more populous country has won 51.6% of the time. So if you have to guess the outcome of an international football match and all you’re given is the population of the two countries involved then you should indeed bet on the more populous country.

Of the 537 games, 81 occurred on a neutral field. More populous countries fared poorly on neutral fields, winning only 43.2% of the time. While at home the more populous country won 53.1% of their matches.

Richer countries fared even worse, losing more than half their games (53.8%). Both at home and at neutral fields they also fared poorly (winning only 45.8% and 48.1% of their matches respectively).

The best predictor of international football matches (at least in the data I had available) was whether the team was playing at home: home teams won 60.1% of the time.

To look more closely at population and winning I plotted teams that had played more than three international matches in the past year against their population. There were 410 total games that met this criteria. I also plotted a linear trend line in red, which as the figures above suggest, slopes upward ever so slightly.

population_vs_winning_perct.png

Although 527 games is a lot, it’s only a single year’s worth of data. It may be possible that this year was an anomaly and I’m working on collecting a larger set of data. As the chart above suggests many countries have a population around 100 million or less and so it would perhaps be surprising if countries with a few million more or fewer people had significantly different outcomes in their matches. But we can test this too…

When two countries whose population difference is less than 1 million play against one another the more populous country actually losses 55.9% of the time. When two countries are separated by less than 5 million people the more populous country wins slightly more than random chance with a winning percentage of 52.1%. But large population differences (greater than 50 million inhabitants) does not translate into more victories. They win just 51.2% of the time. So perhaps surprisingly the small sample of data I have suggests that population differences matter more when the differences are smaller (of course this could be spurious).

This can be further seen below in a slightly different view of the chart above that exchanges the axes and limits teams to those countries with less than 100 million people.

population_vs_winning_perct_smaller.png

R code provided below:

###################################################################################################
# James McCammon
# International Football and Population Analysis
# 7/1/2016
# Version 1.0
###################################################################################################
 
# Import Data
setwd("~/Soccer Data")
soccer_data = read.csv('soccer_data.csv', header=TRUE, stringsAsFactors=FALSE)
population_data = read.csv('population.csv', header=TRUE, stringsAsFactors=FALSE)
 
 
################################################################################################
# Calculate summary data
################################################################################################
# Subset home field and neutral field games
nuetral_field = subset(soccer_data, Neutral=='Yes')
home_field = subset(soccer_data, Neutral=='No')
 
# Calculate % that larger country won
(sum(soccer_data[['Bigger.Country.Won']])/nrow(soccer_data)) * 100
# What about at neutral field?
(sum(nuetral_field[['Bigger.Country.Won']])/nrow(nuetral_field)) * 100
# What about at a home field?
(sum(home_field[['Bigger.Country.Won']])/nrow(home_field)) * 100
 
# Calculate % that richer country won
(sum(soccer_data[['Richer.Country.Won']])/nrow(soccer_data)) * 100
# What about at neutral field?
(sum(nuetral_field[['Richer.Country.Won']])/nrow(nuetral_field)) * 100
# What about at a home field?
(sum(home_field[['Richer.Country.Won']])/nrow(home_field)) * 100
 
# Calculate home field advantage
home_field_winner = subset(home_field, !is.na(Winner))
(sum(home_field_winner[['Home.Team']] == home_field_winner[['Winner']])/nrow(home_field_winner)) * 100
 
# Calculate % that larger country won when pop diff is less than 1 million
ulatra_small_pop_diff_mathes = subset(soccer_data, abs(Home.Team.Population - Away.Team.Population) < 1000000)
(sum(ulatra_small_pop_diff_mathes[['Bigger.Country.Won']])/nrow(ulatra_small_pop_diff_mathes)) * 100
#Calculate % that larger country won when pop diff is less than 5 million
small_pop_diff_mathes = subset(soccer_data, abs(Home.Team.Population - Away.Team.Population) < 5000000)
(sum(small_pop_diff_mathes[['Bigger.Country.Won']])/nrow(small_pop_diff_mathes)) * 100
#Calculate % that larger country won when pop diff is larger than 50 million
big_pop_diff_mathes = subset(soccer_data, abs(Home.Team.Population - Away.Team.Population) > 50000000)
(sum(big_pop_diff_mathes[['Bigger.Country.Won']])/nrow(big_pop_diff_mathes)) * 100
 
 
################################################################################################
# Chart winning percentage vs. population
################################################################################################
library(dplyr)
library(reshape2)
 
base_data = 
  soccer_data %>%
  filter(!is.na(Winner)) %>%
  select(Home.Team, Away.Team, Winner) %>%
  melt(id.vars = c('Winner'), value.name='Team')
 
games_played = 
  base_data %>%
  group_by(Team) %>%
  summarize(Games.Played = n())
 
games_won = 
  base_data %>%
  mutate(Result = ifelse(Team == Winner,1,0)) %>%
  group_by(Team) %>%
  summarise(Games.Won = sum(Result))
 
team_results = 
  merge(games_won, games_played, by='Team') %>%
  filter(Games.Played > 2) %>%
  mutate(Win.Perct = Games.Won/Games.Played)
 
team_results = merge(team_results, population_data, by='Team')
 
# Plot all countries
library(ggplot2)
library(ggthemes)
ggplot(team_results, aes(x=Win.Perct, y=Population)) +
  geom_point(size=3, color='#4EB7CD') +
  geom_smooth(method='lm', se=FALSE, color='#FF6B6B', size=.75, alpha=.7) +
  theme_fivethirtyeight() +
  theme(axis.title=element_text(size=14)) +
  scale_y_continuous(labels = scales::comma) +
  xlab('Winning Percentage') +
  ylab('Population') +
  ggtitle(expression(atop('International Soccer Results Since June 2015', 
                     atop(italic('Teams With Three or More Games Played (410 Total Games)'), ""))))
ggsave('population_vs_winning_perct.png')
 
# Plot countries smaller than 100 million
ggplot(subset(team_results,Population<100000000), aes(y=Win.Perct, x=Population)) +
  geom_point(size=3, color='#4EB7CD') +
  geom_smooth(method='lm', se=FALSE, color='#FF6B6B', size=.75, alpha=.7) +
  theme_fivethirtyeight() +
  theme(axis.title=element_text(size=14)) +
  scale_x_continuous(labels = scales::comma) +
  ylab('Winning Percentage') +
  xlab('Population') +
  ggtitle(expression(atop('International Soccer Results Since June 2015', 
                          atop(italic('Excluding Countries with a Population Greater than 100 Million'), ""))))
ggsave('population_vs_winning_perct_smaller.png')

Created by Pretty R at inside-R.org

Thoughts on Equal Pay Day

A friend posted this on Facebook and one of her friends wondered about the figures.

Screen Shot 2016-05-03 at 6.40.43 PM.png

I replied with this lengthy comment (edited to provide clarity), which is not meant to be controversial, but may induce controversy:

These are the gross numbers if you just take averages across populations. All of the econometric work I’ve seen suggests that if you adjust for race, education, work experience, and IQ the gap narrows to somewhere between 90 and 96 cents (this econometric work is conducted by both women and men.). That’s good new right? The gap is less than we traditionally believe and that means progress is greater than we traditionally believe, even if there is work left to be done. But when I see this pointed out in various articles, such as a Slate article I read a few months back, the response is, “there’s still a gap.” Well, yes the measured gap still exists (and for the sake of argument let’s assume things are measured correctly), but income differences are still substantially better than you thought. Can’t we celebrate progress and acknowledge ongoing challenges simultaneously? For whatever reason it seems the conversation has gotten so bad (vitriolic? woman-hating? epistemically biased?) that celebrating progress is seen by women as a threat that progress will cease rather than acknowledgment that strides are being made.

In some fields, especially medical fields, men are paid less than women even on average. Probably because women are better people than men, or at least have a reputation for being more empathetic, which lends itself to certain professions. (Some economists have suggested that women will be the beneficiaries of the coming automation revolution as people skills will become ever more scarce and valuable).

You might be tempted to say that men are genetically better positioned for “traditional” white collar jobs because of their aggressiveness and competitiveness, but at least some of this difference is socialized; when researchers have played competitiveness games in matriarchal societies women have scored significantly higher than the men. There is perhaps no easy solution to this problem. It’s unclear that women should be socialized for competitiveness simply so they can be better at traditional white collar employment, which emerged largely at the hands of white men. On the other hand, making the workplace less competitive — even if it is normatively the “correct” approach — is a difficult task. It seems to me competitive and cutthroat environments can be good for productivity and innovation even if they aren’t for everyone (man or woman).

Of course the perfect test of discrimination would be to compare the same people randomizing their sex, unfortunately (fortunately?) that isn’t possible. However, there have been studies where researchers have sent out the same resume with traditionally male and female names and men are more likely to receive an interview. Perhaps surprisingly this holds when women are the hiring manager so the discrimination is not always so easy to identify. The same penalty occurs with traditionally black names on resumes. To me this is much stronger evidence of discrimination than gross differences in pay.

The big thing that hurts women is childbirth. Women have a hard time fighting back and income gaps seem to persist for long after the children are born. And you can imagine that at least as traditional household matters break down, women that are the same age as their male counterparts tend to have less work experience after the age of, say 30 or 35, because they spend more time taking care of children. I would like to see the income argument shift from these raw numbers which I think oversimplify the problem and aren’t so useful at any rate, and toward more targeted discussion of specific policies related to helping women better “withstand” childbirth and its ongoing responsibilities career wise. Of course maternity leave gets talked about a lot, but we probably need to have a deeper, more robust conversation policy wise (and culturally) about how to help. A first step might be helping men to understand the tremendous monetary burden women endure when taking care of children. It is not simply that they forgo income for a year or two when the child is young and they stay home. On average that lost income never comes back. If childbirth accounts for 80% (or whatever) of the true income gap after sensible adjustments to education, experience, and aptitude, it seems a good place to start a policy and cultural conversation rather than pointing to a misleading figure about income differences; the latter is less actionable.

There are also probably other areas where gaps can be narrowed beginning even earlier, such as during education. Roland Fryer has done interesting work here. Even simple things like priming minority students before tests with encouraging words can help them perform better so there is possibly at least some low hanging fruit as grades often translate into opportunities later on. Of course there is lots of work in early childhood education. Differences in education, especially between whites and low-income minorities, certainly play some part in income differences later in life so closing the education gap could be beneficial.

At any rate I don’t think it’s correct to say that the current state of gendered income disparity is “old-fashioned sexism.” It may be new-fashioned sexism, but that probably requires a different conversation and response.