Testing Tony Kornheiser’s Football (Soccer) Population Theory

Fans of the daily ESPN show Pardon the Interruption (PTI) will be familiar with the co-host’s frequent “Population Theory.” The theory has a few formulations; it is sometimes asserted that when two countries compete in international football the country with the larger population will win, while at other times it’s stated that the more populous country should win.

The “Population Theory” sometimes also incorporates the resources of the country. So, for example, Kornheiser recently stated that the United States should be performing better in international football both because the country has a large population, but also because it has spent a large sum of money on its football infrastructure.

I decided to test this theory by creating a dataset that combines football scores from SoccerLotto.com with population and per capita GDP data from various sources. Because of the SoccerLott.com formatting the page wasn’t easily scraped by R or copied and pasted into Excel, so a fair amount of manual work was involved. Here’s a picture of me doing that manual work to breakup this text 🙂

IMG_4265

The dataset included 537 international football games that took place between 30 June 2015 and 27 June 2016. The most recent game in the dataset was the shocking Iceland upset over England. The population and per capita GDP data used whatever source was available. Because official government statistics are not collected annually the exact year differs. I’ve uploaded the data into a public Dropbox folder here. Feel free to use it. R code is provided below.

Per capita GDP is perhaps the most readily available proxy for national football resources, though admittedly it’s imperfect. Football is immensely popular globally and so many poor countries may spend disproportionately large sums on developing their football programs. A more useful statistic might be average age of first football participation, but as of yet I don’t have access to this type of data.

Results

So how does Kornheiser’s theory hold up to the data? Well, Kornheiser is right…but just barely. Over the past year the more populous country has won 51.6% of the time. So if you have to guess the outcome of an international football match and all you’re given is the population of the two countries involved then you should indeed bet on the more populous country.

Of the 537 games, 81 occurred on a neutral field. More populous countries fared poorly on neutral fields, winning only 43.2% of the time. While at home the more populous country won 53.1% of their matches.

Richer countries fared even worse, losing more than half their games (53.8%). Both at home and at neutral fields they also fared poorly (winning only 45.8% and 48.1% of their matches respectively).

The best predictor of international football matches (at least in the data I had available) was whether the team was playing at home: home teams won 60.1% of the time.

To look more closely at population and winning I plotted teams that had played more than three international matches in the past year against their population. There were 410 total games that met this criteria. I also plotted a linear trend line in red, which as the figures above suggest, slopes upward ever so slightly.

population_vs_winning_perct.png

Although 527 games is a lot, it’s only a single year’s worth of data. It may be possible that this year was an anomaly and I’m working on collecting a larger set of data. As the chart above suggests many countries have a population around 100 million or less and so it would perhaps be surprising if countries with a few million more or fewer people had significantly different outcomes in their matches. But we can test this too…

When two countries whose population difference is less than 1 million play against one another the more populous country actually losses 55.9% of the time. When two countries are separated by less than 5 million people the more populous country wins slightly more than random chance with a winning percentage of 52.1%. But large population differences (greater than 50 million inhabitants) does not translate into more victories. They win just 51.2% of the time. So perhaps surprisingly the small sample of data I have suggests that population differences matter more when the differences are smaller (of course this could be spurious).

This can be further seen below in a slightly different view of the chart above that exchanges the axes and limits teams to those countries with less than 100 million people.

population_vs_winning_perct_smaller.png

R code provided below:

###################################################################################################
# James McCammon
# International Football and Population Analysis
# 7/1/2016
# Version 1.0
###################################################################################################
 
# Import Data
setwd("~/Soccer Data")
soccer_data = read.csv('soccer_data.csv', header=TRUE, stringsAsFactors=FALSE)
population_data = read.csv('population.csv', header=TRUE, stringsAsFactors=FALSE)
 
 
################################################################################################
# Calculate summary data
################################################################################################
# Subset home field and neutral field games
nuetral_field = subset(soccer_data, Neutral=='Yes')
home_field = subset(soccer_data, Neutral=='No')
 
# Calculate % that larger country won
(sum(soccer_data[['Bigger.Country.Won']])/nrow(soccer_data)) * 100
# What about at neutral field?
(sum(nuetral_field[['Bigger.Country.Won']])/nrow(nuetral_field)) * 100
# What about at a home field?
(sum(home_field[['Bigger.Country.Won']])/nrow(home_field)) * 100
 
# Calculate % that richer country won
(sum(soccer_data[['Richer.Country.Won']])/nrow(soccer_data)) * 100
# What about at neutral field?
(sum(nuetral_field[['Richer.Country.Won']])/nrow(nuetral_field)) * 100
# What about at a home field?
(sum(home_field[['Richer.Country.Won']])/nrow(home_field)) * 100
 
# Calculate home field advantage
home_field_winner = subset(home_field, !is.na(Winner))
(sum(home_field_winner[['Home.Team']] == home_field_winner[['Winner']])/nrow(home_field_winner)) * 100
 
# Calculate % that larger country won when pop diff is less than 1 million
ulatra_small_pop_diff_mathes = subset(soccer_data, abs(Home.Team.Population - Away.Team.Population) < 1000000)
(sum(ulatra_small_pop_diff_mathes[['Bigger.Country.Won']])/nrow(ulatra_small_pop_diff_mathes)) * 100
#Calculate % that larger country won when pop diff is less than 5 million
small_pop_diff_mathes = subset(soccer_data, abs(Home.Team.Population - Away.Team.Population) < 5000000)
(sum(small_pop_diff_mathes[['Bigger.Country.Won']])/nrow(small_pop_diff_mathes)) * 100
#Calculate % that larger country won when pop diff is larger than 50 million
big_pop_diff_mathes = subset(soccer_data, abs(Home.Team.Population - Away.Team.Population) > 50000000)
(sum(big_pop_diff_mathes[['Bigger.Country.Won']])/nrow(big_pop_diff_mathes)) * 100
 
 
################################################################################################
# Chart winning percentage vs. population
################################################################################################
library(dplyr)
library(reshape2)
 
base_data = 
  soccer_data %>%
  filter(!is.na(Winner)) %>%
  select(Home.Team, Away.Team, Winner) %>%
  melt(id.vars = c('Winner'), value.name='Team')
 
games_played = 
  base_data %>%
  group_by(Team) %>%
  summarize(Games.Played = n())
 
games_won = 
  base_data %>%
  mutate(Result = ifelse(Team == Winner,1,0)) %>%
  group_by(Team) %>%
  summarise(Games.Won = sum(Result))
 
team_results = 
  merge(games_won, games_played, by='Team') %>%
  filter(Games.Played > 2) %>%
  mutate(Win.Perct = Games.Won/Games.Played)
 
team_results = merge(team_results, population_data, by='Team')
 
# Plot all countries
library(ggplot2)
library(ggthemes)
ggplot(team_results, aes(x=Win.Perct, y=Population)) +
  geom_point(size=3, color='#4EB7CD') +
  geom_smooth(method='lm', se=FALSE, color='#FF6B6B', size=.75, alpha=.7) +
  theme_fivethirtyeight() +
  theme(axis.title=element_text(size=14)) +
  scale_y_continuous(labels = scales::comma) +
  xlab('Winning Percentage') +
  ylab('Population') +
  ggtitle(expression(atop('International Soccer Results Since June 2015', 
                     atop(italic('Teams With Three or More Games Played (410 Total Games)'), ""))))
ggsave('population_vs_winning_perct.png')
 
# Plot countries smaller than 100 million
ggplot(subset(team_results,Population<100000000), aes(y=Win.Perct, x=Population)) +
  geom_point(size=3, color='#4EB7CD') +
  geom_smooth(method='lm', se=FALSE, color='#FF6B6B', size=.75, alpha=.7) +
  theme_fivethirtyeight() +
  theme(axis.title=element_text(size=14)) +
  scale_x_continuous(labels = scales::comma) +
  ylab('Winning Percentage') +
  xlab('Population') +
  ggtitle(expression(atop('International Soccer Results Since June 2015', 
                          atop(italic('Excluding Countries with a Population Greater than 100 Million'), ""))))
ggsave('population_vs_winning_perct_smaller.png')

Created by Pretty R at inside-R.org

Where to Rank the UConn Women in Terms of Dominance

Unless a miracle occurs the UConn Women’s Basketball team will soon win their fourth championship in a row in an undefeated season when they beat Syracuse on Tuesday night. This will be their sixth championship since 2009. If they lose it will be one of the greatest upsets in the history of team sports.

Where does their recent dominance rank in the all-time history of sports? I put together this short survey.

The UNC women’s soccer team is — as far as I know — the most dominate team in the history of sports, collegiate or professional (at least in the U.S.), Harlem Globetrotters aside. They’ve been consistently dominate now for three decades and won 22 of the 36 NCAA National Championships. Of course U.S. women’s professional soccer has also been dominate the past 15 years with numerous World Cup and Olympic gold medal wins as well as being ranked No. 1 continuously from March 2008 to December 2014. En Espana, Barcelona has created a dominant European futbol team.

The UCLA men’s basketball team of the 1960s and 1970s won seven straight national titles under the famous John Wooden. The Iowa Hawkeyes men’s wrestling team also had an amazing run of dominance, especially throughout the 1990s. My alma matter, the University of Washington, has won five consecutive national crew championship in the men’s varsity eight. Jointly, the University of Minnesota and University of Minnesota Duluth have been dominating Women’s Ice Hockey since 2001, winning a combined 10 National Championships.

The University of Arkansas won eight consecutive national Track & Field Championships on the men’s side throughout the 1990s, while the LSU women won 11 championships in a row in the ’80s and ’90s (wow!). Swimming and diving national championships seem to come in bundles. Since 1937 only 13 different men’s teams have won national championships and many were back to back or three-peats. The women’s side is equally streaky. By the way, there are quite a few schools with swimming and diving programs.

Of course Alabama’s football team has been quite successful over the past seven years, winning four FBS championships in a rather competitive sport that has recently instituted a playoff system (Alabama has won one out of two of those).

What I’ve listed so far have been Division I-A programs only. Certainly some smaller college programs have seen dominance. And of course there are dominant high school teams as well. St. Anthony’s in New Jersey has won 27 boys’s basketball state titles in the past 39 years, for instance. Maryville Tennessee’s high school football team has gone 145-5 and won seven state titles in recent memory. Cheryl Miller, perhaps the greatest female basketball player of all time (and yes, brother of Reggie Miller), led her high school team to a record of 132-4 from ’78-’82 and along the road scored 105 points in a single game. Reggie Miller often recalls the night he found out about his sister’s scoring outburst. He had just scored 39 points and was pretty proud of himself until his sister reported back that she had more than doubled that total. I recall hearing about several boy’s wrestling champions with perfect high school careers. Here is one example.

A number of professional teams have had long periods of dominance. Chinese women’s diving has been extremely dominant recentlyThe New York Baseball Yankees have won 27 World Championships and 40 American League pennants over the past 100 years, with many of these coming over the 45-year period between 1920 and 1965. I’m aware that Russian hockey and gymnastics teams were quite great in their prime, perhaps still so.

The Boston Celtics won eight straight World Championships throughout the 1960s, helping Bill Russell win a total of 11 championship rings during his career. Indeed, Bill Russell is sometimes considered the greatest winner in the history of team sports and as such when LeBron James left Russell off of his theoretical “Mount Rushmore of the NBA” Russell was able to respond with this amazing quote regarding his own athletic success:

Hey, thank you for leaving me off your Mount Rushmore. I’m glad you did. Basketball is a team game, it’s not for individual honors. I won back-to-back state championships in high school, back-to-back NCAA championships in college. I won an NBA championship my first year in the league, an NBA championship in my last year, and nine in between. That, Mr. James, is etched in stone.

Individual athletics has also seen sportsmen and women that have been consistently dominant. Tiger Woods, Usain Bolt, Sean White, and Michael Phelps have all had multi-year stretches of dominance in recent memory. At least one of them was just featured in an inspiring commercial. Of course there were many dominant athletes in each of these sports before the current incarnations (Jack Nicklaus, Carl Lewis, Mark Spitz).  Eric Heiden won five speed skating medals in the Lake Placid Olympics, starting with the 500 meters and ending with the 10,000 meters. I once watched a documentary in which this feat was compared with a single athlete winning both the 100 meter dash and the mile. Tony Hawk helped usher in skateboarding as a professional sport and was dominant while doing so. Chris Sharma did the same with rock climbing. Rich Froning Jr. has had early success in the burgeoning activity of crossfit as a sport, winning the title of “Fittest Man on Earth” four times since 2011.

Anderson Silva had a long run of dominance in MMA and you’ve certainly heard of some of boxing’s all-time greats. Ronda Rousey garnered fame for her win streak until she was beaten just this year; she also appeared in the horrible movie version of HBO’s Entourage, though I liked her performance. If you’ve ever been to the ballet you know it can be extremely athletic. How’s that for an inspirational commercial? Perhaps it’s time we consider ballet a sport?

And then there is this horse.

Tennis has a history of dominant players including two current players: Novak Djokovic and Serena Williams. Serena is already generally considered the best female player of all time and Novak may end up the greatest men’s player before his career is over. Previous generations included Steffi Graf, Martina Navratilova, Roger Federer, Pete Sampras, and many others. Each was extremely dominant during their prime. For example, during his prime Roger Federer held the Number 1 position for 302 consecutive weeks, reached 23 consecutive Grand Slam tournament semifinals and won five consecutive times both at Wimbledon and the US Open and three out of four at the Australian open.

Perhaps a dark-horse contender for most dominate athlete is Kelly Slater, the American professional surfer who won five consecutive titles from ’94 to ’98. There are a number of articles suggesting he may be the greatest male athlete of all time. He won his first title at age 20 and his last at age 39 (and he’s still surfing competitively!). Talk about longevity. Imagine if Kobe Bryant was leading the Lakers to a title this year or if Peyton Manning had been truly great in the Bronco’s Super Win and you have some idea of what Kelly Slater has accomplished. (Yes, I realize surfing is a non-contact sport. Or is it?).

What have I forgotten? Surely there must be a lot. Certainly, this list is too U.S. centric.

But back to the question at hand. There have been many conversations about whether the UConn women’s dominance is bad for women’s college basketball. It has been suggested by some that this is a sexist argument, but I disagree. If Kentucky’s men’s basketball team was on the verge of winning its fourth straight NCAA tournament there would certainly be discussion about their dominance, perhaps around allegations of illegal recruiting or steroid use or at the very least a discussion about reforming the current one-and-done system.

And the question of whether a team can be too dominate is not new. Indeed, many professional sports are structured specifically to provide — or at least attempt to provide — equity among smaller and larger markets. Think of the draft or salary caps. Of course, in individual sports we fear dominance less because we know natural aging will create a new wave of competition in a few years time, or if we’re talking about individual college sports the athlete will simply graduate.

But we also understand that long-term equilibrium can occur where success begets success. College players, shamefully, are not paid in dollars so the next best thing is to be paid in wins. UConn seems to be the central bank in that category.

On the other hand — and as the list above eludes to — dominance is not unique to the UConn women. In fact, in the grand scheme of things they aren’t so dominate after all. But in some sports we’re use to seeing new champions more often than in others, if only because most people in the U.S. only follow the big four. We’re use to seeing new men’s champions every year to be sure, even if they’re all generally from the same group of ten or twenty teams year after year. So it really stands out when the same women’s team wins repeatedly regardless of where they stand in the broader historical spectrum.

The best thing, it seems, would be for Syracuse to beat UConn and put the whole matter to rest.