Many bad shots or one good shot?

October 19, 2017

Who would win in a fight: one horse-sized duck or a hundred duck-sized horses?

Mark Taylor says the former. Ok, that’s not quite what he says. Rather, he says that if two teams generate the same total expected goals, the team that generates loads of little shots does better than the team that generates very few shots, but with a much higher scoring percentage.

His simulations are mathematically correct, but his conclusion is wrong. That’s because the simulations are answering the wrong question. They are answering the question of whether it is to a team’s advantage to spread their expected goals among many shots conditionally on a fixed number of shots being taken. But in order to answer the unconditional question, we also need to simulate the variability in the number of shots per match. This seems minor, but it actually turns out to be pretty important, as we will see.

First, let’s go over Mark’s simulations. In his example, one team generates two shots with a 60% probability of being scored, and the other generates 20 shots with a 6% probability of being scored. This is easy to simulate.

simulations <- 
  # 100,000 simulations
  tibble(
    # Team low variance takes 2 shots every game, 60% probability
    team_low_var = rbinom(1e7, 2, .6),
    # Team high variance takes 20 shots every game, 6% probability
    team_high_var = rbinom(1e7, 20, .06),
    result = case_when(
      team_low_var > team_high_var ~ "Low Variance wins",
      team_low_var < team_high_var ~ "High Variance wins",
      team_low_var == team_high_var ~ "Draw"
    )
  ) %>% 
  count(result) %>% 
  mutate(n = n / sum(n))
result n
Draw 0.305
High Variance wins 0.318
Low Variance wins 0.377

His simulations were, as you can see, spot on. What we need to do now in order to make the simulations complete and relevant to real-world situations is simulate that there’s a different number of shots in every game. Two shots a game teams would sometimes be expected to take four shots, sometimes none. It varies, and it is important:

complete_simulations <- 
  tibble(
    # Team low variance averages 2 shots a game, but there's variation
    team_low_shots = rpois(1e7, 2),
    # Team high variance averages 20 shots a game, but there's variation
    team_high_shots = rpois(1e7, 20),
    # The shots are scored with the same probability as before
    # but there's a different number of shots in every match
    team_low_var = rbinom(1e7, team_low_shots, .6),
    team_high_var = rbinom(1e7, team_high_shots, .06),
    result = case_when(
      team_low_var > team_high_var ~ "Low Variance wins",
      team_low_var < team_high_var ~ "High Variance wins",
      team_low_var == team_high_var ~ "Draw"
    )
  ) %>% 
  count(result) %>% 
  mutate(n = n / sum(n))
result n
Draw 0.277
High Variance wins 0.362
Low Variance wins 0.362

They both now have the same probability of winning! That happens, of course, because football operates roughly as a Poisson process. Hopefully I will write more on that soon.

It’s good that Mark and other people are investigating these fundamental questions. It’s just a shame that the expected goals paradigm got so entrenched that people neglect to model that there’s also variability in the number of shots taken by teams. Expected goals, as it is usually done, operates conditionally on the exact same shots being taken. But there’s a lot of variation in how many, and what kinds of shots are taken! Analyses of football matches can’t just assume the shots as given, they need to encompass everything.