Simpsons Paradox

Statistics
2016
Author

Cliff Weaver

Published

July 3, 2016

Statistics are used much like a drunk uses a lampost: for support, not illumination.

Simpsons Paradox or Yule-Simpson Effect is something many people are not familiar with. So I thought I would provide a brief example to help shed some light on it. If you are not aware of the problem of confounding or “lurking” variables in your analysis, you can box yourself into an analytic corner.

In short what is happening is that a trend or association that appears between two different variables reverses when a third variable is included. You will stumble into or at least need to be on the lookout for this effect of spurious correlation when you have unbalanced group sizes, like you often have using observational data. This is what happened in my recent discovery.

In the example below, let’s have a look at the UC Berkeley data, or at least the portion of it by Department, as provided in a Wikipedia article. https://en.wikipedia.org/wiki/Simpson%27s_paradox#cite_note-Bickel-11

What the data we explore contains is the number of applications and admissions by gender to six different graduate schools. Again, this is just a portion of the data, focusing in on the largest departments.

dpt <- c("A", "B", "C", "D", "E", "F", "A", "B", "C", "D", "E", "F")
app <- c(825,560,325,417,191,272,108,25,593,375,393,341)
adm <- c(512,353,120,138,53,16,89,17,202,131,94,24)
gen <- c("m","m","m","m","m","m","f","f","f","f","f","f")
df <- cbind.data.frame(dpt,app,adm,gen)
str(df)
'data.frame':   12 obs. of  4 variables:
 $ dpt: chr  "A" "B" "C" "D" ...
 $ app: num  825 560 325 417 191 272 108 25 593 375 ...
 $ adm: num  512 353 120 138 53 16 89 17 202 131 ...
 $ gen: chr  "m" "m" "m" "m" ...

There are a number of ways to demonstrate the effect, but I thought I would give it a go using dplyr. First, let’s group the data by gender (gen) then look at their overall admission rate.

library(dplyr)
Warning: package 'dplyr' was built under R version 4.2.2
by_gen = group_by(df, gen) 
summarize(by_gen, adm_rate=sum(adm)/sum(app))
# A tibble: 2 × 2
  gen   adm_rate
  <chr>    <dbl>
1 f        0.304
2 m        0.460

Clearly there is a huge gender bias problem in graduate school admissions at UC Berkeley and it is time to round up a herd of lawyers. On a side note, what is the best way to save a drowning lawyer? It’s easy, just take your foot off their head.

We can even provide a statistical test to strengthen the assertion of bias. In R, a proportions test can be done via the prop.test() function, inputting the number of “hits” and the number of “trials”.

summarize(by_gen, sum(adm))
# A tibble: 2 × 2
  gen   `sum(adm)`
  <chr>      <dbl>
1 f            557
2 m           1192
summarize(by_gen, sum(app))
# A tibble: 2 × 2
  gen   `sum(app)`
  <chr>      <dbl>
1 f           1835
2 m           2590
prop.test(x=c(557,1192), n=c(1835,2590))

    2-sample test for equality of proportions with continuity correction

data:  c(557, 1192) out of c(1835, 2590)
X-squared = 109.67, df = 1, p-value < 2.2e-16
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.1856333 -0.1277456
sample estimates:
   prop 1    prop 2 
0.3035422 0.4602317 

We see in the output a high level of statistical significance and a confidence interval with a difference of roughly 13 to 19 percent. However, beware of the analyst proclaiming truths using a 2×2 table with observational data! In this instance, the confounding variable is the department (dpt).

In order to have a look at the data by gender and by department, we will once again use the group_by() function, then calculate the admission rates and finally turn it from “long” to “wide” format.

by_dpt = group_by(df, dpt, gen)
df2 = as.data.frame(summarize(by_dpt, adm_rate=sum(adm)/sum(app)))
`summarise()` has grouped output by 'dpt'. You can override using the `.groups`
argument.
df2
   dpt gen   adm_rate
1    A   f 0.82407407
2    A   m 0.62060606
3    B   f 0.68000000
4    B   m 0.63035714
5    C   f 0.34064081
6    C   m 0.36923077
7    D   f 0.34933333
8    D   m 0.33093525
9    E   f 0.23918575
10   E   m 0.27748691
11   F   f 0.07038123
12   F   m 0.05882353
library(reshape2)
df2_wide = dcast(df2, dpt ~ gen, value.var = "adm_rate")
df2_wide
  dpt          f          m
1   A 0.82407407 0.62060606
2   B 0.68000000 0.63035714
3   C 0.34064081 0.36923077
4   D 0.34933333 0.33093525
5   E 0.23918575 0.27748691
6   F 0.07038123 0.05882353

With the inclusion of department we now see that females actually have higher admission rates in four of the six departments (A,B,D,F). How can this be? It had to do with rates and the volume of admission to the various departments. Again, the groups were highly unbalanced.