Causal Inference in R - Chapter Summaries - Preparing Data to answer Causal questions

Overview

Data preparation for causal inference looks familiar on the surface: you clean things up, do some EDA, maybe merge a couple of datasets. What makes it different is that every decision you take during prep needs to connect back to the causal question and the target trial you are trying to emulate. This chapter walks through that process using wait time data from Disney World.

The chapter covers six areas: getting to know the data, framing the causal question, linking protocol steps to tidyverse code, joining multiple sources, building a descriptive table, flagging missing data, and checking the positivity assumption.

7.1 The Data

The touringplans package contains attraction wait time data from Disney and Universal theme parks. There are 14 attractions in the package.

Show code

attractions_metadata |>
  kbl(
    col.names = c(
      "Dataset", "Name", "Short Name", "Park",
      "Land", "Opened", "Duration (min)", "Avg Wait / 100"
    ),
    digits = 1
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

Table 8.1: Attractions available in the touringplans package.

Dataset	Name	Short Name	Park	Land	Opened	Duration (min)	Avg Wait / 100
alien_saucers	Alien Swirling Saucers	Alien Saucers	Disney's Hollywood Studios	Toy Story Land	2018-06-30	2.5	10.0
dinosaur	DINOSAUR	DINOSAUR	Disney's Animal Kingdom	DinoLand U.S.A.	1998-04-22	3.5	3.0
expedition_everest	Expedition Everest - Legend of the Forbidden Mountain	Expedition Everest	Disney's Animal Kingdom	Asia	2006-04-07	4.0	4.0
flight_of_passage	Avatar Flight of Passage	Flight of Passage	Disney's Animal Kingdom	Pandora - The World of Avatar	2017-05-27	6.0	4.0
kilimanjaro_safaris	Kilimanjaro Safaris	Kilimanjaro Safaris	Disney's Animal Kingdom	Africa	1998-04-22	20.0	4.0
navi_river	Na'vi River Journey	Na'vi River	Disney's Animal Kingdom	Pandora - The World of Avatar	2017-05-27	5.0	5.0
pirates_of_caribbean	Pirates of the Caribbean	Pirates of Caribbean	Magic Kingdom	Adventureland	1973-12-17	7.5	1.5
rock_n_rollercoaster	Rock 'n' Roller Coaster Starring Aerosmith	Rock Coaster	Disney's Hollywood Studios	Sunset Boulevard	1999-07-29	1.5	2.5
seven_dwarfs_train	Seven Dwarfs Mine Train	7 Dwarfs Train	Magic Kingdom	Fantasyland	2014-05-28	3.0	5.0
slinky_dog	Slinky Dog Dash	Slinky Dog	Disney's Hollywood Studios	Toy Story Land	2018-06-30	3.0	5.0
soarin	Soarin' Around the World	Soarin'	Epcot	World Nature	2005-05-05	8.0	3.0
spaceship_earth	Spaceship Earth	Spaceship Earth	Epcot	World Celebration	1982-10-01	16.0	3.0
splash_mountain	Splash Mountain	Splash Mountain	Magic Kingdom	Frontierland	1992-07-17	18.0	3.5
toy_story_mania	Toy Story Mania!	Toy Story Mania!	Disney's Hollywood Studios	Toy Story Land	2008-05-31	6.5	4.5

Alongside the attraction data, parks_metadata_raw has 181 columns of daily park-level information: ticket season, historic temperature, special events, and more.

Show code

parks_metadata_raw |>
  count(year, mkemhmorn) |>
  pivot_wider(names_from = mkemhmorn, values_from = n, names_prefix = "emh_") |>
  rename(Year = year, `No EMH` = emh_0, `EMH Morning` = emh_1) |>
  kbl() |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

Table 8.2: Days with and without Extra Magic Hours in the morning (mkemhmorn), by year.

Year	No EMH	EMH Morning
2015	290	75
2016	300	66
2017	293	72
2018	304	61
2019	250	115
2020	135	14
2021	104	NA

2020 and 2021 are truncated due to COVID-19 park closures.

Show code

parks_metadata_raw |>
  count(year, mkemhmorn) |>
  summarize(days = sum(n), .by = year) |>
  kbl(col.names = c("Year", "Days Observed")) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

Table 8.3: Total observation days per year in the parks metadata.

Year	Days Observed
2015	365
2016	366
2017	365
2018	365
2019	365
2020	149
2021	104

The Seven Dwarfs Mine Train data has over 320,000 rows, with one row per wait time record (either posted or actual, with the other as NA).

Show code

seven_dwarfs_train |>
  reframe(
    across(
      c(park_date, starts_with("wait_minutes")),
      \(.x) range(.x, na.rm = TRUE)
    )
  ) |>
  mutate(bound = c("Min", "Max")) |>
  relocate(bound) |>
  kbl(
    col.names = c("Bound", "Park Date", "Actual Wait (min)", "Posted Wait (min)")
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

Table 8.4: Range of dates and wait times in the Seven Dwarfs Mine Train dataset.

Bound	Park Date	Actual Wait (min)	Posted Wait (min)
Min	2015-01-01	-92918	0
Max	2021-12-28	217	300

The minimum actual wait time of -92,918 minutes is an obvious data error. Strip it before doing anything else.

Show code

seven_dwarfs_train <- seven_dwarfs_train |>
  filter(wait_minutes_actual >= 0 | is.na(wait_minutes_actual))

Posted wait times are always multiples of 5, which is a useful sanity check:

Show code

seven_dwarfs_train |>
  pull(wait_minutes_posted) |>
  unique() |>
  sort() |>
  matrix(nrow = 5) |>
  as.data.frame() |>
  kbl(col.names = NULL) |>
  kable_styling(
    bootstrap_options = c("condensed"),
    full_width = FALSE
  )

Table 8.5: Unique posted wait time values (minutes). All are multiples of 5.

0	25	50	75	100	125	150	175	200	225	260
5	30	55	80	105	130	155	180	205	230	270
10	35	60	85	110	135	160	185	210	235	280
15	40	65	90	115	140	165	190	215	240	300
20	45	70	95	120	145	170	195	220	250	0

The two wait time distributions look quite different. Actual times are shorter on average and roughly right-skewed; posted times are discretized to 5-minute increments and slightly heavier in the right tail.

Show code

seven_dwarfs_train |>
  pivot_longer(
    starts_with("wait_minutes"),
    names_to = "wait_type",
    values_to = "wait_minutes"
  ) |>
  mutate(
    wait_type = case_when(
      wait_type == "wait_minutes_actual" ~ "Actual",
      wait_type == "wait_minutes_posted" ~ "Posted"
    )
  ) |>
  ggplot(aes(wait_minutes, fill = wait_type)) +
  geom_density(color = NA, alpha = 0.85) +
  facet_wrap(~wait_type, scales = "free_y") +
  scale_fill_okabe_ito() +
  labs(
    x = "Wait time (minutes)",
    y = "Density",
    fill = "Wait type",
    title = "Posted vs. Actual Wait Times",
    subtitle = "Seven Dwarfs Mine Train, all years"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Two density plots side by side showing posted and actual wait time distributions. — Figure 8.1: Distribution of posted and actual wait times for the Seven Dwarfs Mine Train. Posted times are rounded to 5-minute increments, while actual times are shorter on average and less discretized.

7.2 The Causal Question

The causal question driving the next several chapters is:

Does having Extra Magic Hours in the morning at Magic Kingdom causally increase the average posted wait time for the Seven Dwarfs Mine Train between 9 AM and 10 AM on the same day in 2018?

The exposure is a binary day-level indicator: did Magic Kingdom run Extra Magic Hours in the morning (yes or no)? The outcome is the average posted wait time during the 9-10 AM window.

DAG

The proposed DAG encodes the belief that Extra Magic Morning status and average wait times share three common causes: ticket season, park close time, and historic high temperature. Each of these opens a backdoor path that needs to be blocked.

Show code

coord_dag <- list(
  x = c(Season = 0, close = 0, weather = -1, x = 1, y = 2),
  y = c(Season = -1, close = 1, weather = 0, x = 0, y = 0)
)

labels <- c(
  x = "Extra Magic\nMorning",
  y = "Average wait",
  Season = "Ticket Season",
  weather = "Historic high\ntemperature",
  close = "Time park\nclosed"
)

dagify(
  y ~ x + close + Season + weather,
  x ~ weather + close + Season,
  coords = coord_dag,
  labels = labels,
  exposure = "x",
  outcome = "y"
) |>
  tidy_dagitty() |>
  node_status() |>
  ggplot(aes(x, y, xend = xend, yend = yend, color = status)) +
  geom_dag_edges_arc(curvature = c(rep(0, 5), 0.3)) +
  geom_dag_point() +
  geom_dag_label_repel(aes(label = label), seed = 1630) +
  scale_color_okabe_ito(na.value = "grey90") +
  theme_dag() +
  theme(legend.position = "none") +
  coord_cartesian(clip = "off")

DAG with five nodes: Extra Magic Morning, Average wait, Ticket Season, Time park closed, and Historic high temperature. — Figure 8.2: Proposed DAG for the effect of Extra Magic Hours in the morning on average Seven Dwarfs Mine Train wait times between 9 and 10 AM. Ticket season, park close time, and historic high temperature are confounders.

7.3 From Protocol to Code

The target trial protocol maps cleanly onto tidyverse operations. The table below shows that correspondence.

Show code

tibble(
  `Protocol Element` = c(
    "Eligibility criteria",
    "Exposure definition",
    "Assignment procedures",
    "Follow-up period",
    "Outcome definition",
    "Analysis plan"
  ),
  `tidyverse Function(s)` = c(
    "`filter()`",
    "`mutate()`",
    "`mutate()`, `select()`",
    "`mutate()`, `pivot_longer()`, `pivot_wider()`",
    "`mutate()`",
    "`select()`, `mutate()`, `*_join()`"
  )
) |>
  kbl(escape = FALSE) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

Table 8.6: Mapping target trial protocol elements to tidyverse functions.

Protocol Element	tidyverse Function(s)
Eligibility criteria	`filter()`
Exposure definition	`mutate()`
Assignment procedures	`mutate()`, `select()`
Follow-up period	`mutate()`, `pivot_longer()`, `pivot_wider()`
Outcome definition	`mutate()`
Analysis plan	`select()`, `mutate()`, `*_join()`

The full target trial protocol for this study looks like this:

Show code

tibble(
  `Protocol Step` = c(
    "Eligibility criteria",
    "Exposure definition",
    "Assignment procedures",
    "Follow-up period",
    "Outcome definition",
    "Causal contrast",
    "Analysis plan"
  ),
  `Target Trial` = c(
    "Days from 2018.",
    "Exposed: Magic Kingdom had Extra Magic Hours in the morning.",
    "Random assignment, 50% probability, non-blinded.",
    "Park open until 10 AM on the same day.",
    "Average posted wait time for Seven Dwarfs Mine Train, 9-10 AM.",
    "Average Treatment Effect (ATE).",
    "ATE via IPW, weighted for temperature, ticket season, close time."
  ),
  `Observational Emulation` = c(
    "Same.",
    "Same.",
    "Days assigned per observed data. Randomization emulated via confounding adjustment.",
    "Same.",
    "Same.",
    "Same.",
    "Same. Confounders from DAG in Figure 1."
  )
) |>
  kbl() |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = TRUE
  ) |>
  column_spec(1, bold = TRUE, width = "12em") |>
  column_spec(2:3, width = "18em")

Table 8.7: Target trial protocol for the effect of Extra Magic Morning on Seven Dwarfs Mine Train wait times, and the corresponding observational emulation.

Protocol Step	Target Trial	Observational Emulation
Eligibility criteria	Days from 2018.	Same.
Exposure definition	Exposed: Magic Kingdom had Extra Magic Hours in the morning.	Same.
Assignment procedures	Random assignment, 50% probability, non-blinded.	Days assigned per observed data. Randomization emulated via confounding adjustment.
Follow-up period	Park open until 10 AM on the same day.	Same.
Outcome definition	Average posted wait time for Seven Dwarfs Mine Train, 9-10 AM.	Same.
Causal contrast	Average Treatment Effect (ATE).	Same.
Analysis plan	ATE via IPW, weighted for temperature, ticket season, close time.	Same. Confounders from DAG in Figure 1.

Building the outcome dataset

The outcome is the average posted wait time in the 9 AM hour, filtered to 2018.

Show code

seven_dwarfs_9 <- seven_dwarfs_train |>
  filter(year(park_date) == 2018) |>
  mutate(hour = hour(wait_datetime)) |>
  summarize(
    across(
      c(wait_minutes_posted, wait_minutes_actual),
      \(.x) mean(.x, na.rm = TRUE),
      .names = "{.col}_avg"
    ),
    .by = c(park_date, hour)
  ) |>
  mutate(
    across(
      c(wait_minutes_posted_avg, wait_minutes_actual_avg),
      \(.x) if_else(is.nan(.x), NA, .x)
    )
  ) |>
  filter(hour == 9)

Show code

seven_dwarfs_9 |>
  head(10) |>
  kbl(
    col.names = c("Date", "Hour", "Avg Posted Wait (min)", "Avg Actual Wait (min)"),
    digits = 1
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

Table 8.8: First 10 rows of the outcome dataset: average posted and actual wait times for the 9 AM hour in 2018.

Date	Hour	Avg Posted Wait (min)	Avg Actual Wait (min)
2018-01-01	9	60.0	NA
2018-01-02	9	60.0	NA
2018-01-03	9	60.0	16.0
2018-01-04	9	68.9	NA
2018-01-05	9	70.6	33.0
2018-01-06	9	33.3	25.3
2018-01-07	9	46.4	32.0
2018-01-08	9	69.5	8.0
2018-01-09	9	64.3	47.0
2018-01-10	9	74.3	NA

7.4 Joining Multiple Data Sources

The exposure and confounders are in parks_metadata_raw. Clean and filter to 2018 first, then join.

Show code

parks_metadata <- parks_metadata_raw |>
  select(
    park_date          = date,
    park_extra_magic_morning = mkemhmorn,
    park_ticket_season = wdw_ticket_season,
    park_temperature_high    = wdwmaxtemp,
    park_close         = mkclose
  ) |>
  filter(year(park_date) == 2018)

Variable names follow the “Column Names as Contracts” convention (Riederer 2020): variables from park-level metadata are prefixed with park_, wait time variables with wait_.

Extra Magic Mornings by month

About 12 to 16% of days had Extra Magic Mornings in most months. December stands out at 42%.

Show code

parks_metadata |>
  mutate(month_num = month(park_date)) |>
  summarize(
    prop_emh = sum(park_extra_magic_morning) / n(),
    .by = month_num
  ) |>
  arrange(month_num) |>
  mutate(month_label = month(month_num, label = TRUE, abbr = FALSE)) |>
  select(Month = month_label, `Proportion EMH` = prop_emh) |>
  kbl(digits = 3) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

Table 8.9: Proportion of 2018 days with Extra Magic Hours in the morning, by calendar month.

Month	Proportion EMH
January	0.129
February	0.143
March	0.161
April	0.133
May	0.129
June	0.167
July	0.129
August	0.161
September	0.133
October	0.161
November	0.133
December	0.419

Confounders

Not every ticket season type appears in every month:

Show code

count_by_month <- function(data, .var) {
  data |>
    mutate(month = month(park_date, label = TRUE, abbr = TRUE)) |>
    count(month, {{ .var }}) |>
    complete(month, {{ .var }}, fill = list(n = 0))
}

ticket_season_by_month <- parks_metadata |>
  count_by_month(park_ticket_season)

Show code

ticket_season_by_month |>
  filter(n <= 4) |>
  arrange(n, park_ticket_season) |>
  kbl(col.names = c("Month", "Ticket Season", "Days")) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

Table 8.10: Months with zero or very few days of a particular ticket season. Peak tickets were absent in August and September; value tickets were absent in June, July, and December.

Month	Ticket Season	Days
Aug	peak	0
Sep	peak	0
Jun	value	0
Jul	value	0
Dec	value	0
Jan	peak	3
May	value	3
Feb	peak	4
Jul	peak	4
Oct	peak	4

Show code

ticket_season_by_month |>
  ggplot(aes(month, n, fill = park_ticket_season)) +
  geom_col(position = "fill", alpha = 0.85) +
  scale_fill_okabe_ito() +
  labs(
    y = "Proportion of days",
    x = NULL,
    fill = "Ticket season",
    title = "Ticket Season by Month (2018)"
  ) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.major.x = element_blank())

Stacked bar chart showing ticket season proportions by month. — Figure 8.3: Proportion of days by ticket season across 2018. Regular season dominates the summer months; peak season clusters around school holidays.

Close times show more variety than you might expect:

Show code

parks_metadata |>
  count(park_close, sort = TRUE) |>
  kbl(col.names = c("Close Time", "Days")) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

Table 8.11: Count of 2018 days by Magic Kingdom park close time.

Close Time	Days
22:00:00	105
23:00:00	93
24:00:00	58
18:00:00	57
21:00:00	28
20:00:00	21
25:00:00	2
16:30:00	1

Show code

parks_metadata |>
  count_by_month(park_close) |>
  ggplot(aes(month, n, fill = ordered(park_close))) +
  geom_col(position = "fill", alpha = 0.85) +
  scale_fill_okabe_ito() +
  labs(
    y = "Proportion of days",
    x = NULL,
    fill = "Close time",
    title = "Park Close Time by Month (2018)"
  ) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.major.x = element_blank())

Stacked bar chart of park close time by month. — Figure 8.4: Proportion of days by close time, across months. Earlier close times concentrate in late fall and winter.

Show code

parks_metadata |>
  mutate(month = month(park_date, label = TRUE, abbr = TRUE)) |>
  ggplot(aes(month, park_temperature_high)) +
  geom_jitter(height = 0, width = 0.15, alpha = 0.5, color = "#0072B2") +
  labs(
    y = "Historic high temperature (F)",
    x = NULL,
    title = "Daily High Temperature by Month (2018)"
  ) +
  theme_minimal(base_size = 13)

Jitter plot of historic high temperature by month. — Figure 8.5: Historic high temperature at Walt Disney World by month. The park is in Florida, so temperatures stay relatively warm year-round, but summers get properly hot.

Joining the datasets

Three May dates are missing from seven_dwarfs_9 (no 9 AM records in the raw data):

Show code

parks_metadata |>
  anti_join(seven_dwarfs_9, by = "park_date") |>
  mutate(park_close = as.character(park_close)) |>
  kbl() |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

Table 8.12: Days in parks metadata with no matching 9 AM wait time records. Three consecutive days in May 2018 are absent.

park_date	park_extra_magic_morning	park_ticket_season	park_temperature_high	park_close
2018-05-10	0	regular	85.97	22:00:00
2018-05-11	1	regular	91.73	23:00:00
2018-05-12	0	peak	78.56	24:00:00

Left-join on park_date to build the analytic dataset:

Show code

seven_dwarfs_9 <- seven_dwarfs_9 |>
  left_join(parks_metadata, by = "park_date")

Show code

seven_dwarfs_9 |>
  head(10) |>
  mutate(park_close = as.character(park_close)) |>
  kbl(digits = 1) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  ) |>
  scroll_box(width = "100%")

Table 8.13: First 10 rows of the analytic dataset after joining wait times with park metadata.

park_date	hour	wait_minutes_posted_avg	wait_minutes_actual_avg	park_extra_magic_morning	park_ticket_season	park_temperature_high	park_close
2018-01-01	9	60.0	NA	0	peak	58.6	23:00:00
2018-01-02	9	60.0	NA	0	peak	53.6	24:00:00
2018-01-03	9	60.0	16.0	0	peak	51.1	24:00:00
2018-01-04	9	68.9	NA	0	regular	52.7	24:00:00
2018-01-05	9	70.6	33.0	1	regular	54.3	24:00:00
2018-01-06	9	33.3	25.3	0	regular	56.2	23:00:00
2018-01-07	9	46.4	32.0	0	regular	65.2	21:00:00
2018-01-08	9	69.5	8.0	0	value	70.8	20:00:00
2018-01-09	9	64.3	47.0	0	value	75.2	20:00:00
2018-01-10	9	74.3	NA	0	value	74.2	20:00:00

7.5 Descriptive Table

The table below gives a summary of the three confounders stratified by exposure status. More days fell into the unexposed group (no Extra Magic Mornings). The regular ticket season was the most common. Close times and temperature look roughly similar across groups, though Extra Magic Morning days were slightly more likely to close early (18:00) and slightly cooler overall.

Because gtsummary output does not convert cleanly to kable in all contexts, here is an equivalent kable version built manually:

Show code

desc <- seven_dwarfs_9 |>
  mutate(
    emh = if_else(park_extra_magic_morning == 1, "EMH Morning", "No EMH"),
    park_close_chr = as.character(park_close)
  )

ticket_tbl <- desc |>
  count(emh, park_ticket_season) |>
  pivot_wider(names_from = emh, values_from = n, values_fill = 0) |>
  rename(Characteristic = park_ticket_season)

close_tbl <- desc |>
  count(emh, park_close_chr) |>
  pivot_wider(names_from = emh, values_from = n, values_fill = 0) |>
  rename(Characteristic = park_close_chr)

temp_tbl <- desc |>
  summarize(
    `EMH Morning`  = round(median(park_temperature_high[emh == "EMH Morning"],  na.rm = TRUE), 1),
    `No EMH`       = round(median(park_temperature_high[emh == "No EMH"], na.rm = TRUE), 1)
  ) |>
  mutate(Characteristic = "Median historic high temp (F)") |>
  relocate(Characteristic)

bind_rows(
  tibble(Characteristic = "Ticket season", `EMH Morning` = NA_integer_, `No EMH` = NA_integer_),
  ticket_tbl,
  tibble(Characteristic = "Close time", `EMH Morning` = NA_integer_, `No EMH` = NA_integer_),
  close_tbl,
  tibble(Characteristic = "Temperature", `EMH Morning` = NA_integer_, `No EMH` = NA_integer_),
  temp_tbl
) |>
  mutate(across(c(`EMH Morning`, `No EMH`), \(x) replace_na(as.character(x), ""))) |>
  kbl(col.names = c("Characteristic", "EMH Morning (N=60)", "No EMH (N=302)")) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  ) |>
  row_spec(c(1, 5, 12), bold = TRUE, background = "#f0f0f0")

Table 8.14: Descriptive summary of confounders by Extra Magic Morning status in the 2018 analytic dataset.

Characteristic	EMH Morning (N=60)	No EMH (N=302)
Ticket season
peak	18	63
regular	35	161
value	7	78
Close time
18:00:00	18	39
20:00:00	2	19
22:00:00	11	93
23:00:00	11	81
24:00:00	17	40
25:00:00	1	1
16:30:00	0	1
21:00:00	0	28
Temperature
Median historic high temp (F)	82.8	84.1

7.6 Missing Data

The vis_miss() plot shows missingness is mostly concentrated in wait_minutes_actual_avg. Posted wait times are missing for about 3% of the year (8 additional days beyond the 3 with no records at all), which is unlikely to meaningfully distort results. Actual wait times are far more incomplete, but since the outcome is posted wait time, that is a separate problem for Chapter 15.

Show code

seven_dwarfs_9 |>
  summarize(across(everything(), \(x) mean(is.na(x)))) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Proportion Missing") |>
  filter(`Proportion Missing` > 0) |>
  arrange(desc(`Proportion Missing`)) |>
  mutate(`Proportion Missing` = round(`Proportion Missing`, 3)) |>
  kbl() |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

Table 8.15: Share of missing values by variable in the analytic dataset.

Variable	Proportion Missing
wait_minutes_actual_avg	0.594
wait_minutes_posted_avg	0.022

On January 24th, for example, nine records all showed NA for both wait time types in the 9 AM window:

Show code

seven_dwarfs_train |>
  filter(park_date == "2018-01-24", hour(wait_datetime) == 9) |>
  select(starts_with("wait_minutes")) |>
  kbl(col.names = c("Actual Wait (min)", "Posted Wait (min)")) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    full_width = FALSE
  )

Table 8.16: Nine records from January 24, 2018 during the 9 AM hour: all wait times are missing.

Actual Wait (min)	Posted Wait (min)
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA
NA	NA

7.7 Checking Causal Assumptions

Data cannot prove the assumptions underlying causal inference, but it can reveal cracks in them worth worrying about.

Exchangeability gets its own treatment in Chapter 9 (propensity score evaluation), so we skip it here.

Consistency means the exposure must be well-defined. If the question had been about “Extra Magic Hours” in general rather than specifically in the morning, there would be a consistency problem because morning and evening Extra Magic Hours might have different effects. The chapter’s choice to specify “morning” sidesteps that issue.

Positivity requires that every combination of confounder values contains both exposed and unexposed days. This is where the data can say something concrete.

Single-variable positivity checks

Show code

seven_dwarfs_9 |>
  mutate(
    park_close_chr = as.character(park_close),
    emh_label = if_else(park_extra_magic_morning == 1, "EMH Morning", "No EMH")
  ) |>
  ggplot(aes(
    x = park_close_chr,
    fill = emh_label
  )) +
  geom_bar(position = "fill", alpha = 0.85) +
  scale_fill_okabe_ito() +
  labs(
    x = "Close time",
    y = "Proportion of days",
    fill = "Extra Magic Morning",
    title = "Park Close Time by Exposure Status"
  ) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.major.x = element_blank())

Stacked bar chart showing proportion of EMH vs no-EMH by close time. — Figure 8.6: Distribution of park close time by Extra Magic Morning status. Close times of 16:30 and 21:00 had zero exposed days.

Close times of 16:30 and 21:00 had zero exposed days:

Show code

seven_dwarfs_9 |>
  count(park_close, park_extra_magic_morning) |>
  complete(park_close, park_extra_magic_morning, fill = list(n = 0)) |>
  filter(park_close %in% parse_hm(c("16:30", "21:00"))) |>
  mutate(
    park_close = as.character(park_close),
    park_extra_magic_morning = if_else(park_extra_magic_morning == 1, "EMH", "No EMH")
  ) |>
  kbl(col.names = c("Close Time", "Exposure", "Days")) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    full_width = FALSE
  )

Table 8.17: Day counts by close time and Extra Magic Morning status. Close times 16:30 and 21:00 have no exposed days.

Close Time	Exposure	Days
16:30:00	No EMH	1
16:30:00	EMH	0
21:00:00	No EMH	28
21:00:00	EMH	0

Show code

seven_dwarfs_9 |>
  mutate(emh_label = if_else(park_extra_magic_morning == 1, "EMH Morning", "No EMH")) |>
  ggplot(aes(
    x = park_temperature_high,
    group = emh_label,
    fill = emh_label
  )) +
  geom_mirror_histogram(bins = 20, alpha = 0.85) +
  scale_y_continuous(labels = abs) +
  scale_fill_okabe_ito() +
  labs(
    x = "Historic high temperature (F)",
    y = "Count",
    fill = "Exposure",
    title = "Temperature Distribution by Extra Magic Morning Status"
  ) +
  theme_minimal(base_size = 13)

Mirrored histogram showing temperature distribution for exposed vs unexposed days. — Figure 8.7: Mirrored histogram of historic high temperature by Extra Magic Morning status. Very few exposed days fall below 60 degrees F.

Only one exposed day has a historic high below 60 degrees:

Show code

seven_dwarfs_9 |>
  filter(park_temperature_high < 60) |>
  count(park_extra_magic_morning) |>
  mutate(
    park_extra_magic_morning = if_else(park_extra_magic_morning == 1, "EMH Morning", "No EMH")
  ) |>
  kbl(col.names = c("Exposure", "Days")) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    full_width = FALSE
  )

Table 8.18: Number of days below 60 F by Extra Magic Morning status. Only one exposed day falls in this range.

Exposure	Days
No EMH	9
EMH Morning	1

Show code

seven_dwarfs_9 |>
  mutate(emh_label = if_else(park_extra_magic_morning == 1, "EMH Morning", "No EMH")) |>
  ggplot(aes(x = park_ticket_season, fill = emh_label)) +
  geom_bar(position = "dodge", alpha = 0.85) +
  scale_fill_okabe_ito() +
  labs(
    x = "Ticket season",
    y = "Days",
    fill = "Exposure",
    title = "Ticket Season by Extra Magic Morning Status"
  ) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.major.x = element_blank())

Dodged bar chart showing ticket season counts by exposure status. — Figure 8.8: Distribution of ticket season by Extra Magic Morning status. All three ticket season types appear in both exposure groups, so no positivity violations are visible here.

Ticket season shows no positivity problems: all three levels appear in both groups.

Multi-variable positivity check

With only three confounders, we can check all combinations directly. Temperature is cut into tertiles and close time binned into three categories.

Show code

prop_exposed <- seven_dwarfs_9 |>
  mutate(
    park_temperature_high_bin = cut(park_temperature_high, breaks = 3),
    park_close_bin = case_when(
      hour(park_close) < 19 & hour(park_close) > 12 ~ "(1) early",
      hour(park_close) >= 19 & hour(park_close) < 24 ~ "(2) standard",
      hour(park_close) >= 24 | hour(park_close) < 12 ~ "(3) late"
    )
  ) |>
  summarize(
    prop_exposed = mean(park_extra_magic_morning),
    .by = c(park_close_bin, park_temperature_high_bin, park_ticket_season)
  ) |>
  complete(
    park_close_bin,
    park_temperature_high_bin,
    park_ticket_season,
    fill = list(prop_exposed = 0)
  )

prop_exposed |>
  ggplot(aes(
    x = park_close_bin,
    y = park_temperature_high_bin,
    fill = prop_exposed
  )) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(begin = 0.1, end = 0.9, option = "D") +
  facet_wrap(~park_ticket_season) +
  labs(
    x = "Park close time",
    y = "Historic high temperature (F)",
    fill = "Proportion\nexposed",
    title = "Positivity Check Across Three Confounders"
  ) +
  theme_minimal(base_size = 12) +
  theme(panel.grid = element_blank())

Tile heatmap faceted by ticket season, showing proportion exposed. — Figure 8.9: Proportion of days with Extra Magic Mornings across bins of temperature, close time, and ticket season. White cells are combinations with no exposed days; yellow cells are fully exposed.

The fully exposed cell in the peak / early-close / cold-temperature corner corresponds almost certainly to the Christmas and New Year period: cold Florida days during peak holiday season that also close early. That combination appears to have always had Extra Magic Mornings. The chapter notes this as a plausible structural positivity violation rather than a chance one.

Show code

prop_exposed |>
  filter(prop_exposed %in% c(0, 1)) |>
  arrange(desc(prop_exposed), park_close_bin, park_temperature_high_bin) |>
  mutate(prop_exposed = if_else(prop_exposed == 1, "1 (always exposed)", "0 (never exposed)")) |>
  kbl(
    col.names = c("Close Time", "Temperature Bin", "Ticket Season", "Proportion Exposed")
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )

Table 8.19: Covariate combinations where the proportion exposed is either 0 or 1. These represent potential positivity violations.

Close Time	Temperature Bin	Ticket Season	Proportion Exposed
(1) early	(51.1,65.2]	peak	1 (always exposed)
(1) early	(51.1,65.2]	value	0 (never exposed)
(1) early	(65.2,79.3]	peak	0 (never exposed)
(1) early	(65.2,79.3]	value	0 (never exposed)
(1) early	(79.3,93.4]	peak	0 (never exposed)
(2) standard	(51.1,65.2]	peak	0 (never exposed)
(2) standard	(51.1,65.2]	regular	0 (never exposed)
(3) late	(51.1,65.2]	value	0 (never exposed)
(3) late	(65.2,79.3]	value	0 (never exposed)
(3) late	(79.3,93.4]	value	0 (never exposed)

Ten covariate combinations are either always exposed or never exposed. Whether these are structural or stochastic violations is a question for domain knowledge, not the data.

Summary

This chapter sets up the analytic dataset that carries through the next several chapters. A few things are worth holding onto:

Causal data prep ties each wrangling step to a protocol element. filter() handles eligibility; mutate() handles exposure and outcome definitions; joins handle multi-source data.
The “Column Names as Contracts” convention keeps variable provenance legible.
Missing data in posted wait times is small (about 3%) and unlikely to drive results. Missing actual wait times are a bigger issue addressed later.
Positivity deserves careful attention. Close times of 16:30 and 21:00 have no exposed days; the Christmas period cluster looks like a structural violation for the cold/peak/early-close combination.
These observations do not kill the analysis, but they shape how cautiously we should interpret results at the boundary of the covariate space.