Preparing Data to answer Causal questions

Overview

Data preparation for causal inference looks familiar on the surface: you clean things up, do some EDA, maybe merge a couple of datasets. What makes it different is that every decision you take during prep needs to connect back to the causal question and the target trial you are trying to emulate. This chapter walks through that process using wait time data from Disney World.

The chapter covers six areas: getting to know the data, framing the causal question, linking protocol steps to tidyverse code, joining multiple sources, building a descriptive table, flagging missing data, and checking the positivity assumption.


7.1 The Data

The touringplans package contains attraction wait time data from Disney and Universal theme parks. There are 14 attractions in the package.

Show code
attractions_metadata |>
  kbl(
    col.names = c(
      "Dataset", "Name", "Short Name", "Park",
      "Land", "Opened", "Duration (min)", "Avg Wait / 100"
    ),
    digits = 1
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
Table 8.1: Attractions available in the touringplans package.
Dataset Name Short Name Park Land Opened Duration (min) Avg Wait / 100
alien_saucers Alien Swirling Saucers Alien Saucers Disney's Hollywood Studios Toy Story Land 2018-06-30 2.5 10.0
dinosaur DINOSAUR DINOSAUR Disney's Animal Kingdom DinoLand U.S.A. 1998-04-22 3.5 3.0
expedition_everest Expedition Everest - Legend of the Forbidden Mountain Expedition Everest Disney's Animal Kingdom Asia 2006-04-07 4.0 4.0
flight_of_passage Avatar Flight of Passage Flight of Passage Disney's Animal Kingdom Pandora - The World of Avatar 2017-05-27 6.0 4.0
kilimanjaro_safaris Kilimanjaro Safaris Kilimanjaro Safaris Disney's Animal Kingdom Africa 1998-04-22 20.0 4.0
navi_river Na'vi River Journey Na'vi River Disney's Animal Kingdom Pandora - The World of Avatar 2017-05-27 5.0 5.0
pirates_of_caribbean Pirates of the Caribbean Pirates of Caribbean Magic Kingdom Adventureland 1973-12-17 7.5 1.5
rock_n_rollercoaster Rock 'n' Roller Coaster Starring Aerosmith Rock Coaster Disney's Hollywood Studios Sunset Boulevard 1999-07-29 1.5 2.5
seven_dwarfs_train Seven Dwarfs Mine Train 7 Dwarfs Train Magic Kingdom Fantasyland 2014-05-28 3.0 5.0
slinky_dog Slinky Dog Dash Slinky Dog Disney's Hollywood Studios Toy Story Land 2018-06-30 3.0 5.0
soarin Soarin' Around the World Soarin' Epcot World Nature 2005-05-05 8.0 3.0
spaceship_earth Spaceship Earth Spaceship Earth Epcot World Celebration 1982-10-01 16.0 3.0
splash_mountain Splash Mountain Splash Mountain Magic Kingdom Frontierland 1992-07-17 18.0 3.5
toy_story_mania Toy Story Mania! Toy Story Mania! Disney's Hollywood Studios Toy Story Land 2008-05-31 6.5 4.5

Alongside the attraction data, parks_metadata_raw has 181 columns of daily park-level information: ticket season, historic temperature, special events, and more.

Show code
parks_metadata_raw |>
  count(year, mkemhmorn) |>
  pivot_wider(names_from = mkemhmorn, values_from = n, names_prefix = "emh_") |>
  rename(Year = year, `No EMH` = emh_0, `EMH Morning` = emh_1) |>
  kbl() |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
Table 8.2: Days with and without Extra Magic Hours in the morning (mkemhmorn), by year.
Year No EMH EMH Morning
2015 290 75
2016 300 66
2017 293 72
2018 304 61
2019 250 115
2020 135 14
2021 104 NA

2020 and 2021 are truncated due to COVID-19 park closures.

Show code
parks_metadata_raw |>
  count(year, mkemhmorn) |>
  summarize(days = sum(n), .by = year) |>
  kbl(col.names = c("Year", "Days Observed")) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
Table 8.3: Total observation days per year in the parks metadata.
Year Days Observed
2015 365
2016 366
2017 365
2018 365
2019 365
2020 149
2021 104

The Seven Dwarfs Mine Train data has over 320,000 rows, with one row per wait time record (either posted or actual, with the other as NA).

Show code
seven_dwarfs_train |>
  reframe(
    across(
      c(park_date, starts_with("wait_minutes")),
      \(.x) range(.x, na.rm = TRUE)
    )
  ) |>
  mutate(bound = c("Min", "Max")) |>
  relocate(bound) |>
  kbl(
    col.names = c("Bound", "Park Date", "Actual Wait (min)", "Posted Wait (min)")
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
Table 8.4: Range of dates and wait times in the Seven Dwarfs Mine Train dataset.
Bound Park Date Actual Wait (min) Posted Wait (min)
Min 2015-01-01 -92918 0
Max 2021-12-28 217 300

The minimum actual wait time of -92,918 minutes is an obvious data error. Strip it before doing anything else.

Show code
seven_dwarfs_train <- seven_dwarfs_train |>
  filter(wait_minutes_actual >= 0 | is.na(wait_minutes_actual))

Posted wait times are always multiples of 5, which is a useful sanity check:

Show code
seven_dwarfs_train |>
  pull(wait_minutes_posted) |>
  unique() |>
  sort() |>
  matrix(nrow = 5) |>
  as.data.frame() |>
  kbl(col.names = NULL) |>
  kable_styling(
    bootstrap_options = c("condensed"),
    full_width = FALSE
  )
Table 8.5: Unique posted wait time values (minutes). All are multiples of 5.
0 25 50 75 100 125 150 175 200 225 260
5 30 55 80 105 130 155 180 205 230 270
10 35 60 85 110 135 160 185 210 235 280
15 40 65 90 115 140 165 190 215 240 300
20 45 70 95 120 145 170 195 220 250 0

The two wait time distributions look quite different. Actual times are shorter on average and roughly right-skewed; posted times are discretized to 5-minute increments and slightly heavier in the right tail.

Show code
seven_dwarfs_train |>
  pivot_longer(
    starts_with("wait_minutes"),
    names_to = "wait_type",
    values_to = "wait_minutes"
  ) |>
  mutate(
    wait_type = case_when(
      wait_type == "wait_minutes_actual" ~ "Actual",
      wait_type == "wait_minutes_posted" ~ "Posted"
    )
  ) |>
  ggplot(aes(wait_minutes, fill = wait_type)) +
  geom_density(color = NA, alpha = 0.85) +
  facet_wrap(~wait_type, scales = "free_y") +
  scale_fill_okabe_ito() +
  labs(
    x = "Wait time (minutes)",
    y = "Density",
    fill = "Wait type",
    title = "Posted vs. Actual Wait Times",
    subtitle = "Seven Dwarfs Mine Train, all years"
  ) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")
Two density plots side by side showing posted and actual wait time distributions.
Figure 8.1: Distribution of posted and actual wait times for the Seven Dwarfs Mine Train. Posted times are rounded to 5-minute increments, while actual times are shorter on average and less discretized.

7.2 The Causal Question

The causal question driving the next several chapters is:

Does having Extra Magic Hours in the morning at Magic Kingdom causally increase the average posted wait time for the Seven Dwarfs Mine Train between 9 AM and 10 AM on the same day in 2018?

The exposure is a binary day-level indicator: did Magic Kingdom run Extra Magic Hours in the morning (yes or no)? The outcome is the average posted wait time during the 9-10 AM window.

DAG

The proposed DAG encodes the belief that Extra Magic Morning status and average wait times share three common causes: ticket season, park close time, and historic high temperature. Each of these opens a backdoor path that needs to be blocked.

Show code
coord_dag <- list(
  x = c(Season = 0, close = 0, weather = -1, x = 1, y = 2),
  y = c(Season = -1, close = 1, weather = 0, x = 0, y = 0)
)

labels <- c(
  x = "Extra Magic\nMorning",
  y = "Average wait",
  Season = "Ticket Season",
  weather = "Historic high\ntemperature",
  close = "Time park\nclosed"
)

dagify(
  y ~ x + close + Season + weather,
  x ~ weather + close + Season,
  coords = coord_dag,
  labels = labels,
  exposure = "x",
  outcome = "y"
) |>
  tidy_dagitty() |>
  node_status() |>
  ggplot(aes(x, y, xend = xend, yend = yend, color = status)) +
  geom_dag_edges_arc(curvature = c(rep(0, 5), 0.3)) +
  geom_dag_point() +
  geom_dag_label_repel(aes(label = label), seed = 1630) +
  scale_color_okabe_ito(na.value = "grey90") +
  theme_dag() +
  theme(legend.position = "none") +
  coord_cartesian(clip = "off")
DAG with five nodes: Extra Magic Morning, Average wait, Ticket Season, Time park closed, and Historic high temperature.
Figure 8.2: Proposed DAG for the effect of Extra Magic Hours in the morning on average Seven Dwarfs Mine Train wait times between 9 and 10 AM. Ticket season, park close time, and historic high temperature are confounders.

7.3 From Protocol to Code

The target trial protocol maps cleanly onto tidyverse operations. The table below shows that correspondence.

Show code
tibble(
  `Protocol Element` = c(
    "Eligibility criteria",
    "Exposure definition",
    "Assignment procedures",
    "Follow-up period",
    "Outcome definition",
    "Analysis plan"
  ),
  `tidyverse Function(s)` = c(
    "`filter()`",
    "`mutate()`",
    "`mutate()`, `select()`",
    "`mutate()`, `pivot_longer()`, `pivot_wider()`",
    "`mutate()`",
    "`select()`, `mutate()`, `*_join()`"
  )
) |>
  kbl(escape = FALSE) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
Table 8.6: Mapping target trial protocol elements to tidyverse functions.
Protocol Element tidyverse Function(s)
Eligibility criteria `filter()`
Exposure definition `mutate()`
Assignment procedures `mutate()`, `select()`
Follow-up period `mutate()`, `pivot_longer()`, `pivot_wider()`
Outcome definition `mutate()`
Analysis plan `select()`, `mutate()`, `*_join()`

The full target trial protocol for this study looks like this:

Show code
tibble(
  `Protocol Step` = c(
    "Eligibility criteria",
    "Exposure definition",
    "Assignment procedures",
    "Follow-up period",
    "Outcome definition",
    "Causal contrast",
    "Analysis plan"
  ),
  `Target Trial` = c(
    "Days from 2018.",
    "Exposed: Magic Kingdom had Extra Magic Hours in the morning.",
    "Random assignment, 50% probability, non-blinded.",
    "Park open until 10 AM on the same day.",
    "Average posted wait time for Seven Dwarfs Mine Train, 9-10 AM.",
    "Average Treatment Effect (ATE).",
    "ATE via IPW, weighted for temperature, ticket season, close time."
  ),
  `Observational Emulation` = c(
    "Same.",
    "Same.",
    "Days assigned per observed data. Randomization emulated via confounding adjustment.",
    "Same.",
    "Same.",
    "Same.",
    "Same. Confounders from DAG in Figure 1."
  )
) |>
  kbl() |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = TRUE
  ) |>
  column_spec(1, bold = TRUE, width = "12em") |>
  column_spec(2:3, width = "18em")
Table 8.7: Target trial protocol for the effect of Extra Magic Morning on Seven Dwarfs Mine Train wait times, and the corresponding observational emulation.
Protocol Step Target Trial Observational Emulation
Eligibility criteria Days from 2018. Same.
Exposure definition Exposed: Magic Kingdom had Extra Magic Hours in the morning. Same.
Assignment procedures Random assignment, 50% probability, non-blinded. Days assigned per observed data. Randomization emulated via confounding adjustment.
Follow-up period Park open until 10 AM on the same day. Same.
Outcome definition Average posted wait time for Seven Dwarfs Mine Train, 9-10 AM. Same.
Causal contrast Average Treatment Effect (ATE). Same.
Analysis plan ATE via IPW, weighted for temperature, ticket season, close time. Same. Confounders from DAG in Figure 1.

Building the outcome dataset

The outcome is the average posted wait time in the 9 AM hour, filtered to 2018.

Show code
seven_dwarfs_9 <- seven_dwarfs_train |>
  filter(year(park_date) == 2018) |>
  mutate(hour = hour(wait_datetime)) |>
  summarize(
    across(
      c(wait_minutes_posted, wait_minutes_actual),
      \(.x) mean(.x, na.rm = TRUE),
      .names = "{.col}_avg"
    ),
    .by = c(park_date, hour)
  ) |>
  mutate(
    across(
      c(wait_minutes_posted_avg, wait_minutes_actual_avg),
      \(.x) if_else(is.nan(.x), NA, .x)
    )
  ) |>
  filter(hour == 9)
Show code
seven_dwarfs_9 |>
  head(10) |>
  kbl(
    col.names = c("Date", "Hour", "Avg Posted Wait (min)", "Avg Actual Wait (min)"),
    digits = 1
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
Table 8.8: First 10 rows of the outcome dataset: average posted and actual wait times for the 9 AM hour in 2018.
Date Hour Avg Posted Wait (min) Avg Actual Wait (min)
2018-01-01 9 60.0 NA
2018-01-02 9 60.0 NA
2018-01-03 9 60.0 16.0
2018-01-04 9 68.9 NA
2018-01-05 9 70.6 33.0
2018-01-06 9 33.3 25.3
2018-01-07 9 46.4 32.0
2018-01-08 9 69.5 8.0
2018-01-09 9 64.3 47.0
2018-01-10 9 74.3 NA

7.4 Joining Multiple Data Sources

The exposure and confounders are in parks_metadata_raw. Clean and filter to 2018 first, then join.

Show code
parks_metadata <- parks_metadata_raw |>
  select(
    park_date          = date,
    park_extra_magic_morning = mkemhmorn,
    park_ticket_season = wdw_ticket_season,
    park_temperature_high    = wdwmaxtemp,
    park_close         = mkclose
  ) |>
  filter(year(park_date) == 2018)

Variable names follow the “Column Names as Contracts” convention (Riederer 2020): variables from park-level metadata are prefixed with park_, wait time variables with wait_.

Extra Magic Mornings by month

About 12 to 16% of days had Extra Magic Mornings in most months. December stands out at 42%.

Show code
parks_metadata |>
  mutate(month_num = month(park_date)) |>
  summarize(
    prop_emh = sum(park_extra_magic_morning) / n(),
    .by = month_num
  ) |>
  arrange(month_num) |>
  mutate(month_label = month(month_num, label = TRUE, abbr = FALSE)) |>
  select(Month = month_label, `Proportion EMH` = prop_emh) |>
  kbl(digits = 3) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
Table 8.9: Proportion of 2018 days with Extra Magic Hours in the morning, by calendar month.
Month Proportion EMH
January 0.129
February 0.143
March 0.161
April 0.133
May 0.129
June 0.167
July 0.129
August 0.161
September 0.133
October 0.161
November 0.133
December 0.419

Confounders

Not every ticket season type appears in every month:

Show code
count_by_month <- function(data, .var) {
  data |>
    mutate(month = month(park_date, label = TRUE, abbr = TRUE)) |>
    count(month, {{ .var }}) |>
    complete(month, {{ .var }}, fill = list(n = 0))
}

ticket_season_by_month <- parks_metadata |>
  count_by_month(park_ticket_season)
Show code
ticket_season_by_month |>
  filter(n <= 4) |>
  arrange(n, park_ticket_season) |>
  kbl(col.names = c("Month", "Ticket Season", "Days")) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
Table 8.10: Months with zero or very few days of a particular ticket season. Peak tickets were absent in August and September; value tickets were absent in June, July, and December.
Month Ticket Season Days
Aug peak 0
Sep peak 0
Jun value 0
Jul value 0
Dec value 0
Jan peak 3
May value 3
Feb peak 4
Jul peak 4
Oct peak 4
Show code
ticket_season_by_month |>
  ggplot(aes(month, n, fill = park_ticket_season)) +
  geom_col(position = "fill", alpha = 0.85) +
  scale_fill_okabe_ito() +
  labs(
    y = "Proportion of days",
    x = NULL,
    fill = "Ticket season",
    title = "Ticket Season by Month (2018)"
  ) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.major.x = element_blank())
Stacked bar chart showing ticket season proportions by month.
Figure 8.3: Proportion of days by ticket season across 2018. Regular season dominates the summer months; peak season clusters around school holidays.

Close times show more variety than you might expect:

Show code
parks_metadata |>
  count(park_close, sort = TRUE) |>
  kbl(col.names = c("Close Time", "Days")) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
Table 8.11: Count of 2018 days by Magic Kingdom park close time.
Close Time Days
22:00:00 105
23:00:00 93
24:00:00 58
18:00:00 57
21:00:00 28
20:00:00 21
25:00:00 2
16:30:00 1
Show code
parks_metadata |>
  count_by_month(park_close) |>
  ggplot(aes(month, n, fill = ordered(park_close))) +
  geom_col(position = "fill", alpha = 0.85) +
  scale_fill_okabe_ito() +
  labs(
    y = "Proportion of days",
    x = NULL,
    fill = "Close time",
    title = "Park Close Time by Month (2018)"
  ) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.major.x = element_blank())
Stacked bar chart of park close time by month.
Figure 8.4: Proportion of days by close time, across months. Earlier close times concentrate in late fall and winter.
Show code
parks_metadata |>
  mutate(month = month(park_date, label = TRUE, abbr = TRUE)) |>
  ggplot(aes(month, park_temperature_high)) +
  geom_jitter(height = 0, width = 0.15, alpha = 0.5, color = "#0072B2") +
  labs(
    y = "Historic high temperature (F)",
    x = NULL,
    title = "Daily High Temperature by Month (2018)"
  ) +
  theme_minimal(base_size = 13)
Jitter plot of historic high temperature by month.
Figure 8.5: Historic high temperature at Walt Disney World by month. The park is in Florida, so temperatures stay relatively warm year-round, but summers get properly hot.

Joining the datasets

Three May dates are missing from seven_dwarfs_9 (no 9 AM records in the raw data):

Show code
parks_metadata |>
  anti_join(seven_dwarfs_9, by = "park_date") |>
  mutate(park_close = as.character(park_close)) |>
  kbl() |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
Table 8.12: Days in parks metadata with no matching 9 AM wait time records. Three consecutive days in May 2018 are absent.
park_date park_extra_magic_morning park_ticket_season park_temperature_high park_close
2018-05-10 0 regular 85.97 22:00:00
2018-05-11 1 regular 91.73 23:00:00
2018-05-12 0 peak 78.56 24:00:00

Left-join on park_date to build the analytic dataset:

Show code
seven_dwarfs_9 <- seven_dwarfs_9 |>
  left_join(parks_metadata, by = "park_date")
Show code
seven_dwarfs_9 |>
  head(10) |>
  mutate(park_close = as.character(park_close)) |>
  kbl(digits = 1) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  ) |>
  scroll_box(width = "100%")
Table 8.13: First 10 rows of the analytic dataset after joining wait times with park metadata.
park_date hour wait_minutes_posted_avg wait_minutes_actual_avg park_extra_magic_morning park_ticket_season park_temperature_high park_close
2018-01-01 9 60.0 NA 0 peak 58.6 23:00:00
2018-01-02 9 60.0 NA 0 peak 53.6 24:00:00
2018-01-03 9 60.0 16.0 0 peak 51.1 24:00:00
2018-01-04 9 68.9 NA 0 regular 52.7 24:00:00
2018-01-05 9 70.6 33.0 1 regular 54.3 24:00:00
2018-01-06 9 33.3 25.3 0 regular 56.2 23:00:00
2018-01-07 9 46.4 32.0 0 regular 65.2 21:00:00
2018-01-08 9 69.5 8.0 0 value 70.8 20:00:00
2018-01-09 9 64.3 47.0 0 value 75.2 20:00:00
2018-01-10 9 74.3 NA 0 value 74.2 20:00:00

7.5 Descriptive Table

The table below gives a summary of the three confounders stratified by exposure status. More days fell into the unexposed group (no Extra Magic Mornings). The regular ticket season was the most common. Close times and temperature look roughly similar across groups, though Extra Magic Morning days were slightly more likely to close early (18:00) and slightly cooler overall.

Because gtsummary output does not convert cleanly to kable in all contexts, here is an equivalent kable version built manually:

Show code
desc <- seven_dwarfs_9 |>
  mutate(
    emh = if_else(park_extra_magic_morning == 1, "EMH Morning", "No EMH"),
    park_close_chr = as.character(park_close)
  )

ticket_tbl <- desc |>
  count(emh, park_ticket_season) |>
  pivot_wider(names_from = emh, values_from = n, values_fill = 0) |>
  rename(Characteristic = park_ticket_season)

close_tbl <- desc |>
  count(emh, park_close_chr) |>
  pivot_wider(names_from = emh, values_from = n, values_fill = 0) |>
  rename(Characteristic = park_close_chr)

temp_tbl <- desc |>
  summarize(
    `EMH Morning`  = round(median(park_temperature_high[emh == "EMH Morning"],  na.rm = TRUE), 1),
    `No EMH`       = round(median(park_temperature_high[emh == "No EMH"], na.rm = TRUE), 1)
  ) |>
  mutate(Characteristic = "Median historic high temp (F)") |>
  relocate(Characteristic)

bind_rows(
  tibble(Characteristic = "Ticket season", `EMH Morning` = NA_integer_, `No EMH` = NA_integer_),
  ticket_tbl,
  tibble(Characteristic = "Close time", `EMH Morning` = NA_integer_, `No EMH` = NA_integer_),
  close_tbl,
  tibble(Characteristic = "Temperature", `EMH Morning` = NA_integer_, `No EMH` = NA_integer_),
  temp_tbl
) |>
  mutate(across(c(`EMH Morning`, `No EMH`), \(x) replace_na(as.character(x), ""))) |>
  kbl(col.names = c("Characteristic", "EMH Morning (N=60)", "No EMH (N=302)")) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  ) |>
  row_spec(c(1, 5, 12), bold = TRUE, background = "#f0f0f0")
Table 8.14: Descriptive summary of confounders by Extra Magic Morning status in the 2018 analytic dataset.
Characteristic EMH Morning (N=60) No EMH (N=302)
Ticket season
peak 18 63
regular 35 161
value 7 78
Close time
18:00:00 18 39
20:00:00 2 19
22:00:00 11 93
23:00:00 11 81
24:00:00 17 40
25:00:00 1 1
16:30:00 0 1
21:00:00 0 28
Temperature
Median historic high temp (F) 82.8 84.1

7.6 Missing Data

The vis_miss() plot shows missingness is mostly concentrated in wait_minutes_actual_avg. Posted wait times are missing for about 3% of the year (8 additional days beyond the 3 with no records at all), which is unlikely to meaningfully distort results. Actual wait times are far more incomplete, but since the outcome is posted wait time, that is a separate problem for Chapter 15.

Show code
seven_dwarfs_9 |>
  summarize(across(everything(), \(x) mean(is.na(x)))) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Proportion Missing") |>
  filter(`Proportion Missing` > 0) |>
  arrange(desc(`Proportion Missing`)) |>
  mutate(`Proportion Missing` = round(`Proportion Missing`, 3)) |>
  kbl() |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
Table 8.15: Share of missing values by variable in the analytic dataset.
Variable Proportion Missing
wait_minutes_actual_avg 0.594
wait_minutes_posted_avg 0.022

On January 24th, for example, nine records all showed NA for both wait time types in the 9 AM window:

Show code
seven_dwarfs_train |>
  filter(park_date == "2018-01-24", hour(wait_datetime) == 9) |>
  select(starts_with("wait_minutes")) |>
  kbl(col.names = c("Actual Wait (min)", "Posted Wait (min)")) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    full_width = FALSE
  )
Table 8.16: Nine records from January 24, 2018 during the 9 AM hour: all wait times are missing.
Actual Wait (min) Posted Wait (min)
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA
NA NA

7.7 Checking Causal Assumptions

Data cannot prove the assumptions underlying causal inference, but it can reveal cracks in them worth worrying about.

Exchangeability gets its own treatment in Chapter 9 (propensity score evaluation), so we skip it here.

Consistency means the exposure must be well-defined. If the question had been about “Extra Magic Hours” in general rather than specifically in the morning, there would be a consistency problem because morning and evening Extra Magic Hours might have different effects. The chapter’s choice to specify “morning” sidesteps that issue.

Positivity requires that every combination of confounder values contains both exposed and unexposed days. This is where the data can say something concrete.

Single-variable positivity checks

Show code
seven_dwarfs_9 |>
  mutate(
    park_close_chr = as.character(park_close),
    emh_label = if_else(park_extra_magic_morning == 1, "EMH Morning", "No EMH")
  ) |>
  ggplot(aes(
    x = park_close_chr,
    fill = emh_label
  )) +
  geom_bar(position = "fill", alpha = 0.85) +
  scale_fill_okabe_ito() +
  labs(
    x = "Close time",
    y = "Proportion of days",
    fill = "Extra Magic Morning",
    title = "Park Close Time by Exposure Status"
  ) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.major.x = element_blank())
Stacked bar chart showing proportion of EMH vs no-EMH by close time.
Figure 8.6: Distribution of park close time by Extra Magic Morning status. Close times of 16:30 and 21:00 had zero exposed days.

Close times of 16:30 and 21:00 had zero exposed days:

Show code
seven_dwarfs_9 |>
  count(park_close, park_extra_magic_morning) |>
  complete(park_close, park_extra_magic_morning, fill = list(n = 0)) |>
  filter(park_close %in% parse_hm(c("16:30", "21:00"))) |>
  mutate(
    park_close = as.character(park_close),
    park_extra_magic_morning = if_else(park_extra_magic_morning == 1, "EMH", "No EMH")
  ) |>
  kbl(col.names = c("Close Time", "Exposure", "Days")) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    full_width = FALSE
  )
Table 8.17: Day counts by close time and Extra Magic Morning status. Close times 16:30 and 21:00 have no exposed days.
Close Time Exposure Days
16:30:00 No EMH 1
16:30:00 EMH 0
21:00:00 No EMH 28
21:00:00 EMH 0
Show code
seven_dwarfs_9 |>
  mutate(emh_label = if_else(park_extra_magic_morning == 1, "EMH Morning", "No EMH")) |>
  ggplot(aes(
    x = park_temperature_high,
    group = emh_label,
    fill = emh_label
  )) +
  geom_mirror_histogram(bins = 20, alpha = 0.85) +
  scale_y_continuous(labels = abs) +
  scale_fill_okabe_ito() +
  labs(
    x = "Historic high temperature (F)",
    y = "Count",
    fill = "Exposure",
    title = "Temperature Distribution by Extra Magic Morning Status"
  ) +
  theme_minimal(base_size = 13)
Mirrored histogram showing temperature distribution for exposed vs unexposed days.
Figure 8.7: Mirrored histogram of historic high temperature by Extra Magic Morning status. Very few exposed days fall below 60 degrees F.

Only one exposed day has a historic high below 60 degrees:

Show code
seven_dwarfs_9 |>
  filter(park_temperature_high < 60) |>
  count(park_extra_magic_morning) |>
  mutate(
    park_extra_magic_morning = if_else(park_extra_magic_morning == 1, "EMH Morning", "No EMH")
  ) |>
  kbl(col.names = c("Exposure", "Days")) |>
  kable_styling(
    bootstrap_options = c("striped", "condensed"),
    full_width = FALSE
  )
Table 8.18: Number of days below 60 F by Extra Magic Morning status. Only one exposed day falls in this range.
Exposure Days
No EMH 9
EMH Morning 1
Show code
seven_dwarfs_9 |>
  mutate(emh_label = if_else(park_extra_magic_morning == 1, "EMH Morning", "No EMH")) |>
  ggplot(aes(x = park_ticket_season, fill = emh_label)) +
  geom_bar(position = "dodge", alpha = 0.85) +
  scale_fill_okabe_ito() +
  labs(
    x = "Ticket season",
    y = "Days",
    fill = "Exposure",
    title = "Ticket Season by Extra Magic Morning Status"
  ) +
  theme_minimal(base_size = 13) +
  theme(panel.grid.major.x = element_blank())
Dodged bar chart showing ticket season counts by exposure status.
Figure 8.8: Distribution of ticket season by Extra Magic Morning status. All three ticket season types appear in both exposure groups, so no positivity violations are visible here.

Ticket season shows no positivity problems: all three levels appear in both groups.

Multi-variable positivity check

With only three confounders, we can check all combinations directly. Temperature is cut into tertiles and close time binned into three categories.

Show code
prop_exposed <- seven_dwarfs_9 |>
  mutate(
    park_temperature_high_bin = cut(park_temperature_high, breaks = 3),
    park_close_bin = case_when(
      hour(park_close) < 19 & hour(park_close) > 12 ~ "(1) early",
      hour(park_close) >= 19 & hour(park_close) < 24 ~ "(2) standard",
      hour(park_close) >= 24 | hour(park_close) < 12 ~ "(3) late"
    )
  ) |>
  summarize(
    prop_exposed = mean(park_extra_magic_morning),
    .by = c(park_close_bin, park_temperature_high_bin, park_ticket_season)
  ) |>
  complete(
    park_close_bin,
    park_temperature_high_bin,
    park_ticket_season,
    fill = list(prop_exposed = 0)
  )

prop_exposed |>
  ggplot(aes(
    x = park_close_bin,
    y = park_temperature_high_bin,
    fill = prop_exposed
  )) +
  geom_tile(color = "white") +
  scale_fill_viridis_c(begin = 0.1, end = 0.9, option = "D") +
  facet_wrap(~park_ticket_season) +
  labs(
    x = "Park close time",
    y = "Historic high temperature (F)",
    fill = "Proportion\nexposed",
    title = "Positivity Check Across Three Confounders"
  ) +
  theme_minimal(base_size = 12) +
  theme(panel.grid = element_blank())
Tile heatmap faceted by ticket season, showing proportion exposed.
Figure 8.9: Proportion of days with Extra Magic Mornings across bins of temperature, close time, and ticket season. White cells are combinations with no exposed days; yellow cells are fully exposed.

The fully exposed cell in the peak / early-close / cold-temperature corner corresponds almost certainly to the Christmas and New Year period: cold Florida days during peak holiday season that also close early. That combination appears to have always had Extra Magic Mornings. The chapter notes this as a plausible structural positivity violation rather than a chance one.

Show code
prop_exposed |>
  filter(prop_exposed %in% c(0, 1)) |>
  arrange(desc(prop_exposed), park_close_bin, park_temperature_high_bin) |>
  mutate(prop_exposed = if_else(prop_exposed == 1, "1 (always exposed)", "0 (never exposed)")) |>
  kbl(
    col.names = c("Close Time", "Temperature Bin", "Ticket Season", "Proportion Exposed")
  ) |>
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed"),
    full_width = FALSE
  )
Table 8.19: Covariate combinations where the proportion exposed is either 0 or 1. These represent potential positivity violations.
Close Time Temperature Bin Ticket Season Proportion Exposed
(1) early (51.1,65.2] peak 1 (always exposed)
(1) early (51.1,65.2] value 0 (never exposed)
(1) early (65.2,79.3] peak 0 (never exposed)
(1) early (65.2,79.3] value 0 (never exposed)
(1) early (79.3,93.4] peak 0 (never exposed)
(2) standard (51.1,65.2] peak 0 (never exposed)
(2) standard (51.1,65.2] regular 0 (never exposed)
(3) late (51.1,65.2] value 0 (never exposed)
(3) late (65.2,79.3] value 0 (never exposed)
(3) late (79.3,93.4] value 0 (never exposed)

Ten covariate combinations are either always exposed or never exposed. Whether these are structural or stochastic violations is a question for domain knowledge, not the data.


Summary

This chapter sets up the analytic dataset that carries through the next several chapters. A few things are worth holding onto:

  • Causal data prep ties each wrangling step to a protocol element. filter() handles eligibility; mutate() handles exposure and outcome definitions; joins handle multi-source data.
  • The “Column Names as Contracts” convention keeps variable provenance legible.
  • Missing data in posted wait times is small (about 3%) and unlikely to drive results. Missing actual wait times are a bigger issue addressed later.
  • Positivity deserves careful attention. Close times of 16:30 and 21:00 have no exposed days; the Christmas period cluster looks like a structural violation for the cold/peak/early-close combination.
  • These observations do not kill the analysis, but they shape how cautiously we should interpret results at the boundary of the covariate space.