DAGs

Overview

DAGs (Directed Acyclic Graphs) let researchers make their causal assumptions explicit and queryable. The chapter covers three things: what DAGs are and how their path types behave statistically, how to build and query them in R using ggdag, and practical advice on building defensible DAGs in applied work.


4.1 Visualizing Causal Assumptions

A DAG is a graph where nodes are variables and edges are directed arrows representing causal relationships. The direction of an arrow encodes the direction of causation: x -> y means x causes y. Crucially, the graph must be acyclic: no variable can be its own ancestor.

DAGs encode a researcher’s assumptions about causal structure, not parameter estimates. An arrow says “I believe this cause exists for at least one unit”; it says nothing about the magnitude or functional form of the effect. This non-parametric character distinguishes causal DAGs from structural equation models (SEMs), which require parametric assumptions and estimate the entire graph.

The three elemental path types are forks, chains, and colliders. Every backdoor path decomposes into these structures.

Show code
coords <- list(
  x = c(x = 0, y = 2, q = 1),
  y = c(x = 0, y = 0, q = 1)
)

fork     <- dagify(x ~ q, y ~ q, exposure = "x", outcome = "y", coords = coords)
chain    <- dagify(q ~ x, y ~ q, exposure = "x", outcome = "y", coords = coords)
collider <- dagify(q ~ x + y, exposure = "x", outcome = "y", coords = coords)

dag_flows <- map(
  list(fork = fork, chain = chain, collider = collider),
  tidy_dagitty
) |>
  map("data") |>
  list_rbind(names_to = "dag") |>
  mutate(dag = factor(dag, levels = c("fork", "chain", "collider")))

dag_flows |>
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_edges(edge_width = 1) +
  geom_dag_point(colour = "#2c7bb6", size = 14) +
  geom_dag_text(colour = "white", size = 4) +
  facet_wrap(~dag) +
  expand_plot(
    expand_x = expansion(c(0.2, 0.2)),
    expand_y = expansion(c(0.2, 0.2))
  ) +
  labs(title = "Elemental path structures in a DAG")
Figure 5.1: Three elemental causal path structures. Forks share a common cause; chains pass an effect through a mediator; colliders share a common descendant.

Path types and what they imply

Table 5.1: The three path structures, their statistical properties, and when to adjust.
Path type Direction of q Path status x-y correlated? Adjust for q?
Fork (confounder) q -> x and q -> y Open Yes (spurious) Yes, blocks confounding
Chain (mediator) x -> q -> y Open Yes (via q) Depends on question (blocks indirect effect)
Collider x -> q <- y Closed No No, opens a biasing path
Backdoor paths

Any open, non-causal path from the exposure to the outcome is a backdoor path. Forks are the classic backdoor path. Conditioning on a collider can also open a previously closed path, creating a backdoor where none existed.


Confounding in practice

Show code
set.seed(123)
n <- 1000
q_conf <- rbinom(n, size = 1, prob = 0.35)
x_conf <- 2 * q_conf + rnorm(n)
y_conf <- -3 * q_conf + rnorm(n)
confounder_data <- tibble(x = x_conf, y = y_conf, q = as.factor(q_conf))

p_conf1 <- confounder_data |>
  ggplot(aes(x, y)) +
  geom_point(alpha = 0.15, colour = "#636363") +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, colour = "#d7191c", linewidth = 1) +
  facet_wrap(~"Not adjusting for q\n(biased)") +
  data_theme

p_conf2 <- confounder_data |>
  ggplot(aes(x, y, colour = q)) +
  geom_point(alpha = 0.15) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, linewidth = 1) +
  scale_colour_manual(values = c("0" = "#2c7bb6", "1" = "#fdae61")) +
  facet_wrap(~"Adjusting for q\n(unbiased)") +
  data_theme

p_conf1 + p_conf2 +
  plot_annotation(
    title    = "Fork: spurious association removed by conditioning on the common cause",
    theme    = theme(plot.title = element_text(size = 13, face = "bold"))
  )
Figure 5.2: A fork: q causes both x and y. Unadjusted, x and y appear correlated. Conditioning on q reveals the null relationship.

Mediation in practice

Show code
set.seed(123)
x_med <- rnorm(n)
lp    <- 2 * x_med + rnorm(n)
q_med <- rbinom(n, size = 1, prob = 1 / (1 + exp(-lp)))
y_med <- 2 * q_med + rnorm(n)
mediator_data <- tibble(x = x_med, y = y_med, q = as.factor(q_med))

p_med1 <- mediator_data |>
  ggplot(aes(x, y)) +
  geom_point(alpha = 0.15, colour = "#636363") +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, colour = "#1a9641", linewidth = 1) +
  facet_wrap(~"Not adjusting for q\n(total effect)") +
  data_theme

p_med2 <- mediator_data |>
  ggplot(aes(x, y, colour = q)) +
  geom_point(alpha = 0.15) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, linewidth = 1) +
  scale_colour_manual(values = c("0" = "#2c7bb6", "1" = "#fdae61")) +
  facet_wrap(~"Adjusting for q\n(direct effect)") +
  data_theme

p_med1 + p_med2 +
  plot_annotation(
    title = "Chain: adjustment for mediator isolates the direct effect",
    theme = theme(plot.title = element_text(size = 13, face = "bold"))
  )
Figure 5.3: A chain: adjusting for the mediator q removes the indirect effect and leaves only the direct (null) effect.

Collider bias in practice

Show code
set.seed(123)
x_col   <- rnorm(n)
y_col   <- rnorm(n)
lp_col  <- 2 * x_col + 3 * y_col + rnorm(n)
q_col   <- rbinom(n, size = 1, prob = 1 / (1 + exp(-lp_col)))
collider_data <- tibble(x = x_col, y = y_col, q = as.factor(q_col))

p_col1 <- collider_data |>
  ggplot(aes(x, y)) +
  geom_point(alpha = 0.15, colour = "#636363") +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, colour = "#1a9641", linewidth = 1) +
  facet_wrap(~"Not adjusting for q\n(unbiased)") +
  data_theme

p_col2 <- collider_data |>
  ggplot(aes(x, y, colour = q)) +
  geom_point(alpha = 0.15) +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, linewidth = 1) +
  scale_colour_manual(values = c("0" = "#2c7bb6", "1" = "#d7191c")) +
  facet_wrap(~"Adjusting for q\n(biased, collider opened)") +
  data_theme

p_col1 + p_col2 +
  plot_annotation(
    title = "Collider: conditioning opens a bias path that wasn't there",
    theme = theme(plot.title = element_text(size = 13, face = "bold"))
  )
Figure 5.4: A collider: x and y are independent, but conditioning on q induces a spurious association.

4.2 DAGs in R with ggdag

The ggdag package wraps dagitty with a tidy, ggplot2-compatible interface.

Table 5.2: Key arguments to dagify().
Argument Type Purpose
formulas formula... Specify causal relationships (effect ~ cause1 + cause2)
exposure character Exposure variable for path queries
outcome character Outcome variable for path queries
latent character Mark unmeasured variables; excluded from adjustment sets
coords list/data.frame Node positions; use time_ordered_coords() for temporal ordering
labels named character Human-readable labels for display

Podcast exam running example

The running example asks: does listening to a comedy podcast the morning before an exam improve graduate students’ test scores?

Show code
podcast_dag <- dagify(
  podcast ~ mood + humor + prepared,
  exam    ~ mood + prepared,
  coords  = time_ordered_coords(
    list(
      c("prepared", "humor", "mood"),
      "podcast",
      "exam"
    )
  ),
  exposure = "podcast",
  outcome  = "exam",
  labels   = c(
    podcast  = "podcast",
    exam     = "exam score",
    mood     = "mood",
    humor    = "humor",
    prepared = "prepared"
  )
)

ggdag(podcast_dag, use_labels = "label", text = FALSE) +
  labs(title = "Podcast exam DAG")
Figure 5.5: Proposed DAG for the podcast exam question. No direct arrow from podcast to exam score is assumed.

Open paths and adjustment sets

Show code
ggdag_paths(podcast_dag, shadow = TRUE, text = FALSE, use_labels = "label") +
  labs(title = "Open paths in podcast_dag")
Figure 5.6: Two open backdoor paths in podcast_dag: one through mood, one through prepared.
Show code
ggdag_adjustment_set(podcast_dag, text = FALSE, use_labels = "label") +
  labs(title = "Minimal adjustment set")
Figure 5.7: Minimal adjustment set for podcast_dag. Both mood and prepared must be adjusted to block all backdoor paths.

Simulated demonstration

Show code
set.seed(10)
sim_data <- simulate_data(podcast_dag)

unadjusted_model <- lm(exam ~ podcast, sim_data) |>
  tidy(conf.int = TRUE) |>
  filter(term == "podcast") |>
  mutate(formula = "unadjusted")

adjusted_model <- lm(exam ~ podcast + mood + prepared, sim_data) |>
  tidy(conf.int = TRUE) |>
  filter(term == "podcast") |>
  mutate(formula = "mood + prepared")

bind_rows(unadjusted_model, adjusted_model) |>
  ggplot(aes(x = estimate, y = formula, xmin = conf.low, xmax = conf.high)) +
  geom_vline(xintercept = 0, linewidth = 1, colour = "grey75") +
  geom_pointrange(linewidth = 1, size = 0.6, colour = "#2c7bb6") +
  labs(
    x       = "Estimated effect of podcast on exam",
    y       = NULL,
    caption = "True effect = 0",
    title   = "Correct adjustment set recovers the null"
  ) +
  data_theme
Figure 5.8: Adjusting for the correct set (mood + prepared) recovers the null effect. The unadjusted estimate is spurious.
Wrong DAG, wrong answer

If the DAG is mis-specified, for instance by omitting mood as a confounder, then even after adjustment the estimate remains biased. DAG correctness is a prerequisite; the algebra cannot fix a wrong causal story.


4.3 Structures of Causality

4.3.1 Advanced confounding

Backdoor paths need not pass through a single common cause. In a more complex version of the podcast DAG, adding alertness (caused by mood) and skills_course (which frees up time for podcasts and drives preparedness) creates three backdoor paths and four valid minimal adjustment sets.

Show code
podcast_dag2 <- dagify(
  podcast      ~ mood + humor + skills_course,
  alertness    ~ mood,
  mood         ~ humor,
  prepared     ~ skills_course,
  exam         ~ alertness + prepared,
  coords       = time_ordered_coords(),
  exposure     = "podcast",
  outcome      = "exam",
  labels       = c(
    podcast      = "podcast",
    exam         = "exam score",
    mood         = "mood",
    alertness    = "alertness",
    skills_course = "college\nskills course",
    humor        = "humor",
    prepared     = "prepared"
  )
)

ggdag(podcast_dag2, use_labels = "label", text = FALSE) +
  labs(title = "Expanded podcast DAG")
Figure 5.9: Expanded podcast DAG with alertness and skills_course. Three backdoor paths must be closed.
Show code
ggdag_adjustment_set(podcast_dag2, use_labels = "label", text = FALSE) +
  labs(title = "Minimal adjustment sets, expanded DAG")
Figure 5.10: Four minimal adjustment sets for podcast_dag2. Each set closes all three backdoor paths.
Table 5.3: Valid minimal adjustment sets for the expanded podcast DAG and considerations for choosing between them.
Adjustment set Backdoor paths closed Practical considerations
alertness + prepared All three Alertness may be hard to measure accurately
alertness + skills_course All three skills_course is objectively verifiable; preferred if well measured
mood + prepared All three mood + prepared are familiar confounders with moderate measurement quality
mood + skills_course All three skills_course is cleanly measured; mood may vary by self-report quality

4.3.2 Selection bias and mediation

Selection bias is collider-stratification bias induced by the study design itself, the act of selecting units into the sample. If showing up to the exam (showed_up) is caused by both podcast and other confounders, inherently conditioning on that variable (by only observing those who showed up) opens a collider path.

Show code
podcast_dag3 <- dagify(
  podcast    ~ mood + humor + prepared,
  exam       ~ mood + prepared + showed_up,
  showed_up  ~ podcast + mood + prepared,
  coords     = time_ordered_coords(
    list(
      c("prepared", "humor", "mood"),
      "podcast",
      "showed_up",
      "exam"
    )
  ),
  exposure   = "podcast",
  outcome    = "exam",
  labels     = c(
    podcast   = "podcast",
    exam      = "exam score",
    mood      = "mood",
    humor     = "humor",
    prepared  = "prepared",
    showed_up = "showed up"
  )
)

ggdag(podcast_dag3, use_labels = "label", text = FALSE) +
  labs(title = "Selection into sample as a collider-mediator")
Figure 5.11: podcast_dag3: showed_up is both a collider and a mediator. Inherent stratification on this variable limits what causal effect we can estimate.

When the total effect is unrecoverable (indirect path blocked at showed_up), the direct effect can sometimes still be estimated.

Show code
podcast_dag3 |>
  adjust_for("showed_up") |>
  ggdag_adjustment_set(effect = "direct", text = FALSE, use_labels = "label") +
  labs(title = "Direct effect adjustment set after inherent conditioning on showed_up")
Figure 5.12: Switching the estimand from total to direct effect yields a valid adjustment set even after conditioning on showed_up.

M-bias and butterfly bias

M-bias occurs when a pre-exposure collider, m, is caused by two unmeasured variables a and b that separately cause the exposure and outcome. The path through m is closed by default; conditioning on m opens it.

Show code
m_bias() |>
  ggdag() +
  labs(title = "M-bias structure")
Figure 5.13: M-bias: m is a collider that predates exposure and outcome. The backdoor path x <- a -> m <- b -> y is closed; conditioning on m opens it.

Butterfly (bowtie) bias combines M-bias with confounding: mood is both a confounder (must be adjusted) and a collider (adjustment opens a path). When the unmeasured causes are absent from data, there is no clean solution. Empirical guidance: confounding bias tends to be larger than collider bias from M-structures, so adjusting is still the lesser evil.

Show code
butterfly_bias(x = "podcast", y = "exam", m = "mood", a = "u1", b = "u2") |>
  ggdag(text = FALSE, use_labels = "label") +
  labs(title = "Butterfly (bowtie) bias")
Figure 5.14: Butterfly bias: mood is simultaneously a confounder and a collider. Controlling for mood blocks confounding but opens the collider path via u1 and u2.

4.3.3 Instrumental variables and precision variables

Table 5.4: Comparison of instrumental variables and precision variables.
Property Instrumental variable (e.g., humor) Precision variable (e.g., grader_mood)
Causes exposure? Yes No
Causes outcome? No (by assumption) Yes
On a backdoor path? No No
Include in adjustment model? Avoid unless suspected confounder Yes, always include when available
Effect on point estimate Usually negligible bias if wrongly added None
Effect on variance May slightly inflate SE Reduces variance, narrower CIs
Alternative use IV methods (Chapter 22) n/a
Show code
podcast_dag5 <- dagify(
  podcast     ~ mood + humor + prepared,
  exam        ~ mood + prepared + grader_mood,
  coords      = time_ordered_coords(
    list(
      c("prepared", "humor", "mood"),
      c("podcast", "grader_mood"),
      "exam"
    )
  ),
  exposure    = "podcast",
  outcome     = "exam",
  labels      = c(
    podcast     = "podcast",
    exam        = "exam score",
    mood        = "student\nmood",
    humor       = "humor",
    prepared    = "prepared",
    grader_mood = "grader\nmood"
  )
)

ggdag(podcast_dag5, use_labels = "label", text = FALSE) +
  labs(title = "IV (humor) and precision variable (grader_mood) in podcast DAG")
Figure 5.15: humor is an IV (causes podcast, not exam); grader_mood is a precision variable (causes exam, not podcast).

4.3.4 Measurement error and missingness

Separating the true value of a variable from its observed value in the DAG exposes how measurement bias enters the causal estimate. In recall bias, the outcome influences the exposure measurement error, introducing a non-causal association between observed exposure and outcome.

Show code
error_dag <- dagify(
  exposure_observed ~ exposure_real + exposure_error,
  outcome_observed  ~ outcome_real + outcome_error,
  outcome_real      ~ exposure_real,
  exposure_error    ~ outcome_real,
  labels            = c(
    exposure_real     = "Exposure\n(truth)",
    exposure_error    = "Measurement\nerror (exposure)",
    exposure_observed = "Exposure\n(observed)",
    outcome_real      = "Outcome\n(truth)",
    outcome_error     = "Measurement\nerror (outcome)",
    outcome_observed  = "Outcome\n(observed)"
  ),
  exposure = "exposure_real",
  outcome  = "outcome_real",
  coords   = time_ordered_coords()
)

ggdag(error_dag, text = FALSE, use_labels = "label") +
  labs(title = "Recall bias: outcome influences memory of exposure")
Figure 5.16: Measurement error DAG: outcome_real causes exposure_error, creating recall bias. The observed exposure is a biased proxy for the true exposure.

4.4 Recommendations for Building DAGs

Summary table

Table 5.5: Ten recommendations for building defensible DAGs.
# Recommendation Detail
1 Iterate early and often Build the DAG before analysis, ideally before data collection. Share with domain experts and iterate.
2 Consider your question The causal structure varies by population and time. What is a confounder in one setting may be irrelevant in another.
3 Order nodes by time Time ordering clarifies assumptions (cause precedes effect) and makes complex DAGs easier to read.
4 Consider the whole data collection process Found data inherits conditioning on the collection mechanism. Understand selection into your sample.
5 Include variables you don't have Mark unmeasured variables as latent; ggdag returns only adjustment sets feasible with your data.
6 Saturate your DAG, then prune Default to including arrows; not-including is the stronger assumption. Prune only implausible arrows.
7 Include instruments and precision variables Neither type changes the adjustment set, but both inform modeling decisions and can flag IV opportunities.
8 Focus on causal structure first, then measurement bias Start with a perfectly-measured DAG; then overlay measurement error as a sensitivity exercise.
9 Pick adjustment sets most likely to succeed When multiple valid sets exist, prefer the one with better-measured, better-modeled variables.
10 Use robustness checks Negative controls, DAG-data consistency checks, and alternate adjustment sets all stress-test your assumptions.

Saturate then prune (worked example)

Show code
podcast_dag_sat <- dagify(
  podcast  ~ mood + humor + prepared,
  exam     ~ mood + prepared + humor,
  prepared ~ humor,
  mood     ~ humor,
  coords   = time_ordered_coords(
    list("humor", c("prepared", "mood"), "podcast", "exam")
  ),
  exposure = "podcast",
  outcome  = "exam",
  labels   = c(
    podcast  = "podcast",
    exam     = "exam score",
    mood     = "mood",
    humor    = "humor",
    prepared = "prepared"
  )
)

curvatures      <- rep(0, 8)
curvatures[1]   <- 0.25

podcast_dag_sat |>
  tidy_dagitty() |>
  ggplot(aes(x, y, xend = xend, yend = yend, label = label)) +
  geom_dag_point(colour = "#2c7bb6", size = 14) +
  geom_dag_edges_arc(curvature = curvatures) +
  geom_dag_label_repel(aes(label = label), show.legend = FALSE) +
  labs(title = "Saturated podcast DAG")
Figure 5.17: Fully saturated version of podcast_dag: every variable has arrows to all later variables.
Show code
podcast_dag_pruned <- dagify(
  podcast  ~ mood + humor + prepared,
  exam     ~ mood + prepared,
  mood     ~ humor,
  coords   = time_ordered_coords(
    list("humor", c("prepared", "mood"), "podcast", "exam")
  ),
  exposure = "podcast",
  outcome  = "exam",
  labels   = c(
    podcast  = "podcast",
    exam     = "exam score",
    mood     = "mood",
    humor    = "humor",
    prepared = "prepared"
  )
)

ggdag(podcast_dag_pruned, text = FALSE, use_labels = "label") +
  labs(title = "Pruned podcast DAG, implausible arrows removed")
Figure 5.18: Pruned DAG: implausible arrows (humor -> prepared and humor -> exam, since grading is blinded) removed.

Both the saturated and pruned DAGs share the same minimal adjustment set (mood + prepared), showing that the pruning decision doesn’t change what must be adjusted here.

Feedback loops are feedforward loops

Apparent feedback loops are actually variables co-evolving through time. Expressing this correctly as a time-unrolled DAG avoids the cyclic violation.

Show code
dagify(
  global_temp_2000 ~ ac_use_1990 + global_temp_1990,
  ac_use_2000      ~ ac_use_1990 + global_temp_1990,
  global_temp_2010 ~ ac_use_2000 + global_temp_2000,
  ac_use_2010      ~ ac_use_2000 + global_temp_2000,
  global_temp_2020 ~ ac_use_2010 + global_temp_2010,
  ac_use_2020      ~ ac_use_2010 + global_temp_2010,
  coords           = time_ordered_coords(),
  labels           = c(
    ac_use_1990      = "A/C use\n(1990)",
    global_temp_1990 = "Temp\n(1990)",
    ac_use_2000      = "A/C use\n(2000)",
    global_temp_2000 = "Temp\n(2000)",
    ac_use_2010      = "A/C use\n(2010)",
    global_temp_2010 = "Temp\n(2010)",
    ac_use_2020      = "A/C use\n(2020)",
    global_temp_2020 = "Temp\n(2020)"
  )
) |>
  ggdag(text = FALSE, use_labels = "label") +
  labs(title = "Feedback loop unrolled as a time-ordered feedforward DAG")
Figure 5.19: A/C use and global temperature co-evolve over time. The cyclic mental model unfolds into a feedforward DAG.

DAG properties in applied health research

The table below summarises empirical properties of DAGs used in 144 applied health research papers (Tennant et al. 2020).

Table 5.6: Empirical properties of 144 DAGs in applied health research (Tennant et al. 2020). Most DAGs were only ~46% saturated; fewer than a third reported their estimand.
Characteristic N = 144
DAG properties
Number of nodes 12 (IQR: 9-16)
Number of arcs 29 (IQR: 19-42)
Node-to-arc ratio 2.30 (IQR: 1.75-3.00)
Saturation proportion 0.46 (IQR: 0.31-0.67)
Fully saturated, Yes 4 (3%)
Fully saturated, No 140 (97%)
Reporting
Reported estimand, Yes 40 (28%)
Reported estimand, No 104 (72%)
Reported adjustment set, Yes 80 (56%)
Reported adjustment set, No 64 (44%)

Key concepts glossary

Table 5.7: Key terms from Chapter 4.
Term Definition
DAG Directed Acyclic Graph, a causal diagram with directed arrows and no cycles
SCM Structural Causal Model, the class of non-parametric models that DAGs belong to
Node / Edge Node: a variable. Edge: a directed arrow encoding a causal relationship
Fork q -> x and q -> y: shared common cause; open path; induces confounding
Chain x -> q -> y: effect passes through mediator; open path; blocks causal flow if adjusted
Collider x -> q <- y: shared descendant; closed path; conditioning on it opens a biasing path
Open path A path that transmits statistical association between exposure and outcome
Closed path A path that does not transmit association (e.g., at a collider)
Backdoor path Any open, non-causal path from exposure to outcome that must be blocked
Adjustment set A set of variables that, when adjusted, blocks all backdoor paths
Collider-stratification bias Bias introduced by conditioning on a collider (also called selection bias)
Instrumental variable A cause of exposure only, not outcome; useful for IV estimation methods
Precision variable A cause of outcome only, not exposure; reduces variance when included in model
Sharp null No unit has an individual causal effect; justifies removing an arrow
Proxy confounder A measured variable correlated with an unmeasured confounder; reduces but rarely eliminates confounding

Chapter connections
  • Chapter 10 covers estimands in depth (what are you actually trying to estimate?).
  • Chapter 14 covers interaction and effect modification.
  • Chapter 15 covers measurement error and missingness.
  • Chapter 16 covers sensitivity analysis, including robustness checks.
  • Chapter 17 covers causal mediation analysis.
  • Chapter 18 covers time-varying confounding and feedforward relationships.
  • Chapter 22 covers instrumental variable methods.