| Term | Explanation |
|---|---|
| Casual Inference | Making conclusions without explicitly defining causal questions or addressing the assumptions necessary to answer them. |
| Schrodinger's Causal Inference | A pervasive phenomenon where researchers avoid stating an interest in estimating causal effects but embed causal intent, language, and recommendations throughout their studies. |
| Descriptive Analysis | The statistical summarization of variable distributions, often stratified by key characteristics to describe populations. |
| Prediction | The use of data to calculate accurate estimates of variables on new, unobserved data. |
| Causal Inference | The process of determining the specific impact that one variable has on another. |
| Exposure | The variable hypothesized to act as the cause. |
| Outcome | The variable hypothesized to act as the effect. |
| Eligibility Criteria | The specific conditions defining the subjects for whom the causal claim applies. |
| Target Population | The larger population for whom the outcome effect is estimated. |
| Time Zero | The point in time when participant follow-up begins in a study. |
| Follow Up Period | The specific duration over which the outcome is observed and measured. |
| Table Two Fallacy | The erroneous practice of presenting confounders alongside exposure variables in regression tables and interpreting their coefficients as independent causal effects. |
From casual to causal
Overview
Data science primarily involves three fundamental tasks: description, prediction, and causal inference. Researchers frequently conflate these tasks, utilizing ambiguous terminology that masks causal intent while avoiding the rigorous assumptions required for true causal analysis. Clearly defining the specific goal of an inquiry is a prerequisite for selecting the correct methodology, as models optimized for prediction or description generally fail to produce valid causal inferences.
Core Concepts
Methods/Estimators
| Method | Objective | Typical Workflow | Validity Metric | Confounding |
|---|---|---|---|---|
| Description | Detail variable distributions by person, place, and time | EDA; measures of spread and central tendency | Minimization of measurement and sampling errors | Undefined — descriptions capture data as-is |
| Prediction | Maximize estimation accuracy on novel data | Train/test splits; managing the bias-variance trade-off | RMSE, MAE, AUC | Undefined — confounded variables used if predictive |
| Causal Inference | Calculate an unbiased effect of an exposure on an outcome | Randomized trials or statistical adjustment for confounders | Dependent on untestable structural assumptions | Central focus; must be isolated to identify the true effect |
Assumptions
| Category | Key Assumptions |
|---|---|
| Descriptive Validity | Sampled data properly represents the target population; minimal measurement error. |
| Predictive Validity | Hold-out performance proxies generalization to future data; missingness and measurement errors can be leveraged as informative features. |
| Causal Validity | The underlying structural model is correct. Most causal assumptions are fundamentally unverifiable. |
| Variable Selection | Variables are selected from domain background knowledge, not from predictive feature selection metrics or associational statistics. |
Key Takeaways
Analytical tasks must be strictly categorized into description, prediction or causal inference to ensure appropriate methodological choices.
The ability of a model tto accurately predict an outcome does not indicate the the model’s coefficients represent valid causal effects.
Highly predictive models often optimize accuracy by exploiting proxy variables correlated with unmeasured confounders, rendering their internal weights causally biased.
Introducing intentional bias (eg penalized regression) often improves variance in prediction models but undermines the primary goal of causal inference.
Confounding is exclusively causal concept and does not exist within purely descriptive of predictive frameworks.
Valid causal analysis begins by clearly diagramming the claim to map the exposure, outcome, target population, eligibility criteria, time zero and follow-up period.