From casual to causal

Overview

Data science primarily involves three fundamental tasks: description, prediction, and causal inference. Researchers frequently conflate these tasks, utilizing ambiguous terminology that masks causal intent while avoiding the rigorous assumptions required for true causal analysis. Clearly defining the specific goal of an inquiry is a prerequisite for selecting the correct methodology, as models optimized for prediction or description generally fail to produce valid causal inferences.

Core Concepts

Term	Explanation
Casual Inference	Making conclusions without explicitly defining causal questions or addressing the assumptions necessary to answer them.
Schrodinger's Causal Inference	A pervasive phenomenon where researchers avoid stating an interest in estimating causal effects but embed causal intent, language, and recommendations throughout their studies.
Descriptive Analysis	The statistical summarization of variable distributions, often stratified by key characteristics to describe populations.
Prediction	The use of data to calculate accurate estimates of variables on new, unobserved data.
Causal Inference	The process of determining the specific impact that one variable has on another.
Exposure	The variable hypothesized to act as the cause.
Outcome	The variable hypothesized to act as the effect.
Eligibility Criteria	The specific conditions defining the subjects for whom the causal claim applies.
Target Population	The larger population for whom the outcome effect is estimated.
Time Zero	The point in time when participant follow-up begins in a study.
Follow Up Period	The specific duration over which the outcome is observed and measured.
Table Two Fallacy	The erroneous practice of presenting confounders alongside exposure variables in regression tables and interpreting their coefficients as independent causal effects.

Methods/Estimators

Method	Objective	Typical Workflow	Validity Metric	Confounding
Description	Detail variable distributions by person, place, and time	EDA; measures of spread and central tendency	Minimization of measurement and sampling errors	Undefined — descriptions capture data as-is
Prediction	Maximize estimation accuracy on novel data	Train/test splits; managing the bias-variance trade-off	RMSE, MAE, AUC	Undefined — confounded variables used if predictive
Causal Inference	Calculate an unbiased effect of an exposure on an outcome	Randomized trials or statistical adjustment for confounders	Dependent on untestable structural assumptions	Central focus; must be isolated to identify the true effect

Assumptions

Category	Key Assumptions
Descriptive Validity	Sampled data properly represents the target population; minimal measurement error.
Predictive Validity	Hold-out performance proxies generalization to future data; missingness and measurement errors can be leveraged as informative features.
Causal Validity	The underlying structural model is correct. Most causal assumptions are fundamentally unverifiable.
Variable Selection	Variables are selected from domain background knowledge, not from predictive feature selection metrics or associational statistics.

Key Takeaways

Analytical tasks must be strictly categorized into description, prediction or causal inference to ensure appropriate methodological choices.
The ability of a model tto accurately predict an outcome does not indicate the the model’s coefficients represent valid causal effects.
Highly predictive models often optimize accuracy by exploiting proxy variables correlated with unmeasured confounders, rendering their internal weights causally biased.
Introducing intentional bias (eg penalized regression) often improves variance in prediction models but undermines the primary goal of causal inference.
Confounding is exclusively causal concept and does not exist within purely descriptive of predictive frameworks.
Valid causal analysis begins by clearly diagramming the claim to map the exposure, outcome, target population, eligibility criteria, time zero and follow-up period.