Predicting Australian Rainfall: A Hierarchical Zero-Inflated Gamma Framework

Overview

Australian daily rainfall violates the assumptions of standard regression in three simultaneous and non-ignorable ways: 64.05% of observations are exactly zero, the positive tail is severely skewed, and consecutive days are strongly correlated through synoptic weather persistence. This project implements a Mixed-Effects Zero-Inflated Gamma (ZIG) model that addresses all three, treating occurrence and intensity as structurally separate processes with independent linear predictors and location-specific random effects across 49 weather stations. The statistical case for this framework is developed in Chapter 1.

Mathematical Framework

The model specifies that daily rainfall $y_i$ follows a mixture of a point mass at zero and a Gamma-distributed positive component. For observation $i$:

\[ P(Y_i = 0) = \pi_i + (1 - \pi_i) \cdot f_\Gamma(0) \]

\[ P(Y_i = y) = (1 - \pi_i) \cdot f_\Gamma(y \mid \mu_i,\, \phi), \quad y > 0 \]

The zero-inflation probability $\pi_i$ and the conditional intensity mean $\mu_i$ have separate linear predictors:

\[ \text{logit}(\pi_i) = \mathbf{W}_i \boldsymbol{\gamma} \]

\[ \log(\mu_i) = \mathbf{X}_i \boldsymbol{\beta} + \mathbf{Z}_i \mathbf{b} \]

where $\mathbf{W}_i$ contains the predictors of occurrence, $\mathbf{X}_i$ contains the fixed-effects predictors of intensity, and $\mathbf{Z}_i \mathbf{b}$ encodes the location-specific random effects. The predictor matrices $\mathbf{W}_i$ and $\mathbf{X}_i$ are not constrained to be identical, which is the structural distinction between the ZIG framework and a Tweedie model.

Technical Highlights

Two-stage imputation for high-dimensional missingness. The two most predictive meteorological variables, sunshine and evaporation, are missing in 47.69% and 42.54% of records respectively due to uneven instrument deployment across the station network. Listwise deletion would eliminate nearly half the dataset and introduce geographic bias toward well-resourced stations. A two-stage pipeline applies linear temporal interpolation for short gaps and multivariate imputation via predictive mean matching and Random Forest otherwise, using physically motivated predictor sets for each variable. Ghost sensor station-variable pairs exceeding 90% missingness across the full observation window are identified before imputation and carried forward with binary provenance flags rather than populated with extrapolations that have no empirical anchor.

Circular wind vector decomposition. Wind direction is a circular variable: a bearing of 359 degrees and a bearing of 1 degree are separated by 2 degrees, not 358. Naive numeric encoding treats them as maximally distant and destroys the directional information. Each compass bearing is resolved into orthogonal North-South ($V$) and East-West ($U$) components via trigonometric projection, converting a discontinuous circular variable into two continuous quantities that a linear predictor can interpret correctly.

Temporal signal extraction. Seven-day trailing moving averages reduce the standard deviation of daily rainfall by 56%, from 13.1 mm to 5.8 mm, capturing a weekly wetness regime feature that reflects the accumulated atmospheric state preceding each observation. All moving averages are lagged by one additional day before use as predictors to prevent data leakage.

Interaction term construction. The bivariate density analysis identified a Rain Corner: rainfall concentrates almost exclusively when afternoon humidity exceeds approximately 60% and daily sunshine falls below approximately five hours simultaneously. A centred multiplicative interaction term formalises this threshold. Centring reduces the variance inflation factor of the interaction term to 1.174, confirming that the artificial collinearity introduced by uncentred interaction terms has been eliminated.

Two-component mixture model. The ZIG model separates the zero mass from the positive continuous component through a logistic hurdle and a Gamma intensity submodel with independent linear predictors. This separation is empirically motivated: rain_yesterday is the dominant predictor of occurrence, while pressure change and wind vectors are the primary drivers of conditional intensity.

Hierarchical random effects structure. Location-specific random effects are specified as uncorrelated random intercepts and slopes via a diagonal covariance structure. This allows the baseline rainfall level, the humidity-intensity relationship, and the persistence effect to vary across the 49 weather stations in the dataset while sharing a common global mean, without the overparameterisation of an unstructured random covariance matrix.

Non-linear dry spell dynamics. The probability of rain returning after a dry spell decays in a rapid-then-gradual pattern: it falls steeply over the first ten days and then stabilises. A natural spline with four degrees of freedom, $\text{ns}(\text{days\_since\_rain},\, df = 4)$, captures this non-linear structure in the zero-inflation component.

Estimation and diagnostics. All models are fitted via maximum likelihood. Residual adequacy is verified using DHARMa simulation-based diagnostics, which test scaled residuals against the theoretical uniform distribution regardless of model family. The final model passes the dispersion test ($p = 0.152$), zero-inflation calibration (ratio = 1.00, $p = 0.512$), and temporal autocorrelation checks on most of the locations.

Thermodynamic interaction threshold. Rainfall occurrence is gated by the simultaneous occurrence of high humidity and low sunshine in the Rain Corner of the bivariate feature space, quantified by a highly significant interaction term ($p < 0.001$, $F(2,\ 24.1) = 49.375$) that is stable across all model specifications.

Markovian persistence. Having rained the previous day is the single strongest predictor in the zero-inflation component. Dry states are self-reinforcing with an 85% dry-to-dry transition probability; wet states are transient with a 47% wet-to-wet probability.

Directional wind dynamics. Southerly and westerly morning wind vectors are the primary directional drivers of rainfall intensity, reflecting the influence of Southern Ocean frontal systems and the mid-latitude westerlies. Morning wind direction is three to four times more informative than peak gust direction.

Spatial heterogeneity. The random effects analysis establishes that tropical Top End stations produce roughly 75 to 77 percent more times the rainfall of a station with average geographic characteristics at identical atmospheric conditions, while arid interior stations produce approximately half as much. The addition of the mixed-effects structure yields $\Delta\text{AIC} = -6{,}915.32$ over the fixed-effects model.

Classification performance. The zero-inflation component achieves an AUC of 0.813 at the Youden-optimal threshold of 0.6246, with the model generating false alarms of 22,764.

--- pagetitle: "Australian Rainfall Dynamics" date: today about: template: solana links: - icon: github text: Source Code href: https://github.com/Chrisolande/australian-rainfall-dynamics - icon: linkedin text: LinkedIn href: https://www.linkedin.com/in/chris-olande-6557a3238/ --- ```{r} #| label: setup-index #| include: false ``` # Predicting Australian Rainfall: A Hierarchical Zero-Inflated Gamma Framework {.unnumbered} --- ## Overview {.unnumbered} Australian daily rainfall violates the assumptions of standard regression in three simultaneous and non-ignorable ways: 64.05% of observations are exactly zero, the positive tail is severely skewed, and consecutive days are strongly correlated through synoptic weather persistence. This project implements a Mixed-Effects Zero-Inflated Gamma (ZIG) model that addresses all three, treating occurrence and intensity as structurally separate processes with independent linear predictors and location-specific random effects across 49 weather stations. The statistical case for this framework is developed in [Chapter 1](chapters/01-intro.html). --- ## Mathematical Framework {.unnumbered} The model specifies that daily rainfall $y_i$ follows a mixture of a point mass at zero and a Gamma-distributed positive component. For observation $i$: $$ P(Y_i = 0) = \pi_i + (1 - \pi_i) \cdot f_\Gamma(0) $$ $$ P(Y_i = y) = (1 - \pi_i) \cdot f_\Gamma(y \mid \mu_i,\, \phi), \quad y > 0 $$ The zero-inflation probability $\pi_i$ and the conditional intensity mean $\mu_i$ have separate linear predictors: $$ \text{logit}(\pi_i) = \mathbf{W}_i \boldsymbol{\gamma} $$ $$ \log(\mu_i) = \mathbf{X}_i \boldsymbol{\beta} + \mathbf{Z}_i \mathbf{b} $$ where $\mathbf{W}_i$ contains the predictors of occurrence, $\mathbf{X}_i$ contains the fixed-effects predictors of intensity, and $\mathbf{Z}_i \mathbf{b}$ encodes the location-specific random effects. The predictor matrices $\mathbf{W}_i$ and $\mathbf{X}_i$ are not constrained to be identical, which is the structural distinction between the ZIG framework and a Tweedie model. --- ## Technical Highlights {.unnumbered} ::: {.panel-tabset} ### Data Engineering **Two-stage imputation for high-dimensional missingness.** The two most predictive meteorological variables, `sunshine` and `evaporation`, are missing in 47.69% and 42.54% of records respectively due to uneven instrument deployment across the station network. Listwise deletion would eliminate nearly half the dataset and introduce geographic bias toward well-resourced stations. A two-stage pipeline applies linear temporal interpolation for short gaps and multivariate imputation via predictive mean matching and Random Forest otherwise, using physically motivated predictor sets for each variable. Ghost sensor station-variable pairs exceeding 90% missingness across the full observation window are identified before imputation and carried forward with binary provenance flags rather than populated with extrapolations that have no empirical anchor. **Circular wind vector decomposition.** Wind direction is a circular variable: a bearing of 359 degrees and a bearing of 1 degree are separated by 2 degrees, not 358. Naive numeric encoding treats them as maximally distant and destroys the directional information. Each compass bearing is resolved into orthogonal North-South ($V$) and East-West ($U$) components via trigonometric projection, converting a discontinuous circular variable into two continuous quantities that a linear predictor can interpret correctly. **Temporal signal extraction.** Seven-day trailing moving averages reduce the standard deviation of daily rainfall by 56%, from 13.1 mm to 5.8 mm, capturing a weekly wetness regime feature that reflects the accumulated atmospheric state preceding each observation. All moving averages are lagged by one additional day before use as predictors to prevent data leakage. **Interaction term construction.** The bivariate density analysis identified a Rain Corner: rainfall concentrates almost exclusively when afternoon humidity exceeds approximately 60% and daily sunshine falls below approximately five hours simultaneously. A centred multiplicative interaction term formalises this threshold. Centring reduces the variance inflation factor of the interaction term to 1.174, confirming that the artificial collinearity introduced by uncentred interaction terms has been eliminated. ### Statistical Framework **Two-component mixture model.** The ZIG model separates the zero mass from the positive continuous component through a logistic hurdle and a Gamma intensity submodel with independent linear predictors. This separation is empirically motivated: `rain_yesterday` is the dominant predictor of occurrence, while pressure change and wind vectors are the primary drivers of conditional intensity. **Hierarchical random effects structure.** Location-specific random effects are specified as uncorrelated random intercepts and slopes via a diagonal covariance structure. This allows the baseline rainfall level, the humidity-intensity relationship, and the persistence effect to vary across the 49 weather stations in the dataset while sharing a common global mean, without the overparameterisation of an unstructured random covariance matrix. **Non-linear dry spell dynamics.** The probability of rain returning after a dry spell decays in a rapid-then-gradual pattern: it falls steeply over the first ten days and then stabilises. A natural spline with four degrees of freedom, $\text{ns}(\text{days\_since\_rain},\, df = 4)$, captures this non-linear structure in the zero-inflation component. **Estimation and diagnostics.** All models are fitted via maximum likelihood. Residual adequacy is verified using DHARMa simulation-based diagnostics, which test scaled residuals against the theoretical uniform distribution regardless of model family. The final model passes the dispersion test ($p = 0.152$), zero-inflation calibration (ratio = 1.00, $p = 0.512$), and temporal autocorrelation checks on most of the locations. ### Key Results **Thermodynamic interaction threshold.** Rainfall occurrence is gated by the simultaneous occurrence of high humidity and low sunshine in the Rain Corner of the bivariate feature space, quantified by a highly significant interaction term ($p < 0.001$, $F(2,\ 24.1) = 49.375$) that is stable across all model specifications. **Markovian persistence.** Having rained the previous day is the single strongest predictor in the zero-inflation component. Dry states are self-reinforcing with an 85% dry-to-dry transition probability; wet states are transient with a 47% wet-to-wet probability. **Directional wind dynamics.** Southerly and westerly morning wind vectors are the primary directional drivers of rainfall intensity, reflecting the influence of Southern Ocean frontal systems and the mid-latitude westerlies. Morning wind direction is three to four times more informative than peak gust direction. **Spatial heterogeneity.** The random effects analysis establishes that tropical Top End stations produce roughly 75 to 77 percent more times the rainfall of a station with average geographic characteristics at identical atmospheric conditions, while arid interior stations produce approximately half as much. The addition of the mixed-effects structure yields $\Delta\text{AIC} = -6{,}915.32$ over the fixed-effects model. **Classification performance.** The zero-inflation component achieves an AUC of 0.813 at the Youden-optimal threshold of 0.6246, with the model generating false alarms of 22,764. :::