9 Conclusion

This analysis demonstrates that Australian daily rainfall is a structured physical phenomenon generated by distinct, identifiable processes operating at different temporal and spatial scales. The final Mixed-Effects Zero-Inflated Gamma (ZIG) model succeeds because its architecture was designed to reflect that structure.

The core argument, supported across seven analytical chapters, is that rainfall requires two qualitatively separate submodels. Occurrence and intensity are governed by different atmospheric mechanisms, respond to different covariates, and show different temporal and spatial patterns. Conflating them, as a standard Gaussian linear model or a Tweedie model does, produces predictions that are physically incoherent, statistically misspecified, and inferior in predictive accuracy. The ZIG framework assigns each process its own linear predictor and parameter vector, making the structural separation explicit.

9.1 Summary of Principal Findings

Distributional properties of the target variable. The 142,199 daily observations show a zero-inflation rate of 64.05%, a skewness of 9.836, and a kurtosis of 181.146. The maximum recorded value is 371 mm, and 151 events exceed 100 mm. These properties collectively rule out a single-component Gaussian model: any such model would assign positive probability to negative rainfall values and systematically underestimate the probability of the zero outcome. The ZIG family, identified as appropriate by the EDA in Chapter 4, directly addresses both properties through a mixture of a point mass at zero and a Gamma-distributed positive component.

The Rain Corner interaction. Rainfall occurrence is not a linear function of humidity or sunshine individually. It concentrates in a specific region of the bivariate feature space where afternoon humidity exceeds approximately 60% and daily sunshine falls below approximately five hours simultaneously. The Sun-Humidity interaction term included in M4 is jointly significant at $F(2,\ 24.1) = 49.375$, $p < 0.001$ (see Table 8.6), and the interaction is physically interpretable: the conjunction of a saturated atmosphere and suppressed solar radiation creates the thermodynamic conditions necessary for precipitation. An additive model cannot represent this conditional structure.

Markovian persistence. Having rained the previous day is the single strongest predictor in the zero-inflation submodel across the entire model sequence. The Markov Chain analysis in Section 4.4.2 established that dry states have an 85% probability of remaining dry from one day to the next, while wet states have only a 47% probability of continued rain. Encoding this first-order dependency through rain_yesterday and days_since_rain was essential: the temporal features block (Day Cos, Day Sin, rain_yesterday, Cloud Development) produced the largest pooled $D_1$ $F$-statistic in the model sequence at $F(4,\ 71.3) = 802.158$, $p < 0.001$ (Table 8.4).

Wind vector dynamics. Wind direction is a circular variable whose information is destroyed by naive linear encoding. Decomposing compass bearings into orthogonal east-west and north-south components revealed that southerly and westerly airflows are the dominant directional drivers of rainfall intensity, consistent with the influence of Southern Ocean frontal systems. The four wind vector parameters are jointly significant at $F(4,\ 272.8) = 133.422$, $p < 0.001$ (Table 8.7), with a relative increase in variance (RIV) of 0.497, the lowest in the sequence, indicating stable identification of wind direction signal across all imputed datasets.

Spatial heterogeneity and random effects. The random intercepts from the final mixed-effects model (M6) reveal location-specific baseline deviations in rainfall intensity that persist after all dynamic weather predictors are controlled. Tropical Top End stations produce roughly 1.7 to 1.8 times more rainfall than the national average at identical atmospheric conditions, while arid interior stations produce approximately half as much. The addition of splines and random effects over M5 yields $F(10,\ 138.9) = 75.283$, $p < 0.001$, with $\Delta\text{AIC} = -6{,}915.32$ (Table 8.8). A pooled fixed-effects model would suppress this spatial variation by averaging across all stations, systematically over-predicting rainfall in inland stations and under-predicting it in tropical ones.

Temporal autocorrelation elimination. The Durbin-Watson statistic on the residuals of the final model is roughly around 2.0 among most of the locations. The temporal feature engineering documented in Chapter 5, comprising rain_yesterday, rainfall_ma7, humidity_ma7, and the natural spline of days_since_rain, successfully absorbed the temporal dependence present in the raw data. What remains in the residuals is not the signature of an unmodelled temporal component but genuine unpredictability from the available predictors.

9.2 Model Performance

The final model achieves the performance metrics in Table 9.1, validated against the full dataset with DHARMa simulation-based diagnostics confirming residual adequacy.

Table 9.1: Summary of performance metrics for the final Mixed-Effects ZIG model (M6), validated on the full dataset.

Metric	Value
AUC (occurrence submodel)	0.813
Youden-optimal threshold	0.6246
Brier Score	0.1654
Brier Skill Score	0.2819
Mean absolute error (all observations)	2.760 mm
Mean absolute error (rain-days only)	5.607 mm
RMSE (all observations)	7.567 mm
RMSE (rain-days only)	12.263 mm
Zero-inflation calibration ratio	1.00 (p = 0.512)
Dispersion test (p-value)	0.152

The AUC of 0.813 confirms that the occurrence submodel correctly ranks a randomly drawn dry day above a randomly drawn wet day in 81.3% of cases, a threshold-free measure of discriminative performance that does not depend on any particular operating point. At the Youden-optimal threshold of 0.6246, the Brier Score of 0.1654 and Brier Skill Score of 0.2819 quantify the calibrated probabilistic accuracy of the occurrence submodel relative to a naive climatological baseline. The MAE of 2.760 mm represents the average absolute deviation between the model’s point prediction and the recorded observation across all days; the higher rain-day MAE of 5.607 mm reflects the harder prediction problem on positive observations once dry days are excluded. The zero-inflation calibration ratio of 1.00 ($p = 0.512$) confirms that the hurdle component reproduces the 64.05% dry-day rate correctly across the full conditional distribution of covariates, not merely at the intercept. The dispersion test yields $p = 0.152$, giving no evidence that the residual variance deviates from what the fitted ZIG family implies.

9.3 Limitations

Tail underestimation. The Gamma distribution accommodates right skew but has exponentially declining tails rather than power-law tails. Rare events exceeding approximately 50 mm per day are systematically under-predicted. The model correctly identifies conditions conducive to extreme rainfall but underestimates the magnitude of such events. This is an inherent property of the Gamma family, not a failure of specification within that family.

Stationarity. All parameters are estimated from a fixed historical record and are assumed stable over time. The model does not encode non-stationarity arising from shifts in the Southern Oscillation Index, the Indian Ocean Dipole, or long-term anthropogenic forcing. Predictions made under climate trajectories outside the historical parameter space are extrapolations beyond the scope of the model.

Observational confounding. Coefficient estimates represent conditional associations, not causal effects. Unobserved variables including soil moisture feedback, aerosol loading, and land surface characteristics may confound the estimated meteorological relationships. The model is appropriate for prediction under conditions similar to the training data; causal attribution of rainfall to specific drivers requires experimental or quasi-experimental designs.

9.4 Recommended Extensions

Extreme value tail augmentation. A natural extension is a composite model that applies the ZIG framework to the body of the distribution and grafts a Generalised Pareto Distribution (GPD) to exceedances above the 95th percentile. This would preserve predictive accuracy on typical days while correcting the systematic tail underestimation that limits the model’s utility for flood risk applications.

Large-scale climate indices. The Southern Oscillation Index and the Indian Ocean Dipole modulate Australian rainfall at interannual timescales and are absent from the current feature set. Including them as slowly varying covariates in either submodel would allow the model to shift its baseline predictions between El Niño and La Niña years, addressing a known gap in predictive accuracy for anomalous climate regimes.

Dynamic updating. The current model is static, estimated once from the full historical record. A natural operational extension would implement sequential re-estimation of the random effects as new observations accumulate, allowing location-specific baselines to track gradual shifts in local climate without re-fitting the full fixed-effects structure.

9.5 Closing Statement

The methodological progression across this report reflects a single guiding principle: statistical modelling should be constrained by the properties of the system under study. Australian rainfall is zero-inflated, spatially heterogeneous, temporally autocorrelated, and driven by non-linear thermodynamic interactions. The Mixed-Effects ZIG model is not a complex model arbitrarily applied to a difficult problem. It is the appropriately structured model for a phenomenon whose properties are well-characterised. Its performance advantage over a linear baseline is a consequence of better alignment between model structure and physical reality, not of added parameters alone. The $\Delta\text{AIC}$ of 61,518.51 between the null baseline and the final model, accumulated across six theoretically motivated extensions each of which individually rejects the null at $p < 0.001$, quantifies precisely how much that alignment is worth.

```{r} #| label: setup-conclusion #| include: false library(tidyverse) library(kableExtra) ``` ## Conclusion {#sec-conclusion} This analysis demonstrates that Australian daily rainfall is a structured physical phenomenon generated by distinct, identifiable processes operating at different temporal and spatial scales. The final Mixed-Effects Zero-Inflated Gamma (ZIG) model succeeds because its architecture was designed to reflect that structure. The core argument, supported across seven analytical chapters, is that rainfall requires two qualitatively separate submodels. Occurrence and intensity are governed by different atmospheric mechanisms, respond to different covariates, and show different temporal and spatial patterns. Conflating them, as a standard Gaussian linear model or a Tweedie model does, produces predictions that are physically incoherent, statistically misspecified, and inferior in predictive accuracy. The ZIG framework assigns each process its own linear predictor and parameter vector, making the structural separation explicit. --- ## Summary of Principal Findings {#sec-findings} **Distributional properties of the target variable.** The 142,199 daily observations show a zero-inflation rate of 64.05%, a skewness of 9.836, and a kurtosis of 181.146. The maximum recorded value is 371 mm, and 151 events exceed 100 mm. These properties collectively rule out a single-component Gaussian model: any such model would assign positive probability to negative rainfall values and systematically underestimate the probability of the zero outcome. The ZIG family, identified as appropriate by the EDA in @sec-eda, directly addresses both properties through a mixture of a point mass at zero and a Gamma-distributed positive component. **The Rain Corner interaction.** Rainfall occurrence is not a linear function of humidity or sunshine individually. It concentrates in a specific region of the bivariate feature space where afternoon humidity exceeds approximately 60% and daily sunshine falls below approximately five hours simultaneously. The Sun-Humidity interaction term included in M4 is jointly significant at $F(2,\ 24.1) = 49.375$, $p < 0.001$ (see @tbl-d1-m4-vs-m3), and the interaction is physically interpretable: the conjunction of a saturated atmosphere and suppressed solar radiation creates the thermodynamic conditions necessary for precipitation. An additive model cannot represent this conditional structure. **Markovian persistence.** Having rained the previous day is the single strongest predictor in the zero-inflation submodel across the entire model sequence. The Markov Chain analysis in @sec-markov established that dry states have an 85% probability of remaining dry from one day to the next, while wet states have only a 47% probability of continued rain. Encoding this first-order dependency through `rain_yesterday` and `days_since_rain` was essential: the temporal features block (Day Cos, Day Sin, `rain_yesterday`, Cloud Development) produced the largest pooled $D_1$ $F$-statistic in the model sequence at $F(4,\ 71.3) = 802.158$, $p < 0.001$ (@tbl-d1-m2-vs-m1). **Wind vector dynamics.** Wind direction is a circular variable whose information is destroyed by naive linear encoding. Decomposing compass bearings into orthogonal east-west and north-south components revealed that southerly and westerly airflows are the dominant directional drivers of rainfall intensity, consistent with the influence of Southern Ocean frontal systems. The four wind vector parameters are jointly significant at $F(4,\ 272.8) = 133.422$, $p < 0.001$ (@tbl-d1-m5-vs-m4), with a relative increase in variance (RIV) of 0.497, the lowest in the sequence, indicating stable identification of wind direction signal across all imputed datasets. **Spatial heterogeneity and random effects.** The random intercepts from the final mixed-effects model (M6) reveal location-specific baseline deviations in rainfall intensity that persist after all dynamic weather predictors are controlled. Tropical Top End stations produce roughly 1.7 to 1.8 times more rainfall than the national average at identical atmospheric conditions, while arid interior stations produce approximately half as much. The addition of splines and random effects over M5 yields $F(10,\ 138.9) = 75.283$, $p < 0.001$, with $\Delta\text{AIC} = -6{,}915.32$ (@tbl-d1-m6-vs-m5). A pooled fixed-effects model would suppress this spatial variation by averaging across all stations, systematically over-predicting rainfall in inland stations and under-predicting it in tropical ones. **Temporal autocorrelation elimination.** The Durbin-Watson statistic on the residuals of the final model is roughly around 2.0 among most of the locations. The temporal feature engineering documented in @sec-feature-eng, comprising `rain_yesterday`, `rainfall_ma7`, `humidity_ma7`, and the natural spline of `days_since_rain`, successfully absorbed the temporal dependence present in the raw data. What remains in the residuals is not the signature of an unmodelled temporal component but genuine unpredictability from the available predictors. --- ## Model Performance {#sec-performance} The final model achieves the performance metrics in @tbl-performance, validated against the full dataset with DHARMa simulation-based diagnostics confirming residual adequacy. ```{r} #| label: tbl-performance #| tbl-cap: "Summary of performance metrics for the final Mixed-Effects ZIG model (M6), validated on the full dataset." #| echo: false #| warning: false #| message: false tibble( Metric = c( "AUC (occurrence submodel)", "Youden-optimal threshold", "Brier Score", "Brier Skill Score", "Mean absolute error (all observations)", "Mean absolute error (rain-days only)", "RMSE (all observations)", "RMSE (rain-days only)", "Zero-inflation calibration ratio", "Dispersion test (p-value)" ), Value = c( "0.813", "0.6246", "0.1654", "0.2819", "2.760 mm", "5.607 mm", "7.567 mm", "12.263 mm", "1.00 (p = 0.512)", "0.152" ) ) %>% kable( col.names = c("Metric", "Value"), align = c("l", "r"), escape = FALSE, booktabs = TRUE ) %>% kable_styling( latex_options = c("striped", "hold_position"), full_width = FALSE ) %>% column_spec(1, width = "10cm") %>% column_spec(2, width = "4cm") ``` The AUC of 0.813 confirms that the occurrence submodel correctly ranks a randomly drawn dry day above a randomly drawn wet day in 81.3% of cases, a threshold-free measure of discriminative performance that does not depend on any particular operating point. At the Youden-optimal threshold of 0.6246, the Brier Score of 0.1654 and Brier Skill Score of 0.2819 quantify the calibrated probabilistic accuracy of the occurrence submodel relative to a naive climatological baseline. The MAE of 2.760 mm represents the average absolute deviation between the model's point prediction and the recorded observation across all days; the higher rain-day MAE of 5.607 mm reflects the harder prediction problem on positive observations once dry days are excluded. The zero-inflation calibration ratio of 1.00 ($p = 0.512$) confirms that the hurdle component reproduces the 64.05% dry-day rate correctly across the full conditional distribution of covariates, not merely at the intercept. The dispersion test yields $p = 0.152$, giving no evidence that the residual variance deviates from what the fitted ZIG family implies. --- ## Limitations {#sec-limitations} **Tail underestimation.** The Gamma distribution accommodates right skew but has exponentially declining tails rather than power-law tails. Rare events exceeding approximately 50 mm per day are systematically under-predicted. The model correctly identifies conditions conducive to extreme rainfall but underestimates the magnitude of such events. This is an inherent property of the Gamma family, not a failure of specification within that family. **Stationarity.** All parameters are estimated from a fixed historical record and are assumed stable over time. The model does not encode non-stationarity arising from shifts in the Southern Oscillation Index, the Indian Ocean Dipole, or long-term anthropogenic forcing. Predictions made under climate trajectories outside the historical parameter space are extrapolations beyond the scope of the model. **Observational confounding.** Coefficient estimates represent conditional associations, not causal effects. Unobserved variables including soil moisture feedback, aerosol loading, and land surface characteristics may confound the estimated meteorological relationships. The model is appropriate for prediction under conditions similar to the training data; causal attribution of rainfall to specific drivers requires experimental or quasi-experimental designs. --- ## Recommended Extensions {#sec-extensions} **Extreme value tail augmentation.** A natural extension is a composite model that applies the ZIG framework to the body of the distribution and grafts a Generalised Pareto Distribution (GPD) to exceedances above the 95th percentile. This would preserve predictive accuracy on typical days while correcting the systematic tail underestimation that limits the model's utility for flood risk applications. **Large-scale climate indices.** The Southern Oscillation Index and the Indian Ocean Dipole modulate Australian rainfall at interannual timescales and are absent from the current feature set. Including them as slowly varying covariates in either submodel would allow the model to shift its baseline predictions between El Niño and La Niña years, addressing a known gap in predictive accuracy for anomalous climate regimes. **Dynamic updating.** The current model is static, estimated once from the full historical record. A natural operational extension would implement sequential re-estimation of the random effects as new observations accumulate, allowing location-specific baselines to track gradual shifts in local climate without re-fitting the full fixed-effects structure. --- ## Closing Statement {#sec-closing} The methodological progression across this report reflects a single guiding principle: statistical modelling should be constrained by the properties of the system under study. Australian rainfall is zero-inflated, spatially heterogeneous, temporally autocorrelated, and driven by non-linear thermodynamic interactions. The Mixed-Effects ZIG model is not a complex model arbitrarily applied to a difficult problem. It is the appropriately structured model for a phenomenon whose properties are well-characterised. Its performance advantage over a linear baseline is a consequence of better alignment between model structure and physical reality, not of added parameters alone. The $\Delta\text{AIC}$ of 61,518.51 between the null baseline and the final model, accumulated across six theoretically motivated extensions each of which individually rejects the null at $p < 0.001$, quantifies precisely how much that alignment is worth.