1 Introduction

Rainfall in Australia is a problem of statistical extremes. The continent experiences pronounced hydrological variability: years of drought give way to sudden widespread flooding, driven by large-scale climate oscillations including the El Niño-Southern Oscillation and the Indian Ocean Dipole. For any predictive model, this variability is not background noise to be absorbed but a central structural feature to be represented.

The core challenge is that precipitation data violates the assumptions of standard regression in three simultaneous ways, each of which requires a distinct methodological response. This report develops a framework that addresses all three, building from empirical analysis through feature design to a validated hierarchical model.

1.1 Why Standard Models Fail

Zero-inflation. 64.05% of the 142,199 daily observations record exactly zero rainfall, and the median of the response variable is zero. A Gaussian linear model fitted to this data is forced to place probability mass on negative values in order to fit the bulk of the distribution, producing predictions that are physically impossible. This is not a problem that a transformation resolves; it is a consequence of the data-generating process having two qualitatively different states.

Heavy right tail. When rain does fall, the distribution of daily amounts is severely right-skewed with a skewness of 9.836 and a kurtosis of 181.146. The maximum recorded value is 371 mm and 151 events exceed 100 mm. Standard models that assume a symmetric or light-tailed error distribution will systematically underestimate both the probability and the magnitude of extreme events.

Temporal dependence. Daily observations are not independent. The Markov Chain analysis in Section 4.4.2 establishes that the transition probability from a wet day to a subsequent wet day is 47%, compared to only 15% from a dry day to a wet day. The dry state has an 85% probability of persisting to the following day. Any model that treats consecutive observations as exchangeable ignores a substantial and recoverable source of predictive information.

Each of these violations reflects a physical property of the atmosphere: the discrete threshold between no-precipitation and precipitation conditions, the stochastic intensity of convective and frontal systems once that threshold is crossed, and the temporal persistence of synoptic weather regimes. A model that ignores these properties is not a simplification of the problem; it is a misspecification of it.

1.2 The Modelling Framework

This analysis uses a Zero-Inflated Gamma (ZIG) hierarchical model. The core structure is to treat rainfall occurrence and rainfall intensity as two distinct sub-problems with separate linear predictors, linked through a shared observation mechanism. The formal mathematical specification is given on the project index.

Why not Tweedie? The Tweedie distribution is a natural alternative for zero-inflated positive continuous data, using a compound Poisson-Gamma structure to accommodate zero values alongside a continuous positive component. It was evaluated and rejected on empirical grounds. The EDA in Chapter 4 demonstrates that the predictors of occurrence and the predictors of intensity are not the same: rain_yesterday operates primarily through the hurdle component, determining whether rain occurs at all, while pressure change and wind vectors operate primarily through the conditional intensity component. A Tweedie model constrains both processes to share a single linear predictor, obscuring this physical distinction and reducing interpretability without improving fit, as confirmed by the distributional comparison in Section 8.3.

1.3 Report Structure

This document is organised as a progressive validation of modelling choices, beginning with data quality and proceeding through feature design, model construction, diagnostic testing, and formal selection. Each chapter is motivated by empirical findings from the preceding one.

Table 1.1: Chapter structure and content.

Chapter	Content
2: Data Preparation	Two-stage hybrid imputation for 42-48% missingness in key meteorological variables.
3: Imputation Sensitivity	Convergence diagnostics, distributional fidelity, and Rubin's Rules variance decomposition.
4: Exploratory Data Analysis	Quantification of zero-inflation, distributional properties, temporal autocorrelation, and feature interactions.
5: Feature Engineering	Construction of wind vectors, cyclical time encoding, interaction terms, and temporal lag features.
6: Modelling	Progressive ZIG model construction from null baseline through spatial mixed effects.
7: Model Evaluation	Validation via ROC analysis, spatial random effects, DHARMa diagnostics, and autocorrelation testing.
8: Model Selection	Distributional family comparison, pooled D1 Wald tests, and AIC model sequence.
9: Conclusion	Principal findings, limitations, and recommended extensions.

1.4 Objectives

How can high-dimensional missingness be recovered without distorting the predictive signal? Instrumentation gaps affect 47.69% of observations for sunshine and 42.54% for evaporation, two of the most predictive meteorological variables. Listwise deletion is not viable: it would eliminate nearly half the dataset and introduce geographic bias toward well-resourced stations. The imputation pipeline must recover these variables in a way that preserves their distributional properties, confirmed by Kolmogorov-Smirnov statistics below 0.12 for all four imputed variables, as documented in Section 3.3.

How should structurally non-linear and circular variables be represented in a linear predictor? Wind direction is circular, the annual rainfall cycle is periodic, and the joint suppression of rainfall by high sunshine and low humidity is a threshold effect, not an additive one. Standard numeric encodings discard these structural properties and feed the model misrepresented information. The feature engineering in Section 4.7 develops representations that preserve the geometry: orthogonal vector decomposition for wind direction, sine-cosine encoding for the day-of-year cycle, and a mean-centred multiplicative interaction for the humidity-sunshine threshold.

How can a single model account for the climatological diversity of a continent? The relationship between atmospheric conditions and rainfall is not uniform across tropical, temperate, and semi-arid zones. A pooled fixed-effects model imposes a single set of coefficients on stations whose rainfall regimes differ by an order of magnitude. Location-specific random effects allow baseline levels and key slopes to vary by station without requiring a separate model for each location, as evaluated in Section 7.2.

How can model adequacy be verified beyond the conditional mean? Standard accuracy metrics measure proximity to observed values but cannot detect whether the model correctly replicates the distributional structure of the data its zero-inflation rate, tail behaviour, or residual independence. Simulation-based diagnostics via DHARMa test the full empirical residual distribution against the theoretical expectation, providing a distributional adequacy check that likelihood-based fit statistics alone cannot supply.

```{r} #| label: setup-intro #| include: false library(tibble) library(dplyr) library(kableExtra) ``` ## Introduction {#sec-intro} Rainfall in Australia is a problem of statistical extremes. The continent experiences pronounced hydrological variability: years of drought give way to sudden widespread flooding, driven by large-scale climate oscillations including the El Niño-Southern Oscillation and the Indian Ocean Dipole. For any predictive model, this variability is not background noise to be absorbed but a central structural feature to be represented. The core challenge is that precipitation data violates the assumptions of standard regression in three simultaneous ways, each of which requires a distinct methodological response. This report develops a framework that addresses all three, building from empirical analysis through feature design to a validated hierarchical model. --- ## Why Standard Models Fail {#sec-pathology} **Zero-inflation.** 64.05% of the 142,199 daily observations record exactly zero rainfall, and the median of the response variable is zero. A Gaussian linear model fitted to this data is forced to place probability mass on negative values in order to fit the bulk of the distribution, producing predictions that are physically impossible. This is not a problem that a transformation resolves; it is a consequence of the data-generating process having two qualitatively different states. **Heavy right tail.** When rain does fall, the distribution of daily amounts is severely right-skewed with a skewness of 9.836 and a kurtosis of 181.146. The maximum recorded value is 371 mm and 151 events exceed 100 mm. Standard models that assume a symmetric or light-tailed error distribution will systematically underestimate both the probability and the magnitude of extreme events. **Temporal dependence.** Daily observations are not independent. The Markov Chain analysis in @sec-markov establishes that the transition probability from a wet day to a subsequent wet day is 47%, compared to only 15% from a dry day to a wet day. The dry state has an 85% probability of persisting to the following day. Any model that treats consecutive observations as exchangeable ignores a substantial and recoverable source of predictive information. Each of these violations reflects a physical property of the atmosphere: the discrete threshold between no-precipitation and precipitation conditions, the stochastic intensity of convective and frontal systems once that threshold is crossed, and the temporal persistence of synoptic weather regimes. A model that ignores these properties is not a simplification of the problem; it is a misspecification of it. --- ## The Modelling Framework {#sec-framework} This analysis uses a Zero-Inflated Gamma (ZIG) hierarchical model. The core structure is to treat rainfall occurrence and rainfall intensity as two distinct sub-problems with separate linear predictors, linked through a shared observation mechanism. The formal mathematical specification is given on the project index. **Why not Tweedie?** The Tweedie distribution is a natural alternative for zero-inflated positive continuous data, using a compound Poisson-Gamma structure to accommodate zero values alongside a continuous positive component. It was evaluated and rejected on empirical grounds. The EDA in @sec-eda demonstrates that the predictors of occurrence and the predictors of intensity are not the same: `rain_yesterday` operates primarily through the hurdle component, determining whether rain occurs at all, while pressure change and wind vectors operate primarily through the conditional intensity component. A Tweedie model constrains both processes to share a single linear predictor, obscuring this physical distinction and reducing interpretability without improving fit, as confirmed by the distributional comparison in @sec-distributional-justification. --- ## Report Structure {#sec-structure} This document is organised as a progressive validation of modelling choices, beginning with data quality and proceeding through feature design, model construction, diagnostic testing, and formal selection. Each chapter is motivated by empirical findings from the preceding one. ```{r} #| label: tbl-structure #| tbl-cap: "Chapter structure and content." #| echo: false #| warning: false #| message: false tibble( Chapter = c( "2: Data Preparation", "3: Imputation Sensitivity", "4: Exploratory Data Analysis", "5: Feature Engineering", "6: Modelling", "7: Model Evaluation", "8: Model Selection", "9: Conclusion" ), Content = c( "Two-stage hybrid imputation for 42-48% missingness in key meteorological variables.", "Convergence diagnostics, distributional fidelity, and Rubin's Rules variance decomposition.", "Quantification of zero-inflation, distributional properties, temporal autocorrelation, and feature interactions.", "Construction of wind vectors, cyclical time encoding, interaction terms, and temporal lag features.", "Progressive ZIG model construction from null baseline through spatial mixed effects.", "Validation via ROC analysis, spatial random effects, DHARMa diagnostics, and autocorrelation testing.", "Distributional family comparison, pooled D1 Wald tests, and AIC model sequence.", "Principal findings, limitations, and recommended extensions." ) ) %>% kable( col.names = c("Chapter", "Content"), align = c("l", "l"), escape = FALSE, booktabs = TRUE ) %>% kable_styling( latex_options = c("striped", "hold_position"), full_width = FALSE ) %>% column_spec(1, width = "4cm", bold = TRUE) %>% column_spec(2, width = "10cm") ``` --- ## Objectives {#sec-objectives} **How can high-dimensional missingness be recovered without distorting the predictive signal?** Instrumentation gaps affect 47.69% of observations for `sunshine` and 42.54% for `evaporation`, two of the most predictive meteorological variables. Listwise deletion is not viable: it would eliminate nearly half the dataset and introduce geographic bias toward well-resourced stations. The imputation pipeline must recover these variables in a way that preserves their distributional properties, confirmed by Kolmogorov-Smirnov statistics below 0.12 for all four imputed variables, as documented in @sec-dist-fidelity. **How should structurally non-linear and circular variables be represented in a linear predictor?** Wind direction is circular, the annual rainfall cycle is periodic, and the joint suppression of rainfall by high sunshine and low humidity is a threshold effect, not an additive one. Standard numeric encodings discard these structural properties and feed the model misrepresented information. The feature engineering in @sec-bivariate-density develops representations that preserve the geometry: orthogonal vector decomposition for wind direction, sine-cosine encoding for the day-of-year cycle, and a mean-centred multiplicative interaction for the humidity-sunshine threshold. **How can a single model account for the climatological diversity of a continent?** The relationship between atmospheric conditions and rainfall is not uniform across tropical, temperate, and semi-arid zones. A pooled fixed-effects model imposes a single set of coefficients on stations whose rainfall regimes differ by an order of magnitude. Location-specific random effects allow baseline levels and key slopes to vary by station without requiring a separate model for each location, as evaluated in @sec-random-effects. **How can model adequacy be verified beyond the conditional mean?** Standard accuracy metrics measure proximity to observed values but cannot detect whether the model correctly replicates the distributional structure of the data its zero-inflation rate, tail behaviour, or residual independence. Simulation-based diagnostics via DHARMa test the full empirical residual distribution against the theoretical expectation, providing a distributional adequacy check that likelihood-based fit statistics alone cannot supply.