4  Exploratory Data Analysis


Chapter Context. This chapter investigates the distributional and structural properties of the daily rainfall data, using findings from the data preparation chapter (Chapter 2) as its starting point rather than repeating the missingness diagnostics performed there. The analyses proceed in a deliberate sequence: target variable distribution, bivariate predictor correlations, temporal dynamics, pressure analysis, seasonal patterns, and feature interactions. Each section produces a specific finding that motivates a corresponding decision in the feature engineering or model specification stages.



4.1 Data Quality Summary

The missingness diagnostics and imputation pipeline are documented in full in Chapter 2. Briefly: sunshine and evaporation are missing in 47.7% and 42.5% of records respectively, their missingness is station-conditional rather than outcome-conditional, and both are recovered via predictive mean matching across ten imputed datasets. The imputed dataset df_final is used throughout the remainder of this chapter; df_clean is used only in this opening section to avoid circularity between imputed features and the response distribution.

Modelling implication. Discarding sunshine and evaporation would sacrifice two of the strongest predictors of rainfall occurrence (see Section 4.3); listwise deletion would eliminate nearly half the dataset and introduce geographic bias toward well-instrumented stations.

Show the code
total_na <- sum(is.na(df_clean))
print(paste("Total missing values:", total_na))
#> [1] "Total missing values: 314146"
Show the code
missing_val(df_clean)

4.2 Distributional Properties of the Target Variable

Understanding the marginal distribution of the response variable is a prerequisite for selecting an appropriate model family. A Gaussian assumption carries substantive claims about the data-generating process that can be directly tested.

Show the code
source(here::here("chapter4", "target.R"))
render_descriptive_stats()
Table 4.1: Descriptive statistics of daily rainfall (mm).
Descriptive Statistics: Daily Rainfall (mm)
Statistic Value
n 1.42e+05
mean 2.361
median 0
sd 8.478
min 0
max 371
q25 0
q75 0.8
iqr 0.8
n_zeros 9.11e+04
pct_zeros 64.051
n_large 151
pct_large 0.106
skewness 9.836
kurtosis 181.146
Show the code
render_prevalence_table()
Table 4.2: Prevalence of zero-inflation (dry days) in the rainfall record.
Prevalence of Zero-Inflation (Dry Days)
Total Days Dry Days (0 mm) Rainy Days (>0 mm) Zero Inflation (%)
142199 91080 51119 64.05108

The descriptive statistics reveal a distribution fundamentally incompatible with Gaussian modelling assumptions.

Zero-inflation. 64.05% of the 142,199 recorded observations are dry days (rainfall = 0 mm). The median is zero. The data-generating mechanism produces two qualitatively different outcomes: no rain at all versus some positive amount. Any model treating the response as a single continuous variable will be forced to place probability mass on negative values and will systematically misestimate the probability of the zero outcome.

Heavy tails. Among non-zero observations, the distribution is severely right-skewed (skewness = 9.836). The standard deviation (8.478 mm) is nearly four times the mean (2.361 mm). Kurtosis of 181.146 relative to 3 for a normal distribution confirms that extreme events occur far more frequently than a Gaussian model would predict. The maximum recorded value is 371 mm, and 151 events exceed 100 mm.

Modelling implication. The conjunction of zero-inflation and extreme positive skew means a single-component model is insufficient. The data implicitly poses two separate questions: does rain occur, and given that it does, how much falls? This motivates the Zero-Inflated Gamma framework adopted in subsequent analysis.


4.3 Bivariate Correlation Structure

Because rainfall is heavily skewed and the relationships are unlikely to be linear, Spearman rank correlation is used throughout, a non-parametric measure that captures monotonic association without requiring linearity or normality.

Show the code
source(here::here("chapter4", "correlation.R"))
render_correlation_table()
Table 4.3: Spearman rank correlation with rainfall, ranked by absolute strength.
Spearman Correlation with Rainfall (Ranked by Strength)
Predictor Correlation (r) P-Value Strength
humidity9am 0.440 0 Moderate
humidity3pm 0.440 0 Moderate
sunshine -0.400 0 Moderate
cloud9am 0.370 0 Moderate
cloud3pm 0.320 0 Moderate
evaporation -0.310 0 Moderate
temp3pm -0.310 0 Moderate
max_temp -0.300 0 Moderate
pressure9am -0.150 0 Small
temp9am -0.150 0 Small
wind_gust_speed 0.130 0 Small
wind_speed9am 0.083 0 Negligible
wind_speed3pm 0.068 0 Negligible
pressure3pm -0.063 0 Negligible
min_temp 0.022 0 Negligible
Show the code
plot_correlation_matrix()
Figure 4.1: Spearman correlation matrix of meteorological features. Red indicates negative correlation; blue indicates positive correlation. Coefficients are shown only for pairs where |r| > 0.3 to reduce visual noise.
Show the code
run_cocor_test()
#> $pearson1898
#> 
#>  Pearson and Filon's z (1898)
#> 
#> data:  
#> 
#> alternative hypothesis: true difference in correlations is not equal to 0
#> sample estimates:
#>   r.jk.rho   r.jh.rho       r.kh 
#>  0.4436352 -0.4013063 -0.6206376 
#> 
#> 
#> $hotelling1940
#> 
#>  Hotelling's t (1940)
#> 
#> data:  
#> 
#> alternative hypothesis: true difference in correlations is not equal to 0
#> sample estimates:
#>   r.jk.rho   r.jh.rho       r.kh 
#>  0.4436352 -0.4013063 -0.6206376 
#> 
#> 
#> $williams1959
#> 
#>  Williams' t (1959)
#> 
#> data:  
#> 
#> alternative hypothesis: true difference in correlations is not equal to 0
#> sample estimates:
#>   r.jk.rho   r.jh.rho       r.kh 
#>  0.4436352 -0.4013063 -0.6206376 
#> 
#> 
#> $hendrickson1970
#> 
#>  Hendrickson, Stanley, and Hills' (1970) modification of Williams' t
#>  (1959)
#> 
#> data:  
#> 
#> alternative hypothesis: true difference in correlations is not equal to 0
#> sample estimates:
#>   r.jk.rho   r.jh.rho       r.kh 
#>  0.4436352 -0.4013063 -0.6206376 
#> 
#> 
#> $olkin1967
#> 
#>  Olkin's z (1967)
#> 
#> data:  
#> 
#> alternative hypothesis: true difference in correlations is not equal to 0
#> sample estimates:
#>   r.jk.rho   r.jh.rho       r.kh 
#>  0.4436352 -0.4013063 -0.6206376 
#> 
#> 
#> $dunn1969
#> 
#>  Dunn and Clark's z (1969)
#> 
#> data:  
#> 
#> alternative hypothesis: true difference in correlations is not equal to 0
#> sample estimates:
#>   r.jk.rho   r.jh.rho       r.kh 
#>  0.4436352 -0.4013063 -0.6206376 
#> 
#> 
#> $steiger1980
#> 
#>  Steiger's (1980) modification of Dunn and Clark's z (1969) using
#>  average correlations
#> 
#> data:  
#> z = 188.91, p-value < 2.2e-16
#> alternative hypothesis: true difference in correlations is not equal to 0
#> sample estimates:
#>   r.jk.rho   r.jh.rho       r.kh 
#>  0.4436352 -0.4013063 -0.6206376 
#> 
#> 
#> $meng1992
#> 
#>  Meng, Rosenthal, and Rubin's z (1992)
#> 
#> data:  
#> 
#> alternative hypothesis: true difference in correlations is not equal to 0
#> sample estimates:
#>   r.jk.rho   r.jh.rho       r.kh 
#>  0.4436352 -0.4013063 -0.6206376 
#> 
#> 
#> $hittner2003
#> 
#>  Hittner, May, and Silver's (2003) modification of Dunn and Clark's z
#>  (1969) using a backtransformed average Fisher's (1921) Z procedure
#> 
#> data:  
#> 
#> alternative hypothesis: true difference in correlations is not equal to 0
#> sample estimates:
#>   r.jk.rho   r.jh.rho       r.kh 
#>  0.4436352 -0.4013063 -0.6206376 
#> 
#> 
#> $zou2007
#> 
#>  Zou's (2007) confidence interval
#> 
#> data:  
#> 
#> alternative hypothesis: true difference in correlations is not equal to 0
#> sample estimates:
#>   r.jk.rho   r.jh.rho       r.kh 
#>  0.4436352 -0.4013063 -0.6206376

Moisture indicators (positive association). Humidity3pm (\(r = 0.44\)) and cloud cover (\(r \approx 0.37\)) show the strongest positive associations. High afternoon humidity indicates that moisture has accumulated in the lower atmosphere over the course of the day. Cloud cover is both a physical precondition for rain and a consequence of the same atmospheric dynamics that produce it.

Radiation and evaporation indicators (negative association). Sunshine (\(r = -0.40\)) and evaporation (\(r = -0.31\)) show the strongest negative associations. Long sunshine hours proxy clear-sky, high-pressure conditions. High evaporation signals warm, dry, low-humidity surface conditions.

Multicollinearity. The correlation heatmap (Figure 4.1) reveals substantial redundancy among predictors. pressure9am and pressure3pm share \(r = 0.96\), and the two temperature readings are similarly collinear. This directly motivates VIF-based feature selection in the feature engineering chapter.

Statistical validation. Steiger’s Z-test comparing the two strongest opposing predictors yields \(z \approx 188.9\), \(p < 2.2 \times 10^{-16}\). With \(N > 140{,}000\), the p-value alone is uninformative; the Z-statistic magnitude confirms that the differential predictive strength is not a sampling artefact. Humidity and sunshine represent genuinely distinct physical forces operating in opposing directions.


4.4 Temporal Structure of Rainfall

4.4.1 Weekly and Seasonal Frequency

Show the code
source(here::here("chapter4", "temporal.R"))
render_weekly_frequency_table()
Table 4.4: Frequency of rainfall days by day of the week.
Frequency of Rainfall Days by Day of the Week
Day Count (n) Percentage
Tue 7508 14.7%
Mon 7480 14.6%
Fri 7378 14.4%
Wed 7342 14.4%
Thu 7314 14.3%
Sat 7057 13.8%
Sun 7040 13.8%
Show the code
render_monthly_frequency_table()
Table 4.5: Frequency of rainfall days by month.
Frequency of Rainfall Days by Month
Month Count (n) Percentage
6 5448 10.7%
7 5250 10.3%
5 4937 9.7%
8 4704 9.2%
3 4444 8.7%
9 4234 8.3%
4 4001 7.8%
10 3770 7.4%
11 3760 7.4%
1 3702 7.2%
12 3562 7.0%
2 3307 6.5%
Show the code
render_month_day_crosstab()
Table 4.6: Cross-tabulation of rainfall frequency by month and day of week.
Cross-tabulation of Rainfall Frequency: Month vs. Day
month Sun Mon Tue Wed Thu Fri Sat Total
1 536 570 514 482 493 567 540 3702
2 480 530 469 466 443 471 448 3307
3 636 645 639 592 662 644 626 4444
4 578 597 566 604 579 515 562 4001
5 710 722 765 714 678 681 667 4937
6 766 801 804 797 760 753 767 5448
7 694 717 728 796 773 818 724 5250
8 612 702 694 679 689 687 641 4704
9 516 571 616 639 648 659 585 4234
10 507 606 579 550 517 456 555 3770
11 556 526 554 480 538 575 531 3760
12 449 493 580 543 534 552 411 3562
Total 7040 7480 7508 7342 7314 7378 7057 51119

Weekly cycle. The distribution of wet days across days of the week is approximately uniform, ranging from 13.8% to 14.7% (Table 4.4). Atmospheric processes operate independently of the social calendar, and the slight variation is consistent with sampling noise. Day carries little predictive information.

Annual cycle. June (10.7%) and July (10.3%) record the highest frequency of wet days (Table 4.5), consistent with Southern Hemisphere winter frontal systems originating from the Southern Ocean. February (6.5%) and December (7.0%) record the lowest frequencies. Table 4.6 confirms this seasonal signal is not an artefact of any particular day of the week. Month is therefore a legitimate predictor warranting explicit model inclusion.

4.4.2 Day-to-Day Persistence: A Markov Chain Analysis

Show the code
markov_tbl <- build_markov_table()
cont_table <- build_cont_table(markov_tbl)
print(cont_table)
#>        No   Yes
#> No  93231 17043
#> Yes 17047 14829
Show the code
markov_stats <- run_markov_stats(cont_table)
#> 
#> Effect Size Interpretation
#> V = 0.3107: Moderate Association
Show the code
plot_markov_chain(markov_stats = markov_stats)
Figure 4.2: Markov chain transition matrix. The asymmetry between dry-to-dry (85%) and wet-to-wet (47%) probabilities is the central finding: dry states are self-reinforcing while wet states are more transient.

The Chi-squared test yields \(\chi^2 \approx 13{,}718\), \(p < 0.001\), rejecting daily independence by a wide margin. Cramer’s \(V \approx 0.31\) confirms a moderate practical effect size.

The transition matrix (Figure 4.2) reveals an important asymmetry. When the previous day was dry, there is an 85% probability of remaining dry: high-pressure systems are persistent and self-reinforcing. When the previous day was wet, there is only a 47% probability of continued rain, meaning wet events are considerably more transient. This asymmetry has a direct atmospheric interpretation: anticyclonic systems can persist for days to weeks, while frontal systems typically pass through more quickly.

rain_today (lagged one day) carries meaningful predictive signal, yet its modest effect size also demonstrates that autocorrelation alone is insufficient. The wet state is too transient for a persistence-only rule, and other meteorological covariates remain necessary.

4.4.3 Dry Spell Dynamics and Temporal Decay

Show the code
dry_spell_dat <- build_dry_spell_data()
Show the code
dry_spell_models <- run_dry_spell_models(dry_spell_dat)
#> Wald test:
#> ----------
#> 
#> Chi-squared test:
#> X2 = 8099.7, df = 1, P(> X2) = 0.0
#> 
#> For each additional day without rain, odds of rainfall decrease by 16.5%
#> 95% CI: [0.831, 0.838]
#> Analysis of Deviance Table
#> 
#> Model 1: rain_binary ~ days_since_rain
#> Model 2: rain_binary ~ splines::ns(days_since_rain, df = 4)
#>   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
#> 1    138444     169706                          
#> 2    138441     162027  3   7678.7 < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Show the code
plot_dry_spell(dry_spell_dat, dry_spell_models)
Figure 4.3: Probability of rainfall by days since last rain. Points are empirical probabilities (plus or minus 95% CI); the dashed line and shaded band show the linear logistic model fit and its uncertainty. The steep initial decline followed by a plateau motivates a spline parameterisation rather than a linear logistic term.

A logistic regression of daily rain occurrence on days_since_rain yields \(OR = 0.835\) (Wald \(\chi^2 \approx 8{,}099\), \(p < 0.001\)): each additional dry day reduces the odds of rainfall by approximately 16.5%. This is consistent with the progressive establishment of stable high-pressure ridges documented in the Markov analysis above.

The linear model is an approximation. A likelihood ratio test against a four-knot natural spline is highly significant (\(\chi^2 \approx 7{,}678\), \(p < 0.001\)). The empirical pattern (Figure 4.3) shows rain probability falling sharply from approximately 48% on Day 1 to around 18% by Day 10, then plateauing in the 12 to 16% range through Days 15 to 30. The linear model underestimates the initial steepness and overestimates the long-drought decline rate. This rapid-then-gradual decay motivates a spline parameterisation of days_since_rain rather than a simple linear term.


4.5 Atmospheric Pressure Dynamics

Show the code
source(here::here("chapter4", "pressure.R"))
pressure_dat <- build_pressure_data()
plot_pressure_qq(pressure_dat)

Show the code
render_pressure_ttests(pressure_dat)
Table 4.7: Welch two-sample t-test results for pressure metrics by rainfall state (Bonferroni-Holm corrected).
Welch Two-Sample t-test Results (Bonferroni-Holm Corrected)
Metric Group 1 Group 2 t-statistic df Adj. P-Value Significance
pressure3pm No Yes 34.656 47862.40 0 ****
pressure9am No Yes 63.848 47259.23 0 ****
pressure_change No Yes -73.395 46002.79 0 ****
Show the code
render_pressure_cohens_d(pressure_dat)
Table 4.8: Cohen’s d effect size analysis for pressure metrics by rainfall state.
Cohen's d Effect Size Analysis
Metric Effect Size (d) Interpretation
pressure3pm 0.227 Small
pressure9am 0.419 Small
pressure_change -0.487 Small
Show the code
plot_pressure_violin(pressure_dat)
Figure 4.4: Violin plots of atmospheric pressure variables by rainfall state. The pressure_change panel shows the most pronounced separation between dry and rainy days.
Show the code
plot_pressure_means()
Figure 4.5: Mean pressure levels and diurnal drop by rainfall state. The suppressed diurnal pressure drop on rainy days relative to dry days is a stronger discriminating signal than the absolute pressure baseline.

Normality and Test Validity. Q-Q plots reveal modest tail deviations across all three pressure metrics, but with \(N > 140{,}000\) the Central Limit Theorem ensures that sample means are asymptotically normal regardless of the underlying marginal distribution. Welch’s t-test is used throughout to avoid the equal-variance assumption, and Bonferroni-Holm correction is applied across the three comparisons.

Baseline Pressure. Rainy days show significantly lower mean atmospheric pressure than dry days at both observation times (Table 4.7). Cohen’s \(d\) for the baseline measures ranges from 0.227 to 0.419, a small effect (Table 4.8). Lower absolute pressure is a necessary but not sufficient condition for rainfall: the overlap between the two distributions is substantial enough that pressure level alone cannot discriminate reliably between states.

Diurnal Pressure Change. The stronger discriminating signal lies not in the baseline but in how pressure evolves across the day (Figure 4.4, Figure 4.5). Cohen’s \(d = -0.487\) for pressure_change exceeds the effect size of either absolute pressure reading, and the Welch t-statistic of \(-73.395\) is the largest in magnitude among all three metrics (Table 4.7, Table 4.8). The rate of pressure change across the day is therefore a more discriminating indicator of rainfall state than the morning or afternoon level in isolation, and it is entered as a primary derived feature in the engineering pipeline.


4.6 Seasonal Rainfall Intensity

4.6.1 Cyclical Patterns

Show the code
source(here::here("chapter4", "seasonality.R"))
seasonal_dat <- build_seasonal_data()
monthly_stats <- build_monthly_stats()
Show the code
plot_monthly_ridgeline()
Figure 4.6: Ridgeline plot of monthly log-rainfall distributions. Shifting peaks illustrate how rainfall intensity varies cyclically relative to the global median (dashed line). Solid vertical lines within each ridge mark the monthly median.
Show the code
plot_seasonal_facet(seasonal_dat)
Figure 4.7: Seasonal rainfall patterns by meteorological season. Summer shows high variance and a pronounced right tail; winter shows narrower, more predictable distributions. Quartile lines mark the 25th, 50th, and 75th percentiles within each month.
Show the code
plot_mean_rain_bar()
Figure 4.8: Mean rainfall intensity on non-zero days by month. February records the highest mean intensity at 10.1 mm, nearly double July’s 4.9 mm, illustrating that frequency and intensity vary in opposite directions across the seasonal cycle.

The frequency analysis in Section 4.4.1 established when rain tends to occur. This section investigates how much falls when it does.

January and February distributions are shifted systematically right of the global median: summer storms are considerably more intense when they arrive, even though they occur less frequently. June through August cluster left, representing lower but more consistent rainfall. February records the highest mean intensity per wet day at 10.1 mm, nearly double July’s 4.9 mm (Figure 4.8).

Summer rainfall has a wide interquartile range and a pronounced right tail beyond 100 mm per day, reflecting episodic convective storms (Figure 4.7). Winter shows a narrower, more peaked distribution. Both the frequency variation (more rain in winter) and the intensity variation (heavier rain in summer) carry independent information, and a complete model must account for both dimensions.

Encoding implication. Because the seasonal cycle is continuous, the transition from December to January is climatologically smooth, and treating Month as an unordered factor discards that proximity information. Cyclical encoding via sine and cosine transformations of the month number preserves the circular geometry of the annual cycle.

4.6.2 Statistical Validation

Show the code
render_seasonal_desc_stats(seasonal_dat)
Table 4.9: Descriptive statistics of rainfall intensity by meteorological season.
Descriptive Statistics of Rainfall Intensity by Season
Season Variable N (Events) Mean (mm) SD (mm)
Summer rainfall 10571 9.070 18.195
Autumn rainfall 13382 6.667 13.495
Winter rainfall 15402 5.463 9.809
Spring rainfall 11764 5.651 10.495
Show the code
seasonal_tests <- run_seasonal_tests(seasonal_dat)
render_kruskal_table(seasonal_tests)
Table 4.10: Kruskal-Wallis test of seasonal differences in rainfall intensity.
Statistical Significance of Seasonal Differences (Non-Parametric)
Test Chi-squared df P-value Effect Size Magnitude
Kruskal-Wallis Rank Sum Test 230.44 3 <0.001 0.0044 small
* Effect size: Epsilon-squared.
Alpha = 0.05
Show the code
render_dunns_table(seasonal_tests)
Table 4.11: Dunn’s pairwise comparison test with Bonferroni correction.
Dunn's Pairwise Comparison Test (Bonferroni Corrected)
Group 1 Group 2 Z-Statistic Adj. P-Value Significance
Summer Autumn -11.468 0.000 ****
Summer Winter -14.483 0.000 ****
Summer Spring -11.049 0.000 ****
Autumn Winter -2.851 0.026 *
Autumn Spring 0.091 1.000 ns
Winter Spring 2.846 0.027 *
Show the code
plot_seasonal_stat_groups(seasonal_dat, seasonal_tests)
Figure 4.9: Mean seasonal rainfall with statistical groupings. Shared letters above bars indicate seasons not significantly different from one another at the Bonferroni-corrected threshold (Dunn’s test, p < 0.05).

The Kruskal-Wallis test yields \(\chi^2 = 230.44\), \(p < 0.001\), strongly rejecting the null of equal seasonal distributions (Table 4.10). The epsilon-squared effect size (\(\eta^2 \approx 0.0044\)) is small: the seasonal signal is real but accounts for only a minor fraction of total variance in rainfall intensity, reinforcing the necessity of a multivariate approach.

Post-hoc Dunn’s tests identify three distinct statistical groups (Figure 4.9, Table 4.11). Summer stands alone as the most intense season (\(\bar{x} = 9.070\) mm, \(p < 0.001\) vs. all others). Autumn and Spring are statistically indistinguishable (\(p_\text{adj} = 1.000\)). Winter registers the lowest mean intensity (\(\bar{x} = 5.463\) mm) and is statistically distinct from Spring (\(p_\text{adj} = 0.027\)).


4.7 Feature Interactions: The “Rain Corner”

Show the code
source(here::here("chapter4", "interactions.R"))
plot_rain_corner()
Figure 4.10: Bivariate density of Humidity3pm versus Sunshine, faceted by rainfall occurrence. Rain events concentrate tightly in the upper-left corner (high humidity, low sunshine), while dry days are dispersed broadly across the feature space. This structural asymmetry justifies a multiplicative interaction term in the model specification.

While individual correlation analysis in Section 4.3 identifies humidity and sunshine as primary predictors, a standard additive model assumes their effects on the log-odds of rainfall are independent. The bivariate density plots (Figure 4.10) test this assumption directly by visualising the joint distribution of these two features conditioned on the rainfall outcome.

Dry Day Structure. The feature space for dry days shows a concentrated region in the lower-right of the humidity-sunshine plane, where sunshine is moderate-to-high and afternoon humidity is low. This reflects the high-pressure suppression regimes documented in Section 4.4.2, where clear skies drive both elevated sunshine duration and suppressed humidity. Neither variable in isolation is a reliable discriminator between outcomes; the specific combination carries the predictive signal.

The Rain Corner. Precipitation events concentrate tightly in the upper-left quadrant of the feature space, where afternoon humidity is high and sunshine hours are low simultaneously. This cluster is not a feature of either variable’s marginal distribution: high humidity alone and low sunshine alone each occur on dry days with regularity. It is the joint occurrence that distinguishes the rainfall regime. The density mass that is dispersed broadly across the dry-day panel collapses into this single region when rain is present, a pattern that an additive model would be structurally unable to capture.

Modelling Implication. The asymmetry between the two panels confirms a genuine statistical interaction: the effect of humidity on rainfall probability is conditional on the level of sunshine, and vice versa. This justifies including a multiplicative interaction term (\(\text{Humidity3pm} \times \text{Sunshine}\)) in the model specification, in addition to both main effects.


4.8 Summary and Modelling Implications

The preceding analyses characterise the dataset along six interconnected dimensions, each producing a specific modelling requirement.

The distributional structure of the target variable; 64.05% zeros, extreme positive skew, and kurtosis of 181.146 rules out any single-component Gaussian model. A two-part framework separating occurrence from intensity is the appropriate response.

The missingness analysis (documented in Chapter 2, summarised in Section 4.1) establishes that imputation is necessary to preserve both sample size and the predictive signal in sunshine and evaporation, and that the predictive mean matching procedure does not introduce outcome-related bias. Downstream models incorporating these predictors must be fitted across all imputed completions and pooled via Rubin’s combining rules.

The correlation structure identifies humidity, cloud cover, sunshine, and evaporation as the strongest individual predictors, and flags severe multicollinearity among morning-afternoon pairs. VIF-based feature selection is required.

The temporal analyses establish that both the day-to-day Markov state and the cumulative dry spell duration carry predictive signal. Month should be cyclically encoded. The dry spell decay is non-linear and warrants a natural spline parameterisation of days_since_rain.

The pressure analysis identifies the diurnal pressure change as the more discriminating pressure-derived feature, with Cohen’s \(d = -0.487\) separating rainy from dry days more effectively than absolute pressure level.

The interaction analysis provides empirical justification for a \(\text{Humidity} \times \text{Sunshine}\) interaction term, reflecting the physical reality that precipitation concentrates where high moisture and low solar radiation coincide in the Rain Corner of the feature space.