Electronic Thesis and Dissertation Repository

Predicting Rare Events from Large Spatiotemporal Data: Application to Wildland Fires and Species Occupancy

Johanna L. de Haan-Ward, Western University

Abstract

Subsampling of large data is commonly employed in statistical modelling with the goal of efficiency. When the event being modelled is rare, the data is imbalanced and thus sampling methods focus on preferentially subsampling the observations which represent those rare event occurrences. This thesis extends methodology for the subsampling of large data when modelling rare events, motivated by applications in environmetrics and ecology.

The first two projects present extensions to response-based sampling. The response-based sampling approach takes independent samples of event occurrence and non-occurrence, often sampling all occurrences and a small proportion of the non-occurrences. I propose a stratified sampling approach, which defines strata based on a key variable. Independent samples of occurrences and non-occurrences are then sampled from each stratum. The bias induced by this sampling must be accounted for in the logistic regression model. The first project employs sampling weights in the logistic to account for the bias induced by this sampling design. The second project instead uses stratum-specific offsets to the same end, which now allows for the model to include multiple predictors. These approaches are validated using simulation, where they are compared to existing approaches for sampling imbalanced data. I apply these methods to fine-scale human-caused fire occurrence prediction in a region of Ontario, Canada where stratifying on a measure of fire weather and sampling more extreme observations leads to more locally precise estimates of fire occurrence.

The third project presents a novel method for subsampling species detection data to fit occupancy models. When a species is rarely detected, the number of detections will be far outnumbered by the non-detections. I propose a response-based sampling method for species detection data, which allows preferential sampling of the rarer detection observations. I present a method for estimating occupancy and detection probabilities of the subsampled data, as the assumptions of traditional occupancy models no longer hold. I apply this method to detection data of Canada Warbler (Cardellina canadensis) from the Breeding Bird Survey, where we can accurately estimate the occupancy and detection parameters using just 10% of the original dataset, including estimating the effects of a habitat-related covariate.