Electronic Thesis and Dissertation Repository

Thesis Format

Integrated Article

Degree

Doctor of Philosophy

Program

Statistics and Actuarial Sciences

Supervisor

Bonner, Simon J.

2nd Supervisor

Woolford, Douglas G.

Abstract

Subsampling of large data is commonly employed in statistical modelling with the goal of efficiency. When the event being modelled is rare, the data is imbalanced and thus sampling methods focus on preferentially subsampling the observations which represent those rare event occurrences. This thesis extends methodology for the subsampling of large data when modelling rare events, motivated by applications in environmetrics and ecology.

The first two projects present extensions to response-based sampling. The response-based sampling approach takes independent samples of event occurrence and non-occurrence, often sampling all occurrences and a small proportion of the non-occurrences. I propose a stratified sampling approach, which defines strata based on a key variable. Independent samples of occurrences and non-occurrences are then sampled from each stratum. The bias induced by this sampling must be accounted for in the logistic regression model. The first project employs sampling weights in the logistic to account for the bias induced by this sampling design. The second project instead uses stratum-specific offsets to the same end, which now allows for the model to include multiple predictors. These approaches are validated using simulation, where they are compared to existing approaches for sampling imbalanced data. I apply these methods to fine-scale human-caused fire occurrence prediction in a region of Ontario, Canada where stratifying on a measure of fire weather and sampling more extreme observations leads to more locally precise estimates of fire occurrence.

The third project presents a novel method for subsampling species detection data to fit occupancy models. When a species is rarely detected, the number of detections will be far outnumbered by the non-detections. I propose a response-based sampling method for species detection data, which allows preferential sampling of the rarer detection observations. I present a method for estimating occupancy and detection probabilities of the subsampled data, as the assumptions of traditional occupancy models no longer hold. I apply this method to detection data of Canada Warbler (Cardellina canadensis) from the Breeding Bird Survey, where we can accurately estimate the occupancy and detection parameters using just 10% of the original dataset, including estimating the effects of a habitat-related covariate.

Summary for Lay Audience

This thesis develops new methodology for studying rare events using large datasets. While large data contain valuable information, it can take prohibitively long to conduct common analyses, such as a logistic regression model, due to the number of observations. A common approach to mitigate this issue is to use a small portion of the data to conduct analyses, known as subsampling. When analyzing the occurrence of a rare event, these large datasets consist of many non-occurrence observations (or 0s) compared to very few occurrence observations (or 1s). Subsampling methods for such data typically focus on sampling more of the 1s and a small portion of the 0s.

This thesis expands on current methodology for selecting that portion of 0s, and the methods to adjust the analyses to reflect this sampling design. Commonly, a random sample of 0s is taken into the subsample, which may miss some of the crucial information in the full data, simply by chance. I use a stratified sampling approach to intentionally sample 0s which occur under conditions of interest. For example, these may be 0s which occur under conditions which are conducive to the event occurring. I then developed two approaches to adjusting the logistic regression model to reflect this subsampling.

I applied this sampling and modelling method to human-caused wildland fire occurrence, where I demonstrate that by intentionally sampling more 0s (i.e. non-fires) which occur under hot and dry conditions, we gain precision in the estimate of fire occurrence.

The final part of this thesis focuses on a method for modelling the occupancy of rare species using large data. Due to the rarity of the species, there are many more 0s (undetected) than 1s (detected) resulting from repeated surveys to the same sites. I propose a method which subsamples more of the sites with at least one detection, and the modelling framework which can accurately model occupancy while adjusting for this subsampling process. I apply this method to a large citizen science dataset, where using a small proportion of the large dataset we can obtain reliable inferences about Canada Warbler habitat preferences.

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS