Electronic Thesis and Dissertation Repository

Thesis Format

Monograph

Degree

Doctor of Philosophy

Program

Statistics and Actuarial Sciences

Supervisor

Yi, Grace Y.

2nd Supervisor

Diedrichsen, Jörn

Co-Supervisor

Abstract

Due to the advances in information technology, multi-dimensional data has become commonly encountered in many fields, including finance, economics, genomics, neuroscience, and so on. Many regression-based methods have been developed to describe and analyze multivariate data, such as multivariate ridge and lasso regression, reduced-rank regression, partial least squares regression, etc. The use of those methods hinges on important yet tacitly required conditions about the quality of data, which, however, are often violated in applications. Data from the real world often come with measurement error due to various reasons, including imperfections in measuring procedures, and most importantly, the nature of variables themselves. Ignoring the existence of measurement error and naively applying available methods may lead to seriously biased results. Moreover, complications arise when error-contaminated data also contain missing values and/or irrelevant variables. In this thesis, we explore inference issues concerning such kinds of data under multivariate regression models and develop valid methods under different settings.

The first project studies the multivariate regression model with covariates subject to measurement error. We investigate the impact of measurement error and quantify the asymptotic bias induced by using the naive method which ignores measurement error in the estimation procedure. We propose three methods to account for the measurement error effects under different scenarios. The consistency is shown for the proposed estimators, and the corresponding asymptotic properties are derived. The materials of the first project are presented in Chapters 2 and 3.

On the basis of the first project, the second project extends the explorations to further accommodate the sparsity of the covariates in which only a small subset of variables is important to explain the multivariate outcome variables. This project utilizes the idea of regularized, or penalized, methods that are available in the literature, and furthermore, it develops a new penalized method to handle variable selection for error-contaminated data. The consistency and oracle properties of the proposed estimator are rigorously established. The developed method adds a new dimension to the literature on variable selection and enhances our ability to deal with error-corrupted data with high dimensionality. This research is reported in Chapter 4.

An extension of the second project, included in Chapter 5, is to deal with the case where the dimensions of covariates and responses both diverge as the sample size approaches infinity. This extension is motivated by the increasing interest in research on diverging dimensions of the variables in the literature, yet it is more challenging to derive theoretical results than the case with a fixed dimension of the variables. With additional regularity conditions to those in Chapter 4 carefully identified, we establish the theoretical results for the setting with diverging dimensions of the responses and covariates, where new techniques are devised to prove the results. Simulation studies are conducted to assess the finite sample performance of the proposed method.

The third project delves deeper into the research by examining data with missing values, in addition to having measurement error and sparsity in covariates. Handling such data involves more complex modeling and inferential procedures. We develop penalized estimating function approaches for two settings where the nuisance parameters associated with the missing data and measurement error models are known or unknown. The asymptotic results are established for the proposed estimators. Simulation studies are conducted to show the good performance of the proposed method. This research is presented in Chapter 6.

Our studies investigate in-depth the effects of measurement error on the multivariate regression model under different scenarios, including data with additional features of missing values and/or unimportant variables irrelevant to explaining the outcome variables. We develop a number of new inference methods to handle such data. Our methods allow estimation and variable selection to be conducted simultaneously while dealing with the complex interplay effects between measurement error and missingness. This research provides important insights into understanding the impact of accommodating the data quality in inferential procedures under multivariate regression models and offers a new addition to the literature.

Summary for Lay Audience

Due to the advances in information technology, multi-dimensional data become commonly encountered in many fields, including finance, economics, genomics, neuroscience, and so on. Many regression-based methods have been developed to describe and analyze multivariate data. The use of those methods hinges on important yet tacitly required conditions about the quality of data, which, however, are often violated in applications. Data from the real world often come with measurement error due to various reasons, including imperfection in measuring procedures, and most importantly, the nature of the variables themselves. Ignoring the existence of measurement error and naively applying available methods may lead to seriously biased results. Moreover, complications arise when error-contaminated data also contain missing values and/or irrelevant variables.

In this thesis, we explore inference issues concerning such kinds of data under multivariate regression models and develop valid methods under different settings. This research provides important insights into understanding the impact of accommodating the data quality in inferential procedures under multivariate regression models, and offers a new addition to the literature.

Available for download on Thursday, December 11, 2025

Share

COinS