Doctor of Philosophy
Statistics and Actuarial Sciences
In this thesis, we propose new methodologies targeting the areas of high-dimensional variable screening, influence measure and post-selection inference. We propose a new estimator for the correlation between the response and high-dimensional predictor variables, and based on the estimator we develop a new screening technique termed Dynamic Tilted Current Correlation Screening (DTCCS) for high dimensional variables screening. DTCCS is capable of picking up the relevant predictor variables within a finite number of steps. The DTCCS method takes the popular used sure independent screening (SIS) method and the high-dimensional ordinary least squares projection (HOLP) approach as its special cases.
Two methods of high-dimensional influence measure have also been emplored. They are from the perspective of the extreme value distribution (EVD) and the robustness of design respectively. For the first method, EVD-type statistics have been shown to be powerful in measuring high-dimensional influence theoretically and numerically. From the second method, we propose Hellinger distance for high-dimensional influence measure (HD-HIM). Inner product of two transformed influence function is used to measure the Hellinger distance of two discrete distribution function from the whole and deleted dataset. This construction gives detecting power to flag the influence observations.
Lastly, we propose a new numerically feasible post-selection inference method termed Cosine PoSI in high-dimensional framework. Cosine PoSI focus on the geometric aspect of Least Angle Regression (LARS). LARS efficiently provide a solution path along which the entered predictors always have the same absolute correlation with the current residual. At each step of the LARS algorithm, the proposed Cosine PoSI method employs an angle from the correlation between the entering variable and current residual and considers this angle as a random variable from the cosine distribution. The post-selection inference is then conducted based on the order statistics of this cosine distribution. Given the collection of the possible angles, hypothesis tests are performed on the limiting distribution of the maximum angle. To confirm the effectiveness of the proposed method, we conduct simulation studies and a real-life data analysis to illustrate the usefulness of this post-selection method.
Zhao, Bangxin, "Analysis Challenges for High Dimensional Data" (2018). Electronic Thesis and Dissertation Repository. 5370.
Applied Statistics Commons, Biostatistics Commons, Microarrays Commons, Multivariate Analysis Commons, Numerical Analysis and Scientific Computing Commons, Statistical Methodology Commons, Theory and Algorithms Commons