Statistical Learning Methods for Challenges arised from Self-Reported Data
Abstract
This thesis focuses on developing advanced clustering methods and analyzing data arised from chronic pain (CP) studies, with a particular emphasis on the unique challenges posed by self-reported (SR) data. Latent class analysis (LCA) is explored in the early stages of this work to cluster patients, and the clusters are compared to find features that are significantly different among clusters. While LCA is effective for categorical variables, it fails to address the mixed data types and subjective biases inherent in SR data. To overcome these limitations, we propose a novel distance metric tailored specifically for SR questionnaire data. This distance incorporates the correlation distance with other elementary distances for clustering data of mixed type, which outperforms existing metrics in handling mixed data when SR variables are present. Additionally, interpretable clustering techniques are utilized to generate simple, actionable rules that can be applied in clinical practice. To integrate the domain knowledge of CP experts into the clustering process, a semi- supervised clustering algorithm is introduced, allowing the distance metric to be adjusted using pairwise constraints provided by CP experts. We develop a two-step active learning query strat- egy to identify and query the most informative patient cases, enhancing query efficiency and minimizing the number of interactions required between experts and the algorithm. In addition to clustering, we analyze data arised from CP studies and explore predictive modeling. Canonical correlation analysis (CCA) is applied to investigate relationships among CP measurements, revealing important connections between pain characteristics and psycho- logical factors. Furthermore, multiple classification models are used to predict nociplastic pain, and the best cut of each predictor is investigated using the prediction model. Overall, we made significant contributions to the field of CP studies by introducing novel methods for clustering CP patients and analyzing complex data relationships. The proposed approaches emphasize clinical applicability, interpretability, and the integration of domain knowledge, offering practical solutions for real-world challenges in CP management. These advancements provide a foundation for further exploration of personalized treatment strategies and an improved understanding of chronic pain mechanisms.