Electronic Thesis and Dissertation Repository

Thesis Format

Monograph

Degree

Doctor of Philosophy

Program

Electrical and Computer Engineering

Supervisor

Parsa, Vijay

Abstract

Speech quality estimation for pathological voices is becoming an increasingly important research topic. The assessment of the quality and the degree of severity of a disordered speech is important to the clinical treatment and rehabilitation of patients. In particular, patients who have undergone total laryngectomy (larynx removal) produce Tracheoesophageal (TE) speech. In this thesis, we study the problem of TE speech quality estimation using advanced signal processing approaches. Since it is not possible to have a reference (clean) signal corresponding to a given TE speech (disordered) signal, we investigate in particular the non-intrusive techniques (also called single-ended or blind approaches) that do not require a reference signal to deduce the speech quality level. First, we develop a novel TE speech quality estimation based on some existing double-ended (intrusive) speech quality evaluation techniques such as the Perceptual Evaluation Speech Quality (PESQ) and Hearing Aid Speech Quality Index HASQI. The matching pursuit algorithm (MPA) was used to generate a quasi-clean speech signal from a given disordered TE speech signal. Then, by adequately choosing the parameters of the MPA (atoms, number of iterations,...etc) and using the resulting signal as our reference signal in the intrusive algorithm, we show that the resulting intrusive algorithm correlates well with the subjective scores of two TE speech databases. Second, we investigate the extraction of low complexity auditory features for the evaluation of speech quality. An 18-th order Linear Prediction (LP) analysis is performed on each voiced frame of the speech signal. Two evaluation features are extracted corresponding to higher-order statistics of the LP coefficients and the vocal tract model parameters (cross-sectional tubes areas). Using a set of 35 TE speech samples, we perform forward stepwise regression as well as K-fold cross-validation to select the best sets of features that are used in each of the regression models. Finally, the selected features are fitted to different support vector regression models yielding high correlations with subjective scores. Finally, we investigate a new approach for the estimation of the quality of TE speech using deep neural networks (DNNs). A synthetic dataset that consists of 2173 samples was used to train a DNN model that was shown to predict the TE voice quality. The synthetic dataset was formed by mixing 53 normal speech samples with modulated noise signals that had a similar envelope to the speech samples, at different speech-to-modulation noise ratios. A validated instrumental speech quality predictor was used to quantify the perceived quality of speech samples in this database, and these objective quality scores were used for training the DNN model. The DNN model was comprised of an input layer that accepted sixty relevant features extracted through filterbank and linear prediction analyses of the input speech signal, two hidden layers with 15 neurons each, and an output layer that produced the predicted speech quality score. The DNN trained on the synthetic dataset was subsequently applied to four different databases that contained speech samples collected from TE speakers. The DNN-estimated quality scores exhibited a strong correlation with the subjective ratings of the TE samples in all four databases, thus it shows strong robustness compared to those speech quality metrics developed in this thesis or those from the literature.

Summary for Lay Audience

Speech quality estimation is a multi-dimensional perceptual phenomenon that encompasses attributes such as clarity, pleasantness, and naturalness of speech. There is a necessity to estimate the quality of pathological voices due to its clinical importance, especially for people who have undergone total laryngectomy (larynx removal). Speech coming out of people who undergo such surgery is called Tracheoesophageal (TE) speech. This thesis aims to assess the quality of TE speech using speech quality metrics that incorporate digital signal processing and machine learning algorithms.

In the first contribution of this study, a novel TE speech quality estimation metric is developed using intrusive techniques. Intrusive techniques are those methods that need a clean reference audio signal to measure the quality of the signal that is being evaluated. The obtained automated model is found to have high similarity in terms of performance to the subjective human evaluation of speech audio signals.

In the second part of this study, the use of non-intrusive metrics that do not need a clean reference signal to evaluate the quality of speech signals is investigated. Linear prediction features from the speech signal are fed to a stepwise regression model to predict the quality of the speech records. Moreover, another machine learning algorithm, support vector regression (SVR) is used to extract the quality evaluation metrics from these prediction coefficients. The obtained quality metrics are found to be highly correlated with individual subjective scores.

Finally, the third part of this study investigates the use of artificial deep neural networks (DNN), a state-of-the-art machine learning technique, in predicting the quality of TE speech records. The DNN-estimated quality scores exhibited a strong correlation with the subjective ratings of the TE samples in all four databases, thus it shows strong robustness compared to those speech quality metrics developed in this thesis or those from the literature.

Creative Commons License

Creative Commons Attribution 4.0 License
This work is licensed under a Creative Commons Attribution 4.0 License.

Share

COinS