Objective Estimation of Tracheoesophageal Speech Quality

Yousef S Ettomi Ali, The University of Western Ontario

Abstract

Speech quality estimation for pathological voices is becoming an increasingly important research topic. The assessment of the quality and the degree of severity of a disordered speech is important to the clinical treatment and rehabilitation of patients. In particular, patients who have undergone total laryngectomy (larynx removal) produce Tracheoesophageal (TE) speech. In this thesis, we study the problem of TE speech quality estimation using advanced signal processing approaches. Since it is not possible to have a reference (clean) signal corresponding to a given TE speech (disordered) signal, we investigate in particular the non-intrusive techniques (also called single-ended or blind approaches) that do not require a reference signal to deduce the speech quality level. First, we develop a novel TE speech quality estimation based on some existing double-ended (intrusive) speech quality evaluation techniques such as the Perceptual Evaluation Speech Quality (PESQ) and Hearing Aid Speech Quality Index HASQI. The matching pursuit algorithm (MPA) was used to generate a quasi-clean speech signal from a given disordered TE speech signal. Then, by adequately choosing the parameters of the MPA (atoms, number of iterations,...etc) and using the resulting signal as our reference signal in the intrusive algorithm, we show that the resulting intrusive algorithm correlates well with the subjective scores of two TE speech databases. Second, we investigate the extraction of low complexity auditory features for the evaluation of speech quality. An 18-th order Linear Prediction (LP) analysis is performed on each voiced frame of the speech signal. Two evaluation features are extracted corresponding to higher-order statistics of the LP coefficients and the vocal tract model parameters (cross-sectional tubes areas). Using a set of 35 TE speech samples, we perform forward stepwise regression as well as K-fold cross-validation to select the best sets of features that are used in each of the regression models. Finally, the selected features are fitted to different support vector regression models yielding high correlations with subjective scores. Finally, we investigate a new approach for the estimation of the quality of TE speech using deep neural networks (DNNs). A synthetic dataset that consists of 2173 samples was used to train a DNN model that was shown to predict the TE voice quality. The synthetic dataset was formed by mixing 53 normal speech samples with modulated noise signals that had a similar envelope to the speech samples, at different speech-to-modulation noise ratios. A validated instrumental speech quality predictor was used to quantify the perceived quality of speech samples in this database, and these objective quality scores were used for training the DNN model. The DNN model was comprised of an input layer that accepted sixty relevant features extracted through filterbank and linear prediction analyses of the input speech signal, two hidden layers with 15 neurons each, and an output layer that produced the predicted speech quality score. The DNN trained on the synthetic dataset was subsequently applied to four different databases that contained speech samples collected from TE speakers. The DNN-estimated quality scores exhibited a strong correlation with the subjective ratings of the TE samples in all four databases, thus it shows strong robustness compared to those speech quality metrics developed in this thesis or those from the literature.

This item has been relocated to Western University’s Open Repository

Objective Estimation of Tracheoesophageal Speech Quality

Abstract

Links

Browse

Author Corner

Links