Electronic Thesis and Dissertation Repository

Thesis Format



Master of Science


Computer Science

Collaborative Specialization

Artificial Intelligence


Mohsenzadeh, Yalda


Learning rich visual representations using contrastive self-supervised learning has been extremely successful. However, it is still a major question whether we could use a similar approach to learn more efficient auditory and audio-visual representations. In this thesis, we expand on prior self-supervised methods to learn better auditory and audio-visual representations. We introduce various data augmentations suitable for auditory and audio-visual data and evaluate their impact on predictive performance, and demonstrate that training with both supervised and contrastive losses simultaneously improves the learned representations compared to self-supervised pre-training followed by supervised fine-tuning. We illustrate that by combining all these methods and with substantially less labeled data, our framework achieves significant improvement on prediction performance compared to the supervised approach. Moreover, compared to the self-supervised approach, our framework converges faster with significantly better representations.

Summary for Lay Audience

Audio recognition is a fundamental challenge to the goal of automated perception. Although humans are proficient at perceiving and understanding sounds, making computers do the same poses a challenge due to the wide range of variations in auditory acoustics. Currently, there are an overabundance amount of unlabelled (or not annotated) auditory data. Moreover, data are available in a wide range of formats, from various sources, and often they are not stored in a format that is ready to feed into a machine learning pipeline, hence, the process of curating and annotating a dataset is expensive and time-consuming. Therefore, advances in this area are being held back by the lack of adequate unsupervised learning algorithms for auditory data to utilize these data. This thesis directly addresses this shortcoming by developing an unsupervised approach of training algorithms to carry down tasks such as classification, detection, etc. The main goal of this thesis is to investigate the various components that have a direct effect on the performance of these algorithms.

Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License