Electronic Thesis and Dissertation Repository

Thesis Format

Monograph

Degree

Master of Science

Program

Computer Science

Collaborative Specialization

Artificial Intelligence

Supervisor

Mohsenzadeh, Yalda

Abstract

Generative adversarial networks (GANs) synthesize realistic samples (image, audio, video, etc.) from a random latent vector. While many studies have explored various training configurations and architectures for GANs, the problem of inverting a generative model to extract latent vectors of given input images/audio has been inadequately investigated. Although there is exactly one generated output per given random vector, the mapping from an image/audio to its recovered latent vector can have more than one solution. We train a deep residual neural network (ResNet18) architecture to recover a latent vector for a given target that can be used to generate a face image or a spoken digit audio nearly identical to the target. Here we focus on precise latent vector recovery of human face and voice. We use a perceptual loss to embed texture details in the recovered latent vector while maintaining quality using a reconstruction loss. The vast majority of studies on latent vector recovery perform well only on synthesized examples, we argue that our method can be used to determine a mapping between real human faces and latent-space vectors that contain most of the important face style details. In addition, our proposed method projects generated faces to their latent-space with high fidelity and speed. Applying a few further gradient descent steps on predicted latent vectors of face can further improve performance, however the hybrid technique does not help audio inverse mapping. Our audio inverse mapper can reconstruct both synthesized and real spoken digits with high quantitative and qualitative accuracy. At last, we demonstrate the performance of our approach on both real and generated examples.

Summary for Lay Audience

Recent Generative models can synthesis a brand new image or audio given a random vector of numbers. The generated output is realistic enough to be indistinguishable from actual real images/audio. Advanced face generators are capable of generating very high quality faces that look as natural-looking as actual human faces. Researchers are improving the architecture and training configurations of these generators everyday. Nevertheless, the task of converting an image or an audio to a vector of numbers, that can reconstruct the target when given to a generator is relatively less investigated. In other work, we know that in order to generate an image, the generator requires a set of input numbers. Given a set of specific numbers, the output of the generator is always the same image. Our goal is to do the backward process. We aim to map an image/audio to its corresponding vector of numbers. Using this predicted vector, one can reconstruct the target image/audio using a generative model.

The world of computer science is about numbers, more specifically 0 and 1s. Finding numerical representations for non-numerical file formats such as image and audio is a common task performed by many different approaches. All the machine learning algorithms are trained and evaluated on the numerical representations of their data sets. Our work is a mapping function that maps human face image and human voice into numerical representations that are in particular useful for Generative Adversarial Networks (GANs).

Creative Commons License

Creative Commons Attribution-Noncommercial 4.0 License
This work is licensed under a Creative Commons Attribution-Noncommercial 4.0 License

Audio_Examples.zip (3560 kB)

Share

COinS