Electronic Thesis and Dissertation Repository

Thesis Format



Master of Science


Computer Science

Collaborative Specialization

Artificial Intelligence


Bauer, Michael A.


In this thesis, we examine the performance of Vision Transformers concerning the current state of Advanced Driving Assistance Systems (ADAS). We explore the Vision Transformer model and its variants on the problems of vehicle computer vision. Vision transformers show performance competitive to convolutional neural networks but require much more training data. Vision transformers are also more robust to image permutations than CNNs. Additionally, Vision Transformers have a lower pre-training compute cost but can overfit on smaller datasets more easily than CNNs. Thus we apply this knowledge to tune Vision transformers on ADAS image datasets, including general traffic objects, vehicles, traffic lights, and traffic signs. We compare the performance of Vision Transformers on this problem to existing convolutional neural network approaches to determine the viability of Vision Transformer usage.

Summary for Lay Audience

One component of Autonomous Driving, Advanced Driving System Assistance Systems (ADAS), are vehicle systems designed to improve driving ability and road safety. These technologies can include Anti-Lock Braking Systems and lane departure warning systems. These systems often have to collect information about the traffic environment, including the presence of traffic objects such as vehicles, pedestrians, traffic lights, and traffic signs. The collection is often done through computer vision, collecting visual information through cameras attached to the vehicle. A common way of parsing this visual information to detect and classify these traffic objects is through machine learning models. Machine learning models differ from traditional computer algorithms as they do not need to be explicitly programmed. Instead, they learn from the data given to them at training time and make decisions based on the information. In this case, machine learning models can learn from traffic image data to make predictions presence and class of traffic objects. In this thesis, we evaluate a set of pre-trained Vision Transformer models made by Google. Vision Transformers are a new, popular type of machine learning model applying a mechanism called self-attention. Self-attention mechanisms can learn from and form relations between any pair of points in a data sequence. Vision transformers do not compare every pixel but split images into patches for more realistic computer power and memory cost. These patches are arranged into a linear sequence of vectors, and the Vision transformer trains by finding relations in these patches of pixels. Our research shows that Vision Transformers are competitive with existing Convolutional Neural network models when first pretrained on a large dataset of images and then adjusted to train on a smaller, domain-specific dataset. We apply this concept to an image dataset of vehicles, pedestrians, traffic lights, and traffic signs. We find that classification accuracy when predicting the traffic object in unseen images is higher than classification accuracy from prior research applying Convolutional Neural Networks to the same datasets.

Creative Commons License

Creative Commons Attribution-Share Alike 4.0 License
This work is licensed under a Creative Commons Attribution-Share Alike 4.0 License.