Electronic Thesis and Dissertation Repository

Thesis Format

Integrated Article


Master of Science


Computer Science

Collaborative Specialization

Artificial Intelligence


Mur, Marieke

2nd Supervisor

Goodale, Melvyn A.


3rd Supervisor

Maryam Vaziri Pahskam


National Institute of Mental Health

Joint Supervisor


Recent advances in computer vision have enabled machines to have high performance in labeling objects in natural scenes. However, object labeling constitutes only a small fraction of daily human activities. To move towards building machines that can function in natural environments, the usefulness of these models should be evaluated on a broad range of tasks beyond perception. Moving towards this goal, this thesis evaluates the internal representations of state-of-the-art deep convolutional neural networks in predicting a perception-based and an action-based behavior: object similarity judgment and visually guided grasping. To do so, a dataset of everyday objects was collected and used to obtain these two behaviors on the same set of stimuli. For the grasping task, participants’ finger positions were recorded at the end of the object grasping movement. Additionally, for the similarity judgment task, an odd-one-out experiment was conducted to build a dissimilarity matrix based on participants’ similarity judgments. A comparison of the two behaviors suggests that distinct features of objects are used for performing each task. I next explored if the features extracted in different layers of the state-of-the-art deep convolutional neural networks (DNNs) could be useful in deriving both outputs. The prediction accuracy of the similarity judgment behavior increased from low to higher layers of the networks, while that of the grasping behavior increased from low to mid-layers and drastically decreased further along the hierarchy. These results suggest that for building a system that could perform these two tasks, the processing hierarchy may need to be split starting at the middle layers. Overall, the results of this thesis could inform future models that can perform a broader set of tasks on natural images.

Summary for Lay Audience

Our visual system enables us to recognize objects and people around us. It also enables us to move and interact with the world. Advances in the field of computer vision have given rise to models that can perform similarly to humans in recognizing objects and people. In this thesis, we study if the same models can also help us judge the similarity of objects or grasp them. Both these tasks require visual processing, but we show that they rely on different features of objects. Our results suggest that the simple models optimized for object recognition are not suitable for producing both similarity judgments and grasping behaviors and their architecture may need to be modified to allow for the human-like production of both behaviors. The results of this thesis shed light on the object properties that are relevant for perception and action and emphasize the importance of studying vision in the context of both perception and action.