
The Two Visual Processing Streams Through The Lens Of Deep Neural Networks
Abstract
Recent advances in computer vision have enabled machines to have high performance in labeling objects in natural scenes. However, object labeling constitutes only a small fraction of daily human activities. To move towards building machines that can function in natural environments, the usefulness of these models should be evaluated on a broad range of tasks beyond perception. Moving towards this goal, this thesis evaluates the internal representations of state-of-the-art deep convolutional neural networks in predicting a perception-based and an action-based behavior: object similarity judgment and visually guided grasping. To do so, a dataset of everyday objects was collected and used to obtain these two behaviors on the same set of stimuli. For the grasping task, participants’ finger positions were recorded at the end of the object grasping movement. Additionally, for the similarity judgment task, an odd-one-out experiment was conducted to build a dissimilarity matrix based on participants’ similarity judgments. A comparison of the two behaviors suggests that distinct features of objects are used for performing each task. I next explored if the features extracted in different layers of the state-of-the-art deep convolutional neural networks (DNNs) could be useful in deriving both outputs. The prediction accuracy of the similarity judgment behavior increased from low to higher layers of the networks, while that of the grasping behavior increased from low to mid-layers and drastically decreased further along the hierarchy. These results suggest that for building a system that could perform these two tasks, the processing hierarchy may need to be split starting at the middle layers. Overall, the results of this thesis could inform future models that can perform a broader set of tasks on natural images.