Unsupervised object learning explains face but not animate category structure in human visual cortex
Document Type
Article
Publication Date
9-27-2021
Journal
Journal of Vision
Volume
21
Issue
9
URL with Digital Object Identifier
10.1167/jov.21.9.2501
Abstract
Deep convolutional neural networks (DCNNs) are currently the best computational models of human vision. However, DCNNs cannot fully explain the representation of natural object categories in high-level human visual cortex. DCNNs are classically trained to recognize objects using supervised learning, while humans rely heavily on unsupervised learning. Here, we test whether unsupervised learning yields an object representation that more strongly emphasizes natural categories and better explains human brain activity than supervised learning. We trained ResNet50 on the ImageNet database, using both supervised and contrastive unsupervised learning. For both types of learning, we characterized the network’s internal representation of 96 real-world object images over the course of 200 training epochs. We fitted a category model to the resulting learning trajectories using ordinary least squares to measure the strength of category clustering for faces, animate objects and inanimate objects. We then compared the networks’ learning trajectories and clustering strengths with the object representation in high-level visual cortex, measured with fMRI in human adult observers. We focused our analysis on the deepest convolutional layer and used bootstrap resampling for statistical inference. We found that the unsupervised network better explains the human object representation than the supervised network (FDR corrected p<0.05 for 80 percent of epochs). This difference emerges relatively early in training and increases as learning progresses. Better performance of the unsupervised network is partly driven by its ability to discover natural face category structure in the input images. Importantly, both supervised and unsupervised models fall short of predicting category clusters of animate and inanimate objects in the human brain data (FDR corrected p<0.05 for all epochs), suggesting that these categories are difficult to learn from static images alone. Our findings suggest that the natural category structure in the human high-level visual cortex may arise from unsupervised learning during development.