ImageNet-21K Pretraining for the Masses

Published 22 Apr 2021 in cs.CV and cs.LG | (2104.10972v4)

Abstract: ImageNet-1K serves as the primary dataset for pretraining deep learning models for computer vision tasks. ImageNet-21K dataset, which is bigger and more diverse, is used less frequently for pretraining, mainly due to its complexity, low accessibility, and underestimation of its added value. This paper aims to close this gap, and make high-quality efficient pretraining on ImageNet-21K available for everyone. Via a dedicated preprocessing stage, utilization of WordNet hierarchical structure, and a novel training scheme called semantic softmax, we show that various models significantly benefit from ImageNet-21K pretraining on numerous datasets and tasks, including small mobile-oriented models. We also show that we outperform previous ImageNet-21K pretraining schemes for prominent new models like ViT and Mixer. Our proposed pretraining pipeline is efficient, accessible, and leads to SoTA reproducible results, from a publicly available dataset. The training code and pretrained models are available at: https://github.com/Alibaba-MIIL/ImageNet21K

Abstract PDF Upgrade to Chat

Authors (4)

Citations (607)

View on Semantic Scholar

Summary

The paper introduces a novel semantic softmax scheme that leverages hierarchical label structures to improve pretraining efficiency.
The authors design a comprehensive pipeline that cleans, standardizes, and optimizes the large ImageNet-21K dataset for broad accessibility.
Experimental results show that the proposed approach outperforms traditional ImageNet-1K pretraining, benefiting both large-scale and mobile-oriented models.

Analyzing ImageNet-21K Pretraining for Broad Accessibility

The paper "ImageNet-21K Pretraining for the Masses" addresses a significant gap in the application and accessibility of the ImageNet-21K dataset for pretraining in computer vision tasks. Traditionally, ImageNet-1K has been the default dataset for pretraining deep learning models due to its size, simplicity, and standardized structure. However, ImageNet-21K offers a much larger and more diverse set of classes, which can potentially enhance model performance across various tasks.

Key Contributions

The authors introduce a comprehensive and efficient pipeline for pretraining on the ImageNet-21K dataset, aiming to make this resource more accessible to researchers and practitioners. The pipeline involves:

Dataset Preparation: The preprocessing includes cleaning invalid classes, forming a standardized train-validation split, and resizing images to reduce the dataset's memory footprint.
Utilizing Semantic Structures: By leveraging the WordNet semantic tree, the authors transform ImageNet-21K into a multi-label dataset. However, they observe that the straightforward multi-label training does not outperform single-label approaches due to optimization issues like extreme imbalancing.
Semantic Softmax Training: Introducing a novel "semantic softmax" scheme, the authors take advantage of hierarchical label structures. This method involves multiple softmax layers corresponding to different levels of label hierarchies, avoiding extreme multi-tasking challenges in regular multi-label approaches.
Semantic Knowledge Distillation: To further improve pretraining quality, the paper integrates semantic softmax with a knowledge distillation framework. This allows non-conventional labels to be predicted more accurately by considering hierarchical consistencies.

Experimental Study

The authors provide extensive empirical validation, showing that semantic softmax pretraining consistently outperforms standard ImageNet-1K pretraining across a wide range of downstream tasks, including image classification, multi-label classification, and video recognition. The study also demonstrates the scalability and efficiency of their pipeline by successfully pretraining both large models such as TResNet-L and mobile-oriented models like MobileNetV3, suggesting widespread applicability.

Implications and Future Directions

The research has several practical implications:

Enhanced Model Performance: The use of ImageNet-21K with the proposed pipeline significantly boosts performance across various computer vision models and tasks, even benefiting smaller, mobile-optimized models.
Accessible Pretraining: By offering a streamlined and efficient method for using the ImageNet-21K dataset, the paper democratizes access to rich pretraining resources that previously required significant computational power and resources.
Framework Generalizability: While this work focuses on ImageNet-21K, the principles and methodologies could be extrapolated to other large-scale datasets, fostering enhanced model pretraining strategies across different domains.

For future work, the integration of semantic approaches and hierarchical structures in model training presents a rich area for exploration. Further research could explore optimal ways to combine these strategies with other advanced training techniques for maximized efficiency and accuracy.

In conclusion, this paper provides a substantial contribution to the understanding and application of large-scale datasets in neural network pretraining. By effectively harnessing the complex structures within ImageNet-21K, the work opens up new possibilities for efficient, high-quality model development in the field of computer vision.

Markdown Report Issue