EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

18 Jun 2023

マスク付き画像モデリングを10億級大規模モデル・約千万の画像データにスケールさせる研究．予測対象の離散トークン化は不要で，マスクで隠されたCLIP特徴量を可視パッチから予測するだけで良いことを示した．

基本情報

@InProceedings{Fang_2023_CVPR,
    author    = {Fang, Yuxin and Wang, Wen and Xie, Binhui and Sun, Quan and Wu, Ledell and Wang, Xinggang and Huang, Tiejun and Wang, Xinlong and Cao, Yue},
    title     = {EVA: Exploring the Limits of Masked Visual Representation Learning at Scale},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2023},
    pages     = {19358-19369}
}

論文リンク

CVPR / arXiv / GitHub

著者・所属

Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao (Beijing Academy of Artificial Intelligence, Huazhong University of Science and Technology, Zhejiang University, Beijing Institute of Technology)

新規性

マスク付き画像モデリングを10億級の大規模モデルにスケールさせるため必要な設計上の検討を行った．その結果，BEiaTシリーズで採用されている予測対象の離散トークン化は不要で，マスクされたパッチに対応するCLIP特徴量をそのまま予測するMVPやMILIANで採用された方法が最もよくスケールすることを明らかにした．

手法

結果

議論・コメント

EVAは"Explore the limits of Visual representation at scAle"らしい
実験の詳細はよく書かれていて再現はしやすそう（規模がアレなので実際に再現できる人は限られていそうだが）
- ただ，MIMでCLIP特徴量を予測するのは具体的に何をやっているのか明示的には書かれていない
  - おそらくMVPやMILIANの論文を読めということだろうが
各種タスクで先行研究を上回っているものの，差はかなり薄い
- モデルをかなり大規模化してこの差だと，実用上はかなり難しい感じがする

備忘録機械学習，コンピュータビジョン，時々物理

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

基本情報

論文リンク

著者・所属

新規性

手法

結果

議論・コメント

関連文献

Tags

備忘録 機械学習，コンピュータビジョン，時々物理

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

基本情報

論文リンク

著者・所属

新規性

手法

結果

議論・コメント

関連文献

Tags

Related Posts

On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions 11 Feb 2024

End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames. (arXiv:2311.17241v1 [cs.CV]) 08 Feb 2024

Region-Based Representations Revisited 07 Feb 2024

備忘録機械学習，コンピュータビジョン，時々物理