Conferences >2024 IEEE/CVF Conference on C...

Enhancing Vision-Language Pre-Training with Rich Supervisions

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot re...Show More

Metadata

Abstract:

We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4, we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble down- stream tasks across different domains and the annotations are cheap to obtain. We demonstrate that, compared to current screenshot pre-training objectives, our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection, and at least 1 % on Widget Captioning.

Published in: 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Date of Conference: 16-22 June 2024

Date Added to IEEE Xplore: 16 September 2024

ISBN Information:

ISSN Information:

DOI: 10.1109/CVPR52733.2024.01280

Conference Location: Seattle, WA, USA

Citations are not available for this document.

Contents

1. Introduction

In recent years, there has been significant progress in Language Models (LMs) [7], [11], [47], [52] and Vision Language Models (VLMs) [1,2,6,8, 10, 15-18,21,23,25-30,39- 41,44,45,49,50,56,57,60-63,65-75,77-80] exhibiting strong zero-shot generalization and adaptability to a wide range of tasks. Though they may differ in architecture, data and task formulation, such foundational models predominantly rely on large-scale pre-training on giant corpora of web scraped data which serves as the source of generalization capability - C4 [51], The Pile [20], Laion 5B [54].

Cites in Papers - |

Cites in Papers - IEEE (2)

Select All

Qingxin Zhang, Haoyan Wei, Yang Qian, "Group-CLIP Uncertainty Modeling for Group Re-Identification", ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.1-5, 2025.

Show Article

Google Scholar

Yuting He, Fuxiang Huang, Xinrui Jiang, Yuxiang Nie, Minghao Wang, Jiguang Wang, Hao Chen, "Foundation Model for Advancing Healthcare: Challenges, Opportunities and Future Directions", IEEE Reviews in Biomedical Engineering, vol.18, pp.172-191, 2025.

Show Article

Google Scholar

References is not available for this document.

Enhancing Vision-Language Pre-Training with Rich Supervisions

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

Cites in Papers - |

Cites in Papers - IEEE (2)

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Enhancing Vision-Language Pre-Training with Rich Supervisions

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

1. Introduction

Cites in Papers - IEEE (2) | Other Publishers (0)

Cites in Papers - IEEE (2)

References

Cites in Papers - |