Loading [MathJax]/extensions/MathMenu.js
Bo He - IEEE Xplore Author Profile

Showing 1-4 of 4 results

Filter Results

Show

Results

With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective...Show More
Diversified image color editing is typically modeled as a multimodal image-to-image translation (MMI2IT) problem with an impact on multiple applications such as photo enhancement and retouching. Although previous GAN-based algorithms successfully generate diverse edits with clear control, we observe two issues remaining: Firstly, they tend to apply the same color style to all kinds of input images...Show More
The goal of multimodal summarization is to extract the most important information from different modalities to form summaries. Unlike unimodal summarization, the multimodal summarization task explicitly leverages cross-modal information to help generate more reliable and high-quality summaries. However, existing methods fail to lever-age the temporal correspondence between different modal-ities an...Show More
Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training. Without the boundary information of action segments, existing methods mostly rely on multiple instance learning (MIL), where the predictions of unlabeled instances (i.e., video snippets) are supervised by classifying labeled bags (i.e....Show More