Download PDFOpen PDF in browserCurrent version

ITCONTRAST: Contrastive Learning with Hard Negative Synthesis for Image-Text Matching

EasyChair Preprint 9930, version 1

Versions: 12history
9 pagesDate: April 6, 2023

Abstract

Image-text matching aims to bridge vision and language so as to match the instance of one modality with the instance of another modality. Recent years have seen considerable progress in the research area by exploring local alignment between image regions and sentence words. However, how to learn modality-invariant feature embedding and make use of the hard negatives in the training set to infer more accurate matching scores are still open ques- tions. In this paper, we attempt to solve these problems by introducing a new Image-Text Modality Contrastive Learning (abbreviated as ITContrast) approach for image-text matching. Specifically, a pre-trained vision-language model OSCAR is firstly fine-tuned to obtain the visual and textual features, and a hard negative synthesis module is then introduced to leverage the hardness of negative samples, which features of profiling negative samples in a mini-match and generating their represen- tatives to reflect the hardness relations to the anchor. A novel cost function is designed to comprehensively combine the knowledge of positives, negatives and synthesized hard nega- tives. Extensive experiments on the MS-COCO and Flickr30K datasets demonstrate that our approach is effective for image-text matching

Keyphrases: Contrastive Learning, Hard Negative Synthesis, Multimodal Deep Learning, image-text matching

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
@booklet{EasyChair:9930,
  author    = {Fangyu Wu and Qiufeng Wang and Qi Chen and Yushi Li and Bailing Zhang and Eng Gee Lim},
  title     = {ITCONTRAST: Contrastive Learning with Hard Negative Synthesis for Image-Text Matching},
  howpublished = {EasyChair Preprint 9930},
  year      = {EasyChair, 2023}}
Download PDFOpen PDF in browserCurrent version