Download PDFOpen PDF in browserCurrent versionITCONTRAST: Contrastive Learning with Hard Negative Synthesis for Image-Text MatchingEasyChair Preprint 9930, version 19 pages•Date: April 6, 2023AbstractImage-text matching aims to bridge vision and language so as to match the instance of one modality with the instance of another modality. Recent years have seen considerable progress in the research area by exploring local alignment between image regions and sentence words. However, how to learn modality-invariant feature embedding and make use of the hard negatives in the training set to infer more accurate matching scores are still open ques- tions. In this paper, we attempt to solve these problems by introducing a new Image-Text Modality Contrastive Learning (abbreviated as ITContrast) approach for image-text matching. Specifically, a pre-trained vision-language model OSCAR is firstly fine-tuned to obtain the visual and textual features, and a hard negative synthesis module is then introduced to leverage the hardness of negative samples, which features of profiling negative samples in a mini-match and generating their represen- tatives to reflect the hardness relations to the anchor. A novel cost function is designed to comprehensively combine the knowledge of positives, negatives and synthesized hard nega- tives. Extensive experiments on the MS-COCO and Flickr30K datasets demonstrate that our approach is effective for image-text matching Keyphrases: Contrastive Learning, Hard Negative Synthesis, Multimodal Deep Learning, image-text matching
|