Download PDFOpen PDF in browser

Multimodal Neural Machine Translation Using CNN and Transformer Encoder

EasyChair Preprint 873

11 pagesDate: April 2, 2019

Abstract

Multimodal machine translation uses images related to source language sentences as inputs to improve translation quality. Previous multimodal Neural Machine Translation (NMT) models, which incorporate visual features of each image region into an encoder for source language sentences or an attention mechanism between an encoder and a decoder, cannot catch the relation between visual features from each image region. This paper proposes a new multimodal NMT model, which encodes an input image using a Convolutional Neural Network (CNN) and a Transformer encoder. In particular, the proposed image encoder first extracts visual features from each image region using a CNN, and then encodes an input image on the basis of the extracted visual features using a Transformer encoder, where the relation between visual features from each image region are captured by a self-attention mechanism of the Transformer encoder. The experiments on the English-German translation task using the Multi30k data set show that the proposed model achieves 0.96 BLEU points improvement against a baseline Transformer NMT model without image inputs and 0.47 BLEU points improvement against a baseline multimodal Transformer NMT model without a Transformer encoder for images.

Keyphrases: CNN, machine translation, multimodal learning, transformer

BibTeX entry
BibTeX does not have the right entry for preprints. This is a hack for producing the correct reference:
@booklet{EasyChair:873,
  author    = {Hiroki Takushima and Akihiro Tamura and Takashi Ninomiya and Hideki Nakayama},
  title     = {Multimodal Neural Machine Translation Using CNN and Transformer Encoder},
  doi       = {10.29007/hxhn},
  howpublished = {EasyChair Preprint 873},
  year      = {EasyChair, 2019}}
Download PDFOpen PDF in browser