Combining Textual and Visual Information for Machine Translation

 

 

Lauren Tran

 

Problem Statement

In this multidisciplinary project, which spans the fields of computer vision, machine learning, and natural language processing (NLP), we propose to use the visual similarity between images along with corpus-based distributional semantics for NLP tasks.  By exploiting both textual and visual information, we will pursue the task of machine translation which on a larger scale, extends to the broader idea of semantic similarity.

 

To approach this machine translation task, we will use multilingual image sets to determine the visual similarity of images across languages.  The multilingual labels will allow us to derive translations once we find matches according to a specified threshold.  We will expand on previous work by introducing multimodal distributional models which allow us to model the meaning of words with both visual and textual information.  By generating feature vectors for each word, we can evaluate the distance between vectors and subsequently generate bilingual translations.

 

Methods

Beginning with the bag-of-words model, we will extract salient features from the images and use k-means clustering to calculate cluster centers.  The value we choose for k is the number of visual words that we define in our vocabulary.  Because the number of words can significantly impact the results of our work, we plan to experiment with a wide variety of values for k.  With a visual vocabulary defined, we can then generate histograms that represent the frequency of word occurrences in each image [1].  

 

In related work, Bruni et. al. [2] and Bergsma and Van Durme [3] use an unsophisticated method of exploiting visual information.  They simply look at visual word occurrences to build feature vectors, which they evaluate to determine the image similarity.  To expand upon these works, we propose to introduce classifiers such as support vector machines (SVMs) to learn our model.  By applying classifiers, we can generate a confusion matrix that will provide a more sophisticated means of determining visual similarity.  Each row of the confusion matrix will represent an individual word, and thus, we can use this output to construct the visual portion of our feature vectors.  

 

To generate the textual models, we will extract textual distributional information from large text corpora.  Our feature vectors will contain occurrence and co-occurence statistics, as well as dependency relations between words.  To build their multimodal models, Bruni et. al. use the following equation to combine the text feature vector Ft and the visual feature vector Fv :

 

F = α × Ft ⊕ (1 − α) × Fv   [2]

 

They simply concatenate the two vectors after normalizing each vector separately and applying a manually tuned weight, α.  We plan to derive a more sophisticated means of combining the feature vectors.  Here, the clear starting point is devising a intelligent method of learning α as opposed to manual tuning.  Once we have built our feature vectors, we can evaluate the similarity between vectors in order to find accurate translations.  Bergsma and Van Durme use cosine similarity to determine the distance between vectors [3].  While this is a good starting point, we can experiment with different distance functions to potentially improve our results.

 

Data Sets

 As a starting point, we will use the image sets from Bergsma and Van Durme's work, which is publicly available online [3].  There are two different sets of images, one containing 500 classes across 6 languages and the other containing 20,000 classes across 3 languages.  Each class contains at most 20 images.  Figure 1 illustrates the idea that the Google Image Search results for the same word in two different languages will yield results with very similar visual features [3].  This will allow us to build meaningful visual feature vectors to aid with our translations.  


image


Figure 1: Training images for the words “candle”

and “vela” which exhibit similar visual features.


These data sets, however, include near-duplicate images across languages, which makes their work far less compelling with respect to the vision component.  We can take advantage of our extracted SIFT features to find and eliminate near duplicates [4].  The new data set will set the stage for more meaningful results.  To extract textual information, we will use large text corpora such as Wikipedia and possibly the freely available Distributional Memory (DM) model.

 

Milestone

By the milestone, we plan to successfully replicate the Bergsma and Van Durme experiments, which we can use as a baseline.  We also expect to apply classifiers and introduce textual information as valuable extensions on this previous work.  Once we have a working implementation using classifiers and multimodal distributional models by the milestone, we can experiment with other improvements and fine tune our method for the final project.


References

[1] Richard Szeliski, Computer Vision: Algorithms and Applications, Springer, 2011.

 

[2] Elia Bruni, Giang Binh Tran, and Marco Baroni, Distributional semantics from text and images. Proceedings of the EMNLP Geometrical Models for Natural Language Semantics Workshop, 2011.

 

[3] S. Bergsma, B. Van Durme, Learning Bilingual Lexicons using the Visual Similarity of Labeled Web Images, In Proc. IJCAI 2011.

 

[4] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, pages 91-110, January 2004.