Lauren Tran
This interdisciplinary project lies at the crossroads of three major fields: computer vision, machine learning, and natural language processing (NLP). The fundamental idea behind this task is that visual information is universal, which allows us to exploit image similarity across labeled multilingual image sets. By extracting visual information from images and combining it with corpus-based distribution semantics, we obtain a multimodal representation of words in our lexicon. With our feature vectors, we then evaluate the distance between words in order to generate bilingual translations.
We approach this task with the bag-of-words model to transform our data into feature vectors [1]. Thus far, we have focused on the visual component of our multimodal models, and our preliminary results reflect english to spanish translations within the 500-word lexicon from Bergsma and Van Durme [2]. The pipeline for our process is as follows:
To gather visual information from our images so that we can conduct meaningful comparisons, we first extract salient features using Lowe's Scale Invariant Feature Transform (SIFT) from all images across every language [3].
We then build a vocabulary of visual words by performing k-means clustering to locate a set of k cluster centers. Each of these cluster centers represents a visual word. Thus far, we have defined vocabularies by using k = 1,000, k = 10,000, and k = 20,000. While Bergsma and Van Durme only used features from the english set, we use a sample over all languages for a more balanced approach.
With our vocabulary, we next represent each image as a histogram of visual word frequencies. For all images in our data set, we count the occurences of each visual word and store the counts in a vector. We have quantized the english and spanish images for all vocabulary sizes, as well as the french images for size k = 1,000.
In order to generate our translations, we calculate the cosine similarity to measure the distance between feature vectors. We use two methods, presented in [2], to calculate the similarity between english word E and spanish word F, shown in equations 1 and 2. We also experimented with variations including taking the average over the top-k scores, using the median score, and averaging only scores that fall within the 1 standard deviation of the mean. We have thus far found the similarity scores between the english and spanish sets for k = 1000 and k = 10,000.
In addition to the standard pipeline outlined above, we also leverage our SIFT features to detect near-duplicate images that appear across languages. While using duplicates makes sense from a pure translation and NLP standpoint, eliminating the duplicates makes for a more compelling problem from a machine learning and vision perspective. Figure 1 shows a snapshot of near-duplicate statistics across all languages. We show the top 5 and bottom 5 classes that contain duplicates.
To measure the accuracy of our translations, we determine the top-1, top-5, and top-20 accuracy, which measures how often the ground truth translation falls within the top 1, 5, and 20 highest-ranked matches. We show our accuracy results in Figure 2, using all images in the original data set, as well as a reduced set with all near-duplicates removed. When removing images from our data, we run into the issue that some classes are primarily made up of dupicates. To alleviate this problem, we decided to completely eliminate the classes that only have 11 or fewer images remaining.
This threshold was chosen by referring to results from [2], depicted in Figure 3. The graph seems to indicate that accuracy begins to suffer dramatically when using fewer than 12 images per class. Thus, we remove 9 total classes from the spanish set. The accuracy for the reduced data set decreases by less than 2% for both vocabulary sizes, despite our inclusion of 29 classes that have been stripped of their duplicates. These results seem suggest that we do not need parallel data for comparable accuracy measures. Removing near-duplicates is not detrimental, which is very promising. Figure 2 also shows the increased accuracy of using k=10,000 in our visual vocabulary. The final column of Figure 2 shows the accuracy results that [2] achieved.
Figure 4 illustrates the benefit of using the AvgMax method as opposed to the MaxMax method when finding the similarity between words. As previously mentioned, we experimented with several variations of these measures. The variations actually yielded no improvement when we did preliminary testing on small image sets---in fact, they performed worse than the AvgMax method. We found that the accuracy actually decreases as we use fewer scores in our comparison, which suggests that perhaps the method is robust to outliers in the data set.
Thus far, we have focused on visual information, and the next step is to add the textual component to our representation. We will experiment with different methods of combining our text and image feature vectors to yield the best results. Additionally, we propose to apply classifiers in order to generate a more sophisticated representation of our visual information, instead of simply comparing image similarity with our distance function.
[1] S. Bergsma, B. Van Durme, Learning Bilingual Lexicons using the Visual Similarity of Labeled Web Images, In Proc. IJCAI 2011.
[2] Richard Szeliski, Computer Vision: Algorithms and Applications, Springer, 2011.
[3] David G. Lowe. Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, pages 91-110, January 2004.