Spoken Language Identification with Neural Networks

Jing Wei Pan, Chuanqi Sun

Problem

This project aims to solve the problem of spoken language identification. The machine identifies the language spoken in an audio clip by choosing from a set of languages that it has been trained to “understand”. The automated language identification, though rarely implemented, is a preprocessing essential for many applications in which the identity of the language is required before further manipulation. Natural language processing interface such as Siri and online multilingual voice-based translator such as Google translate both require a manual selection of input language. Automated language identification can also work with human operators. For instance, CIA agents who monitor telephone conversations will need to identify the language spoken by terrorists.

Method

The solution consists of three stages: preprocessing, training, and application. During preprocessing, the raw signal file is plotted as a progression of either spectrograms or Mel-frequency cepstral coefficients(MFCC) feature vectors. The spectrograms and feature vectors are handled similarly during training stage but both approaches will be explored for comparison. After taking the input, the neural network minimizes error function with back propagation. One-vs-all and multiple output neurons will both be tested for multiple class classification. Finally, the neural network will take an audio clip as input at application time and display on its output layer the most likely identity of the language. If adopting one-vs-all method, the identity of the language will be found through comparison across multiple neural networks.

Dataset

Most audio files will be obtained from VoxForge.org. Three languages were selected for comparison both within and across the language families: Spanish and French of the Romantic family, and English of the Germanic family. Each language will have 600 training samples that are a few seconds long in standard .wav format.

Milestone

By the milestone, we wish to have a working algorithm that distinguishes among the three languages and to achieve an average accuracy of at least 65 percent.

References

Zissman, Marc A. "Comparison of four approaches to automatic language identification of telephone speech." IEEE Transactions on Speech and Audio Processing 4.1 (1996): 31-44.

Hosford, Alexander W. “Automatic Language Identification (LiD) Through Machine Learning” (2011)

Montavon, Gregoire. "Deep learning for spoken language identification." NIPS Workshop on Deep Learning for Speech Recognition and Related Applications. 2009.