Spoken Language Identification with Neural Networks
Jing Wei Pan, Chuanqi Sun
Problem
This
project aims to solve the problem of spoken language identification. The
machine identifies the language spoken in an audio clip by choosing from a set
of languages that it has been trained to ÒunderstandÓ.
The automated language identification, though rarely implemented, is a
preprocessing essential for many applications in which the identity of the
language is required before further manipulation. Natural language
processing interface such as Siri and online multilingual
voice-based translator such as Google translate both require a manual selection
of input language. Automated language identification can also work with human
operators. For instance, CIA agents who monitor telephone conversations will
need to identify the language spoken by terrorists.
Method
The
solution consists of three stages: preprocessing, training, and application.
During preprocessing, the raw signal file is plotted as a progression of either
spectrograms or Mel-frequency cepstral coefficients(MFCC) feature vectors. The spectrograms and
feature vectors are handled similarly during training stage but both approaches
will be explored for comparison. After taking the input, the neural network
minimizes error function with back propagation. One-vs-all
and multiple output neurons will both be tested for multiple class
classification. Finally, the neural network will take an audio clip as input at
application time and display on its output layer the most likely identity of
the language. If adopting one-vs-all method, the
identity of the language will be found through comparison across multiple
neural networks.
Dataset
Most
audio files will be obtained from VoxForge.org. Three languages were selected
for comparison both within and across the language families: Spanish and French
of the Romantic family, and English of the Germanic family. Each language will
have 600 training samples that are a few seconds long in standard .wav format.
Milestone
By the
milestone, we wish to have a working algorithm that distinguishes among the
three languages and to achieve an average accuracy of at least 65 percent.
References
Zissman, Marc A. "Comparison of four approaches to automatic
language identification of telephone speech." IEEE Transactions on Speech
and Audio Processing 4.1 (1996): 31-44.
Hosford, Alexander W. ÒAutomatic Language Identification (LiD) Through Machine LearningÓ (2011)
Montavon, Gregoire. "Deep learning for spoken language identification." NIPS Workshop on Deep Learning for Speech Recognition and Related Applications. 2009.