ICASSP 2006 - May 15-19, 2006 - Toulouse, France

Technical Program

Paper Detail

Paper:SLP-P18.5
Session:LVCSR Systems
Time:Friday, May 19, 10:00 - 12:00
Presentation: Poster
Topic: Speech and Spoken Language Processing: Miscellaneous Topics
Title: Arabic Broadcast News Transcription using a One Million Word Vocalized Vocabulary
Authors: Abdel. Messaoudi, Jean-Luc Gauvain, Lori Lamel, LIMSI-CNRS, France
Abstract: Recently it has been shown that modeling short vowels in Arabic can significantly improve performance even when producing a non-vocalized transcript. Since Arabic texts and audio transcripts are almost exclusively non-vowelized, the training methods have to overcome this missing data problem. For the acoustic models the procedure was bootstrapped with manually vowelized data and extended with semi-automatically vowelized data. In order to also capture the vowel information in the language model, a vocalized 4-gram language model trained on the audio transcripts was interpolated with the orignal 4-gram model trained on the (non-vowelized) written texts. Another challenge of the Arabic language is its large lexical variety. The out-of-vocabulary rate with a 65k word vocabulary is in the range of 4-8% (compared to under 1% for English). To address this problem a vocalized vocabulary containing over 1 million vocalized words, grouped into 200k word classes is used. This reduces the out-of-vocabulary rate to about 2%. The extended vocabulary and vocalized language model trained on the manually annotated data give a 1.2% absolute word error reduction on the DARPA RT04 development data. However, including the automatically vowelized transcripts in the language model reduces performance indicating that automatic vocalization needs to be improved.



IEEESignal Processing Society

©2018 Conference Management Services, Inc. -||- email: webmaster@icassp2006.org -||- Last updated Friday, August 17, 2012