| Paper: | SLP-P18.5 |
| Session: | LVCSR Systems |
| Time: | Friday, May 19, 10:00 - 12:00 |
| Presentation: |
Poster
|
| Topic: |
Speech and Spoken Language Processing: Miscellaneous Topics |
| Title: |
Arabic Broadcast News Transcription using a One Million Word Vocalized Vocabulary |
| Authors: |
Abdel. Messaoudi, Jean-Luc Gauvain, Lori Lamel, LIMSI-CNRS, France |
| Abstract: |
Recently it has been shown that modeling short vowels in Arabic can significantly improve performance even when producing a non-vocalized transcript. Since Arabic texts and audio transcripts are almost exclusively non-vowelized, the training methods have to overcome this missing data problem. For the acoustic models the procedure was bootstrapped with manually vowelized data and extended with semi-automatically vowelized data. In order to also capture the vowel information in the language model, a vocalized 4-gram language model trained on the audio transcripts was interpolated with the orignal 4-gram model trained on the (non-vowelized) written texts. Another challenge of the Arabic language is its large lexical variety. The out-of-vocabulary rate with a 65k word vocabulary is in the range of 4-8% (compared to under 1% for English). To address this problem a vocalized vocabulary containing over 1 million vocalized words, grouped into 200k word classes is used. This reduces the out-of-vocabulary rate to about 2%. The extended vocabulary and vocalized language model trained on the manually annotated data give a 1.2% absolute word error reduction on the DARPA RT04 development data. However, including the automatically vowelized transcripts in the language model reduces performance indicating that automatic vocalization needs to be improved. |