Technical Program

Paper Detail

Paper:	SLP-P17.7
Session:	Spoken Language Modeling, Identification and Characterization
Time:	Thursday, May 18, 16:30 - 18:30
Presentation:	Poster
Topic:	Speech and Spoken Language Processing: Language modeling and Adaptation
Title:	Morpheme-Based Language Modeling for Arabic LVCSR
Authors:	Ghinwa Choueiter, Massachusetts Institute of Technology, United States; Daniel Povey, Stanley Chen, Geoffrey Zweig, IBM T. J. Watson Research Center, United States
Abstract:	In this paper, we concentrate on Arabic speech recognition. Taking advantage of the rich morphological structure of the language, we use morpheme-based language modeling to improve the word error rate. We propose a simple constraining method to rid the decoding output of illegal morpheme sequences. We report the results obtained for word and morpheme language models using medium (<64kw) and large (~800kw) vocabularies, the morpheme LM obtaining an absolute improvement of 2.4% for the former and only 0.2% for the latter. The 2.4% gain surpasses previous gains for morpheme-based LMs for Arabic, and the large vocabulary runs represent the first comparative results for vocabularies of this size for any language. Finally, we analyze the performance of the morpheme LM on word OOV's.