Technical Program

Paper Detail

Paper:	SLP-P17.3
Session:	Spoken Language Modeling, Identification and Characterization
Time:	Thursday, May 18, 16:30 - 18:30
Presentation:	Poster
Topic:	Speech and Spoken Language Processing: Language modeling and Adaptation
Title:	UNSUPERVISED ADAPTATION OF A STOCHASTIC LANGUAGE MODEL USING A JAPANESE RAW CORPUS
Authors:	Gakuto Kurata, Shinsuke Mori, Masafumi Nishimura, IBM Japan, Ltd., Japan
Abstract:	The target uses of Large Vocabulary Continuous Speech Recognition (LVCSR) systems are spreading. It takes a lot of time to build a good LVCSR system specialized for the target domain because experts need to manually segment the corpus of the target domain, which is a labor-intensive task. In this paper, we propose a new method to adapt an LVCSR system to a new domain. In our method, we stochastically segment a Japanese raw corpus of the target domain. Then a domain-specific Language Model (LM) is built based on this corpus. All of the domain-specific words can be added to the lexicon for LVCSR. Most importantly, the proposed method is fully automatic. Therefore, we can reduce the time for introducing an LVCSR system drastically. In addition, the proposed method yielded a comparable or even superior performance to use of expensive manual segmentation.