Paper: | SLP-P17.10 |
Session: | Spoken Language Modeling, Identification and Characterization |
Time: | Thursday, May 18, 16:30 - 18:30 |
Presentation: |
Poster
|
Topic: |
Speech and Spoken Language Processing: Language modeling and Adaptation |
Title: |
BOOTSTRAPPING LANGUAGE MODELS FOR SPOKEN DIALOG SYSTEMS FROM THE WORLD WIDE WEB |
Authors: |
Dilek Hakkani-Tür, Mazin Gilbert, AT&T Labs – Research, United States |
Abstract: |
In this paper, we describe our approach for bootstrapping statistical language models for spoken dialog systems using in-domain web data and utterances collected from previous applications. The approach is based on the idea of stitching conversational templates with the predicate and arguments extracted from the web pages using semantic role labeling, to generate conversational style utterances. The conversational templates represent the task-independent portions of user utterances and can be built by hand, or learned from utterances collected from other domain applications. Experiments have shown that, stitching with both types of conversational templates have resulted in significantly better ASR word accuracy. Furthermore, the new language model bootstrapping approach can be combined with unsupervised and active learning to improve word accuracy even with very little in-domain transcribed data. |