| Paper: | SLP-P17.11 |
| Session: | Spoken Language Modeling, Identification and Characterization |
| Time: | Thursday, May 18, 16:30 - 18:30 |
| Presentation: |
Poster
|
| Topic: |
Speech and Spoken Language Processing: Language modeling and Adaptation |
| Title: |
Strategies for Language Model Web-data Collection |
| Authors: |
Vincent Wan, Thomas Hain, University of Sheffield, United Kingdom |
| Abstract: |
This paper presents an analysis of the use of textual information collected from the internet via a search engine for the purpose of building domain specific language models. A framework to analyse the effect of search query formulation on the resulting web-data language model performance in an evaluation is developed. The framework gives rise to improved methods of selecting n-gram search engine queries, which return documents that make better domain specific language models. |