Paper: | SLP-P17.11 |
Session: | Spoken Language Modeling, Identification and Characterization |
Time: | Thursday, May 18, 16:30 - 18:30 |
Presentation: |
Poster
|
Topic: |
Speech and Spoken Language Processing: Language modeling and Adaptation |
Title: |
Strategies for Language Model Web-data Collection |
Authors: |
Vincent Wan, Thomas Hain, University of Sheffield, United Kingdom |
Abstract: |
This paper presents an analysis of the use of textual information collected from the internet via a search engine for the purpose of building domain specific language models. A framework to analyse the effect of search query formulation on the resulting web-data language model performance in an evaluation is developed. The framework gives rise to improved methods of selecting n-gram search engine queries, which return documents that make better domain specific language models. |