Paper: | SLP-P15.9 |
Session: | Spoken Document Search, Navigation and Summarization |
Time: | Thursday, May 18, 14:00 - 16:00 |
Presentation: |
Poster
|
Topic: |
Speech and Spoken Language Processing: Speech data mining and document retrieval |
Title: |
An Extremely Large Vocabulary Approach to Named Entity Extraction from Speech |
Authors: |
Takaaki Hori, Atsushi Nakamura, NTT Corporation, Japan |
Abstract: |
This paper describes an approach to Named Entity (NE) extraction from speech data, in which an extremely large vocabulary lexicon including all NEs occurring in a large text corpus is used for Automatic Speech Recognition (ASR). Accordingly, NEs appear in the recognition results just as they are. Our approach is implemented by the following steps: (1) run an NE-tagger for a whole text corpus and make an NE-tagged corpus in which each NE is padded with its category, (2) construct a lexicon and a language model for ASR using the tagged corpus where each NE is considered as a regular word, and (3) run the speech recognizer in one pass. Although a very large vocabulary is necessary to ensure a high coverage of NEs, that is no longer a big problem since we recently achieved real-time extremely large vocabulary ASR using WFSTs. In experiments on NE extraction from spoken queries for an open-domain question-answering system, our approach yielded higher F-measure values than a conventional approach. |