Augmenting automatic speech recognition and search models for spoken content retrieval

Moriya, Yasufumi (2022) Augmenting automatic speech recognition and search models for spoken content retrieval. PhD thesis, Dublin City University.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

Spoken content retrieval (SCR) is a process to provide a user with spoken documents in which the user is potentially interested. Unlike textual documents, searching through speech is not trivial due to its representation. Generally, automatic speech recognition (ASR) is used to transcribe spoken content such as user-generated videos and podcast episodes into transcripts before search operations are performed. Despite recent improvements in ASR, transcription errors can still be present in automatic transcripts. This is in particular when ASR is applied to out-of-domain data or speech with background noise. This thesis explores improvement of ASR systems and search models for enhanced SCR on user-generated spoken content. There are three topics explored in this thesis. Firstly, the use of multimodal signals for ASR is investigated. This is motivated to integrate background contexts of spoken content into ASR. Integration of visual signals and document metadata into ASR is hypothesised to produce transcripts more aligned to background contexts of speech. Secondly, the use of semi-supervised training and content genre information from metadata are exploited for ASR. This approach is motivated to mitigate the transcription errors caused by recognition of out-of-domain speech. Thirdly, the use of neural models and the model extension using N-best ASR transcripts are investigated. Using ASR N-best transcripts instead of 1-best for search models is motivated because "key terms" missed in 1-best can be present in the N-best transcripts. A series of experiments are conducted to examine those approaches to improvement of ASR systems and search models. The findings suggest that semi-supervised training bring practical improvement of ASR systems for SCR and the use of neural ranking models in particular with N-best transcripts improve the result of known-item search over the baseline BM25 model.

Metadata

Item Type:	Thesis (PhD)
Date of Award:	November 2022
Refereed:	No
Supervisor(s):	Jones, Gareth
Uncontrolled Keywords:	spoken content retrieval (SCR); multimodal automatic speech recognition; automatic speech recognition for spoken content retrieval; neural ranking for spoken content retrieval
Subjects:	Computer Science > Information retrieval Computer Science > Multimedia systems Computer Science > Information storage and retrieval systems Engineering > Signal processing
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing Research Institutes and Centres > ADAPT
Funders:	Science Foundation Ireland
ID Code:	27672
Deposited On:	10 Nov 2022 14:29 by Gareth Jones . Last Modified 10 Nov 2022 14:29

Documents

Full text available as:

[thumbnail of YasufumiMoriyaPhD_final.pdf]

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
Creative Commons: Attribution-Noncommercial-No Derivative Works 4.0
4MB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

DORAS | DCU Research Repository

Augmenting automatic speech recognition and search models for spoken content retrieval

Downloads