Spectral discontinuity in concatenative speech synthesis – perception, join costs and feature transformations

Kirkpatrick, Barry (2010) Spectral discontinuity in concatenative speech synthesis – perception, join costs and feature transformations. PhD thesis, Dublin City University.

Abstract
Metadata
Downloads
Documents

[+][-]

Abstract

This thesis explores the problem of determining an objective measure to represent human perception of spectral discontinuity in concatenative speech synthesis. Such measures are used as join costs to quantify the compatibility of speech units for concatenation in unit selection synthesis. No previous study has reported a spectral measure that satisfactorily correlates with human perception of discontinuity. An analysis of the limitations of existing measures and our understanding of the human auditory system were used to guide the strategies adopted to advance a solution to this problem. A listening experiment was conducted using a database of concatenated speech with results indicating the perceived continuity of each concatenation. The results of this experiment were used to correlate proposed measures of spectral continuity with the perceptual results. A number of standard speech parametrisations and distance measures were tested as measures of spectral continuity and analysed to identify their limitations. Time-frequency resolution was found to limit the performance of standard speech parametrisations.As a solution to this problem, measures of continuity based on the wavelet transform were proposed and tested, as wavelets offer superior time-frequency resolution to standard spectral measures. A further limitation of standard speech parametrisations is that they are typically computed from the magnitude spectrum. However, the auditory system combines information relating to the magnitude spectrum, phase spectrum and spectral dynamics. The potential of phase and spectral dynamics as measures of spectral continuity were investigated. One widely adopted approach to detecting discontinuities is to compute the Euclidean distance between feature vectors about the join in concatenated speech. The detection of an auditory event, such as the detection of a discontinuity, involves processing high up the auditory pathway in the central auditory system. The basic Euclidean distance cannot model such behaviour. A study was conducted to investigate feature transformations with sufficient processing complexity to mimic high level auditory processing. Neural networks and principal component analysis were investigated as feature transformations. Wavelet based measures were found to outperform all measures of continuity based on standard speech parametrisations. Phase and spectral dynamics based measures were found to correlate with human perception of discontinuity in the test database, although neither measure was found to contribute a significant increase in performance when combined with standard measures of continuity. Neural network feature transformations were found to significantly outperform all other measures tested in this study, producing correlations with perceptual results in excess of 90%.

Metadata

Item Type:	Thesis (PhD)
Date of Award:	22 February 2010
Refereed:	No
Supervisor(s):	O'Brien, Darragh and Scaife, Ronan
Subjects:	Computer Science > Machine learning Engineering > Signal processing
DCU Faculties and Centres:	DCU Faculties and Schools > Faculty of Engineering and Computing > School of Computing
Use License:	This item is licensed under a Creative Commons Attribution-NonCommercial-No Derivative Works 3.0 License. View License
Funders:	Science Foundation Ireland
ID Code:	15753
Deposited On:	04 Apr 2011 15:24 by Darragh O'brien . Last Modified 19 Jul 2018 14:51

Documents

Full text available as:

[thumbnail of HardboundThesis_9Octo2010.pdf]

Preview

PDF - Requires a PDF viewer such as GSview, Xpdf or Adobe Acrobat Reader
1MB

Downloads

Downloads per month over past year

Archive Staff Only: edit this record

Altmetric