The application of manifold based visual speech units for visual speech recognition
Yu, Dahai (2008) The application of manifold based visual speech units for visual speech recognition. PhD thesis, Dublin City University.
Full text available as:
This dissertation presents a new learning-based representation that is referred to as a Visual
Speech Unit for visual speech recognition (VSR). The automated recognition of human speech using only features from the visual domain has become a significant research topic that plays an essential role in the development of many multimedia systems such as audio visual speech recognition(AVSR), mobile phone applications, human-computer interaction (HCI) and sign language recognition. The inclusion of the lip visual information is opportune since it can improve the overall accuracy of audio or hand recognition algorithms especially when such systems are operated in environments characterized by a high level of acoustic noise.
The main contribution of the work presented in this thesis is located in the development of a new learning-based representation that is referred to as Visual Speech
Unit for Visual Speech Recognition (VSR). The main components of the developed Visual Speech Recognition system are applied to: (a) segment the mouth region of
interest, (b) extract the visual features from the real time input video image and (c) to identify the visual speech units. The major difficulty associated with the VSR systems resides in the identification of the smallest elements contained in the image sequences that represent the lip movements in the visual domain.
The Visual Speech Unit concept as proposed represents an extension of the standard viseme model that is currently applied for VSR. The VSU model augments the standard viseme approach by including in this new representation not only the data associated with the articulation of the visemes but also the transitory information between consecutive
visemes. A large section of this thesis has been dedicated to analysis the performance of the new visual speech unit model when compared with that attained for standard (MPEG-
4) viseme models. Two experimental results indicate that:
1. The developed VSR system achieved 80-90% correct recognition when the system has been applied to the identification of 60 classes of VSUs, while the
recognition rate for the standard set of MPEG-4 visemes was only 62-72%.
2. 15 words are identified when VSU and viseme are employed as the visual speech element. The accuracy rate for word recognition based on VSUs is 7%-12% higher than the accuracy rate based on visemes.
Archive Staff Only: edit this record