Approaching Multilingual Emotion Recognition from Speech - On Language Dependency of Acoustic/Prosodic Features for Anger Detection


Tim Polzehl, Deutsche Telekom Laboratories / Quality and Usability Lab, Technische Universitat Berlin
Alexander Schmitt, Dialogue Systems Group Institute, Information Technology University of Ulm
Florian Metze, Language Technologies Institute, Carnegie Mellon University, Pittsburgh

This paper reports on mono- and multi-lingual performance of different acoustic and prosodic features for automatic emotion recognition. We analyze different methods to obtain optimal sets of features, i.e. sets that can handle multi-lingual speech input. When an emotion recognition system is confronted with a language it has not been trained on we normally observe a severe system degradation. Analyzing this loss we report on strategies to combine the feature spaces with and without combining and retraining the mono-lingual systems. We estimate the relative importance of our features on an American English and a German database. Both databases contain speech of real-life users calling into interactive voice response (IVR) platforms. We compare the feature distribution, classification scores and feature ranks in terms of information gain ratio. After final system integration we obtain a single unified bi-lingual anger recognition system which performs as well as two separated mono-lingual systems on the test data.