Phone Duration Modeling of Affective Speech Using Support Vector Regression

Full Text (PDF, 393KB), PP.1-9

Views: 0 Downloads: 0


Alexandros Lazaridis 1,2,* Iosif Mporas 1,3 Todor Ganchev 1

1. Artificial Intelligence Group, Wire Communications Laboratory, Dept. of Electrical and Computer Engineering, University of Patras, Rion-Patras 26500, Greece

2. Dept. of Engineering Informatics & Telecommunications, University of Western Macedonia, Kozani, 50100, Greece

3. Dept. of Informatics and Mass Media, Technological Educational Institute of Patras, Greece

* Corresponding author.


Received: 16 Sep. 2011 / Revised: 25 Jan. 2012 / Accepted: 13 Apr. 2012 / Published: 8 Jul. 2012

Index Terms

Phone Duration Modeling, Statistical Modeling, Support Vector Regression, Emotional Speech, Text-to-speech Synthesis


In speech synthesis accurate modeling of prosody is important for producing high quality synthetic speech. One of the main aspects of prosody is phone duration. Robust phone duration modeling is a prerequisite for synthesizing emotional speech with natural sounding. In this work ten phone duration models are evaluated. These models belong to well known and widely used categories of algorithms, such as the decision trees, linear regression, lazy-learning algorithms and meta-learning algorithms. Furthermore, we investigate the effectiveness of Support Vector Regression (SVR) in phone duration modeling in the context of emotional speech. The evaluation of the eleven models is performed on a Modern Greek emotional speech database which consists of four categories of emotional speech (anger, fear, joy, sadness) plus neutral speech. The experimental results demonstrated that the SVR-based modeling outperforms the other ten models across all the four emotion categories. Specifically, the SVR model achieved an average relative reduction of 8% in terms of root mean square error (RMSE) throughout all emotional categories.

Cite This Paper

Alexandros Lazaridis, Iosif Mporas, Todor Ganchev, "Phone Duration Modeling of Affective Speech Using Support Vector Regression", International Journal of Intelligent Systems and Applications(IJISA), vol.4, no.8, pp.1-9, 2012. DOI:10.5815/ijisa.2012.08.01


[1]Dutoit T.. An Introduction to Text-To-Speech Synthesis [B]. Dordrecht: Kluwer Academic Publishers. 1997.

[2]Möbius B, Santen P H J. Modeling Segmental duration in German Text-to-Speech Synthesis [C]. Proceedings of ICSLP’96, Philadelphia, USA, 1996, 2395–2398.

[3]Barbosa P A, Bailly G.. Characterisation of rhythmic patterns for text-to-speech synthesis [J]. Speech Communication, 1994, 15: 127–137. 

[4]Bell A, Jurafsky D, Fosler-Lussier E, Girand C, Gregory M, Gildea D. Effects of disfluencies, predictability, and utterance position on word form variation in English conversation [J]. Journal of the Acoustical Society of America, 2003, 113(2): 1001–1024.

[5]Crystal T H, House A S. Segmental durations in connected-speech signals: Current results [J]. Journal of the Acoustical Society of America, 1988, 83(4): 1553–1573.

[6]Gregory M, Bell A, Jurafsky D, Raymond W. Frequency and predictability effects on the duration of content words in conversation [J]. Journal of the Acoustical Society of America, 2001, 110(5): 27–38.

[7]Riley M. Tree-based modelling for speech synthesis [B]. In G. Bailly, C. Benoit, and T.R. Sawallis (Eds.), Talking Machines: Theories, Models and Designs. Amsterdam, Netherlands: Elsevier, 1992, 265–273.

[8]van Santen J P H. Contextual effects on vowel durations [J]. Speech Communication, 1992, 11: 513–546.

[9]Klatt D H. Synthesis by rule of segmental durations in English sentences [B]. In B. Lindlom, and S. Ohman (Eds.), Frontiers of Speech Communication Research. New York: Academic Press, 1979, 287–300.

[10]Chen S H, Hwang S H. Wang Y R. An RNN-based prosodic information synthesizer for Mandarin text-to-speech [J]. IEEE Trans. on Speech and Audio Processing, 1998, 6(3): 226–239.

[11]Chien J T, Huang C H. Bayesian Learning of Speech Duration Models [J]. IEEE Trans. on Speech and Audio Processing, 2003, 11(6): 558–567.

[12]Lazaridis A, Zervas P, Kokkinakis G. Segmental Duration Modeling for Greek Speech Synthesis [C]. In Proceedings of ICTAI’07, Patras, Greece, 2007, 518–521.

[13]Carlson R, Granstrom B. A search for durational rules in real speech database [J]. Phonetica, 1988, 43: 140-154.

[14]Bartkova K, Sorin C. A model of segmental duration for speech synthesis in French [J]. Speech Communication, 1987, 6: 245–260.

[15]Epitropakis G, Tambakas D, Fakotakis N, Kokkinakis G. Duration modelling for the Greek language [C]. In Proceedings of EUROSPEECH’93, Berlin, Germany, 1993, 1995–1998.

[16]Rao K S, Yegnanarayana B. Modeling durations of syllables using neural networks [J]. Computer Speech & Language, 2007, 21(2): 282–295.

[17]Klatt D H. Review of text-to-speech conversion for English [J]. Journal of the Acoustical Society of America, 1987, 82(3): 737–793.

[18]Kominek J, Black A W. CMU ARCTIC databases for speech synthesis [R]. CMU-LTI-03-177, Language Technologies Institute, School of Computer Science, Carnegie Mellon University, 2003.

[19]Takeda K, Sagisaka Y, Kuwabara H. On sentence-level factors governing segmental duration in Japanese [J]. Journal of Acoustic Society of America, 1989, 86(6): 2081–2087.

[20]Murray I R, Arnott J L. Implementation and testing of a system for producing emotion-by-rule in synthetic speech [J]. Speech Communication, 1995, 16: 369–390.

[21]Burkhardt F,. Sendlmeier W F. Verification of Acoustical Correlates of Emotional Speech using Formant-Synthesis [C]. Proceedings of the ISCA Workshop on Speech & Emotion, Northern Ireland, 2000, 151–156.

[22]Heuft B, Portele T, Rauth M. Emotions in Time Domain Synthesis [C]. Proceedings of ICSLP’96, Philadelphia, USA, 1996, 1974–1977.

[23]Rank E, Pirker H. Generating Emotional Speech with a Concatenative Synthesizer [C]. Proceedings of ICSLP’98, Sydney, Australia, 1998, 671–674.

[24]Black A. Unit Selection and Emotional Speech [C]. Proceedings of EUROSPEECH’03, Geneva, Switzerland, 2003, 1649–1652.

[25]Iida A, Campbell N, Iga S, Higuchi F, Yasumura M. A Speech Synthesis System for Assisting Communication [C]. Proceedings of the ISCA Workshop on Speech & Emotion, Northern Ireland, 2000, 167–172.

[26]Inanoglu Z, Young S. Data-driven emotion conversion in spoken English [J]. Speech Communication, 2009, 51: 268–283.

[27]Jiang D N, Zhang W, Shen L, Cai L H. Prosody Analysis and Modeling for Emotional Speech Synthesis [C]. Proceedings of ICASSP’05, Philadelphia, USA, 2005, 281–284.

[28]Tesser F, Cosi P, Drioli C, Tisato G. Emotional Festival-Mbrola TTS Synthesis [C]. Proceedings of INTERSPEECH’05, Lisboa, Portugal, 2005, 505–508.

[29]Kääriäinen M, Malinen T. Selective Rademacher Penalization and Reduced Error Pruning of Decision Trees [J]. Journal of Machine Learning Research, 2004, 5: 1107–1126.

[30]Quinlan R J. Learning with continuous classes [C]. Proceedings of the 5th Australian Joint Conference on Artificial Intelligence, Hobart, Tasmania, 1992, 343–348.

[31]Wang Y, Witten I H. Induction of model trees for predicting continuous classes [C]. Proceedings of the 9th European Conference. on Machine Learning, University of Economics, Faculty of Informatics and Statistics, Prague, Czech, 1997, 128–137.

[32]Aha D, Kibler D, Albert M. Instance-based learning algorithms [J]. Journal of Machine Learning, 1991, 6: 37–66.

[33]Atkeson C G, Moorey A W, Schaal S. Locally Weighted Learning [J]. Artificial Intelligence Review, 1996, 11: 11–73.

[34]Breiman L. Bagging Predictors [J]. Journal of Machine Learning, 1996, 24(2): 123–140.

[35]Friedman J H. Stochastic gradient boosting [J]. Computational Statistics and Data Analysis, 2002, 38(4): 367–378.

[36]Witten H I, Frank E. Data Mining: Practical Machine Learning Tools and Techniques [B], second ed. San Francisco: Morgan Kaufmann Publishing, 2005.

[37]Vapnik V. The Nature of Statistical Learning Theory [B]. Springer, New York, 1995.

[38]Vapnik V. Statistical Learning Theory [B]. Wiley, New York, 1998.

[39]Scholkopf B, Smola A J. Learning with Kernels [R]. MIT Press, Cambridge, 2002

[40]Oatley K, Johnson-Laird P. The communicative theory of emotions [B]. In J. Jenkins, K. Oatley, and N. Stein (Eds), Human Emotions: A Readr. Oxford: Blackwell, 1998, 84–87.

[41]Febrer A, Padrell J, Bonafonte A. Modeling Phone Duration: Application to Catalan TTS [C]. Workshop of Speech Synthesis, Australia, 1998, 43–46.

[42]Krishna N S, Talukdar P P, Bali K, Ramakrishnan A G. Duration Modeling for Hindi Text-to-Speech Synthesis System [C]. Proceedings of ICSLP’04, Jeju Island, Korea, 2004, 789–792.

[43]Chung H. Duration models and the perceptual evaluation of spoken Korean [C]. Proceedings of Speech Prosody, France, 2002, 219–222.

[44]Iwahashi N, Sagisaka Y. Statistical modeling of speech segment duration by constrained tree regression [J]. IEICE Trans. Inform. Systems, 2000, E83-D(7):1550–1559.

[45]Goubanova O, King S. Bayesian network for phone duration prediction [J]. Speech Communication, 2008, 50: 301–311.

[46]Yamagishi J, Kawai H, Kobayashi T. Phone duration modeling using gradient tree boosting [J]. Speech Communication, 2008, 50(5): 405–415.

[47]Krishna N S, Murthy H A. Duration modeling of Indian languages Hindi and Telugu [C]. Proceedings of the 5th ISCA Speech Synthesis Workshop, Pittsburgh, USA, 2004, 197–202.

[48]Lee S, Oh Y H. CART-based modelling of Korean segmental duration [C]. Proceedings of the Oriental COCOSDA’99, Taipei, Taiwan, 1999, 109–112.

[49]Teixeira, J P, Freitas D. Segmental durations predicted with a neural network [C]. Proceedings of EUROSPEECH’03, Geneva, Switzerland, September, 2003, 169–172.

[50]van Santen J P H. Assignment of segmental duration in text-to-speech synthesis [J]. Computer Speech & Language, 1994, 8(2), 95-128.

[51]Akaike H. A new look at the statistical model identification [J]. IEEE Trans. on Automatic Control, 1974, 19: 716-723.