An Overview of Automatic Audio Segmentation

Full Text (PDF, 246KB), PP.1-9

Views: 0 Downloads: 0


Theodoros Theodorou 1,* Iosif Mporas 1,2 Nikos Fakotakis 1

1. Artificial Intelligence Group, Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Patras 26500, Greece

2. Computer Engineering and Informatics Department, Technological Educational Institute of Western Greece, Antirion 30300, Greece

* Corresponding author.


Received: 19 Jan. 2014 / Revised: 21 May 2014 / Accepted: 12 Aug. 2014 / Published: 8 Oct. 2014

Index Terms

Audio Segmentation, Sound Classification, Machine Learning, Mathematical Functions, Hybrid Architecture of Unsupervised and Data-Driven Algorithms


In this report we present an overview of the approaches and techniques that are used in the task of automatic audio segmentation. Audio segmentation aims to find changing points in the audio content of an audio stream. Initially, we present the basic steps in an automatic audio segmentation procedure. Afterwards, the basic categories of segmentation algorithms, and more specific the unsupervised, the data-driven and the mixed algorithms, are presented. For each of the categorizations the segmentation analysis is followed by details about proposed architectural parameters, such us the audio descriptor set, the mathematical functions in unsupervised algorithms and the machine learning algorithms of data-driven modules. Finally a review of proposed architectures in the automatic audio segmentation literature appears, along with details about the experimenting audio environment (heading of database and list of audio events of interest), the basic modules of the procedure (categorization of the algorithm, audio descriptor set, architectural parameters and potential optional modules) along with the maximum achieved accuracy.

Cite This Paper

Theodoros Theodorou, Iosif Mporas, Nikos Fakotakis, "An Overview of Automatic Audio Segmentation", International Journal of Information Technology and Computer Science(IJITCS), vol.6, no.11, pp.1-9, 2014. DOI:10.5815/ijitcs.2014.11.01


[1]E. Dogan, M. Sert, A. Yazici (2009). “Content-Based Classification and Segmentation of Mixed-Type Audio by Using MPEG-7 Features”, 2009 First International Conference on Advances in Multimedia MMEDIA ’09, on pages(s) 152-157

[2]Y. Patsis, W. Verhelst (2008). “A Speech/ Music/Silence /Garbage Classifier for Searching and Indexing Broadcast News Material”, 2008 19th International Workshop on Database and Expert Systems Application DEXA ’08, on page(s) 585-589

[3]C.-H. Wu, C.-H. Hsieh (2006). “Multiple Change Point Audio Segmentation and Classification using an MDL-based Gaussian Model”, IEEE Transactions on Audio, Speech and Language Processing, Issue Date March 2006, volume 14, Issue 2, on page(s) 647-657

[4]C. Delphine 2010. “Model-Free Anchor Speaker Turn Detection for Automatic Chapter Generation in Broadcast News”, 2010 IEEE International Conference on Acoustics Speech and Signal Processing ICASSP, on page(s) 4966-4969

[5]V. Gupta, G. Boulianne, P. Kenny, P. Ouellet, P. Dumouchel (2008). “Speaker Diarization of French Broadcast News”, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2008, on page(s) 4365-4368

[6]T. Butko, C. Nadeu (2011). “Audio Segmentation of Broadcast News in the Albayzin-2010 Evaluation: Overview, Results and Discussion”, EURASIP Journal on Audio, Speech and Music Processing 2011, volume 2011 issue 1

[7]H. Xue, H. Li, C. Gao, Z. Shi (2010). “Computationally Efficient Audio Segmentation through a Multi-Stage BIC Approach”, 2010 3rd International Congress on Image ad Signal Processing CISP, volume 8, on page(s) 3774-3777

[8]R. Huang, J. H.L. Hansen (2006). “Advances in Unsupervised Audio Classification and Segmentation for Broadcast News and NGSW Corpora”, IEEE Transactions on Audio, Speech and Language Processing, issue Date May 2006, Volume 14, Issue 3, on page(s) 907-919

[9]D. Wang, R. Vogt, M. Mason, S. Sridharan (2008). “Automatic Audio Segmentation Using the Generalized Likelihood Ratio”, 2008 2nd International Conference on Signal Processing and Communication Systems ICSPCS 2008, on page(s) 1-5

[10]S.-S. Cheng, H.-M. Wang, H.-C. Fu (2008). “BIC-Based Audio Segmentation by Divide and Conquer”, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2008, on page(s) 4841-4844

[11]H.-Y. Lo, J.-C. Wang, H.-M. Wang (2010). “Homogeneous Segmentation and Classifier Ensemble for Audio Tag Annotation and Retrieval”, 2010 IEEE International Conference on Multimedia and Expo ICME, on page(s) 304-309

[12]J. Huang, Y. Dong, J. Liu, C. Dong, H. Wang (2009). “Sports Audio Segmentation and Classification”, 2009 IEEE International Conference on Network Infrastructure and Digital Content IC-NIDC 2009, on page(s) 379-383

[13]S. Harsha Yella, V. Varma, K. Prahallad (2010). “Significance of Anchor Speaker Segments for Constructing Extractive Audio Summaries of Broadcast News”, 2010 IEEE Spoken Language Technology Workshop SLT, on page(s) 13-18

[14]G. Richard, M. Ramona, S. Essid (2007). “Combined Supervised and Unsupervised Approaches for Automatic Segmentation of Radiophonic Audio Streams”, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2007, on page(s) II-461 – II-464

[15]B. Bigot, I. Ferrane, J. Pinquier (2010). “Exploiting Speaker Segmentations for Automatic Role Detection”. An Application to Broadcast News Documents, 2010 International Workshop on Content-Based Multimedia Indexing CBMI, on page(s) 1-6

[16]M. Kos, M. Grasic, D. Vlaj, Z. Kacic (2009). “On-line Speech/Music Segmentation for Broadcast News Domain”, 2009 16th International Conference on Systems, Signal and Image Processing IWSSIP 2009, on page(s) 1-4

[17]J. Zhang, B. Jiang, L. Lu, Q. Zhao (2010). “Audio Segmentation System for Sport Games”, 2010 International Conference on Electrical and Control Engineering ICECE, on page(s) 505-508

[18]M. Liu, C. Wan, L. Wang (2002). “Content-Based Audio Classification and Retrieval using a Fuzzy Logic System: Towards Multimedia Search Engines”, Soft Computing 6 (2002) 357-364

[19]T. Perperis, T. Giannakopoulos, A. Makris, D. I. Kosmopoulos, S. Tsekeridou, S. J. Perantonis, S. Theodoridis (2010). “Mutlimodal and Ontology-based Fusion Approaches of Audio and Visual Processing for Violence Detection in Movies”, Expert Systems with Applications Volume 38, Issue 11, October 2011, Pages 14102–14116 

[20]J. X. Zhang, J. Whalley, S. Brooks (2009). “A Two Phase Method for General Audio Segmentation”, 2009 IEEE International Conference on Multimedia and Expo ICME 2009, on page(s) 626-629