Audio Segmentation using Supervised & Unsupervised Algorithms in Python - Part 2 | iNNovationMerge

Audio Segmentation using Supervised & Unsupervised Algorithms in Python - Part 2

For Feedbacks | Enquiries | Questions | Comments - Contact us @

What?, Why?

Software’s Required:

  • Python 3.6

Network Requirements

  • Internet to download packages




Install pyAudioAnalysis

git clone
pip install -e .

Troubleshooting if error

  • Error : ImportError: failed to find libmagic. Check your installation
  • Solution : pip install python-magic-bin==0.4.14
  • Issue resolved link

Unsupervised - Silence removal

  • pyAudioAnalysis has semi-supervised silence removal function which takes an uninterrupted audio data and provides response with segment endpoints that correspond to individual audio events, removing silent areas of the audio.
  • This is achieved by following steps
    • SVM model will be trained to distinguish between high-energy and low-energy short-term frames.
    • Whole audio data is passed to the SVM classifier which results in a sequence of probabilities that correspond to a level of confidence that the respective short-term frames belong to an audio event.
    • A dynamic thresholding is used to detect the active segments.
from pyAudioAnalysis import audioBasicIO as aIO
from pyAudioAnalysis import audioSegmentation as aS
[Fs, x] = aIO.read_audio_file("data/recording1.wav")
segments = aS.silence_removal(x, Fs, 0.020, 0.020, smooth_window = 1.0, weight = 0.3, plot = True)
  • Audio Segmentation Silence Removal (Source: iNNovationMerge)

Unsupervised - Speaker Diarization

  • Speaker diarization is the process of identifying “who spoke when” in the audio provided automatically.

  • Speaker diarization needs both segmentation and clustering, where speech segments are grouped into speaker-specific clusters.

  • Following are the main algorithmic steps performed to implement diarization

    • Feature extraction (short-term and mid-term)
    • FLsD step (optional)
    • K-means clustering
    • Smoothing
     from pyAudioAnalysis import audioSegmentation
    num_speakers = 4
    audioSegmentation.speaker_diarization("data/diarizationExample.wav", num_speakers, plot_res=True)
  • Audio Segmentation Speaker Diarization (Source: iNNovationMerge)

Unsupervised - Audio thumbnailing

  • Audio thumbnailing refers to the unsupervised extraction of the most representative part of a music recording such as chorus, music.

  • In pyAudioAnalysisLibrary this has been implemented in the musicThumbnailing function.

  • The automatically annotated diagonal segment represents the area where the self similarity is maximized, leading to the definition of the “most common segments” in the audio stream.

    from pyAudioAnalysis import audioAnalysis
    thumbSize = 25
  • Audio Segmentation Thumbnail (Source: iNNovationMerge)

Clone and Run the project