Automatically place markers in a song (workarounds?)

Discussion in 'Mixing and Mastering' started by MHEO, Jun 10, 2025 at 3:27 AM.

  1. MHEO

    MHEO Ultrasonic

    Joined:
    Apr 18, 2017
    Messages:
    52
    Likes Received:
    20
    Looking for software that automatically analyzes a song and places markers (or creates regions/segments) at the start and end of sung passages. It doesn't exist. Easiest workarounds?
     
  2.  
  3. clone

    clone Audiosexual

    Joined:
    Feb 5, 2021
    Messages:
    8,623
    Likes Received:
    3,761
  4. MHEO

    MHEO Ultrasonic

    Joined:
    Apr 18, 2017
    Messages:
    52
    Likes Received:
    20
    Thanks, but it's not what I am looking for. I need a software with kind of "VAD" (Vocal Activity Detection) that keeps parts with singing voice and removes the rest. Not a voice isolator or a song structure analyzer. I know it does not exist, I just asked for simplest workarounds. Of course I can already do it manually, adding markers and exporting regions. I needed something faster, like an AI app that makes it for me. ChatGPT wasn't exactly my friend, I should be an expert programmer to code with C+ and python. I am just a musician. The process should be like the usual automatic "trim/crop" "remove pauses and keep the rest", but aimed to "remove parts with no singing and keep the rest". Someone suggested to play with gate and frequences, but the result is a mess.
     
    Last edited: Jun 10, 2025 at 4:46 PM
  5. PulseWave

    PulseWave Platinum Record

    Joined:
    May 4, 2025
    Messages:
    533
    Likes Received:
    235
    You are correct that most existing Voice Activity Detection (VAD) tools are designed to detect general speech—including both spoken and sung voice—but not specifically to isolate singing voice and remove everything else. Most VAD systems, such as Silero VAD, Picovoice Cobra, or PyAnnote Audio, are optimized for speech detection and can quickly trim audio to keep only sections with human voice, but they do not distinguish between singing and other types of speech or background music1347.

    Workarounds and Practical Options
    1. Using VAD as a First Step

    • You can use a high-quality VAD tool (e.g., Silero VAD, Picovoice Cobra, or PyAnnote Audio) to automatically remove silence and non-voice sections from your audio. This will keep all segments with any human voice (including singing), but it will also keep spoken parts if present1347.

    • This approach is fast and does not require programming skills if you use available GUI tools or web-based solutions, but it won't distinguish between singing and speaking.
    2. Source Separation + VAD

    • Some musicians use source separation tools (like Spleeter or Demucs) to extract the vocal stem from a song. You could then run a VAD tool on the isolated vocal track to remove silent or non-vocal parts. This method still doesn’t distinguish between singing and talking, but it can help if your main concern is to get only the vocal presence, regardless of type.
    3. No-Code or Low-Code Tools

    • There are some AI-powered DAWs and plugins that offer "auto-trim" or "remove silence" features based on voice detection, such as iZotope RX's "Voice Activity Detector" or Audacity's "Truncate Silence" function. These are easy to use but, again, do not specifically target singing voice.
    4. Advanced (But Not Plug-and-Play) Solutions

    • Research in sung speech recognition (lyrics transcription) is ongoing and some deep learning models are being developed to distinguish sung speech from spoken speech9. However, these are not yet available as user-friendly apps and typically require programming and machine learning expertise to deploy.
    Why Gates and Frequency Tricks Don’t Work Well
    • Using gates and frequency filtering is unreliable for this task because the frequency range of singing overlaps heavily with both spoken voice and some instruments, leading to a "messy" result as you described.
    Summary Table
    Method Keeps Singing Only Removes Spoken Parts Easy/No Code Fast/Automatic
    Basic VAD (Silero, Cobra) No (keeps all voice) No Yes Yes
    Source Separation + VAD No (keeps all voice) No Somewhat Somewhat
    AI DAW Plugins (RX, Audacity) No (keeps all voice) No Yes Yes
    Custom ML Models Potentially Potentially No No
    Conclusion
    Currently, there is no out-of-the-box tool that automatically detects and keeps only the singing voice while removing everything else, without also keeping spoken voice. The most practical workaround is to use a VAD tool to quickly trim non-voice sections, which at least speeds up the manual process. For a more precise solution (singing-only), you would need a custom-trained AI model, which is not yet available as a simple app for musicians9.
     
  6. MHEO

    MHEO Ultrasonic

    Joined:
    Apr 18, 2017
    Messages:
    52
    Likes Received:
    20
    Too complex, furthermore the online apps ask for microphone, can't load wav or mp3. I guess I'll keep doing it manually, thanks.
     
  7. 1_i_Pi

    1_i_Pi Member

    Joined:
    Mar 2, 2024
    Messages:
    28
    Likes Received:
    12
    Work takes time. That's why you get paid. Even for the monotonous tedious tasks (I absolutely hate comping a lot of the time). This incessant need to "optimize" absolutely every part of every process is/will be the undoing. You may say "this is such a small thing who cares" but we're at the point where all that counts. They're already beginning to try to make composition obsolete, audio engineering is right around the corner.

    It literally has to be an all or nothing approach if any of us want jobs in 5 years. Yeah, AI is great for a lot of things, but i truly believe this is unfortunately an all or nothing kinda thing.
     
    • Interesting Interesting x 1
    • List
  8. MHEO

    MHEO Ultrasonic

    Joined:
    Apr 18, 2017
    Messages:
    52
    Likes Received:
    20
    You are right, work takes time. Anyway, consider that I have to break up about two thousand songs, each of which contains about 50 or 60 sung phrases. That is why I asked if there was anything that could speed up the work. Peace.

    P.S. Meanwhile I found a way, for those interested:
    (1) Isolate vocals via stem separation;
    (2) Automarker (markers take into account spaces of silence),
    (3) Import CSV markers and apply on original audio;
    (4) autotrim and export markers as audio regions.

    This way you'll get more or less what you need.
     
    Last edited: Jun 11, 2025 at 9:55 PM
Loading...
Loading...