UVR5 the Best AI stem separation algo?

Discussion in 'Software' started by curtified, Feb 27, 2023.

  1. Martel

    Martel Platinum Record

    Joined:
    Jan 8, 2023
    Messages:
    398
    Likes Received:
    176
    Thanks a lot for trying to help but I'm in no need of a 2 hours process to split acapellas with Instrumentals for a 0.8 MVSEP quality improvement.

    I will wait for the Developer to fix it on his own (if he want to fix it)

    I'm already very grateful for this community and the guy behind MDX to try to offer it to the public.

    UVR is already doing a very good job here.

    Soon we will surely reach excellency.
     
  2. jarredou

    jarredou Guest

    • Like Like x 4
    • Winner Winner x 1
    • List
  3. Dyslexicon

    Dyslexicon Noisemaker

    Joined:
    Mar 19, 2023
    Messages:
    23
    Likes Received:
    4
    Well - this model is quite an improvement over htdemucs_ft, with one big problem - the vocal stem sucks!

    Drum stem makes htdemucs_ft sound like lossy in comparison, absolutely beautiful
    Bass is significantly more accurate, identifies and retains actual bass guitar frequencies with clarity and accuracy
    "Other", equally impressive improvement over htdemucs_ft, much more clarity in guitars

    Vocal stem has problems surprisingly - less accurate than htdemucs_ft, plus if you take a look at the vocal stem file in Spek, (https://www.spek.cc/) there is a wide band of misidentified frequencies from 15-18KHz, and no response whatsoever above about 18KHz except for a few random spikes. Compared to htdemucs_ft and even Demucs v3 mdx_extra - the vocal stem on MVSep-MDX23 is crap.
    Troubling because I thought this model was trained on super-duper high quality vocal samples?? Maybe the problem was that the vocals were trained on MDX which is an algorithm designed for lossy audio? Maybe someone can clarify if that's correct or not. If anyone can corroborate the results I'm getting in the vocal stem please advise. I may post screen-shots from Spek after doing a few more songs.

    One other impressive quality about this model is if you paste the vocal stem into the Instrumental stem mixdown, and then invert the original file, paste it into the Vocal+Instrumental mixdown, you get absolute dead silence, meaning the outputs are clean with no addition or loss of frequencies, as I've seen in other models when pasted back together, there tend to be minor holes or slight degradation in the frequency spectrum.

    Try not to laugh too hard but I was able to run this on a decade old quad-core i7 with 16GB RAM. :rofl:
    A 4-minute song took 4 hours to render, but didnt crash after I clicked on the options tab and selected CPU and single onnx
    Tried a few renders without those options selected and it crashed at 20% like others have mentioned.
     
    Last edited: May 13, 2023
    • Interesting Interesting x 1
    • Useful Useful x 1
    • List
  4. jarredou

    jarredou Guest

    The high frequency problem can be because the "Kim Vocal 1" model is not fullband (I don't know for the second Kim model used). I haven't much investigated this, I'm not using it for vocals stems but for drums & bass stems, as you said, they are better than pure demucs_ft.
     
    Last edited by a moderator: May 13, 2023
  5. Dyslexicon

    Dyslexicon Noisemaker

    Joined:
    Mar 19, 2023
    Messages:
    23
    Likes Received:
    4
    Is "Kim Vocal 2" trained on lossless, and is it a user-selectable/modifiable option?
     
  6. Martel

    Martel Platinum Record

    Joined:
    Jan 8, 2023
    Messages:
    398
    Likes Received:
    176
    Yeah, it's unusable as we speak. I believe you word for word as of better quality but there's a limit as usability.

    I'm also feeding it lossless and the vocals is the main thing for me. The drums are just a reference for BPM in my use case.

    With a bit of EQing, I've been able to do achieve incredible results earlier on this year with UVR.

    To me, all of this just underline one thing. This technology is really about to get to a level where anything will be available to anyone willing to spend a little time to understand it. It really just need to be fine tuned. But so far the qualityu is extremely impressive.

    It just suck that MDX was not able to make it available to the general public as a first release.

    Yet, I'm 100% confident they will find a way to make it work on a normal PC.

    This whole AI is so stimulating. I can't even remember when a technology got me so hyped up.
     
    • Like Like x 1
    • Agree Agree x 1
    • List
  7. jarredou

    jarredou Guest

    No, you can't really change it afterwards... From what I understand, some models are trained with a cutoff, because it's less ressources heavy and the major part of sound/music is way below these frequencies. It's a trade off. But some of the recent UVR-MDX Instrumental models are really fullband.

    I don't know how algos and models react with frequency spectrum larger than their training, if it mess up the audio or just let it go through without being processed...
     
  8. jarredou

    jarredou Guest

    You can try the Colab version I've posted above, it takes few minutes to process a track. ;)
     
  9. Dyslexicon

    Dyslexicon Noisemaker

    Joined:
    Mar 19, 2023
    Messages:
    23
    Likes Received:
    4
    Weird because the rest of the stems are gorgeous, the best I've ever heard, and seen on spectral view, so the fact that the vocal stem alone is based on lossy seems like a major fuckup. It's a fatal error rendering this model effectively useless for my purposes.

    So this model only got third place? Looking forward to trying 2nd and 1st place entries!!
     
  10. julianbre

    julianbre Producer

    Joined:
    Jul 15, 2015
    Messages:
    217
    Likes Received:
    126
  11. Legotron

    Legotron Audiosexual

    Joined:
    Apr 24, 2017
    Messages:
    2,181
    Likes Received:
    2,113
    Location:
    Hyperborea
    This OT, but didn´t wanna create new thread for asking question.
    Is there an A.I video creator, that makes videos from feeded clips. I´m not talking about editing i.e cutting.
    I´m looking for some really weird results, like feeding clips of fractals, cartoons, nature, porn and anything you can image, weirder the better:rofl:
     
  12. jarredou

    jarredou Guest

    https://research.runwayml.com/gen2
     
  13. Dyslexicon

    Dyslexicon Noisemaker

    Joined:
    Mar 19, 2023
    Messages:
    23
    Likes Received:
    4
    sami-bytedance-v.0.1.1 - has passed MVSep-2023 but no "Other" stem. Annoying!
    Any idea how to run this? Might try to paste the resultant stems into the original file after phase-inverting and see if it leaves a decent sounding guitar stem.
     
  14. Dyslexicon

    Dyslexicon Noisemaker

    Joined:
    Mar 19, 2023
    Messages:
    23
    Likes Received:
    4
    @jarredou
    Re: Vocal stem quality issues caused by lossy training data, you said over on Git:

    This is also happening with the MDX-Colab from Audio Separation discord. But not in UVR !
    There must be some lowpass filtering going on somewhere.​

    Are you saying MVSep2023 can be run in UVR?
    And that UVR prevents this vocal stem filtering issue that causes the garbage frequency band from 15-18KHz?

    I dont understand how a model trained on lossy data yielding a Vocal stem with these demonstrable spectral issues, scores so highly on the leaderboard, above models not trained on lossy and with squeaky-clean extraction results (like htdemucs_ft).
    It's not only frequency truncated, it's introducing garbage frequencies that can be clearly seen in the individual stem.

    Am I missing something here? I did have to run the model on CPU with Single onnx, which I know gives poorer results;
    Does rendering with a beast of a GPU or using Colab not cause this issue?
     
  15. Martel

    Martel Platinum Record

    Joined:
    Jan 8, 2023
    Messages:
    398
    Likes Received:
    176
    I gave up on MVSEP for now. It finally worked on onnx but it took over 45 minutes for one 4 minute song. The result were not noticably better then UVR.

    I'm back to ensemble. It is still yielding very decent results.

    Someone in the UVR issue report on github mentioned that main inst is the 496 in the leaderboard ensemble result. Can someone confirm that?

    https://github.com/Anjok07/ultimatevocalremovergui/issues/541#issuecomment-1556026356
     
  16. kevdel10

    kevdel10 Noisemaker

    Joined:
    May 14, 2022
    Messages:
    4
    Likes Received:
    4
    Thanks for the share! I had trouble figuring out how to run this, but I think I've figured it out. I get the following message when I run it though: "UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:245.) mix_waves = torch.tensor(mix_waves, dtype=torch.float32).to(device)". I have no idea what any of this means :dunno::rofl:
     
  17. Dyslexicon

    Dyslexicon Noisemaker

    Joined:
    Mar 19, 2023
    Messages:
    23
    Likes Received:
    4
  18. jarredou

    jarredou Guest

    The first place in MDX23 is by SAMI research lab from ByteDance (TikTok owner), they said they can't open sourced their codes & models at the moment, and that they were not sure if they will/can in the end, because their legal team must review everything before being published... it's a big corpo, it's Chinese... so it's also political.

    The 2nd place owner also said he will not opensource his stuff... :(
     
    • Interesting Interesting x 1
    • List
  19. jarredou

    jarredou Guest

    This is just a warning, so it's working, it just says that it could be little faster in array convesion (but it's not the reason MVsep is slow, it is slow because the audio is processed by htdemucs_ft, htdemucs_mmi, ht_demucs_6s then 1 mdx model, then another one, so it's just slow to do all this.. And if you choose high values for "overlap", it will be way much slower (but with slightly better quality separations)
     
    Last edited by a moderator: May 22, 2023
    • Like Like x 1
    • Interesting Interesting x 1
    • List
  20. jarredou

    jarredou Guest

Loading...
Loading...