have been exploring retrieval-based voice conversion today. i would like to train a model on my own voice, as while it is fun to transform things into other voices (though i am more interested in the timbral adjustment rather than just using another voice), this would be a great tool to generate vocal pad backing tracks, harmonies, or even lead vocals when i'm unable to provide them (tts -> tune/time in melodyne -> RVC model of my voice).
update: flawless victory. input is some saw waves through formant filters. model is just some quick recordings of me reading the harvard sentences lol
this shit owns. very excited to train it on a more expansive set since this one was pretty small. still ends up really expressive when you get some filter envelopes and whatnot on the synths.
@msx these are next level, superb, so impressive
@msx what if you'll make a model for google translate so that you can use it in the future if they'll change their voice engine/whatever
@eightone you are a genius
@msx it seemed like a bad idea at first because I forgot that the model can do TTS too....
@eightone this model is actually sound to sound - it maps the model onto the previous sound input - which actually makes for some interesting effects when applying such a very obviously fake TTS as training model
@msx i'm really curious to hear how it will sound. i might make my own model too. but I'm honestly not sure what random text I should put in.
@eightone https://en.wikipedia.org/wiki/Harvard_sentences i used these since it covers most of the phonetic information necessary. it winds up being a small set but gets good results around 30 epochs.
Harvard sentences - Wikipedia

@msx oh this slaps‼️‼️‼️‼️
@msx Whoa, this is really cool! I need to look into this myself.
@msx i wonder how it'd sound if the model was trained on drum sounds instead of regular speech (or non-human sounds, for that matter)
@Gumball2415 that's the next stop. the inverse (applying voices to non-voice sounds) is usually hilarious, and i'm very curious about what happens if the set is non-voice, maybe then applied to voice? i know it does some ML extraction of vocal "components" which may make it super interesting
@msx pull off a project with "ML/NN as a tool" using home grown data sets, and I'll feel a bit more at ease with the idea; lately I've just been jaded by the negative aspects
@hyenatown been using ML-generated audio in my projects since More Adventures and it is just about the best thing ever :)
@msx then hell yeah go donk go

@msx Honestly, this is a really good use of an RVC model and why have I not thought of that?

Still, not sure about the requirements of training an RVC model. I know that most AI models right now need pretty beefy PCs. Then again, optimizations has come in for a long while now.

@wishdream i can train a simple model on my 2060 within half an hour, it's very reasonable. i was even able to do it while streaming provided i didn't have anything else in my OBS profile :)
@msx oh neat! I got to try it sometime. I got a 3060ti so it'll probably be a bit more faster.
Also streaming while training a model sounds like it's gonna burn your PC XD
@wishdream amazingly it just crashed obs at first because the training tried to allocate all my vram LOL, even with 6gb vram though i was able to train + encode video with a simple layout :D it's amazing how optimized this stuff is getting
@msx Can you limit the amount of VRAM it uses then? If not, I am surprised that you're able to encode video with it XD I wonder if I could do a stream with training in the background with my layout, gosh it's already VRAM intensive hahaha
@wishdream you can limit the number of epochs it batches and i just knocked that down to one at a time. it uses remarkably little GPU processing power, it's all about the memory
@msx oh then that shouldn't be a problem then. should probably give it a try with that. never really thought of bumping down the number of epochs so that changes things.