Mastodawn

have been exploring retrieval-based voice conversion today. i would like to train a model on my own voice, as while it is fun to transform things into other voices (though i am more interested in the timbral adjustment rather than just using another voice), this would be a great tool to generate vocal pad backing tracks, harmonies, or even lead vocals when i'm unable to provide them (tts -> tune/time in melodyne -> RVC model of my voice).

msx Jul 11, 2023

update: flawless victory. input is some saw waves through formant filters. model is just some quick recordings of me reading the harvard sentences lol

msx Jul 11, 2023

this shit owns. very excited to train it on a more expansive set since this one was pretty small. still ends up really expressive when you get some filter envelopes and whatnot on the synths.

diirfox Jul 11, 2023

@msx these are next level, superb, so impressive

basilisken Jul 11, 2023

@msx LES GOOO

sleepy sunflower Jul 11, 2023

@msx what if you'll make a model for google translate so that you can use it in the future if they'll change their voice engine/whatever

msx Jul 11, 2023

@eightone you are a genius

sleepy sunflower Jul 11, 2023

@msx it seemed like a bad idea at first because I forgot that the model can do TTS too....

msx Jul 11, 2023

@eightone this model is actually sound to sound - it maps the model onto the previous sound input - which actually makes for some interesting effects when applying such a very obviously fake TTS as training model

sleepy sunflower Jul 11, 2023

@msx i'm really curious to hear how it will sound. i might make my own model too. but I'm honestly not sure what random text I should put in.

msx Jul 11, 2023

@eightone https://en.wikipedia.org/wiki/Harvard_sentences i used these since it covers most of the phonetic information necessary. it winds up being a small set but gets good results around 30 epochs.

Harvard sentences - Wikipedia

DIGBOYE 🐶Jul 11, 2023

@msx oh this slaps‼️‼️‼️‼️

Nakura Jul 11, 2023

@msx Whoa, this is really cool! I need to look into this myself.

Persune Jul 11, 2023

@msx i wonder how it'd sound if the model was trained on drum sounds instead of regular speech (or non-human sounds, for that matter)

msx Jul 11, 2023

@Gumball2415 that's the next stop. the inverse (applying voices to non-voice sounds) is usually hilarious, and i'm very curious about what happens if the set is non-voice, maybe then applied to voice? i know it does some ML extraction of vocal "components" which may make it super interesting

Hyena 🏙️ Town Jul 11, 2023

@msx pull off a project with "ML/NN as a tool" using home grown data sets, and I'll feel a bit more at ease with the idea; lately I've just been jaded by the negative aspects

msx Jul 11, 2023

@hyenatown been using ML-generated audio in my projects since More Adventures and it is just about the best thing ever :)

Hyena 🏙️ Town Jul 11, 2023

@msx then hell yeah go donk go

Wishdream Jul 12, 2023

@msx Honestly, this is a really good use of an RVC model and why have I not thought of that?

Still, not sure about the requirements of training an RVC model. I know that most AI models right now need pretty beefy PCs. Then again, optimizations has come in for a long while now.

msx Jul 12, 2023

@wishdream i can train a simple model on my 2060 within half an hour, it's very reasonable. i was even able to do it while streaming provided i didn't have anything else in my OBS profile :)

Wishdream Jul 12, 2023

@msx oh neat! I got to try it sometime. I got a 3060ti so it'll probably be a bit more faster.
Also streaming while training a model sounds like it's gonna burn your PC XD

msx Jul 12, 2023

@wishdream amazingly it just crashed obs at first because the training tried to allocate all my vram LOL, even with 6gb vram though i was able to train + encode video with a simple layout :D it's amazing how optimized this stuff is getting

Wishdream Jul 13, 2023

@msx Can you limit the amount of VRAM it uses then? If not, I am surprised that you're able to encode video with it XD I wonder if I could do a stream with training in the background with my layout, gosh it's already VRAM intensive hahaha

msx Jul 13, 2023

@wishdream you can limit the number of epochs it batches and i just knocked that down to one at a time. it uses remarkably little GPU processing power, it's all about the memory

Wishdream Jul 17, 2023

@msx oh then that shouldn't be a problem then. should probably give it a try with that. never really thought of bumping down the number of epochs so that changes things.