Building a TTS model for the first time? Good guides?

Hey all, brand new to this community, excited to be here!

I’ve stumbled my way through SD and I currently also have text-generation-webui up and running, and now SillyTavern. Having lots of fun with all of this stuff, learning how it works together, and how it all works!

I’ve made a few models elsewhere, but TTS models for some reason I’m having issues wrapping my head around. I have a voice I want to make a model for, and I have some videos currently. I’m very familiar with editing audio and video, but stripping out their voice second by second sounds exhausting tbh.

I was wondering if anyone had any good guides on their process of making a TTS model? Are there steps that can be automated while still producing decent results? How much time do I need of a person speaking? Should I run any specific tools to clean up audio? I’m completely new so any and all advice would be great.

I want to run it locally and “plug it in” to my cluster already, so also I’ll need the model to work with a tool that will work with the above programs (and I’ll take advice there too if you have it!)

Thanks!

  • All
  • Subscribed
  • Moderated
  • Favorites
  • auai@programming.dev
  • DreamBathrooms
  • magazineikmin
  • GTA5RPClips
  • everett
  • rosin
  • Youngstown
  • ngwrru68w68
  • slotface
  • osvaldo12
  • Durango
  • kavyap
  • InstantRegret
  • normalnudes
  • khanakhh
  • megavids
  • tacticalgear
  • tester
  • thenastyranch
  • mdbf
  • ethstaker
  • cisconetworking
  • Leos
  • modclub
  • anitta
  • cubers
  • provamag3
  • JUstTest
  • lostlight
  • All magazines