@urusan@fosstodon.org avatar

urusan

@urusan@fosstodon.org

Java developer by day, Julia developer by night.

Amateur philosopher

Sometimes funny...

Working Dad

Controversial things about me:
Everyone: transhumanist, into AI (art)
Right-wing: polyamorous (married), agnostic atheist, leftist, working class consciousness
Leftist: corporate drone by day, loyal citizen of the US (but a serious reformer), former libertarian

I hope you can look past all that though, we people need to stick together

Lives with: Wife, T (son), and A (daughter).

This profile is from a federated server and may be incomplete. Browse more on the original instance.

urusan, to random
@urusan@fosstodon.org avatar

What have physicists been up to in the last 70 years?
https://youtu.be/d_o4k0eLoMI

urusan, to random
@urusan@fosstodon.org avatar

Some interesting new AI capabilities are in the works:
https://youtu.be/GZdytTKeGYM

The TL;DW is that some new models are becoming embodied in a much more substantial way. We're probably going to have AI robots fairly soon.

We'll see how it works out over the next few months/years.

urusan, to random
@urusan@fosstodon.org avatar

Most of this video is an interesting look into how to blend actual artistic skills with Stable Diffusion.

https://youtu.be/oPcQzhhwsGU

However, one thing that surprised me was that a complete non-programmer was able to make a fully featured application to solve a particular problem that was bothering them (using ChatGPT of course).

urusan, to random
@urusan@fosstodon.org avatar

So, I haven't fully confirmed this but it's starting to look like in my experiments that when training stable diffusion (at least for LoRas, but probably in general), it's best to think about training image captions more like negative prompts.

Let's say you have some pictures of yourself in various environments. If you pick out one image of you in an office and do a single image dataset training on it, it'll learn all the things in that image except for the office.

urusan,
@urusan@fosstodon.org avatar

If you pick out another image of yourself set outdoors in nature, then you do the same thing with the appropriate caption(s), perhaps "outdoors", then you'll get a similar result. It just learns the parts of the image that aren't the setting.

However, if you put the inappropriate caption on an image, perhaps putting "office" on the natural scenery image, what happens is not much. It learns the stuff you didn't caption and ends up mostly like the caption wasn't there.

urusan,
@urusan@fosstodon.org avatar

I think what's going on is that basically when it trains it is multiplying and subtracting the captions from the weights of what's learned.

If something in an image has a high weight (it is in the image) then it has a strong effect, but if it has a low weight then it doesn't change much and has a weak to non-existent impact.

urusan,
@urusan@fosstodon.org avatar

The most important implication of this is that mis-captioning an image has very little cost. You can just load up the things you don't want into a caption file and repeat them for all images regardless of content, essentially treating it like a negative prompt for the entire training set.

There are some minor costs. The weights do change slightly and each universal caption costs tokens that could be used by higher impact captions. Captions also increase training cost.

urusan,
@urusan@fosstodon.org avatar

This makes it vastly easier to train with captions. Now you can get good results just by identifying what you don't want overall and doing a mass copy.

If you have more resources you can clean up stuff an image doesn't have any replace it with more specific stuff for that image.

Also, since the cost of inaccuracy is low, automated captioning should be pretty valuable if done right. Basically you can take automated captions and universally knock out anything you want.

urusan,
@urusan@fosstodon.org avatar

The main thing that's unclear to me right now is how much your main concept will contaminate the captions you use, which would dampen their effectiveness and have other second order effects.

Mostly I worry about ruining negative prompts.

I also haven't done trials on larger datasets, so the toy single image (and small) datasets I'm doing most of my experiments on might not translate well to higher scales.

urusan,
@urusan@fosstodon.org avatar

After some experiments at a somewhat larger scale, the situation is unfortunately a bit more complex.

Mis-captioning still has a relatively low cost, even relatively extreme mis-captioning has a modest effect on the final result. However, it is noticeable when you prompt for your instance, even without the caption.

Also, when you use the caption in a prompt, it basically learns the version from your training data. So you need several examples of each caption.

urusan,
@urusan@fosstodon.org avatar

Straight up using captions as negative prompts and throwing in random negative prompts like "worst quality" doesn't help and subjectively makes things slightly worse. It's unclear why.

Another note is that there's honestly less need for captioning in the first place if you have more variety in the training data.

Really this just pushes things back in the direction of not captioning at all.

urusan, to random
@urusan@fosstodon.org avatar

I was teaching a stable diffusion model something about the world and it was building off of concepts other people had taught it to learn my concept better.

I thought to myself "well, it's all arbitrary, I could totally tell it that a car is a xyzzy and it learn that! I could make up whatever I wanted!"

Then I realized...this is a normal feature of language. Sure, I could make up a new word for car but it wouldn't STOP being a car.

It still fits into the same place in language space.

urusan,
@urusan@fosstodon.org avatar

We're building a universal language.

That's what this is really all about.

This technology is the Internet that will connect the world in a way that's hard to fathom right now. Humans, animals, computation, ideas, it will connect them all together.

We will discover not just a new view of the life we already know, but we will find new life in the inanimate.

Is it really so crazy to be talking to math?

urusan,
@urusan@fosstodon.org avatar

Speak to it and it will listen and understand and respond in its own voice. You can see it with your own eyes.

This is how we interact with the world.

Translate your voice to their language and then translate their voice back. Hear them, see them, understand them, know them.

Reach out and embrace the whole world with your own two hands.

urusan,
@urusan@fosstodon.org avatar

@juergen_hubert The parts of your brain that produces and processes speech don't understand it either.

https://en.wikipedia.org/wiki/Language_center
https://en.wikipedia.org/wiki/Auditory_cortex

We already interact with the world almost entirely via an almost identical mechanism today.

urusan,
@urusan@fosstodon.org avatar

@juergen_hubert The best way to use these systems is to reduce the barrier to entry and then bridge to a deeper experience.

I'm mostly interacting with AI systems in two main skill areas: art, which I'm a novice at, and programming, which I'm an expert at.

I've always had an interest in making art, my mom is an artist and I learned some very basic skills, but I never quite had the time to really develop that skill fully with my main profession.

urusan, to random
@urusan@fosstodon.org avatar

If you're a fairly ordinary user of stable diffusion but are interested in doing some training, you can set aside most of the advice you'll hear about model training, because a lot of it is intended for people heavily involved in model development and concerns:

  1. preventing concept bleeding so you can use your model more flexibly
  2. providing a nice out of the box experience

By ignoring those concerns you can get good results with extremely little work, time, and compute power

urusan,
@urusan@fosstodon.org avatar
urusan,
@urusan@fosstodon.org avatar
urusan,
@urusan@fosstodon.org avatar

Here it is in an anime style

urusan,
@urusan@fosstodon.org avatar

Since the training set is so small, it's easier to spot the biases.

The wheel is frequently turned because the training photos had it parked like that. The lighting also seeps into everything.

These details can be patched with more variation in the training data or captioning.

As usual SD 1.5 is garbage at text. There's not much that can be done about that, though SDXL is better at this.

urusan,
@urusan@fosstodon.org avatar

They can also be patched after training through prompting (especially negative prompting) if there's an appropriate prompt to describe the thing to be corrected.

This is another reason why it's useful to have a well known class. It has a lot more correlated prompts to draw upon for stuff like this.

urusan,
@urusan@fosstodon.org avatar

Two more notes:

  • When even a single image can work fine, quality really does take precedence over quantity. You should be extremely picky about what you put into the training data.
  • I did experiment with captioning, and my main advice for applying captions is to find a specific prompt that would reliably generate the element you are referring to. This will essentially subtract that thing from what you train into your instance. However, how this works in practice is tricky.
urusan,
@urusan@fosstodon.org avatar

Some things are inherent to the instance, so you can't really remove them (even if you logically could, like removing the tires from the car). Other times the thing is really nebulous, so it may not be obvious what was removed.

Either way, when the prompt you captioned is used in generation, it'll act as though you never applied the caption in the first place.

This is why not captioning anything works so well, you're basically captioning everything implicitly.

urusan,
@urusan@fosstodon.org avatar

There's a lot more stuff I have thoughts on but can't say with any reasonable confidence. Even all the stuff here is my best guesses, but based on reasonable evidence.

If you do caption, the most useful thing to caption is probably the background and other elements totally unrelated to your core subject. It's may also be useful to caption "conditional" things like character positioning or a special case.

Still, captioning is very much optional.

urusan,
@urusan@fosstodon.org avatar

To clarify, you're implicitly captioning everything (not explicitly captioned) into your instance prompt.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • JUstTest
  • kavyap
  • DreamBathrooms
  • thenastyranch
  • magazineikmin
  • tacticalgear
  • khanakhh
  • Youngstown
  • mdbf
  • slotface
  • rosin
  • everett
  • ngwrru68w68
  • Durango
  • megavids
  • InstantRegret
  • cubers
  • GTA5RPClips
  • cisconetworking
  • ethstaker
  • osvaldo12
  • modclub
  • normalnudes
  • provamag3
  • tester
  • anitta
  • Leos
  • lostlight
  • All magazines