gimulnautti,
@gimulnautti@mastodon.green avatar

#mathematics people:

I feel there has to be a way of training neural networks to recognise the influence of their training data on the output.

This would probably include training a complementary indexing network + database that would then ”reverse-training” resolve and offer at some predetermined accuracy the #copyright-viable sources for each generated #aiart

I need some help though. A proof would show the companies know it can be done, but they just don’t want to.

#neuralnetworks

gimulnautti,
@gimulnautti@mastodon.green avatar

As to my motivation, looking forward to people making MORE contributions to AI training instead of hiding their work away to protect it from being stolen.

I am not looking for copyright violations, but a new paradigm, where generative models feed back into the creative economy.

For this rewriting of the copyright laws will be necessary. I wrote lengthily about this for the Finnish Pirate party more than a year ago:

https://gimulnaut.wordpress.com/2023/01/13/copyright-wars-pt-2-ai-vs-the-public/

gimulnautti,
@gimulnautti@mastodon.green avatar

In other words: Getting paid a reasonable amount from the model providers would incentivise people to be part of the training data.

Getting also paid from using the models creatively incentivises their use, and this is where copyright reform is most needed.

Instead of remaining a grey area forever, as the industry would sure be happy about, we need legal category for the remix as legit piece of art, not bastardisation.

#ai #aiart #remix #legal #copyrightreform

https://gimulnaut.wordpress.com/2023/04/20/ai-art-is-a-remix-the-djs-of-pictures/

gimulnautti,
@gimulnautti@mastodon.green avatar

As to additional benefits of training data indexing models, they could be utilised to pinpoint ”needles in haystacks” and be able to clear out malicious training data.

This malicious data, I’m sure is being inserted right now on the open internet to influence future models into having certain properties by malicious political actors. Using themselves of course.

gimulnautti,
@gimulnautti@mastodon.green avatar

When everybody gets their answers about the world through a language model, might not be too far away.

What kind of expressions do these models and their training prefer? How can they be hacked?

The same will happen as with social media networks: Those who can reverse-engineer the model most efficiently will prosper.

In the words of Walter Benjamin:

”Technology is the mastery not of nature, but the relation between nature and man.”

This time nature is human nature.

jaifroid,

@gimulnautti Isn't there an obvious solution: open-source all model weights on the basis that so much of the training material is open-source or public domain? I've never seen the sense in the idea that LLM providers should pay a licensing fee for every generation based on public data or copyright data acquired legally. I shouldn't have to pay continued royalties or licence fees to every provider of information that has been absorbed by my brain every time I speak. That's a dangerous path.

gimulnautti,
@gimulnautti@mastodon.green avatar

@jaifroid No, you shouldn’t.

Comparing people’s brains to data centers isn’t an apples-to-apples comparison. Your brain ’s computation has value because you are human. Your knowledge contributes to society whether you like it or not.

A data center is not human, it’s calculations do not have human rights.

In business, there is fair use. But what determines fair use is the use case.

Copyright law should serve humans. What’s really dangerous is writing it to serve machines.

sofia,
@sofia@chaos.social avatar

@gimulnautti it's really not clear what you mean here, the first paragraph sounds like you want a system that is can be fed media and identifies likely copyright violation?

i don't know what an indexing network is (google didn't help), and i think by reverse training you mean training a network to be less likely to produce the given output?

sounds like you wanna unleash some new generation of copyright hell on people? i hope i misunderstand, but i can't think of much else…

gimulnautti, (edited )
@gimulnautti@mastodon.green avatar

@sofia No, actually I am looking forward to people making MORE contributions to AI training instead of hiding their work away to protect it from being stolen.

I am not looking for copyright violations, but a new paradigm, where generative models feed back into the creative economy.

For this rewriting of the copyright laws will be necessary. I wrote lengthily about this for the Finnish Pirate party more than a year ago:

https://gimulnaut.wordpress.com/2023/01/13/copyright-wars-pt-2-ai-vs-the-public/

gimulnautti,
@gimulnautti@mastodon.green avatar

@sofia In other words: Getting paid a reasonable amount from the model providers would incentivise people to be part of the training data.

Getting also paid from using the models creatively incentivises their use, and this is where copyright reform is most needed.

Instead of remaining a grey area forever, as the industry would sure be happy about, we need legal category for the remix as piece of art.

https://gimulnaut.wordpress.com/2023/04/20/ai-art-is-a-remix-the-djs-of-pictures/

gimulnautti,
@gimulnautti@mastodon.green avatar

@sofia I understand your confusion also.

Like in the articles above, we lack the proper terminology and concepts in art and legal practices to approach the problem pragmatically.

Developing these frameworks is paramount to getting out of the deadlock. Pointing fingers is not going to work if all it leads to is the same thing.

gimulnautti, (edited )
@gimulnautti@mastodon.green avatar

@sofia A good example is news:

Where’s the revenue for news providers if eveyone gets their answers through models? This will happen sooner rather than later.

These are fundamental topics for our society, and the laws should be written based on how this technology changes us, not the other way around.

gigapixel,
@gigapixel@mastodontti.fi avatar

@gimulnautti this sounds like a very interesting mathematical problem. I’ve got a bit of an idea forming on the basic structures that would be needed mathematically for this. I work in NLP and the ideas are for generative text models but I think one could generalise.

gigapixel,
@gigapixel@mastodontti.fi avatar

@gimulnautti So we need to consider the set of all token sequences of a given length and then the idealised model is just a probability density on the set of all tokens for each possible sequence. We make this space a bit more continuos by considering this input to be a sequence of probability densities on sequences (this should map better to the underlying token embeddings). Then we can define training as a path through this space.

gigapixel,
@gigapixel@mastodontti.fi avatar

@gimulnautti when we run the model forward we generate a new sequence (or rather a sequence of sequences). For each one we query along each step of the path how much each training sample contributed to the reduction of loss for the predicted output.

gigapixel,
@gigapixel@mastodontti.fi avatar

@gimulnautti now the problem is that this is probably prohibitive to calculate. But following your suggestion we can probably train a model that tries to predict this to begin with.

gimulnautti, (edited )
@gimulnautti@mastodon.green avatar

@gigapixel Yes. I was thinking a companion model would need to be trained to accompany the generative model. It could never be 100% accurate, but accurate enough would be sufficient. Say, 98%.

Predicting the neuron activation patterns without the weights could lead to a model of influence of each training item in the network.

A large compression ratio would need to be applied for this to be practical. Arriving at a proof of accuracy relative to comp ratio would be the target.

shram86,
@shram86@mastodon.gamedev.place avatar

@gimulnautti considering all of the Metadata that goes into every single pixel this isn't beyond the realm of possibility at all, but it will never happen even if you perfect it. Nobody is going to willingly prove that they plagiarized all of their art.

gimulnautti,
@gimulnautti@mastodon.green avatar

@shram86 Yes.

Only political action can make it happen. Only by convincing enough people to pressure enough legislators with aforementioned mathematical proof, which happens through running enough grassroots campaigns with enough artists with a stake in the matter.

When enough people demand it and the company is by the mathematics cornered into not appearing credible when they say ”it’s impossible”, then it can happen.

It’s a long road, but we should start walking.

gimulnautti,
@gimulnautti@mastodon.green avatar

@shram86 When we have the mathematical proof, we should start a non-profit and get some donations to campaign.

shram86,
@shram86@mastodon.gamedev.place avatar

@gimulnautti I'm all in.

gimulnautti,
@gimulnautti@mastodon.green avatar

@shram86 ”providing training material indexing capability to language models” google search provides me nothing.

Not a single citation.

This alone is good enough reason to start, just out of pure scientific interest.

However. I think the topic is dismissed at the top right out of hand, and there are strong incentives for the CEO’s to do that. Their whole careers depend on it.

We know how human reasoning works. It’s not just logical but a mix of self-interest & logic.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • mathematics
  • Durango
  • DreamBathrooms
  • thenastyranch
  • magazineikmin
  • tacticalgear
  • khanakhh
  • Youngstown
  • mdbf
  • slotface
  • rosin
  • everett
  • ngwrru68w68
  • kavyap
  • InstantRegret
  • JUstTest
  • cubers
  • GTA5RPClips
  • cisconetworking
  • ethstaker
  • osvaldo12
  • modclub
  • normalnudes
  • provamag3
  • tester
  • anitta
  • Leos
  • megavids
  • lostlight
  • All magazines