lauren, (edited )
@lauren@mastodon.laurenweinstein.org avatar

Lauren's Blog: In Support of Google’s Progress On AI Content Choice and Control

https://lauren.vortex.com/2023/10/26/in-support-of-googles-progress-on-ai-content-choice-and-control

Last February, in:

Giving Creators and Websites Control Over Generative AI

https://lauren.vortex.com/2023/02/14/giving-creators-and-websites-control-over-generative-ai

I suggested expansion of the existing Robots Exclusion Protocol (e.g. "robots.txt") as a path toward helping provide websites and creators control over how their contents are used by systems.

Shortly thereafter, publicly announced their own support for the robots.txt methodology as a useful mechanism in these contexts.

While it's true that adherence to robots.txt (or related webpage Meta tags -- also part of the Robots Exclusion Protocol) is voluntary, my view is that most large firms do honor its directives, and if ultimately moves toward a regulatory approach to this were deemed genuinely necessary, a more formal approach would be a possible option.

This morning Google ran a livestream discussing their progress in this entire area, emphasizing that we're only at the beginning of a long road, and asking for a wide range of stakeholder inputs.

I believe of particular importance is Google's desire for these content control systems to be as technologically straightforward as possible (so, building on the existing Robots Exclusion Protocol is clearly desirable rather than creating something entirely new), and for the effort to be industry-wide, not restricted to or controlled by only a few firms.

Also of note is Google's endorsement of the excellent "AI taxonomy" concept for consideration in these regards. Essentially, the idea is that AI Web crawling exclusions could be specified by the type of use involved, rather than by which entity was doing the crawling. So, a set of directives could be defined that would apply to all AI-related crawlers, irrespective of who was doing the crawling, but permitting (for example) crawlers that are looking for content related to public interest AI research to proceed, but direct that content not be taken or used for commercial Generative AI chatbot systems.

Again, these are of course only the first few steps toward scalable solutions in this area, but this is all incredibly important, and I definitely support Google's continuing progress in these regards.

--Lauren--

kevindalley,
@kevindalley@sfba.social avatar

@lauren
When new protocol is decided for stopping AI crawl, will Google and others remove that data from existing systems?

lauren,
@lauren@mastodon.laurenweinstein.org avatar

@kevindalley That's a tricky question policy-wise and technically. I'm doubtful that existing data can be "unwound" that way with that kind of specificity. However, what I'd assume would be more likely would be only using new data under the protocol for newer versions, though the details could be complicated.

cazabon,

@lauren

Another organization announced a separate "ai.txt" initiative which they proposed. They claimed some AI crawlers were already respecting it, but I started watching the counts - each day there are a few hits to robots.txt on my server, but not a one for ai.txt yet.

https://spawning.substack.com/p/aitxt-a-new-way-for-websites-to-set

lauren,
@lauren@mastodon.laurenweinstein.org avatar

@cazabon I don't see any good reasons not to expand on robots.txt -- N.I.H. doesn't help the situation.

Mojeek,
@Mojeek@mastodon.social avatar

@lauren @cazabon we put some alongside something we think is a better way: https://noml.info/

cazabon,

@Mojeek @lauren

I'm not sure adding yet another "" (see XKCD ...) does much to advance the cause.

I'm not a huge fan of the "declare it in the document content" approach, as it means you have to modify every piece of on your site(s), and also include it in all future content.

I also think the "do not use for ML" should not exist. All content should be "no inclusion in models" by default, and should be opted-in with an declaration.

lauren,
@lauren@mastodon.laurenweinstein.org avatar

@cazabon @Mojeek I strongly support just sticking with extensions to robots.txt directives. I suggested this even before Google endorsed this approach, and so far I am satisfied with how Google is proceeding in this area regarding use of the Robots Exclusion Protocol. I'm increasingly unconvinced that a very broad default prohibition against use of data by all AI is practical or desirable, though the devil is in the details of course, and there is a continuum of details to consider.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • random
  • tacticalgear
  • DreamBathrooms
  • mdbf
  • InstantRegret
  • ngwrru68w68
  • magazineikmin
  • thenastyranch
  • Durango
  • rosin
  • Youngstown
  • slotface
  • khanakhh
  • kavyap
  • ethstaker
  • JUstTest
  • cubers
  • cisconetworking
  • normalnudes
  • modclub
  • everett
  • osvaldo12
  • GTA5RPClips
  • Leos
  • anitta
  • tester
  • provamag3
  • megavids
  • lostlight
  • All magazines