Taffer, to llm
@Taffer@mastodon.gamedev.place avatar

I was going to ask if there’s some robots.txt magic that’ll keep LLM scrapers out.

Then I thought of a better idea.

Is there a source of text/images that I can toss on there that’ll poison “AI” scrapers?

dansup, to random
@dansup@mastodon.social avatar

I updated https://fedidb.org to return the proper latest software version

ex: https://fedidb.org/software/mastodon/versions

Next is finishing robots.txt parsing support so we honor instance admins privacy.

Be warned, this will drop stats across the board, I think it's better to respect admins and provide less accurate overall stats.

AdamBishop, to ArtificialIntelligence
@AdamBishop@floss.social avatar

The robots.txt file I add to sites to warn off scraping by AI data-collecting bots, but allow searching indexing is getting longer and longer ... 😳

Can we do more than hope and pray it gets respected?

Legislation?

dwsmart, to SEO
@dwsmart@seocommunity.social avatar

So the old robots.txt tool from google is truly gone, it now redirects to the new report. Quick reminder that if you're missing the ability to test rules & urls, @jammer_volts & I did this tool https://tamethebots.com/tools/robotstxt-checker that uses the official parse and lets you test one or many urls, and export the results

A more familiar to the depreciated tool is the excellent https://www.realrobotstxt.com/ from Will Crichlow.

Merkle's https://technicalseo.com/tools/robots-txt/ allows you to test a url & its resources

elias, to SEO
@elias@seocommunity.social avatar

Even before Google removed the robots tester tool, it was much better to have a bulk* robots tester.

bulk: test a bunch of URLs across a bunch of user-agents (all combinations) in one go.

It's still here, it's also open-source:
https://lnkd.in/d9yjtud

How to use:
https://www.youtube.com/watch?v=s0Dx0QV9iEs

tdp_org, to SEO
@tdp_org@mastodon.social avatar

We recently noticed a fair bit of traffic on www.bbc.co.uk & www.bbc.com from a User Agent which identifies itself as "ByteSpider" (& has a @bytedance.com email address).

Lots of docs on the web state it doesn't obey robots.txt but ByteDance have told us it does:

> ...in the robots.txt files
> user-agent:Bytespider
> Disallow:/

Thought that might be worth documenting as it might be a recent change & several of us searched but found zero docs from ByteDance

j9t, to web
@j9t@mas.to avatar

robots.txt 1994–2023:

User-agent: *
Disallow:

robots.txt 2023–?:

User-agent: CCBot
User-agent: ChatGPT-User
User-agent: FacebookBot
User-agent: Google-Extended
User-agent: GPTBot
User-agent: Omgilibot
Disallow: /

travissouthard, to webdev
@travissouthard@jawns.club avatar

Not sure if I totally trust that it will be effective but I did add a robots.txt file and robots meta tag to my website to tryyyyy to block bots crawling my website to feed AI models.

The blog I based my changes on: https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/
The changes I made: https://github.com/travissouthard/travissouthard.github.io/pull/27

clarkesworld, to random
@clarkesworld@mastodon.online avatar

If you've been blocking AI from scraping your website, there's another one to add. This time it's Google.

I've updated my post on the subject.

https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/

hugovk,
@hugovk@mastodon.social avatar
marcel, to random German
@marcel@waldvogel.family avatar

Die New York Times verhandelt schon länger mit OpenAI über Urheberrechtsabgaben. Laut Berichten hat OpenAI für das Training von ChatGPT unautorisiert Materialien der New York Times genutzt. Das kann in mehrfacher Sicht teuer werden.

Ein 🧵

https://marcel-waldvogel.ch/2023/08/20/todesstoss-fuer-chatgpt/

marcel,
@marcel@waldvogel.family avatar

Nachtrag:

Wer selbst aktiv werden will, kann bereits jetzt einige wenige -Crawler davon abhalten, zukünftig keine Inhalte der eigenen Webseiten mehr zu verdauen. Dabei bleiben aber noch viele Fragen offen.


https://netfuture.ch/2023/07/blocking-ai-crawlers-robots-txt-chatgpt/

juliemoynat, to ChatGPT French
@juliemoynat@eldritch.cafe avatar

Bloquer ChatGPT pour qu'il se serve pas du contenu de vos sites web, c'est facile. Dans le fichier robots.txt à la racine, ajouter :

User-agent: GPTBot
Disallow: /

Et voilà. Bye bye l'IA dégueulasse. 👋

dirkhaun, to ChatGPT German

PSA: If you're running @writefreely, make sure your server is set up to serve a robots.txt so that you can block bots you don't want to gobble up the contents of your website (looking at you, ).

Something like

location /robots.txt {
alias /complete/path/to/your/robots.txt;
}

in your configuration.

A wordier version of this can be found here: https://blog.tinycities.net/dirkhaun/robots-txt-chatgpt-and-writefreely 🙈

bananabob, to ChatGPT
@bananabob@mastodon.nz avatar
BenjaminHCCarr, to OpenAI
@BenjaminHCCarr@hachyderm.io avatar

Now you can block ’s
OpenAI now lets you block its web crawler from scraping your site to help train models. OpenAI said website operators can specifically disallow its crawler on their site's .txt file or block its IP address.
https://www.theverge.com/2023/8/7/23823046/openai-data-scrape-block-ai

phillycodehound, to random
@phillycodehound@masto.ai avatar

Google wants to change Robots.txt. Oh come on. They just want access to everything to scrape.

#GoogleIsEvil

phillycodehound,
@phillycodehound@masto.ai avatar

@rustybrick do you think will just ignore Robots.txt? I mean they'd love to be able to train on everything.

Though I would love more controls on stopping AI from scraping without blocking my sites from Search

atomicpoet, to fediverse

just announced their next social media venture: .

Perspectives is a tab that will showcase social media posts in their search results. This will give -- you guessed it -- "perspectives" on current events and other matters as well.

While Google is positioning this generating results from all social media, this also has massive implications for the .

I've been saying for a long time that if we Fediverse developers didn't nail search soon, Google will eat our lunch.

Well, it looks like they've just set down at a table and are studying the menu right now -- because I completely expect that the Fediverse will be present on that Perspective tabs, especially since the Fediverse is now generating 1 billion+ posts each month.

What is the next logical step for Google?

If I were putting on my Google product development hat, I'd push for full text search with near-instant results. This is very easy for Google to do. Their engineers could probably build it fast.

Meanwhile, the Fediverse is practically giving away Fediverse search to Google -- Google is what most people use to find posts on the Fediverse right now.

Are we just going to allow Google to extend their search monopoly into the Fediverse, and without a fight too?

https://www.engadget.com/google-searchs-new-perspectives-tab-will-highlight-forum-and-social-media-posts-175209372.html

@socialmedianews @fediversenews

chris,

@atomicpoet @fediversenews @socialmedianews

It's time for the Fediverse to learn about Robots.TXT files and add server-level (and perhaps user-level) features that instruct search engines like Google about which profiles are welcome for external indexing, and which are not.

https://www.robotstxt.org/robotstxt.html

  • All
  • Subscribed
  • Moderated
  • Favorites
  • megavids
  • InstantRegret
  • rosin
  • modclub
  • Youngstown
  • khanakhh
  • Durango
  • slotface
  • mdbf
  • cubers
  • GTA5RPClips
  • kavyap
  • DreamBathrooms
  • ngwrru68w68
  • JUstTest
  • magazineikmin
  • osvaldo12
  • tester
  • tacticalgear
  • ethstaker
  • Leos
  • thenastyranch
  • everett
  • normalnudes
  • anitta
  • provamag3
  • cisconetworking
  • lostlight
  • All magazines