TheDJ, to foss
@TheDJ@mastodon.social avatar

When you see some of what WMF has to deal with in terms of massive scraping of not only Wikipedia, but also the gigantic codebase and ticketing system and many of the other services, you start to understand why GitHub killed off unauthenticated search.

Commercial and non-considerate activity, really is grinding the open Internet to a standstill at times.

mima, to fediverse

Hmm I probably have the most ridiculous #robotstxt for a #Misskey instance right now lol. I just want to let #Mojeek and #Marginalia crawl #Makai and make sure to keep out #Google and the AI scrapers... ​:satrithink:​

If there are other user-agents of independent #searchengines I should allow in https://makai.chaotic.ninja/robots.txt, please let me know! I'm actually searching #SauceNAO, #TinEye, and #IQDB's #useragent so I can let them fetch our media for their reverse image search.

User-Agent: MojeekBot
User-Agent: FeedFetcher-Mojeek
User-Agent: search.marginalia.nu
Allow: /
Allow: /notes
Disallow: /admin
Disallow: /settings
Disallow: /my/

User-Agent: *
User-Agent: Googlebot
User-Agent: Google-Extended
User-Agent: GoogleOther
User-Agent: AdsBot-Google
User-Agent: AdsBot-Google-Mobile
User-Agent: Mediapartners-Google
User-Agent: CCBot
User-Agent: ChatGPT-User
User-Agent: GPTBot
User-Agent: Omgilibot
User-Agent: omgili
User-Agent: FacebookBot
User-agent: Twitterbot
User-Agent: cohere-ai
User-Agent: anthropic-ai
User-Agent: Bytespider
User-Agent: Amazonbot
User-Agent: Applebot
User-Agent: PerplexityBot
User-Agent: YouBot
User-Agent: AwarioRssBot
User-Agent: AwarioSmartBot
User-Agent: ClaudeBot
User-Agent: Claude-Web
User-Agent: DataForSeoBot
User-Agent: FriendlyCrawler
User-Agent: ImagesiftBot
User-Agent: magpie-crawler
User-Agent: Meltwater
User-Agent: peer39_crawler
User-Agent: PiplBot
User-Agent: Seekr
Disallow: /

# todo: sitemap

#sysadmin #fediadmin

dwsmart, to SEO
@dwsmart@seocommunity.social avatar

@methode
and co added a new library to the awesome open source Google robots txt parser We've added it to @jammer_volts
& mine's robots.txt tool here: https://tamethebots.com/tools/robotstxt-checker

The last column now shows robots.txt status, lines parsed, How many valid directives that robots.txt file has, and any ones that are not understood, for each URL tested.
#seo #robotstxt

Taffer, to llm
@Taffer@mastodon.gamedev.place avatar

I was going to ask if there’s some robots.txt magic that’ll keep LLM scrapers out.

Then I thought of a better idea.

Is there a source of text/images that I can toss on there that’ll poison “AI” scrapers?

dansup, to random
@dansup@mastodon.social avatar

I updated https://fedidb.org to return the proper latest software version

ex: https://fedidb.org/software/mastodon/versions

Next is finishing robots.txt parsing support so we honor instance admins privacy.

Be warned, this will drop stats across the board, I think it's better to respect admins and provide less accurate overall stats.

AdamBishop, to ArtificialIntelligence
@AdamBishop@floss.social avatar

The robots.txt file I add to sites to warn off scraping by AI data-collecting bots, but allow searching indexing is getting longer and longer ... 😳

Can we do more than hope and pray it gets respected?

Legislation?

#consentFirst #robots #robotsTxt #ai #GDPR #consent #dataPrivacy #piracy #copyright

dwsmart, to SEO
@dwsmart@seocommunity.social avatar

So the old robots.txt tool from google is truly gone, it now redirects to the new report. Quick reminder that if you're missing the ability to test rules & urls, @jammer_volts & I did this tool https://tamethebots.com/tools/robotstxt-checker that uses the official parse and lets you test one or many urls, and export the results

A more familiar to the depreciated tool is the excellent https://www.realrobotstxt.com/ from Will Crichlow.

Merkle's https://technicalseo.com/tools/robots-txt/ allows you to test a url & its resources

elias, to SEO

Even before Google removed the robots tester tool, it was much better to have a bulk* robots tester.

bulk: test a bunch of URLs across a bunch of user-agents (all combinations) in one go.

It's still here, it's also open-source:
https://lnkd.in/d9yjtud

How to use:
https://www.youtube.com/watch?v=s0Dx0QV9iEs

tdp_org, to SEO
@tdp_org@mastodon.social avatar

We recently noticed a fair bit of traffic on www.bbc.co.uk & www.bbc.com from a User Agent which identifies itself as "ByteSpider" (& has a @bytedance.com email address).

Lots of docs on the web state it doesn't obey robots.txt but ByteDance have told us it does:

> ...in the robots.txt files
> user-agent:Bytespider
> Disallow:/

Thought that might be worth documenting as it might be a recent change & several of us searched but found zero docs from ByteDance

j9t, to web
@j9t@mas.to avatar

robots.txt 1994–2023:

User-agent: *
Disallow:

robots.txt 2023–?:

User-agent: CCBot
User-agent: ChatGPT-User
User-agent: FacebookBot
User-agent: Google-Extended
User-agent: GPTBot
User-agent: Omgilibot
Disallow: /

travissouthard, to webdev
@travissouthard@jawns.club avatar

Not sure if I totally trust that it will be effective but I did add a robots.txt file and robots meta tag to my website to tryyyyy to block bots crawling my website to feed AI models.

The blog I based my changes on: https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/
The changes I made: https://github.com/travissouthard/travissouthard.github.io/pull/27

clarkesworld, to random
@clarkesworld@mastodon.online avatar

If you've been blocking AI from scraping your website, there's another one to add. This time it's Google.

I've updated my post on the subject.

https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/

hugovk,
@hugovk@mastodon.social avatar
marcel, to random German
@marcel@waldvogel.family avatar

Die New York Times verhandelt schon länger mit OpenAI über Urheberrechtsabgaben. Laut Berichten hat OpenAI für das Training von ChatGPT unautorisiert Materialien der New York Times genutzt. Das kann in mehrfacher Sicht teuer werden.

Ein 🧵

https://marcel-waldvogel.ch/2023/08/20/todesstoss-fuer-chatgpt/

marcel,
@marcel@waldvogel.family avatar

Nachtrag:

Wer selbst aktiv werden will, kann bereits jetzt einige wenige -Crawler davon abhalten, zukünftig keine Inhalte der eigenen Webseiten mehr zu verdauen. Dabei bleiben aber noch viele Fragen offen.


https://netfuture.ch/2023/07/blocking-ai-crawlers-robots-txt-chatgpt/

juliemoynat, to ChatGPT French
@juliemoynat@eldritch.cafe avatar

Bloquer ChatGPT pour qu'il se serve pas du contenu de vos sites web, c'est facile. Dans le fichier robots.txt à la racine, ajouter :

User-agent: GPTBot
Disallow: /

Et voilà. Bye bye l'IA dégueulasse. 👋

dirkhaun, to ChatGPT German

PSA: If you're running @writefreely, make sure your server is set up to serve a robots.txt so that you can block bots you don't want to gobble up the contents of your website (looking at you, ).

Something like

location /robots.txt {
alias /complete/path/to/your/robots.txt;
}

in your configuration.

A wordier version of this can be found here: https://blog.tinycities.net/dirkhaun/robots-txt-chatgpt-and-writefreely 🙈

bananabob, to ChatGPT
@bananabob@mastodon.nz avatar
BenjaminHCCarr, to OpenAI
@BenjaminHCCarr@hachyderm.io avatar

Now you can block ’s
OpenAI now lets you block its web crawler from scraping your site to help train models. OpenAI said website operators can specifically disallow its crawler on their site's .txt file or block its IP address.
https://www.theverge.com/2023/8/7/23823046/openai-data-scrape-block-ai

phillycodehound, to random
@phillycodehound@masto.ai avatar

Google wants to change Robots.txt. Oh come on. They just want access to everything to scrape.

#GoogleIsEvil

phillycodehound,
@phillycodehound@masto.ai avatar

@rustybrick do you think will just ignore Robots.txt? I mean they'd love to be able to train on everything.

Though I would love more controls on stopping AI from scraping without blocking my sites from Search

atomicpoet, to fediverse

just announced their next social media venture: .

Perspectives is a tab that will showcase social media posts in their search results. This will give -- you guessed it -- "perspectives" on current events and other matters as well.

While Google is positioning this generating results from all social media, this also has massive implications for the .

I've been saying for a long time that if we Fediverse developers didn't nail search soon, Google will eat our lunch.

Well, it looks like they've just set down at a table and are studying the menu right now -- because I completely expect that the Fediverse will be present on that Perspective tabs, especially since the Fediverse is now generating 1 billion+ posts each month.

What is the next logical step for Google?

If I were putting on my Google product development hat, I'd push for full text search with near-instant results. This is very easy for Google to do. Their engineers could probably build it fast.

Meanwhile, the Fediverse is practically giving away Fediverse search to Google -- Google is what most people use to find posts on the Fediverse right now.

Are we just going to allow Google to extend their search monopoly into the Fediverse, and without a fight too?

https://www.engadget.com/google-searchs-new-perspectives-tab-will-highlight-forum-and-social-media-posts-175209372.html

@socialmedianews @fediversenews

chris,

@atomicpoet @fediversenews @socialmedianews

It's time for the Fediverse to learn about Robots.TXT files and add server-level (and perhaps user-level) features that instruct search engines like Google about which profiles are welcome for external indexing, and which are not.

https://www.robotstxt.org/robotstxt.html

  • All
  • Subscribed
  • Moderated
  • Favorites
  • provamag3
  • InstantRegret
  • mdbf
  • ethstaker
  • magazineikmin
  • GTA5RPClips
  • rosin
  • thenastyranch
  • Youngstown
  • osvaldo12
  • slotface
  • khanakhh
  • kavyap
  • DreamBathrooms
  • JUstTest
  • Durango
  • everett
  • cisconetworking
  • Leos
  • normalnudes
  • cubers
  • modclub
  • ngwrru68w68
  • tacticalgear
  • megavids
  • anitta
  • tester
  • lostlight
  • All magazines