#robotsTxt - kbin.social

TheDJ, 12 days ago to foss

When you see some of what WMF has to deal with in terms of massive scraping of not only Wikipedia, but also the gigantic codebase and ticketing system and many of the other services, you start to understand why GitHub killed off unauthenticated search.

Commercial and non-considerate activity, really is grinding the open Internet to a standstill at times.

#robotstxt #scraping #foss #wikipedia

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ goinfawr

mima, 24 days ago to fediverse
Hmm I probably have the most ridiculous #robotstxt for a #Misskey instance right now lol. I just want to let #Mojeek and #Marginalia crawl #Makai and make sure to keep out #Google and the AI scrapers... :satrithink:

If there are other user-agents of independent #searchengines I should allow in https://makai.chaotic.ninja/robots.txt, please let me know! I'm actually searching #SauceNAO, #TinEye, and #IQDB's #useragent so I can let them fetch our media for their reverse image search.
User-Agent: MojeekBot
User-Agent: FeedFetcher-Mojeek
User-Agent: search.marginalia.nu
Allow: /
Allow: /notes
Disallow: /admin
Disallow: /settings
Disallow: /my/

User-Agent: *
User-Agent: Googlebot
User-Agent: Google-Extended
User-Agent: GoogleOther
User-Agent: AdsBot-Google
User-Agent: AdsBot-Google-Mobile
User-Agent: Mediapartners-Google
User-Agent: CCBot
User-Agent: ChatGPT-User
User-Agent: GPTBot
User-Agent: Omgilibot
User-Agent: omgili
User-Agent: FacebookBot
User-agent: Twitterbot
User-Agent: cohere-ai
User-Agent: anthropic-ai
User-Agent: Bytespider
User-Agent: Amazonbot
User-Agent: Applebot
User-Agent: PerplexityBot
User-Agent: YouBot
User-Agent: AwarioRssBot
User-Agent: AwarioSmartBot
User-Agent: ClaudeBot
User-Agent: Claude-Web
User-Agent: DataForSeoBot
User-Agent: FriendlyCrawler
User-Agent: ImagesiftBot
User-Agent: magpie-crawler
User-Agent: Meltwater
User-Agent: peer39_crawler
User-Agent: PiplBot
User-Agent: Seekr
Disallow: /

# todo: sitemap
#sysadmin #fediadmin
reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

dwsmart, 24 days ago to SEO

@methode
and co added a new library to the awesome open source Google robots txt parser We've added it to @jammer_volts
& mine's robots.txt tool here: https://tamethebots.com/tools/robotstxt-checker

The last column now shows robots.txt status, lines parsed, How many valid directives that robots.txt file has, and any ones that are not understood, for each URL tested.
#seo #robotstxt

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ simoncox

Taffer, 3 months ago to llm

I was going to ask if there’s some robots.txt magic that’ll keep LLM scrapers out.

Then I thought of a better idea.

Is there a source of text/images that I can toss on there that’ll poison “AI” scrapers?

#genai #llm #scraper #robotstxt

reply

expand (4)

collapse (4)

report

activity

copy /kbin url

copy original url

open original url

Loading...

dansup, 4 months ago to random

I updated https://fedidb.org to return the proper latest software version

ex: https://fedidb.org/software/mastodon/versions

Next is finishing robots.txt parsing support so we honor instance admins privacy.

Be warned, this will drop stats across the board, I think it's better to respect admins and provide less accurate overall stats.

#fediDB #softwareVersions #robotsTxt

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ reiver

AdamBishop, 5 months ago to ArtificialIntelligence

The robots.txt file I add to sites to warn off scraping by AI data-collecting bots, but allow searching indexing is getting longer and longer ... 😳

Can we do more than hope and pray it gets respected?

Legislation?

#consentFirst #robots #robotsTxt #ai #GDPR #consent #dataPrivacy #piracy #copyright

reply

expand (5)

collapse (5)

report

activity

copy /kbin url

copy original url

open original url

Loading...

dwsmart, 6 months ago to SEO

So the old robots.txt tool from google is truly gone, it now redirects to the new report. Quick reminder that if you're missing the ability to test rules & urls, @jammer_volts & I did this tool https://tamethebots.com/tools/robotstxt-checker that uses the official parse and lets you test one or many urls, and export the results

A more familiar to the depreciated tool is the excellent https://www.realrobotstxt.com/ from Will Crichlow.

Merkle's https://technicalseo.com/tools/robots-txt/ allows you to test a url & its resources #seo #robotstxt

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ simoncox

elias, 6 months ago to SEO

Even before Google removed the robots tester tool, it was much better to have a bulk* robots tester.

bulk: test a bunch of URLs across a bunch of user-agents (all combinations) in one go.

It's still here, it's also open-source:
https://lnkd.in/d9yjtud

#advertools #robotstxt #SEO #Dash

How to use:
https://www.youtube.com/watch?v=s0Dx0QV9iEs

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ simoncox

tdp_org, 7 months ago to SEO

We recently noticed a fair bit of traffic on www.bbc.co.uk & www.bbc.com from a User Agent which identifies itself as "ByteSpider" (& has a @bytedance.com email address).

Lots of docs on the web state it doesn't obey robots.txt but ByteDance have told us it does:

> ...in the robots.txt files
> user-agent:Bytespider
> Disallow:/

Thought that might be worth documenting as it might be a recent change & several of us searched but found zero docs from ByteDance

#SEO #ByteSpider #RobotsTXT

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ keen456, alexanderhay, simonzerafa, GossiTheDog +1 more

j9t, 8 months ago to web

robots.txt 1994–2023:

User-agent: *
Disallow:

robots.txt 2023–?:

User-agent: CCBot
User-agent: ChatGPT-User
User-agent: FacebookBot
User-agent: Google-Extended
User-agent: GPTBot
User-agent: Omgilibot
Disallow: /

#web #ai #robotstxt

reply

expand (2)

collapse (2)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ jake4480, tshirtman, ronanmcd

travissouthard, 8 months ago to webdev

Not sure if I totally trust that it will be effective but I did add a robots.txt file and robots meta tag to my website to tryyyyy to block bots crawling my website to feed AI models.

The blog I based my changes on: https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/
The changes I made: https://github.com/travissouthard/travissouthard.github.io/pull/27
#robotstxt #robotsdottxt #webdev #noai #ai

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

clarkesworld, 8 months ago to random

If you've been blocking AI from scraping your website, there's another one to add. This time it's Google.

I've updated my post on the subject.

https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/

#ThisShouldBeOptIn

reply

expand (27)

collapse (27)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ jrenken, onepict, markusl, moira +37 more

hugovk, 8 months ago

@eklem @robertoqs @clarkesworld

For GitHub Pages, add robots.txt to a repo called <username>.github.io and then it will appear at <username>.github.io/robots.txt

For example:
https://github.com/hugovk/hugovk.github.io/commit/79a14a01d37d574e2a76127722cdaf25cc1b9293
https://github.com/hugovk/hugovk.github.io
https://hugovk.github.io/robots.txt

More info:
https://stackoverflow.com/a/47652485/724176
https://docs.github.com/en/pages/getting-started-with-github-pages/about-github-pages
#GitHub #GitHubPages #robotstxt

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ hugovk

marcel, 9 months ago to random German

Die New York Times verhandelt schon länger mit OpenAI über Urheberrechtsabgaben. Laut Berichten hat OpenAI für das Training von ChatGPT unautorisiert Materialien der New York Times genutzt. Das kann in mehrfacher Sicht teuer werden.

Ein 🧵

https://marcel-waldvogel.ch/2023/08/20/todesstoss-fuer-chatgpt/

reply

expand (6)

collapse (6)

report

activity

copy /kbin url

copy original url

open original url

Loading...

marcel, 9 months ago

Nachtrag:

Wer selbst aktiv werden will, kann bereits jetzt einige wenige #KI-Crawler davon abhalten, zukünftig keine Inhalte der eigenen Webseiten mehr zu verdauen. Dabei bleiben aber noch viele Fragen offen.

#ChatGPT #RobotsTxt
https://netfuture.ch/2023/07/blocking-ai-crawlers-robots-txt-chatgpt/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ ixi

juliemoynat, 10 months ago to ChatGPT French

Bloquer ChatGPT pour qu'il se serve pas du contenu de vos sites web, c'est facile. Dans le fichier robots.txt à la racine, ajouter :

User-agent: GPTBot
Disallow: /

Et voilà. Bye bye l'IA dégueulasse. 👋

#ChatGPT #robotstxt

reply

expand (4)

collapse (4)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ Looping, TheOtterDragon

dirkhaun, 10 months ago to ChatGPT German

PSA: If you're running @writefreely, make sure your server is set up to serve a robots.txt so that you can block bots you don't want to gobble up the contents of your website (looking at you, #ChatGPT).

Something like

location /robots.txt {
alias /complete/path/to/your/robots.txt;
}

in your #nginx configuration.

A wordier version of this #ServiceToot can be found here: https://blog.tinycities.net/dirkhaun/robots-txt-chatgpt-and-writefreely 🙈

#WriteFreely #RobotsTXT

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ dgoldsmith

bananabob, 10 months ago to ChatGPT

Sites scramble to block ChatGPT web crawler after instructions emerge

#ArsTechnica #ChatGPT #GPTBot #RobotsTXT

https://arstechnica.com/information-technology/2023/08/openai-details-how-to-keep-chatgpt-from-gobbling-up-website-data/

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

BenjaminHCCarr, 10 months ago to OpenAI

Now you can block #OpenAI’s #webcrawler
OpenAI now lets you block its web crawler from scraping your site to help train #GPT models. OpenAI said website operators can specifically disallow its #GPTBot crawler on their site's #Robots.txt file or block its IP address.
https://www.theverge.com/2023/8/7/23823046/openai-data-scrape-block-ai #privacy #security #RobotsTxT

reply

expand (1)

collapse (1)

report

activity

copy /kbin url

copy original url

open original url

Loading...

phillycodehound, 11 months ago to random

Google wants to change Robots.txt. Oh come on. They just want access to everything to scrape.

#GoogleIsEvil

reply

expand (4)

collapse (4)

report

activity

copy /kbin url

copy original url

open original url

Loading...

phillycodehound, 11 months ago

@rustybrick do you think #Google will just ignore Robots.txt? I mean they'd love to be able to train on everything.

Though I would love more controls on stopping AI from scraping without blocking my sites from Search

#seo #AI #scraping #RobotsTXT

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...

atomicpoet, 1 year ago to fediverse

#Google just announced their next social media venture: #Perspectives.

Perspectives is a tab that will showcase social media posts in their search results. This will give -- you guessed it -- "perspectives" on current events and other matters as well.

While Google is positioning this generating results from all social media, this also has massive implications for the #Fediverse.

I've been saying for a long time that if we Fediverse developers didn't nail search soon, Google will eat our lunch.

Well, it looks like they've just set down at a table and are studying the menu right now -- because I completely expect that the Fediverse will be present on that Perspective tabs, especially since the Fediverse is now generating 1 billion+ posts each month.

What is the next logical step for Google?

If I were putting on my Google product development hat, I'd push for full text search with near-instant results. This is very easy for Google to do. Their engineers could probably build it fast.

Meanwhile, the Fediverse is practically giving away Fediverse search to Google -- Google is what most people use to find posts on the Fediverse right now.

Are we just going to allow Google to extend their search monopoly into the Fediverse, and without a fight too?

https://www.engadget.com/google-searchs-new-perspectives-tab-will-highlight-forum-and-social-media-posts-175209372.html

@socialmedianews @fediversenews

reply

expand (63)

collapse (63)

report

activity

copy /kbin url

copy original url

open original url

Loading...

+ itsjoshbruce, konstantin, torb, maegul +9 more

chris, 1 year ago

@atomicpoet @fediversenews @socialmedianews

It's time for the Fediverse to learn about Robots.TXT files and add server-level (and perhaps user-level) features that instruct search engines like Google about which profiles are welcome for external indexing, and which are not.

https://www.robotstxt.org/robotstxt.html

#Fediverse #Search #RobotsTXT #Google #Perspective

reply

report

activity

copy /kbin url

copy original url

open original url

Loading...