cory,
@cory@social.lol avatar
roelant,
@roelant@eu.mastodon.green avatar

@cory I hear you! I did the same last year:

https://roelant.net/en/2023/im-blocking-ai-crawlers/

For this I used the information from the blog below, which misses some of yours, but has some others. And some other suggestions too, such as going beyond robots.txt and using the newly minted ai.txt.

https://neil-clarke.com/block-the-bots-that-feed-ai-models-by-scraping-your-website/

cory,
@cory@social.lol avatar

@roelant Fantastic! I’ll take a look and update things.

roelant,
@roelant@eu.mastodon.green avatar

@cory iI’l be doing the same with the info from your post, thanks for sharing. 😊

cory,
@cory@social.lol avatar

@roelant Absolutely! Thank you! The more everyone finds and shares the easier it is to crowdsource.

cory,
@cory@social.lol avatar

@roelant Updated and added links to your post and Neil’s (once my site rebuilds itself)

koolinus,
@koolinus@toot.community avatar

@cory I have blocked also those in my .htaccess file

cory,
@cory@social.lol avatar

@koolinus that's the most effective approach I think, provided they identify themselves (short of that we're looking at IP addresses).

Mojeek,
@Mojeek@mastodon.social avatar

@cory can't help but think that even more would if it was a meta tag like https://noml.info/ and not a long list on robots.txt

cory,
@cory@social.lol avatar

@Mojeek Oh I’m all for both 😅

Mojeek,
@Mojeek@mastodon.social avatar

@cory Block Ahoy!

jake4480,
@jake4480@c.im avatar

@cory another great piece! I especially like what you said about Copilot, "a mild improvement over traditional autocomplete" which is an interesting comparison. We've had autocomplete for so long, but it ALWAYS needs to be checked, just like Copilot. I never/very rarely even use autocomplete. I have to edit and reread all my own writing anyway, WHY would I add another layer?! 😂

cory,
@cory@social.lol avatar

@jake4480 Thanks bud! And that’s with me using just the LSP in Sublime, not the whole VS Cose integration (it really bothers me how much of the dev tool chain MSFT has scooped up). ChatGPT is even worse and totally untrustworthy imho.

jake4480,
@jake4480@c.im avatar

@cory it's insane. And like in another piece I saw, it feels like EVERY DAY we're seeing lists of companies adding it. I hope/feel like it's a bubble/all hype. Like many do. My question is, do companies honestly think AI is the future, or are they begrudgingly adding it in, like well, I guess we gotta add in this shit now to keep up with all the other companies.... 🤣

cory,
@cory@social.lol avatar

@jake4480 Oh I think it’s corporate FOMO — all about pleasing speculative investors and not users. Maybe I’m the outlier, but I don’t want it at all.

jake4480,
@jake4480@c.im avatar

@cory absolutely. Investors. It's gonna mess stuff up and/or potentially evolve into something even more stupid. I can't imagine it. But if there's anything I've learned, is that there's ALWAYS something more stupid right around the bend. 🤣

cory,
@cory@social.lol avatar

@jake4480 We’ve gone through social media, IoT, gig work, crypto and now AI and I can’t say any of it’s been of tremendous benefit.

jake4480,
@jake4480@c.im avatar

@cory oh man. You put them all in one sentence like that, and wow. No. Yuck. 🤣 Here is great if it counts as social media, I guess it does. But all the big crappy companies that do it, nah. It's questionable benefit over even such a short amount of time. Sometimes I think about that. The thousands of years of humanity (and even LESS with tech) in the scheme of the millions of this planet, it's wild. 🤣 such a tiny blip.

cory,
@cory@social.lol avatar

@jake4480 Hahaha — we never should’ve privatized the public discourse (and so many other things but 🇺🇸). I saw someone else make a comment the other day along the lines of 200 years ago the average lifespan was 39 years old and now look at what LeBron James is doing at that age and I keep spinning that around in my head. 😅

jake4480,
@jake4480@c.im avatar

@cory ah, privatization. The bane of everything decent and helpful. 🤣

analog_cafe,
@analog_cafe@mas.to avatar

@cory the list has grown in the past few months! I wonder if there’s a repo or even a package that can help keep this mushrooming madness under control.

cory,
@cory@social.lol avatar

@analog_cafe https://darkvisitors.com is an excellent resource — an API would be awesome though so that any framework could reasonably consume it.

analog_cafe,
@analog_cafe@mas.to avatar

@cory any thoughts on Common Crawl bot? I understand it’s been used for AI training but if I remember correctly, it’s also used by indie search engines 🤔

cory,
@cory@social.lol avatar

@analog_cafe I’ve got it blocked myself — I respect that indie search engines use it but if it’s also leveraged by a significant number of LLMs I’d come down on the side of blocking it.

barrysampson,
@barrysampson@social.lol avatar

@cory Really useful. Thanks!

cory,
@cory@social.lol avatar

@barrysampson Sure thing!

paul,
@paul@fedi.nlpagan.net avatar

@cory Thank you so much.

cory,
@cory@social.lol avatar

@paul Absolutely!

paul,
@paul@fedi.nlpagan.net avatar

@cory Immediately replaced the less elaborate version on 5 sites. ☺️

cory,
@cory@social.lol avatar

@paul Perfect! The more folks that adopt it the better imho. 🙌🏻

paul,
@paul@fedi.nlpagan.net avatar

@cory I agree entirely. Also mailed it to 2 friends. The more the merrier. If I run into more code, I'll let you know.

cory,
@cory@social.lol avatar

@paul Excellent, thank you!

cliophate,
@cliophate@overkill.social avatar

@cory @mako main problem right now is that I doubt these companies even respect robots.txt files.

cory,
@cory@social.lol avatar

@cliophate @mako That’s very much a concern — and makes their behavior and claims to any sort of legitimacy all the more dubious. I’d imagine you could redirect crawlers based on user agents at the server/request level as well, but that assumes they’re setting anything in good faith.

sangster,
@sangster@macaw.social avatar

@cory possibly a dumb question, but: what’s the process for adding a robots.txt on a Netlify Eleventy deploy?

sangster,
@sangster@macaw.social avatar

@cory Ah, found something. Simpler than I expected: https://mikefallows.com/posts/adding-robots-txt-to-eleventy-site/

cory,
@cory@social.lol avatar

@sangster Ah yep! I’ve got mine in a static array that I update and expose via a data file and template that loops through that to spit the find file out.

sangster,
@sangster@macaw.social avatar

@cory Nice! This is a good reminder that I could probably do with making a sitemap and have never bothered.

cory,
@cory@social.lol avatar

@sangster I just fixed mine recently! The date format was wrong and the final output had p tags in it because it was being generated from a markdown rather than liquid file 🤦🏼‍♂️

ghalldev,
@ghalldev@mastodon.social avatar

@cory Thanks for sharing your robots.txt I was missing a bunch 🙂

cory,
@cory@social.lol avatar

@ghalldev Happy to do it! Feels like something worth putting a concerted distributed effort into.

  • All
  • Subscribed
  • Moderated
  • Favorites
  • ai
  • Durango
  • DreamBathrooms
  • thenastyranch
  • ngwrru68w68
  • cisconetworking
  • magazineikmin
  • Youngstown
  • osvaldo12
  • rosin
  • slotface
  • khanakhh
  • mdbf
  • kavyap
  • ethstaker
  • JUstTest
  • InstantRegret
  • GTA5RPClips
  • modclub
  • tacticalgear
  • everett
  • cubers
  • Leos
  • tester
  • normalnudes
  • megavids
  • provamag3
  • anitta
  • lostlight
  • All magazines