Protecting content from LLMs, AI, etc.?

matt · April 24, 2023, 3:13am

Naturally I’ve been following all the “AI” developments lately, trying it out, and listening / thinking about its implications. If you already follow me, or saw the April Fools’ day “AI” we launched, you’ll have a pretty good idea of my general thoughts on it.

This recent Washington Post article brought the whole extraction side to light again, showing how these tools “train” on all the lovely language we publish on this writing platform of ours, among other places across the web. Here’s 170,000 “tokens” from Write.as making it into Google’s dataset alone:

Screenshot showing Write.as ranked #121,142 out of all websites in Google's C4 dataset

Feedback wanted

This begs the question: how do you feel about your written and visual content feeding these models / "AI"s? As a platform, should we do anything to proactively limit what these companies are building on top of your intellectual property, or does it not matter? (Despite any opinions I may have on the whole, any decision around this is totally up to the community – that’s why I’m opening up the discussion.)

Of course, this also applies across all the products in our suite. Large language models (LLMs) like ChatGPT crawl your Write.as posts as well as your social interactions on Remark.as and Writing Exchange. And image synthesis models like DALL-E and Midjourney could “train” on your Snap.as photos and visual creations. So I’d like to think about how this affects all our tools, and if our approach might differ between them or not.

Please feel free to discuss and let me know what you think!

PaoloAmoroso · April 24, 2023, 7:30am

Yes, I think Write.as should proactively limit the scraping for AI training purposes.

PaoloAmoroso · April 24, 2023, 10:36am

To elaborate on this, I can sort of understand using as training data public source code with permissive licences such as MIT or BSD. But the text of most blogs and public websites nearly never comes with such licencing options or implicit permissions.

davepolaschek · April 24, 2023, 12:46pm

I would appreciate a way to block them should I so choose. Choice is good. That said, I’m not sure of the cost of implementing it, and wouldn’t want it to get in the way of comments, for example.

dino · April 25, 2023, 8:19pm

I’m fine if they want to crawl and use the data, but only if they attribute it to the original source. I’m guessing this attribution piece is not going to happen, so I would say we should block it as much as possible. Let them crawl somewhere else.

writeprivacy · April 26, 2023, 9:42am

Whatever man can think about, can happen in time.
Ancestors knew our planet to be spherical. Someone thinks a flat world serves a purpose, and most once were convinced our world was flat.
Dream of making a movie, people ran out of the cinema scared to death when the train on screen came at them.
Dream of flying, we fly
Dream of going to the moon, been there done that.
Think of controlling people’s opinions, happening right under our noses, every day.

Is it all wrong? Not necessarily. This AI stuff is no different.

I do think we’re going to end up living lives whereby we can’t trust anything any more, not even ourselves.

Did I ever write this, or that? AI will tell you yes, in reality it’s no. Or vice-versa. Your brain can’t remember it all, and its plasticity makes it prone to reshaping, changing, reality.

So I am totally against the AI crap. Tomorrow it’s going to tell us something like Saddam resurrected Jesus, and we’re all going to be gullible about it.

So I am pro some sort of solution, but speechless as to which.

Thumbs up, but I do wonder about the practical feasibility of making sure crawlers and AI companies don’t come steal-to-sell-and-worse. None respect the do-not-track me browser requests. It’s a joke. And I can see silly me taking them to court over it, as if that’s remotely feasible.

After millennia of war, violence and abuse, we’re still totally incapable of avoiding it happens again. I think it will be the same with AI.

Sorry, no solution lads, but I know I do not want AI anywhere near my scribbles, god forbid anywhere near my thoughts.

hyaniner · May 10, 2023, 7:36am

It will be good if I can choose.
Frankly, I want to block them.

I want to think about it more carefully, and I want to make decision based on how this environment is changing, and how people change their attitude and their thought.

My concern is similar with this: Will A.I. Become the New McKinsey? - Ted Chiang

michaell · May 23, 2023, 4:03pm

I don’t see how it will ever be technically feasible to completely block crawling/scraping. If people are going to be seeing it on their screens, then crawlers will find a way of seeing it too. You could do things to make it more difficult to see on screens, like serve your entire page as an image with added noise to try to defeat OCR, but at that point you’re making it difficult for people, not to mention blocking potentially useful services like search engines.

I’d have thought the best we could hope for would be a combination of (i) robots.txt style do not crawl instructions which responsible LLM builders can follow, and (ii) some kind of attribution if generated content has more than a certain percentage match to content in the training set. Won’t be trivial to implement the last one so I don’t see anyone implementing under their own volition, but they may be forced to if/when there are court cases e.g. about “fair use”.

cornelius · July 1, 2024, 6:56am

Bouncing this old thread up in the hope that some consideration has been given to this over the past year and a bit, especially since robot.txt now supports blocking of AI crawlers and bots

davepolaschek · September 24, 2024, 1:31pm

In today’s news: Cloudflare moves to end free, endless AI scraping with one-click blocking | Ars Technica

markwyner · September 25, 2024, 11:45pm

I never saw this post until now. If it’s not too late to weigh in, I’d prefer to never have AI access anything I publish online anywhere. So my vote is to block the bots.

nigel · September 26, 2024, 10:09am

I personally quite like this idea

veyne · September 26, 2024, 6:11pm

I don’t give a damn, personally, but I understand.

matt · September 27, 2024, 9:13pm

That’s a pretty good one if it works!

matt · September 27, 2024, 9:13pm

Just an update: as of last week, we’re gradually starting to block common AI scrapers on user blogs across Write.as:

Right now, we’re outright blocking requests from well-known bots at the server level. But next, we’ll roll out robots.txt files for all blogs, which we’ll be able to keep up to date as new platforms / scrapers come out.

nigel · October 1, 2024, 7:41am

awesome.

As the amount of generative content online increases, stuff like our blogs written by actual people will become hot property for these folks unfortunately.

mikeschnier · October 12, 2024, 12:44am

It’s worth considering that scraping can be used for good. I use the same underlying technology (a headless Chrome browser) to monitor job posts in the animation industry and get those job posts in front of animators. I only look at one page per website and only load it once per day.

The animators find out about jobs they otherwise might not have seen. The studios get applicants who otherwise might not have applied. Would the studios be better off by blocking me? I don’t have my script do anything with a website that I couldn’t do myself. Granted, I’m also not using it to scoop up words that someone poured their heart and soul into.

As for AI, I’m not really threatened by it? I don’t know if an AI regurgitated version of my blog has any commercial or artistic value. I don’t know if there’s a point to someone making a fake version of me, pretending to have experiences, asking people to read something that nobody bothered to write. I don’t know if there’s anything to be gained by blocking it.

I think we’re going to look back at a time when we were sold the idea of utterly brainless LLMs thinking for us and wonder how anyone fell for it.

huride · October 26, 2024, 10:04pm

How’s the “gradual” timeline looking like?

Crow · October 31, 2024, 5:40pm

It isn’t so much how we craft, use, or interpret the written word so much as the fact that what we say comes from our own minds. Hence, intellectual property. Now, if Google wanted to pay me one hundred dollars per word of mine they use elsewhere, I would consider that a business, contractual, and feel fine about it. Thereby, giving me the first right of refusal or acceptance. Ideas we have are another thing altogether. On the end of most here I would think that an original theory or concept of coding would be much more valuable than anything I write. This is, to me, is where it gets testy.

pcm · January 3, 2025, 4:42pm

Hi Matt, any timeline on the robots.txt? I would be fine with a simple checkbox, one each for blocking all search engines and one for blocking AI. I don’t mind the latter, but I want the former for my anonymous posts.