Protecting content from LLMs, AI, etc.?

Naturally I’ve been following all the “AI” developments lately, trying it out, and listening / thinking about its implications. If you already follow me, or saw the April Fools’ day “AI” we launched, you’ll have a pretty good idea of my general thoughts on it.

This recent Washington Post article brought the whole extraction side to light again, showing how these tools “train” on all the lovely language we publish on this writing platform of ours, among other places across the web. Here’s 170,000 “tokens” from Write.as making it into Google’s dataset alone:

Screenshot showing Write.as ranked #121,142 out of all websites in Google's C4 dataset

Feedback wanted

This begs the question: how do you feel about your written and visual content feeding these models / "AI"s? As a platform, should we do anything to proactively limit what these companies are building on top of your intellectual property, or does it not matter? (Despite any opinions I may have on the whole, any decision around this is totally up to the community – that’s why I’m opening up the discussion.)

Of course, this also applies across all the products in our suite. Large language models (LLMs) like ChatGPT crawl your Write.as posts as well as your social interactions on Remark.as and Writing Exchange. And image synthesis models like DALL-E and Midjourney could “train” on your Snap.as photos and visual creations. So I’d like to think about how this affects all our tools, and if our approach might differ between them or not.

Please feel free to discuss and let me know what you think!

2 Likes

Yes, I think Write.as should proactively limit the scraping for AI training purposes.

To elaborate on this, I can sort of understand using as training data public source code with permissive licences such as MIT or BSD. But the text of most blogs and public websites nearly never comes with such licencing options or implicit permissions.

1 Like

I would appreciate a way to block them should I so choose. Choice is good. That said, I’m not sure of the cost of implementing it, and wouldn’t want it to get in the way of comments, for example.

I’m fine if they want to crawl and use the data, but only if they attribute it to the original source. I’m guessing this attribution piece is not going to happen, so I would say we should block it as much as possible. Let them crawl somewhere else.

Whatever man can think about, can happen in time.
Ancestors knew our planet to be spherical. Someone thinks a flat world serves a purpose, and most once were convinced our world was flat.
Dream of making a movie, people ran out of the cinema scared to death when the train on screen came at them.
Dream of flying, we fly
Dream of going to the moon, been there done that.
Think of controlling people’s opinions, happening right under our noses, every day.

Is it all wrong? Not necessarily. This AI stuff is no different.

I do think we’re going to end up living lives whereby we can’t trust anything any more, not even ourselves.

Did I ever write this, or that? AI will tell you yes, in reality it’s no. Or vice-versa. Your brain can’t remember it all, and its plasticity makes it prone to reshaping, changing, reality.

So I am totally against the AI crap. Tomorrow it’s going to tell us something like Saddam resurrected Jesus, and we’re all going to be gullible about it.

So I am pro some sort of solution, but speechless as to which.

Thumbs up, but I do wonder about the practical feasibility of making sure crawlers and AI companies don’t come steal-to-sell-and-worse. None respect the do-not-track me browser requests. It’s a joke. And I can see silly me taking them to court over it, as if that’s remotely feasible.

After millennia of war, violence and abuse, we’re still totally incapable of avoiding it happens again. I think it will be the same with AI.

Sorry, no solution lads, but I know I do not want AI anywhere near my scribbles, god forbid anywhere near my thoughts.

It will be good if I can choose.
Frankly, I want to block them.

I want to think about it more carefully, and I want to make decision based on how this environment is changing, and how people change their attitude and their thought.

My concern is similar with this: Will A.I. Become the New McKinsey? - Ted Chiang

I don’t see how it will ever be technically feasible to completely block crawling/scraping. If people are going to be seeing it on their screens, then crawlers will find a way of seeing it too. You could do things to make it more difficult to see on screens, like serve your entire page as an image with added noise to try to defeat OCR, but at that point you’re making it difficult for people, not to mention blocking potentially useful services like search engines.

I’d have thought the best we could hope for would be a combination of (i) robots.txt style do not crawl instructions which responsible LLM builders can follow, and (ii) some kind of attribution if generated content has more than a certain percentage match to content in the training set. Won’t be trivial to implement the last one so I don’t see anyone implementing under their own volition, but they may be forced to if/when there are court cases e.g. about “fair use”.