Exclude site from ChatGPT scraping

Could someone more technically adept than I am (eg. @manton or @sod or… well, pretty much anyone here I’m guessing) glance at this and explain the necessary steps? My understanding is that it should be possible to add something to my site’s robots.txt that will exclude it from future scraping by OpenAI / ChatGPT. Is it as simple as adding the following somewhere?:

User-agent: ChatGPT-User
Disallow: /

My robots.txt already reads as follow — does this cover me?:

User-agent: *
Disallow: /

Thanks.

1 Like

I don’t know the answer to this but I’ve updated my robots.txt to include both:

Blockquote User-agent: *
Disallow: /
User-agent: GPTBot
Disallow: /

Your second example prevents all user-agents (crawlers) from indexing your site. So, in theory, that should be enough as long as OpenAI respects the robots.txt standard. It’s not clear to me if they do, so to be on the safe side, you might want to add ChatGPT-User and GPTBot explicitly:

User-agent: ChatGPT-User
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: *
Disallow: /

The robots.txt file can be added via a Custom Theme, sparsley documented here.

4 Likes

Thanks both!

Shameless self-promotion: my new Micro.blog plug-in, Custom Robots, makes editing the robots.txt file a little easier.

5 Likes

Excellent, thanks @sod!

I mean, if you hadn’t self-promoted I never would have found this! Thank you!!

Just adding too the conversation. You can also add Google Bard and Common Crawl bots as well:

# Block OpenAI
User-agent: GPTBot
Disallow: /
User-agent: ChatGPT-User
Disallow: /

# Block Google Bard AI
User-agent: Google-Extended
Disallow: /

# Block Common Crawl
User-agent: CCBot
Disallow: /

Edit: Worth noting that Common Crawl is used by OpenAI, Bard, etc. when training their LLMs.