Gotta block ’em all

Robb Knight, a software developer who found that Perplexity was circumventing robots.txt to scrape websites it wasn’t supposed to, told 404 Media there are many cases where it’s hard to tell what a user agent does or who operates it. “What’s happening to people, including me, is copy-pasting lists of agents without verifying every agent is a real one,” he said. Knight added that the Wall Street Journal and many News Corp-owned websites are currently blocking a bot called “Perplexity-ai,” which may or may not even exist (Perplexity’s crawler is called “PerplexityBot.”)

Source: Websites are Blocking the Wrong AI Scrapers (Because AI Companies Keep Making New Ones)

The solution we used on this here blargh is simple: We blocked everyone on robots.txt:

User-agent: * Disallow: / Crawl-delay: 360

We also like to have a terminal window to look at what’s currently hitting the server and we block bots liberally. We’ve already blocked ahrefs.com and perplexitybot for being assholes without rate limits. Does this mean this here blargh will be that much harder to find? Yeah, but we don’t particularly care about it.

Pages

Meta