AI startup Perplexity is facing accusations from Cloudflare that it is collecting and copying content from websites that have clearly specified that they do not want their data copied.
According to research published by Cloudflare, the company noticed that Perplexity was ignoring blocking instructions and hiding its crawling and data collection activity by changing its User Agent identity and ASNs to avoid restrictions placed in the Robots.txt file, the standard that determines which pages are allowed to be indexed or blocked.
Cloudflare stated that these activities were detected across tens of thousands of domains and millions of requests per day, and the bots were identified through a combination of machine learning and network signal analysis. Perplexity not only uses a declared user agent, but sometimes masquerades as Google Chrome on macOS if its known bots are blocked.
For its part, Perplexity denied the allegations, with a spokesperson calling Cloudflare's post a "promotion" and emphasizing that the bot in question "is not affiliated with us."
Cloudflare confirmed that it had removed Perplexity from its list of authenticated bots and added new blocking mechanisms, noting that it began the investigation after complaints from its customers that Perplexity continued to crawl their sites despite explicit blocking instructions.
This is not the first time Perplexity has faced similar accusations. Last year it was criticized by media outlets like Wired for allegedly copying its content, and the company's CEO avoided providing a clear definition of the term "plagiarism" during an interview at TechCrunch Disrupt 2024.
Do you consider this an invasion of privacy? Let us know what you think.