Reddit Escalates Legal Battle Over AI Data Scraping in New Lawsuit Targeting Perplexity and Data Brokers

Reddit Takes Legal Action Against Alleged Data Scraping Operations

Reddit has initiated a significant legal offensive against four companies accused of systematically harvesting its content without proper licensing. The lawsuit targets SerApi, OxyLabs, AWMProxy, and AI startup Perplexity for allegedly scraping Reddit data from search results and using it without payment or permission. This legal action represents the latest development in Reddit’s increasingly aggressive strategy to monetize and protect its vast repository of user-generated content.

Reddit Takes Legal Action Against Alleged Data Scraping Operations
The Business of Data Scraping
Reddit’s Data Monetization Strategy
The Perplexity Investigation
Technical Violations and Industry Standards
Broader Implications for AI Development
Reddit’s Multi-Pronged Defense Strategy

The Business of Data Scraping

According to court documents, the defendants operate sophisticated data collection businesses that circumvent Reddit’s licensing requirements. While SerApi, OxyLabs, and AWMProxy specialize in extracting data from search results for commercial purposes, Perplexity stands out as an AI company that allegedly uses scraped content to train its models and power its answer engine. The lawsuit claims these operations undermine Reddit’s data licensing program, which the company launched in 2023 to generate revenue from AI training data.

Reddit’s Data Monetization Strategy

Reddit has been actively pursuing data licensing agreements with major technology companies, having already secured deals with Google and OpenAI. The platform has even developed its own AI answer system to leverage the knowledge contained within user posts. This legal action follows a similar lawsuit against AI startup Anthropic, which Reddit accused of using its content to train the Claude chatbot without proper authorization.

The Perplexity Investigation

Reddit’s complaint reveals a detailed investigation into Perplexity’s data collection practices. After sending a cease-and-desist letter that Perplexity allegedly ignored, Reddit created a carefully designed test post that was only accessible through Google’s search engine and not available elsewhere online. Within hours, queries to Perplexity’s answer engine were reproducing the test content, providing what Reddit claims is definitive evidence of unauthorized data scraping., as our earlier report

“The only way that Perplexity could have obtained that Reddit content and then used it in its ‘answer engine’ is if it and/or its co-defendants scraped Google search results for that Reddit content,” the lawsuit states, highlighting what Reddit describes as a clear violation of its terms and licensing requirements., according to market analysis

Technical Violations and Industry Standards

The legal complaint alleges that Perplexity and other defendants have disregarded fundamental web protocols, including:, according to market trends

Ignoring robots.txt directives that communicate scraping preferences
Bypassing technical measures designed to limit automated access
Operating without proper licensing despite Reddit’s clear terms

Reddit has been actively working to establish new industry standards through initiatives like the Really Simple Licensing framework, which aims to add licensing terms to traditional robots.txt files.

Broader Implications for AI Development

This lawsuit occurs against the backdrop of increasing tension between content platforms and AI companies seeking training data. As Reddit and other platforms seek to monetize their user-generated content, AI developers face growing legal and financial barriers to accessing the data needed to train sophisticated models. The outcome of this case could establish important precedents for:

Data ownership rights for user-generated content platforms
Legal boundaries of web scraping for AI training
Enforcement mechanisms for data licensing programs
Industry standards for ethical data collection

Reddit’s Multi-Pronged Defense Strategy

Beyond legal action, Reddit has implemented several technical measures to protect its data, including rate-limiting unknown bots and web crawlers in 2024 and restricting the Internet Archive’s Wayback Machine access scheduled for August 2025. These efforts represent a comprehensive approach to data protection that combines legal, technical, and standards-based solutions.

The company is seeking financial damages and a permanent injunction that would prevent the defendants from selling or using previously scraped Reddit material. As the case progresses, it will likely shed light on the evolving relationship between content platforms, data brokers, and AI companies in an increasingly data-driven digital economy.