The Battle for the Content Rights Layer
The End of the Link Economy
The internet’s “link economy” – where search engine optimization (SEO) and page links drove traffic and ad revenue – is rapidly giving way to a new paradigm. For two decades, companies competed to rank atop Google’s blue-link results, hoping for clicks and pageviews. Today, generative AI platforms like ChatGPT, Google’s Gemini, and Perplexity have emerged as the new gatekeepers. In fact, a recent survey found that 58% of consumers (versus only 25% in 2023) now use AI tools for product and service recommendations. Shoppers and researchers are literally “migrating en masse from traditional search engines to Gen AI platforms”. In other words, the core function of search has shifted from a document finder to an answer generator. Advanced language models “synthesize direct answers via AI overviews and conversational interfaces,” eroding the old click-driven model.
Zero-Click Answers: AI summaries now satisfy user intent on the spot. Large language models pull information from multiple sources and present a single, synthesized answer at the top of the results page. As a result, users often never click through to websites, creating a pervasive “zero-click” environment.
Citation over Clicks: Success is no longer measured by pageviews but by being cited as the definitive source. SEO expert Josh et al. note that winning in an AI-first economy means “being cited as the authoritative source in AI overviews”. Marketers now focus on earning mentions in AI-generated answers rather than driving traffic.
Conversational Intent: Search queries have become long-tail, conversational questions. Users ask full, nuanced questions (e.g. “What is the best CRM for a B2B startup?”), and generative models aim to answer them directly. Content must now be structured to address these complex prompts head-on.
This fundamental shift – dubbed Generative Engine Optimization (GEO) – means that the old link-based traffic model is collapsing. As one analysis puts it, “for two decades our goal was rank high, secure the click, and drive traffic. That model is collapsing. The new gatekeepers are LLMs… which synthesize direct answers”. In short, the link economy is ending and SEO is evolving into a new discipline focused on AI-driven answers.
The Mechanics of Synthesis: How RAG Replaces Traditional Search
Under the hood, generative search engines use Retrieval-Augmented Generation (RAG) to produce answers. Traditional search engines simply index web pages and rank them; in contrast, RAG-based systems actively fetch relevant documents and then generate a single, coherent response. In RAG, the AI model first retrieves pieces of text from an external knowledge base (such as web articles or a database of trusted sources) and then augments its language model prompt with those documents. Only after this retrieval phase does the model generate an answer that integrates the new information.
Document Retrieval: The system converts content into vector embeddings and performs a similarity search to find the most relevant passages. In practice, this means the AI “does web search for most questions” instead of relying on static training data. The retrieval step prioritizes relevance, recency, and trustworthiness, ensuring up-to-date and authoritative content is used.
Answer Generation: The retrieved passages are prepended to the prompt (sometimes called “prompt stuffing”), so the language model can generate a response grounded in real data. This blending of search and generative AI effectively makes the model “stick to the facts” by synthesizing text from trusted sources.
Citations and Transparency: Because the output is based on specific source documents, the engine can include attributions or footnotes to those sources. RAG thus helps reduce hallucinations and increases transparency. In fact, LLMs with RAG can cite their sources in answers, allowing users to verify claims.
The upshot is that a user’s query no longer yields a ranked list of links, but rather an immediate, paragraph-long answer (often with bulleted “key points” or an infobox) plus a handful of cited sources. As one industry guide notes, RAG-powered engines “pull data from various trusted sources, instantly synthesize it into a single, comprehensive answer,” placing it at the top of the search page. This fundamentally alters the user experience: knowledge is delivered directly, and clicks are optional. For brands and publishers, it means the traditional metric of search-driven traffic is being replaced by visibility within AI answers.
Defining Generative Engine Optimization (GEO): Credibility, Authority, and Structural Clarity
Generative Engine Optimization (GEO) is the practice of optimizing content for AI-driven answer engines. Whereas SEO once prioritized backlinks, keywords, and SERP rank, GEO emphasizes three new pillars:
Credibility (Trust & E‑E‑A‑T): Content must demonstrate expertise and trustworthiness. The old SEO metric of E‑E‑A‑T (Experience, Expertise, Authoritativeness, Trustworthiness) remains critical in a generative context. For example, pages with clear author bios, reputable citations, or official endorsements tend to be favored. As one guide observes, “content with transparent author bios, reputable citations, and consistent updates often outranks shallow material” in AI answers. In practice, brands are encouraged to publish original research and thought leadership that earn external citations – “earning citations from respected domains increases your trustworthiness in the eyes of AI engines,” notes a strategy playbook.
Authority (Citation Signals): Rather than counting the number of backlinks, GEO values how often and how authoritatively a brand is cited. AI overviews cite only a handful of sources (typically 2–7 domains per answer), so being one of those cited sources is the goal. Brands build citation authority by creating unique, data-driven content that others (including AI engines) reference. In short, mentions in trusted outlets – or being directly quoted by an AI – serve as the new “upvotes” for relevance. As Francesca Tabor notes, “mentions in licensed, high-trust outlets are now as critical as backlinks once were”.
Structural Clarity (Machine-Readable Formatting): AI models parse content more easily when it is cleanly structured. This means using schema markup (e.g. FAQ, HowTo), logical heading hierarchies (H1/H2/H3), bullet lists, tables, and TL;DR summaries. Content should be concise and well-organized so that the model can extract answers. Indeed, SEO practitioners advise adding “concise, well-organized summaries” and bullet lists to make pages “AI-friendly,” increasing the likelihood of being cited. In short, clarity and brevity help the AI incorporate your content into its answer. As one guide puts it, “structured content increases the likelihood of being cited in AI answers”.
Under GEO, traditional on-page factors (like crawlability and speed) are still needed, but the emphasis is on content and context. Marketers must craft content that shows expertise, earns authoritative citations, and is formatted for machine consumption. Only by aligning with these criteria can brands hope to be represented accurately in the AI-first search landscape.
Case Study: The Media Industry’s Shift from Ads to Licensing
The news media sector vividly illustrates this transformation. Legacy publishers once monetized content through display ads and subscriptions, driving traffic via SEO and social links. But as AI-driven answers replace clicks, publishers face declining referral traffic. Investigations in late 2022 found that ChatGPT could closely mimic major news articles (e.g. from NYT, Reuters, BBC), prompting publishers to realize their archives had become “central training fuel for LLMs”. In other words, news content became the backbone of AI knowledge, yet that value was not being captured under the old model.
In response, many media companies are forging content licensing deals with AI platforms, effectively shifting from an ad-driven economy to a usage-based licensing economy:
Subscription and Revenue-Share Models: Startups like ProRata (operator of the Gist.ai search engine) have launched programs where hundreds of publishers license their content. ProRata’s Gist.ai only uses licensed articles to answer queries, and it shares revenue directly with content partners. For example, in 2025 the Boston Globe, Vox Media, Mansueto Ventures (Fast Company/Inc), and dozens of others joined Gist.ai’s program. Under this model, publishers earn roughly 50% of ad revenue whenever their content is cited in an AI answer (Perplexity, by comparison, pays at most 25%). Executives praise the transparency: “we appreciate the transparency, publisher credit and focus on monetization [ProRata is] bringing to the marketplace,” says a media VP. Crucially, publishers’ content is accessed via RAG (not ingested into the AI’s training set), so they retain control and are compensated for each use.
Major Licensing Deals: Established news organizations have also inked direct licensing contracts. In late 2023, for example, Axel Springer (publisher of Business Insider, Politico, Bild, etc.) signed a deal to license its archives to OpenAI. The Associated Press entered into a similar agreement allowing its newswire content for ChatGPT. Analysts predict the emergence of a formal “Content Rights Layer” – a clearinghouse to manage AI content licensing like music and stock-photo licensing.
Usage-Based (“Pay-Per-Crawl”) Models: Some publishers are experimenting with usage-based payment structures. As eMarketer reports, rather than one-off fees, AI companies may soon pay on a per-use basis. Through platforms like Cloudflare’s Raptive, publishers (e.g. Condé Nast, Time, AP) can require AI bots to pay a fee each time they crawl content. This “pay-per-crawl” approach recognizes that AI firms, who earn huge revenues, should compensate content creators proportionally. In practice, this could limit truly free scraping and tie publisher revenue directly to AI usage.
Together, these models represent a stark shift from the old ad-revenue paradigm. Instead of giving content away for traffic (and hoping to monetize clicks), publishers are demanding direct payment for AI access. The industry sees this as necessary: as one analysis notes, as AI firms’ revenues “skyrocket… one-time deals may no longer seem adequate”. In effect, media companies are transforming themselves into suppliers of licensed knowledge. They are betting that this content-rights economy – where visibility in an AI answer translates to a revenue share or licensing fee – can sustain journalism in the AI era.
LLMs and the Content Supply Crisis
The Great Data Scrape: Economic Incentives and Lack of Content Governance
Large language models (LLMs) are built on vast, scraped troves of online content. In the early phase of AI development, companies had a strong incentive to harvest any freely available text or code at scale. Public datasets like Common Crawl, Wikipedia, public forums, news sites and open-source codebases provided an effectively zero-cost “training database.” Because no consent or licensing process governed this mass ingestion, LLMs have often consumed copyrighted novels, articles, images and other creative works without permission. This “great data scrape” dramatically fueled LLM capabilities but left content creators unpaid and unaware. Indeed, one recent analysis warns that major AI firms have “undermined authors’ proprietary control over their works by using these works as training data, without consent and often through opaque processes”. Research shows that today’s foundation models have already “ingested much of the public internet,” even though that accounts for <0.01% of all data globally. As free web data proved insufficient for continued gains, labs turned to paid content – striking multiyear licensing deals worth tens or hundreds of millions of dollars to access previously locked-up archives. In short, the first wave of model training exploited cheap, unregulated datasets; only later did economic incentives push companies to pay for premium content.
Huge, free data pools: Early LLM training scraped massive web datasets. Projects like Common Crawl, along with public archives of books, code, and social media, fed trillions of tokens into models.
Low cost, weak governance: Harvesting public content avoided licensing fees. With no oversight or consent requirements, copyrighted materials were copied en masse, often with identifying information stripped away.
Creators sidelined: Content owners received no compensation. As one author observes, big tech firms effectively “overrun copyright protections” by absorbing creative works into their training data without authorization. Content generated by LLMs later makes heavy use of these works – but the original authors are not paid.
Shift to paid sources: When open data sources were exhausted, companies began negotiating licenses for specialized content. Recent reporting finds deals in the hundreds of millions of dollars, for example News Corp’s archive-to-OpenAI contract reportedly exceeds $250 million over five years. These new arrangements highlight that once “free” data ran out, firms turned to expensive paid data, albeit under very little external regulation.
The Fair Use/Fair Dealing Doctrine in Mass Training
AI developers often invoke copyright law to defend their mass copying of content. In the U.S., companies argue that training an AI is a “transformative” fair use – analogous to a student learning from books or an artist using references. Some courts have tentatively agreed. For example, in Bartz v. Anthropic (2025), a federal judge held that copying thousands of lawfully acquired books into a training dataset was “reasonably necessary” to achieve the model’s capabilities, and found the use likely fair. Crucially, the judge noted the model’s outputs did not directly replace the market for the original books – a market factor that heavily favored fair use. OpenAI and others have publicly asserted that ingesting publicly available web materials is fair use, even citing library-industry position papers that broadly endorse AI training as a research use.
However, this expansive view of fair use is contested. The U.S. Copyright Office warns that analogizing AI training to human learning can be “mistaken.” Unlike humans (who imperfectly paraphrase), AI can produce near-perfect copies; allowing unlimited scraping without permission risks “undoing the balance” that fair use seeks to maintain. The Office suggests that if AI outputs begin to compete with originals or flood markets with near-duplicates, the fair-use defense may fail. Beyond the U.S., doctrines differ: in countries with fair-dealing rules (e.g. Canada, UK), only specific purposes (education, research, parody, etc.) qualify, and “transformative use” is not an explicit factor. A legal expert notes that what U.S. courts regard as transformative fair use could easily be infringement elsewhere. Indeed, ongoing litigation underscores the uncertainty: for instance, a recent complaint alleges ChatGPT produced a summary of a copyrighted novel so similar to the original that it should be considered infringing.
Industry stance: AI companies contend that training on public Internet content is non-infringing. OpenAI and other firms cite precedents and position papers (e.g. Library Copyright Alliance) to argue ingestion of copyrighted works is “generally fair use” when used for building a new model.
U.S. case law: In Bartz v. Anthropic, the court found copying whole books into the training data was likely fair use. The judge emphasized the outputs were different in form and not sold as book substitutes, so the “effect on the market” favored fair use. (Notably, the ruling rested on the specifics – such as the model having no infringing outputs – and may not broadly immunize all training.)
Regulatory caution: The U.S. Copyright Office (in reports and testimony) explicitly rejects an unqualified fair-use claim. It cautions that AI does not “escape copyright” simply because it “learns” like a person. The Office suggests that using vast troves of copyrighted works to generate market-competitive content could “go beyond established fair use boundaries”.
Fair dealing abroad: Canada, the UK and others have closed lists of exceptions. Those laws lack a free-floating “transformative” doctrine. As one analysis notes, copying books to train an AI might fail to qualify as fair dealing in Canada, where only specified purposes (research, private study, etc.) are allowed. In short, U.S. fair use may not protect AI training in all jurisdictions.
Emerging infringement claims: These legal debates are already playing out in court. Authors and publishers have sued AI companies, claiming outputs too closely mirror protected works. For example, a court recently allowed a case to proceed where ChatGPT’s summary of Game of Thrones was deemed potentially “substantially similar” to the copyrighted text. This illustrates the risk that broad fair-use defenses could fail if models reproduce copyrighted content.
The Economic Value of Content: Marginal and Aggregate Worth
Training data has immense economic value, both in total and per unit. Industry reports reveal eye-opening price tags: licensing deals for AI training run into the tens or hundreds of millions of dollars. For instance, a recent OpenAI-News Corp agreement was reported to exceed $250 million over five years. Google’s deal to license Reddit’s user content is worth roughly $60 million per year. Such multi-million-dollar contracts show the huge aggregate value companies place on exclusive data access.
On a per-item basis, analysts estimate rough rates: a typical book might command around $5,000 in a model-training license (often split 50/50 with the author), a music track about €0.30–€2.00, and video roughly $1–$4 per minute. These unit prices reflect the marginal value of individual works under current deals. Importantly, content value is not uniform: AI research has found that different data sources contribute unevenly to model performance. Some data (like high-quality technical documents) may dramatically reduce model error, while other data adds little. As one study notes, “not all data contributes equally — heterogeneity creates variation in marginal value”. In practice, most AI companies buy data in bulk (paying flat fees for entire archives) rather than per-token. Converting algorithmic improvements into exact dollar values remains an unresolved challenge in this emerging economy.
Crucially, much of the revenue from data still bypasses the original creators. A survey of recent AI-data contracts found that only 7 out of 24 deals actually paid the original content creators (e.g. journalists, photographers, academics); the remaining 17 deals paid only publishers or platforms. In other words, even though firms are spending millions for licensed data, authors and artists seldom see those dollars directly. The rest flows to media companies or intermediaries. In sum, content owners face a double squeeze: on one hand, their works fuel valuable AI models; on the other hand, they often do not capture the market price that models place on that data.
Aggregate deals: Licensed data can be worth hundreds of millions in total. Examples include News Corp’s 5-year deal ($250M+), Google’s Reddit content deal (~$60M/year), and Dotdash Meredith’s reported ~$16M/year contract with OpenAI. These figures illustrate how large content libraries can command very large aggregate fees in AI training.
Per-unit pricing: When content is traded in identifiable units, prices are high. One economic analysis suggests roughly $5,000 for a book, €0.30–€2 for a music track, and $1–$4 per minute of video. Such numbers are examples of the marginal value under simple licensing terms. They reflect the scarcity and demand for high-quality training examples.
Marginal heterogeneity: Research shows diminishing returns on performance as more data are added. Not all data sources equally improve an LLM: some bits of content yield large accuracy gains, others negligible ones. Quantifying each source’s economic worth is complicated by these diminishing returns. In practice, companies are currently paying for “chunks” of content (entire archives) rather than pricing data token-by-token.
Creator compensation: Most of the economic value is captured by intermediaries, not creators. In one study of 24 training-data deals, only 7 deals paid the original authors or creators; the other 17 deals remunerated only publishers, platforms or aggregators. Thus, even as LLM firms pay hundreds of millions for data, much of that wealth does not reach the writers, journalists, or artists whose works were used.
Magdalena’s Lens: Competition Economics and Data Concentration
Viewed through a competition-economics lens (as emphasized by economist Magdalena Kuyterink and others), the concentration of training data poses distinct harms. Data by itself is nonrival, but control over it can lock out competitors. If one firm aggregates a massive proprietary dataset, rival firms cannot catch up without that data. The critical point is that incumbents can choose not to share. As one survey of AI economics explains, the incentive of big data holders to withhold data – even for sale – creates high barriers to entry. In essence, data becomes a de facto bottleneck: a newcomer might have the computing resources and talent, but without access to large, high-quality corpora, their models lag behind.
Moreover, data often creates network effects in AI. A larger dataset typically yields a better model, which then attracts more users or usage, generating even more data – a positive feedback loop. A classic example is how Amazon’s early recommendation engine improved as more customers used it, providing more “feedback” data to refine the system. In AI, this means leaders can entrench their advantage: better models attract more real-world usage, fueling even larger datasets and further improvements (increasing returns to scale). Over time, the market “tips” toward those initial leaders.
These dynamics are already of regulatory concern. Analysts note that if only a few “foundation models” exist, the companies deploying them could wield outsized market power. In 2024, antitrust authorities explicitly warned about this risk: a joint U.S./EU statement flagged the ability of large incumbent tech firms to “entrench or extend power in AI-related markets,” and cautioned that exclusive deals among big players could “steer market outcomes” in their favor. In short, from a competition perspective, the massive data hoards controlled by the biggest AI companies create entry barriers and network effects that could limit future competition and innovation.
Barriers to entry: Incumbent AI firms have far greater access to training data, which newcomers lack. Because data is nonrivalrous, the key barrier is strategic: dominant firms can simply refuse to share their datasets, effectively freezing out smaller rivals. This raises the cost and difficulty for new entrants who would otherwise need comparable data to build competitive models.
Feedback loops: More data begets better AI products, which attract more users and thus more data. This virtuous cycle (increasing returns to scale) can “tip” the market. For example, studies of online platforms note that early success attracts user activity that feeds the algorithm, further improving the product. In LLM markets, this suggests a few large models could become dominant simply by virtue of their larger training sets.
Limited foundation models: With only a handful of general-purpose LLMs, the companies controlling them gain leverage. Congressional analysts warn that such firms “might have market power and significant influence” over the AI ecosystem, since it is hard for others to enter without their own massive models. In practice, most downstream AI services are built on these few proprietary models.
Regulatory scrutiny: Competition authorities have taken note. In mid-2024, U.S. (DOJ/FTC) and European regulators jointly highlighted concerns that Big Tech could lock in its AI dominance through concentrated data access. They are examining whether exclusive content deals, partnerships, or bundled cloud services give incumbents an unfair edge. The core worry is that without intervention, data concentration will reinforce a winner-take-all market structure in AI.
The Need for an Interoperable Rights System
The Flaws of Current Protocols: robots.txt and CC Licenses
Today’s web standards are ill-suited to policing AI data use. The robots.txt protocol – originally designed to guide search engines – is voluntary and easily bypassed by aggressive crawlers. Notably, investigations found AI services ignoring robots.txt “do not crawl” rules, effectively undermining publishers’ opt-outs. Worse, strict use of robots.txt to block AI would also block indexing by search engines, hurting visibility. In short, robots.txt was not built for AI and offers publishers no reliable protection or selective control.
Generic Creative Commons (CC) licenses likewise fall short in practice. CC licenses give broad reuse permissions (e.g. CC BY) but contain no mechanisms to enforce attribution or block AI training. In fact, CC explicitly cannot override existing copyright exceptions, so “licensors cannot use [CC] to prohibit a use if it is otherwise permitted”. As a result, many AI developers proceed as though CC content is free material, often ignoring attribution requirements. The fundamental CC ethos of reciprocal sharing (“give credit back to the original creator”) is violated when AI crawls “strip out any reference to the original creator”. In practice, open-licensed content is swept up at scale without links, credit or compensation. Thus neither robots.txt nor generic licenses can practically manage modern AI usage.
The Conceptual Content Rights Layer
What’s needed instead is a rights layer: a standardized, machine-readable metadata layer that travels with content on the open web. Such a layer would embed key information (ownership, license terms, attribution rules, and even pricing) directly into web content or crawl protocols. Technically, this requires adding structured metadata (e.g. in HTML or a linked XML/JSON file) that tells any automated agent exactly how it may use the content – for free, with attribution, or only under paid license. In essence, publishers would publish explicit terms for AI usage rather than hoping robots.txt or legalese suffice.
Key capabilities of this layer would include:
Machine-readable licenses: A clear standardized format (beyond a simple “yes/no”) so AI systems automatically parse whether they may train on or quote the content.
Attribution metadata: Embedded pointers (URLs, author names, timestamps) that require AI outputs to cite original sources. This could help enforce the moral right of attribution.
Payment signaling: Options for specifying per-use fees or subscriptions. Rather than “all-you-can-eat” scraping, the layer could define a micropayment model (e.g. pay-per-query or pay-per-crawl) so creators get compensated.
Content labeling: Tags for sensitive content (legal status, privacy), so AI systems can respect laws beyond copyright (as CC noted, licenses don’t cover privacy).
Without such a layer, publishers have “no effective and standardized licensing system” for online content, especially as AI transforms consumption. By contrast, a rights layer creates a permanent data record of consent, mirroring how CC licenses added a machine-readable RDF layer in the past. In short, an interoperable rights layer would make content self-describing for AI: who authored it, how to attribute it, and what (if any) payment is due.
The Technical Standards: Emerging Protocols
Recent initiatives exemplify how a rights layer might work. The Really Simple Licensing (RSL) standard – backed by Reddit, Yahoo, Medium and others – enriches robots.txt with detailed licensing terms. In RSL, a site’s robots.txt includes a License: directive pointing to a machine-readable license file. That file specifies granular usage rules (e.g. “free with credit”, “pay $0.01 per crawl”, “subscribe for full access”){}. This allows publishers to define multiple models (free, attribution-only, paid per-crawl/inference, etc.) in one place. For example, the RSL XML can demand a fee each time a model queries or cites the content. Embedding a license as metadata turns a passive website into an active negotiator with AI crawlers.
Figure: Illustration of the Really Simple Licensing (RSL) standard, which adds machine-readable license and payment terms to content. RSL is essentially robots.txt ++ – publishers can choose a “license model” for each site or page. Early adopters even formed an RSL Collective to broker payments, much like ASCAP for music. Still, RSL is voluntary: its power depends on AI companies honoring the directives.
Beyond RSL, other technical approaches are emerging. Cloudflare’s “Pay-per-Crawl” (launched July 2025) uses standard HTTP responses. If an AI bot requests a page, the server can reply with HTTP 402 “Payment Required” and pricing info. The crawler then either pays (to get a 200 OK) or is denied (403). This gives publishers a domain-wide price per visit and the options to allow, charge, or block any bot. Similarly, the new x402 protocol (spearheaded by Coinbase and Cloudflare) revives HTTP 402 as an open micropayment layer for AI. In x402, servers reply with 402 and a blockchain-based payment request; AI clients automatically handle the payment handshake and then receive the content. In all these systems, the common idea is: turn data access into an explicit transaction.
Thus, whether via RSL metadata or HTTP payment codes, the goal is standardizing how content owners state their AI-usage terms. These standards are still in infancy and rely on voluntary compliance, but they illustrate the technical requirements of a rights layer – precise, machine-readable licenses; built-in payment protocols; and integration with web infrastructure. Over time these building blocks could coalesce into a robust Rights Layer that aligns with both technology and emerging law.
Legal Standards and Regulatory Alignment
Technical protocols will gain force when backed by law. The EU’s AI Act (effective 2024‑25) is explicitly designed to do just that. It requires large AI models to document “training data provenance” and respect copyright norms. For example, LLM providers must publish summaries of their data sources and explain how copyright is (or isn’t) respected. The Act essentially treats data rights as legal obligations: “consent, transparency, and provenance are no longer just best practices – they’re obligations,” noted one commentator. In practice, this gives courts and regulators a foundation to enforce publisher preferences: a publisher’s opt-out or license notice can be treated as a binding signal.
The Act also encourages publishers to use tools like robots.txt and the new llms.txt (see below) as formal markers of intent. As one analysis advises, “Revisit your robots.txt and llms.txt files: these files are no longer symbolic. They’re technical expressions of legal boundaries and AI companies are now expected to respect them”. In short, the legal trend is to reward transparency and punished unconsented data use. For copyright owners, this means a standardized rights layer aligns neatly with new regulations. Publishers with clear machine-readable licensing terms will have far stronger legal leverage under the AI Act and related laws than those relying on vague contracts or tech workarounds.
(Meanwhile in the U.S., courts and the Copyright Office are grappling with these issues through cases and inquiries. Several class actions have already been filed against AI developers for using copyrighted text or images without permission – a clear sign that the legal pendulum is swinging toward requiring some form of data licensing. But the EU AI Act is the first global framework formally linking rights metadata with accountability.)
Defensive and Offensive Content Strategies
Defensive Measures
Some creators are taking a more adversarial stance to protect their work. One provocative approach is data poisoning: deliberately seeding content with subtle “poison” that confuses AI. For images, tools like Nightshade let artists change pixels in ways imperceptible to humans but catastrophic to models. For instance, a poisoned image of a dog might routinely train the AI to see a cat instead. Studies and demos show such poisoning can scramble generative outputs, causing AI tools to produce wildly inaccurate images from poisoned sources. This tactic essentially “spoils” the data stream, with the hope that if AI models learn on poisoned content, their outputs will degrade – incentivizing them to avoid such data or pay for clean sources.
Other adversarial defenses are emerging. Infrastructure-level blocks have appeared: notably, Cloudflare now blocks unidentified AI crawlers by default, effectively enforcing robots.txt site-wide without action by individual publishers. Traditional bot defenses – CAPTCHAs, login walls or dynamic content – can also bar rudimentary scrapers. Some sites even insert hidden “honeypot” text or links to detect bots. In practice, defensive tactics range from innocuous (blocking known AI user-agents) to extreme (embedding dummy data). Each has trade-offs: for example, poisoning could erode model performance (a win for creators) but might also backfire if it inadvertently poisons the very data ecosystem. In any case, these methods are stopgap measures until rights-based solutions mature.
Offensive GEO Strategies
“Offense” here means optimizing content so that, if AI consumes it, the site benefits rather than loses. This Generative Engine Optimization (GEO) is like SEO for an AI world. Key tactics include:
Structured Data (Schema.org, JSON-LD, etc.): Rich metadata helps LLMs digest a site’s facts accurately. Recent research shows that schema markups (as collected in Common Crawl) get transformed into plain “verbalized facts” in LLM training. In effect, well-maintained structured data acts as a direct input to the AI’s knowledge graph. By annotating articles with clear entity and fact tags (author, title, date, product specs, FAQs, etc.), publishers ensure that accurate, verifiable information about their content flows into AI models. As one analysis puts it, structured data is the “foundation of machine-readable brand management” – it anchors facts about a company or topic in the AI’s world model.
Semantic Linking: How pages link to each other and to external sites informs an LLM’s understanding. In traditional SEO, links transfer authority; in GEO, links create meaning. The Web Data Commons project shows that hyperlinks form a semantic map of the web. Frequently co-linked topics reinforce each other in the model’s “language space.” For example, linking a new recipe post to a well-known cooking site teaches the AI that those concepts are related. In practice, publishers can bolster their authority by smart cross-linking (e.g. linking each blog post to a canonical knowledge hub or citing authoritative sources), because every link in context becomes a data point for the AI’s knowledge graph.
Answer-First Content: A more subtle change in writing style is emerging. Whereas old web writing often delays the answer, modern AI-driven content is being restructured so the main point comes immediately. In “answer-first” writing, the first sentence or paragraph delivers the concise response to the user’s question. Subsequent text simply elaborates. This mirrors how AI assistants parse content: they scour pages for the quickest snippable answer. Indeed, SEO experts note that placing the answer upfront “makes your content a prime candidate” for an AI-generated snippet. In short, writing in a direct Q&A format – with the direct answer at the top – signals to generative engines that your page is highly relevant, increasing the chance your content will be cited in an AI response.
Other GEO tactics include optimizing page load (so that crawlers easily fetch content), using clear headings/questions (so AI can identify topics), and even republishing key insights as FAQ or data tables. The overall goal is to make your content the source of truth for queries: well-structured data and writing that anticipates AI prompts. When done ethically, GEO can help publishers capture traffic (and brand visibility) even as AI disintermediates search.
The llms.txt Standard
As generative crawlers proliferate, a new protocol called llms.txt has been proposed to give publishers fine control over AI indexing. Conceptually akin to robots.txt, an llms.txt file (placed at a site’s root) would explicitly tell AI bots what they may or may not do. For example, a publisher could whitelist certain pages for summarization and blacklist others entirely. The idea is to provide a “curated roadmap” for AI: the file lists your preferred policies and priorities for machine ingestion. Importantly, llms.txt is about guidance, not enforcement: the standard is voluntary. Early drafts note that some AI crawlers will honor llms.txt and others may ignore it, much like early days of robots.txt.
Nevertheless, llms.txt is gaining traction as part of a larger strategy. By tagging sections of content (e.g. marking premium articles as off-limits to AI), publishers create an “AI-safe” content layer. Over time, llms.txt could interface with pay-per-crawl systems: for instance, llms.txt might list only the pages available for paid ingestion. As one analysis explains, llms.txt is “the velvet rope between your content and AI crawlers” – potentially the price of admission for companies that want to control how AI scrapes, summarizes, or pays for content. In practice, it may evolve into a standard part of every publisher’s toolkit: signaling preferences to AI at scale, analogous to how we use sitemaps and schema today.
Ethical and Unethical GEO
Optimizing for AI raises ethical questions. Ethical GEO means structuring and presenting content that genuinely helps users and builds trust. It involves providing accurate facts, citations, and useful summaries – even if that makes your content easier for AI to parse and quote. In this sense, GEO can be seen as modern “answer engine optimization” that rewards transparency and credibility.
By contrast, unethical GEO includes tactics that game the system without adding value. Generating superficial or misleading content purely to trigger AI algorithms falls into this trap. For instance, deliberately stuffing pages with AI-relevant keywords or clickbait questions (while delivering low-quality answers) would mislead both users and models. As one SEO analysis warns, using AI to create content solely “aimed at being optimized for search engines instead of being helpful” is unethical and violates guidelines. In other words, if an AI had a human editor, would that editor reward the tactic? If not, it’s likely unethical.
Search platforms themselves are also aligning with ethical principles. Google, for example, will permit AI-generated content only if it is high-quality, accurate, and transparent about sources. It forbids content created just to manipulate rankings. Thus, publishers should focus on trust – e.g. by using structured data and clear language so that when AI quotes the site, it includes correct attribution and context. One practical ethos is to publish first-class, well-sourced content and let good quality be “rewarded” by AI (in featured summaries), rather than resort to secret tricks that might backfire.
In summary, the offensive use of AI-aware techniques can be done ethically – by enhancing clarity, verifiability and user value – or unethically – by churning out misleading or coercive content. Ethical GEO emphasizes accuracy and transparency; unethical GEO tries to trick or “spam” AI systems. Publishers and content creators must tread carefully, ensuring that in seeking AI visibility, they still serve the user’s best interests and intellectual honesty.
Market Definition and Dominance in the AI Ecosystem
Defining the relevant market in AI is complex. Economists consider multiple layers: upstream inputs (e.g. high-quality training data, cloud compute, AI talent) and downstream outputs (e.g. foundation models or generative search services). In practice, this might involve a separate market for generative search (AI chat interfaces providing direct answers) distinct from traditional search. Likewise, markets for “foundation model services” (the supply of large pre-trained models via API or downloads) and for content licensing (rights to text, images, code) all warrant consideration. For example, regulators might examine if Google’s generative AI integrates search, models, and data in one platform, suggesting a single “AI platform” market.
Market power in AI often stems from data-driven network effects. Large LLMs improve as more users interact with them: each query can generate training data or feedback that the firm can use to refine the model. This creates a feedback loop: more users → more data → better performance → even more users. Yale scholars call this a “significant data network effect” that gives first-mover advantages and tends toward monopoly in generative AI. In economic terms, like a phone or social network, the value of an AI model grows with its scale of use, making big models ever more dominant. Such network effects also compound other scale economies (e.g. vast compute infrastructure), reinforcing a virtuous cycle of incumbent dominance.
“The Gatekeeper” problem highlights how big tech firms can leverage this dominance. Under the EU’s Digital Markets Act (DMA), companies may be designated as gatekeepers if they control core services (e.g. search engines, operating systems). The DMA already covers Google, Apple, Microsoft, Meta, etc., and the EU is exploring whether large AI offerings should count as core platform services. In practice, a gatekeeper in AI can use its platform power to entrench model dominance (e.g. pre-installing an LLM on devices or bundling AI into an OS). As one expert notes, DMA enforcement is focusing on issues like cross-service data use, but so far has not fully tackled “the massive data advantages held by gatekeepers” in AI. Notably, in the U.S. Google was allowed to keep its Android and Chrome bundling despite antitrust concerns – the judge instead only required Google to share some data with rivals, implicitly trusting AI competition to emerge.
Assessing market power in these data-driven markets is a new challenge. Traditional measures (market share of revenue or unit sales) may not capture a model’s influence. Competition authorities are developing forward-looking methodologies: sector inquiries can map the whole AI “stack” and identify bottlenecks (e.g. rare data, specialized chips), and economic models can simulate how feedback loops amplify dominance. For instance, the European Commission’s AI sector brief suggests considering “unconventional” metrics – number of model downloads, activity volumes, compute capacity – to gauge an AI platform’s reach. In short, researchers look for signs of “data network effects,” “gatekeeping,” and concentration at bottleneck layers to flag potential abuse.
Exclusionary Practices and Barriers to Entry
The data barrier is among the highest hurdles for new AI entrants. High-quality training content (books, news, code, images) is costly to obtain and often locked behind copyright licenses. Major AI firms have started signing deals with publishers (e.g. news media, code repositories), giving them preferential access to curated data. As economist Martin Peukert argues, these agreements effectively raise fixed costs that only incumbents can bear: “only a few large AI corporations are likely able to purchase access to this quality data”. Exclusive licenses (or informal preferential access) can “tie up” essential content, leaving scrappy startups with scraped or low-quality data. In effect, data becomes a utility that newcomers cannot tap at the same scale, so the market tips further in favor of entrenched models.
Another key concern is tying and bundling. Big tech can bundle generative AI into operating systems, browsers, or software suites, making them hard to avoid. For example, Microsoft’s Copilot is integrated across Windows and Office, Google is embedding Gemini (formerly Bard) into Chrome and Android, and Apple is adding generative features into iOS. These practices can foreclose independent AI apps by pre-installing an incumbent’s service as the default assistant. US antitrust cases show this plays out: in the Google search trial, the judge let Google keep bundling search into Android/Chrome, despite the government’s complaints, largely because he believed new AI search tools would offset it. Critics argue this ignores how integration actually raises switching costs: an AI assistant baked into your OS is much stickier than a separate app, potentially tipping market power.
Pricing and compensation models are also under scrutiny. Subscription fees (e.g. ChatGPT Plus) give predictable revenue but concentrate the benefits among heavy users and may exclude lower-income consumers. Usage-based pricing (pay-per-query or per-token) aligns cost with use but can deter experimentation or real-time use. A new hybrid has emerged: pay-per-use for content owners. For instance, some publishers and AI platforms are experimenting with “pay-per-crawl” or micro-royalties: whenever an AI bot uses a publisher’s content in a generated answer, the publisher gets a small fee. Early pilots like Perplexity (in partnership with Cloudflare) charge AI companies based on how often their crawler hits a publisher’s site. This shifts value back to creators and could reshape content strategy – AI engines might favor content they can afford, or share ad revenues with creators. But adoption is voluntary so far; without regulatory or industry pressure, firms could stick to one-time flat fees (or nothing) if unchecked.
The threat of vertical integration looms large. If an AI provider also owns content pipelines, it can favor its own data and squeeze rivals out. For example, imagine a tech giant that both produces streaming content and offers a chatbot: it could train its model preferentially on its own shows. Similarly, major cloud providers (AWS, GCP, Azure) are rolling out foundation models on their platforms; this could skew the market if competitors’ models are harder to access. Legal scholars warn that integration across the AI stack (from chips to models to apps) “restricts the number of providers at downstream layers,” reducing innovation. A merged AI-content firm could “self-preference” its own training material or lock customers into using its ecosystem only. In sum, dual ownership of data and model layers risks foreclosure and calls for close antitrust oversight.
Policy Solutions and Regulatory Intervention
Proposed remedies span from interoperability mandates to antitrust divestitures. Some experts urge mandating interoperability and open interfaces in generative AI. Analogous to telecom or internet standards, a “Content Rights Layer” could ensure that an AI platform recognizes the licensing status of each work and pays when it uses it. For instance, if an AI model must check a registry before quoting an article, it could trigger a micropayment. Policymakers might require platforms to adopt such protocols (or use data tagging standards), so content creators can enforce their rights in AI outputs. This idea parallels EU’s Digital Markets Act, which compels gatekeepers to open up key services; it could be extended to make AI services honor licensing (i.e. not scrape blocked sites) and interoperate with content-blocking technologies.
There is also debate over data access rules. Some proposals call for data-sharing mandates in the vein of the EU’s new Data Act or California’s emerging AI transparency laws. For example, California’s Generative AI law (AB-2013, effective 2026) requires providers to disclose sources of training data, which could pressure firms to license more content. More radically, governments might create public data repositories or research “common data pools” to lower barriers (the U.S. NAIRR initiative is a step in this direction). However, critics caution that forcing data sharing could backfire if not done carefully: overregulating access might stifle investment in data gathering or run afoul of copyright law. Nevertheless, many agree on some form of “fair access” regime – e.g., standardized API licensing with nondiscriminatory terms – to prevent gatekeeping of raw training inputs.
Antitrust remedies are a final line of defense. Structural remedies (breakups or spin-offs) might include forcing divestiture of key assets (like breaking off a search engine from an AI unit). In the DOJ’s case against Google, the government originally sought Google’s divestment of Chrome and forced dissolution of certain contracts, but the final ruling imposed only behavioral fixes. Future cases could pursue more aggressive splits: academic proposals suggest, for instance, barring dominant search firms from owning stakes in large AI labs. Behavioral remedies could forbid exclusivity (no more pay-to-play deals) or mandate open APIs for models. The EU’s DMA is itself a prophylactic behavioral regime for “core platform” problems; a similar ex-ante framework might be needed specifically for AI platforms (e.g. requiring model access for certified rivals). Importantly, any remedy should consider the uniqueness of data: classic antitrust looks at price-effects, but here regulators must consider the “thickets” of data and code ownership that underlie market power.
Finally, regulators worry about the future of the creator economy. Human authors, artists, and developers supply the content that AI consumes. To sustain creative industries, policies may guarantee equitable compensation. This could mean statutory licensing schemes or collective management organizations for AI use (some analogize it to how music royalties are handled). Voluntary deals are one path (e.g. the high-profile $250M news licensing agreements), but competition advocates argue for rules: for example, requiring AI firms to pay a set royalty rate per use of copyrighted content. According to a CISAC study, up to €22 billion of music and film creator revenues could be at risk by 2028 if AI grows unchecked. Usage-based payments (per-crawl or per-inference fees) are seen by many as a fairer model: under such systems, the “digital divide” of AI’s value creation would be partially narrowed. Policymakers might also support “AI safety nets” like public funds or tax incentives for creators who license content to AI. In all cases, the goal is to ensure that AI-driven platforms do not strip all value from human creativity, but instead channel some of it back to those who produced the original content.