LLMs, Media Licensing, and the Rise of Generative Engine Optimisation (GEO): The Battle for Trusted Content
Introduction
For two decades, digital visibility was defined by one acronym: SEO. Brands fought for Google rankings through backlinks, keywords, and content strategies designed to please algorithms. But the rise of generative AI has changed the game. Today, when someone asks ChatGPT, Gemini, Claude, or Perplexity a question, they don’t see ten blue links—they see one authoritative answer.
That answer is not simply scraped from the web. Increasingly, it is shaped by media outlets and data providers that have signed licensing agreements with LLM vendors. These publishers are becoming the new gatekeepers of visibility. If your brand isn’t mentioned in their pages, it risks invisibility in the AI-first internet.
This shift signals the dawn of Generative Engine Optimisation (GEO). Unlike SEO, which was about gaming crawlers, GEO is about being present in the knowledge pipelines that feed LLMs. Mentions in licensed, high-trust outlets are now as critical as backlinks once were.
To help brands, PR agencies, and policymakers navigate this new terrain, we introduce the Generative AI Licensing Agreement Tracker—a live intelligence system mapping:
Which publishers have licensing deals with AI vendors,
Which LLMs use those licensed sources,
How trust and authority signals flow into AI-generated answers.
Just as tools like Moz and SEMrush became indispensable for SEO, this Tracker provides the competitive intelligence layer for the GEO era—revealing where influence flows, how brands can secure AI visibility, and why PR strategy must adapt.
Phase 1: The Wild West (Pre–2022)
1. Scraping at Scale — “The Bigger, The Better”
Early LLM builders (OpenAI, Google, Meta, EleutherAI, Hugging Face, etc.) operated with the mindset that more data = better performance.
Training corpora like Common Crawl (billions of web pages), Wikipedia, Reddit conversations, GitHub code repositories, arXiv preprints, and digitized books were ingested wholesale.
The goal wasn’t selectivity, but sheer size — a race to build the largest “general-purpose” text prediction models possible.
No consent, no compensation: content from journalists, authors, and developers was absorbed without acknowledgment or royalties.
2. “Fair Use” Assumption — A Legal Grey Zone
Researchers leaned heavily on the US legal doctrine of “fair use” — assuming training on publicly accessible text was permissible since the output wasn’t a verbatim copy.
Courts hadn’t ruled on AI training yet, so there was little legal precedent.
Developers positioned themselves under research and non-commercial exemptions (even when projects were quietly commercial in intent).
This ambiguity created a legal buffer that allowed rapid development — but planted the seeds for future lawsuits.
3. No Transparency — Data Laundering by Obscurity
Instead of disclosing sources, researchers used vague descriptions like “internet-scale corpora” or “large, diverse datasets.”
Model cards (technical documentation) often listed categories of data but not actual sources or proportions.
This lack of transparency served two purposes:
Avoid scrutiny from rights holders.
Preserve competitive advantage by not revealing data recipes.
Example: GPT-3’s dataset was described as “hundreds of billions of tokens from filtered web text” without breakdowns.
4. Incentive Structure — Scale Above All Else
LLM performance was measured by perplexity (predictive accuracy) and benchmark scores, which improved with more data.
Researchers believed data quantity > data quality, so duplication, bias, and misinformation were tolerated if they helped scale.
The arms race was fueled by:
OpenAI released GPT models in quick succession (GPT-2 in 2019, GPT-3 in 2020).
Google releasing T5 and PaLM.
EleutherAI building GPT-Neo and GPT-J using open data.
This created a “scrape now, ask questions later” culture — prioritize dominance in benchmarks, worry about legality later.
5. Early Warning Signs (Ignored at the Time)
Reddit: Moderators began questioning whether communities wanted their content used for training.
GitHub: Developers complained Copilot was regurgitating code snippets verbatim.
Artists & Authors: Early alarm bells were sounded by creators noticing AI outputs that mirrored their work.
Despite warnings, AI labs pressed ahead, betting that regulators and lawsuits would move too slowly to stop them.
Phase 2: Cracks Emerge (2022–2023)
1. Media Realizes Its Value — AI Needs Journalism
As ChatGPT (Nov 2022) exploded into public consciousness, journalists and publishers tested it with prompts — and found their own work echoed back.
Investigations showed ChatGPT and other LLMs could summarize, paraphrase, or even closely mimic articles from outlets like The New York Times, Reuters, BBC, and Bloomberg.
Publishers recognized a new reality: their archives were central training fuel for LLMs — without them, answers would be shallow, less credible, and less trusted.
Suddenly, journalism wasn’t just news — it was the backbone of AI knowledge.
2. Early Legal Pushback — First Test Cases
2022: GitHub Copilot lawsuits — Developers sued Microsoft & OpenAI, alleging Copilot reproduced code snippets verbatim, bypassing open-source licenses.
This was the first high-profile case questioning whether AI training = copyright infringement.
2023: Authors vs. OpenAI & Meta — Groups of novelists (e.g., Sarah Silverman, Paul Tremblay) sued, claiming LLMs ingested and regurgitated their books.
2023: Visual artists lawsuits — Stability AI, Midjourney, and DeviantArt were sued for scraping images without consent, raising parallels for text/media.
2023: New York Times sues OpenAI & Microsoft — alleging GPT-4 outputs contained near-verbatim excerpts of NYT reporting, positioning the paper as the flagship test case for journalism’s value in AI.
3. Narrative Shift — From Innovation to Accountability
Phase 1 narrative: “LLMs are a groundbreaking technology that democratizes knowledge.”
Phase 2 narrative: “Whose knowledge is being democratized — and who profits?”
Intellectual property, consent, and compensation entered the spotlight.
Policymakers, academics, and creators reframed the debate around rights, fairness, and sustainability of the media ecosystem.
4. Public Debate — IP Meets Ethics
New societal questions gained traction:
“If LLMs depend on news, books, and journals — shouldn’t publishers, authors, and artists get paid?”
“Is training AI on copyrighted material without consent a form of plagiarism at scale?”
“If AI replaces the need to visit news sites, are we undermining the financial survival of journalism?”
Ethical critiques deepened:
Plagiarism: LLMs could echo creative works without attribution.
Disinformation: AI-generated articles blended truth with errors, blurring credibility.
Hallucinations: Fabricated quotes, fake references, and invented facts risked polluting the information space.
5. Tensions Peak — Industry, Legal & Cultural Flashpoints
Media vs. Tech giants: Media outlets began exploring collective bargaining (similar to music licensing against Napster/Spotify shift).
Academia & science publishers: Elsevier, Springer Nature, and IEEE started tightening API and access rules, wary of being scraped.
Governments & regulators: Early calls for AI copyright frameworks began in the EU and US.
Public awareness: Users began asking whether ChatGPT’s outputs were original thought or recycled journalism.
Phase 3: Legal & Regulatory Pressure (2023–2024)
1. Lawsuits Mount — Copyright Goes Mainstream
New York Times vs. OpenAI & Microsoft (2023)
The flagship lawsuit — alleging GPT-4 reproduced verbatim excerpts of NYT reporting.
Framed as a “Napster moment” for AI: journalism as the test case for whether mass scraping is theft or fair use.
Getty Images vs. Stability AI (2023)
Claimed Stability AI used millions of copyrighted images without consent. This created legal precedent for visual media that bled into text.
Authors Guild vs. OpenAI & others (2023)
Writers, novelists, and screenwriters joined forces to demand recognition and royalties.
Music industry enters (2023–2024): Labels and artists push back against AI-generated music trained on copyrighted catalogs.
Result: Content creators across text, image, and audio industries united in a single narrative:
“LLMs are trained on our work. We deserve compensation.”
2. Regulatory Scrutiny Intensifies
European Union – AI Act (draft 2023, finalized 2024):
Required LLM developers to disclose training datasets.
Proposed rules on copyright compliance and opt-outs for publishers.
Raised the bar globally — compliance in the EU effectively forced transparency everywhere.
United States:
Lawmakers began questioning whether “fair use” covers wholesale ingestion of copyrighted works.
FTC & DOJ raised antitrust and consumer protection angles: if AI replaces publishers, are we killing journalism?
United Kingdom:
Post-Brexit, the UK sought to position itself as an AI-friendly hub, but faced backlash from publishers demanding stronger protections.
Other Regions:
Australia, Canada, and India (which already battled Facebook/Google over news royalties) started circling AI in their policy frameworks.
3. Risk Perception Changes Inside AI Vendors
Legal risk = existential risk.
If lawsuits succeed, damages could reach billions in retroactive liability.
Example: NYT lawsuit alone speculated damages in the hundreds of millions for “lost traffic and subscription value.”
Reputational risk.
AI companies risked being seen as pirates exploiting creators.
“Innovation” narrative began collapsing under weight of IP theft claims.
Commercial risk.
Enterprise clients (banks, governments, pharma) demanded legally safe models.
Vendors realized they couldn’t sell to big clients without clean data provenance.
4. Market Gap Identified — Birth of AI Licensing Ecosystem
The industry began recognizing the need for structured licensing deals, mirroring:
Music licensing (Napster → iTunes → Spotify).
Stock photography (Getty, Shutterstock).
By late 2023, early deals started:
Axel Springer & OpenAI (Dec 2023): Business Insider, Politico, Bild content licensed for ChatGPT.
AP & OpenAI (July 2023): First newswire licensing deal, allowing AP archives for training and outputs.
Analysts began predicting the rise of a “Content Rights Layer” for AI — essentially a clearinghouse of licensed, traceable data for LLM training and outputs.
Content Rights Layer
Large Language Models (LLMs) and generative AI systems have entered mainstream use, transforming how information is created, consumed, and distributed. These systems were initially trained on vast amounts of publicly available data scraped from the web, often without consent. As lawsuits from publishers, authors, and rights holders mount, the AI industry faces a critical inflection point: continue in a legally precarious “Wild West,” or establish a structured framework for content licensing.
The emerging solution is the Content Rights Layer (CRL) — an infrastructural standard for sourcing, licensing, and tracking data used in AI model training and outputs. Much like protocols for digital rights management (DRM) in media or clearinghouses in financial markets, the Content Rights Layer provides the mechanisms for accountability, monetization, and compliance in the AI era.
Why the Content Rights Layer is Needed
Legal Pressure
Ongoing litigation, such as New York Times v. OpenAI and Getty Images v. Stability AI, has made it clear that unlicensed scraping is legally vulnerable.
Retroactive liability could cost AI vendors billions, threatening the commercial viability of foundation models.
Commercial Demand
Enterprises and governments require data provenance for compliance and risk management.
“Legally clean” models will become a prerequisite for adoption in regulated industries like finance, healthcare, and defense.
Publisher Incentives
Media outlets, academic publishers, and rights holders see their content as core intellectual property.
Without a rights layer, AI threatens their economic sustainability by substituting rather than supporting them.
Consumer Trust
Users demand higher accuracy and less hallucination. Transparent sourcing and citation improve credibility.
Core Functions of the Content Rights Layer
The Content Rights Layer can be conceptualized as a middleware stack between content owners and AI vendors. Its functions include:
Content Ingestion and Tagging
Structured ingestion of publisher datasets, tagged with ownership metadata, usage rights, and licensing terms.
Support for multiple formats: text, audio, video, images, and code.
Rights Management
Encodes the legal permissions associated with each dataset.
Defines terms such as: training-only, inference-only, citation-required, attribution-style, commercial vs. non-commercial.
Payment and Royalty Infrastructure
Microtransaction or subscription models to compensate rights holders per usage.
Similar to how Spotify pays artists per stream, LLMs could pay publishers per token trained or per inference that cites their work.
Auditability and Traceability
Cryptographic proof of content ingestion and usage.
Watermarking and hashing to verify provenance.
Audit logs accessible to both vendors and regulators.
Interoperability Layer
APIs enabling LLMs to query licensed content in real time.
Standardized schemas for metadata exchange between publishers, clearinghouses, and AI vendors.
Architectural Model
The Content Rights Layer could follow a federated architecture:
Publisher Nodes
Individual media companies, publishers, and dataset owners.
Maintain content repositories with licensing metadata.
Clearinghouse / Exchange
Acts as a marketplace and settlement layer.
Aggregates content, handles licensing transactions, and distributes royalties.
AI Vendor Integrations
Model builders integrate APIs to access licensed data.
Access permissions are enforced at the middleware level, not left to vendor discretion.
Regulatory Interfaces
Provides transparent audit trails for compliance.
Can plug into national or supranational AI regulators (e.g., EU AI Office).
Technical Challenges
Granularity of Licensing
Should payments occur at the article level, paragraph level, or token level?
How to handle derivative works and paraphrasing?
Latency vs. Accuracy
Real-time API calls to rights-managed data sources may increase inference latency.
Pretraining with licensed datasets offers speed but reduces granular reporting.
Cross-Jurisdictional Standards
Licensing laws differ across the US, EU, and Asia.
A global standard is needed to avoid fragmented ecosystems.
Watermarking and Provenance
Current watermarking is imperfect and easily circumvented.
Advances in cryptographic watermarking or differential privacy may be required.
Incentive Alignment
Publishers want fair compensation.
AI vendors want scalable, affordable licensing terms.
A viable CRL must balance these incentives.
Future Outlook
Short Term (2024–2025): Bilateral licensing deals dominate (e.g., OpenAI with Axel Springer, AP). These are proprietary and non-standardized.
Medium Term (2025–2027): Industry consortia emerge, establishing APIs and metadata schemas for standardized licensing. Comparable to early financial settlement systems.
Long Term (2027+): A mature Content Rights Layer operates as critical infrastructure — an invisible backbone akin to DNS for the internet or SWIFT for banking.
AI Visibility & Generative Engine Optimisation
For over two decades, digital marketing revolved around Search Engine Optimisation (SEO)—the art and science of ensuring brands appeared at the top of Google results. But with the rise of Generative AI, a new paradigm has emerged: Generative Engine Optimisation (GEO).
In this new landscape, people are no longer typing queries into search bars—they are asking AI assistants like ChatGPT, Claude, Perplexity, or Gemini. These systems don’t serve “10 blue links.” They generate direct answers. And increasingly, those answers are being shaped by licensed media content and structured knowledge sources.
The implication is profound: visibility is no longer about ranking in Google, but about being present in the training, fine-tuning, and licensing pipelines of LLMs.
From SEO to GEO
For two decades, Search Engine Optimisation (SEO) defined how companies achieved digital visibility. The playbook was well understood:
Keywords: Crafting content around high-intent terms to match search queries.
Backlinks: Earning inbound links from authoritative sites to signal trustworthiness.
Content structures: Optimising metadata, titles, and schema markup so Google’s crawlers could parse relevance.
Authority in the SEO era meant ranking high on Google’s SERP. The brands that invested heavily in SEO software, content marketing, and backlink strategies dominated visibility and traffic.
The GEO Era (2023 onwards)
With the rise of generative AI, visibility no longer stops at search engines. We have entered the Generative Engine Optimisation (GEO) era.
Here, the question is no longer “How do I rank on Google?” but “How do I get mentioned by ChatGPT, Gemini, Claude, or Perplexity when a customer asks about my category?”
The rules have shifted:
Authority signals are no longer just backlinks—they are mentions in the media outlets and data sources that feed LLMs.
Crawlers have been replaced by licensing deals. Generative engines increasingly ingest content through structured licensing agreements with publishers, newswires, journals, and data providers.
Ranking is replaced by retrieval. Instead of ten blue links, users now see a single AI-generated answer, often with fewer citations.
Why This Matters
In SEO, visibility was a matter of technical optimisation.
In GEO, visibility is a matter of trust and licensing.
If your brand is absent from the licensed, high-trust media ecosystem that LLMs use, you risk invisibility—even if you’ve spent years building SEO authority.
The Strategic Shift
SEO playbook: Optimise for crawlers.
GEO playbook: Optimise for licensing pathways and trusted media mentions.
Put simply: in SEO, Google’s algorithms decided who was visible. In GEO, LLM licensing networks decide who is credible enough to be surfaced.
Generative AI Licensing Agreement Tracker
Introducing the Generative AI Licensing Agreement Tracker
The Generative AI Licensing Agreement Tracker is the first intelligence system purpose-built for the Generative Engine Optimisation (GEO) era.
For the first time, brands, agencies, and regulators can see clearly:
Which media outlets and data providers have signed licensing deals with LLM vendors.
Example: Associated Press → OpenAI, Axel Springer (Politico, Business Insider, Bild) → OpenAI, Financial Times → OpenAI, Thomson Reuters → Microsoft.
This data reveals which publishers now act as privileged knowledge feeders to the most influential AI systems.
Which LLMs are using which licensed sources.
The Tracker cross-references licensing agreements across ChatGPT (OpenAI), Gemini (Google), Claude (Anthropic), Perplexity, Mistral, and others.
This allows you to answer a simple but critical question: “If I get mentioned in X outlet, will that show up in ChatGPT—or only in Gemini?”
Trustworthiness and authority rankings from an AI perspective.
Not all outlets are equal in the eyes of LLMs.
The Tracker scores outlets based on factors such as inclusion in licensing deals, historical citation weight in AI outputs, and industry relevance.
This produces a “Generative Trust Index”—a visibility metric for modern PR strategy.
Dates and details of licensing agreements.
The Tracker provides transparency on when deals were struck, what content types were included (news, archives, multimedia), and exclusivity clauses.
This matters because licensing pipelines evolve: being featured in a source pre-deal vs. post-deal can change your visibility footprint dramatically.
Why the Tracker Matters
For CMOs
Move beyond SEO. Allocate PR spend where it actually impacts AI-generated recommendations.
Example: Don’t just target Forbes for brand authority; target Reuters or AP if your priority is being quoted in ChatGPT.
For PR Agencies
Smarter media targeting. Instead of pitching everywhere, guide clients toward the AI-visible tier of media.
Offer a new KPI: AI Mention Share (the % of generative answers where the client’s brand appears).
For Media Outlets
Competitive positioning. Know how your publication stacks up against rivals in the AI licensing economy.
Understand whether being part of a deal with OpenAI or Google boosts long-term relevance—and how to price your content accordingly.
For Regulators & Analysts
Transparency. The Tracker reveals which knowledge pipelines shape public AI answers.
Supports policy discussions around bias, misinformation, and fair compensation in generative ecosystems.
Conclusion
The rise of generative AI marks the most significant shift in digital visibility since the dawn of search engines. Where SEO once determined which brands appeared on Google, today GEO determines which brands appear in the knowledge streams of ChatGPT, Gemini, Claude, and Perplexity.
The rules have changed. Visibility no longer depends on backlinks and metadata—it depends on whether your brand is cited in the licensed, high-trust outlets feeding generative models. For CMOs and PR leaders, this requires a fundamental strategic pivot: from optimising for crawlers to optimising for licensing pipelines.
The Generative AI Licensing Agreement Tracker was designed to make this shift actionable. By mapping which media outlets and LLM vendors are connected, it helps brands and agencies understand where to invest PR efforts, how to secure AI visibility, and how to stay competitive in an AI-mediated world.
Just as companies that mastered SEO became leaders in the search era, those that embrace GEO will become leaders in the generative era. The future of influence belongs to those who recognise that visibility is no longer a matter of search rankings, but of generative presence.
Conclusion: From Press Clippings to AI Citations
The Generative AI Licensing Agreement Tracker isn’t just a research tool — it’s a PR strategy compass. As AI assistants replace search engines for millions of users, the publishers they license from will define what gets seen, cited, and trusted.
CMOs and PR agencies who shift early will gain a first-mover advantage in AI visibility, ensuring their brands don’t just appear in headlines but in the answers people now trust most.