Multi-Agent System for AI Visibility Engineering

Overview

AI visibility engineering is an emerging practice focused on ensuring a brand’s presence in AI-generated content and search results. It measures how often and how prominently a brand is mentioned or recommended by large language models (LLMs) like ChatGPT, Claude, Perplexity, and Bing Chat. As consumers increasingly rely on AI chatbots for recommendations, brands need to monitor and optimize their LLM visibility (also called LLM Share of Voice). Manual checks (e.g. asking ChatGPT or Claude about your product) can give a quick pulse, but they don’t scale. Indeed, specialized tools have emerged (e.g. AI Search Grader, SpyGPT) to track if and how a brand appears in AI responses.

In this document, we design a client-deployable multi-agent system to automate AI visibility tracking and optimization. The system consists of four modular agents, each with a specific role:

Web Crawler Agent – Scans public web content for brand or product mentions using targeted search queries and crawling, gathering context from relevant pages.
LLM Query Agent – Uses LLM APIs (OpenAI GPT-4, Anthropic Claude, Bing Chat, Perplexity, etc.) to simulate user queries and captures the AI responses for analysis.
Data Analysis Agent – Computes metrics like an AI visibility score, brand mention frequency, token overlaps between official content and AI answers, embedding-based content similarity, and benchmark comparisons (e.g. vs. competitors or past performance).
Optimization Agent – Applies rule-based checks and ML heuristics to suggest content improvements: e.g. rewriting or tagging content for better AI visibility, adding missing metadata (schema, alt text), and structural edits to enhance AI discoverability.

This system is built with an emphasis on modularity, scheduling, data storage, and ease of deployment in a client’s environment. The following sections detail the overall architecture, technology stack, agent interactions, component interfaces, data models, deployment approach, and security/configuration considerations.

System Architecture and Agent Roles

Figure: Multi-Agent System Architecture for AI Visibility Engineering.
Each agent is a specialized module in a pipeline orchestrated by a scheduler. The Web Crawler finds brand mentions across the web (and gathers the brand’s own content), the LLM Query agent collects responses from various AI assistants, the Analysis agent computes visibility metrics and comparisons, and the Optimization agent produces content improvement suggestions. Data from each stage is stored for the next stage and for reporting.

At a high level, the system operates as a pipeline of agents executed in sequence (with some parallel sub-tasks where appropriate) under a central orchestrator. The figure above illustrates the architecture:

A Scheduler/Orchestrator (e.g. an Airflow DAG or similar) initiates the workflow on a schedule (for example, nightly or weekly). It triggers each agent in order and handles dependencies (ensuring, for instance, that crawling completes before LLM queries start). This ensures the process is automated and repeatable, with the ability to monitor success/failure of each step.
Web Crawler Agent: First, the orchestrator invokes the crawler to gather data from the web. It uses search queries and web crawling to find where the brand or product is mentioned publicly. The crawler also can fetch the brand’s own website content for analysis. It writes the results into a storage layer (e.g. a database or JSON files) for later use.
LLM Query Agent: Next, the orchestrator triggers the LLM querying component. This agent loads a set of user-like prompts (e.g. “What are the best [product category] tools?”) and queries multiple LLM-based services (OpenAI, Claude, Bing, Perplexity, etc.) with these prompts. The responses from each LLM are captured and stored. This reveals if and how the brand is mentioned by AI when answering relevant questions.
Data Analysis Agent: After collecting web data and AI responses, the analysis module runs. It reads the stored data (crawler outputs and LLM answers) and computes key metrics: the AI visibility score (a composite indicator of how visible the brand is across queries and platforms), counts of brand mentions per platform, overlap between AI answers and official content, embedding similarities, and benchmark comparisons (e.g. comparing the brand’s visibility to competitors or to prior periods).
Optimization Agent: Finally, the orchestrator calls the optimization module. This agent uses the analysis findings to pinpoint content improvements. It scans the brand’s content (from the crawl) for issues (missing metadata, lack of coverage on certain topics) and uses both rule-based logic and ML (e.g. language model suggestions) to propose optimizations. The suggestions might include rewritten copy, additional sections or tags (like adding Schema.org markup), and structural changes to help the brand’s content rank better in AI responses. The outputs are stored and also compiled into a human-readable report.

Each agent is modular and encapsulated – they can be developed, tested, and run independently, which aligns with good multi-agent design practices (each agent has focused responsibilities and can be improved without affecting others). The agents communicate through well-defined data interfaces: typically writing to and reading from a shared data store or passing in-memory data via the orchestrator. This design allows scaling or replacing components (for example, swapping out the web crawler for a different implementation) without impacting the overall system, as long as the interface contracts are maintained.

Below we detail each agent’s role, implementation approach, and interface in the system.

Web Crawler Agent

Role: The Web Crawler Agent discovers and collects references to the brand or product across the public web. Its goal is to map out where the brand is mentioned (e.g. news articles, blogs, forums) and to retrieve the content context of those mentions. It also can crawl the brand’s own websites or knowledge bases to have a reference for official content. By compiling these sources, the system can later analyze how widespread the brand’s web presence is and compare it to how AI models answer questions.

Crawling strategy: The agent uses targeted query operators and search APIs to find relevant pages. Rather than blindly crawling the entire web, it formulates search queries like "\"<BrandName>\" <industry keywords>" or uses Google/Bing advanced operators (e.g. site:example.com <BrandName> for specific domains, or intext: queries) to pinpoint likely mentions. This focused approach yields a list of URLs that are likely to contain the brand name or product references. The agent then fetches those pages for analysis. It respects robots.txt and uses rate limiting to avoid overloading any site.

Technology: The crawler is implemented in Python using robust web scraping libraries. For example, Scrapy can be used as the crawling framework, providing efficient handling of HTTP requests, parsing, and concurrency. Scrapy’s built-in features (HTTP client, HTML parsing, auto-throttling, output pipelines, etc.) make it suitable for scalable crawling. Alternatively, a simpler approach with requests + BeautifulSoup can be used for a smaller scope crawl. If JavaScript-heavy pages need scanning (less likely for textual brand mentions), a headless browser (e.g. Selenium or Playwright) could be integrated, but generally most content can be retrieved via normal HTTP requests.

Data storage: The crawler outputs structured data on each page found. Key fields include the page URL, page title, the snippet of text around the brand mention, and metadata like the date or source domain. This can be stored as records in a database (e.g. a table for “web_mentions”) or as JSON lines in a file for simplicity. Storing the full text of the page can be optional (for in-depth analysis, the full content might be saved, or at least a generous context around the mention). The agent may also tag each result with attributes like the relevance or confidence (for instance, if the brand name appears only once vs. many times). Internally, it might use a text search library to highlight the brand terms and extract surrounding sentences.

Interface & sample output: The Web Crawler Agent can be invoked with input parameters like the brand name, a list of product keywords, and perhaps a limit on number of pages per domain. It returns or saves a collection of mention records. For example, an output JSON record might look like:

json

CopyEdit

{ "url": "https://technews.example.com/cloud/12345", "title": "AcmeCloud recognized as a top cloud backup solution", "snippet": "...According to the report, **AcmeCloud** offers unprecedented backup speeds...", "mention": "AcmeCloud", "source": "technews.example.com", "crawl_time": "2025-07-13T10:55:00Z" }

Each record captures where and how AcmeCloud (the brand) was mentioned. These records are stored in a Crawled Data Store (e.g. a SQLite or PostgreSQL database table, or a JSON/CSV file). Downstream agents will query this store to analyze the brand’s web mentions.

LLM Query Agent

Role: The LLM Query Agent assesses the brand’s visibility in AI-generated responses by programmatically querying multiple large language model services. Instead of waiting for users to ask AI about the brand, this agent simulates typical user questions and records what the AI says. This reveals whether the brand is being mentioned, how it’s described, and where it stands relative to competitors in AI outputs. This directly addresses the core of AI visibility: does the AI know/talk about our brand, and in what context?

Query design: The agent takes a set of prompts relevant to the brand’s domain. For example, if the brand is AcmeCloud (a cloud backup product), prompts might include: “What are the best cloud backup solutions in 2025?”, “Pros and cons of using AcmeCloud?”, or broader industry questions that should ideally mention the brand. These prompts can be manually curated or generated from a template plus the brand/product keywords. They mimic what a potential customer might ask an AI assistant. Using a consistent set of prompts allows tracking presence over time and comparing with competitors.

Multi-LLM interfacing: The agent interfaces with various AI systems via their APIs:

OpenAI (for ChatGPT/GPT-4): via OpenAI’s API (e.g. gpt-4 or gpt-3.5-turbo model endpoints).
Anthropic Claude: via its API for Claude 2, etc.
Bing Chat: via the Bing Search API or an Edge browser automation if needed. (Bing’s new AI may not have a straightforward API for chat as of writing, but the agent can use the Bing Web Search API to get AI-generated answers in Bing’s answer format or utilize an unofficial approach to query Bing Chat.)
Perplexity.ai: if an API is available, or by using their web search tool programmatically. Perplexity provides cited answers by combining search and LLM, which can be scraped via their API/HTML.
Others: The framework is extensible; e.g., Google Bard or emerging tools can be added if needed, given appropriate API or web interface integration.

Using a library like LangChain can simplify this multi-LLM orchestration. LangChain provides standard interfaces to call different LLM providers and can manage prompt formatting and API calls uniformly. It also supports asynchronous calls, which the agent can leverage to query multiple LLMs in parallel (reducing overall latency). For example, the agent might concurrently send the same prompt to OpenAI, Anthropic, and Perplexity, then gather all results.

Data capture: For each prompt, the agent records the responses from each LLM. It captures the raw answer text and possibly metadata like whether the brand was mentioned and any sources cited. It may also log token usage or confidence scores if provided by the API (OpenAI returns usage tokens; Perplexity might return a relevance score for sources, etc.). The output is structured for analysis. One approach is to create a JSON object for each prompt with sub-fields for each LLM’s answer. For example:

json

CopyEdit

{ "prompt": "What are the best cloud backup solutions in 2025?", "responses": { "OpenAI_GPT-4": { "text": "... I would recommend AcmeCloud, Backblaze, and XYZ as top options ...", "mentions": ["AcmeCloud", "Backblaze", "XYZ"], "brand_mentioned": true }, "Anthropic_Claude": { "text": "... Top cloud backup services include Backblaze and XYZ. ...", "mentions": ["Backblaze", "XYZ"], "brand_mentioned": false }, "Bing_Chat": { "text": "\"According to experts, AcmeCloud is among the leading solutions...\" (source: TechCrunch)", "mentions": ["AcmeCloud"], "brand_mentioned": true, "cited_sources": ["TechCrunch"] }, "Perplexity": { "text": "AcmeCloud is listed as a top solution in multiple reviews [1]. ...", "brand_mentioned": true, "cited_sources": ["[1] link to review site ..."] } }, "query_time": "2025-07-13T11:00:00Z" }

In this example, the prompt about “best cloud backup solutions” elicited different answers: GPT-4 and Bing Chat mentioned AcmeCloud (the brand) while Claude did not. This structured result is saved in an LLM Responses Store (e.g. another database table or JSON file). The Data Analysis agent will later traverse this data to quantify how often the brand appears and where.

Implementation details: The agent is essentially a wrapper that calls external APIs, so careful attention is paid to API integration and reliability:

API keys for each service are stored securely in configuration.
The agent respects rate limits (e.g. OpenAI might limit requests per minute; the agent can insert delays or use async with semaphore limits).
Error handling is in place: if an API call fails or times out, the agent logs it and possibly retries a limited number of times. If one LLM service is down, the system should continue with others and mark that service’s result as unavailable in the output.
To keep cost predictable (especially for OpenAI API which bills per token), the prompts are designed to be concise, and the agent might enforce a max tokens for the response. Only needed information (whether the brand is mentioned, the general context) is necessary, so the agent could even post-process or truncate the answers to what's relevant for analysis (e.g. store just a boolean or snippet containing the brand).
The LangChain framework can be used to manage prompts and aggregate responses. However, since we want to capture exact responses, the agent likely will not transform or chain these calls (no need to have the LLM think beyond answering). It simply collects outputs verbatim.

The LLM Query Agent essentially automates what human marketers might do manually: “prompt AI platforms and see if our brand comes up”. Automating this across many queries and platforms gives a much richer dataset for analysis.

Data Analysis Agent

Role: The Data Analysis Agent takes the raw data from the crawler and LLM query stages and computes actionable visibility metrics. It transforms raw text and counts into quantitative scores and insights. This agent provides the visibility analytics that show where the brand stands and identifies gaps. Its output feeds both the final report and the Optimization agent (to target improvements).

Inputs: The agent reads from the Crawled Data Store (web mentions data) and the LLM Responses Store (LLM answers). It may also use the brand’s official content (from the crawl of the brand site) as input for comparisons. Additionally, if the scope includes competitors, their names or content could be included to benchmark the brand’s performance. (For instance, the system could be configured with a list of competitor brand names to look for in the LLM responses as well, enabling a share-of-voice computation.)

Key computations:

Visibility Score: The agent calculates an AI Visibility Score for the brand. This could be a composite index (e.g. 0–100) reflecting how visible the brand is across the queries and platforms tested. One method is to take the percentage of prompts in which the brand was mentioned by at least one LLM, weighted by the importance of the prompt or the number of platforms. For example, if out of 10 key user questions, AcmeCloud was mentioned in answers to 6 of them, we might start with 60%. We can add weight if it’s mentioned by multiple LLMs or appears first in an answer. The exact formula can be tailored, but the idea is to condense multiple data points into a single score for easy tracking. (Semrush’s AI toolkit uses a similar metric to summarize AI mentions)
Mention counts and coverage: The agent tallies how many times and where the brand was mentioned. For each LLM platform, it can compute how often the brand appears. E.g., “ChatGPT mentioned AcmeCloud in 3 out of 5 queries (60%)”. It also can measure if the brand is consistently mentioned alongside certain competitors or if some prompts always yield a competitor name instead. If competitor tracking is enabled, the agent will similarly count competitor mentions. This provides a share of voice comparison: e.g., “Out of 5 AI queries, AcmeCloud was mentioned 3 times, CompetitorA 4 times, CompetitorB 1 time” – showing CompetitorA leads in AI mention frequency, etc.
Token/Text Overlap: To gauge how much of the brand’s own content is reflected in AI responses, the agent computes overlap metrics. One simple approach is token overlap: e.g., take the text of an LLM’s answer (especially any part discussing the brand) and the text from the brand’s relevant page (like the product page or description), then calculate the fraction of common tokens or common key phrases. A high overlap might indicate the AI is pulling phrasing directly from the brand (or a quote), whereas low overlap might mean the AI is paraphrasing or using other sources. This can be measured using techniques like Jaccard similarity or n-gram overlap after removing stopwords. The agent might output a percentage overlap for each answer.
Embedding Similarity: A more robust measure of semantic similarity uses vector embeddings. The agent can generate embeddings for texts (for example, the brand’s official product description vs. the LLM’s description of the product in an answer) and then compute cosine similarity. Using a model like OpenAI’s text-embedding-ada-002 or a local transformer model (via HuggingFace) yields high-dimensional vectors representing meaning. The analysis agent can embed: (a) key content from the brand (official descriptions, FAQ answers, etc.), (b) the content of AI responses about the brand. Then compute similarity scores. A high similarity (>0.8 cosine, for instance) would mean the AI’s answer is semantically very close to the brand’s own messaging (which might be good for accuracy but could also mean the AI is heavily relying on the brand’s copy). A low similarity might reveal that the AI has a different take or possibly misinformation. This helps identify misalignment between what the brand says and what the AI says.
Benchmark Comparisons: The agent prepares comparisons either against competitors or against the brand’s past performance:
- Competitor benchmarks: If competitor names are provided, the agent will compute their visibility scores in the same way as for the brand (how often did each competitor appear in the AI responses?). It can then produce a rank or table of brand vs competitors. For example, AcmeCloud: 75, CompetitorA: 60, CompetitorB: 50 (higher is better). This shows relative AI presence. It can also identify in which queries competitors were mentioned but the brand was not (potential content gap opportunities).
- Time benchmarks: Since the system can be run on a schedule, the agent can compare the current results to previous runs (data stored from prior dates). It could track the visibility score trend over time (e.g. improved from 60 to 75 after last optimizations) or note if a new competitor has started appearing.
Additional analyses: The agent might perform sentiment analysis on the AI responses about the brand (e.g., are mentions positive, neutral, or negative?). This is optional but can be insightful – if the brand is mentioned, is it in a recommending tone or a warning tone? This could be done by feeding responses into a sentiment classifier (or using an LLM to judge sentiment). Another analysis could be checking if the AI responses cite the brand’s content (e.g., did Perplexity or Bing cite the brand’s blog?). If the crawler collected the brand’s domain pages, the agent can see if any of those URLs appear in the LLM citations. If not, it means the brand’s content isn’t being directly used as a source.

Tools and libraries: This agent heavily uses Python data libraries:

Pandas for aggregating counts and calculating percentages.
NumPy/Scipy for vector operations (cosine similarity).
NLTK / spaCy for tokenization if doing overlap metrics, and possibly for basic sentiment word analysis.
HuggingFace Transformers or OpenAI Embedding API for generating text embeddings for similarity measures.
Potentially scikit-learn for any classification or regression (for example, if a custom model is used to combine metrics into a single visibility score, or to predict something like an SEO score).
The agent might also use visualization libraries (Matplotlib/Seaborn) to create charts (like a bar chart of brand vs competitor scores) that can be embedded in the final report.

Output: The Data Analysis Agent produces both structured data outputs (metrics, scores) and intermediate artifacts for reporting. For instance, it may prepare a JSON or Python dict like:

json

CopyEdit

{ "brand": "AcmeCloud", "visibility_score": 78, "mention_coverage": { "ChatGPT": 3, "Claude": 2, "Bing": 3, "Perplexity": 3 }, "total_prompts": 5, "competitor_scores": { "Backblaze": 65, "XYZBackup": 50 }, "similarity_to_official": 0.83, "sentiment": "neutral" }

In this hypothetical summary, out of 5 prompts, ChatGPT mentioned AcmeCloud in 3, etc., yielding an overall score of 78. Competitor benchmarks are included, and an average semantic similarity (0.83 cosine) suggests the AI answers are fairly close to official content in meaning. The sentiment is neutral overall (meaning AI responses talk about AcmeCloud in a factual or mixed way).

These results are stored in an Analysis Results database or file. Additionally, this agent could generate human-readable snippets, such as:

A markdown table of results for inclusion in the report.
Highlight lists like “Queries where AcmeCloud was absent but CompetitorX appeared” or “Top brand-related facts AI mentioned (and their source)”.

Crucially, the Data Analysis Agent bridges the gap between raw data and actionable insight, which guides the next optimization steps.

Optimization Agent

Role: The Optimization Agent uses the insights from the analysis to recommend improvements to the brand’s content and web presence, aiming to boost future AI visibility. This is the prescriptive part of the system – not only do we see what the situation is, we also get guidance on how to improve it. The agent produces content suggestions, metadata additions, and structural edits that can be deployed to increase the likelihood of the brand being mentioned and cited by LLMs (essentially an LLM-focused SEO or AI visibility optimization task).

The optimization operates on two fronts:

Improving on-site content (the brand’s own websites or pages), so that AI models crawling or training on this content will find clear, structured, authoritative information about the brand.
Addressing gaps where the brand is missing from AI answers – which might involve creating new content or improving external signals.

Inputs: The agent takes as input:

The analysis results (visibility score, where the brand was or wasn’t mentioned, comparisons).
The raw data related to those results: specifically, the brand’s own content (from crawl) and possibly competitor content or public info for reference.
Configuration or best-practice rules for content optimization (a knowledge base of SEO/LLM optimization guidelines).

Using this, the agent generates recommendations. The approach is hybrid: a combination of rule-based heuristics and ML/NLP-based suggestions.

Rule-based heuristics: These are deterministic checks based on known best practices:

Metadata and Schema: Check if the brand’s pages have proper metadata that aids AI. For example, does the homepage or product page have a descriptive <title> and meta description containing the brand and product keywords? Is there Schema.org structured data (FAQ schema, Product schema, Organization schema, etc.) present? Structured data can help LLMs interpret content, as Microsoft confirmed using schema markup to help its Bing chatbot understand content. If such metadata is missing, the agent flags it. For instance, “Add Organization schema to your About page to ensure LLMs recognize factual info about your company” or “Missing meta description on page X – consider adding one that includes keyword Y.”
Content Coverage Gaps: Using the analysis findings, the agent identifies topics or queries where the brand should be mentioned but wasn’t. For example, if the prompt “best cloud backup for small business” did not yield the brand in any AI response, the agent checks if the brand’s site has a page addressing that exact topic. If not, it recommends creating content (e.g. a blog post or guide) targeting that query. If the content exists but maybe not effectively, it suggests improving it (perhaps the content is buried or not indexed). Rule example: “Competitor A is mentioned for ‘affordable backup solutions’, but your content doesn’t explicitly address that angle. Consider adding a section about affordability/pricing on your product page or a blog comparing backup solution costs.”
Technical SEO for AI: Ensure the site is crawlable by AI. E.g., the agent might parse robots.txt to ensure it’s not disallowing known AI crawlers. Or verify that site content isn’t behind logins. While this is more general SEO, it has AI implications (if an AI can’t crawl it, it can’t know about it).
Consistency and Clarity: Check if the brand name and product names are used consistently and prominently in content. If the brand has a very generic name or multiple variants, the agent might suggest standardizing references to help NLU (for instance, always use “AcmeCloud Backup” instead of sometimes just “Acme” which an AI might not link to the product).
Backlink and External Signals: While primarily on-site, the agent might note if the crawl found very few external mentions (e.g., “Brand is rarely mentioned on external authoritative sites”). While the system itself can’t force that, it can recommend PR or outreach as a strategy (e.g. “Consider publishing a research piece or getting listed on Wikipedia to increase authoritative mentions, as brand mentions across the web strongly correlate with AI visibility.”). This crosses into marketing strategy, but it’s a valid insight derived from the analysis.

ML-driven suggestions: These involve using language models or ML algorithms to generate or refine content:

Content Rewriting: For key pages, the agent can utilize an LLM (like GPT-4) to rewrite certain sections in a more “AI-friendly” way. For example, if the brand’s product description is very marketing-jargon-heavy, an LLM could be prompted (with a system instruction) to rewrite it in a more factual, concise manner, since LLMs prefer content that reads like Wikipedia or an unbiased source. The agent could supply the current text and ask for a rewrite emphasizing certain keywords or clarity. The output would be a draft the content team can consider. (This is done carefully: we use the LLM as an assistant, but a human should review before publishing changes.)
Generating FAQs: If the site lacks an FAQ and the analysis shows certain questions being asked to AI (the prompts) where the brand could answer, the agent might suggest FAQ questions and even draft answers. It can do this by taking the prompt and generating an answer based on the brand’s info. Structured Q&A content can then be added (and marked up with FAQ schema), which is known to help with AI and featured snippets.
Meta text generation: The agent can generate meta descriptions or alt text for images using ML. For instance, if images on the site are missing alt text (which also can be a factor in how AI perceives content), an image captioning model or prompt to GPT could create descriptive alt texts.
Tone and Coverage Analysis: Using ML to analyze if the content meets E-E-A-T (Experience, Expertise, Authoritativeness, Trustworthiness) criteria. Some SEO tools use ML to score content quality. Our agent might not have a full model for this, but it could use a prompt like “score this content on a scale for expertise” or use existing open-source classifiers to detect if content reads like authoritative text. Based on that, it might suggest adding statistics or quotes (since adding well-sourced stats can boost AI visibility).

Output: The Optimization Agent outputs a list of recommendations and actionable items, often tied to specific pages or content pieces. This can be formatted as a structured list (JSON or YAML) and also a nicely formatted section in the Markdown report.

For example, a JSON output of one recommendation item might be:

json

CopyEdit

{ "page": "https://acmecloud.com/product", "issue": "No FAQ section; missing schema", "recommendations": [ "Add a FAQ section addressing common questions (e.g., pricing, security).", "Include a <script type=\"application/ld+json\"> FAQPage schema</script> for the new FAQ content.", "Meta description is generic; rewrite to include 'cloud backup' keyword and highlight AcmeCloud’s key value." ], "suggested_edits": { "meta_description": "AcmeCloud – Fast, affordable cloud backup solutions for small businesses. Secure your data with easy, automatic backups.", "faq_draft": [ { "question": "Is AcmeCloud suitable for small businesses?", "answer": "Yes – AcmeCloud offers flexible plans designed for small businesses, with easy setup and automatic backups..." } ] } }

This example indicates for the product page, the agent found no FAQ and missing schema. It recommends adding one, suggests adding FAQPage JSON-LD, and even provides a draft meta description and a sample Q&A pair for the FAQ.

The agent might produce multiple such items – e.g. one per important page (Home, Product, About, Blog) – or category-level suggestions. It can also provide general recommendations like “increase content updates frequency” if content freshness is an issue.

All these suggestions are compiled for the final report. The idea is that an engineer or content strategist at the client side can take these recommendations and implement changes (or even feed some suggestions directly to a CMS). Over time, after implementing these, the system can be run again to see if the visibility scores improve, creating a feedback loop for continuous optimization.

Agent Communication and Orchestration Strategy

To coordinate these four agents, the system implements a master orchestration workflow. The communication and scheduling strategy ensures each component runs in the correct order, passes necessary data, and operates on a schedule without manual intervention. The design choices here prioritize reliability (no step should be missed or executed out of sequence) and modularity (the orchestrator can trigger or skip agents as needed).

Orchestration via Airflow: We utilize Apache Airflow (or a similar workflow orchestrator) to define the pipeline as a Directed Acyclic Graph (DAG). Airflow is a proven scheduler for batch workflows, allowing us to set up dependencies and timing easily. In our Airflow DAG:

crawl_task triggers the Web Crawler Agent. Once it succeeds (i.e. all targeted pages crawled and data stored), it yields to the next task.
llm_query_task runs the LLM Query Agent to call the APIs and store results. It depends on crawl_task (though in theory, LLM queries could run in parallel to crawling since they don’t strictly depend on crawler output – if desired, we could run them concurrently; but often it’s fine sequentially).
analysis_task runs the Data Analysis Agent, which reads both crawler and LLM data and then writes analysis results.
optimize_task runs the Optimization Agent, which reads analysis (and possibly raw data) and produces the suggestions output.
Optionally, a report_task could run at the end to compile the outputs (from analysis and optimize) into a report or dashboard update.

These tasks are set with dependencies (crawl_task >> llm_query_task >> analysis_task >> optimize_task >> report_task in Airflow syntax). Airflow ensures each completes successfully before the next starts, or marks the DAG as failed and alerts if something goes wrong. We configure scheduling (e.g. run the DAG every Monday at 9am, or daily at midnight) easily via Airflow’s cron-like settings. Airflow’s monitoring UI also allows engineers to see run logs, retries, and durations for each agent – making it easier to manage in production.

Data passing: Rather than tightly coupling agents, we use the storage layer as the interface:

The Crawler writes to the mentions table; the LLM agent reads from config (prompts) rather than crawler output, so they’re mostly independent.
The Analysis agent queries the mentions table and the llm_responses table to do its work.
The Optimization agent reads from analysis outputs and the brand_content data (which could be a subset of mentions specifically for the brand’s own site pages).

Since the data is persisted, each agent can run as a separate process or even on different machines/containers, as long as they have access to the data store. In Airflow, we could also pass data via XCom (Airflow’s in-memory data passing), but for potentially large data (like lists of pages or lengthy text), a persistent store is more robust.

Agent interfaces: Each agent is implemented such that it can be invoked in a stand-alone manner. For example:

The Web Crawler Agent could be a Python function or an executable that accepts parameters (brand name, list of search queries, etc.) and internally handles crawling, then exits.
The LLM Query Agent might take a list of prompts (could be loaded from a file or table) and produce outputs to the DB.
In Airflow, these could be triggered via PythonOperator (calling a function in code) or via a DockerOperator (running a container that executes the agent code). We choose an approach depending on deployment preferences (see Deployment section). If using containers, each agent can be a separate container image that the orchestrator spins up and passes config to.

Parallelism: Within certain agents, parallel processing improves efficiency:

The Web Crawler can spawn multiple threads or asynchronous requests to fetch pages concurrently, since the search results yield many URLs. Scrapy by default handles parallel requests asynchronously, which is efficient.
The LLM Query Agent can parallelize API calls for different prompts or to different LLMs using Python’s asyncio or multi-threading (ensuring not to overshoot rate limits). For example, it could dispatch all prompts to ChatGPT one by one (since ChatGPT might have a rate limit) but do so concurrently with dispatching prompts to Claude, etc. Another strategy is to have separate Airflow tasks per LLM (e.g. query_chatgpt_task, query_claude_task in parallel) then join results. However, that complicates merging results. Simpler is the single task handles multiple APIs internally.
The Analysis Agent mostly does CPU-bound computations (similarity calc, etc.), which are usually fast for the volume of data (a few dozen responses and pages). If it were heavy (like thousands of embeddings), one could use batch processing or even offload to a GPU for embeddings. But typically this will be fine on CPU in a timely manner.
The Optimization Agent might call LLMs for rewriting suggestions. If it’s doing a lot (like rewriting 20 pages), it could do those in parallel or sequentially. Since this is often an offline process, a bit of extra time here is usually acceptable, but it can also be parallelized by content section.

Airflow can manage parallel tasks if we decided to split tasks further (e.g., crawl external vs crawl brand site as two tasks in parallel, or query each LLM in parallel). The design is flexible on this; one must just ensure data consistency (i.e. all needed data is ready when analysis runs).

Communication between agents: Since each agent runs in order, direct inter-agent communication is minimal. They communicate via data. This has security and reliability advantages (no complex RPC calls or live data passing; everything is written and read in a controlled way). If we needed more interactive communication (for example, an advanced scenario: an agent asking another agent a question in loop), we could use a message queue or an in-memory blackboard. But our use case doesn’t demand dynamic back-and-forth, it’s a straight pipeline.

Error handling and retries: The orchestrator handles failures by capturing non-zero exits or exceptions in tasks. For instance, if the Web Crawler fails (maybe a search API quota issue), Airflow can retry it a few times (we can configure retries) and if still failing, skip subsequent tasks or mark them failed. This prevents incomplete or inconsistent data from propagating. Each agent is idempotent or resumable to an extent; e.g. if the crawler partially completed before failing, it can pick up where left off (or we clear its data and re-run from scratch). Logging is important: each agent logs its progress (like “Fetched 50/100 pages…” or “Query 3 of 5 to ChatGPT done”) to aid debugging.

Flexible orchestration: While Airflow is our main suggestion (for its robust scheduling, DAG visualization, and ease of integration in many environments), the system can also run using simpler schedulers (like cron jobs that trigger a pipeline script) or alternative orchestrators (like Prefect or Luigi if the client already uses those). The key is the same: a top-level control that knows the order. In a simpler cron scenario, one might have a master Python script that calls each agent in turn (checking for success codes). However, Airflow’s reliability, monitoring and ability to scale out (distribute tasks) make it a strong choice for enterprise deployment.

In summary, the communication strategy uses a pull-based integration (agents pull what they need from storage) orchestrated on a schedule. This decouples agents and makes the workflow robust to individual component changes. Next, we discuss the technology stack that enables this system.

Technology Stack Choices

Designing this system requires choosing technologies that support modularity, scalability, and ease of deployment. Below is the tech stack selection for each part and the rationale:

Programming Language: Python is chosen as the primary language for all agents. Python offers a rich ecosystem for web crawling, data analysis, and ML integration, ensuring we can implement each agent with appropriate libraries. Its ease of use and readability suits a multi-component project, and most API SDKs (OpenAI, etc.) are readily available in Python.
Web Crawling: The Web Crawler Agent is built with Python’s Scrapy framework for robust scraping. Scrapy handles parallel requests, parsing, and pipeline to storage elegantly. It also allows custom middleware (e.g. for rotating proxies or handling bot defenses) if needed. If the crawl scope is simple, we could use requests + BeautifulSoup4, but Scrapy’s scalability and features (auto-throttle, retries, export pipelines) make it optimal for a client-deployable tool that might need to crawl many pages reliably. We also include Python’s AsyncIO capabilities if writing a custom crawler (for non-Scrapy async fetching).
Search APIs: To find pages to crawl, the system can integrate with search engine APIs. For example, the Google Programmable Search API or Bing Web Search API can be used to perform the targeted queries. This avoids scraping Google results HTML (which is against TOS and less stable). These APIs return JSON results of web pages for a query, which our crawler can parse to get URLs. Using official APIs where possible makes the solution more compliant and stable. The tech stack includes libraries or modules to call these search APIs (or an SDK if provided).
LLM APIs and SDKs: For the LLM Query Agent, the stack includes:
- OpenAI Python SDK (openai library) for ChatGPT/GPT-4 calls.
- Anthropic SDK or HTTP client (Anthropic provides a Python client for Claude).
- Bing: either Microsoft’s azure-ai search client if using Azure OpenAI/Bing, or requests to the Bing Web Search endpoint.
- LangChain library to abstract these calls under a unified interface (LangChain provides wrappers like OpenAI() and Anthropic() classes, and even tools for Bing search). It also offers utilities to format prompts and handle streaming if needed.
- Async libraries: httpx or Python’s built-in asyncio for parallel API calls, if not using LangChain’s async support.
These ensure that connecting to each AI service is straightforward. By 2025, many LLM services have matured APIs, so the stack is flexible to include new ones (e.g., OpenAI’s function-calling responses, etc., can be utilized if needed for more advanced query parsing).
Data Storage: For data persistence, a lightweight yet scalable solution is used:
- PostgreSQL or MySQL as a relational database to store crawl results, LLM responses, and analysis metrics. This is client-deployable (can run in a Docker container or on the client’s DB server) and handles structured data well. We prefer Postgres for its JSON support (if we want to store raw JSON of LLM responses) and reliability.
- Alternatively, for simplicity or smaller deployments, SQLite can be used (file-based DB, no server needed). Each agent can write to the same SQLite file if on the same machine. This trades off concurrency (only one writer at a time) but since our tasks are sequential, it’s workable. For enterprise, Postgres is recommended.
- For large text storage (like storing full page content or all LLM answer text), a combination of the DB (for meta and references) and flat files (for lengthy content) can be used. E.g., store page text in files named by a hash, and reference the filename in the DB. But given modern DBs can handle text, storing text in a TEXT column is fine if not huge volumes.
- The analysis results and recommendations can also be stored in the DB, or simply generated as files (CSV/JSON) that are included in reports.
- If vector embedding storage becomes necessary (say we want to quickly query for similar content), we could integrate a vector database like FAISS (in-memory) or Milvus or Pinecone. An on-prem alternative is to use PostgreSQL with the pgvector extension. For now, the agent can compute similarities on the fly without a persistent vector index, since the number of vectors (pages and responses) is manageable.
Data Analysis & ML Libraries:
- Pandas and NumPy for manipulation and math.
- scikit-learn for any classical ML (perhaps clustering similar content or regression if creating a scoring model).
- NLTK/spaCy for NLP tasks like tokenization, if needed for overlap calcs.
- Transformers (HuggingFace) for embedding generation or using pre-trained sentiment models. For example, use SentenceTransformer with a model like all-MiniLM for local embeddings, or call OpenAI’s embedding API.
- Matplotlib/Seaborn for creating charts (if we include graphs in reports).
- Optionally, NetworkX or Graphviz if we wanted to visualize relationships (not required, but could map which sources cite the brand).
Agent Orchestration: Apache Airflow is a key part of the stack for scheduling and orchestration. We use Airflow 2.x, which can run standalone or in a small cluster. In deployment, Airflow itself can run in a container. The Python operators will execute our agent code. Airflow’s reliability and scheduling features are a major reason for choosing it. It also allows integration with notification systems (so if a run fails, it can email an alert, etc.). If Airflow is too heavy for the client’s context, cron jobs or Prefect (a more lightweight orchestrator in Python) could be alternatives – but given Airflow’s prevalence in data engineering, it’s a solid choice.
Containerization: We use Docker to containerize the system for easy deployment. Each agent could be in a separate Docker image (to keep environments isolated), or we could combine some. At minimum, we will have:
- A Docker image for Airflow scheduler & web server (Airflow provides an official image we can extend to include our DAG and any needed Python packages).
- A Docker image for the agents (if we choose one image that has all the code and libraries for crawler, LLM, analysis, optimization). This image would be used by Airflow’s tasks or via Docker Compose. Alternatively, separate images: e.g. one for crawler (with Scrapy installed), one for LLM (with openai, etc.), etc. For simplicity, one image with all dependencies might be easier to maintain, as the environment is all Python.
- The container approach ensures all dependencies (Python libs, system libs like libxml for scraping, etc.) are bundled and consistent across client deployments. It also simplifies running on different OSes (just need Docker runtime).
APIs/Integration: The system might expose no external API (since it runs internally and outputs reports), but we could consider:
- A simple web dashboard or API that the client can call to trigger a run or fetch the latest results. This could be a small Flask app served alongside Airflow or integrated into Airflow’s web UI (Airflow allows custom plugins). This is optional; often, scheduling and email reports suffice.
- If integration with other tools is needed, we ensure outputs are in standard formats (CSV, JSON, Markdown) so they can be consumed by other analytics or reporting tools.

All selected technologies are cloud-neutral and open source (except the LLM APIs, which are external services by nature). The stack avoids proprietary cloud services for core functionality, meaning it can run on-premises or on any cloud VM or Kubernetes cluster the client uses. For example, if deploying on AWS, we might use an EC2 or ECS with the same Docker images; on Azure or GCP similarly – no dependency on AWS-specific services like Lambda or Google-specific tools, etc., aside from optional use of their search API which is a trivial swap if needed.

Using Python across the board also means a unified codebase and easier logging and error handling (same logging framework can be used in all agents, outputting to stdout for Airflow to catch or to a file).

The tech stack emphasizes maintainability and extensibility. For instance, if a new LLM API comes out, we can add its SDK to the image and update the LLM Query Agent code to include it. If the client wants to incorporate their own machine learning model for content scoring, we can integrate it via Python. If scheduling needs change, Airflow’s DAG can be updated without code changes to agents.

In summary, the combination of Python + LangChain + Scrapy + Airflow + Docker + data libraries provides a powerful yet flexible foundation. Each choice is proven in industry (Scrapy for crawling, LangChain for multi-LLM, Airflow for scheduling), ensuring the system can be built rapidly and run reliably in a client’s environment.

Deployment Architecture (On-Premise or Cloud-Neutral)

Figure: Deployment architecture for the multi-agent system.
All components are containerized for portability. In this example, each agent runs as a service (in separate containers) orchestrated by an Airflow scheduler. The system can be deployed on a single on-prem server or on any cloud VM or Kubernetes cluster. Data is stored in a local database or volume. Only outbound connections are to public web resources (for crawling) and to LLM provider APIs. This design avoids any dependency on proprietary cloud services, making it cloud-neutral and client-controlled.

The system is designed for easy deployment in a client environment, whether on-premises or in the client’s cloud account. The key deployment considerations are isolation, configurability, and minimal external dependencies beyond the AI APIs.

Containerization and Services: Each agent and the orchestrator are containerized (using Docker). We can deliver a docker-compose.yaml or Kubernetes manifests to the client to spin up the needed services. In a simple setup:

An Airflow container runs the scheduler and web UI. Our DAG and Python code are included (baked into the image or mounted as a volume).
A Crawler container runs the Web Crawler Agent. (This could be the same image as Airflow if using local executors, or a separate image if we use KubernetesExecutor/CeleryExecutor for Airflow, allowing it to spawn containers on demand).
Similarly, LLM Query, Analysis, Optimization containers exist for their respective agents. These can all be instances of one image that contains all code, with an environment variable or entrypoint argument telling it which agent to execute. For example, the image could be built with all Python modules, and we have entry scripts like run_crawler.sh, run_analysis.sh that call the appropriate module. Airflow can then trigger the specific container and command for each step.
A Database container (if using Postgres/MySQL) can be part of the docker-compose. Alternatively, the client might provide a database or the system could use a host-mounted volume for SQLite. In either case, data persists between runs. If using Postgres, we ensure the data volume is persisted (so Docker volume or a bind mount on host).

Using Docker Compose, all these containers can communicate on a private network. The Airflow container (scheduler) orchestrates by either invoking the others directly (if using DockerOperator, it contacts the Docker daemon to start containers for each agent) or by Airflow running tasks in-proc (PythonOperator). In our architecture diagram, we illustrated them as separate services for clarity.

Resource allocation: Each agent container can be given resource limits. For instance, crawling might need more CPU if many pages, LLM querying might benefit from some concurrency but is mostly network-bound, analysis is light, and optimization might use some CPU for ML. These can all run on a single modern VM (e.g., 4 vCPU, 16GB RAM) without issues, as the workload is periodic and not huge. If the client demands, it can scale: e.g., if crawling thousands of pages, we can allocate more CPU or split workload.

On-Premise Deployment: The client can run the Docker stack on a VM or a bare-metal server in their data center. They would need internet egress for:

Crawling the web (outbound HTTP to various websites).
Calling LLM APIs (HTTPS to OpenAI/Anthropic/Bing endpoints).
If the environment is restricted, they might use a proxy server; the system can be configured to respect proxy settings for outbound calls. No inbound internet connections are needed (unless the client wants to access the Airflow web UI remotely, which can be secured or tunneled).

Cloud Deployment: If deployed in the cloud, the architecture remains the same. For example:

On AWS: Use an EC2 instance (or ECS service) to run the Docker containers. Or use EKS (Kubernetes) with our pods. The database could be an RDS instance if desired. But to keep it cloud-neutral, using an included Postgres container is simplest.
On Azure/GCP: similarly, a VM or container service can be used. No cloud-specific service is inherently required, which avoids lock-in.

Security in deployment: All containers communicate internally – for example, the crawler writes to the DB container over the Docker network. That network is isolated from the outside. We should ensure the database is not exposed outside the Docker network (no open port on host unless needed for debugging). The Airflow UI can be exposed on a specific port (with optional basic authentication enabled). If the client prefers, the Airflow UI can be kept only accessible internally and not opened externally, running jobs and sending reports via email or storing them in a shared folder.

Configuration management: The system uses environment variables and configuration files for all sensitive and variable data:

API keys for OpenAI, etc., are supplied via environment variables (Airflow has a mechanism to store connections and variables securely). We instruct the client to add their keys in an .env file or in Airflow’s connections vault.
The list of prompts or competitor names might reside in a config JSON/YAML which is volume-mounted so the client can edit it easily without rebuilding images.
Scheduling interval can be adjusted in the Airflow DAG code or via Airflow UI (if a cron schedule needs change).
Logging level for each agent can be set via env (e.g., set LOG_LEVEL=DEBUG to troubleshoot).

Scalability: For a single brand deployment, one instance is enough. If the solution needs to handle multiple brands or business units, we can parameterize the DAG to run per brand (or have separate DAGs). The system could scale horizontally by running multiple crawler instances for different brands concurrently. The modular design on Kubernetes could even allocate separate pods per brand. But in a simpler client scenario, it might just run for their brand and maybe a set of competitors all in one go.

Monitoring and maintenance: Airflow provides a UI to monitor scheduled runs, retry failures, and see logs for each task. This is very useful for the client’s engineers. We would set up log persistence (Airflow by default can store logs on the local file system or remote like S3; we can keep them local or in a volume). Container logs can also be collected if using a logging driver or something like ELK stack, but that may be overkill.

Backup and data retention: The data collected (pages, responses, analyses) might accumulate. We consider how to rotate or archive it:

The database can be configured with retention policies (e.g., keep last 6 months of data if runs are daily).
Alternatively, each run’s core outputs (like the visibility score summary) is appended to a history table for trend analysis, and old raw data can be cleaned if space is a concern.
If using a volume for SQLite, ensure it's backed up or at least the important results are exported (in reports or exported CSV).

Client access to results: Deployment includes delivering the final outputs to the client in a convenient way:

The Markdown report can be saved to a volume that is perhaps mapped to the host so the client can pick it up, or automatically converted to PDF and emailed.
We can integrate an email notification via Airflow (using an SMTP hook) to send the report to certain emails on each run.
Alternatively, host the report on a simple internal web server or push it to a knowledge base.

The cloud-neutral aspect means no part of this system inherently ties to a specific cloud vendor’s proprietary tech. For example, we don’t rely on AWS Lambda or Google Cloud Functions; instead we use our own Airflow. We don’t use a cloud-specific database; Postgres is portable. The only external calls are to third-party APIs (OpenAI, etc.), which are the same anywhere. This allows deployment on AWS, Azure, GCP, or completely offline (if the client has their own local instance of an LLM or chooses not to use external APIs, they could point the LLM Query agent to a local model endpoint – this is configurable).

We also consider air-gapped scenarios: If a client cannot expose data to external AI APIs due to policy, the system could be configured to skip the LLM Query agent or use an on-prem LLM (if available). The architecture is modular enough that, for example, the LLM Query agent could be swapped with calls to a locally hosted Llama2 model for a rough idea, though results may vary. In standard deployments, we assume using the public APIs is allowed for the public content queries.

Deployment Procedure: A likely deployment workflow for the client:

Provide the Docker Compose file (and Dockerfiles if they want to build images themselves for security review).
Client sets environment variables (API keys, config options) in a .env file.
They run docker-compose up (or we assist setting up in their orchestrator). Airflow and the agents come up.
The client accesses Airflow UI at http://localhost:8080 (for example) to trigger the first run or just wait for schedule.
Verify that data is coming in (they can see logs or we print summary in logs).
The report is generated and delivered (maybe placed in a mounted folder or auto-emailed).
The system continues to run per schedule. The client can pause the schedule or manually trigger on demand.

This straightforward deployment means the client retains control: they can stop the containers anytime (all data is stored on the mounted volume/DB), they can update config and restart. Updates to the system (e.g. improved agent code) can be delivered as updated Docker images.

Finally, security on deployment:

Ensure the Docker images are built from minimal base images (e.g., Python slim), and only necessary ports are exposed (Airflow web on 8080, others not needed to be exposed).
Use network segregation if needed (the containers can reside in an internal network zone).
Secrets (API keys) should not be baked into images; we use environment injection.
If the client has strict monitoring, they can monitor the outbound calls (the domains contacted will be known: OpenAI, Anthropic, Bing, plus various crawled domains).
We also plan for future-proofing: if the client later wants to move to a different orchestrator or cloud, the containers abstract most of the application, so redeploying them elsewhere is low-friction.

In summary, the deployment architecture leverages containerization to remain flexible and portable. It supports on-prem deployments by not requiring any cloud-specific services, and equally can be hosted in the client’s preferred cloud environment. All components run within the client's control, with only necessary external communication for the core function of monitoring web and AI content.

Data Models and Output Formats

Throughout the system, data is exchanged and stored in structured formats to facilitate analysis and integration. We design simple, clear data models for each stage, and produce outputs in formats useful to engineers and stakeholders (JSON/CSV for data interchange, Markdown/HTML for reports). Below we outline the key data models and output artifacts:

Web Crawl Data Model: Each record of a brand mention found on the web is stored with fields capturing what, where, and when:

url – the URL of the page containing the mention.
domain – the domain or site name (could be parsed from URL for quick grouping, e.g., technews.example.com).
title – the page title (from <title> tag or meta og:title).
snippet – a text snippet around the brand mention (e.g., 200 characters surrounding the first occurrence of the brand name, with highlighting or context markers). This gives a quick idea of how the brand was mentioned.
mention – the exact brand/product name variant found (in case the crawler was searching multiple terms).
mention_count – how many times the brand was mentioned on that page (could indicate depth of discussion).
crawl_time – timestamp when this page was fetched.
Possibly status – if page was accessible or if any error (pages that failed to fetch could be logged with an error status and no snippet).

This is stored in a table or JSONL file. If using SQL, the table might be web_mentions with columns as above. We ensure indices on domain and maybe mention for fast querying (to e.g. count mentions per domain).

LLM Query Results Model: The outputs from querying each LLM for each prompt. We have a few ways to model this:

A Prompt table: each entry has prompt_id, prompt_text, maybe a category/tag (like “pricing question” vs “best-of question”).
A Response table: with prompt_id, llm_name (e.g. GPT-4, Claude2, Bing), response_text, and some flags:
- brand_mentioned (boolean),
- mention_count (how many times brand name appeared in response, could be 0,1,2...),
- other_brands (list of other brand names found, e.g. competitors, which could be extracted).
- cited_sources (if the LLM provided citations/links, store them as a list or a concatenated string).
- timestamp.
- You might also store prompt_embedding and response_embedding if doing vector analysis later, but those can also be computed on the fly in analysis to avoid storing high-dim vectors in the DB.

Alternatively, one could store the entire response JSON structure (like the nested JSON shown earlier) in a single column, but having a table with one row per (prompt, LLM) is convenient for SQL queries like “which prompts did not mention brand on any LLM” etc.

Analysis Results Model: After the Data Analysis agent runs, it can output:

A Visibility summary table: one row per run (date) with columns: visibility_score, and perhaps breakdowns like mentions_chatgpt, mentions_claude, ... (counts or percentages), competitor_x_score, competitor_y_score, ... (if fixed set of competitors). This is like a KPI table that can be easily queried or plotted over time.
A Detailed metrics table: potentially one row per prompt or per prompt category with details. For example, for each prompt we might store if brand was mentioned in each LLM (binary flags) and maybe the correctness of info. This could be used for deeper analysis but might not be needed in final outputs.
Overlap/similarity data: If we compute token overlap or embedding similarity for specific pairs (e.g., each response vs official content), those can be stored in a separate structure or just used to generate recommendations. We might not need a table for every individual similarity score; instead, the Analysis agent could highlight top insights (like “lowest similarity was for question X, indicating a potential mismatch in content”).
If sentiment analysis was done, a summary like overall_sentiment=Neutral and maybe sentiment_score=+0.1 (where positive, negative, neutral might be encoded).

Much of the analysis output is also directly turned into human-readable text or visuals, so not all intermediate metrics need their own long-term storage. The critical ones are the scores and mention counts that feed trend analysis.

Optimization Recommendations Model: This could be semi-structured because recommendations often are textual. We can model it as:

A Recommendations list, where each item has:
- page or content_id that it refers to (could be a URL or a page identifier like "Product Page", "Homepage").
- issue or finding – a short description of what’s wrong or could be improved (e.g., "Missing FAQ", "Low mention in AI responses for topic X", "No schema markup", "Outdated info on page (last update 2019)").
- recommendation – the suggestion text, possibly multi-line. This could also be broken into multiple recommendations linked to the same issue.
- Optionally priority level – high/medium/low impact suggestion.
- Optionally ref_metric – a reference to why (like "brand not mentioned in 2 key AI queries" or "competitor schema present but yours not").
- Optionally suggested_content – if the agent provides a draft (like a rewritten meta description or sample Q&A), include that here or in a linked structure.

If using a database, we could store these in a recommendations table, or simply output them as a JSON or Markdown block for the report. Since these are ultimately for human consumption, many implementations just format them into the Markdown/PDF report directly rather than storing in a DB. However, storing them can be useful if tracking whether recommendations were implemented (the next run could then check if an issue persists).

Output Formats:

JSON: Almost every intermediate and final data structure can be output in JSON for maximum interoperability. For instance, after a full run, we can output a JSON file containing:
- visibility_score and breakdown,
- arrays of mentions (the crawl data),
- responses (the LLM outputs),
- analysis (metrics),
- recommendations (the suggestions).
This comprehensive JSON could be used by the client to ingest into their own analytics systems or for auditing. We ensure to preserve the citation references or keys if needed. Because JSON can be nested, we might nest some of this (like have a field analysis.visibility_score etc.).
CSV: Certain results lend themselves to CSV/tabular format:
- The list of web mentions can be output as a CSV (columns: URL, Title, Domain, etc.), which is easy to open in Excel if needed.
- The summary metrics over time could be appended to a CSV log (for quick charting).
- If the client wants to compare competitor scores, a CSV with columns date, brand_score, compA_score, compB_score... can be produced.
CSV is useful for numeric data and simple records; it’s less ideal for long text (like LLM answers or recommendations), so we use it selectively.
Markdown Report: We generate a detailed Markdown report as a primary deliverable for human readers (e.g., marketing managers, content writers, executives). The Markdown format is chosen because it’s easily convertible to PDF or viewable in many platforms (and the user specifically mentioned Markdown). This report is structured with headings, bullet points, and tables for clarity (following the style guidelines for readability).
A typical report structure might be:
- Overview: Summary of the run date and key findings (e.g., “AcmeCloud AI Visibility Score: 78 (up from 70 last month). Mentioned in 60% of test queries. CompetitorA scored 65.”).
- Detailed Findings:
  - A table listing each prompt and whether AcmeCloud was mentioned by each LLM (✓ or ✕ marks), possibly highlighting where it failed to appear.
  - Examples of what was said: e.g., quoting a sentence from ChatGPT’s answer that mentioned AcmeCloud, and perhaps one that missed it.
  - Web mention summary: e.g., “Found 25 recent web pages mentioning AcmeCloud. Majority from tech blogs; none from Wikipedia.” This frames how broad the web coverage is.
- Analysis Metrics: Could include a small table:
  MetricValueAI Visibility Score (AcmeCloud)78/100Mention Coverage (ChatGPT/Claude/Bing/Perplexity)60% / 40% / 60% / 60%CompetitorA Visibility Score65/100CompetitorB Visibility Score50/100Top Missing Topic“affordable backup” (not mentioned AcmeCloud)Avg. Content Similarity0.83 (High)
- Recommendations: A section with an ordered or bulleted list of recommendations. Each might be a few sentences, for example:
  1. Add schema markup (Product and FAQ schema) to the AcmeCloud product page. This will give AI models more structured information to draw from, as Microsoft’s Bing has confirmed using schema to understand content.
  2. Publish a blog post focusing on "affordable cloud backup for small businesses". Our analysis shows AcmeCloud wasn’t mentioned in AI answers to cost-focused queries, whereas CompetitorA was. Filling this content gap could improve visibility.
  3. Update the homepage meta description. The current meta description doesn’t mention “cloud backup” explicitly. A revised description including that keyword can improve how AI summarizes your site.
  4. Engage in Q&A on forums (e.g., Reddit). We found few external authoritative mentions of AcmeCloud; increasing genuine mentions on high-authority forums can improve AI recognition.
  5. (Optional) Fine-tune a custom model or use retrieval augmentation for your content. (If the client is sophisticated, we might suggest providing a feed of your content to AI services or using tools to influence AI results.)
  Each recommendation can reference the earlier findings (and we keep them concise and actionable). We maintain a logical flow so that the reader sees the problem in findings and the solution in recommendations.
Visuals: The Markdown can embed simple visuals (if the environment allows viewing images). For instance, we might include a bar chart of brand vs competitor score, or a trend line of visibility over time. Since the guidelines cautioned not to search for images, we rely on simple generated charts if needed. The code could generate a matplotlib chart and save as PNG, which we then reference in the Markdown. This adds an intuitive understanding, e.g., a bar chart showing the brand’s name taller than competitorB but shorter than competitorA at a glance.
Interactive Output (optional): If required, we could produce an interactive HTML report or a Jupyter Notebook that contains the analysis. However, for client deployment, a static Markdown/PDF is often preferred for ease of distribution (no dependencies to view).
API/JSON Output: In addition to files, we could have an API endpoint (if we set up a small web server) that returns the JSON results. For example, a GET request to /api/visibility_score could return the latest scores. This could allow integration with dashboards (like if the client wants to pull the data into a PowerBI or Grafana dashboard). This is an optional enhancement; the core requirement is satisfied with files and reports, but the architecture allows adding a presentation layer if needed.

Data Model Extensibility: The models are designed to be extensible. For instance, if later we want to track specific quotes or sources LLMs use when mentioning the brand, we could augment the LLM response model with those details. If we want to store feedback (like if a user manually flags a suggestion as implemented), we could add a field to recommendations. We keep the schema flexible (using JSON fields for some parts, or easy to alter tables) to accommodate future needs.

Data Privacy Considerations: All stored data is either public (web pages, LLM output about public info) or the brand’s own content. We ensure no personal user data is involved. The JSON outputs with LLM responses should be treated as ephemeral or for internal use, since LLM content can sometimes contain inaccuracies; we mark it with context (like what prompt generated it) in the data so it’s not misattributed.

In conclusion, the system’s data models capture essential information at each stage in a structured way. Outputs are provided in multiple formats:

JSON for programmatic access,
CSV for tabular data analysis,
and a comprehensive Markdown report for easy reading by stakeholders.

The Markdown report, combined with any charts or tables, will allow even non-technical team members to understand the brand’s AI visibility status and the recommended steps to improve it, which is a key goal of this system.

Security and Configuration Considerations

Building a client-deployable system requires careful attention to security and the ability to configure the system to different client needs. Below, we outline the main considerations:

1. Data Security and Privacy:

Public Data Focus: The system primarily handles public data – public web content and AI-generated answers about a brand. This means we avoid storing sensitive personal data. The brand’s own content is presumably not confidential if it’s on their website. However, the results of analysis (e.g., the fact that an AI didn’t mention the brand) might be sensitive to the client competitively, so we treat all outputs as internal to the client.
Storage Security: If using a database, secure it with credentials and network rules. For example, in Docker deployment, the Postgres DB should be on an internal network and use a strong password for the user. If the client already has a secure database service, we integrate with that (Airflow can manage connections securely).
Access Control: The Airflow UI and any generated reports should be protected. Airflow can be configured with user authentication. If the UI is not needed for daily use by many, it can even be shut down or only run on demand, with CLI triggers for the DAG via airflow trigger_dag by an authorized user. Reports containing analysis can be distributed to specific people rather than broadly, depending on client preference.
LLM API Data: Content sent to external LLM APIs (the prompts and any brand info included) is minimal and generic (asking about public facts). We avoid sending any truly sensitive info to these APIs. Still, some organizations have policies about not sending even prompts externally. Configuration can allow the LLM Query Agent to use sanitized prompts (e.g., not include internal code or URLs, only general questions). If needed, one can skip queries that might reveal internal plans.
OpenAI/Anthropic data handling: These providers have their own data usage policies (OpenAI for instance doesn’t use API prompts/data to train by default, as of 2025, if you opt-out or have enterprise API). We should ensure the client is aware and has appropriate agreements if needed. Alternatively, use Azure OpenAI which might have different data guarantees, if the client has that.
Crawl Etiquette and Legal: The crawler respects robots.txt to avoid unauthorized scraping. We also might implement domain whitelists/blacklists if the client wants (e.g., avoid scraping certain sites). The targeted search approach inherently limits us to likely relevant sites. We also include a proper user agent string identifying the crawler (e.g., "AcmeVisibilityBot/1.0 (+email)" so site owners know it’s a legitimate tool, which is good practice).
SSL and Requests: All API calls and web requests use HTTPS where available. The system should validate SSL certificates. Using Python requests or Scrapy, this is default. We ensure we don't skip SSL verification.

2. API Key Management:
API keys for OpenAI, etc., are highly sensitive. We handle them as configuration secrets:

They are never hard-coded in code or images. Instead, we fetch them from environment variables or Airflow Connections.
Provide a template .env.example for Docker where the client can input their keys. This file is then used at runtime, not baked into any distributed artifact.
In Airflow, one can set these in the UI or through its secrets backend (Airflow can use HashiCorp Vault, AWS Secrets Manager, etc., if the client prefers).
We avoid logging the keys. Logs should not print the API key or full request that includes it.

3. Configurability:
Each client might have different needs. We design the system to be configurable via external files or environment settings:

Brand and Competitors: These are specified in a config (could be a YAML like:
```
yaml
```
CopyEdit
brand: "AcmeCloud" competitors: ["Backblaze", "XYZBackup"] industry_terms: ["cloud backup", "disaster recovery"]
) which the agents read. The prompt generator will use these to form questions. The crawler will use brand and maybe competitors (if we also want to track competitor mentions on web).
Prompts list: We can allow the client to edit or provide the list of user prompts they care about. E.g., a text file with one question per line. This way, marketing can tailor what scenarios to test.
Scheduling frequency: Controlled by Airflow schedule config. Can be easily changed to daily/weekly. The client might start with on-demand runs and then schedule monthly, etc.
Feature toggles: If, for example, a client doesn’t want to use the Optimization Agent’s ML rewriting (maybe they only want rule-based suggestions), we can toggle that via a config flag. Similarly, if sentiment analysis is not needed, disable that to save time.
Output options: Config can specify where to put the report (local folder, email, etc.). Email settings (SMTP) can be configured if they want auto-emailing of reports (Airflow can handle email on failure or on success callbacks).
Logging verbosity: A config to set debug mode will make agents log more details (which pages are being crawled, what prompt is being sent, etc.) – useful during initial deployment and testing. In production runs, probably we switch to INFO level to log only high-level info or warnings.
Resource config: If running in a constrained environment, we might set config for the crawler like max_pages_per_run or request_delay to control load. Or for LLM queries, a limit on concurrent calls. These can be in a config file.

4. Performance and Rate Limits:
We ensure the system operates within limits:

The crawler has delays and concurrency limits to avoid IP blocking. If necessary, using a proxy or rotating proxies can be configured (some clients might have their own proxy).
The LLM queries adhere to rate limits by default, but if a client has a rate limit issue (say only 60 queries/min allowed), we can configure a pause between calls or chunk the queries.
We also guard against the system generating too high load on the client’s own site (if it’s crawling the client’s site, ensure not to hammer it too fast, though that’s under their control anyway).

5. Error Handling and Recovery:
From a security standpoint, robust error handling also prevents things like data corruption:

If an agent fails mid-run, the orchestrator will stop the pipeline, ensuring we don’t carry on with partial data.
We can implement idempotency where possible: e.g., if the crawler ran earlier today and data is still in DB, a rerun could either clear and recrawl or skip if unchanged. Perhaps simpler: treat each run separately (append data with timestamps, or wipe previous data each time since it’s ephemeral). The strategy can be configured (some may want to archive each run’s data; others only care about latest).
Logging of errors is thorough but avoid exposing secrets. For example, log “API call failed with 401 – check API key” rather than logging the key.

6. Compliance and Privacy:
Though dealing with public info, if the client is in a regulated industry, we ensure:

All open-source libraries used are properly licensed for commercial use (Scrapy, Airflow, etc., are Apache/MIT licensed – fine for commercial deployment).
If the client requires code review, we provide source for transparency (especially for security).
The system does not store any personal data, so GDPR etc. typically not an issue, but if the brand name is someone’s personal name, then the crawl might pick up personal data. This is unlikely; mostly it’s company/product names.

7. Network and Access:

The machine running this should be secured. Only necessary ports (Airflow UI, maybe SSH for maintenance) open. Outbound internet is required as mentioned.
If deploying on client’s cloud, follow their security group rules for egress. Possibly restrict egress only to what’s needed (OpenAI domains, known API domains, and port 443 generally since we will also access arbitrary websites – that might need to be open).
Internally, ensure the containers can’t be accessed from outside. Docker by default can isolate, but e.g., don’t map the database port to host unless needed.

8. Secrets in Code and Images:
We maintain separation of config from code. For instance, if we ship a Docker image with our Python code, we do not include any client-specific config. The image is generic. The client then mounts a config file or passes env variables to customize. This way, the same image could even be reused for multiple clients by just changing configs.

Airflow DAG might read from a configuration file for things like prompts and brand, instead of hardcoding them. Alternatively, use Airflow Variables (which can be set via UI or CLI per deployment).

9. Testing and Sandbox:
We likely include a dry-run mode (especially for the LLM Query Agent) that can be used in testing without consuming too much API quota. For example, a config flag USE_MOCK_LLM_RESPONSES could tell the agent to either use stored example responses or a cheaper model for trial. This helps in testing the pipeline end-to-end in a sandbox environment. Once confirmed, it can be switched to real mode. This prevents accidental huge API bills during initial setup.

10. Audit and Logging:
Security also includes auditing actions:

We log all external interactions (which URLs crawled, which API endpoints called) so if something goes awry, one can trace it. For instance, log “Crawled 10 pages from domain X” or “Queried OpenAI for prompt Y”. This helps ensure the system is doing what is expected and nothing malicious.
If required, these logs can be reviewed by the client’s security team, since everything the system does externally is on behalf of the client.

11. Upgrades and Maintenance:
From a security perspective, we keep dependencies updated. For example, using latest patches for Airflow (since it’s a web app, any known vulnerabilities should be patched), updating the base OS images regularly, etc. We might schedule maintenance windows to update the Docker images every few months or as needed. The client’s IT should also do routine updates if this runs long-term.

12. Fail-safe Defaults:
If, for example, the LLM API keys are not configured, the system should not run that part or should clearly error out rather than trying to call with no key (which might result in too many failed calls). The DAG could be configured to skip the LLM step if no key provided, and still do crawling and maybe just output that LLM data is missing. This kind of flexibility could be helpful if, say, a client initially doesn’t have an API key and only wants web data first.

In summary, the system is built with a security-first mindset:

Minimal exposure of services.
Proper handling of secrets and data.
Respect for external resources (ethical crawling, API terms).
Configurability to comply with client’s environment (like using proxies, keys, toggling features).
Regular security hygiene (updating dependencies, etc.).

By following these practices, the multi-agent system can be safely deployed in a client’s infrastructure, providing valuable AI visibility insights without introducing undue risk or complexity. All configurations can be adjusted without modifying the core code, making the system adaptable to different client requirements and policies.

Sources:

Margarita Loktionova, “AI Visibility: How to Track & Grow Your Brand Presence in LLMs,” Semrush Blog, Jun. 25, 2025. – Definition of AI visibility and factors influencing it semrush.com
LangChain Team, “LangGraph: Multi-Agent Workflows,” Jan. 23, 2024. – On multi-agent design benefits for specialized tasks blog.langchain.com.
Superckid (D. Tunai), “Utilizing Airflow for Planning, Scheduling, Executing and Scaling AI Agents,” Medium, 2023. – Advantages of Airflow DAGs for AI workflows medium.com.
ZenRows, “Scrapy in Python: Web Scraping Tutorial 2025,” Updated May 16, 2024. – Overview of Scrapy framework capabilities for web crawling zenrows.com
Windmill Strategy, “Monitoring and Analyzing LLM Visibility,” Jun. 18, 2025. – Importance of querying chatbots and tools (AI Search Grader, SpyGPT) to track brand presence windmillstrategy.com.

Multi-Agent System, AI Visibility, LLM Visibility, LLM DiscoverabilityFrancesca Tabor13 July 2025

Multi-Agent System for AI Visibility Engineering

Overview

System Architecture and Agent Roles

Web Crawler Agent

LLM Query Agent

Data Analysis Agent

Optimization Agent

Agent Communication and Orchestration Strategy

Technology Stack Choices

Deployment Architecture (On-Premise or Cloud-Neutral)

Data Models and Output Formats

Security and Configuration Considerations

CONTACT ME

GET A QUOTE