Wiki-as-a-Data Technical Architecture

Here’s a clean, scalable reference architecture for a “Wiki-as-a-Data-Platform” that powers high-quality RAG apps and can license datasets for LLM fine-tuning. It’s modular so you can swap pieces as you grow.

Goals

Author once → serve everywhere: human-friendly wiki, API, search, RAG, and licensable datasets.
Trust & provenance: versioning, citations, audit trails, PII controls.
Retrieval excellence: hybrid lexical+dense+graph retrieval with reranking.
Data products: curated JSONL/Parquet/Delta exports with metering and SLAs.

High-Level Architecture (layers)

Ingest & Authoring → 2) Normalization & Governance → 3) Storage (object + relational + vector + graph) →
Indexing & Embeddings → 5) Serving APIs (Content, Search, RAG, Export/Licensing) →
Observability & Evaluation → 7) Finetuning Dataset Factory

Core Components (what/why + examples)

Data & Indexing Pipeline (key details)

Chunking: semantic chunking by headings/sentences; target ~200–400 tokens; store overlaps and hierarchy breadcrumbs (site→section→page→chunk) to preserve context.
Embeddings: store (a) chunk vector, (b) title vector, (c) entity vector; log embedding model/version for reproducibility.
Hybrid retrieval: BM25 (keyword/filters) ∪ vector KNN ∪ optional graph expansion (neighbors of matched entities). Merge with learned weights; then cross-encoder rerank top 100→20.
Citations & provenance: each chunk keeps source_url, page_version, section_anchor, hash. Responses always return these.
Incremental updates: evented indexing—on page publish, push to queue → re-chunk → re-embed → upsert search/vector indices and neighbor edges.

Serving APIs (contract-first)

Content API: GET /pages/:id?version=... returns structured JSON with sections, anchors, entities, and license tags.
Search API: lexical+faceted; support filters: date, entity, topic, license.
RAG API: POST /rag/query with query + constraints → returns answers, citations, used_chunks, retrieval_stats.
Entities API: CRUD for ontology; GET /entities/:id/neighbors.
Export API (Licensing): create datasets by scope (topics, date ranges, entity sets), format (JSONL/Parquet/Delta), and schema profiles (RAG, pretraining, instruction). Async job + webhooks + pre-signed download URLs.
Usage/Metering API: report retrievals/exports for billing & compliance.

Governance, Security & Legal

Versioning & lineage: every page and generated artifact has immutable version IDs.
PII & compliance: policy-based redaction/retention; region pinning; differential access by license.
License tagging: per-page and per-chunk license/rights; export filters enforce inclusion rules.
Watermarking/fingerprints: embed invisible IDs in exported text blocks to trace leaks.
Audit & consent: complete audit logs; consent registry for contributed content.

Finetuning Dataset Factory (instruction & pretraining)

Curated splits: train/val/test with temporal splits to avoid leakage.
Formats:
- Pretraining: {"text": "...", "source", "license", "version"}
- SFT/Instruction: {"input": "...","output": "...","context": [...citations...]}
Synthesis: generate Q/A, summaries, flashcards, and chain-of-thought-free rationales from source chunks; always store citation IDs.
Quality gates: toxicity/PII checks, duplication, perplexity outlier filters, heuristic/LLM rubric scoring.
Manifests: dataset cards (license, composition, model compatibility, eval metrics).

RAG Reference Flow (online serving)

Query understanding: spell-fix, entity detect, intent (ask vs browse).
Retriever: hybrid search with structured filters; optional graph hop.
Rerank: cross-encoder on top-K.
Grounding: assemble context windows with diverse sources and section titles; de-dup near-similar chunks.
Generator: call LLM with strict citation constraint; enable tool mode for follow-ups.
Answer policy checks: fact-checking heuristics; block if low grounding score.
Telemetry: store retrieval/latency/answer faithfulness; capture user feedback for re-training.

Minimal Viable Stack (fast to ship)

CMS: Headless (Strapi) + Markdown.
Storage: Postgres (metadata + pgvector), S3 (blobs), OpenSearch (text).
Pipelines: Dagster + Airbyte; FastAPI microservices; Redis cache.
Embeddings/Reranker: open models (e.g., E5/BGE + bge-reranker) or hosted equivalents.
RAG Service: FastAPI with a thin orchestration layer (LangChain optional).
Exports: Parquet on S3 with signed URLs; dataset manifests in Postgres.

Scale & Enterprise Readiness

Sharding & tenants: tenant_id on all tables and indices; per-tenant encryption keys.
Zero-downtime reindex: dual-write to new index aliases; flip on completion.
Cold→warm tiers: S3 (cold) + vector/search (warm) populated on demand via event workers.
Cost controls: adaptive chunking, popularity-based cache, and eviction policies.

Schematic (text)

Practical tips that move the needle

Chunk IDs everywhere: make them first-class for citations, exports, and fingerprinting.
Hybrid first: lexical often beats vectors on proper nouns and rare terms; fuse signals.
Entity tables: maintain a clean entity store with aliases/synonyms; improves both search and graph hops.
Eval loops: automate weekly retrieval/faithfulness benchmarks; gate model/index upgrades on scores.
Content ops: enforce editorial checklists (title quality, abstracts, tags) to boost retrieval quality before ML.

If you want, I can tailor this to your exact stack (cloud, team size, budget) and draft the initial Postgres schema (pages, chunks, entities, licenses, usage) plus the RAG API contracts.

AI Architecture, Technical Architecture, WikiFrancesca Tabor10 August 2025

Wiki-as-a-Data Technical Architecture

Goals

High-Level Architecture (layers)

Core Components (what/why + examples)

Data & Indexing Pipeline (key details)

Serving APIs (contract-first)

Governance, Security & Legal

Finetuning Dataset Factory (instruction & pretraining)

RAG Reference Flow (online serving)

Minimal Viable Stack (fast to ship)

Scale & Enterprise Readiness

Schematic (text)

Practical tips that move the needle

AI Models

INDUSTRY

Real Estate

Automotive

Legal Services

SaaS & Technology

Education & EdTech

SERVICES

RESOURCES

Events

E-LEARNING