Wiki-as-a-Data Technical Architecture
Here’s a clean, scalable reference architecture for a “Wiki-as-a-Data-Platform” that powers high-quality RAG apps and can license datasets for LLM fine-tuning. It’s modular so you can swap pieces as you grow.
Goals
Author once → serve everywhere: human-friendly wiki, API, search, RAG, and licensable datasets.
Trust & provenance: versioning, citations, audit trails, PII controls.
Retrieval excellence: hybrid lexical+dense+graph retrieval with reranking.
Data products: curated JSONL/Parquet/Delta exports with metering and SLAs.
High-Level Architecture (layers)
Ingest & Authoring → 2) Normalization & Governance → 3) Storage (object + relational + vector + graph) →
Indexing & Embeddings → 5) Serving APIs (Content, Search, RAG, Export/Licensing) →
Observability & Evaluation → 7) Finetuning Dataset Factory
Core Components (what/why + examples)
Data & Indexing Pipeline (key details)
Chunking: semantic chunking by headings/sentences; target ~200–400 tokens; store overlaps and hierarchy breadcrumbs (site→section→page→chunk) to preserve context.
Embeddings: store (a) chunk vector, (b) title vector, (c) entity vector; log embedding model/version for reproducibility.
Hybrid retrieval: BM25 (keyword/filters) ∪ vector KNN ∪ optional graph expansion (neighbors of matched entities). Merge with learned weights; then cross-encoder rerank top 100→20.
Citations & provenance: each chunk keeps
source_url
,page_version
,section_anchor
,hash
. Responses always return these.Incremental updates: evented indexing—on page publish, push to queue → re-chunk → re-embed → upsert search/vector indices and neighbor edges.
Serving APIs (contract-first)
Content API:
GET /pages/:id?version=...
returns structured JSON with sections, anchors, entities, and license tags.Search API: lexical+faceted; support filters: date, entity, topic, license.
RAG API:
POST /rag/query
with query + constraints → returnsanswers
,citations
,used_chunks
,retrieval_stats
.Entities API: CRUD for ontology;
GET /entities/:id/neighbors
.Export API (Licensing): create datasets by scope (topics, date ranges, entity sets), format (JSONL/Parquet/Delta), and schema profiles (RAG, pretraining, instruction). Async job + webhooks + pre-signed download URLs.
Usage/Metering API: report retrievals/exports for billing & compliance.
Governance, Security & Legal
Versioning & lineage: every page and generated artifact has immutable version IDs.
PII & compliance: policy-based redaction/retention; region pinning; differential access by license.
License tagging: per-page and per-chunk license/rights; export filters enforce inclusion rules.
Watermarking/fingerprints: embed invisible IDs in exported text blocks to trace leaks.
Audit & consent: complete audit logs; consent registry for contributed content.
Finetuning Dataset Factory (instruction & pretraining)
Curated splits: train/val/test with temporal splits to avoid leakage.
Formats:
Pretraining:
{"text": "...", "source", "license", "version"}
SFT/Instruction:
{"input": "...","output": "...","context": [...citations...]}
Synthesis: generate Q/A, summaries, flashcards, and chain-of-thought-free rationales from source chunks; always store citation IDs.
Quality gates: toxicity/PII checks, duplication, perplexity outlier filters, heuristic/LLM rubric scoring.
Manifests: dataset cards (license, composition, model compatibility, eval metrics).
RAG Reference Flow (online serving)
Query understanding: spell-fix, entity detect, intent (ask vs browse).
Retriever: hybrid search with structured filters; optional graph hop.
Rerank: cross-encoder on top-K.
Grounding: assemble context windows with diverse sources and section titles; de-dup near-similar chunks.
Generator: call LLM with strict citation constraint; enable tool mode for follow-ups.
Answer policy checks: fact-checking heuristics; block if low grounding score.
Telemetry: store retrieval/latency/answer faithfulness; capture user feedback for re-training.
Minimal Viable Stack (fast to ship)
CMS: Headless (Strapi) + Markdown.
Storage: Postgres (metadata + pgvector), S3 (blobs), OpenSearch (text).
Pipelines: Dagster + Airbyte; FastAPI microservices; Redis cache.
Embeddings/Reranker: open models (e.g., E5/BGE + bge-reranker) or hosted equivalents.
RAG Service: FastAPI with a thin orchestration layer (LangChain optional).
Exports: Parquet on S3 with signed URLs; dataset manifests in Postgres.
Scale & Enterprise Readiness
Sharding & tenants: tenant_id on all tables and indices; per-tenant encryption keys.
Zero-downtime reindex: dual-write to new index aliases; flip on completion.
Cold→warm tiers: S3 (cold) + vector/search (warm) populated on demand via event workers.
Cost controls: adaptive chunking, popularity-based cache, and eviction policies.
Schematic (text)
Practical tips that move the needle
Chunk IDs everywhere: make them first-class for citations, exports, and fingerprinting.
Hybrid first: lexical often beats vectors on proper nouns and rare terms; fuse signals.
Entity tables: maintain a clean entity store with aliases/synonyms; improves both search and graph hops.
Eval loops: automate weekly retrieval/faithfulness benchmarks; gate model/index upgrades on scores.
Content ops: enforce editorial checklists (title quality, abstracts, tags) to boost retrieval quality before ML.
If you want, I can tailor this to your exact stack (cloud, team size, budget) and draft the initial Postgres schema (pages, chunks, entities, licenses, usage) plus the RAG API contracts.