Wiki-as-a-Data Technical Architecture

Here’s a clean, scalable reference architecture for a “Wiki-as-a-Data-Platform” that powers high-quality RAG apps and can license datasets for LLM fine-tuning. It’s modular so you can swap pieces as you grow.

Goals

  • Author once → serve everywhere: human-friendly wiki, API, search, RAG, and licensable datasets.

  • Trust & provenance: versioning, citations, audit trails, PII controls.

  • Retrieval excellence: hybrid lexical+dense+graph retrieval with reranking.

  • Data products: curated JSONL/Parquet/Delta exports with metering and SLAs.

High-Level Architecture (layers)

  1. Ingest & Authoring → 2) Normalization & Governance → 3) Storage (object + relational + vector + graph) →

  2. Indexing & Embeddings → 5) Serving APIs (Content, Search, RAG, Export/Licensing) →

  3. Observability & Evaluation → 7) Finetuning Dataset Factory

Core Components (what/why + examples)

Data & Indexing Pipeline (key details)

  • Chunking: semantic chunking by headings/sentences; target ~200–400 tokens; store overlaps and hierarchy breadcrumbs (site→section→page→chunk) to preserve context.

  • Embeddings: store (a) chunk vector, (b) title vector, (c) entity vector; log embedding model/version for reproducibility.

  • Hybrid retrieval: BM25 (keyword/filters) ∪ vector KNN ∪ optional graph expansion (neighbors of matched entities). Merge with learned weights; then cross-encoder rerank top 100→20.

  • Citations & provenance: each chunk keeps source_url, page_version, section_anchor, hash. Responses always return these.

  • Incremental updates: evented indexing—on page publish, push to queue → re-chunk → re-embed → upsert search/vector indices and neighbor edges.

Serving APIs (contract-first)

  • Content API: GET /pages/:id?version=... returns structured JSON with sections, anchors, entities, and license tags.

  • Search API: lexical+faceted; support filters: date, entity, topic, license.

  • RAG API: POST /rag/query with query + constraints → returns answers, citations, used_chunks, retrieval_stats.

  • Entities API: CRUD for ontology; GET /entities/:id/neighbors.

  • Export API (Licensing): create datasets by scope (topics, date ranges, entity sets), format (JSONL/Parquet/Delta), and schema profiles (RAG, pretraining, instruction). Async job + webhooks + pre-signed download URLs.

  • Usage/Metering API: report retrievals/exports for billing & compliance.

Governance, Security & Legal

  • Versioning & lineage: every page and generated artifact has immutable version IDs.

  • PII & compliance: policy-based redaction/retention; region pinning; differential access by license.

  • License tagging: per-page and per-chunk license/rights; export filters enforce inclusion rules.

  • Watermarking/fingerprints: embed invisible IDs in exported text blocks to trace leaks.

  • Audit & consent: complete audit logs; consent registry for contributed content.

Finetuning Dataset Factory (instruction & pretraining)

  • Curated splits: train/val/test with temporal splits to avoid leakage.

  • Formats:

    • Pretraining: {"text": "...", "source", "license", "version"}

    • SFT/Instruction: {"input": "...","output": "...","context": [...citations...]}

  • Synthesis: generate Q/A, summaries, flashcards, and chain-of-thought-free rationales from source chunks; always store citation IDs.

  • Quality gates: toxicity/PII checks, duplication, perplexity outlier filters, heuristic/LLM rubric scoring.

  • Manifests: dataset cards (license, composition, model compatibility, eval metrics).

RAG Reference Flow (online serving)

  1. Query understanding: spell-fix, entity detect, intent (ask vs browse).

  2. Retriever: hybrid search with structured filters; optional graph hop.

  3. Rerank: cross-encoder on top-K.

  4. Grounding: assemble context windows with diverse sources and section titles; de-dup near-similar chunks.

  5. Generator: call LLM with strict citation constraint; enable tool mode for follow-ups.

  6. Answer policy checks: fact-checking heuristics; block if low grounding score.

  7. Telemetry: store retrieval/latency/answer faithfulness; capture user feedback for re-training.

Minimal Viable Stack (fast to ship)

  • CMS: Headless (Strapi) + Markdown.

  • Storage: Postgres (metadata + pgvector), S3 (blobs), OpenSearch (text).

  • Pipelines: Dagster + Airbyte; FastAPI microservices; Redis cache.

  • Embeddings/Reranker: open models (e.g., E5/BGE + bge-reranker) or hosted equivalents.

  • RAG Service: FastAPI with a thin orchestration layer (LangChain optional).

  • Exports: Parquet on S3 with signed URLs; dataset manifests in Postgres.

Scale & Enterprise Readiness

  • Sharding & tenants: tenant_id on all tables and indices; per-tenant encryption keys.

  • Zero-downtime reindex: dual-write to new index aliases; flip on completion.

  • Cold→warm tiers: S3 (cold) + vector/search (warm) populated on demand via event workers.

  • Cost controls: adaptive chunking, popularity-based cache, and eviction policies.

Schematic (text)

Practical tips that move the needle

  • Chunk IDs everywhere: make them first-class for citations, exports, and fingerprinting.

  • Hybrid first: lexical often beats vectors on proper nouns and rare terms; fuse signals.

  • Entity tables: maintain a clean entity store with aliases/synonyms; improves both search and graph hops.

  • Eval loops: automate weekly retrieval/faithfulness benchmarks; gate model/index upgrades on scores.

  • Content ops: enforce editorial checklists (title quality, abstracts, tags) to boost retrieval quality before ML.

If you want, I can tailor this to your exact stack (cloud, team size, budget) and draft the initial Postgres schema (pages, chunks, entities, licenses, usage) plus the RAG API contracts.