A schema is a contract. It says: every piece of knowledge stored in this system will have these exact fields, in these exact formats, every single time.
Think of it like a hospital form. Every patient record has the same fields: name, date of birth, blood type, allergies, medications. They don't let one doctor write "Bob, he's 40ish, allergic to something" and another doctor write a structured record. The form IS the schema. It enforces consistency so that any doctor, any nurse, any system that reads the record knows exactly where to find what they need.
Why does this matter for AI knowledge systems?
A sovereign AI stores thousands of pieces of knowledge — principles, tactics, lessons, frameworks, errors, insights. If each one is stored differently — some with a confidence score, some without, some with a source name, some with a hex ID, some with mechanism explanations, some without — you can never reliably search, compare, or transfer knowledge between systems.
Say you extract a principle from a marketing course:
"Lead with outcomes, not mechanisms, when selling to skeptical men."
Stored as NODE_SCHEMA (14 fields):
id: "a7f3b2c1-..." unique identifier, permanent
vector: [0.23, -0.15, 0.87, ...] 1024-dim embedding (meaning as numbers)
text: "Lead with outcomes, not the actual principle in full
mechanisms, when selling
to skeptical men."
title: "Outcomes before mechanisms" short human-readable label
node_type: "principle" what kind of knowledge this is
source_id: "Anatomy of Ads 2.0" where it came from (human-readable)
confidence_score: 0.92 how validated it is (0.0 to 1.0)
tags: "cold_traffic,identity,masculine" CSV string for filtering
mechanism: "Skeptical men evaluate outcome WHY this works — the causal chain
identity before caring about
the how-to"
situation: "Cold traffic ads for WHEN to apply this
identity-based offers"
when_not: "Warm retargeting where WHEN NOT to apply this
credibility is established"
context: "From module 3 of AoA 2.0" additional context
timestamp: "2026-03-15" when it was stored
version: 1 revision number
Every single principle in the system has these same 14 fields. You can search by confidence. You can filter by tags. You can retrieve by situation. You can compare mechanisms. You can track where it came from. The schema makes the knowledge machine-readable, not just human-readable.
Rob's current GHOSTNET system stores knowledge as "holons" — semi-structured blobs where each holon can have different fields depending on what the extraction model decided to include. Some have a mechanism explanation, some don't. Some have confidence scores, some don't. Some have tags, some have categories, some have neither.
This means you can't write a query that says "give me all principles with confidence above 0.85 tagged cold_traffic" — because not every holon has those fields. You'd need to check for field existence first, handle missing values, deal with different naming conventions. Scale that to 16,717 holons across a collective network and it becomes unworkable.
This is the primary schema. Every piece of atomic knowledge — principles, tactics, call examples, book excerpts, meta-observations, raw chunks, and call transcripts — is stored with these fields.
| Field | Type | Why It Exists |
|---|---|---|
| id | string | Permanent unique identifier. Survives re-embedding, migration, codex transfer. |
| vector | float[1024] | The embedding — meaning as numbers. This is what makes semantic search work. |
| text | string | The actual knowledge content. Human-readable. Source material for re-embedding. |
| title | string | Short label for display and quick scanning. |
| node_type | string | Classification: principle, tactic, error, call_example, book_excerpt, meta, raw_chunk. Enables type-filtered search. |
| source_id | string | Human-readable origin: "Anatomy of Ads 2.0", "NHB Call 47". Never a hex hash — must be traceable. |
| confidence_score | float | 0.0–1.0. How validated this knowledge is. Principles proven across multiple sources score higher. Enables quality-filtered retrieval. |
| tags | string (CSV) | Comma-separated labels: "cold_traffic,identity,masculine". Enables faceted filtering without complex joins. |
| mechanism | string | WHY this works — the causal chain. Critical for distinguishing principles that sound similar but work differently. |
| situation | string | WHEN to apply this. Context-dependent retrieval: "show me principles for cold traffic to skeptical men." |
| when_not | string | WHEN NOT to apply this. Prevents misapplication. The most undervalued field in any knowledge system. |
| context | string | Additional context, notes, surrounding information from the source. |
| timestamp | string | When the node was created. Enables time-based analysis of knowledge growth. |
| version | int | Revision tracking. When a principle gets refined, version increments. Original preserved. |
Created earlier in development for os_context, reference_sites, os_site_current, and raw_transcripts. Uses different field names and a simpler structure.
| Field | Type | Difference from NODE_SCHEMA |
|---|---|---|
| id | string | Same concept, same purpose. |
| vector | float[1024] | Same embedding dimension. |
| text | string | Same concept. |
| source | string | Called source instead of source_id. Different name, same idea. |
| topic_tags | list | Called topic_tags (a list) instead of tags (CSV string). Incompatible format. |
| source_count | int | No equivalent in NODE_SCHEMA. Counts how many sources mention this. |
| timestamp | string | Same concept. |
The problem: legacy collections are missing mechanism, situation, when_not, confidence_score, node_type, and version. They can't participate in codex exchange because they lack the fields that make knowledge actionable. A principle without mechanism or situation is just a quote.
Stores session history — what Q asked, what the AI answered, what context was used. This is operational memory, not transferable knowledge. Not part of codex exchange.
Stores synthesis output — the trunk/leaf/longform/thread content generated from clusters of related nodes. These are derivative works, not atomic knowledge. Created by the synthesis pipeline, not by ingestion. Not part of codex exchange.
A different system entirely. Rob's holons are stored in LanceDB but with:
Computers can't understand meaning. To a computer, the string "the dog sat on the mat" and the string "the canine rested on the rug" are completely different sequences of characters. They share almost no characters in common. A keyword search for "dog" would find the first sentence but miss the second — even though they mean the same thing.
This is the fundamental limitation of all keyword-based search. Google spent two decades trying to work around it with increasingly complex heuristics. It never fully worked.
An embedding model takes a piece of text and converts it into a list of numbers — called a vector — where similar meanings produce similar numbers.
"Lead with outcomes, not mechanisms" → [0.234, -0.152, 0.871, 0.043, -0.567, 0.298, ...] (1024 numbers total) "Show results before explaining how it works" → [0.219, -0.141, 0.853, 0.051, -0.549, 0.287, ...] (nearly identical numbers) "How to change a tire on a Honda Civic" → [-0.672, 0.334, -0.118, 0.891, 0.056, -0.443, ...] (completely different numbers)
The first two sentences mean roughly the same thing — lead with the result, not the process. Their vectors are nearly identical. The third sentence is about something completely unrelated. Its vector is wildly different.
The number of numbers in the vector is called the dimension. More dimensions means more nuance in distinguishing meaning.
When you ask your AI "what principles apply to writing cold traffic ads for skeptical men?", this is what happens:
This is semantic search — search by meaning, not by keywords. The question doesn't need to contain the word "outcomes" to find the principle about leading with outcomes. It just needs to mean something similar.
Q's system uses BGE-M3, which produces 1024 numbers per piece of text.
Rob's system uses nomic-embed-text, which produces 768 numbers per piece of text.
These are not compatible. You cannot compare a list of 1024 numbers to a list of 768 numbers. It's like comparing a 3D object to a 2D photograph — they exist in different mathematical spaces. There is no "conversion" between them. The numbers mean fundamentally different things because they were produced by different models with different training.
Every embedding model that's worth considering for Meridian, with the factors that actually matter:
| Model | Dim | Size | Maker | Quality | CPU Speed | Notes |
|---|---|---|---|---|---|---|
| BAAI/bge-m3 | 1024 | 1.3 GB | Beijing Academy of AI | Very high (top 5 MTEB) | ~0.5s/text | Q's current. Multilingual (100+ languages). Dense + sparse + multi-vector retrieval. Production-proven at 6,797 nodes. |
| nomic-embed-text | 768 | 274 MB | Nomic AI | Good (comparable to ada-002) | ~0.2s/text | Rob's current. Smaller, faster on CPU. English-only. No sparse retrieval. 16,717 holons embedded. |
| BAAI/bge-large-en-v1.5 | 1024 | 1.2 GB | BAAI | High | ~0.4s/text | English-only predecessor to BGE-M3. No multilingual. Strictly worse than M3. |
| all-MiniLM-L6-v2 | 384 | 80 MB | Sentence Transformers | Moderate | ~0.05s/text | Very fast, very small. Not enough nuance for Meridian's knowledge density. Fine for simple FAQ bots. |
| Cohere embed-v3 | 1024 | API only | Cohere | Very high | N/A (cloud) | API-only — breaks sovereignty. Every embedding request goes to Cohere's servers. Not viable. |
| OpenAI text-embedding-3-large | 3072 | API only | OpenAI | Excellent | N/A (cloud) | API-only, closed source, highest dimension. Non-starter for sovereign infrastructure. |
| Snowflake arctic-embed-l | 1024 | 1.1 GB | Snowflake | High | ~0.4s/text | Open source, 1024-dim, worth benchmarking against BGE-M3. Less production validation. |
| Alibaba gte-Qwen2-7B | 3584 | 14 GB | Alibaba | Near-best | Impractical on CPU | Requires dedicated GPU and massive resources. Near state-of-the-art quality. Not practical for client builds unless they have enterprise hardware. |
The viable options for a sovereign, local-first system are BGE-M3 and nomic-embed-text. Everything else is either API-dependent (breaks sovereignty), too small (insufficient quality), or too large (impractical hardware requirements).
The speed difference between BGE-M3 and nomic-embed-text only matters in a specific hardware context. Understanding that context is critical to making the right decision.
Rob runs embeddings through Ollama. Ollama auto-manages GPU allocation. On his Mac (Apple Silicon with unified memory), the embedding model runs on the GPU alongside everything else. It's fast. He never has to think about CPU vs GPU — the system handles it.
Q runs embeddings through sentence-transformers on CPU. Why? Because Q's RTX 2080 (8 GB VRAM) is already occupied by the inference model — Qwen 30B or Hermes 36B. There's no room on the GPU for the embedding model. So embeddings run on CPU, where BGE-M3 at 1.3 GB is noticeably slower than nomic at 274 MB.
The speed concern is Q's hardware limitation, not a universal truth about these models.
| Scenario | CPU Speed Matters? | Details |
|---|---|---|
| Q's desktop (RTX 2080, 8 GB VRAM shared with LLM) |
Yes | Embedding competes for resources. BGE-M3 ~0.5s vs nomic ~0.2s per text on CPU. Noticeable during bulk ingestion. |
| Rob's Mac (Apple Silicon, unified memory) |
No | GPU handles both LLM and embedding. Both models run fast. Speed difference negligible. |
| Meridian client build ($100K commission, dedicated hardware, 24 GB+ GPU) |
No | Plenty of GPU memory. Both models fit alongside any LLM with room to spare. Speed argument disappears entirely. |
| Raspberry Pi / edge deployment (no GPU) |
Yes | No GPU at all. nomic wins on CPU speed and smaller memory footprint. 274 MB vs 1.3 GB matters on 4 GB RAM. |
| Bulk ingestion (10,000 nodes, one-time batch) |
Somewhat | ~80 min (BGE-M3) vs ~33 min (nomic) on CPU. But this is a one-time job, not a daily operation. |
Q currently runs BGE-M3 through sentence-transformers (Python library, CPU-bound on Q's setup). Rob runs nomic through Ollama (which auto-manages GPU). There's no reason Q couldn't run BGE-M3 through Ollama as well — getting BGE-M3's quality with Ollama's deployment simplicity and GPU management. Best of both worlds.
If Meridian ever offers a stripped-down deployment on minimal hardware — Raspberry Pi, old laptop, $500 mini PC — nomic genuinely wins on size and speed in that context. But that's the outer tier of the offering, not the sovereign commission. The base model should optimize for the primary use case, not the edge case.
Every embedding model decision is reversible in theory. The question is how expensive the reversal is.
| Migration Path | Difficulty | Risk | Details |
|---|---|---|---|
| 1024 → 1024 (same dim, new model) |
Low | Low | Cleanest upgrade. Re-embed everything with the new model. Text is preserved, only vectors change. All collections stay the same dimension. Codex packs just need re-embedding. |
| 1024 → 3072 (upgrade dimension) |
High | Medium | Re-embed everything + 3x storage per vector + ALL builds across the collective must upgrade together or codex exchange breaks. Coordinated migration across every client. |
| 1024 → 768 (downgrade) |
Medium | High | Technically possible but you're deliberately making search worse. Less nuance, less accuracy. Never do this. |
| Mixed (some 1024, some 768) |
N/A | Fatal | Vectors are mathematically incomparable. Codex exchange breaks. Collective synthesis breaks. Semantic search returns garbage when comparing across dimensions. This is the one state you must never reach. |
The industry is moving UP, not down. OpenAI's latest embeddings are at 3072 dimensions. Google is moving up. New research consistently pushes toward 1024+. Starting at 768 means almost certainly migrating within 2 years as better 1024-dim models emerge and become the expected standard. Starting at 1024 means you might never need to migrate at all.
The schema stores raw text alongside the vector. This is critical. Re-embedding is always possible because you have the original source material. You're never locked in permanently — you're just choosing how expensive the next migration will be. Starting at 1024 minimizes that future cost. Starting at 768 maximizes it.
BGE-M3 (1024-dim) for Meridian. Here's the factor-by-factor breakdown:
| Factor | Winner | Details |
|---|---|---|
| Quality | BGE-M3 | Higher MTEB benchmark scores. More accurate semantic search, especially on nuanced, closely related knowledge. |
| Multilingual | BGE-M3 | 100+ languages. Will speaks 5 languages. Clients may have multi-language knowledge bases, international teams, non-English source material. nomic is English-only. |
| Scalability | BGE-M3 | 1024-dim is becoming the industry standard. Starting at 768 means migrating later. Starting at 1024 aligns with where the industry is heading. |
| Production validation | BGE-M3 | 6,797 nodes proven in Q's VOHU MANAH system. Ingestion pipeline, retrieval, synthesis, hardening — all battle-tested. |
| Hardware reality | Irrelevant | At Meridian's price point ($100K+ commissions), clients get proper GPUs. The CPU speed difference doesn't apply. |
| Speed (CPU) | nomic | ~2.5x faster on CPU. Matters on Q's current dev hardware. Does not matter on client hardware or Rob's Mac. |
| Size | nomic | 274 MB vs 1.3 GB. Meaningful on 4 GB devices. Meaningless on machines with 64 GB RAM. |
BGE-M3 wins on every factor that matters at Meridian's scale. nomic wins on two factors that only matter in edge cases Meridian isn't optimizing for.
A side-by-side comparison of the three systems:
| Aspect | Rob (GHOSTNET) | Q (VOHU MANAH) | Meridian Base Model |
|---|---|---|---|
| Embedding model | nomic-embed-text (768-dim) | BGE-M3 (1024-dim) | BGE-M3 (1024-dim) |
| Embedding via | Ollama | sentence-transformers (CPU) | Ollama or sentence-transformers |
| Storage | LanceDB | LanceDB | LanceDB |
| Schema | Semi-structured holons (varies) | NODE_SCHEMA (14 fields, enforced) | NODE_SCHEMA (14 fields, enforced) |
| Collections | Custom (holons, dreams, errors) | 7 node + 4 legacy + 1 conv + 1 evergreen | 7 node + errors + dreams (standard set) |
| Knowledge graph | Custom graph layer | kg_edges table (semantic + part_of) | kg_edges (standardized) |
| Interface | Custom agent swarm | Metatekt agent + Alen-Chan + TAO dashboard | Standardized agent layer + client UI |
Rob's security layer, dream engine, and agent swarm architecture sit above the schema layer. They consume knowledge from NODE_SCHEMA collections but don't define the storage format. These are differentiation features — they make Rob's builds unique. The schema standardization happens beneath them, not instead of them.
TIER 1: NODE_SCHEMA (the protocol — exchangeable)
Every principle, tactic, error, dream_insight.
14 fields, strictly enforced.
BGE-M3 1024-dim embeddings.
Codex packs contain this tier.
Collective emissions synthesize from this tier.
Any Meridian build can receive any codex pack
because the schema + embedding dimension match.
Collections: principles, tactics, call_examples, book_excerpts,
meta, raw_chunks, calls, errors, dreams
—————————————————————————————————
TIER 2: SYSTEM SCHEMAS (internal — never exchanged)
Conversations, evergreen synthesis pages, snapshots,
agent activity logs, operational state.
Each client's system tables are their own business.
No standardization required.
No codex compatibility needed.
—————————————————————————————————
TIER 3: GRAPH SCHEMA (relationship layer — exchangeable)
kg_edges between NODE_SCHEMA nodes.
edge_id, from_id, to_id, rel_type, weight, notes, created_at
Edges ARE part of codex packs.
They are the knowledge structure — the connections
between principles that make a knowledge bank
more than a list of quotes.
—————————————————————————————————
LEGACY: RETIRE
os_context, reference_sites → migrate to NODE_SCHEMA
or mark as system-only (not codex-compatible).
These collections predate the NODE_SCHEMA standard.
They served their purpose. They don't serve Meridian.