Meridian — Schemas & Embeddings from First Principles

Part 1

What Is a Schema?

A schema is a contract. It says: every piece of knowledge stored in this system will have these exact fields, in these exact formats, every single time.

Think of it like a hospital form. Every patient record has the same fields: name, date of birth, blood type, allergies, medications. They don't let one doctor write "Bob, he's 40ish, allergic to something" and another doctor write a structured record. The form IS the schema. It enforces consistency so that any doctor, any nurse, any system that reads the record knows exactly where to find what they need.

Why does this matter for AI knowledge systems?

A sovereign AI stores thousands of pieces of knowledge — principles, tactics, lessons, frameworks, errors, insights. If each one is stored differently — some with a confidence score, some without, some with a source name, some with a hex ID, some with mechanism explanations, some without — you can never reliably search, compare, or transfer knowledge between systems.

The schema is the interoperability protocol. If two Meridian builds use the same schema, a codex pack created by one can be installed into the other without any translation layer. If they use different schemas, every transfer requires a conversion step — and every conversion step is a place where data gets lost or corrupted.

A concrete example

Say you extract a principle from a marketing course:

"Lead with outcomes, not mechanisms, when selling to skeptical men."

Stored as NODE_SCHEMA (14 fields):

  id:               "a7f3b2c1-..."                    unique identifier, permanent
  vector:           [0.23, -0.15, 0.87, ...]          1024-dim embedding (meaning as numbers)
  text:             "Lead with outcomes, not           the actual principle in full
                     mechanisms, when selling
                     to skeptical men."
  title:            "Outcomes before mechanisms"       short human-readable label
  node_type:        "principle"                        what kind of knowledge this is
  source_id:        "Anatomy of Ads 2.0"              where it came from (human-readable)
  confidence_score: 0.92                               how validated it is (0.0 to 1.0)
  tags:             "cold_traffic,identity,masculine"  CSV string for filtering
  mechanism:        "Skeptical men evaluate outcome    WHY this works — the causal chain
                     identity before caring about
                     the how-to"
  situation:        "Cold traffic ads for              WHEN to apply this
                     identity-based offers"
  when_not:         "Warm retargeting where            WHEN NOT to apply this
                     credibility is established"
  context:          "From module 3 of AoA 2.0"        additional context
  timestamp:        "2026-03-15"                       when it was stored
  version:          1                                  revision number

Every single principle in the system has these same 14 fields. You can search by confidence. You can filter by tags. You can retrieve by situation. You can compare mechanisms. You can track where it came from. The schema makes the knowledge machine-readable, not just human-readable.

What happens without a schema

Rob's current GHOSTNET system stores knowledge as "holons" — semi-structured blobs where each holon can have different fields depending on what the extraction model decided to include. Some have a mechanism explanation, some don't. Some have confidence scores, some don't. Some have tags, some have categories, some have neither.

This means you can't write a query that says "give me all principles with confidence above 0.85 tagged cold_traffic" — because not every holon has those fields. You'd need to check for field existence first, handle missing values, deal with different naming conventions. Scale that to 16,717 holons across a collective network and it becomes unworkable.

Part 2

The Schemas That Exist Right Now

NODE_SCHEMA — 14 fields, 7 collections

This is the primary schema. Every piece of atomic knowledge — principles, tactics, call examples, book excerpts, meta-observations, raw chunks, and call transcripts — is stored with these fields.

Field	Type	Why It Exists
id	string	Permanent unique identifier. Survives re-embedding, migration, codex transfer.
vector	float[1024]	The embedding — meaning as numbers. This is what makes semantic search work.
text	string	The actual knowledge content. Human-readable. Source material for re-embedding.
title	string	Short label for display and quick scanning.
node_type	string	Classification: principle, tactic, error, call_example, book_excerpt, meta, raw_chunk. Enables type-filtered search.
source_id	string	Human-readable origin: "Anatomy of Ads 2.0", "NHB Call 47". Never a hex hash — must be traceable.
confidence_score	float	0.0–1.0. How validated this knowledge is. Principles proven across multiple sources score higher. Enables quality-filtered retrieval.
tags	string (CSV)	Comma-separated labels: "cold_traffic,identity,masculine". Enables faceted filtering without complex joins.
mechanism	string	WHY this works — the causal chain. Critical for distinguishing principles that sound similar but work differently.
situation	string	WHEN to apply this. Context-dependent retrieval: "show me principles for cold traffic to skeptical men."
when_not	string	WHEN NOT to apply this. Prevents misapplication. The most undervalued field in any knowledge system.
context	string	Additional context, notes, surrounding information from the source.
timestamp	string	When the node was created. Enables time-based analysis of knowledge growth.
version	int	Revision tracking. When a principle gets refined, version increments. Original preserved.

LEGACY_SCHEMA — 7 fields, 4 collections

Created earlier in development for os_context, reference_sites, os_site_current, and raw_transcripts. Uses different field names and a simpler structure.

Field	Type	Difference from NODE_SCHEMA
id	string	Same concept, same purpose.
vector	float[1024]	Same embedding dimension.
text	string	Same concept.
source	string	Called `source` instead of `source_id`. Different name, same idea.
topic_tags	list	Called `topic_tags` (a list) instead of `tags` (CSV string). Incompatible format.
source_count	int	No equivalent in NODE_SCHEMA. Counts how many sources mention this.
timestamp	string	Same concept.

The problem: legacy collections are missing mechanism, situation, when_not, confidence_score, node_type, and version. They can't participate in codex exchange because they lack the fields that make knowledge actionable. A principle without mechanism or situation is just a quote.

CONV_SCHEMA — 9 fields, conversations collection

Stores session history — what Q asked, what the AI answered, what context was used. This is operational memory, not transferable knowledge. Not part of codex exchange.

EVERGREEN_SCHEMA — 16 fields, evergreen collection

Stores synthesis output — the trunk/leaf/longform/thread content generated from clusters of related nodes. These are derivative works, not atomic knowledge. Created by the synthesis pipeline, not by ingestion. Not part of codex exchange.

Rob's GHOSTNET — 16,717 holons in LanceDB

A different system entirely. Rob's holons are stored in LanceDB but with:

768-dimensional embeddings (nomic-embed-text) — different dimension from Meridian's 1024
Semi-structured fields — no standardized mechanism, situation, or when_not across all holons
No enforced node_type — classification varies by extraction run
No confidence_score standard — validation state varies

THIS is why SPEC-001 matters. For codex packs to transfer between builds, for the collective to synthesize across members, for any Meridian client to benefit from any other Meridian client's knowledge — everyone must use the same schema. Not "similar." Not "compatible." The same. NODE_SCHEMA is the candidate. It's proven at 6,797 nodes. It has every field needed for actionable knowledge retrieval.

Part 3

What Is an Embedding? (From First Principles)

The problem

Computers can't understand meaning. To a computer, the string "the dog sat on the mat" and the string "the canine rested on the rug" are completely different sequences of characters. They share almost no characters in common. A keyword search for "dog" would find the first sentence but miss the second — even though they mean the same thing.

This is the fundamental limitation of all keyword-based search. Google spent two decades trying to work around it with increasingly complex heuristics. It never fully worked.

The solution: convert meaning into numbers

An embedding model takes a piece of text and converts it into a list of numbers — called a vector — where similar meanings produce similar numbers.

"Lead with outcomes, not mechanisms"
  → [0.234, -0.152, 0.871, 0.043, -0.567, 0.298, ...]   (1024 numbers total)

"Show results before explaining how it works"
  → [0.219, -0.141, 0.853, 0.051, -0.549, 0.287, ...]   (nearly identical numbers)

"How to change a tire on a Honda Civic"
  → [-0.672, 0.334, -0.118, 0.891, 0.056, -0.443, ...]  (completely different numbers)

The first two sentences mean roughly the same thing — lead with the result, not the process. Their vectors are nearly identical. The third sentence is about something completely unrelated. Its vector is wildly different.

Why 1024 numbers? The dimension question.

The number of numbers in the vector is called the dimension. More dimensions means more nuance in distinguishing meaning.

1 dimension — like measuring brightness only. You can tell light from dark, but you can't distinguish red from blue.
3 dimensions — like RGB color. Now you can distinguish millions of colors, but subtle shades blend together.
384 dimensions — reasonable for simple tasks. Can distinguish topics but struggles with subtle differences between related marketing principles.
768 dimensions — good. Can distinguish "outcome-led" from "mechanism-led" in most cases.
1024 dimensions — very good. Can distinguish "outcome-led for cold traffic" from "outcome-led for warm retargeting." The nuance matters when your knowledge bank has thousands of closely related principles.
3072 dimensions — excellent but expensive. Diminishing returns for most use cases.

How search works

When you ask your AI "what principles apply to writing cold traffic ads for skeptical men?", this is what happens:

Your question gets embedded — converted into 1024 numbers using the same model that embedded all stored knowledge.
Those 1024 numbers are compared to every stored principle's 1024 numbers.
The principles with the most similar numbers are returned — ranked by similarity.

This is semantic search — search by meaning, not by keywords. The question doesn't need to contain the word "outcomes" to find the principle about leading with outcomes. It just needs to mean something similar.

The embedding model is the lens through which your AI sees all knowledge. Every principle, every tactic, every framework is filtered through this lens when stored and when retrieved. If the lens is blurry (low dimension, low quality), your AI retrieves the wrong knowledge and gives bad advice. If the lens is sharp, it retrieves exactly what's relevant.

Part 4

Why Q and Rob Can't Merge Right Now

Q's system uses BGE-M3, which produces 1024 numbers per piece of text.

Rob's system uses nomic-embed-text, which produces 768 numbers per piece of text.

These are not compatible. You cannot compare a list of 1024 numbers to a list of 768 numbers. It's like comparing a 3D object to a 2D photograph — they exist in different mathematical spaces. There is no "conversion" between them. The numbers mean fundamentally different things because they were produced by different models with different training.

The consequences

Codex packs can't transfer. If Q exports a codex pack (a bundle of principles with their vectors), Rob can't install it into GHOSTNET. The vectors are meaningless in Rob's space.
Collective synthesis breaks. The collective is supposed to find overlapping principles across all members' knowledge banks. You can't find overlap between 1024-dim and 768-dim vectors.
Clients must choose one or the other. Every Meridian client build must use the same embedding model. If Client A uses BGE-M3 and Client B uses nomic, they can't participate in the same collective.

This is the single most important infrastructure decision Meridian will make. Once clients start building sovereign systems on an embedding model, switching is astronomically expensive. Every node across every build must be re-embedded. Every codex pack must be regenerated. Every collective index must be rebuilt. The cost scales with the number of clients multiplied by the number of nodes per client. Choosing now, before any client builds, costs nothing. Choosing later costs everything.

Part 5

The Options

Every embedding model that's worth considering for Meridian, with the factors that actually matter:

Model	Dim	Size	Maker	Quality	CPU Speed	Notes
BAAI/bge-m3	1024	1.3 GB	Beijing Academy of AI	Very high (top 5 MTEB)	~0.5s/text	Q's current. Multilingual (100+ languages). Dense + sparse + multi-vector retrieval. Production-proven at 6,797 nodes.
nomic-embed-text	768	274 MB	Nomic AI	Good (comparable to ada-002)	~0.2s/text	Rob's current. Smaller, faster on CPU. English-only. No sparse retrieval. 16,717 holons embedded.
BAAI/bge-large-en-v1.5	1024	1.2 GB	BAAI	High	~0.4s/text	English-only predecessor to BGE-M3. No multilingual. Strictly worse than M3.
all-MiniLM-L6-v2	384	80 MB	Sentence Transformers	Moderate	~0.05s/text	Very fast, very small. Not enough nuance for Meridian's knowledge density. Fine for simple FAQ bots.
Cohere embed-v3	1024	API only	Cohere	Very high	N/A (cloud)	API-only — breaks sovereignty. Every embedding request goes to Cohere's servers. Not viable.
OpenAI text-embedding-3-large	3072	API only	OpenAI	Excellent	N/A (cloud)	API-only, closed source, highest dimension. Non-starter for sovereign infrastructure.
Snowflake arctic-embed-l	1024	1.1 GB	Snowflake	High	~0.4s/text	Open source, 1024-dim, worth benchmarking against BGE-M3. Less production validation.
Alibaba gte-Qwen2-7B	3584	14 GB	Alibaba	Near-best	Impractical on CPU	Requires dedicated GPU and massive resources. Near state-of-the-art quality. Not practical for client builds unless they have enterprise hardware.

The viable options for a sovereign, local-first system are BGE-M3 and nomic-embed-text. Everything else is either API-dependent (breaks sovereignty), too small (insufficient quality), or too large (impractical hardware requirements).

Part 6

GPU vs CPU — The Real Hardware Story

The speed difference between BGE-M3 and nomic-embed-text only matters in a specific hardware context. Understanding that context is critical to making the right decision.

How Q and Rob run embeddings differently

Rob runs embeddings through Ollama. Ollama auto-manages GPU allocation. On his Mac (Apple Silicon with unified memory), the embedding model runs on the GPU alongside everything else. It's fast. He never has to think about CPU vs GPU — the system handles it.

Q runs embeddings through sentence-transformers on CPU. Why? Because Q's RTX 2080 (8 GB VRAM) is already occupied by the inference model — Qwen 30B or Hermes 36B. There's no room on the GPU for the embedding model. So embeddings run on CPU, where BGE-M3 at 1.3 GB is noticeably slower than nomic at 274 MB.

The speed concern is Q's hardware limitation, not a universal truth about these models.

When does CPU speed actually matter?

Scenario	CPU Speed Matters?	Details
Q's desktop (RTX 2080, 8 GB VRAM shared with LLM)	Yes	Embedding competes for resources. BGE-M3 ~0.5s vs nomic ~0.2s per text on CPU. Noticeable during bulk ingestion.
Rob's Mac (Apple Silicon, unified memory)	No	GPU handles both LLM and embedding. Both models run fast. Speed difference negligible.
Meridian client build ($100K commission, dedicated hardware, 24 GB+ GPU)	No	Plenty of GPU memory. Both models fit alongside any LLM with room to spare. Speed argument disappears entirely.
Raspberry Pi / edge deployment (no GPU)	Yes	No GPU at all. nomic wins on CPU speed and smaller memory footprint. 274 MB vs 1.3 GB matters on 4 GB RAM.
Bulk ingestion (10,000 nodes, one-time batch)	Somewhat	~80 min (BGE-M3) vs ~33 min (nomic) on CPU. But this is a one-time job, not a daily operation.

Key insight: For Meridian clients at $100K+ commissions, hardware is part of the engagement. You're speccing the machine. On proper hardware (24 GB+ GPU), the speed argument for nomic completely disappears. The only remaining factors are quality and multilingual support — both of which BGE-M3 wins decisively.

A potential improvement

Q currently runs BGE-M3 through sentence-transformers (Python library, CPU-bound on Q's setup). Rob runs nomic through Ollama (which auto-manages GPU). There's no reason Q couldn't run BGE-M3 through Ollama as well — getting BGE-M3's quality with Ollama's deployment simplicity and GPU management. Best of both worlds.

The one edge case

If Meridian ever offers a stripped-down deployment on minimal hardware — Raspberry Pi, old laptop, $500 mini PC — nomic genuinely wins on size and speed in that context. But that's the outer tier of the offering, not the sovereign commission. The base model should optimize for the primary use case, not the edge case.

Part 7

Can You Change Later? Upgrade vs Downgrade

Every embedding model decision is reversible in theory. The question is how expensive the reversal is.

Migration Path	Difficulty	Risk	Details
1024 → 1024 (same dim, new model)	Low	Low	Cleanest upgrade. Re-embed everything with the new model. Text is preserved, only vectors change. All collections stay the same dimension. Codex packs just need re-embedding.
1024 → 3072 (upgrade dimension)	High	Medium	Re-embed everything + 3x storage per vector + ALL builds across the collective must upgrade together or codex exchange breaks. Coordinated migration across every client.
1024 → 768 (downgrade)	Medium	High	Technically possible but you're deliberately making search worse. Less nuance, less accuracy. Never do this.
Mixed (some 1024, some 768)	N/A	Fatal	Vectors are mathematically incomparable. Codex exchange breaks. Collective synthesis breaks. Semantic search returns garbage when comparing across dimensions. This is the one state you must never reach.

Where the industry is going

The industry is moving UP, not down. OpenAI's latest embeddings are at 3072 dimensions. Google is moving up. New research consistently pushes toward 1024+. Starting at 768 means almost certainly migrating within 2 years as better 1024-dim models emerge and become the expected standard. Starting at 1024 means you might never need to migrate at all.

The safety net

The schema stores raw text alongside the vector. This is critical. Re-embedding is always possible because you have the original source material. You're never locked in permanently — you're just choosing how expensive the next migration will be. Starting at 1024 minimizes that future cost. Starting at 768 maximizes it.

Part 8

The Recommendation

BGE-M3 (1024-dim) for Meridian. Here's the factor-by-factor breakdown:

Factor	Winner	Details
Quality	BGE-M3	Higher MTEB benchmark scores. More accurate semantic search, especially on nuanced, closely related knowledge.
Multilingual	BGE-M3	100+ languages. Will speaks 5 languages. Clients may have multi-language knowledge bases, international teams, non-English source material. nomic is English-only.
Scalability	BGE-M3	1024-dim is becoming the industry standard. Starting at 768 means migrating later. Starting at 1024 aligns with where the industry is heading.
Production validation	BGE-M3	6,797 nodes proven in Q's VOHU MANAH system. Ingestion pipeline, retrieval, synthesis, hardening — all battle-tested.
Hardware reality	Irrelevant	At Meridian's price point ($100K+ commissions), clients get proper GPUs. The CPU speed difference doesn't apply.
Speed (CPU)	nomic	~2.5x faster on CPU. Matters on Q's current dev hardware. Does not matter on client hardware or Rob's Mac.
Size	nomic	274 MB vs 1.3 GB. Meaningful on 4 GB devices. Meaningless on machines with 64 GB RAM.

BGE-M3 wins on every factor that matters at Meridian's scale. nomic wins on two factors that only matter in edge cases Meridian isn't optimizing for.

Rob's migration path: Re-embed 16,717 holons with BGE-M3. On Rob's Mac hardware with GPU: estimated 2–3 hours as a batch job. Run once. Done. All holon content is preserved — only the vectors change. The text, the metadata, the relationships — everything stays. This is a one-afternoon migration, not a rebuild.

Part 9

Rob's Current System in Detail

A side-by-side comparison of the three systems:

Aspect	Rob (GHOSTNET)	Q (VOHU MANAH)	Meridian Base Model
Embedding model	nomic-embed-text (768-dim)	BGE-M3 (1024-dim)	BGE-M3 (1024-dim)
Embedding via	Ollama	sentence-transformers (CPU)	Ollama or sentence-transformers
Storage	LanceDB	LanceDB	LanceDB
Schema	Semi-structured holons (varies)	NODE_SCHEMA (14 fields, enforced)	NODE_SCHEMA (14 fields, enforced)
Collections	Custom (holons, dreams, errors)	7 node + 4 legacy + 1 conv + 1 evergreen	7 node + errors + dreams (standard set)
Knowledge graph	Custom graph layer	kg_edges table (semantic + part_of)	kg_edges (standardized)
Interface	Custom agent swarm	Metatekt agent + Alen-Chan + TAO dashboard	Standardized agent layer + client UI

What Rob needs to change

Re-embed with BGE-M3. All 16,717 holons. One batch job, 2–3 hours on his Mac. Vectors change, content preserved.
Restructure holons to NODE_SCHEMA. Map existing holon fields to the 14 NODE_SCHEMA fields. Add mechanism, situation, when_not where missing (can be backfilled by LLM extraction). Add confidence_score and node_type classification.
Add errors.lance and dreams.lance as standard collections. Rob already has these concepts in GHOSTNET — they just need to use NODE_SCHEMA format so they're codex-compatible.

What Rob does NOT need to change

Rob's security layer, dream engine, and agent swarm architecture sit above the schema layer. They consume knowledge from NODE_SCHEMA collections but don't define the storage format. These are differentiation features — they make Rob's builds unique. The schema standardization happens beneath them, not instead of them.

Part 10

The Unified Picture (Proposed)

TIER 1: NODE_SCHEMA (the protocol — exchangeable)

  Every principle, tactic, error, dream_insight.
  14 fields, strictly enforced.
  BGE-M3 1024-dim embeddings.

  Codex packs contain this tier.
  Collective emissions synthesize from this tier.
  Any Meridian build can receive any codex pack
    because the schema + embedding dimension match.

  Collections: principles, tactics, call_examples, book_excerpts,
               meta, raw_chunks, calls, errors, dreams

—————————————————————————————————

TIER 2: SYSTEM SCHEMAS (internal — never exchanged)

  Conversations, evergreen synthesis pages, snapshots,
  agent activity logs, operational state.

  Each client's system tables are their own business.
  No standardization required.
  No codex compatibility needed.

—————————————————————————————————

TIER 3: GRAPH SCHEMA (relationship layer — exchangeable)

  kg_edges between NODE_SCHEMA nodes.
  edge_id, from_id, to_id, rel_type, weight, notes, created_at

  Edges ARE part of codex packs.
  They are the knowledge structure — the connections
    between principles that make a knowledge bank
    more than a list of quotes.

—————————————————————————————————

LEGACY: RETIRE

  os_context, reference_sites → migrate to NODE_SCHEMA
  or mark as system-only (not codex-compatible).
  These collections predate the NODE_SCHEMA standard.
  They served their purpose. They don't serve Meridian.

The one rule: If it participates in codex exchange or collective synthesis, it must be NODE_SCHEMA with BGE-M3 1024-dim embeddings. Everything else is internal plumbing that each build can handle however it wants. The schema is the protocol. The protocol is the product. This is what makes Meridian a network, not just a collection of independent builds.