Coframe Core Manual¶
A grammar-layer framework for analytical data, designed for AI-native analytics.
v1.0.
Changes from v0.7.7 (v1.0 publication pass): Trailing version stamp at the document's end corrected (was "draft v0.7.6"). Cross-reference sweep verified all internal §X.Y refs resolve correctly against the Tier-1-renumbered chapter structure (Chapter 1 absorbs old Chapters 1–2; old Chapters 3–12 are now 2–11; Pro-feature deferrals consolidated in §1.5). No content changes. Older changelog entries pruned and preserved in git history.
Preface¶
This is the reference manual for Coframe Core, the open-source edition of Coframe.
Coframe is a framework for grammar-level reasoning over analytical data. It separates the structural facts about how data is organized — the framework's grammar layer — from the analytical interpretations engineers attach to it — the framework's semantic layer. The grammar layer's correctness machinery (integrity conditions, query resolution, cross-schema reasoning) operates on declared structures and verified data. The semantic layer is the engineer's domain: what metrics mean, what conventions a team uses, what story the data tells.
Coframe Core focuses on querying existing analytical content with lower friction to author and adopt. It preserves the grammar-layer thesis in its full structural form while omitting capabilities (custom operators, multi-backend support, Slowly Changing Attributes, configurable strictness, persistent re-ingestion, recursive hierarchies, the generalized functional grammar layer) that Coframe Pro provides on top of Core's foundation.
This manual specifies Coframe Core in detail: every primitive, every integrity condition, every Frame-QL construct, every operator, and every concern an engineer building or consuming Coframe Core ACs needs to know. It is the reference specification — comprehensive within Coframe Core's scope. Coframe Pro-only capabilities are referenced but not specified; engineers interested in those should consult the Coframe Pro Manual.
For the relationship between Coframe Core and Coframe Pro, see §1.5.
How to read this manual¶
Audience. This manual is for engineers building ACs (data engineers, analytics engineers, architects authoring schema.init and iterating on the DQ feedback loop), engineers consuming ACs (analysts, data scientists, BI developers writing Frame-QL queries), implementers (developers building backends that implement the data-API protocol, or integrating Coframe Core into their analytical workflow), AI/ML engineers (developers building agents that query ACs through the MCP server, or building agent-mediated analytics infrastructure), and decision-makers evaluating Coframe (technical leaders comparing Coframe Core and Coframe Pro against their analytical-tools landscape). The manual assumes familiarity with relational data, basic SQL concepts, and analytical workflows. No prior experience with Coframe is assumed.
For practitioners who prefer a narrative introduction before reading the specification, the companion article Coframe: A Grammar-Layer Substrate for AI-Native Analytics covers the framework's thesis, the family vocabulary, and the AI-agent story in less specification-dense prose.
Reading paths. The manual is structured as a reference specification, not a tutorial. For a first-time reader, the recommended path is:
- Chapter 1 (Introduction) — what Coframe Core is, the framework's thesis, where Coframe Core fits in the analytical-tools landscape, and the Coframe Core / Coframe Pro boundary.
- Chapter 2 (Foundations) — the framework's structural vocabulary in a single chapter. This is the chapter that establishes everything else; reading it carefully pays off in subsequent chapters.
- Subsequent chapters by need: ColumnSpec (Chapter 3) for the per-column specification, AC authoring (Chapter 4) for the workflow, Frame-QL (Chapter 8) for query syntax, and so on.
Engineers building ACs typically read Chapters 3-7 in sequence. Engineers querying ACs typically read Chapters 8-9. AI/ML engineers building agent integrations typically read Chapters 8-9 then Chapter 11. The Operator Catalog (Chapter 10) is a reference; engineers consult it for specific operators rather than reading sequentially.
Conventions. Italicized definitions like Analytics Collection (AC) introduce framework concepts on first use. Code identifiers are backtick-formatted (column_name, SELECT, revenue, family-root). Section references use chapter-and-section numbers where unambiguous (e.g., "see §2.4"); chapter titles or numbers are used for cross-chapter references (e.g., "see Chapter 3" or "see ColumnSpec §3.7"). Family-name and family-root are hyphenated; siblings and cousins are not. Operations between metrics use the notation m_pred --op--> m, with m_pred = (name_pred, E_pred) for the predecessor and m = (name, E) for the successor.
Naming conventions. Coframe in prose names the framework as a whole — the grammar-layer thesis, the structural commitments, the verification regime — independent of edition. Coframe Core names the open-source edition specified by this manual; the corresponding package name in code voice is coframe-core (lowercase, hyphenated, per Python conventions). Coframe Pro names the commercial edition (specified in the Coframe Pro Manual). Lowercase coframe appears in code/identifier contexts (Python package names, file paths, configuration keys) and as a generic noun where edition is irrelevant.
Running example. This manual uses a retail analytics example throughout. The retail AC has:
- Schemas: customers, stores, transactions, store_monthly_summary.
- AC-dimensions: customer, store, region, country, product, category, date, week, month, quarter, year, transaction.
- AC-attributes: customer_name, customer_segment, store_name, store_address, product_name, product_description.
- AC-metrics: revenue (rooted at transaction grain, with siblings at store-month grain via SUM-aggregation), units_sold (similar), peak_revenue (rooted at MAX-derivation from revenue at finer grain).
The example illustrates structural concepts; specific data is illustrative.
Table of Contents¶
Part I: Overview - Chapter 1: Introduction
Part II: Foundations - Chapter 2: Foundations
Part III: AC and Authoring - Chapter 3: ColumnSpec and Naming Machinery - Chapter 4: AC Authoring Workflow - Chapter 5: schema.init Format - Chapter 6: Data-API Protocol - Chapter 7: Data Quality and Structural Verification
Part IV: Query - Chapter 8: Frame-QL - Chapter 9: Query Resolution
Part V: Reference - Chapter 10: Operator Catalog
Part VI: MCP - Chapter 11: The MCP Server
Appendices - Appendix A: BNF Grammar for Frame-QL - Appendix B: Glossary - Appendix C: Worked Example — The Retail AC - Appendix D: Performance and Scaling Guidance
Part I: Overview¶
Chapter 1: Introduction¶
1.1 What Coframe Core is¶
Coframe Core is a framework that sits between your analytical data and the engineers, BI tools, and AI agents that consume it. It provides:
- Structural rigor: the framework's integrity conditions catch data-quality and modeling issues before they affect analytical results. Errors are surfaced at AC-load time or at query-resolution time, not at result time.
- Cross-schema reasoning: queries automatically draw on multiple schemas without engineers writing JOIN clauses; the four-rule filter handles schema selection per the AC's structural commitments.
- Cross-grain navigation with verified coherence: metrics at one anchoring (e.g., revenue at transaction grain) are queryable at coarser grains (e.g., revenue by region) via the family's identity-preserving reducer; the Multi-Table Invariance theorem guarantees correctness, with cross-schema metric coherence verified per attestable DNA edge by default during DQ.
- Principled missing-value treatment: deterministic handling of MCAR, MAR, and MNAR signatures per a closed operator catalog with explicit annotations on results.
- Declarative query language: Frame-QL expresses queries at the grammar level — referencing AC family-names rather than physical column names — without exposing physical schemas or join logic.
- AI-native query surface: the family vocabulary, the structural-rigor commitment, and the dubious-query mechanism together make Coframe Core a substrate where AI agents can construct queries that are either correct or explicitly disambiguated, with no silent third option.
Coframe Core is the open-source edition of the broader Coframe framework. It targets teams whose analytical needs fit within its scope: query workloads against existing analytical content, single-backend deployments, deterministic missing-value handling without strictness override, no custom operators.
1.2 The framework's core thesis¶
Coframe rests on a thesis: the structural facts about how analytical data is organized — the grammar layer — can be separated from the analytical interpretations engineers attach to it — the semantic layer. The framework's correctness machinery operates at the grammar layer; the semantic layer is the engineer's domain.
This separation is structural, not stylistic. The grammar layer is what the framework's integrity conditions, FD-DAG, four-rule filter, metric genealogy, and operator catalog reason about — column-level structural commitments expressed through ColumnSpec, DNA, and the family vocabulary. The semantic layer is what the engineer brings: which families correspond to which business concepts, what conventions a team uses, what story the data tells.
The grammar layer thesis does real work. Articulating constraints that have been latent in analytical practice — and giving them precise structural form — is the framework's contribution. Many of Coframe's structural conditions are not Coframe-specific; they are conditions on data and metadata that must hold for any analytical reasoning to be sound. The framework makes them first-class checks with specific diagnostics, rather than hoping engineers catch them in dashboard reviews.
Much of Coframe's structural-verification work is more like a discovery than an invention. The framework names what was already there.
1.2.1 Function-derived structure and cross-grain navigation¶
A specific structural property worth naming. Coframe's FD-DAG and family genealogy admit function-derived edges and metrics as first-class participants alongside data-attested ones. A column declared with op: MONTH_OF and dna: month <- day produces an FD-edge day → month that participates in cross-grain navigation identically to a data-attested FD-edge. A Frame-QL inline expression SUM(revenue) - SUM(cost) produces a derived metric whose family-genealogy relationship is established through the operator catalog. Both are verified by construction — the function's deterministic semantics is the verification — rather than by data attestation.
What this means concretely for query authoring: cross-grain navigation extends to function-derived groupings. A query like SUM(revenue) BY MONTH_OF(day) resolves through the framework's structural reasoning even when no month column is materialized in any schema, because the FD-edge day → month is established by the operator catalog. Similarly, dimension transformations like BUCKET(price, 10) or SUBSTR(product_code, 0, 2) produce groupings that participate in the framework's anchor-reach reasoning. The reasoning surface scales with what the data admits given the operator catalog, not just with what's been pre-materialized as columns.
This matters for AI-agent consumers. An agent constructing a query against a Coframe AC has access to whatever the data admits given the verified commitments and the operator catalog. The agent's analytical capability is bounded by the data's structural richness and the catalog's expressiveness, not by enumeration of pre-defined query patterns.
For specific function-derived FD-edge mechanics, see §2.8.5 (in Foundations, Chapter 2). For the operator catalog's full scope, see Chapter 10. The full architectural generalization of the data-borne/function-borne duality — admitting user-defined deterministic functions as first-class structural objects under explicit empirical and deductive verification regimes — is Coframe Pro territory (see §1.5).
1.3 Coframe Core in the analytical-tools landscape¶
Coframe Core is one of many tools in the analytical-data landscape. Practitioners evaluating it will reasonably ask: how does this fit alongside dbt, Cube, Looker, MetricFlow, the warehouse itself? Where does it overlap with what I already have? When does it earn its place in my stack?
This section orients Coframe Core relative to existing tools. It's positioning, not advocacy: Coframe Core fits some teams well and others not at all, and being clear about which is more useful than being broadly enthusiastic.
The landscape¶
Most organizations doing analytics today have a stack roughly like this:
- Warehouse (Snowflake, BigQuery, Postgres, DuckDB) holding the data.
- Transformation pipeline (dbt, custom SQL, Spark) producing analytical tables from raw sources.
- Semantic layer or metric definitions (Cube, MetricFlow, LookML, AtScale) defining named metrics.
- BI tool (Tableau, Looker, Power BI, Superset) providing dashboards and ad-hoc query interfaces.
- AI agents (Claude, GPT, internal tools) increasingly being asked to answer analytical questions in natural language, often via text-to-SQL.
Coframe Core sits between the transformation pipeline and the analytical surface. It does not replace any of these. It adds a query layer that didn't exist before — direct querying by analysts and AI agents, against a structurally-governed AC, with cross-schema reach and cross-grain navigation handled by the framework rather than by per-metric configuration.
vs. semantic layers (dbt MetricFlow, Cube, LookML)¶
Both Coframe Core and semantic layers govern queries against multiple tables, exposing a logical metric surface to consumers. The architectural difference: semantic layers encode named metrics with operational logic, attaching each metric to a physical model. Coframe Core's grammar layer encodes structural metadata (family-names, DNA, FD-DAG, ColumnSpecs) without bundling business logic into metric definitions.
The practical difference: semantic-layer metrics are defined per metric per logical model. Coframe Core's structural declarations are defined per column once, with cross-grain navigation, cross-schema substitutability, and integrity checking falling out of the structure rather than requiring per-metric configuration. Cousins are surfaced as dubious queries — same family-name, different family-roots — which semantic layers typically don't catch.
Where they're similar: both let business users (and now agents) query without knowing physical schemas. Where they differ: semantic layers expose curated metric menus through BI tools; Coframe Core exposes a query language directly to analysts and agents, with the family vocabulary as the unit of analytical thought.
vs. BI tools (Tableau, Looker, Power BI)¶
BI tools visualize analytical content; they don't substitute for it. Coframe Core sits behind a BI tool, providing the query layer over the AC. The BI tool reads from Coframe Core (via Frame-QL or a BI-tool connector). The BI tool visualizes, the framework reasons.
vs. text-to-SQL approaches¶
Text-to-SQL hands the LLM a database and asks it to construct queries from natural language. Modern LLMs are dramatically better at this than they were two years ago — accuracy approaches 100% within well-modeled semantic layers — but the failure modes remain. Queries the LLM gets syntactically right but semantically wrong; joins constructed on guesses about cardinality; aggregations that look reasonable but produce silently incorrect results.
Coframe Core's contribution to AI-mediated analytics is structural: queries are constructed against the AC's family vocabulary, with the framework's machinery handling joins, navigation, and aggregation. The agent's role is to express analytical intent (which families, at what grain, with what filters); the framework's role is to verify resolution and execute correctly. Errors are caught at parse or resolution time as explicit diagnostics, not at result time as silently wrong numbers.
1.4 What Coframe Core enables¶
For engineers building ACs: - A clear authoring workflow (schema.init → DQ feedback → refined AC). - Curatorial authority over what to expose: ACs are deliberate selections of backend columns, named in the AC author's vocabulary, with structural commitments declared per ColumnSpec. - Structural-verification machinery that catches modeling errors early. - AI-assisted authoring that proposes ColumnSpec fields, FD-DAG edges, and family genealogy from data inspection.
For engineers querying ACs: - Declarative queries via Frame-QL, without join logic or physical-schema exposure. - Cross-schema reach handled by the four-rule filter; engineers don't pick schemas. - Cross-grain navigation handled automatically per family ip_reducers. - Principled missing-value treatment per the operator catalog, with annotations on results.
For AI agents consuming AC content: - Structural metadata exposed through MCP for reasoning at the family / sibling / cousin level. - Annotations on results documenting treatment, missing fractions, biases. - Dubious-query mechanism that refuses ambiguous queries with structured diagnostics, eliminating silent-incorrectness as a failure mode.
For implementers: - A defined data-API protocol backends implement. - A clear separation between framework and backend. - Reference implementations (Polars, DuckDB) to build against. - An MCP server exposing the AC's structural surface to LLM clients.
1.5 Coframe Core and Coframe Pro¶
Coframe ships in two editions:
Coframe Core is the open-source edition specified in this manual. It targets practitioners with bounded analytical needs — single backend, query workloads against existing analytical content, no custom operators, deterministic missing-value handling.
Coframe Pro is the commercial edition. It extends Coframe Core with capabilities that engineers building substantial analytical infrastructure need: custom operators, multi-backend support, Slowly Changing Attributes, configurable strictness, persistent re-ingestion, sensitivity analysis, recursive hierarchies, the generalized functional grammar layer, and richer authoring tooling.
Both editions share the framework's core: the two principles, the AC and AC scope, the (E, M) paired declaration, the column trichotomy, the FD-DAG, the metric genealogy with DNA / family-roots / siblings / cousins, the four-rule filter, MTI, the dubious-query mechanism, the structural-rigor posture. Coframe Core is a strict subset of Coframe Pro's surface area; engineers using Coframe Core can move to Coframe Pro without re-authoring their ACs.
What Coframe Pro adds (overview)¶
The detailed Coframe Pro surface is in the Coframe Pro Manual; this section gives a high-level summary so practitioners can assess whether Coframe Core fits their needs:
- Custom operator registration. Engineers declare operators with their own semantics, partition_invariance properties, identity-preservation flags, and missing-value treatment.
- Slowly Changing Attributes (SCA). Time-varying attribute values modeled as a structural concern via multi-entity anchoring with a slow-time-grain component, rather than handled through ETL flattening.
- Generalized functional grammar layer. The data-borne / function-borne duality lifted from a special case (in Core) to the framework's primary architectural framing, with explicit empirical and deductive verification regimes.
- Recursive hierarchies. First-class support for self-referential FD-edge patterns (employee-manager hierarchies, parent-part bills-of-materials, message-thread reply structures) with recursive query primitives in Frame-QL.
- Cross-AC federation. Queries spanning multiple ACs with explicit reconciliation rules.
- Multi-backend support. Schemas in an AC can source from different engines.
- Configurable strictness. AC-level strictness default with query-level override.
- Sensitivity analysis machinery. Bounded estimates rather than point estimates for analytically-questionable queries.
- Persistent re-ingestion of Frame-QL outputs. Query results become AC schemas for subsequent queries.
- Per-DNA-edge attestation extensions. Federated-edge attestation, attestation-driven sensitivity analysis, incremental attestation.
- Sophisticated AI-assisted authoring. Advanced multi-pass refinement, schema-evolution detection, complex AC-construction workflows.
Choosing between the editions¶
Coframe Core is the right choice when:
- The team's analytical work is primarily querying existing AC content.
- The data lives in a single backend.
- Standard operators (SUM, AVG, MAX, MIN, COUNT, MEDIAN, MODE, etc.) cover the analytical operators needed.
- Default missing-value behavior is acceptable.
- Time-varying attributes — when they exist — are handled via event modeling (events anchored at event-time) rather than as time-varying attribute structure. Equivalently: the team is comfortable modeling "the customer's segment changed in March" as a segment-change event anchored at the change date, not as a segment attribute anchored at (customer, month).
- The team is starting with Coframe and wants to evaluate before commercial commitment.
Coframe Pro is the right choice when:
- The team builds rich ACs through Frame-QL derivation and re-ingestion.
- Data spans multiple backends or multiple ACs.
- Custom operators (HLL sketches, domain-specific statistics) are needed.
- Engineer-controlled missing-value strictness is needed for specific queries.
- Slowly Changing Attributes (SCAs) require structural support — what was the customer's segment in Q3 last year? is a first-class query rather than something requiring event-replay reconstruction.
- Self-referential hierarchies (organizational, BOM, etc.) are central to the analytical workload.
- Sensitivity analysis on questionable queries is operationally valuable.
- The team has established Coframe expertise and wants the full surface.
The upgrade path is additive. Coframe Core ACs are a strict subset of Coframe Pro ACs. Teams starting with Coframe Core and outgrowing it can move to Coframe Pro without re-authoring; the structural commitments made in Coframe Core remain valid in Coframe Pro.
1.6 What this manual covers¶
This manual specifies Coframe Core. It references Coframe Pro only for the upgrade path and for capabilities Coframe Core doesn't provide.
The manual proceeds in five substantive parts:
Part II (Foundations) establishes the framework's vocabulary in a single chapter (Chapter 2): the two principles, the AC and AC scope, the (E, M) paired declaration, the column trichotomy, operations and the predecessor/successor relationship, DNA / family / metric genealogy / structural relations, the FD-DAG, schemas, the structural rules, the integrity conditions, and the framework's overall posture.
Part III (AC and Authoring) specifies the authoring surface and the per-column declaration: the ColumnSpec specification (Chapter 3), the AC authoring workflow (Chapter 4), the schema.init format (Chapter 5), the data-API protocol (Chapter 6), and the data-quality / structural-verification process (Chapter 7).
Part IV (Query) specifies the query language and resolution: Frame-QL grammar and semantics (Chapter 8), and query resolution including the four-rule filter, the Multi-Table Invariance theorem, and the dubious-query mechanism (Chapter 9).
Part V (Reference) is the operator catalog (Chapter 10) with type, partition_invariance, identity-preservation, default naming-function entries, and missing-value treatment per (operator, signature).
Part VI (MCP) specifies the MCP server (Chapter 11) — the framework's interface to LLM clients.
After reading the manual, an engineer should be able to:
- Author a schema.init for their warehouse, choosing what columns to expose, what to name them, and what structural commitments to declare.
- Run DQ and respond to its feedback iteratively.
- Write Frame-QL queries against the resulting AC.
- Understand the framework's diagnostics when something goes wrong.
- Deploy an MCP server exposing the AC to LLM clients and reason about the agent-mediated query patterns the framework supports.
The manual is not a tutorial. For tutorial-style introductions, see the Coframe Core documentation site (coframe.tech) or the practitioner article positioning Coframe in the analytical-tools landscape.
Part II: Foundations¶
Chapter 2: Foundations¶
The principles, primitives, structural rules, and integrity conditions on which Coframe Core's grammar-layer reasoning rests.
2.1 Overview¶
This chapter establishes the foundations of Coframe Core. Subsequent chapters build on the structural commitments introduced here: the ColumnSpec specification (Chapter 3), the AC authoring workflow (Chapter 4), the schema.init format (Chapter 5), the data-API protocol (Chapter 6), the DQ process (Chapter 7), the Frame-QL query language (Chapter 8), query resolution (Chapter 9), and the operator catalog (Chapter 10).
The chapter is organized in eleven sections:
- §2.2 introduces the framework's two principles and the (entity, family, operator) triple — the conceptual foundation of Coframe's grammar layer.
- §2.3 introduces the Analytics Collection (AC), schemas, and ColumnSpec.
- §2.4 introduces the (E, M) paired declaration — the column's observational commitment.
- §2.5 introduces the column trichotomy: AC-dimension, AC-attribute, AC-metric.
- §2.6 specifies operations: how a predecessor metric and a successor metric are linked through an operator, and the well-formedness conditions on the relationship.
- §2.7 introduces DNA, family-name, family-root, ancestry tree, and metric genealogy — the structural representation of operational lineage, including the structural relations (identical, sibling, cousin) that emerge from the genealogy.
- §2.8 introduces the FD-DAG: functional-dependency structure over AC-dimensions.
- §2.9 specifies schemas in detail: schema types, grain, declared scope.
- §2.10 states the structural rules and integrity conditions Coframe Core enforces.
- §2.11 states the framework's overall posture: structural rigor, grammar/semantics separation, name-agnosticism.
The vocabulary used throughout this chapter is specified in the Coframe Vocabulary Spine. Definitions there are precise; this chapter motivates and contextualizes the vocabulary, with examples and integrative discussion.
This chapter uses the retail analytics running example introduced in the manual's preface. The retail AC has schemas including customers, stores, transactions, and store_monthly_summary. AC-dimensions include customer, store, region, country, product, category, date, week, month, quarter, year, transaction. AC-metrics include revenue, units_sold, and various derived quantities. The example illustrates structural concepts; specific data is illustrative.
2.2 Two principles¶
The framework rests on two principles that engineers commit to when authoring an Analytics Collection (AC). The framework's correctness guarantees are conditional on these principles holding.
2.2.1 Principle 1: Column-borne information¶
Every column c in every schema S is a property of entities, with the entities declared via the column's anchoring E(c, S).
Each column observes a value about some specific entities. The entities — the things the column's value depends on — are what the column is anchored to. The anchoring is declared explicitly in the column's E.
Examples:
- A
customer_namecolumn is a property of customers;E = {customer}. - A
revenuecolumn in a transactions schema is a property of transactions;E = {transaction}. - A
region_quarterly_totalcolumn is a property of (region, quarter) pairs;E = {region, quarter}.
The principle requires that every column's values be explicable in terms of the entities the column is anchored to. A column whose value depends on entities not declared in E violates the principle.
The principle also implies that columns whose values cannot be coherently anchored — values that are arbitrary, computed without reference to entities, or dependent on hidden context — should not appear in Coframe ACs. The framework's reasoning relies on E being complete and accurate.
2.2.2 Principle 2: Same universe of observation¶
All schemas in an AC observe the same universe of entities.
The schemas in a single AC are different views of the same underlying reality. Schemas may observe different aspects (different entity sets, different grains, different scopes), but they observe the same entities. Cross-schema reasoning depends on this.
When a customer dimension appears in multiple schemas, the customer values across schemas refer to the same actual customers. When revenue appears across schemas at different grains, the revenue values are observations of the same revenue universe at different aggregation levels.
The principle does not require every schema to observe every entity. A schema may be degenerate on a dimension d — observing only a subset of d's universe-wide values, declared explicitly. The principle requires that the entity universes be consistent where they overlap.
2.2.3 What the principles enable¶
The two principles together enable the framework's grammar-level reasoning:
- Cross-schema queries (resolving a query over multiple schemas) depend on Principle 2's commitment that schemas observe the same universe.
- Cross-grain navigation (moving from one anchoring to another via aggregation) depends on Principle 1's commitment that every column's values are entity-determined.
- Integrity conditions (consistency checks across schemas) depend on Principle 2's universe-consistency.
Per-column anchoring (Principle 1) gives the framework the structural information it needs to reason about each column's role. Universe-consistency (Principle 2) gives the framework a basis for reasoning across schemas.
The framework's verification machinery (Chapter 7) checks specific consequences of the principles against data. Many consequences are verified directly; the cross-schema metric coherence consequence is verified per attestable DNA edge by default, with engineer opt-out available; a few residual facts (catalog properties, the principles themselves, naming consistency in the declined-naming-function case) remain asserted-not-verified. The boundary is documented in §7.7.
2.2.4 Three primitives: entity, family, operator¶
The two principles introduced above generate a particular conceptual structure. Coframe's grammar-layer reasoning rests on three universal primitives, each managing one orthogonal aspect of structured analytical observation:
-
Entity. The key space — what an observation is about. The thing being identified or anchored to. In Coframe, entities are named via
E(c, S)declarations on each ColumnSpec; they participate in the FD-DAG; they carry universe-consistency commitments across schemas (Principle 2). AC-dimensions are columns appearing in entity-anchoring (grain) role. -
Family. The value space — what is observed about entities. The conceptual quantity a column carries. In Coframe, families are named via the
namefield on each ColumnSpec; columns sharing a family-name participate in the metric genealogy (Foundations §2.7); siblings within a family represent the same conceptual quantity at different anchors. AC-metrics and AC-attributes are family members. -
Operator. The operational linkage — how observations transform into other observations. The relationships that compose entities and families into derivation chains. In Coframe, operators are catalog-defined (Chapter 10), participate in DNA records on each ColumnSpec, and carry structural properties (
partition_invariant,identity_preserving, type signatures) that govern well-formed transformation.
These three primitives — entity, family, operator — are the conceptual foundation of Coframe's grammar layer. Every structural rule, every integrity condition, every query-resolution decision can be expressed in terms of how entities, families, and operators relate.
The triple is complete in a precise sense: every act of structured analytical observation requires a what-is-this-about (entity), a what-am-I-recording (family), and — when observations compose or derive — a how-are-these-related (operator). Coframe's commitment is that these three primitives, declared and verified through the framework's machinery, are sufficient for grammar-layer reasoning over analytical data. The vocabulary that subsequent sections develop — DNA, family-root, ip_reducer, sibling, cousin, FD-DAG, metric genealogy — emerges from the relationships among these three.
2.3 The Analytics Collection¶
An Analytics Collection (AC) is a Coframe artifact capturing a coherent AC scope (per §2.3.4) over a backend's data.
2.3.1 AC structure¶
An AC consists of:
- A collection of schemas, each binding to backend tables through ColumnSpecs.
- A backend connection. Coframe Core has exactly one backend per AC; multi-backend ACs are Coframe Pro territory.
- AC-level annotations: descriptions, scope declarations, naming function declaration if any, etc.
- Verified integrity status produced by the DQ process.
The AC is the unit of Coframe Core deployment. Engineers author one AC for a specific AC scope; queries are evaluated against an AC.
2.3.2 AC plurality¶
The framework supports multiple ACs over the same backend data. Different ACs may have different AC scopes: one AC for finance reporting, another for operations dashboards, another for marketing analytics. Each AC carries its own scope, commitments, schema declarations, and analytical posture.
This plurality is structurally meaningful. Different teams have different conventions about how to anchor metrics, name dimensions, and structure schemas. An AC encodes one team's analytical perspective on the data; multiple ACs accommodate multiple perspectives without forcing a single canonical view.
2.3.3 AC as semantic closure¶
An AC is not a translation surface from physical data to a "semantic layer" in the conventional sense. It is a complete semantic closure: within the AC, the engineer has committed to a structural perspective — what counts as which dimension, what's anchored where, what cross-schema relationships hold — and the framework reasons over that closure.
Queries against an AC produce results whose meaning is fully determined by the AC's commitments. Two ACs over the same data may produce different results for the "same" query because they encode different structural commitments. This is correct behavior; the framework preserves analytical pluralism by letting different ACs coexist.
2.3.4 The AC scope¶
The AC scope is what the AC author chooses to expose for analytical purpose. The AC scope is composed of three deliberate authoring choices:
- Selection: which columns from the backend the AC includes via ColumnSpec declarations. Backend columns not declared as ColumnSpecs are outside the AC scope.
- Naming: what the included columns are called in the AC's vocabulary. The AC author chooses freely; the framework treats names as opaque labels (per §2.11.3).
- Structural commitments: the per-ColumnSpec declarations of
E,M,op, anddna(per Chapter 3) defining how the included columns behave structurally.
Together these three choices constitute the AC scope. The framework operates within the AC scope: queries reach the columns the AC exposes, navigate via the structural commitments declared, and reference columns by the AC author's chosen names.
Backend columns outside the AC scope are not visible to queries against the AC. They are not errors; they are simply outside the AC's analytical surface. A backend table with hundreds of columns may produce an AC exposing only a few — the AC author's selection determines what is in scope.
The same backend data may support multiple ACs with different scopes. A finance AC and a marketing AC over the same transactions table may include different columns, name them differently, and commit to different structural relationships. Each AC's scope is its own; the framework reasons within each independently.
The AC scope is what the AC author commits to. The framework's verification is over the commitments within scope; columns outside scope are outside the framework's reasoning.
2.3.5 Schemas¶
A schema is a structural object within an AC binding to a single backend source (a physical table or materialized view) and declaring ColumnSpecs for each of its analytically-relevant columns.
Schemas have:
- A
name— the schema's local label within the AC. - A
source— the backend table or view it binds to. - A
grain— the entity-set anchoring the schema's rows. - A list of ColumnSpecs — one per analytically-relevant column.
Schema types and detailed specifications appear in §2.9.
2.3.6 ColumnSpec¶
A ColumnSpec is the AC's declaration of a single column in a schema. It is the unit of structural commitment at the column level.
A ColumnSpec is structurally divided into four parts:
- Backend-facing:
src_name,data_type— what the backend exposes. - Entity-facing:
E,M— what the column observes and how it can be missing. - Operator/operation-facing:
op,dna— how the column was produced. - Cross-schema linkage:
name— the family-name through which the column participates in the AC's vocabulary.
The four parts have distinct structural roles:
- Backend-facing fields bind the ColumnSpec to physical data.
- Entity-facing fields capture the column's observational content.
- Operator/operation-facing fields capture the column's operational lineage.
- Cross-schema linkage allows the column to be referenced across schemas in the AC's vocabulary.
The framework reasons over each part separately. The four parts are independent declarations that, taken together, constitute the column's structural commitment.
The (E, M) paired declaration is detailed in §2.4. The (op, dna) declaration is detailed in §2.7. The name field's role as a family-identifier is detailed in §2.7.
2.4 The (E, M) paired declaration¶
Every ColumnSpec carries E and M as a paired value-determining declaration. Together they capture the column's structural commitment to how its values arise.
2.4.1 E(c, S): entity-set declaration¶
E(c, S) declares the entities the column's value depends on. Per Principle 1, every column's value is determined by the entities it's anchored to; E names these entities.
For AC-dimensions and AC-attributes, |E| = 1 (per §2.10). For AC-metrics, E is one or more elements.
2.4.2 M(c, S): missingness signature¶
M(c, S) declares the column's missingness mechanism: how missing values arise in this column. M is one of three categories:
-
MCAR (
M = ∅): missingness is independent of any determinant. Missing values are randomly distributed; their occurrence doesn't depend on the column's value or other columns. -
MAR (
M ⊊ E ∪ {self}, c ∉ M): missingness depends on observed determinants fromE. Missing values' occurrence depends on dimensions in the column's anchoring, but not on the column's own value. -
MNAR (
c ∈ M): missingness depends on the column's own value. Missing values' occurrence is correlated with what the value would have been.
The constraint M ⊆ E ∪ {self} reflects the column's structural scope. The column is a property of E (per Principle 1), with self as a special case for MNAR; the missingness mechanism for c can only depend on these.
2.4.3 (E, M) as paired declaration¶
E and M are not separate optional declarations. Together they form the column's value-determining declaration: E determines how values arise (anchoring); M determines when values are absent (missingness mechanism).
A column without M is incomplete in the same way a column without E is incomplete. The framework requires both to be present at AC validation time.
2.4.4 Auto-derivation for grain-role columns¶
For columns where E(c, S) = {c} (grain-role columns), the framework auto-derives:
M = {c}(signature MNAR; missingness mechanism is intrinsic to the column).- Admissibility = forbidden by Principle 1; missing values are integrity violations.
The engineer declares E = {c}; the framework derives M without separate declaration. Missing values appearing in grain-role columns are hard violations regardless of M.
2.4.5 What (E, M) enables¶
The (E, M) paired declaration is used by the framework throughout:
- Operator missing-value treatment (per the operator catalog) derives behavior from
(operator, M). - Cross-schema consistency checks reference
EandMto determine equivalent columns. - Query resolution uses
Efor anchoring; the operator catalog usesMfor treatment derivation.
2.5 Column trichotomy¶
Every column in an AC is classified into exactly one of three categories at AC level. The classification is metadata-derivable from declared E patterns across schemas.
2.5.1 AC-dimension¶
A column c is an AC-dimension iff there exists a schema S in the AC where E(c, S) = {c}.
In words: c is an AC-dimension if it serves as the grain in at least one schema. AC-dimensions are entity-identifiers; their values individuate entities.
Examples: customer_id, store_id, product_id, transaction_id, date. Each is the grain of some schema (e.g., customer_id is the grain of customers).
An AC-dimension may appear in non-grain role in other schemas. customer_id is in grain role in customers (E = {customer_id}) and in non-grain role in transactions (E = {transaction_id}). The AC-level classification depends on grain role somewhere; the per-schema role can vary.
2.5.2 AC-attribute¶
A column c is an AC-attribute iff c is not an AC-dimension and, for every schema S in which c appears, E(c, S) is identical.
In words: c is an AC-attribute if it never serves as grain and its anchoring is constant across schemas where it appears.
Examples: customer_name (always anchored at customer; never the grain), store_address (always anchored at store), product_description (always anchored at product).
2.5.3 AC-metric¶
A column c is an AC-metric iff c is not an AC-dimension and there exist schemas in which E(c, S) varies.
In words: c is an AC-metric if its anchoring varies across schemas. The varying anchoring requires reducer-mediated reasoning to relate values across schemas.
Examples: revenue (anchored at transaction in transactions, at (store, week) in weekly_summary), units_sold (anchored variably), customer_count (anchored at different aggregation levels).
2.5.4 Properties of the trichotomy¶
The trichotomy is exhaustive: every column is exactly one of AC-dimension, AC-attribute, or AC-metric.
The trichotomy is metadata-derivable: the framework computes it from E declarations across all ColumnSpecs without needing data.
The trichotomy has consequences:
-
AC-dimensions and AC-attributes have well-defined column-level mapping functions. The data-attested mapping from
E-tuple-values toc-values is verifiable; cross-schema consistency of these mappings is a structural integrity condition. -
AC-metrics' cross-schema relationships are reducer-mediated. The framework's reasoning about AC-metrics across schemas operates through operations (§2.6) and the metric genealogy (§2.7).
-
The
|E| = 1rule (§2.10) applies to AC-dimensions and AC-attributes, not to AC-metrics.
2.6 Operations: predecessor, successor, and their relationship¶
A metric is a pair (name, E) — an identifier (the column's name) and an entity-set anchoring. An operation takes a predecessor metric and produces a successor metric via an operator:
where m_pred = (name_pred, E_pred) is the predecessor and m = (name, E) is the successor produced by applying op.
Given that m_pred and m are linked by an operation, the natural question is: how are they related in name and in E? The framework's answer to this question — what relationships an operation enforces on input and output, and how the operator's catalog properties constrain those relationships — is what this section specifies.
2.6.1 Operator types and the E-relation¶
Each operator in the catalog has a type. Coframe Core recognizes two operator types:
-
Reducer: aggregates over rows, collapsing entities. For a reducer operation,
E_pred ⊇ Eunder FD-DAG navigation. The output anchor must be reachable from the input anchor by collapsing entities through the FD-DAG. -
Function: transforms values row-wise without aggregating. For a function operation,
E_pred = E. The input and output share an anchor.
The well-formedness of an operation requires the appropriate E-relation between predecessor and successor. A reducer with E_pred = E (no entities collapsed) is the trivial case — applying a reducer where no aggregation occurs. A reducer with E_pred ⊊ E is structurally malformed; the operation cannot be a reduction. A function with E_pred ≠ E is structurally malformed; functions are anchor-preserving.
Coframe Pro recognizes a third operator type, broadcast, with E_pred ⊆ E (replicating a coarser-grain attribute to a finer-grain anchor). Broadcast is not a Coframe Core operator type; broadcasting in Coframe Core is handled by Frame-QL's Rung 2 mechanism at query time, not as a per-column derivation.
2.6.2 Identity-preservation¶
An operator op is identity-preserving for predecessor (name_pred, E_pred) iff applying op produces a successor whose name equals the predecessor's name:
For reducers, an operator is identity-preserving for a predecessor iff it equals the predecessor's family ip_reducer (§2.7). For functions, identity-preservation is declared as a flag in the operator-catalog entry per the operator's intrinsic semantics.
Identity-preserving operations preserve name across the operation. Non-identity-preserving operations produce a different name.
2.6.3 The naming relationship¶
For each operation, the relationship between name_pred and name falls into one of two cases:
Case 1: Identity-preservation. If op is identity-preserving for (name_pred, E_pred), then name = name_pred. The successor inherits the predecessor's name.
Case 2: Non-identity-preservation. If op is not identity-preserving for (name_pred, E_pred), then name ≠ name_pred. The successor's name differs from the predecessor's.
Why does this distinction matter? Identity-preservation is precisely the structural property that the predecessor and successor are aggregation-consistent — the successor is the same conceptual quantity as the predecessor, just observed at a different anchor (for reducers) or in a value-equivalent form (for functions). The ip_reducer is the operator that preserves this consistency: applying it to m_pred produces a successor m whose values are coherent with m_pred's values under aggregation. Two metrics linked by an identity-preserving operation are the same metric at different anchorings, which is exactly the structural condition for sharing a family-name (§2.7).
When the operation is not identity-preserving, the successor is observationally distinct from the predecessor. SUM(revenue) at coarser grain remains revenue; MAX(revenue) at the same grain produces something different — a peak, not a sum. The two metrics measure different things; they should not share a name. The framework's structural commitment is: name-sharing reflects aggregation-consistency, name-difference reflects observational distinctness. Names track this structural fact rather than being free annotations.
2.6.4 The well-formedness conditions, summarized¶
For each operation m_pred --op--> m in the AC, two well-formedness conditions hold:
- Operator-type-appropriate E-relation:
E_pred ⊇ Efor reducers (under FD-DAG),E_pred = Efor functions. - Name-relationship consistency:
name = name_predifopis identity-preserving for(name_pred, E_pred);name ≠ name_predotherwise.
The framework verifies both conditions at AC validation. Violations are integrity errors that prevent the AC from loading.
The mechanism by which the framework verifies the name-relationship — what specific name a non-identity-preserving operation should produce, how the AC author declares this, and how the framework checks it — is specified in Chapter 3 (ColumnSpec and Naming Machinery). Foundations establishes the structural commitment; Chapter 3 specifies its operational verification.
2.7 DNA, family, and metric genealogy¶
The framework's structural representation of operational lineage operates at three levels: per-column DNA, per-family ancestry, and AC-wide metric genealogy.
2.7.1 DNA¶
The DNA of a column records the column's predecessor metric. DNA is a triple (name_pred, E_pred, op_pred) capturing the predecessor's family-name, anchor, and the operator that produced the predecessor.
Note: op_pred is the operator that produced the DNA's metric, not the operator that produced the current column. Each DNA entry is a snapshot of the predecessor column in the same form the predecessor's own ColumnSpec uses. This makes DNA structurally uniform: walking DNA chains backward, each step records the same information about the column at that point.
For a root column, DNA is self-referential:
Walking DNA from a root yields the root itself — a structural fixed point. For a non-root column, DNA points to a strictly different predecessor metric in the AC.
DNA chains terminate. Walking DNA from any column backward through predecessors eventually reaches a column whose DNA is self-referential. That column is a root.
2.7.2 Ancestry tree¶
The ancestry tree of a column is the chain of predecessor metrics recoverable by walking DNA from the column backward to a root.
In Coframe Core, ancestry trees are linear for unary-operator columns (one predecessor per step) and branching for multi-input columns (singletons; multiple predecessors at the singleton's DNA — see §2.7.7).
The ancestry tree captures the column's operational history within the AC. It is the structural representation of how the column came to be, traced back to observed roots.
2.7.3 Family-name and family¶
A family-name is a name value appearing in one or more ColumnSpecs in the AC. Two columns with the same name belong to the same family.
The framework determines family membership by string equality on declared names. The name revenue denotes a single family in the AC; columns named revenue are members. Columns named peak_revenue are members of a different family.
The AC's metric columns are partitioned by family-name. Every metric column belongs to exactly one family.
2.7.4 Family-root¶
The family-root of a column is the earliest ancestor in the column's ancestry tree that shares the column's family-name.
To find a column's family-root, walk DNA backward as long as name_pred equals name_self (string equality). The family-root is the last column reached while names matched. If the column's DNA is self-referential (the column is a root), the family-root is the column itself.
The family-root is a derived property. The AC author declares ColumnSpecs (with their name, E, op, and dna); the framework computes the family-root via DNA-walk.
Two columns share a family-root iff their DNA chains, walked through same-named ancestors, terminate at the same column.
2.7.5 Metric genealogy and the structural relations among columns¶
The AC's metric genealogy is the structure of all metric columns organized by family and by family-root.
The metric genealogy partitions metric columns at two levels:
- Family level: partition by family-name. Each family is the set of all columns with that name.
- Within-family level: partition each family by family-root. Each within-family partition is a set of columns sharing both a family-name and a family-root.
This partition structure gives every pair of columns in the AC a definite structural relation:
-
Identical: same
(name, E), same family-root. Two identical columns are interchangeable for query purposes; when they appear in different schemas, the framework can serve a query from either, with MTI (Chapter 9) guaranteeing equivalent results. -
Siblings: same
name, differentE, same family-root. Siblings represent the same conceptual metric observed at different anchors. Cross-anchor navigation between siblings is well-defined under the family's ip_reducer (when the family has one): applying the ip_reducer to a sibling at a finer anchor produces the sibling at a coarser anchor, with values agreeing. The four-rule filter (Chapter 9) selects siblings as substitutable schemas for a query; MTI's domain is precisely the siblings. -
Cousins: same
name, different family-root. Cousins share a family-name but are observationally independent — they trace to different roots in the AC's metric genealogy. Cousins are not interchangeable: applying the family's ip_reducer to two cousins at a target anchor produces different results because the underlying observations differ. When a query references a family-name that resolves to multiple cousins, the framework refuses the query as dubious and requires disambiguation (qualified reference, explicit FROM clause, or BY-clause grain anchor — see Chapter 9). -
Different families: different
name. Two columns with different names belong to different families and share no structural relation under the framework's grammar-layer reasoning. The AC author may know two differently-named families are conceptually related (e.g.,gross_revenueandnet_revenue); the framework does not infer this. Any structural relationship between differently-named families must be encoded explicitly via DNA — one family's columns derive from the other's through declared operations.
The metric genealogy is the framework's primary structural object for reasoning about AC-metrics across schemas. The DQ process verifies its well-formedness.
2.7.6 The ip_reducer and the family¶
The ip_reducer of a family is the operator under which the family's columns are interchangeable across anchors via partition-invariant aggregation.
A family has an ip_reducer iff its family-root's op has partition_invariant: true in the operator catalog (§2.6, Chapter 10). In that case, the ip_reducer is the family-root's op.
A family whose family-root has a non-partition-invariant op (AVG, MEDIAN, COUNT_DISTINCT, etc.) has no ip_reducer. Such a family is anchor-locked: its columns exist at specific anchors but cannot be derived to other anchors via name-preserving aggregation.
The ip_reducer is a property of the family, not of individual columns. Columns within a family share the family's ip_reducer (or share its absence). The ip_reducer is derived; the AC author does not declare it directly.
2.7.7 Multi-input operations and singletons¶
Some operations take multiple input metrics. A column produced by such an operation (e.g., a ratio revenue / units_sold or a binary mapper MAP(DIV, c1, c2)) has DNA representing the multiple predecessors.
In Coframe Core, multi-input operations produce singleton columns, not new families. A singleton column has a name (the AC author's choice), an anchor, an operator that combines the multiple inputs, and DNA recording the input metrics. The singleton is structurally a leaf in the AC's metric genealogy: other columns do not derive from it through DNA; it stands on its own.
The DNA representation for singletons is a tuple of predecessor (name, E, op) snapshots, one per input. The operator for the singleton is the multi-input function.
Singletons are useful for registered ratios, computed columns, and ad-hoc derivations the AC author wants to expose by name. They do not participate in family-genealogy reasoning beyond their own definition. Other columns cannot derive from a singleton through DNA in Coframe Core.
2.7.8 The family-DAG¶
The family-DAG is the AC-wide structure of derivation relationships among families.
Each family has a family-root. If the family-root is a root column (DNA self-referential), the family is primitive — it does not derive from other families.
If the family-root is a non-root column whose DNA points to a column in a different family, the family is derived — its family-root inherits structure from the predecessor family, and an edge in the family-DAG records the derivation.
The family-DAG captures the AC's metric-derivation structure abstracted from anchors. Primitive families are roots of the family-DAG; derived families have predecessors. The family-DAG is acyclic (DNA chains terminate at column-roots; family-roots inherit this property).
2.7.9 What the metric genealogy enables¶
The metric genealogy is the framework's substrate for:
- Query resolution: identifying which schemas contain siblings of a queried column at the right anchor (§2.7.5 and Chapter 9).
- Multi-Table Invariance (MTI): the structural guarantee that siblings produce equivalent results under the family's ip_reducer.
- Dubious-query detection: identifying when a queried family-name resolves to multiple cousins requiring disambiguation.
- AC-author tooling: AI-assisted authoring proposes families and ancestry structure; the metric genealogy is the target structure such tooling produces.
- MCP exposure: the metric genealogy is exposed to LLM clients via the MCP server; LLMs reason about analytical questions in terms of families and ancestry.
The framework's grammar-layer reasoning operates over the metric genealogy. Without it, cross-schema reasoning about AC-metrics would require per-column ad-hoc rules; with it, the framework has a uniform structural representation.
2.8 The FD-DAG¶
The FD-DAG is the framework's structural representation of functional-dependency relationships among AC-dimensions.
2.8.1 What the FD-DAG captures¶
For a pair of AC-dimensions (a, c), an FD-edge a → c declares that a functionally determines c: each a-value maps to at most one c-value.
Examples in the retail running example:
transaction → store(each transaction occurs at one store).store → region(each store is in one region).region → country(each region is in one country).date → week,date → month,date → quarter,date → year(each date falls in one week, one month, one quarter, one year).month → quarter,quarter → year(calendar hierarchy).
FD-edges are directional. a → c does not imply c → a; the reverse may or may not hold (and typically doesn't — a region has multiple stores, but each store has one region).
2.8.2 Candidate FD-DAG vs. data-driven FD-DAG¶
The framework distinguishes:
- The candidate FD-DAG: edges declared by the engineer in schema.init (or auto-derived from grain-role columns referencing other AC-dimensions). These are claims about the data's structure.
- The data-driven FD-DAG: edges attested by data via DQ Phase 3 verification.
Logical FD-edges declared by the engineer must be attested in the data-driven FD-DAG (the Logical ⊆ Data-driven condition). Logical edges not data-attested are integrity violations.
Data-driven edges not declared logically are not violations; they are advisories for engineer consideration. Some data-driven edges may be incidental or coincidental; the engineer decides whether to formalize them.
2.8.3 FD-DAG acyclicity¶
The candidate FD-DAG is required to be acyclic. Cycles among AC-dimensions are rejected at AC validation.
Acyclicity is a structural rule: a cycle a → b → ... → a would imply a functionally determines itself through other dimensions, which is structurally degenerate. The framework rejects such configurations.
2.8.4 What the FD-DAG enables¶
The FD-DAG is essential for:
- Cross-grain navigation: when a query asks for a metric at a coarser anchor than where it's observed, the framework navigates the FD-DAG to find a path from the observation anchor to the query anchor, applying the metric's family ip_reducer along the way.
- The four-rule filter's Rule 2 (entity-set capability): a schema's column can serve a query iff the schema's anchor reaches the query's target anchor under FD-DAG navigation.
- Dimension hierarchies in queries: queries can group by any AC-dimension reachable from the data's grain via the FD-DAG.
The FD-DAG operates orthogonally to the metric genealogy. The FD-DAG governs dimension relationships; the metric genealogy governs metric ancestry. Both are structural objects derived from declared ColumnSpecs and verified by DQ.
2.8.5 Function-derived FD-edges¶
FD-edges in Coframe come from two sources, and both populate the FD-DAG identically.
Data-attested FD-edges. The mapping is stored in a referential table or a fact-table column: dim_date contains (day, month) rows; the (store_id, region) mapping appears in dim_store. The framework verifies these during DQ Phase 3 by examining the data — the Logical FD-DAG ⊆ Data-driven FD-DAG attestation (§7.6.3). Every declared FD-edge of this kind must hold against actual data tuples to pass verification. This is the empirical verification regime: the FD-edge is true because the data has been examined and confirms it.
Function-derived FD-edges. The mapping is computed by a deterministic unary function from the operator catalog: month = MONTH_OF(day), quarter = QUARTER_OF(day), price_tier = BUCKET(price, 10), product_prefix = SUBSTR(product_code, 0, 2). A ColumnSpec declared with op and dna referencing such a function produces a derived dimensional column whose FD-edge is established by the function's mathematical determinism: given day, the function MONTH_OF produces a unique month. No data attestation is needed because there's nothing to attest against — the FD-edge is true by construction, given the operator catalog's declaration of the function's determinism. This is the deductive verification regime: the FD-edge is true because the function's semantics make it so, combined with the framework's trust in the data engine to evaluate the function correctly.
Both regimes populate the FD-DAG identically. The four-rule filter doesn't distinguish them when navigating cross-grain queries; the resolved query plan is the same. MTI applies uniformly. The AC's structural commitments are identical regardless of which regime grounds each individual FD-edge. An AC author authoring an AC over a transactions table can choose, per dimension, whether to materialize derivable mappings (storing month as a column, populated during ETL by MONTH_OF(day)) or to compute them on demand (declaring month as a function-derived ColumnSpec with op: MONTH_OF). The AC's analytical contract is unchanged.
This duality has an analogue on the metric side. Metric values can be data-stored (revenue is a column in the transactions table; pre-aggregated siblings live in their own schemas) and verified empirically through per-DNA-edge value attestation (§7.6.8). Metric values can also be function-derived through Frame-QL inline expressions: profit = SUM(revenue) - SUM(cost), unit_price = revenue / quantity, gross_margin_pct = (revenue - cost) / revenue * 100. Function-derived metrics are verified by construction — the operator catalog's declared semantics combined with the engine's correct evaluation produces a verified output without needing data attestation. As with FD-edges, both regimes contribute uniformly to the family genealogy and to the family's structural reasoning.
Verification level computation honors both regimes. A function-derived FD-edge contributes to the AC's structural integrity at Level AA without requiring data attestation: the FD-edge is grounded by construction (operator catalog semantics) rather than by data attestation. The level definitions in §7.13 are formulated in terms of "grounded" structural commitments, where grounding admits both empirical (data-attested) and deductive (verified-by-construction) sources. Both regimes are legitimate; the level reflects what's verified, not which mechanism verified it. §7.13.4 specifies the grounding rules precisely.
Coframe Core uses this duality for catalog-defined operators and for Frame-QL inline expressions. Custom user-defined functions, with their own deterministic semantics declared by AC authors, are Coframe Pro territory — see §1.5's "Generalized functional grammar layer." Pro lifts the duality from a special case to the framework's primary architectural framing, with explicit naming of the empirical and deductive verification regimes throughout the specification, expanded operator-catalog mechanisms for user-defined functions, and richer reporting in the verification status that distinguishes regime-by-regime.
The intellectual point: Coframe's grammar layer is storage-strategy-agnostic for derived structural objects. An engineer can slide along the function-vs-data spectrum based on performance, storage cost, or convenience, and the AC's analytical contract is stable. The framework reasons about the structural commitment; how the commitment is grounded — empirically through data, or deductively through function semantics — is an implementation choice, not a structural one.
2.9 Schemas¶
2.9.1 Virtual tables and physical tables¶
A schema in an AC binds to a virtual table: a logical view over backend data. The virtual table may map directly to a physical table (one-to-one) or to a view, a query, or a more complex backend artifact. The framework consumes the virtual table through the data-API protocol (Chapter 6).
2.9.2 Schema grain¶
The grain of a schema is the entity-set anchoring its rows. Operationally: each row of the schema corresponds to one combination of values in the grain columns.
For the retail example:
transactionsschema: grain =[transaction].customersschema: grain =[customer].storesschema: grain =[store].store_monthly_summaryschema: grain =[store, month].
Schema grain is determined by which columns have E(c, S) = {c} — the grain-role columns of the schema. The schema's grain is the set of grain-role columns.
2.9.3 Schema-type taxonomy¶
Schemas in an AC are classified into types based on their column composition. The classification is metadata-derivable from declared ColumnSpecs.
- Reference schema: contains only AC-dimensions and AC-attributes for one entity. Example: the
customersschema, withcustomer_idas grain and customer-anchored attributes. - Fact schema: contains AC-metrics anchored at one or more grain dimensions. Example: the
transactionsschema, withtransaction_idas grain and revenue, units_sold as transaction-anchored metrics. - Composite-grain fact schema: contains AC-metrics at a composite grain. Example: the
store_monthly_summaryschema, with[store, month]as grain and store-month-anchored metrics. - Other schemas: schemas not fitting cleanly into the above types are classified per the engineer's intent and the structural facts.
The schema-type classification is informational; it doesn't drive different framework behavior. It helps engineers reason about their AC structure.
2.9.4 Declared scope and degeneracy¶
For each AC-dimension d appearing in a schema S, the schema is either:
- Non-degenerate on
d: the schema is committed to observing all ofd's universe-wide values. - Degenerate on
d: the schema observes only a declared subset ofd's values, with the subset explicitly stated.
Degeneracy declarations allow schemas to scope themselves: a west_region_transactions schema might be declared degenerate on region with the value-set {west}. The framework's coverage analysis (Chapter 7, Chapter 9) honors declared degeneracy.
A schema that is non-degenerate on d but observes only a subset of d's universe-wide values is an integrity violation; either the data is missing values, or the schema should be redeclared as degenerate.
2.10 Structural rules and integrity conditions¶
The framework enforces structural rules at AC validation. Rules are hard constraints; violations result in rejection at validation.
2.10.1 Per-column rules¶
Rule: |E| = 1 for AC-dimensions and AC-attributes. Per the column trichotomy, AC-dimensions and AC-attributes have |E| = 1. AC-metrics may have larger E. Violation indicates a misclassification or a malformed declaration.
Rule: (E, M) paired declaration. Every ColumnSpec has both E and M declared (with M auto-derived for grain-role columns).
Rule: Operator-type-appropriate E-relation. For each non-root ColumnSpec, the E-relation between the column and its DNA predecessor matches the operator's type: E_pred ⊇ E_self for reducer ops, E_pred = E_self for function ops.
Rule: Naming consistency (when a naming function is declared). For each non-root ColumnSpec, the column's declared name equals the AC's naming function called with the column's DNA predecessor and operator (or name = name_pred if op is identity-preserving for the predecessor).
2.10.2 Per-schema rules¶
Rule: No-all-dimensions. Each schema must have at least one non-grain-role column. A schema consisting solely of grain-role columns is structurally degenerate.
Rule: Type consistency. Same-named columns within a schema have the same data type. Same-named columns across schemas should have compatible types (per the AC's type-equivalence rules).
Rule: Schema well-formedness. Each column's anchoring E(c, S) is reachable from the schema's grain via the FD-DAG. Columns anchored to entities not reachable from the schema's grain are structurally malformed.
Rule: Same-name uniqueness within schema. Two ColumnSpecs in the same schema do not share a name.
2.10.3 AC-level rules¶
Rule: Candidate FD-DAG acyclicity. The candidate FD-DAG declared in schema.init has no cycles among AC-dimensions.
Rule: Family-root uniqueness within (name, E). Two ColumnSpecs in the AC with the same (name, E) walk DNA to the same family-root. Violation indicates two non-equivalent metrics share an (name, E) identity claim — a structural inconsistency.
2.10.4 Data-attested rules (integrity conditions)¶
These rules are checkable structurally only by attestation against data, via the DQ process (Chapter 7).
Logical FD-DAG ⊆ Data-driven FD-DAG. Every logical FD-edge declared by the engineer is data-attested.
Schema scope honoring. Every schema's declared scope (non-degenerate or declared-degenerate) matches its observed value-sets per quasi-metadata.
Grain combo-key uniqueness. For each schema, the grain-role columns' value tuples are unique per row.
Cross-schema value-mapping consistency for AC-dimensions and AC-attributes. When the same (c1-value, c2-value) mapping appears in multiple schemas, the mappings agree.
2.10.5 Lemmas: facts asserted, with selective verification¶
Some structural facts are asserted by the AC's structure. The framework's posture distinguishes facts asserted-and-verified, asserted-and-not-verified, and asserted-and-verified-by-default-with-opt-out.
The principal cross-schema lemma in Coframe Core:
Cross-schema metric coherence. Across schemas containing siblings of the same family-root, the metric values at common coarsenings agree (i.e., applying the family's ip_reducer to the finer-grained sibling at the common coarsening produces values matching the coarser-grained sibling).
The framework asserts this from Principle 2 plus the ip_reducer's partition-invariance, and verifies it per attestable DNA edge during DQ Phase 3 in the default Coframe Core configuration (Chapter 7 §7.6.8). Verification compares the predecessor's data, aggregated via the family's ip_reducer at the successor's anchor, against the successor's observed values; deltas surface as integrity violations or advisories per the AC's failure-mode setting.
Engineers may opt out of attestation per AC (attestation: enabled: false in the AC catalog). Opted-out ACs treat cross-schema metric coherence as an asserted-not-verified lemma; the choice is visible in the AC's verification status and propagates to query-result annotations and MCP responses.
When the lemma fails (default config: at AC validation time; opted-out: at query-result time, observable only when engineers cross-check): ETL mismatches, stale pre-aggregations, definitional drift between schemas, or coverage gaps not accounted for in declared scope are typical causes. The Multi-Table Invariance theorem (Chapter 9 §9.6) is an unconditional guarantee in attestation-enabled configurations and a conditional guarantee in opted-out configurations.
A second class of facts remains asserted-not-verified in Coframe Core:
- Catalog-declared partition-invariance. The framework trusts the operator catalog's
partition_invariantflag for each reducer. The flag's correctness is a property of the catalog's design, not of the AC's data; verification is structural, not data-attested. - The engineer's principle commitments. Principle 1 (column-borne information) and Principle 2 (same universe of observation) are commitments the engineer makes by authoring an AC. The framework verifies many consequences but does not verify the principles themselves; an engineer asserting Principle 2 over schemas observing genuinely different universes is making an unverifiable commitment.
(For attestation operational details — edge selection, failure modes, scope-aware verification, missing-value handling, sampling — see Chapter 7 §7.6.8.)
2.11 The framework's posture¶
Several notes on the framework's overall posture, which guides the rest of the manual.
2.11.1 Structural rigor as binary¶
The framework's correctness commitment is binary: either an AC is principle-honoring (with verifiable integrity conditions) or it's not. The framework rejects ACs and queries that violate principles or integrity conditions; engineers either work within the framework's discipline or step outside it.
There's no "mostly correct" mode, no "permissive" alternative, no engineer-controlled relaxation of structural rules. The framework's integrity conditions are non-negotiable.
This posture is preserved in Coframe Core. (Coframe — the full version — provides a strictness mechanism for engineer-controlled deviation under declared-assumption discipline; Coframe Core does not.)
2.11.2 Grammar and semantics¶
The framework distinguishes the grammar layer (structural facts about how data is organized) from the semantic layer (analytical interpretations engineers attach to it). The framework's correctness machinery operates at the grammar layer; the semantic layer is the engineer's domain.
This separation enables the framework to verify structural correctness without engaging the engineer's domain knowledge. Two ACs over the same data may differ at the semantic layer (different interpretations); the framework verifies structural correctness for each independently.
2.11.3 Names as opaque labels¶
The framework treats column names as opaque string labels. Its only operations on names are:
- Equality comparison: do two declared names match? This determines family membership.
- Naming-function output verification: does the AC's declared naming function (if declared), called with a column's DNA predecessor and operator, produce a string equal to the column's declared name?
The framework does not parse names, decompose them, extract substrings, recognize prefixes or suffixes, or interpret any structural content from name strings. Whatever structure the AC author may encode in a name (e.g., a peak_ prefix indicating MAX-derivation) is the AC author's convention and is verified, if at all, by the AC's declared naming function — not by any framework-level name parsing.
This name-agnostic posture is structural: the framework's reasoning depends on the relationships among columns (family memberships via name equality, ancestry via DNA), not on what the names "say." Names are equality-tokens for the framework, nothing more.
2.11.4 Naming practice as foundation¶
The AC author's naming practice — the choice of names plus the declared naming function (if any) — is the foundational structural commitment from which the AC's metric genealogy emerges.
The AC author chooses freely:
- Names of columns: any string satisfying syntactic requirements for Frame-QL parsing. English, domain-specific terminology, internal codenames, abstract identifiers, names in any language — the framework does not prefer one over another.
- Naming function: adopt the operator catalog's default, override per operator, declare a fully custom function, or decline structured naming altogether.
- Business meaning attached to families: the AC author's domain.
- Analytical scope of the AC: which schemas to include, which families to define, what queries the AC supports.
The framework verifies the AC's declared naming practice for internal consistency. It imposes no naming aesthetic.
This separation has practical consequences:
- ACs in any natural language work.
- ACs with domain-specific naming traditions work.
- ACs with internal codenames or abstract identifiers work.
- ACs migrating from existing systems work — existing column names from a warehouse can be adopted directly; no rename is forced.
2.11.5 Discovery rather than invention¶
Many of the framework's structural conditions are not Coframe-specific. They are conditions on data and metadata that have to hold regardless of what analytical tool is being used.
If a declared FD fails against data, that breaks any reasoning that assumes the dependency. If cross-schema integrity is violated, queries on either schema inherit the inconsistency. If two non-equivalent metrics share a name, any tool relying on that name will produce ambiguous results.
The framework's contribution is to articulate these conditions as first-class checks, with specific diagnostics, in a phase where engineers can address them deliberately. In this sense, much of Coframe's structural-verification work is more like discovery than invention.
The grammar-layer thesis is doing real work — articulating constraints latent in analytical practice and giving them precise structural form.
2.11.6 Adoption posture¶
Coframe Core's design prioritizes adoption. The simplifications relative to Coframe Pro are deliberate: lower friction for engineers new to the framework; smaller surface area to learn; faster path to a working AC.
Engineers who outgrow Coframe Core or need its capabilities upgrade to Coframe. The principles, the foundational vocabulary, and the structural rigor are preserved across versions. Coframe Core's omissions are convenience-oriented (no broadcast operator type, no slow-changing dimension support, no custom operators, no strictness, no multi-backend) rather than rigor-oriented. The framework's correctness guarantees are uniform.
2.12 Where to go next¶
After reading this Foundations chapter, the natural next chapters are:
- Chapter 3: ColumnSpec and Naming Machinery — the field-by-field specification of ColumnSpec plus the naming function machinery.
- Chapter 4: AC Authoring Workflow — how engineers go from a warehouse to a working AC.
- Chapter 5: schema.init Format — the engineer's input artifact.
- Chapter 6: Data-API Protocol — the protocol DQ uses to call backends.
- Chapter 7: Data Quality and Structural Verification — the DQ process specification.
- Chapter 8: Frame-QL — the query language.
- Chapter 9: Query Resolution — how queries resolve against an AC, including the four-rule filter and MTI.
- Chapter 10: Operator Catalog — Coframe Core's operator catalog with type, partition_invariant, identity-preservation, naming function, and missing-value treatment per (operator, signature).
The chapters can be read in this order or in another order, as the reader's needs dictate.
Part III: AC and Authoring¶
Chapter 3: ColumnSpec and Naming Machinery¶
The formal specification of ColumnSpec, the naming function, and the framework's verification of name-vs-operational-lineage consistency.
3.1 Overview¶
This chapter specifies in detail the ColumnSpec: the AC's declaration of a single column. The Foundations chapter (Chapter 2) introduced ColumnSpec at a structural level; this chapter provides the field-by-field specification, the rules governing each field, and the framework's verification of declared values.
The chapter also specifies the naming function: the AC-level declaration that maps (name_pred, E_pred, op) to name_self for non-identity-preserving operations. The naming function is the bridge between the AC author's naming choices and the framework's structural verification.
The chapter is organized in eight sections:
- §3.2 specifies the ColumnSpec's structural division into four parts.
- §3.3 specifies the backend-facing fields:
src_name,data_type. - §3.4 specifies the entity-facing fields:
E,M. - §3.5 specifies the operator/operation-facing fields:
op,dna. - §3.6 specifies the cross-schema linkage field:
name. - §3.7 specifies the naming function: declaration, semantics, framework verification.
- §3.8 specifies derived properties the framework computes from ColumnSpec.
- §3.9 specifies the AC-level integrity conditions involving ColumnSpec.
The chapter assumes familiarity with the Foundations chapter and the Coframe Vocabulary Spine.
3.2 ColumnSpec structure¶
A ColumnSpec is the AC's declaration of a single column in a schema. Each ColumnSpec represents one of the AC author's selection choices — the AC author has chosen this backend column for inclusion in the AC scope (per Foundations §2.3.4). Backend columns without ColumnSpec declarations are outside the AC scope and not visible to queries against the AC.
ColumnSpecs are listed within their containing schema's declaration in schema.init (Chapter 5).
A ColumnSpec is structurally divided into four parts:
| Part | Fields | Role |
|---|---|---|
| Backend-facing | src_name, data_type |
Bind to physical data |
| Entity-facing | E, M |
Observational commitment |
| Operator/operation-facing | op, dna |
Operational lineage |
| Cross-schema linkage | name |
Family identifier |
The four parts have distinct structural roles and are independently declared. The framework reasons over each part separately:
- Backend-facing fields are consumed when the framework binds to data via the data-API.
- Entity-facing fields participate in the column trichotomy, integrity conditions, and operator missing-value treatment.
- Operator/operation-facing fields participate in metric genealogy reasoning and the four-rule filter.
- The cross-schema linkage field participates in family-membership determination.
The four parts together constitute the column's complete structural commitment.
3.2.1 Required vs. derivable¶
Most ColumnSpec fields are declared by the engineer. Some are auto-derivable in specific cases:
Mis auto-derived for grain-role columns (E = {c}).opis auto-derived for grain-role columns (set to a designated grain-role operator; see §3.5).dnais set to self-referential for root columns and may be auto-detected from naming-function consistency in some cases (see §3.5.5).
The framework requires all four parts to be present (declared or derived) at AC validation time. Missing fields are integrity violations.
3.3 Backend-facing fields¶
3.3.1 src_name¶
The physical column name in the backend table the schema binds to. Backend-facing; not used in queries.
The framework consumes src_name to bind the ColumnSpec to backend data via the data-API protocol (Chapter 6). Operations like get_distinct_values, get_pair_mapping, and verify_grain_integrity reference src_name when communicating with the backend.
src_name is a string. The framework places no constraints on its content beyond what the backend itself accepts.
If the AC's name field equals src_name for a column, the AC author may omit src_name and let the framework default it from name. This convenience does not change semantics; the framework treats src_name as the canonical reference for backend binding.
3.3.2 data_type¶
The column's data type as exposed by the backend. The framework recognizes the following data types in Coframe Core:
numeric: integer or floating-point numbers.integer: integer-valued numbers.string: character data.boolean: TRUE/FALSE values.date: calendar dates.timestamp: dates with time-of-day.
Backend-specific types (e.g., decimal precision, varchar length, geographic types) are mapped to the framework's recognized types per the backend's data-API. The data-API protocol specifies the mapping (Chapter 6).
data_type participates in:
- Type-consistency rules across same-named columns (§3.9).
- Operator-applicability checks: each operator declares the data types its inputs accept.
- Frame-QL expression type-checking.
The framework rejects ColumnSpecs whose declared data_type is incompatible with the column's op (e.g., SUM on a string column).
3.4 Entity-facing fields¶
3.4.1 E: entity-set anchoring¶
E(c, S) declares the entities the column's value depends on. Per Principle 1 (§2.2), every column's value is determined by the entities it's anchored to; E names these entities.
E is a set of AC-dimensions. The set's cardinality is constrained by the column's trichotomy classification (Foundations §2.5):
- For AC-dimensions and AC-attributes,
|E| = 1. - For AC-metrics,
|E| ≥ 1(one or more elements). - For grain-role columns,
E = {c}(the column itself).
E is declared as a list of AC-dimension names (canonically the dimension's name). Examples:
E: [customer]— anchored at customer.E: [transaction]— anchored at transaction.E: [store, week]— anchored at (store, week).
The framework verifies that every dimension named in E is itself an AC-dimension (has a ColumnSpec somewhere in the AC where that dimension is in grain role).
3.4.2 M: missingness signature¶
M(c, S) declares the column's missingness mechanism. Per Foundations §2.4.2, M is one of three categories:
- MCAR:
M.signature = "MCAR",M.determinants = []. - MAR:
M.signature = "MAR",M.determinants ⊊ E ∪ {self},c ∉ M.determinants. - MNAR:
M.signature = "MNAR",c ∈ M.determinants.
The constraint M.determinants ⊆ E ∪ {self} is a structural rule. Determinants outside this set are integrity violations.
Examples:
3.4.3 (E, M) auto-derivation for grain-role columns¶
For columns where E(c, S) = {c} (grain-role columns), the framework auto-derives:
M.signature = MNAR.M.determinants = [c].- Admissibility = forbidden by Principle 1; missing values are integrity violations.
The engineer declares E = {c}; the framework derives M. If the engineer also declares M, it must match the derived value; mismatches are integrity violations.
3.4.4 What the framework does with (E, M)¶
The (E, M) declaration is consulted by the framework throughout:
- Operator missing-value treatment (Chapter 10): the operator catalog specifies behavior per
(operator, M_eff), whereM_effis the effective signature for the operation.Mis the input. - Cross-schema consistency checks (Chapter 7): same-named columns across schemas are checked for consistent
E(for AC-dimensions/attributes) and consistentM(for the column trichotomy's purposes). - Query resolution (Chapter 9): the four-rule filter's Rule 2 (entity-set capability) navigates from
Eto the query's target anchor via the FD-DAG.
3.5 Operator/operation-facing fields¶
The op and dna fields together capture the column's operational lineage: how this column was produced, traceable back through DNA chains to a root.
3.5.1 op: the column's operator¶
op is the operator that produced this column. For a non-root column, op is the operator-catalog entry that, applied to the DNA predecessor's metric (name_pred, E_pred), produced this column.
For a root column (DNA self-referential), op is the column's root operator — the operator under which the column is observationally rooted. For roots in families with an ip_reducer (per §2.7.6), op is the family's ip_reducer.
op is declared as a string identifier referencing an operator-catalog entry (Chapter 10). Examples: SUM, MAX, MIN, AVG, COUNT_DISTINCT, MAP_DIV.
For grain-role columns (E = {c}), op is set to the OBSERVED operator (per Chapter 10 §10.6). The framework auto-derives this if not explicitly declared.
3.5.2 dna: the column's predecessor snapshot¶
dna records the column's predecessor metric: a snapshot of the predecessor's (name, E, op) triple.
For a non-root column, dna is:
The predecessor must itself be a ColumnSpec elsewhere in the AC (in any schema, including the same schema as the current column). The framework verifies that a ColumnSpec matching (name_pred, E_pred, op_pred) exists in the AC.
For a root column, dna is self-referential:
The framework recognizes a column as a root when its DNA equals its own (name, E, op).
3.5.3 The op-dna relationship¶
The column's op is the operator that took the DNA predecessor metric forward to produce this column. The DNA's op is the operator that produced the DNA predecessor itself (one step further back in the chain).
For example, a column representing peak revenue derived by MAX over weekly revenue, where weekly revenue itself is SUM-aggregated from transaction-level revenue, might be declared:
# A column at (region, year): peak weekly revenue, observed in this AC
- column_spec:
name: peak_revenue
E: [region, year]
M:
signature: MCAR
determinants: []
op: MAX
dna:
name: peak_revenue
E: [store, week]
op: MAX
This column's op is MAX (the operator that produced this column from its predecessor). The DNA's op is also MAX (the operator that produced the predecessor from its predecessor — one step further). Walking backward, the predecessor's predecessor might be revenue @ {transaction} with op = SUM.
3.5.4 Walking DNA¶
To walk DNA from a column, the framework:
- Reads the column's
dna: the predecessor's(name, E, op)triple. - Locates the ColumnSpec in the AC matching that triple. (Multiple schemas may have such a ColumnSpec; the framework treats all of them as equally valid representatives of the predecessor metric.)
- Reads that ColumnSpec's own
dna. If it's self-referential, the walk terminates at this root; otherwise, recurse to step 2.
The walk is deterministic given a starting column and the AC's declared ColumnSpecs. It produces an ordered sequence of (name, E, op) triples — the column's ancestry tree — terminating at a root.
The framework caches the result of DNA walks at AC-load time. The cached structure is the metric genealogy.
3.5.5 Auto-detection of DNA for derived columns¶
In some cases, the AC author may declare a column without explicit DNA, and the framework can infer DNA from the column's other fields plus the AC's naming function.
Specifically: if the AC declares a naming function and the column's name and op together uniquely identify a possible predecessor (e.g., name = peak_revenue, op = MAX, and the naming function says f_MAX(revenue, *) produces peak_revenue), the framework can propose dna = (revenue, ?, ?) for the column and ask the engineer to confirm or specify the predecessor's anchor and operator.
Auto-detection is an AC-authoring convenience; the AC validation requires explicit DNA in the final ColumnSpec.
3.5.6 Multi-input DNA for singletons¶
Per Foundations §2.7.7, multi-input operations produce singleton columns. A singleton's DNA is a tuple of predecessor (name, E, op) snapshots — one per input.
# A singleton ratio column
- column_spec:
name: revenue_per_unit
E: [transaction]
M:
signature: MCAR
determinants: []
op: MAP_DIV
dna:
- name: revenue
E: [transaction]
op: SUM
- name: units_sold
E: [transaction]
op: SUM
The singleton's op is the multi-input operator (MAP_DIV in this example). The singleton's DNA lists each input metric's snapshot in order.
Singletons do not participate as DNA predecessors for other columns in Coframe Core. Other columns cannot declare DNA pointing to a singleton.
3.6 Cross-schema linkage: name¶
The name field is the column's family-name in the AC. It is the cross-schema linkage by which queries reference columns and by which the framework determines family membership.
3.6.1 What name is¶
name is a string. The AC author chooses freely.
Within a schema, distinct ColumnSpecs have distinct names. Two ColumnSpecs in the same schema cannot share a name.
Across schemas, ColumnSpecs may share a name. Same-named columns across schemas belong to the same family per Foundations §2.7.3. They may be identical, siblings, or cousins, depending on E and DNA-walk results.
3.6.2 What name is not¶
The framework treats name as an opaque string label. Per Foundations §2.11.3:
- The framework does not parse
name. - The framework does not decompose
nameinto substrings, prefixes, or suffixes. - The framework does not interpret structural content from
name.
The framework's only operations on name are:
- Equality comparison: do two declared names match? Used for family membership and family-root walking.
- Naming-function output verification (when a naming function is declared): does the AC's declared naming function, called with the column's DNA predecessor and operator, produce a string equal to this column's declared
name?
Whatever structure the AC author may encode in a name (e.g., a peak_ prefix indicating MAX-derivation) is the AC author's convention and is verified, if at all, by the AC's declared naming function — not by any framework-level name parsing.
3.6.3 Syntactic constraints¶
The framework imposes minimal syntactic constraints on name:
- The string must parse as a Frame-QL identifier or qualified identifier (Chapter 8). Specifically, identifiers consist of letters, digits, and underscores, starting with a letter or underscore, and may include dots for qualified references.
- Reserved Frame-QL keywords (e.g.,
SELECT,FROM,WHERE,BY) cannot be used as bare identifiers.
Beyond these syntactic constraints, the AC author chooses freely. Names in any natural language, domain-specific vocabulary, internal codenames, and abstract identifiers are equally valid.
3.6.4 Same-named columns across schemas¶
When the same name appears in multiple schemas, those columns belong to the same family. The framework determines whether they are identical, siblings, or cousins:
- Identical: same
(name, E), same family-root. - Siblings: same
name, differentE, same family-root. - Cousins: same
name, different family-roots.
The family-root determination uses DNA walks (§3.5.4). Two columns sharing a name are siblings iff their DNA-walk through same-named ancestors terminates at the same column.
The framework computes these relations at AC-load time and stores them in the metric genealogy.
3.7 The naming function¶
The naming function is an AC-level declaration that maps (name_pred, E_pred, op) to name_self. The framework uses the naming function to verify that ColumnSpecs declare names consistent with their operational lineage.
3.7.1 What the naming function is¶
A naming function is a function from (name_pred: string, E_pred: list-of-dimensions, op: operator-name) to name_self: string.
The framework calls the function during AC validation. For each non-root ColumnSpec where op is not identity-preserving, the framework computes the naming function's output and verifies that the column's declared name equals the output via string equality.
For identity-preserving operations (where op is the family ip_reducer for reducers, or flagged identity-preserving for functions), the framework verifies name = name_pred directly without calling the naming function.
3.7.2 The naming function is a black box¶
The framework treats the naming function as a black box:
- The framework does not constrain the function's internal logic.
- The framework does not require the function to follow any specific naming convention.
- The framework does not parse or interpret the function's output beyond comparing it to declared names via string equality.
The function's internal logic may be string concatenation, table lookup, arbitrary computation, or any other approach the AC author chooses.
The framework's role is limited to: call the function with declared inputs; compare the function's output to declared names; flag inconsistencies as integrity violations.
3.7.3 Declaration options¶
The AC author chooses one of four options for the naming function:
Option 1: Adopt the operator catalog's default. The operator catalog (Chapter 10) provides default naming function entries for common operators. For example, a default catalog might define:
f_MAX(revenue, *) → peak_revenuef_MIN(revenue, *) → trough_revenuef_AVG(revenue, *) → mean_revenuef_COUNT_DISTINCT(customer, *) → customer_count
The default is a starting point, not a structural commitment by the framework. The AC opts in by declaring naming_function: catalog_default.
Option 2: Override per operator. The AC adopts catalog defaults but overrides specific operators:
Override format is the AC author's choice; the framework calls the AC's declared override mechanism without interpreting it.
Option 3: Declare a fully custom naming function. The AC declares a function from scratch:
The implementation may be in any form the AC's runtime can call — a registered Python function, a declarative table, an external service call. The framework calls it as a black box.
Option 4: Decline structured naming. The AC declares no naming function:
In this case, the framework verifies family-genealogy consistency through DNA only. Names are free labels; no name-vs-operator consistency is checked. The AC author is responsible for ensuring family membership claims via name correspond to genuine ancestry via DNA.
3.7.4 What the framework verifies¶
For each non-root ColumnSpec in the AC, the framework's verification proceeds as follows:
- Determine whether
opis identity-preserving for the DNA predecessor(name_pred, E_pred). - For reducer
op: identity-preserving iffopequals the predecessor's family ip_reducer (which is the predecessor's family-root'sopifpartition_invariant). -
For function
op: identity-preserving iff the operator-catalog entry flags it as identity-preserving for the predecessor's data type. -
If identity-preserving: verify
name = name_pred(string equality). Inconsistency is an integrity violation. -
If not identity-preserving and the AC has declared a naming function: call the naming function with
(name_pred, E_pred, op)and verifyname = naming_function_output(string equality). Inconsistency is an integrity violation. -
If not identity-preserving and the AC declines structured naming: skip name-vs-operator verification. The AC is verified through DNA only.
3.7.5 What the framework does not verify¶
Without a declared naming function, the framework does not verify:
- That names are "good" or follow any convention.
- That same-named columns are conceptually related.
- That different-named columns are conceptually distinct.
- That the AC author's name choices reflect any particular vocabulary or aesthetic.
These are the AC author's responsibility. The framework's structural reasoning operates on declared structure (DNA, E, M, op), not on names beyond their role as equality-tokens.
3.8 Derived properties¶
The framework derives several properties from declared ColumnSpec fields. Engineers do not declare these directly.
3.8.1 O(c, S): natural anchoring¶
For root columns, O(c, S) is derived from name plus E. For derived columns, O(c, S) is derived via DNA-walk plus op composition. The natural anchoring captures where the column "is" in the AC's anchor space.
3.8.2 Identity¶
The column's identity is the structural fingerprint determining substitutability for query purposes. Identity is captured by (name, E, family-root):
- Same
(name, E, family-root)⟹ identical columns. - Same
(name, family-root), differentE⟹ siblings. - Same
name, different family-root ⟹ cousins. - Different
name⟹ different families.
Only (name, E) is declared by the AC author. The family-root is derived via DNA-walk; the AC author does not declare it.
The four-rule filter (Chapter 9) uses identity equivalence to determine which schemas can serve a query.
3.8.3 E*(c, S): FD-DAG-extended entity set¶
E*(c, S) is the closure of E(c, S) under FD-DAG reachability — both upward (coarser ancestors via FD-edges) and downward (finer descendants).
The framework uses E* in the four-rule filter's Rule 2 to determine whether a schema's column can serve a query at a target anchor: the schema's column can serve the query iff the query's target anchor is in E*(c, S).
3.8.4 AC-level category¶
The column's AC-level category (AC-dimension, AC-attribute, or AC-metric) is derived from the trichotomy in Foundations §2.5. The derivation is from E patterns across schemas:
- AC-dimension: there exists a schema where
E(c, S) = {c}. - AC-attribute: not an AC-dimension;
Eis identical across all schemas wherecappears. - AC-metric: not an AC-dimension;
Evaries across schemas.
3.8.5 Family membership and family-root¶
For each ColumnSpec, the framework derives:
- The family it belongs to (set of all ColumnSpecs sharing its
name). - The family-root it walks DNA to.
- The structural relations to other same-named columns (identical, sibling, cousin).
These derivations are described in Foundations §2.7.
3.8.6 ip_reducer (per family)¶
For each family in the AC, the framework derives:
- The family-root operator: the
opat the family-root. - The ip_reducer: equals the family-root operator if it has
partition_invariant: true; otherwise the family has no ip_reducer.
The ip_reducer is a property of the family. All columns in the family share the family's ip_reducer (or share its absence). This derivation is described in Foundations §2.7.6.
3.9 Integrity conditions involving ColumnSpec¶
The framework enforces integrity conditions involving ColumnSpec at AC validation. Conditions checkable from declared metadata alone are checked at DQ Phase 1; conditions requiring data attestation are checked at DQ Phase 2/3.
3.9.1 Phase 1 conditions (metadata-only)¶
Required fields present. Every ColumnSpec has all four parts declared (with auto-derivation for grain-role columns).
E references valid AC-dimensions. Every dimension named in E is itself an AC-dimension somewhere in the AC.
M constraint. M.determinants ⊆ E ∪ {self}. Determinants outside this set are integrity violations.
|E| = 1 for AC-dimensions and AC-attributes. Per the column trichotomy.
Operator-type-appropriate E-relation. For each non-root ColumnSpec, the E-relation between the column and its DNA predecessor matches the operator's type:
- For reducer
op:E_pred ⊇ E_selfunder FD-DAG navigation. - For function
op:E_pred = E_self.
Naming consistency (when a naming function is declared). For each non-root ColumnSpec, the column's name equals the naming function called with its DNA predecessor and operator (or name = name_pred if op is identity-preserving).
DNA references valid columns. Every non-root ColumnSpec's DNA points to an (name, E, op) triple that matches some ColumnSpec in the AC.
Same-name uniqueness within schema. Two ColumnSpecs in the same schema do not share a name.
Type consistency for same-named columns within a schema. Same-named columns within a schema have the same data_type (this rule is trivial since same-name uniqueness within schema forbids collisions; the rule applies meaningfully across schemas — see §3.9.3).
3.9.2 Phase 2/3 conditions (data-attested)¶
Grain combo-key uniqueness. For each schema, the grain-role columns' value tuples are unique per row. Verified via verify_grain_integrity (Chapter 6).
Cross-schema value-mapping consistency. Same-named AC-dimensions and AC-attributes across schemas have consistent value mappings where their declared scopes overlap.
Logical FD-DAG ⊆ Data-driven FD-DAG. Every logical FD-edge declared in schema.init is data-attested per Phase 3.
Schema scope honoring. Each schema's declared scope (degenerate or non-degenerate on each AC-dimension) matches its observed value-sets per quasi-metadata.
3.9.3 Cross-schema conditions¶
Family-root uniqueness within (name, E). Two ColumnSpecs in the AC with the same (name, E) walk DNA to the same family-root. Violation indicates two non-equivalent metrics share an (name, E) identity claim — a structural inconsistency.
Type compatibility for same-named columns across schemas. Same-named columns across schemas have compatible data types (per the AC's type-equivalence rules).
3.9.4 Cross-schema coherence verification¶
The cross-schema metric coherence statement (Foundations §2.10.5) — that siblings of the same family-root produce coherent values across schemas — is verified per attestable DNA edge during DQ Phase 3 by default in Coframe Core (Chapter 7 §7.6.8). Verification compares the predecessor's data, aggregated via the family's ip_reducer at the successor's anchor (honoring operator-catalog missing-value semantics), against the successor's observed values, scoped to the intersection of the two schemas' declared scopes.
Engineers may opt out of attestation per AC by setting attestation.enabled: false in the AC catalog. Opted-out ACs treat coherence as asserted-not-verified; the opt-out is recorded in the verification status and propagates to query-result annotations.
The framework's MTI guarantee (Chapter 9 §9.6) is unconditional in default configurations and conditional on the engineer's commitment to ETL-side coherence in opted-out configurations.
3.10 Example: the retail AC ColumnSpecs¶
Below is a sketch of ColumnSpecs in the retail AC's running example, illustrating the four parts.
A grain-role column in the customers schema:
- column_spec:
src_name: customer_id
name: customer
data_type: integer
E: [customer]
# M auto-derived: MNAR with determinants = [customer]
op: OBSERVED
dna:
name: customer
E: [customer]
op: OBSERVED
An AC-attribute in the customers schema:
- column_spec:
src_name: customer_name
name: customer_name
data_type: string
E: [customer]
M:
signature: MCAR
determinants: []
op: OBSERVED
dna:
name: customer_name
E: [customer]
op: OBSERVED
An AC-metric (root) in the transactions schema:
- column_spec:
src_name: amount
name: revenue
data_type: numeric
E: [transaction]
M:
signature: MCAR
determinants: []
op: SUM
dna:
name: revenue
E: [transaction]
op: SUM
An AC-metric (sibling of revenue) in the store_monthly_summary schema:
- column_spec:
src_name: total_revenue
name: revenue
data_type: numeric
E: [store, month]
M:
signature: MCAR
determinants: []
op: SUM
dna:
name: revenue
E: [transaction]
op: SUM
This sibling shares the family-name revenue with the root and walks DNA back to the root. The framework recognizes them as siblings; cross-anchor navigation between them is well-defined.
A derived AC-metric (peak revenue, in a different family) in the same schema:
- column_spec:
src_name: peak_daily_revenue
name: peak_revenue
data_type: numeric
E: [store, month]
M:
signature: MCAR
determinants: []
op: MAX
dna:
name: revenue
E: [store, day]
op: SUM
This column is in the peak_revenue family, derived via MAX from a revenue @ {store, day} predecessor (which would itself be a sibling of revenue @ {transaction} in the AC).
If the AC has declared a naming function with f_MAX(revenue, *) → peak_revenue, the framework verifies that this column's name (peak_revenue) matches the naming function's output for its DNA predecessor and op. If consistent, the ColumnSpec passes Phase 1 verification.
A singleton (registered ratio) at transaction grain:
- column_spec:
src_name: NULL # computed, no backend column
name: revenue_per_unit
data_type: numeric
E: [transaction]
M:
signature: MCAR
determinants: []
op: MAP_DIV
dna:
- name: revenue
E: [transaction]
op: SUM
- name: units_sold
E: [transaction]
op: SUM
This singleton stands alone in the metric genealogy; other columns do not derive from it through DNA.
3.11 Where to go next¶
After reading this chapter, the natural next chapters are:
- Chapter 5: schema.init Format — the complete specification of the schema.init artifact, including how ColumnSpecs are declared.
- Chapter 7: Data Quality and Structural Verification — how the integrity conditions in §3.9 are verified.
- Chapter 9: Query Resolution — how the four-rule filter uses the structural relations among columns.
- Chapter 10: Operator Catalog — the operator catalog with type, partition_invariant, identity-preserving flag, and naming function entries.
For the framework's overall posture and the broader structural picture, see the Foundations chapter (Chapter 2).
Chapter 4: AC Authoring Workflow¶
The engineer's process for taking a backend warehouse and producing a working Coframe Core Analytics Collection.
4.1 Overview¶
This chapter specifies the workflow by which engineers author a Coframe Core Analytics Collection (AC). The chapter is operational: it describes the activities, artifacts, and iteration cycle that take a backend's data and produce a working AC ready for query workloads.
The chapter assumes familiarity with the Foundations chapter (Chapter 2) and the ColumnSpec chapter (Chapter 3). It refers forward to the schema.init format (Chapter 5), the data-API protocol (Chapter 6), and the data quality verification process (Chapter 7) for technical specifications of the artifacts and processes referenced here.
The chapter is organized in nine sections:
- §4.2 frames the authoring problem.
- §4.3 specifies the four phases of the authoring workflow.
- §4.4 lists the artifacts the workflow consumes and produces.
- §4.5 describes the iteration cycle.
- §4.6 details the engineer's role across the workflow.
- §4.7 describes AI-assisted authoring.
- §4.8 covers AC lifecycle and schema evolution.
- §4.9 enumerates common pitfalls.
- §4.10 lists what subsequent chapters cover.
4.2 The authoring problem¶
Authoring a Coframe Core AC means:
- Identifying what the AC should contain: which physical tables in the backend, which columns, what analytical scope.
- Declaring structural commitments: for each column, what its anchoring
Eis, what its missingness mechanismMis, what its operational lineage (opanddna) is, and what its family-name is. - Declaring AC-level commitments: the naming function (if any), the candidate FD-DAG, schema scope declarations.
- Verifying the commitments against data: the DQ process attests structural facts and surfaces violations or advisories.
- Iterating until convergence: engineers refine the schema.init based on DQ feedback; the cycle continues until DQ produces a clean structural-verification deliverable.
- Validating the AC: after DQ converges, AC validation confirms all integrity conditions hold; the AC is ready for query workloads.
The workflow is iterative. Engineers don't get the AC right on the first attempt; they iterate based on data attestation. The framework's role is to surface specific structural concerns and guide remediation; the engineer's role is to commit to structural facts and respond to attestation.
4.3 The four phases¶
The AC authoring workflow has four phases.
4.3.1 Phase 1: Discovery¶
In Phase 1, engineers define the AC scope (per Foundations §2.3.4) — the curatorial commitment that determines what the AC exposes. The scope-defining activities:
- Selecting backend tables: which physical tables (or views) the AC should source data from.
- Selecting columns within each table: which columns are analytically relevant to the AC's purpose. A backend table may have hundreds of columns; the AC may include only a handful. The selection is deliberate: each included column becomes a ColumnSpec; columns not selected are outside the AC scope.
- Choosing names: what to call the selected columns in the AC's vocabulary. The framework treats names as opaque labels; the AC author's choice can preserve backend names, adopt domain-specific terminology, use abstract identifiers, or anything else.
- Identifying candidate AC-dimensions: which selected columns will serve as grain in some schema.
- Identifying candidate AC-metrics and AC-attributes: which selected columns observe values about entities.
- Identifying candidate FD-DAG edges: structural relationships among the candidate AC-dimensions.
Engineers draft an initial schema.init based on these choices. The schema.init may be incomplete: not all ColumnSpec fields need to be filled in yet. The framework's Phase 1 metadata-only verification (Chapter 7 §7.3) catches structural inconsistencies in the draft.
Phase 1 typically uses AI-assisted tooling (§4.7) to propose ColumnSpec field values from data inspection. The engineer reviews and confirms — particularly the curatorial choices, where domain knowledge about which columns matter for analytical purpose is the engineer's contribution.
4.3.2 Phase 2: Verification¶
In Phase 2, the framework calls the backend's data-API to fetch quasi-metadata, then verifies declared facts against attested facts. The DQ process (Chapter 7) produces:
- Hard violations: integrity conditions that fail. Engineers must address before AC validation passes.
- Advisories: soft concerns the framework surfaces for engineer consideration. Engineers either confirm intentional or address.
- Refinements: ColumnSpec fields the framework can infer from data attestation (e.g., metric anchorings inferred from data; missingness signatures observed from missing-value counts).
- The structural-verification deliverable: a record of what was verified (including per-DNA-edge attestation results when enabled), what was identified as asserted-not-verified (the genuine lemmas — see §7.7.3), and the AC's coherence posture summary.
Phase 2 is the framework's substantive engagement with the backend's data. Engineers respond to the DQ output by refining schema.init.
4.3.3 Phase 3: AC development plan¶
In Phase 3, engineers consider how the verified structural facts inform the AC's intended use. Decisions made in Phase 3:
- Family-name finalization: which conceptual quantities form which families. This may involve consolidating columns under shared family-names where structural ancestry justifies it, or distinguishing columns that look similar but are observationally independent (cousins).
- Naming function declaration: whether to adopt the operator catalog's default, override per operator, declare a custom function, or decline structured naming altogether (per §3.7.3).
- Logical FD-DAG finalization: which FD-edges to declare, given what the data attests.
- Operational lineage decisions: declaring
opanddnafor each ColumnSpec, with attention to which columns are roots (observationally given) vs. derived (with explicit DNA predecessors). - Analytical posture: what queries the AC should support; what cross-schema relationships matter; which families are primitive vs. derived.
Phase 3 is partly Coframe Core territory and partly Coframe Pro territory; Coframe Core's authoring needs may be simpler than the full version's. For Coframe Core, Phase 3 is typically lighter than for Coframe Pro.
4.3.4 Phase 4: AC creation¶
In Phase 4, the verified, decided AC is committed:
- ColumnSpecs are finalized with all fields declared (or auto-derived).
- Integrity conditions pass AC validation.
- The AC is ready for query workloads.
Phase 4 is a transition: engineers move from authoring to operating. Subsequent maintenance involves either (a) re-running DQ when backend data updates significantly, or (b) re-entering the workflow when the AC's scope expands or analytical requirements change.
4.4 The artifacts¶
The workflow produces and consumes several artifacts.
4.4.1 schema.init¶
The engineer's input to the DQ process. A YAML artifact specifying:
- The AC's identification (name, description, scope).
- The naming function declaration (if any).
- The collection of virtual tables (one per schema), each with ColumnSpecs (initially possibly incomplete).
- Logical FD-edges declared by the engineer.
- Instructions for DQ (trust-declared-FD directives, etc.).
Engineers iterate on schema.init based on DQ feedback. The schema.init format is specified in Chapter 5.
4.4.2 Backend data-API¶
The backend's interface that DQ calls for introspection and verification. Backends implement this protocol (Chapter 6); engineers don't write data-API calls directly.
The protocol covers operations like:
- Listing tables and getting table schemas.
- Computing distinct value sets per column.
- Computing pair value-mappings.
- Computing missing counts.
- Testing functional dependencies.
- Testing metric anchorings.
4.4.3 DQ deliverable¶
The output of the DQ process. A structural-verification artifact including:
- Coverage maps: per-AC-dimension per-schema, the value-set observed and its relationship to the universe-wide value set.
- FD-DAG attestation: the data-driven FD-DAG with its relationship to the logical FD-DAG.
- Metric anchoring inferences: per-AC-metric, the smallest data-attested anchoring.
- Per-DNA-edge attestation results: in default configurations, per-edge passed/failed/unattestable/sampled status with deltas; in opted-out configurations, an explicit record of the opt-out.
- Violations: integrity conditions that failed (including I10 attestation failures under
failure_mode: hard). - Advisories: soft concerns the framework surfaces for engineer review (including attestation deltas under
failure_mode: soft). - AC-level integrity status: pass/fail and what's pending.
- Asserted-not-verified facts listed: the genuine lemmas of the framework, distinguished from verified-with-opt-out facts (per §7.7.3).
- Coherence posture summary: explicit statement of the AC's attestation configuration and resulting MTI status (unconditional / conditional within scope).
4.4.4 Working AC¶
The final artifact engineers query against. Produced when DQ converges and AC validation passes. The AC consists of:
- Verified schemas with complete ColumnSpecs.
- The verified FD-DAG.
- The verified metric genealogy.
- The verified coverage maps.
- The naming function (if any).
Queries (Chapters 9 and 10) execute against the AC.
4.5 The iteration cycle¶
The authoring cycle:
- Engineer authors initial schema.init (Phase 1: Discovery).
- Framework runs DQ (Phase 1 → 2 → 3 of DQ process; per Chapter 7).
- Framework returns the deliverable: violations, advisories, refined schema.init proposals.
- Engineer reviews:
- Addresses violations: modifies schema.init, fixes data, or declares synthetic-unknown values per the DQ chapter.
- Considers advisories: confirms intentional or addresses.
- Adds instructions to schema.init for re-run: e.g., trust-declared-FD directives that override default DQ behavior.
- Re-runs DQ.
- Iterates until violations are zero and engineer is satisfied with advisories.
- AC validation runs on the converged schema.init; AC is ready.
Each iteration produces a refined schema.init. The framework supports iteration via:
- Caching of quasi-metadata between runs: refreshed only when data changes, not per iteration.
- Differential re-verification: re-checking only changed schemas/columns when possible.
- Advisory acknowledgments persisting across runs: engineer says "I accept this" once.
The iteration converges when the engineer is satisfied with the AC's structural state. There is no fixed iteration count; the framework supports arbitrary iteration.
4.6 Engineer's role across the workflow¶
The engineer's role spans:
4.6.1 Domain knowledge¶
The engineer brings:
- Knowledge of which physical tables to include in the AC.
- Knowledge of what each column means in the business — the semantic content the AC will encode.
- Knowledge of how dimensions hierarchically relate (the basis for the candidate FD-DAG).
- Knowledge of what analytical questions the AC should support.
The framework does not infer domain knowledge. AI-assisted tooling (§4.7) can propose structural commitments based on data inspection, but the engineer reviews and confirms.
4.6.2 Naming practice¶
The engineer chooses:
- Names of columns: any string the AC author finds appropriate. Per Foundations §2.11.4, names are the AC author's foundational choice; the framework imposes no naming aesthetic.
- Whether to declare a naming function: catalog default, per-operator overrides, fully custom, or no naming function at all.
- Family-naming consistency: ensuring the engineer's name choices reflect intended ancestry (via DNA) and that same-named columns are intentionally siblings or intentionally cousins.
If the engineer declines structured naming (no naming function), the framework verifies family-genealogy consistency through DNA only; the engineer carries full responsibility for ensuring name choices correspond to genuine ancestry.
4.6.3 Structural commitments¶
For each ColumnSpec, the engineer declares:
E: what entities the column observes.M: how the column can be missing (or accepts the auto-derived value for grain-role columns).opanddna: the column's operational lineage. For roots:op: OBSERVED(or the family ip_reducer) with self-referential DNA. For derived columns: the operator that produced this column from its DNA predecessor.name: the column's family-name.
The framework verifies the declarations for consistency with the AC's principles, the operator catalog's properties, and (when declared) the naming function. Violations are surfaced as DQ output.
4.6.4 DQ response¶
The engineer responds to DQ output:
- Hard violations: must be addressed. Options include modifying schema.init, fixing data in the backend, declaring synthetic-unknown values per the DQ chapter, or reclassifying schemas (e.g., re-declaring a schema as degenerate on a dimension).
- Advisories: optional. The engineer chooses whether to address each advisory or accept it as-is.
The framework's discipline is that violations block AC validation; advisories don't. The engineer decides which advisories to address based on analytical purpose.
4.6.5 Analytical posture¶
The engineer commits to:
- The AC scope (per Foundations §2.3.4): which columns to include, what to call them, what structural commitments they bear. This is the AC author's curatorial authority over what the AC exposes.
- Which families to define: which conceptual quantities are first-class in the AC's vocabulary, which are derived.
- Cross-schema integration choices: when multiple schemas observe overlapping data, how the AC reconciles them (typically via the FD-DAG and the framework's automatic reasoning, but with engineer-level decisions about scope).
The framework supports these decisions but does not make them. Engineers retain authorship of analytical purpose; the framework verifies structural correctness.
4.7 AI-assisted authoring¶
The authoring workflow can be substantially AI-assisted. AI agents — operating against the framework's structural metadata via the MCP server (Chapter 11) — can:
4.7.1 What AI agents can do¶
- Draft initial schema.init from warehouse exploration (Phase 1). AI agents inspect backend tables, propose ColumnSpec field values for each column, and produce an initial schema.init for engineer review.
- Propose ColumnSpec field values: based on data attestation and naming patterns, AI agents propose
E,M,op, anddnafor columns. The framework's structural rigor gives AI proposals a precise target to optimize against. - Identify candidate roots and derived columns: AI agents can examine pre-aggregated tables, propose which columns are observationally rooted vs. derived from finer-grained sources, and construct candidate DNA chains.
- Propose the FD-DAG: AI agents can examine column relationships and propose candidate FD-edges, which the framework then attests at DQ Phase 3.
- Respond to DQ advisories with proposed remediation: when an advisory surfaces, AI agents can propose schema.init modifications to address it.
- Draft Frame-QL queries from natural-language analytical questions: engineers (or business analysts using the AC) describe analytical needs in natural language; AI agents construct Frame-QL queries against the AC's vocabulary.
4.7.2 The engineer's role with AI assistance¶
The engineer's role with AI-assisted authoring is review and confirmation:
- AI agents propose; engineers approve.
- The engineer brings domain knowledge AI cannot infer (semantic content, organizational priorities, analytical purpose).
- The engineer carries final responsibility for the AC's commitments — the framework verifies, but the engineer commits.
This is increasingly the practical authoring pattern. Engineers do not typically author every ColumnSpec by hand for a large warehouse; they review and refine AI-proposed drafts. The structural rigor of the framework is what makes AI-assisted authoring feasible: AI proposals can be checked against the framework's integrity conditions, and unproductive proposals are surfaced as violations or advisories.
4.7.3 The MCP server's role¶
The MCP server (Chapter 11) exposes the AC's structural metadata to LLM clients:
- The metric genealogy.
- The operator catalog.
- The FD-DAG.
- Coverage maps and quasi-metadata.
LLM clients consume this metadata to reason about analytical questions and propose well-formed Frame-QL queries. The MCP exposure is what enables AI-assisted authoring tooling to operate against the framework's structural target.
4.8 AC lifecycle and schema evolution¶
An AC is not authored once and frozen. The data the AC reads evolves: new columns appear in backend tables, columns get renamed, FDs that held last quarter no longer hold this quarter, new schemas appear, old schemas are deprecated. The AC needs to evolve in lockstep. This section specifies how.
4.8.1 The lifecycle stages¶
An AC moves through four stages during its operational life:
- Initial authoring — the four-phase workflow specified in §4.3 produces a v1.0 AC.
- Steady-state operation — the AC is loaded, queries run against it, DQ runs periodically (on a schedule or on demand). The schema.init and the DQ deliverable are stable artifacts version-controlled in source.
- Schema evolution — upstream data changes; the AC needs to be updated. The AC author runs an evolution pass: adjust the schema.init, re-run DQ, address new violations or advisories, commit a new AC version.
- Deprecation — when an AC is being phased out (replaced by a different AC, or the underlying analytical purpose has changed), there's a deliberate sunset workflow.
The framework supports stages 1, 2, and 3 directly. Deprecation is an organizational concern; the framework provides no special mechanism beyond the AC author's own version control.
4.8.2 Common evolution scenarios¶
The following scenarios cover most operational changes practitioners will encounter.
Adding a new column to a backend table. The most common case: an upstream table gains a column, and the AC author wants to expose it. The AC author edits schema.init to add a ColumnSpec for the new column, re-runs DQ, addresses any new violations, commits. No special framework support is needed; the change is additive and integrity-preserving by construction (existing ColumnSpecs are unchanged).
Adding a new schema to the AC. A new pre-aggregated summary table gets added to the warehouse and the AC author wants to include it. The AC author adds the schema declaration to schema.init (per §5.4), declares its ColumnSpecs (binding existing family-names where the new schema's columns share families with existing ones), re-runs DQ. Phase 3 attestation will verify the new schema's metric coherence against existing siblings; the AC author addresses any drift.
Renaming a backend column. An upstream column was renamed from customer_id to customer_uuid. The AC author updates src_name in the affected ColumnSpec(s) and re-runs DQ. The framework's structural reasoning operates on the AC-level name field, not on src_name; renaming src_name while keeping name stable means consumers of the AC see no change. The DQ deliverable's caching invalidates appropriately.
Removing a backend column the AC depends on. If an upstream column is dropped, DQ Phase 1 fails: the schema.init declares a column whose src_name doesn't exist in the backend. The AC author either restores the upstream column, removes the dependent ColumnSpec from schema.init (and addresses downstream genealogy implications), or declares a substitute backend source. There's no "graceful degradation" mode — the framework's correctness depends on declared columns existing.
An FD that used to hold no longer holds. The FD store → region held when stores were geographically partitioned, but a recent reorg made some stores serve multiple regions. DQ Phase 3 (data attestation) flags this as an integrity violation. The AC author has three responses: (a) fix the data (perhaps the violation is itself a data-quality bug); (b) add the violating cases to tolerated_edges with rationale (the FD holds for 99.5% of the data, the residual is acceptable); (c) restructure the AC to model the new reality (perhaps region becomes time-varying — which in Coframe Core means modeling region-change as event-time-anchored events; in Coframe Pro it would mean an SCA).
A backend type change. A column was INTEGER; it's now BIGINT. The AC author updates data_type in the affected ColumnSpec, re-runs DQ. The framework's reasoning is type-aware (operator-catalog signatures are typed), so a type change can cascade — for example, if a derived column's expression was typed against the predecessor's type, the derivation may need updating. DQ surfaces these.
AC catalog format version changes across Coframe Core releases. The AC catalog format is versioned (coframe_version: "1.0"). Future minor versions of Coframe Core may add optional fields without breaking older catalogs. Future major versions may require migration; the project will provide migration guidance when major versions ship. For v1.0, the format is the format; AC catalogs authored against v1.0 remain valid against all v1.x releases.
4.8.3 The evolution workflow¶
For most evolution scenarios, the workflow is simple:
- Edit schema.init to reflect the change.
- Re-run DQ. The framework runs Phases 1, 2, and 3 against the new schema.init plus the (possibly changed) backend data.
- Address violations. New errors require resolution before the AC is loadable; advisories are informational.
- Address per-DNA-edge attestation results. If new schemas were added or existing schemas changed, attestation may newly fail or newly pass; the AC's verification level may move (up or down).
- Commit the new AC version to source control. The schema.init plus the DQ deliverable together constitute the AC's source-controlled state.
- Communicate the change to consumers if it's substantive. The verification status's level field is observable to MCP clients and BI tools branching on it; if the level changed, downstream consumers may need to react.
The DQ feedback loop (per §4.5) is the same loop used for initial authoring. Evolution is structurally the same activity as authoring, performed against an existing artifact rather than a blank slate.
4.8.4 Caching and incremental DQ¶
Coframe Core re-runs DQ in full when schema.init changes; there's no incremental DQ in v1.0. Engineering teams managing many ACs or very large fact tables will want to be deliberate about when to re-run (on schema.init commit, on a schedule, on data-pipeline completion).
The DQ deliverable itself is cached. When schema.init hasn't changed and the underlying backend data hasn't changed, re-loading the AC consumes the cached DQ deliverable rather than re-running. The cache is invalidated by changes to either schema.init or to backend table metadata.
Coframe Pro's incremental attestation (per §1.5) re-runs only DNA edges affected by source changes, which is operationally important for very large ACs. Coframe Core users with this need either run scheduled full attestation during off-peak hours or migrate to Pro.
4.8.5 Versioning ACs in source control¶
ACs are deliberately structured to be source-controlled artifacts. The recommended pattern:
- schema.init in version control. This is the AC's source of truth.
- DQ deliverable as build artifact. Generated from schema.init plus backend introspection; not committed to source control. (Each fresh DQ run regenerates it.)
- Co-locate with related infrastructure. ACs over a dbt-managed warehouse typically live alongside the dbt project, ideally in the same monorepo with shared CI.
- Tag AC versions. Use git tags or release versions to mark AC milestones (
retail-ac-v1.0,retail-ac-v1.1). Consumers can pin to specific versions. - Code review the schema.init. Structural commitments warrant the same review discipline as application code.
When the schema.init format itself evolves (rare, but possible across Coframe Core minor versions), use the AC catalog's coframe_version field to assert compatibility. Migration tooling, when needed, will operate on schema.init artifacts in version control.
4.8.6 Recovery from severe schema-evolution events¶
Some upstream changes are large enough that incremental evolution isn't practical: a backend table gets dropped and replaced with a structurally different table; an FD-DAG edge that was central to many derivations no longer holds. In these cases, the recovery pattern is to author a v2.0 AC alongside the existing v1.0, run them in parallel during a transition period, and deprecate v1.0 when consumers have migrated. Both ACs can coexist in the system; consumers select which AC to query.
The framework has no built-in mechanism for AC-version migration of in-flight queries; the engineering team coordinates the cutover. This is intentional — the structural commitments that make Coframe Core rigorous would be undermined by silent rewriting of consumer queries to a different AC.
4.9 Common pitfalls¶
The Manual specifies what's correct. This section enumerates what's commonly wrong. The pitfalls below are predictable mistakes practitioners encounter during initial authoring and during evolution. Each is followed by what the mistake looks like in DQ output and how to fix it.
4.9.1 Misclassifying AC-attribute vs. AC-metric¶
The mistake. Declaring customer_segment as an AC-metric (or as an AC-attribute when it actually varies over time and should be event-modeled). Same data, wrong column type.
What it looks like. If declared as AC-metric when it's properly an AC-attribute, DQ Phase 3 flags the column's op as inappropriate (no meaningful aggregation reducer; OBSERVED should be the operator). If declared as AC-attribute when the value varies over time per (customer, day), DQ flags the AC-attribute's |E| = 1 integrity check failing for cases where the same customer has different segments at different dates.
Fix. Decide whether the column is current-state (AC-attribute, |E| = 1) or event-derived (AC-metric anchored at an event-time). For "the customer's segment changed in March," model the change as a segment_change_event AC-metric anchored at the change date, not as a time-varying customer_segment attribute. (Coframe Pro's SCA generalizes this.)
4.9.2 Declaring an FD-DAG edge that holds for almost-but-not-all data¶
The mistake. Declaring store → region as an FD-edge when the data shows it holds for 99.5% of the data but not 100% — e.g., 12 stores across 2,400 are in transitional states with no region assigned, or have multiple region assignments due to a recent reorg.
What it looks like. DQ Phase 3 flags the FD as violated, naming the offending tuples (the framework caches the violation set up to a configured limit). The AC won't load until the violation is addressed.
Fix. Three options. (a) If the violation is a data-quality bug — fix the data. (b) If the violation represents legitimate exceptions you can transparently document — declare the affected edges in tolerated_edges with rationale; the AC loads at AA-with-tolerance posture and the level reporting reflects the tolerance. (c) If the violation represents a structural reality — the data really doesn't have a clean store → region FD — restructure the AC so the FD is at a different anchoring (perhaps (store, day) → region if region is per-day-stable) or move to event-modeling for region changes.
4.9.3 Choosing an entity-set that's too coarse or too fine¶
The mistake. Declaring revenue at E = {transaction} when it's actually observed at E = {transaction, line_item} (too coarse — loses the line-item structure), or declaring revenue at E = {customer, store, product, day, hour, minute} when the actual observation is at the daily grain (too fine — creates a false claim of finer-grain anchoring than the data supports).
What it looks like. Too-coarse: DQ Phase 3 flags integrity violations because aggregation over the declared E doesn't recover values consistent with finer-grain siblings (the line-item structure isn't reflected). Too-fine: DQ may not flag this directly; the cousin/sibling reasoning will treat the column as a new family with no siblings, surprising the AC author who expected sibling reasoning to work.
Fix. Re-examine the data: at what tuple of identifiers is each observation actually unique? Match the declared E to that. If the data is genuinely observed at a finer grain than other schemas in the AC, declare it correctly and let the framework handle cross-grain navigation via the family ip_reducer.
4.9.4 Inconsistent column naming creating accidental cousins¶
The mistake. Two schemas have a revenue column, but one is sourced from transactions.amount (gross revenue including tax) and the other from daily_summary.net_rev (net revenue, tax-excluded). Both ColumnSpecs are declared with name: revenue. Same family-name, but they're not siblings — they observe different conceptual quantities.
What it looks like. The framework computes family-roots via DNA-walk and discovers different family-roots for the two columns. They're cousins. DQ Phase 1 flags this. Queries that reach both refuse as dubious.
Fix. Decide whether they're truly the same conceptual quantity or different ones. If different (gross vs. net), give them different name fields (revenue_gross and revenue_net). If conceptually the same — and they should agree — investigate the discrepancy; the data may have a coherence problem.
4.9.5 Missingness signature mismatch¶
The mistake. Declaring M = "complete" (no missingness allowed) for a column that has 0.3% NULL values in the data, or declaring M = "MAR" when the missingness is actually MNAR (the missingness is correlated with the column's own value, not just with other observed values).
What it looks like. DQ Phase 3 attestation for "complete" columns finds NULLs and flags integrity violation. For mis-declared MAR/MNAR, DQ won't catch the mistake (the framework can't infer MNAR vs. MAR from data alone) but downstream queries propagate misleading missingness annotations to consumers.
Fix. For "complete": fix the data, or declare the actual missingness signature. For MAR vs. MNAR: this is the AC author's analytical judgment; the Manual's missing-value treatment guidance (Chapter 7) is reference material. When uncertain, declaring MNAR is the conservative choice — it prompts the framework's bias-aware annotations on results.
4.9.6 Forgetting to declare a schema's coverage filter¶
The mistake. A schema's underlying table has 5 years of data but the AC's analytical scope is the most recent 2 years. The AC author doesn't declare a coverage filter; queries return 5 years of data, including stale historical periods the AC author didn't intend to expose.
What it looks like. Queries succeed; results include unexpected historical periods. No DQ violation surfaces because the framework can't infer the AC author's coverage intent.
Fix. Declare the schema's coverage filter in schema.init (per §5.4). The framework respects it during query resolution; results are bounded to the declared scope.
4.9.7 Choosing the wrong op from the catalog¶
The mistake. Declaring peak_revenue as a derivation from revenue with op: SUM when it should be op: MAX. The framework treats the column as a SUM-aggregation of revenue (a sibling at a coarser grain) when the analytical intent is a MAX-aggregation (a derived family with a different family-root).
What it looks like. DQ Phase 3 attestation for peak_revenue against finer-grain revenue siblings will fail (SUM-aggregation produces totals, not peaks; the cached values don't match). The AC author re-examines the column's intent.
Fix. Match op to the column's analytical intent. The operator catalog (Chapter 10) is the reference. AI-assisted authoring (per §4.7) substantially helps here by inferring op from data inspection.
4.9.8 Using a WITH-block when a simpler query would work¶
The mistake. A user query is structured as a multi-stage WITH chain when a single-stage Frame-QL query would produce the same result with cleaner semantics.
What it looks like. The query works but is harder to read, harder to reason about, and slower. There's no DQ-level signal because the query is well-formed.
Fix. When writing Frame-QL, prefer simpler structures. WITH-blocks (per §8.7) are for genuinely multi-stage computations: stages where an intermediate result is consumed by a subsequent stage with different aggregation grain. If the query can be expressed as a single SELECT with no intermediate stages, it should be.
4.9.9 Treating Frame-QL as SQL with different syntax¶
The mistake. A user trained on SQL writes Frame-QL queries that try to specify which table to query, what JOIN to use, what GROUP BY to compute. Frame-QL doesn't have these constructs; the framework infers them from the AC's structural commitments.
What it looks like. Queries reference physical schema names instead of family-names; the framework either rejects the query as syntactically invalid or surprises the user with cross-schema reach.
Fix. Re-read §8.2 (What Frame-QL is). The mental shift: in SQL you specify the structural plan; in Frame-QL you specify the analytical intent and the framework derives the structural plan. Family-names are the unit of reference, not table names.
4.9.10 Conflating cousin and sibling cases¶
The mistake. Encountering a "dubious query: cousins detected" diagnostic and assuming it's a framework bug rather than a real warning that the query reaches conceptually different metrics with the same name.
What it looks like. The diagnostic names the cousin pair and the schemas they live in. The user dismisses the diagnostic, force-qualifies a reference, gets a result, doesn't realize the result is from one cousin and not the other.
Fix. Cousins are the framework's most valuable error class. When the diagnostic surfaces, examine why the family-roots differ. The AC may have a naming bug (cousins should be siblings, with the same family-root); or the AC correctly identifies that two different conceptual metrics share a name and need to be renamed; or the user should genuinely query one cousin (with explicit qualification and awareness). Don't reflex-qualify; investigate.
4.9.11 Misunderstanding what AC scope governs¶
The mistake. Treating the AC scope as documentation rather than enforcement, and assuming consumers can reach beyond it (via direct backend access, by guessing Frame-QL queries that traverse to non-AC columns, etc.).
What it looks like. No symptom in DQ; the misunderstanding manifests as planning and architecture decisions that assume the AC isn't a real boundary.
Fix. Re-read §11.6. The AC scope is a structural commitment; the framework refuses queries that reach outside it. This is part of why the AC works as a deliberate exposure boundary for AI agents and for cross-team data sharing. Trust the boundary.
4.9.12 Not running DQ after backend changes¶
The mistake. The backend warehouse changed (schemas evolved, data was reloaded, ETL pipeline updated) but the AC hasn't been re-loaded. The cached DQ deliverable is stale.
What it looks like. Queries return values inconsistent with the current backend state. The verification status reports stale information. Consumers may notice "the numbers changed but the AC didn't re-verify."
Fix. Re-run DQ after substantive backend changes. The DQ deliverable's cache invalidation is metadata-driven (per §4.8.4); changes the framework doesn't see (a manual table reload, an ETL bug fix that didn't update metadata) won't trigger automatic re-verification. Establish operational discipline: re-run DQ on a schedule appropriate to the warehouse's update cadence.
4.10 What subsequent chapters cover¶
The remaining chapters of Part III specify the technical artifacts and processes:
- Chapter 5: schema.init Format — the input artifact's structure, including how ColumnSpecs are declared, how the naming function is declared, how instructions modify DQ behavior.
- Chapter 6: Data-API Protocol — the backend interface DQ uses to fetch quasi-metadata.
- Chapter 7: Data Quality and Structural Verification — the DQ process specification, including the three phases of DQ and the integrity conditions verified at each.
After reading Part III, an engineer should understand the workflow, the artifacts, and how to apply the framework to a backend warehouse. Part IV (Frame-QL and Query Resolution) and Part V (Operator Catalog reference) cover the AC's operational use.
Chapter 5: schema.init Format¶
The YAML format specification for the AC author's input to the DQ process.
5.1 Overview¶
This chapter specifies the schema.init artifact: the YAML file the engineer authors as input to the DQ process. schema.init declares the AC's structure — schemas, ColumnSpecs, the naming function declaration, the candidate FD-DAG, and instructions modifying DQ behavior.
schema.init is the engineer's authoring surface. The framework's tooling consumes schema.init, calls the data-API to fetch quasi-metadata, runs DQ, and produces the verified AC.
The chapter is organized as follows:
- §5.2 specifies the top-level structure of schema.init.
- §5.3 specifies the AC-level declarations.
- §5.4 specifies virtual table declarations (one per schema).
- §5.5 specifies ColumnSpec declarations within virtual tables.
- §5.6 specifies the naming function declaration.
- §5.7 specifies the FD-DAG declaration.
- §5.8 specifies the instructions section.
- §5.9 specifies what's required at schema.init time vs. what's verified at AC validation time.
- §5.10 provides a complete schema.init example.
- §5.11 specifies what schema.init is not.
5.2 Top-level structure¶
The schema.init file has the following top-level structure:
schema_init:
ac_name: <string>
ac_description: <string>
naming_function:
<naming function declaration; see §5.6>
collection:
- virtual_table:
<virtual table 1 declaration>
- virtual_table:
<virtual table 2 declaration>
# ... additional virtual tables
fd_dag:
- <FD edge 1 declaration>
- <FD edge 2 declaration>
# ... additional FD edges
attestation:
<attestation configuration; see §5.3.4 and Chapter 7 §7.6.8.8>
instructions:
- <instruction 1>
- <instruction 2>
# ... additional instructions
The order of top-level sections is conventional. The framework parses each independently. The attestation block is optional; defaults apply when absent (per §5.3.4).
5.3 AC-level declarations¶
5.3.1 ac_name¶
The AC's identifier. A string. Conventionally lowercase with underscores (e.g., retail_analytics_v1).
The framework uses ac_name as the AC's primary identifier in tooling, MCP exposure, and persistence.
5.3.2 ac_description¶
A free-form string describing the AC's purpose and analytical scope. Engineer-facing; not used by the framework's reasoning.
5.3.3 ac_metadata¶
Optional. Additional structured metadata about the AC: author, date, version, organizational scope, etc. The framework preserves but does not interpret these fields.
5.3.4 attestation¶
Optional. Configures per-DNA-edge value attestation behavior for the AC (specified operationally in Chapter 7 §7.6.8). Defaults apply when absent.
attestation:
enabled: true # default; set to false to opt out
failure_mode: soft # soft (default) | hard | tolerated
tolerated_edges: [] # required when failure_mode == tolerated
epsilon_relative: 1.0e-9
epsilon_absolute: 0
strict_row_sets: false
sampling_threshold_rows: 100_000_000
sampling_fraction: 0.01
sampling_min_rows_per_stratum: 10_000
sampling_confidence_target: 0.99
force_full: false
Field semantics are specified in Chapter 7 §7.6.8.8. The framework treats the absence of an attestation block as equivalent to the default values; opting out of attestation requires setting attestation: enabled: false explicitly. This makes opt-out a deliberate, visible choice in the AC catalog.
5.4 Virtual table declarations¶
Each entry in the collection list is a virtual table declaration:
- virtual_table:
schema_name: <string>
source:
backend: <backend identifier>
physical_table: <table name>
# or, for view/query-based sources:
sql_definition: <SQL string>
declared_scope: <optional scope declaration; see §5.4.4>
column_specs:
- column_spec:
<ColumnSpec 1>
- column_spec:
<ColumnSpec 2>
# ... additional ColumnSpecs
5.4.1 schema_name¶
The schema's local label within the AC. Two virtual tables in the same AC cannot share a schema_name.
The framework uses schema_name for cross-references (e.g., qualified column references in queries: <schema_name>.<column_name>).
5.4.2 source¶
The source declaration tells the framework where this virtual table's data comes from in the backend.
For a direct table binding:
source:
backend: <backend identifier matching the AC's backend connection>
physical_table: <table name in the backend>
For a view-based or query-based source:
The framework consumes the source declaration when binding ColumnSpecs to backend data via the data-API protocol.
5.4.3 declared_scope¶
Optional. Declares which AC-dimensions this schema is degenerate on, with explicit value-sets.
declared_scope:
degenerate_on:
- dimension: region
values: [west, southwest]
- dimension: year
values: [2025, 2026]
Schemas without a declared_scope are non-degenerate on all dimensions they observe. Schemas with declared degeneracy on a dimension d are committed to observing only the listed values for d.
The framework verifies declared scope at DQ Phase 3 against quasi-metadata. Mismatches are integrity violations.
5.4.4 Virtual splitting¶
In some cases, a single physical table may be most cleanly represented as multiple virtual tables in the AC. For example, a transactions table containing both completed and abandoned transactions may be virtual-split:
- virtual_table:
schema_name: completed_transactions
source:
backend: warehouse_main
physical_table: transactions
filter: "status = 'completed'"
column_specs: [...]
- virtual_table:
schema_name: abandoned_transactions
source:
backend: warehouse_main
physical_table: transactions
filter: "status = 'abandoned'"
column_specs: [...]
The filter clause is applied at data-API query time. Each virtual table is a logically distinct schema in the AC, with its own ColumnSpecs and declared scope.
Virtual splitting is an authoring convenience; the framework treats virtual-split tables as independent schemas.
5.4.5 column_specs¶
The list of ColumnSpecs in this virtual table. Specified in §5.5.
The set of ColumnSpecs declared in the virtual table is deliberate: it represents the AC author's selection of which backend columns to include in the AC scope (per Foundations §2.3.4). Backend columns without ColumnSpec declarations are outside the AC scope and not visible to queries against the AC. This is by design — a backend table with hundreds of columns may produce a virtual table with only a few ColumnSpecs, exposing exactly what the AC author chooses for analytical purpose.
5.5 ColumnSpec declarations¶
Each ColumnSpec entry has the following structure:
- column_spec:
src_name: <string or null>
name: <string>
data_type: <type identifier>
E: [<dimension 1>, <dimension 2>, ...]
M:
signature: <MCAR | MAR | MNAR>
determinants: [<determinant 1>, ...]
op: <operator catalog identifier>
dna:
<DNA declaration; see below>
5.5.1 Required and optional fields¶
For most ColumnSpecs, all fields above are declared. Some are auto-derivable:
src_name: defaults tonameif not declared. Set tonullfor computed singletons with no backend column.M: auto-derived for grain-role columns whereE = {c}(set to MNAR withdeterminants = [c]).op: for grain-role columns, set toOBSERVEDif not declared.dna: for root columns (where DNA is self-referential), the framework can recognize the root and auto-derive DNA from(name, E, op)if not explicitly declared.
Engineers may declare auto-derivable fields explicitly for clarity; the framework treats explicit and auto-derived values identically as long as they agree.
5.5.2 The dna field¶
For a non-root column, dna declares the predecessor metric:
For a root column, dna is self-referential:
dna:
name: <name_self> # same as the column's own name
E: [<E_self>] # same as the column's own E
op: <op_self> # same as the column's own op
For multi-input operations producing singletons (per §3.5.6), dna is a list of predecessor declarations:
5.5.3 Examples¶
A grain-role column:
- column_spec:
src_name: customer_id
name: customer
data_type: integer
E: [customer]
# M auto-derived: MNAR with determinants = [customer]
# op auto-derived: OBSERVED
# dna auto-derived: self-referential
An AC-attribute:
- column_spec:
src_name: customer_name
name: customer_name
data_type: string
E: [customer]
M:
signature: MCAR
determinants: []
op: OBSERVED
# dna auto-derived: self-referential (root)
An AC-metric root with explicit ip_reducer commitment:
- column_spec:
src_name: amount
name: revenue
data_type: numeric
E: [transaction]
M:
signature: MCAR
determinants: []
op: SUM
dna:
name: revenue
E: [transaction]
op: SUM
A sibling of revenue at coarser grain:
- column_spec:
src_name: total_revenue
name: revenue
data_type: numeric
E: [store, month]
M:
signature: MCAR
determinants: []
op: SUM
dna:
name: revenue
E: [transaction]
op: SUM
A derived column in a different family (peak_revenue):
- column_spec:
src_name: peak_daily_revenue
name: peak_revenue
data_type: numeric
E: [store, month]
M:
signature: MCAR
determinants: []
op: MAX
dna:
name: revenue
E: [store, day]
op: SUM
A singleton ratio:
- column_spec:
src_name: null
name: revenue_per_unit
data_type: numeric
E: [transaction]
M:
signature: MCAR
determinants: []
op: MAP_DIV
dna:
- name: revenue
E: [transaction]
op: SUM
- name: units_sold
E: [transaction]
op: SUM
5.6 The naming function declaration¶
The naming function declaration is at the AC level, in the top-level naming_function section.
The AC author chooses one of four options.
5.6.1 Option 1: Adopt the operator catalog's default¶
The AC adopts whatever default naming function entries the operator catalog provides (Chapter 10 §10.4 and §10.5 list the catalog's default entries). ColumnSpec names are verified against the catalog defaults.
5.6.2 Option 2: Override per operator¶
The AC starts from catalog defaults and overrides specific entries. The format of the overrides is the AC author's choice; the framework calls the AC's declared naming function as a black box (per §3.7.2).
5.6.3 Option 3: Custom naming function¶
The AC declares a function from scratch. The implementation may be a registered function, a declarative table, an external service, or anything else the AC's runtime can call. The framework calls it as a black box.
5.6.4 Option 4: Decline structured naming¶
The AC declines to have a naming function. The framework verifies family-genealogy consistency through DNA only; names are free labels with no enforced relationship to operators.
5.7 The FD-DAG declaration¶
The fd_dag section declares the candidate FD-DAG (Foundations §2.8.2):
fd_dag:
- source: <AC-dimension>
target: <AC-dimension>
channel: <declaration channel>
[additional fields per channel]
5.7.1 FD edge declaration channels¶
Each FD-edge has a channel field indicating how the edge is supported:
declared: the engineer asserts the FD as part of the schema's structure. The data-driven FD-DAG must attest it.reference_table: the FD is sourced from a specified reference table.computed: the FD is computable from a function over the source value (e.g.,date → yearvia theYEAR_OFfunction).
5.7.2 Declared FDs¶
The framework attests this edge at DQ Phase 3 by checking the data-driven FD-DAG. If the data does not attest the edge, the integrity condition Logical FD-DAG ⊆ Data-driven FD-DAG fails.
5.7.3 Reference-table FDs¶
The FD is sourced from a specific reference table (the stores schema, in this example). The framework verifies the FD against the reference table's data.
5.7.4 Computed FDs¶
The FD is computable from a function over the source value. The framework verifies that the function consistently produces the target value across observed source values.
5.7.5 Multi-step FDs¶
The FD-DAG is closed under transitivity at the framework level — the framework infers transitive FDs from declared edges. Engineers do not need to declare every transitive FD explicitly; the framework computes the closure.
For example, with declared edges store → region and region → country, the framework infers store → country without needing it to be declared.
5.8 The instructions section¶
The instructions section contains directives that modify DQ's default behavior.
5.8.1 trust_declared_FD¶
Asserts that specific declared FD-edges are trusted, bypassing data attestation:
instructions:
- directive: trust_declared_FD
edges:
- source: store_id
target: region_id
rationale: "Maintained by upstream ETL; data attestation may show transient violations during loads."
The framework records the trust declaration but does not bypass the data-driven FD-DAG attestation: it still computes the data-driven FD-DAG and surfaces violations as advisories (rather than hard violations) for trusted edges. Engineers can review the advisories.
5.8.2 Other instructions¶
Additional instructions may be defined per the framework's evolving capabilities. The framework treats unknown instructions as advisory (logged but not acted upon) and does not fail AC validation due to unknown instructions.
5.9 What's required at schema.init time vs. AC validation time¶
Some declarations are required at schema.init time (i.e., when the engineer first submits to DQ); others are required at AC validation time (i.e., when DQ has converged and the AC is ready for query workloads).
5.9.1 Required at schema.init time¶
ac_name(the AC's identifier).- The collection of virtual tables with
schema_nameandsourcefor each. - For each ColumnSpec: at least
src_name(or default toname),name,data_type, andE. The framework can begin Phase 1 verification with this minimum.
5.9.2 Required at AC validation time¶
- All fields specified in §5.5 for every ColumnSpec (with auto-derivations as appropriate).
- The naming function declaration (one of the four options in §5.6).
- All FD-edges the engineer intends to assert.
- All instructions the engineer intends to apply.
The framework allows iteration: an engineer can submit an incomplete schema.init for initial DQ feedback, then refine based on DQ output. AC validation only succeeds when all required fields are present and all integrity conditions hold.
5.10 A complete schema.init example¶
Below is a complete schema.init for the retail running example.
schema_init:
ac_name: retail_analytics_v1
ac_description: |
Retail analytics AC covering transactions, stores, customers, and
pre-aggregated monthly summaries.
naming_function: catalog_default
collection:
- virtual_table:
schema_name: customers
source:
backend: warehouse_main
physical_table: customers
column_specs:
- column_spec:
src_name: customer_id
name: customer
data_type: integer
E: [customer]
- column_spec:
src_name: customer_name
name: customer_name
data_type: string
E: [customer]
M:
signature: MCAR
determinants: []
op: OBSERVED
- column_spec:
src_name: customer_segment
name: customer_segment
data_type: string
E: [customer]
M:
signature: MCAR
determinants: []
op: OBSERVED
- virtual_table:
schema_name: stores
source:
backend: warehouse_main
physical_table: stores
column_specs:
- column_spec:
src_name: store_id
name: store
data_type: integer
E: [store]
- column_spec:
src_name: region_id
name: region
data_type: integer
E: [store]
M:
signature: MCAR
determinants: []
op: OBSERVED
- virtual_table:
schema_name: transactions
source:
backend: warehouse_main
physical_table: transactions
column_specs:
- column_spec:
src_name: transaction_id
name: transaction
data_type: integer
E: [transaction]
- column_spec:
src_name: customer_id
name: customer
data_type: integer
E: [transaction]
M:
signature: MAR
determinants: [transaction]
op: OBSERVED
- column_spec:
src_name: store_id
name: store
data_type: integer
E: [transaction]
M:
signature: MCAR
determinants: []
op: OBSERVED
- column_spec:
src_name: amount
name: revenue
data_type: numeric
E: [transaction]
M:
signature: MCAR
determinants: []
op: SUM
dna:
name: revenue
E: [transaction]
op: SUM
- virtual_table:
schema_name: store_monthly_summary
source:
backend: warehouse_main
physical_table: store_revenue_monthly
column_specs:
- column_spec:
src_name: store_id
name: store
data_type: integer
E: [store]
- column_spec:
src_name: month
name: month
data_type: date
E: [month]
- column_spec:
src_name: total_revenue
name: revenue
data_type: numeric
E: [store, month]
M:
signature: MCAR
determinants: []
op: SUM
dna:
name: revenue
E: [transaction]
op: SUM
- column_spec:
src_name: peak_daily_revenue
name: peak_revenue
data_type: numeric
E: [store, month]
M:
signature: MCAR
determinants: []
op: MAX
dna:
name: revenue
E: [store, day]
op: SUM
fd_dag:
- source: store
target: region
channel: reference_table
table: stores
- source: date
target: month
channel: computed
mapping: MONTH_OF(date)
- source: month
target: quarter
channel: computed
mapping: QUARTER_OF_MONTH(month)
- source: quarter
target: year
channel: computed
mapping: YEAR_OF_QUARTER(quarter)
attestation:
enabled: true
failure_mode: soft
instructions:
- directive: trust_declared_FD
edges:
- source: store
target: region
rationale: |
Maintained by upstream ETL; transient violations during
store-region reorganizations.
This schema.init describes a four-schema AC. The customers and stores schemas are reference tables; transactions is a fact schema at transaction grain; store_monthly_summary is a pre-aggregated fact schema with two metric columns at (store, month) grain.
The AC adopts the operator catalog's default naming function. The candidate FD-DAG declares store → region (sourced from the stores reference table), and date → month → quarter → year (computed from date functions). The instruction trusts the store → region declaration despite potential ETL-related transient violations.
The attestation block uses defaults explicitly for clarity (the same behavior would result from omitting the block entirely): per-DNA-edge attestation is enabled, with soft failure mode — coherence violations between revenue at transaction grain and revenue at (store, month) grain become advisories with query-result annotations rather than hard validation failures.
DQ processes this schema.init, verifies the declarations against data via the backend data-API, and produces refinements. Specifically, DQ verifies:
- The schemas' grain integrity.
- The FD-DAG attestation against data.
- The metric anchorings against data (for revenue at transaction grain, for revenue and peak_revenue at (store, month) grain).
- The cross-schema value-mapping consistency for AC-dimensions and AC-attributes.
- The naming-function consistency for derived columns (peak_revenue's name verified against
f_MAX(revenue, *)per the catalog default). - Cross-schema metric coherence per attestable DNA edge: revenue at transaction grain summed to (store, month) is compared against revenue at (store, month) in store_monthly_summary; peak_revenue at (store, month) is compared against MAX of revenue at transaction grain, projected to (store, month). Disagreements surface as advisories per the soft failure mode.
5.11 What schema.init is not¶
schema.init is the input to DQ, not the final AC. The final AC includes verification results, the data-driven FD-DAG with attestation status, coverage maps, and other DQ-produced artifacts. The final AC's representation may be a separate file format or an enriched schema.init; this document does not specify the final AC's format.
schema.init is also not a query specification. It declares the AC's structural commitments. Queries against the AC (Frame-QL, Chapter 8) reference the verified AC, not schema.init.
schema.init is also not a backend-specific format. The framework's tooling consumes schema.init independent of backend; the data-API protocol (Chapter 6) handles backend-specific concerns.
5.12 Where to go next¶
After reading this chapter, the natural next chapters are:
- Chapter 6: Data-API Protocol — the backend interface DQ uses to fetch quasi-metadata referenced in this chapter.
- Chapter 7: Data Quality and Structural Verification — the DQ process specification, including how the framework consumes schema.init and produces the verified AC.
- Chapter 3: ColumnSpec and Naming Machinery — the structural details of ColumnSpec referenced throughout this chapter.
Chapter 6: Data-API Protocol¶
The protocol the framework uses to call backends for introspection and verification.
6.1 Overview¶
This chapter specifies the data-API protocol: the interface backends implement to support Coframe Core's DQ process and query execution. The data-API is consumed by the framework's tooling; engineers do not write data-API calls directly.
The chapter is organized as follows:
- §6.2 specifies connection and identification.
- §6.3 specifies introspection operations.
- §6.4 specifies verification operations used during DQ.
- §6.5 specifies filter and projection support.
- §6.6 specifies aggregation operations used during query execution.
- §6.7 specifies error handling and diagnostics.
- §6.8 specifies backend-specific extensions.
- §6.9 lists what the data-API doesn't include.
The chapter assumes familiarity with the Foundations chapter and the schema.init format chapter.
6.1.1 Input-shape flexibility¶
The data-API protocol is the framework's input-shape abstraction. A backend qualifies to host a Coframe Core AC when it can:
- Expose data series with
(name, entity)declarations — equivalently, when its underlying storage can be presented as a collection of ColumnSpecs with declared anchorings. - Respond to operator applications (the verification and aggregation operations specified in §6.4 and §6.6).
- Honor the framework's missing-value treatment per the predecessor's signature when computing aggregations.
Backends meeting these requirements can host an AC regardless of whether their underlying storage is relational (rows and columns), columnar (Parquet, Arrow), key-value, document-oriented, or other structured form. The data-API speaks in terms of data series and operator responses, not in terms of tables-and-rows; relational engines are the most common implementation today, but they are one implementation, not the framework's requirement.
This is a deliberate framing. Coframe Core's commitment is to tabular-output queries over input-flexible backends: Frame-QL produces tabular results (frames are rectangular row-sets), but the data those results are derived from need not be rectangular at the source. Any backend that can present its data through the column-spec / data-series-spec abstraction can participate.
In v1.0, reference implementations exist for Polars and DuckDB (both relational engines, both excellent for tabular analytics on file-backed sources). The data-API protocol is specified to admit other backend types — including non-relational engines that wrap their underlying data in the column-spec abstraction — though such backends are not v1.0 deliverables.
6.2 Connection and identification¶
6.2.1 Backend identifier¶
Each Coframe Core AC has a backend identifier. The identifier is a string (e.g., warehouse_main). The data-API uses the identifier to route calls to the appropriate backend.
In Coframe Core, an AC has exactly one backend identifier. Multi-backend ACs are Coframe Pro territory.
6.2.2 Connection establishment¶
The framework establishes a connection to the backend at AC-load time. Connection parameters (host, credentials, database, schema namespace, etc.) are configured at framework deployment, not in schema.init.
The backend must support the data-API protocol's required operations (§6.3, §6.4, §6.5). Optional operations (§6.6, §6.8) may or may not be supported per backend.
6.2.3 Backend capabilities¶
The framework queries the backend at connection time for capability advertisement:
- Supported data types and their mappings to the framework's recognized types (per ColumnSpec chapter §3.3.2).
- Supported operations (introspection, verification, aggregation).
- Backend-specific extensions.
Backend capabilities are advertised; the framework's tooling adapts to what each backend supports. Backends that don't support specific operations may fall back to alternative implementations (e.g., a verification operation implemented by sampling rather than full enumeration).
6.3 Introspection operations¶
These operations let the framework discover the backend's structure.
6.3.1 List tables¶
Returns the list of tables in the backend's namespace.
Operation: list_tables(namespace?: string) → list of table_name
The framework uses list_tables during AC authoring (Phase 1, Chapter 4) to surface candidate tables for inclusion in the AC.
6.3.2 Get table schema¶
Returns the column-level schema of a table.
Operation: get_table_schema(table: string) → list of column descriptors
Each column descriptor includes:
- The column's physical name (
src_name). - The column's backend data type, mapped to the framework's recognized types.
- Nullability indicator (whether the backend allows NULL in this column).
- Backend-specific column-level metadata (precision, length, etc.).
The framework uses get_table_schema to validate that ColumnSpec declarations match the backend's actual columns and to detect discrepancies (e.g., a ColumnSpec declares data_type: numeric but the backend's column is varchar).
6.3.3 Verify column existence¶
Returns whether a specific column exists in a table.
Operation: verify_column_existence(table: string, column: string) → boolean
Used as a sanity check before issuing operations on columns.
6.4 Verification operations¶
These operations support DQ Phase 2 (quasi-metadata fetch) and Phase 3 (quasi-metadata-derived verification).
6.4.1 Distinct values¶
Returns the distinct values observed in a column.
Operation: get_distinct_values(table: string, column: string, filter?: string) → set of values
The optional filter restricts the operation to rows matching a filter expression (per §6.5).
The framework uses get_distinct_values to compute per-AC-dimension per-schema observed value sets and per-AC-attribute per-schema observed value sets — essential quasi-metadata for DQ Phase 3.
For large value sets, the framework may use sampling-based variants (§6.4.9).
6.4.2 Pair value-mapping¶
Returns the (c1-value, c2-value) pairs observed in a table.
Operation: get_pair_mapping(table: string, c1: string, c2: string, filter?: string) → set of (value, value) pairs
The framework uses pair value-mapping to test functional dependencies (does each c1-value map to one c2-value?) and to verify cross-schema value-mapping consistency.
6.4.3 Multi-column pair mapping¶
Generalization of pair value-mapping to multiple columns.
Operation: get_multi_column_pair_mapping(table: string, columns: list of string, filter?: string) → set of tuples
The framework uses this for grain integrity verification (the grain-role columns' value tuples are unique per row) and for FD verification involving composite sources.
6.4.4 Missing counts¶
Returns the count of rows where a specific column has missing values.
Operation: get_missing_counts(table: string, column: string, filter?: string) → integer
The framework uses missing counts to compute per-column missingness rates and to verify that grain-role columns have no missing values (a hard integrity violation).
6.4.5 Grain integrity¶
Verifies that the grain-role columns' value tuples are unique per row.
Operation: verify_grain_integrity(table: string, grain_columns: list of string) → result with violations or empty
If violations exist, the operation returns the violating tuples (or a sample of them) for engineer review.
6.4.6 Function-dependency test¶
Tests whether one column functionally determines another in the data.
Operation: test_functional_dependency(table: string, source: string, target: string, filter?: string) → result
The result indicates:
- Whether the FD holds (each source-value maps to at most one target-value).
- If not: the violating source-values and the multiple target-values they map to.
The framework uses this to attest declared FDs (Logical FD-DAG ⊆ Data-driven FD-DAG verification at DQ Phase 3).
6.4.7 Metric anchoring inference¶
Tests whether a metric column is functionally determined by a candidate set of dimensions.
Operation: test_metric_anchoring(table: string, metric: string, candidate_anchor: list of string, filter?: string) → result
The result indicates whether the metric is a function of the candidate anchor set in the data — i.e., whether each combination of values in candidate_anchor determines a unique metric value.
The framework uses this for metric anchoring inference: testing candidate anchorings against data to find the smallest data-attested anchoring for each AC-metric.
6.4.8 Cross-schema value-set check¶
Verifies that the value-set of a column in one schema is consistent with the universe-wide value-set or with another schema's value-set.
Operation: cross_schema_value_set_check(table_a: string, column_a: string, table_b: string, column_b: string) → result
The result reports whether the value-sets agree, what's missing in each, etc. Used for cross-schema consistency verification.
6.4.9 Sampling-based variants¶
For large tables, the framework supports sampling-based variants of the verification operations:
Operation: get_distinct_values_sampled(table: string, column: string, sample_fraction: float) → set of values
These return approximate results based on sampled rows. The framework uses them when full enumeration would be prohibitively expensive. Results are annotated as approximate.
6.4.10 DNA edge attestation¶
Verifies that the predecessor's data, aggregated via the column's op at the successor's anchor, agrees with the successor's observed values. Used by DQ Phase 3 for per-DNA-edge value attestation (Chapter 7 §7.6.8).
Operation: attest_dna_edge(predecessor_spec, successor_spec, op, missing_treatment, scope, epsilon, sampling) → result
Parameters:
predecessor_spec: the predecessor schema, physical column name, and anchor (E_pred).successor_spec: the successor schema, physical column name, and anchor (E_self).op: the family's ip_reducer (e.g.,SUM,MAX).missing_treatment: the operator-and-signature-derived missing-value treatment (per the operator catalog and the predecessor'sM). Backends compute aggregation honoring this treatment, not naive SQL semantics.scope: the intersection of the two schemas' declared scopes (filter expressions per §6.5.1).epsilon: numeric tolerance —{relative: float, absolute: float}per §7.6.8.4.sampling: optional{fraction: float, min_rows_per_stratum: int, confidence_target: float}; absent for full attestation.
Result fields:
status:passed | failed_with_deltas | unattestable | sampled_passed | sampled_failed.n_keys_compared: count ofE_selfkeys present in both schemas' projections.n_keys_disagreement: count of keys where derived and asserted values differ beyond epsilon.n_rows_predecessor_only: count ofE_selfkeys present in predecessor projection but not successor.n_rows_successor_only: count ofE_selfkeys present in successor but not predecessor.top_disagreements: a sample (top-N by |delta|, default N=10) of disagreeing keys with their derived value, asserted value, and delta.derived_annotations: any annotations attached to the derived aggregation (partial-coverage, bias-warning, substitution-applied).sampling_metadata: when sampled, the actual sample size per stratum and computed confidence.runtime_ms: backend-reported runtime.
Backends are expected to:
- Compute aggregation honoring the operator catalog's missing-value semantics for the operator-and-signature combination.
- Apply scope filters before aggregation.
- Implement sampling via stratified sampling when
samplingis present, with strata defined byE_selfkeys in the successor's projection. - Return unattestable status when the predecessor's data is not accessible (rather than raising an error).
The framework's DQ machinery composes per-edge results into the AC's verification status (§7.6.8.9). Backends are not responsible for the AC-level failure-mode logic; they return per-edge facts.
For backends that cannot efficiently compute scope-aware aggregation honoring the operator catalog's missing-value semantics, a backend-capability flag (per §6.2.3) declares partial attestation support; the framework falls back to a less-rigorous verification mode and records the limitation in the verification status.
6.5 Filter and projection support¶
The data-API operations accept optional filter and projection parameters.
6.5.1 Filter expressions¶
Filter expressions restrict operations to rows matching a predicate.
Format: SQL-style boolean expression over backend column names.
Examples:
status = 'completed'region = 'west' AND year = 2026date >= '2026-01-01' AND date < '2026-04-01'
Backends translate filter expressions to backend-native query syntax. The framework's tooling assembles filter expressions from declared scope, virtual splitting, and DQ-specific conditions.
6.5.2 Projection specification¶
For operations returning row data, the framework can specify which columns to project:
Used for efficiency: backends compute only the needed columns rather than full rows.
6.6 Aggregation operations (for query execution)¶
These operations support query execution. They are consumed by the query resolver (Chapter 9) when executing Frame-QL queries.
6.6.1 GROUP BY aggregation¶
Groups rows by specified columns and applies aggregation functions.
Operation: aggregate(table: string, group_by: list of string, aggregations: list of (operator, source_column, alias), filter?: string) → result rows
Example:
aggregate(
table='transactions',
group_by=['region', 'year'],
aggregations=[
(SUM, 'amount', 'total_revenue'),
(COUNT_DISTINCT, 'customer_id', 'customer_count')
],
filter="year = 2026"
)
6.6.2 Reducer support¶
The data-API supports the reducers in Coframe Core's operator catalog (Chapter 10 §10.4): SUM, AVG, MAX, MIN, COUNT(*), COUNT(c), COUNT_DISTINCT, MEDIAN, MODE, FIRST/LAST, STDEV, VARIANCE, CONCAT_AGG, ARRAY_AGG, BOOL_AND/OR, BIT_AND/OR/XOR, APPROX_DISTINCT, APPROX_PERCENTILE.
Backends translate framework-named reducers to backend-native equivalents. Backends that don't support specific reducers indicate so via the capability advertisement (§6.2.3); the framework's resolver may then exclude operations requiring those reducers, or fall back to alternative implementations.
6.6.3 Mapper support¶
The data-API supports the mappers in Coframe Core's function catalog (Chapter 10 §10.5): arithmetic, comparison, logical, string, date/time, type conversion, missing-value handling, conditional.
6.6.4 Multi-table operations¶
For queries spanning multiple schemas (Rung 7 of Frame-QL), the framework's resolver assembles operations involving multiple tables. The data-API supports:
Operation: aggregate_with_join(tables: list of (table, join_condition), group_by: list of string, aggregations: list of ..., filter?: string) → result rows
Backends translate this to backend-native join queries.
6.7 Error handling and diagnostics¶
6.7.1 Operation results¶
Each data-API operation returns a result that includes:
- Status: success, error, partial.
- Result data (if success or partial): the operation's output.
- Error information (if error or partial): the error type, message, and any diagnostic information.
6.7.2 Common error conditions¶
- Table not found: the requested table doesn't exist in the backend's namespace.
- Column not found: the requested column doesn't exist in the table.
- Type mismatch: the operation isn't applicable to the column's data type.
- Backend error: the backend itself reported an error (e.g., timeout, permission denied, syntax error).
- Capability not supported: the backend doesn't support the requested operation.
6.7.3 Backend-error pass-through¶
When the backend itself reports an error, the data-API passes the backend's error message through to the framework. The framework includes this in DQ output and error diagnostics so engineers can address backend-side issues.
6.8 Backend-specific extensions¶
Backends may support operations beyond the protocol's required set. For example:
- Statistics-based estimation: backends with cost-based optimizers may expose estimated cardinalities, histograms, etc., for the framework to use in query planning.
- Native sketch types: backends supporting HyperLogLog, t-digest, or similar sketches may expose sketch-aware operations the framework can use for approximate distinct/quantile computation.
- Materialized view awareness: backends with materialized view machinery may inform the framework about views available for query rewriting.
Backend-specific extensions are exposed via the capability advertisement (§6.2.3) and consumed by the framework's tooling when relevant.
6.9 Backend implementations¶
The Coframe Core project ships with two reference backend implementations:
- coframe-polars: a Polars-based backend operating on local Parquet/CSV files or Polars DataFrames.
- coframe-duckdb: a DuckDB-based backend operating on DuckDB tables.
Both implementations support the full data-API protocol's required operations. They differ in performance characteristics, connection model, and supported backend-specific extensions.
Engineers may implement additional backends (e.g., for Snowflake, BigQuery, or Postgres) by implementing the data-API protocol. The protocol specification (this chapter) is the contract.
6.10 What the data-API doesn't include¶
The data-API is scoped specifically for DQ verification and query execution. It does not include:
- Data transformation operations: ETL, schema migration, data loading. These are upstream of Coframe Core; the framework consumes data after these are done.
- Backend-specific authentication or access control: these are configured at framework deployment, outside the data-API protocol.
- Cross-backend operations: in Coframe Core, each AC has one backend; cross-backend operations are Coframe Pro territory.
- Backend-specific query optimization hints: the framework's resolver makes optimization choices independent of backend-specific hint syntax.
6.11 Coframe Core vs. Coframe Pro¶
The data-API protocol is largely the same in Coframe Core and Coframe Pro. Coframe Pro extends the protocol with:
- Multi-backend operations: handling queries spanning schemas across multiple backends.
- Sketch-typed columns: native support for HLL, t-digest, and similar sketches as first-class column types.
- Custom operator semantics: backends supporting custom operators (registered by AC authors per Coframe Pro) expose the operator's semantics via the data-API.
For Coframe Core, the protocol is specified in this chapter and is sufficient for Coframe Core's scope.
6.12 Where to go next¶
After reading this chapter, the natural next chapters are:
- Chapter 7: Data Quality and Structural Verification — the DQ process that uses the data-API to verify schema.init declarations against data.
- Chapter 9: Query Resolution — how Frame-QL queries route to data-API aggregation operations.
- Chapter 5: schema.init Format — the engineer's authoring artifact that drives DQ's data-API calls.
Chapter 7: Data Quality and Structural Verification¶
The framework's process for verifying schema.init declarations against backend data and producing a verified Analytics Collection.
7.1 Overview¶
This chapter specifies the Data Quality and Structural Verification (DQ) process. DQ is the framework's procedure for taking the engineer's schema.init and producing a verified AC plus a structural-verification deliverable.
The chapter is operational: it describes what DQ does, in what order, with what inputs and outputs, and how engineers respond to its findings.
The chapter assumes familiarity with the Foundations chapter (Chapter 2), the ColumnSpec chapter (Chapter 3), the schema.init Format chapter (Chapter 5), and the Data-API Protocol chapter (Chapter 6).
The chapter is organized in thirteen sections:
- §7.2 frames DQ as a process.
- §7.3 specifies the three phases of DQ.
- §7.4 specifies Phase 1: metadata-only verification.
- §7.5 specifies Phase 2: quasi-metadata fetch.
- §7.6 specifies Phase 3: quasi-metadata-derived verification.
- §7.7 specifies what's verified, what's verified-with-opt-out, and what remains genuinely asserted.
- §7.8 specifies handling of missing data on AC-dimensions.
- §7.9 specifies the DQ deliverable.
- §7.10 specifies the DQ iteration cycle.
- §7.11 specifies the three lenses on schema validity.
- §7.12 frames DQ's broader positioning.
- §7.13 specifies AC Verification Levels (A / AA / AAA) — the AC's verification status as an ordinal level, composed from the integrity conditions. The level definitions admit both empirical (data-attested) and deductive (verified-by-construction) grounding sources for structural commitments.
7.2 DQ as process¶
DQ has:
- One input: schema.init (per Chapter 5). The engineer's commitments about how the AC should be structured.
- One operational dependency: the backend's data-API (per Chapter 6). DQ calls the data-API to fetch the data needed for verification. The data-API is not a framework input; it is the framework's mechanism for accessing what the backend exposes.
- Multiple outputs: a verified schema (refined schema.init with verification status), a structural-verification deliverable, AC-level integrity status, and any advisories or violations surfaced.
DQ's role: reconcile what the engineer commits to (schema.init) with what the data attests (via data-API), producing a coherent AC.
The consistency constraints DQ checks are not artifacts of the Coframe Core framework. They are constraints on data and metadata that must hold regardless of what analytical tool is being used. If a declared functional dependency fails against data, that breaks any analytical reasoning that assumes the dependency. If cross-schema integrity is violated, queries on either schema inherit the inconsistency. The framework's contribution is to articulate these constraints as first-class checks, with specific diagnostics, in a phase where engineers can address them deliberately.
For this reason, DQ work has analytical value beyond AC creation. A clean structural-verification deliverable over a backend warehouse is informative in its own right — it tells the engineer what their data attests, regardless of whether they ultimately deploy the AC for query workloads.
7.3 The three phases of DQ¶
DQ proceeds through three phases:
-
Phase 1 — Metadata-only verification: checks that schema.init's declarations are internally consistent. Operates on schema.init alone, without calling the data-API. Produces a set of structural-rule violations.
-
Phase 2 — Quasi-metadata fetch: calls the data-API to fetch the minimum data information needed for principle-verification. Produces quasi-metadata (per §7.5) that supports subsequent verification.
-
Phase 3 — Quasi-metadata-derived verification: uses the fetched quasi-metadata to verify integrity conditions, infer derivable facts (data-driven FD-DAG, metric anchorings), produce the coverage map, and run per-DNA-edge value attestation for cross-schema metric coherence (§7.6.8). Produces violations, advisories, and the structural-verification deliverable.
The framework runs Phase 1 first; if Phase 1 fails (hard structural violations in schema.init), Phase 2 doesn't run. If Phase 1 passes, Phase 2 fetches quasi-metadata; Phase 3 then uses it for verification, including attestation.
Per-DNA-edge value attestation is enabled by default in Coframe Core; engineers may opt out per AC. The opt-out is recorded and propagated to query-result annotations (per §7.6.8.11). MTI's status as theorem (default config) or conditional guarantee (opted-out config) follows from this configuration.
The phases produce a unified output: the structural-verification deliverable plus violations, advisories, and AC-level status.
7.4 Phase 1: Metadata-only verification¶
This phase verifies what is checkable from schema.init alone, without data.
7.4.1 Structural rules verified at Phase 1¶
Per the Foundations chapter §2.10.1 and the ColumnSpec chapter §3.9.1:
-
Required-fields rule: every ColumnSpec has all four parts declared (with auto-derivation for grain-role columns). Missing required fields are violations.
-
Ereferences valid AC-dimensions: every dimension named in any ColumnSpec'sEis itself an AC-dimension somewhere in the AC. -
Mconstraint:M.determinants ⊆ E ∪ {self-token}. Determinants outside this set are integrity violations. -
|E| = 1rule: AC-dimensions and AC-attributes have|E| = 1per the column trichotomy. Violations indicate misclassification or malformed declaration. -
No-all-dimensions rule: each schema must have at least one non-grain-role column. A schema consisting solely of grain-role columns is structurally degenerate.
-
Type consistency for same-named columns within a schema: trivially satisfied (same-name uniqueness within schema rule); type consistency across schemas is checked at Phase 1 too.
-
Schema well-formedness: each column's
Eis reachable from the schema's grain via the FD-DAG. Columns anchored to entities not reachable from the schema's grain are structurally malformed. -
Candidate FD-DAG acyclicity: the candidate FD-DAG declared in schema.init has no cycles among AC-dimensions.
-
Same-name uniqueness within schema: two ColumnSpecs in the same schema do not share a
name. -
Operator-type-appropriate E-relation: for each non-root ColumnSpec, the E-relation between the column and its DNA predecessor matches the operator's type —
E_pred ⊇ E_selffor reducer ops,E_pred = E_selffor function ops. -
DNA references valid columns: every non-root ColumnSpec's DNA points to an
(name, E, op)triple that matches some ColumnSpec in the AC. -
Naming consistency (when a naming function is declared): for each non-root ColumnSpec, the column's
nameequals the AC's naming function called with its DNA predecessor and operator (orname = name_predifopis identity-preserving). -
Family-root uniqueness within
(name, E): across all ColumnSpecs in the AC, two with the same(name, E)walk DNA to the same family-root.
7.4.2 Structures derived at Phase 1¶
Phase 1 derives:
-
Column trichotomy classification: each column classified as AC-dimension, AC-attribute, or AC-metric per the trichotomy in Foundations §2.5. Derived from
Epatterns across schemas. -
Schema-type classification: each schema classified as reference, fact, composite-grain fact, etc., per Foundations §2.9.3.
-
Candidate FD-DAG: per Foundations §2.8, edges added per AC-dimension columns with
Ereferencing other AC-dimensions, plus FD-edges declared explicitly in schema.init'sfd_dagsection. -
Schema grain:
grain(S) = {grain-role columns of S}per Foundations §2.9.2. -
Metric genealogy structure (preliminary): family partition by name, family-root identification via DNA-walk, and structural relations (identical, sibling, cousin) computed from declared ColumnSpecs.
7.4.3 Phase 1 outputs¶
If Phase 1 passes:
- All declarations are internally consistent.
- Trichotomy, FD-DAG candidates, schema grains, and metric genealogy structure are computed.
- Phase 2 proceeds.
If Phase 1 fails:
- Specific structural violations are surfaced with diagnostic information.
- Phase 2 does not run; the engineer must address Phase 1 violations first.
7.4.4 Phase 1 advisories¶
Beyond hard rules, Phase 1 surfaces advisories for engineer review:
-
Sparse-grain advisory (strong): when a schema's grain is a strict superset of the union of its non-grain-role columns'
Evalues. The schema may be under-utilized at its declared grain. -
Hidden-dimension advisory: when a column's
Ereferences a dimension not present in the schema (typically reachable via FD-DAG navigation, but worth noting). -
Pre-aggregation candidate (informational): when multiple fact schemas carry the same AC-metric at FD-related
Evalues. Suggests the schemas may be sibling representations of the same metric at different grains. -
Temporal-partitioning candidate (informational): when multiple schemas have isomorphic ColumnSpec structures. Suggests they may be temporal partitions of the same logical schema.
-
Missing-naming-function advisory: if the AC declines structured naming and the framework detects ColumnSpecs whose names suggest operator derivation (e.g., a column named
peak_revenuewithop: MAXand DNA pointing to arevenuecolumn), the advisory notes that name-vs-operator consistency is not verified.
Advisories don't block Phase 2; engineers review them at their discretion.
7.5 Phase 2: Quasi-metadata fetch¶
This phase calls the data-API to fetch the minimum data information needed for principle-verification.
7.5.1 What quasi-metadata covers¶
Quasi-metadata, narrowly:
-
Per-AC-dimension-per-schema observed value sets: for each AC-dimension
dand each schemaSwheredappears, the set ofd-values observed inS. -
Per-AC-attribute-per-schema observed value sets: similarly for AC-attributes.
-
Per-pair value-mappings: for each pair
(c1, c2)where both have AC-dimension or AC-attribute roles and both appear in the same schemaS, the(c1-value, c2-value)pairs observed inS. -
Per-schema grain integrity: confirmation that
grain(S)is unique per row. -
Per-column missing counts: for non-grain-role columns, the count of rows with missing values; for grain-role columns, confirmation that no row is missing.
-
Metric data for anchoring inference: per-metric-per-schema, the data needed to test candidate anchorings (per §7.6.5).
7.5.2 What quasi-metadata is not¶
Quasi-metadata does not include:
- Distribution shapes or value frequencies beyond what's needed for principle-verification.
- Operator-semantic information (combination law candidates, missing-value mechanism inference, etc.).
(Cross-schema verification of metric values at common coarsenings is performed by per-DNA-edge value attestation, §7.6.8 — distinct from quasi-metadata, which the attestation pass consumes alongside other DQ verification.)
The narrow scope is deliberate: quasi-metadata is the minimum data information the framework requires to verify what's verifiable.
7.5.3 Data-API calls for quasi-metadata¶
The framework calls the data-API per Chapter 6:
get_distinct_valuesper dimension per schema.get_distinct_valuesper attribute per schema.get_pair_mappingper pair per schema.verify_grain_integrityper schema.get_missing_countsper schema for non-grain columns.test_metric_anchoringper metric per schema for candidate anchorings.
The framework batches calls where possible to minimize backend load.
7.5.4 Quasi-metadata refresh¶
Quasi-metadata can become stale as backend data is updated. The framework supports:
- Initial computation at AC-load time.
- On-demand refresh for specific schemas or items.
- Periodic refresh on a configurable schedule.
Stale quasi-metadata can cause Phase 3 verification to be based on outdated facts. Production deployments should refresh quasi-metadata after data updates that affect the verified facts.
7.6 Phase 3: Quasi-metadata-derived verification¶
This phase uses the fetched quasi-metadata to verify integrity conditions, derive structures, surface violations or advisories, and verify cross-schema metric coherence per attestable DNA edge (§7.6.8).
7.6.1 Universe-wide value sets¶
For each AC-dimension d, the framework computes V(d) — the union of d's observed value sets across all non-degenerate schemas where d appears in grain role.
For ACs declaring a reference table R_d as authoritative for d, V(d) is taken from R_d; other schemas' values must be subsets of V(d).
7.6.2 Coverage map¶
For each AC-dimension d, the framework records per-schema coverage:
- Fully covered: observed value set equals
V(d), missing-value count is zero. - Coverage-restricted (Kind 1): observed value set is a strict subset of
V(d), missing-value count is zero. - Attribution-incomplete (Kind 2): missing-value count is non-zero.
The coverage map is descriptive; it reports the AC's structural state. All cases are surfaced (no threshold-based suppression). Engineers consult the map to understand schema coverage.
7.6.3 Data-driven FD-DAG¶
For each pair of AC-dimensions (a, c) appearing together in some schema, the framework checks:
- Existence: in each schema, does the
(a-value, c-value)mapping have the function property? - Value-mapping consistency: across schemas with overlapping declared scopes, do mappings agree?
If both checks pass, the AC-level FD-edge a → c is data-attested.
The framework verifies the relationship to the logical FD-DAG (specified by the engineer):
- Logical FD-DAG ⊆ Data-driven FD-DAG. Logical edges not attested are hard violations.
- Data-driven edges not declared logically: not violations; an advisory may surface for engineer consideration.
For FD-edges declared with trust_declared_FD instructions (§5.8.1), violations are downgraded to advisories.
7.6.4 Cross-schema value-mapping consistency for AC-attributes¶
The same machinery applies to AC-attributes. For each AC-attribute c with E = {anchor}, value-mapping consistency is checked across schemas where c appears with overlapping declared scopes.
AC-attributes don't contribute FD-edges (the FD-DAG's nodes are AC-dimensions). They contribute to the framework's column-mapping space.
7.6.5 Metric-anchoring inference¶
For each AC-metric in each schema, the framework infers the smallest data-attested anchoring:
- Compute the candidate set: subsets of
grain(S)plus FD-DAG-reachable dimensions. - For each candidate
D, test whether the metric is a function ofDin the data via the data-API (test_metric_anchoringper Chapter 6 §6.4.7). - Among subsets where the metric is a function, find the smallest by subset relation.
- The smallest subset is the inferred
E(c, S).
The inference serves two purposes:
- Drafting: when a ColumnSpec lacks an
Edeclaration (or for AC-spec drafting from data), the framework proposes the inferred smallest anchoring. - Verification: when
Eis declared, the framework compares declared vs. inferred. If declaredEis finer than inferred, surface a mis-anchored advisory.
When multiple anchorings are minimal (neither subset of the other), the framework presents all to the engineer for selection. Coverage-restriction in candidate dimensions is surfaced.
7.6.6 Structural enforcement at Phase 3¶
7.6.6.1 Redundant-grain rule¶
In any schema S, if column c is in grain role (E = {c}), and the FD-DAG has d → ... → c where d is any ancestor of c, and d is also in grain(S), then c's grain-role declaration is structurally redundant.
Remediation: re-declare c with E = {a}, where a is the closest in-schema-grain ancestor of c. After re-declaration, c becomes a non-grain reference.
The rule applies iteratively. The post-remediation invariant: every schema's grain consists only of dimensions with no in-schema-grain ancestor in the FD-DAG.
7.6.6.2 Schema scope honoring¶
For each schema declared as non-degenerate on AC-dimension d, the schema's observed value set must equal V(d). If the observed set is a strict subset of V(d), the schema must either:
- Be re-declared as degenerate on
dwith the actual value set. - Have the missing values added to its data.
- Be reclassified to admit Kind 1 coverage-restriction explicitly.
7.6.6.3 Logical-FD-DAG-vs-data-driven conflict¶
Logical FD-edges not attested by the data-driven FD-DAG are hard violations (downgraded to advisories under trust_declared_FD). The logical declaration must be removed, the data corrected, or the trust-instruction added before AC validation passes.
7.6.7 Phase 3 advisories¶
7.6.7.1 Mis-anchored AC-dimension column advisory¶
When a column c with AC-dimension or AC-attribute classification is declared with E = {a}, but the data-driven FD-DAG has a longer path a → b → ... → c, the advisory surfaces. Declared anchor is finer than necessary; engineer can re-declare to a coarser anchor.
7.6.7.2 Mis-anchored metric advisory¶
When declared E for a metric is finer than the data-attested inferred anchoring, the advisory surfaces. Engineer can re-declare to inferred anchoring or confirm finer anchoring is intentional.
7.6.7.3 Attribution-incomplete column advisory¶
For each column with non-zero missing count in the coverage map, the advisory surfaces with schema, column, missing fraction, and remediation options (data fix, synthetic-unknown declaration, declared-degeneracy promotion).
7.6.7.4 Coverage-restricted column advisory¶
For each column with Kind 1 coverage-restriction, descriptive advisory specifying observed value set, V(d), and the gap.
7.6.7.5 Family-DAG inconsistency advisory¶
If the framework detects that two columns share a family-name but walk DNA to different family-roots (i.e., they are cousins), the advisory surfaces. The engineer confirms intentional cousins (and prepares to handle dubious queries) or addresses by renaming or restructuring DNA.
7.6.8 Per-DNA-edge value attestation¶
Per-DNA-edge value attestation verifies the cross-schema metric coherence lemma (per Foundations §2.10.5) per attestable DNA edge during DQ Phase 3. Attestation is enabled by default in Coframe Core; engineers may opt out per AC for cost-bounded deployments.
7.6.8.1 What attestation does¶
For each attestable DNA edge in the AC's metric genealogy, the framework:
- Identifies the predecessor column (anchored at
E_pred) and the successor column (anchored atE_self, withE_pred ⊇ E_selfunder FD-DAG navigation). - Computes the predecessor's data, aggregated via the column's
opfromE_predtoE_self, honoring the operator catalog's missing-value treatment for the column's declared signature. - Compares the computed values against the successor's observed values at
E_self, scoped to the intersection of the two schemas' declared scopes. - Reports any deltas exceeding a configured tolerance, distinguishing value-disagreement on shared keys from row-set differences (rows in source not in successor, or vice versa).
When the attestation passes (no value-disagreements within tolerance on shared keys), the lemma is verified for the edge. When it fails, the deltas are surfaced as integrity violations (default failure mode: see §7.6.8.5) or advisories (failure mode soft).
7.6.8.2 Edge selection¶
A DNA edge predecessor → successor is attestable iff:
- Both columns are physically present in the AC (the predecessor's data is reachable through the data-API).
- The column's
opis a reducer withpartition_invariant: true. (Function-typed edges and non-partition-invariant reducers are not attestable in Jr; the framework records them as unattestable, distinct from attestation-disabled.) - The predecessor's anchor
E_predreaches the successor's anchorE_selfvia the FD-DAG (this follows from ColumnSpec well-formedness; trivially satisfied for valid ACs). - The intersection of the two schemas' declared scopes is non-empty.
Edges failing the first criterion (predecessor not in AC) are recorded as unattestable: predecessor not present. Edges failing the second are recorded as unattestable: operator not partition-invariant. The verification status reports both unattestable categories distinctly from attested-passed and attested-failed.
Singletons (multi-input columns produced by ratio operators or multi-input mappers) are not attested in Jr — they are leaves in the metric genealogy with multi-input DNA, and Jr does not specify multi-input attestation. Coframe Pro extends attestation to multi-input edges.
7.6.8.3 Attestation query shape¶
The framework constructs attestation queries through the data-API's verification operations (Chapter 6 §6.4.10). Conceptually, for an edge predecessor (in schema S_pred at E_pred) → successor (in schema S_self at E_self) with op being the family's ip_reducer:
derived = compute_aggregation(
schema=S_pred,
metric=predecessor.physical_name,
op=op,
group_by=projection(E_pred → E_self via FD-DAG),
missing_treatment=per operator catalog and predecessor's M signature,
scope=intersection(S_pred.scope, S_self.scope)
)
asserted = read_columns(
schema=S_self,
columns=[E_self ∪ {successor.physical_name}],
scope=intersection(S_pred.scope, S_self.scope)
)
deltas = compare(derived, asserted, on=E_self, epsilon=ac.attestation.epsilon)
The framework reports per-edge: passed | failed_with_deltas | unattestable | sampled along with the count of disagreeing keys, a sample of disagreements (top-N by |delta|), the missing-from-predecessor and missing-from-successor counts, and the timestamp of attestation.
7.6.8.4 Numeric tolerance and missing-value semantics¶
Epsilon. Attestation uses a relative tolerance ε_rel and an absolute tolerance ε_abs. A delta |d| between derived value v_d and asserted value v_a passes iff |d| ≤ max(ε_abs, ε_rel · max(|v_d|, |v_a|)). Defaults: ε_rel = 1e-9 for numeric/float, ε_abs = 0 for integer and decimal. Engineers configure per AC.
Missing-value computation. The aggregation that produces derived honors the operator catalog's missing-value treatment per the predecessor's declared signature (Chapter 10 §10.4). For SUM under MCAR-effective, this means mean-substitute-then-sum, not naive SQL SUM. Naive SQL aggregation as the attestation reference would produce false positives whenever the framework's defined operator semantics differ from SQL defaults — which is the entire point of the framework's missing-value regime.
Annotated outputs. When derived has its own annotations (partial-coverage, bias-warning, substitution-applied), the attestation result records that the comparison is between annotated values. A passing attestation under bias-warning is a weaker guarantee than a passing attestation under no annotations; the verification status preserves this distinction.
7.6.8.5 Failure modes¶
The AC's attestation.failure_mode controls what happens when an edge attestation fails:
-
hard(strict): any failed edge causes AC validation to fail; the AC cannot be loaded for queries until either the data is corrected, the edge declaration is changed, or attestation is opted out for that edge or AC. -
soft(default): failed edges are recorded as coherence advisories in the AC's verification status. The AC validates and is queryable, but every query result that draws on a schema participating in a failed edge carries acoherence-warningannotation specifying the failed edges and the magnitude of the disagreements. MCP responses propagate the annotation to LLM clients. -
tolerated: failed edges are recorded but produce neither validation failure nor query-result annotations. This mode exists for explicit acknowledgment of long-standing pre-aggregation drift the engineer accepts; using it requires also setting anattestation.tolerated_edgeslist per AC, naming each edge explicitly. Anonymous tolerance (turning off failure-mode checking globally) is not supported — tolerance is per-edge and visible.
The default soft is calibrated for first-time adoption: engineers can validate ACs against existing warehouses without immediately remediating all coherence issues, while still seeing what's broken and propagating the warning to consumers.
7.6.8.6 Late-arriving data and row-set differences¶
A common cause of attestation deltas: the predecessor schema has rows the successor schema's pre-aggregation hasn't incorporated yet (or vice versa). The framework distinguishes:
- Value-disagreement on shared keys: both schemas have a row at this key, but their values differ (after aggregation and missing-value handling). This is a coherence violation.
- Row-set difference: a key appears in one schema's projection to
E_selfand not the other's. This is a coverage concern, surfaced as a coverage-difference advisory rather than a coherence violation.
By default, only value-disagreement on shared keys triggers the failed_with_deltas status. Row-set differences are recorded separately. Engineers can tighten this with attestation.strict_row_sets: true, which treats any row-set difference as a coherence failure.
7.6.8.7 Sampling fallback for large fact tables¶
For DNA edges whose predecessor schema exceeds the configured row budget (attestation.sampling_threshold_rows, default 1e8), the framework falls back to stratified sampling:
- Identify the strata: the values of
E_selfin the successor's data (the keys to verify). - Sample a configured fraction of predecessor rows per stratum (default 1% with a minimum of 10000 rows per stratum, capped at 100% for small strata).
- Compute attestation on the sample.
- Report the result with
sampledstatus and a confidence level (default target0.99perattestation.sampling_confidence_target).
Sampled attestation is weaker than full attestation — it confirms that disagreements, if any, are below the tolerance with the stated confidence rather than confirming exact agreement. The verification status records sampled: true, the sample size per stratum, and the confidence level; query-result annotations distinguish sampled-attested from fully-attested edges.
Engineers can force full attestation regardless of size with attestation.force_full: true, accepting the runtime cost. Coframe Pro provides additional sampling strategies (importance sampling, stratification by anchor distribution).
7.6.8.8 Configuration¶
Attestation configuration lives at the AC level in the AC catalog YAML:
attestation:
enabled: true # default; set to false to opt out
failure_mode: soft # soft (default) | hard | tolerated
tolerated_edges: [] # required when failure_mode == tolerated
epsilon_relative: 1.0e-9
epsilon_absolute: 0
strict_row_sets: false
sampling_threshold_rows: 100_000_000
sampling_fraction: 0.01
sampling_min_rows_per_stratum: 10_000
sampling_confidence_target: 0.99
force_full: false
When the attestation: block is absent from the AC YAML, defaults apply (attestation enabled, failure mode soft, etc.). Setting enabled: false requires the AC YAML to include attestation: enabled: false explicitly — opting out is a deliberate, visible choice. Implicit opt-out (silent absence of the configuration) is not how attestation gets disabled.
7.6.8.9 Verification status reporting¶
The DQ deliverable (per §7.9) reports per AC:
- The count of attestable, attested-passed, attested-failed, unattestable, and sampled edges.
- Per failed edge: predecessor and successor (logical names), failure-mode applied, count of disagreeing keys, top-N disagreements with their deltas, missing-from-predecessor and missing-from-successor counts.
- Per unattestable edge: predecessor, successor, reason for unattestability.
- The configuration used (failure mode, epsilon, sampling parameters, force_full setting).
- Total attestation runtime.
This reporting is part of the AC's stable verification artifact. Engineers consult it to debug failures and to decide whether the AC's coherence posture is acceptable. AI agents querying via MCP can read it through the validate_ac capability (Chapter 11).
7.6.8.10 Cost considerations¶
Attestation adds time to AC validation. Empirically, for an AC with N attestable edges over fact tables of M rows each, the cost is approximately O(N · M · log M) for full attestation in DuckDB (sort + group-by aggregation per edge), reduced to O(N · sample_fraction · M · log(sample_fraction · M)) under sampling. For a typical retail-scale AC (5–10 metric families, 2–4 siblings per family, fact table of ~50M rows), full attestation in the polars or duckdb backend completes in single-digit minutes.
This is build-time cost paid at AC validation, not query-time cost. Validation is not a hot path; the cost is one-time per AC version. Engineers preferring faster iteration during AC authoring can enable sampling; engineers preferring stronger guarantees in production can disable sampling.
The framework's posture: attestation is a one-time verification cost paid in exchange for an unconditional MTI guarantee at every subsequent query. Default-on reflects this trade.
7.6.8.11 Opting out¶
Engineers may opt out of attestation per AC by setting attestation.enabled: false in the AC catalog. The verification status records the opt-out and propagates a global coherence-asserted-not-verified annotation to every query result that depends on cross-schema reach. MCP responses surface this annotation to LLM clients.
Opt-out is a legitimate choice for some deployments — proof-of-concept work, small-scale prototypes, ACs whose coherence is verified by upstream pipelines and re-attestation is redundant. The framework does not pass judgment on the choice; it records and reports it.
The Coframe Pro version's attestation is similarly default-on with a richer feature surface; the opt-out semantics are identical.
7.7 Asserted facts: what's verified, what isn't¶
The framework distinguishes facts that DQ verifies, facts DQ verifies by default but engineers can opt out of, and facts the framework relies on but does not data-attest.
7.7.1 The boundary¶
Verified by default, no opt-out: the structural rules in Foundations §2.10.1–3.10.3 (per-column, per-schema, AC-level), and the data-attested integrity conditions in §2.10.4 (Logical FD-DAG ⊆ Data-driven FD-DAG, scope honoring, grain combo-key uniqueness, cross-schema value-mapping consistency for AC-dimensions and AC-attributes). These run unconditionally during DQ Phase 1 and Phase 3.
Verified by default, with explicit opt-out: cross-schema metric coherence per attestable DNA edge (§7.6.8). Default-on; engineers may set attestation.enabled: false per AC. The opt-out is recorded and propagated to query-result annotations.
Asserted, not verified: facts the framework relies on but cannot directly attest from data. These are the genuine lemmas of the framework's grammar layer.
7.7.2 Cross-schema metric coherence (verified by default)¶
The cross-schema metric coherence statement — that across schemas containing siblings of the same family-root, metric values at common coarsenings agree — was formerly a lemma asserted from Principle 2 plus partition-invariance. In Coframe Core v0.7+ (the manual version corresponding to this specification), cross-schema metric coherence is verified per attestable DNA edge during DQ Phase 3 by default (§7.6.8).
The verified condition: for each attestable edge, the predecessor's data aggregated via the family's ip_reducer at the successor's anchor agrees (within tolerance, on shared keys) with the successor's observed values.
When the verification fails, the framework's response is governed by the AC's attestation.failure_mode setting (§7.6.8.5): hard validation failure, soft advisories propagated to query results, or per-edge tolerance with explicit declaration.
When the engineer opts out of attestation, this fact reverts to asserted-not-verified status. The Multi-Table Invariance theorem (Chapter 9 §9.6) is correspondingly conditional in opted-out configurations and unconditional in default configurations.
7.7.3 Genuinely asserted facts in Coframe Core¶
The following facts the framework relies on but cannot data-attest:
Catalog-declared partition-invariance. Each reducer in the operator catalog has a partition_invariant flag (Chapter 10). The flag's correctness is a property of the catalog's design (a mathematical claim about the operator) and is not separately verified per AC. The framework trusts the catalog.
Catalog-declared identity-preservation for functions. Each function in the operator catalog has an identity_preserving flag. Same status: a property of the catalog, not of the AC.
Engineer's principle commitments. Principle 1 (column-borne information) and Principle 2 (same universe of observation) are commitments the engineer makes by authoring an AC. The framework verifies many consequences of the principles (FD-DAG attestation, cross-schema value-mapping consistency, attestation per §7.6.8) but does not verify the principles themselves. An engineer asserting Principle 2 over schemas observing genuinely different universes is making an unverifiable commitment that produces structurally valid integrity conditions while still being analytically wrong.
The naming function (when not declared). When an AC declines structured naming (per Chapter 3 §3.7.3 Option 4), name-based family membership claims are asserted by the engineer's naming choices and not verified by a declared naming function. The verification status records this case as naming-consistency: asserted (vs. verified when a naming function is declared and checked).
7.7.4 The framework's commitment with respect to asserted facts¶
The framework's commitment is honest articulation of what's verified and what's asserted. Engineers operating ACs should understand the distinction:
- Verified-no-opt-out facts are unconditional within the AC's verification artifact.
- Verified-with-opt-out facts are conditional on the engineer's configuration choice; the choice is visible.
- Asserted-not-verified facts are conditional on the engineer's discipline; the framework cannot detect violations.
The structural-rigor-as-binary posture (Foundations §2.11.1) applies to verified facts. Asserted facts are outside the framework's verification scope; the engineer's commitment is what the framework operates on. Making this distinction explicit, on every AC's verification status and on every query result that depends on it, is the framework's contribution.
7.8 Missing data on AC-dimensions¶
This section specifies how DQ handles missing values on AC-dimensions.
7.8.1 Missing on grain-role columns¶
Per Foundations §2.4.4, grain-role columns have auto-derived M = {c} (MNAR) and forbidden admissibility: missing values are integrity violations.
If quasi-metadata reports any missing values on grain-role columns, DQ Phase 3 produces a hard violation. The engineer must either fix the data or remove the affected rows.
7.8.2 Synthetic-unknown values and the FD-DAG¶
In some cases, the engineer may want to admit "unknown" as a valid value for an AC-dimension in non-grain-role positions. This is supported via synthetic-unknown declarations in schema.init:
- column_spec:
src_name: customer_id
name: customer
data_type: integer
E: [transaction]
M:
signature: MAR
determinants: [transaction]
op: OBSERVED
synthetic_unknown:
value: -1
meaning: "Customer not identified at transaction time"
The framework treats customer_id = -1 rows as having a synthetic-known value (-1) representing "unknown customer." The synthetic value participates in the data-driven FD-DAG, the coverage map, and queries — it is a normal value with a special meaning declared by the engineer.
7.8.3 Missing on reference-role AC-dimension columns¶
For AC-dimension columns appearing in reference role (non-grain) in fact schemas, missing values may be valid (declared via M). DQ Phase 3 reports the missing fraction; engineers respond per §7.8.2 (synthetic-unknown), data fix, or scope re-declaration.
7.9 The DQ deliverable¶
The output of the DQ process is the DQ deliverable: a structural-verification artifact summarizing what was attested, what wasn't, and what the engineer needs to address.
7.9.1 Deliverable contents¶
The deliverable includes:
- Coverage maps (§7.6.2): per AC-dimension per schema.
- Data-driven FD-DAG (§7.6.3): the FD edges attested by data, with attestation status for declared edges.
- Metric anchoring inferences (§7.6.5): per-metric per-schema, the smallest data-attested anchoring.
- Per-DNA-edge attestation results (§7.6.8): in default configurations, the per-edge passed/failed/unattestable/sampled status with deltas and configuration used. In opted-out configurations, an explicit record of the opt-out and a
coherence-asserted-not-verifiedannotation propagated globally. - Violations: integrity conditions that failed at Phase 1, 2, or 3 (including attestation failures under
failure_mode: hard). - Advisories: soft concerns the framework surfaces for engineer review (including attestation deltas under
failure_mode: soft). - AC-level integrity status: pass/fail and what's pending.
- AC Verification Level (§7.13): the AC's level — A, AA, or AAA — computed deterministically from the grounding status of the AC's structural commitments. The deliverable enumerates which commitments are empirically grounded (data-attested), which are deductively grounded (verified-by-construction through operator catalog semantics), which are mixed-and-cross-checked, and which are tolerated. Reported alongside any caveats (
naming_consistency: asserted,tolerated_edges: [...],attestation: disabled). Informational in v1.0; stable surface in v1.x. - Asserted-not-verified facts listed: the genuinely unverifiable lemmas (§7.7.3), distinguished from verified-with-opt-out facts (§7.7.2).
- Coherence posture summary: explicit statement of the AC's attestation configuration (enabled/disabled, failure mode, sampling, force_full) and the resulting MTI status (unconditional / conditional within scope).
7.9.2 Deliverable consumption¶
The deliverable is consumed by:
- The engineer: reviews violations and advisories; iterates schema.init based on findings.
- AI-assisted authoring tooling: proposes remediation for advisories; refines schema.init.
- The MCP server: exposes the deliverable's structural facts to LLM clients querying the AC's metadata.
- The framework's AC validation: when DQ converges, AC validation uses the deliverable to confirm all integrity conditions hold.
7.10 The DQ iteration cycle¶
The iteration cycle:
- Engineer authors initial schema.init.
- Framework runs DQ (Phase 1 → 2 → 3, where Phase 3 includes per-DNA-edge attestation in default configuration).
- Framework returns the deliverable: violations, advisories, attestation results, refined schema.init proposals.
- Engineer reviews:
- Addresses violations: modifies schema.init, fixes data, declares synthetic-unknowns, or applies trust-instructions.
- Addresses attestation failures (under
failure_mode: hard): investigates pre-aggregation drift, fixes ETL, re-runs the offending pre-aggregations, or downgrades the failure mode for the affected AC after explicitly accepting the deltas. - Considers advisories (including attestation deltas under
failure_mode: soft): confirms intentional or addresses. - Re-runs DQ.
- Iterates until violations are zero and engineer is satisfied with advisories.
- AC validation runs on the converged schema.init; AC is ready for query workloads.
The framework supports iteration via:
- Caching of quasi-metadata between runs: refreshed only when data changes.
- Differential re-verification: re-checks only changed schemas/columns when possible.
- Differential re-attestation: re-runs attestation only for edges whose predecessor or successor schemas have changed since the prior run; cached edges retain their prior status until invalidated.
- Advisory acknowledgments persisting across runs: engineer says "I accept this" once.
For initial AC authoring against a warehouse with long-standing pre-aggregation drift, the recommended path is: keep failure_mode: soft (the default) during the initial iteration cycle; address the largest deltas first; promote to hard once the AC is in a coherent state. This avoids the situation where attestation prevents the engineer from getting any AC validated until every edge passes.
7.11 The three lenses on schema validity¶
DQ checks schema validity through three lenses. Each lens corresponds to a distinct notion of "valid schema."
7.11.1 Principle-consistency¶
The schema honors the framework's principles (Foundations §2.2): every column is anchored to declared entities (Principle 1); schemas observe a consistent universe (Principle 2). Phase 1 and Phase 3 verify principle-consistency.
7.11.2 Structural well-formedness¶
The schema satisfies the structural rules (Foundations §2.10): |E| = 1 for AC-dimensions/attributes, no-all-dimensions, type consistency, schema well-formedness, FD-DAG acyclicity, etc. Phase 1 verifies structural well-formedness.
7.11.3 Operational utility¶
The schema, as authored, supports the analytical purpose the engineer intends. This is partly the engineer's domain (does the AC's vocabulary match how analysts think?) and partly verifiable (do the family-roots correspond to genuine observed metrics? are the FD-edges meaningful?).
DQ surfaces operational-utility concerns as advisories (mis-anchored advisories, sparse-grain advisories, etc.). Engineers address them based on analytical purpose.
7.11.4 Failures on each lens¶
- Principle-consistency failure: hard violation. The AC cannot proceed.
- Structural well-formedness failure: hard violation. The AC cannot proceed.
- Operational utility failure: advisory. The AC can proceed, but the engineer should consider whether the AC's commitments serve their intended use.
7.12 DQ's positioning¶
DQ is the framework's structural-verification process. It is not:
-
A general data quality tool. Tools like Great Expectations, dbt tests, and custom monitoring address data quality at value level (data is correct, complete, fresh, etc.). DQ addresses structural correctness (declarations match data structure).
-
A query-time verification mechanism. DQ runs at AC load time and on demand; it does not verify integrity at every query. Once the AC is verified, queries proceed under the integrity conditions DQ established.
-
A backend-management tool. DQ assumes the backend's data is what the engineer intends; remediation of data issues happens at the backend level (ETL fixes, data cleaning), which DQ does not perform.
DQ's value: surfacing structural concerns to the engineer in a phase where they can address them deliberately, before query workloads depend on the AC's commitments. A clean DQ deliverable over a backend warehouse is informative regardless of whether the AC is ultimately deployed for queries.
7.13 AC Verification Levels¶
Coframe Core characterizes an AC's verification status with three ordinal levels — A, AA, and AAA — composed from the integrity conditions specified in this chapter and Chapter 2 §2.10. The levels are ordinal and monotonic: AA implies A; AAA implies AA. Each level represents a meaningful jump in the trust an AC's consumer can place in its results.
The levels measure the strength of the AC's verified structural commitments, not the specific mechanism by which each commitment was verified. Per §2.8.5, Coframe operates in two parallel verification regimes — empirical (data-attested through DQ) and deductive (verified-by-construction through operator catalog semantics). The level definitions admit both regimes as legitimate sources of verification: an AC reaches a level when each of its commitments at that level is grounded by at least one verification mechanism. §7.13.4 specifies what grounding means precisely.
The levels are informational in v1.0 and become stable surface in v1.x. v1.0 deployments report and propagate levels (in the DQ deliverable per §7.9.1, in MCP query results per §11.7.3); v1.x will lock the level definitions as semver-protected commitments after field experience informs any calibration.
7.13.1 Level A — Structural well-formedness¶
What Level A means. The AC's metadata is internally consistent. Phase 1 of DQ passes; integrity conditions I0 through I9 hold. Specifically:
- Per-column rules (§2.10.1):
|E| = 1for AC-dimensions and AC-attributes; (E, M) paired declaration; operator-type-appropriate E-relation between column and DNA predecessor; naming consistency when a naming function is declared. - Per-schema rules (§2.10.2): no-all-dimensions; type consistency within and across schemas.
- AC-level rules (§2.10.3): candidate FD-DAG acyclicity; family-root uniqueness within
(name, E). - Catalog and customization integrity (I7): the AC's effective operator registry is well-formed.
- Name-identity uniqueness (I8): same-named columns refer to compatible selves.
- name_map consistency (I9): every logical name has an entry; the map is injective; customizations reference registered logical operators.
What Level A does not mean. No data has been examined and no function evaluation has been required. Level A is a metadata-coherence commitment — the declarations are mutually consistent at the structural level, regardless of how each declaration is grounded.
Who benefits from Level A. Anyone consuming the AC's metadata: analysts reading the AC's structure, AI agents browsing the family vocabulary via MCP, tooling that introspects the AC. They get a guarantee that the AC's declarations are mutually consistent and that the AC will load cleanly into a Coframe Core backend.
Cost to author. Essentially free. Phase 1 of DQ runs without data access; any AC that loads cleanly is automatically Level A.
7.13.2 Level AA — Verified structural integrity¶
What Level AA means. Level A plus every dimensional structural commitment is grounded. Specifically:
- FD-DAG completeness. Every FD-edge in the AC's FD-DAG is grounded by at least one of three mechanisms:
- Data-attested grounding: the FD-edge passes I3 attestation per §7.6.3 (Logical FD-DAG ⊆ Data-driven FD-DAG). The data examined contains tuples consistent with the declared FD.
- Construction grounding: the FD-edge is established by deterministic operator catalog semantics for a function-derived dimensional column (e.g.,
month = MONTH_OF(day)). The function's declared determinism and type signature combined with the catalog's correctness commitment make the FD true by construction. No data attestation is needed because nothing data-side could falsify a deductive consequence of the function's definition. - Mixed grounding (cross-checked): the FD-edge is materialized data-side (a stored
monthcolumn populated by ETL) and also derivable function-side (MONTH_OF(day)from the operator catalog). In this case, the framework requires cross-check: I3 attestation verifies that the materialized values agree with the function output. Disagreement here is a meaningful integrity violation indicating ETL drift. - Schema scope honoring (§7.6.6.2): every schema declared as non-degenerate on a dimension actually covers that dimension's universe; declared-degenerate schemas honor their declared value-sets. This is empirically grounded — schemas reflect what their data contains.
- Grain combo-key uniqueness (§2.10.4): each schema's grain-role columns produce unique value tuples per row. This is empirically grounded — grain uniqueness is a property of the actual rows.
- Cross-schema value-mapping consistency for AC-dimensions and AC-attributes (§7.6.4): when the same
(c1-value, c2-value)mapping appears in multiple schemas with overlapping declared scopes, the mappings agree. This is empirically grounded for data-stored mappings; for function-derived dimensional values consistent across schemas (the sameMONTH_OF(day)applied identically), consistency is by-construction.
What Level AA does not mean. Cross-schema metric coherence is not yet grounded at AA. Pre-aggregation drift between metric siblings (e.g., revenue at transaction grain disagreeing with revenue at (store, month) grain in a pre-aggregated summary) is not caught at AA. The Multi-Table Invariance theorem (§9.6) is conditionally trustworthy — conditional on the cross-schema metric coherence lemma which remains ungrounded at AA.
Who benefits from Level AA. Analytical consumers running cross-schema queries that involve dimensional navigation. They get a guarantee that the AC's dimensional structure is verified — that joins across dimension hierarchies produce consistent values, that declared FDs hold, that schema scopes match observed data. Most existing semantic-layer products effectively claim AA when they verify FK relationships and value mappings, though typically without the explicit articulation of grounding regimes.
Cost to author. Moderate. Phase 3 of DQ runs against data via the data-API for the data-attested portion; function-derived FD-edges are validated at metadata time (operator catalog correctness) without per-edge data work. For an AC whose dimensional structure is mostly function-derived (e.g., a single transactions schema with month, quarter, year derived via catalog functions), AA is achievable with minimal data-attestation cost. For an AC with rich data-attested dimensional structure (referential tables, pre-aggregated summaries with materialized hierarchy columns), AA is where engineers do the work of bringing declarations and data into alignment.
7.13.3 Level AAA — Verified cross-schema metric coherence¶
What Level AAA means. Level AA plus every metric coherence commitment is grounded. Specifically:
- Metric DNA completeness. Every DNA edge in the AC's metric genealogy is grounded by one of four mechanisms:
- Data-attested grounding: the edge passes I10 per-DNA-edge value attestation per §7.6.8. The predecessor's data, aggregated via the family's ip_reducer at the successor's anchor, agrees with the successor's observed values within tolerance.
- Construction grounding: the edge is established by deterministic operator catalog semantics for a function-derived metric (e.g.,
profit = SUM(revenue) - SUM(cost)is a metric whose value is deductively determined by the SUM operator's catalog declaration plus arithmetic;unit_price = revenue / quantityis similarly verified-by-construction). No data attestation is needed; the metric's value is a deductive consequence of the operator catalog and the engine's correct evaluation of arithmetic. - Mixed grounding (cross-checked): the metric edge connects a data-stored sibling to a function-derivable one — for example, a stored
monthly_revenuecolumn in amonthly_summaryschema whose values should equalSUM(revenue) BY monthfrom the transactions schema. I10 attestation verifies that the stored sibling matches the function output applied to the predecessor data. Disagreement is the canonical pre-aggregation-drift case. - Tolerated edges: the edge is declared in
attestation.tolerated_edgeswith explicit rationale (per §7.13.5), accepting the disagreement transparently rather than verifying it. - Attestation enabled where data-attestation is the grounding mechanism. The AC's
attestation.enabledistrue(the default) when any metric edge requires data-attestation grounding. Opt-out (attestation.enabled: false) hard-caps at AA — see §7.13.6. - Unattestable edges that would require data-attestation are explicitly enumerated. Edges where the predecessor is not in the AC, or where the operator is not partition-invariant, and where the edge is not function-derived, are recorded as unattestable. Their existence does not block AAA if no metric coherence commitment depends on them being grounded; if they represent ungrounded commitments, the AC stays at AA.
What Level AAA gives consumers. MTI is an unconditional guarantee within scope: every dependency the theorem rests on (FD-DAG completeness, value-mapping consistency, coverage map honoring, metric coherence) is grounded by at least one verification mechanism. Pre-aggregation drift is verified absent on data-attested edges; verified-by-construction on function-derived metrics; transparently accepted on tolerated edges. The AC's cross-schema query results are theorem-quality.
Who benefits from Level AAA. Anyone whose decisions depend on cross-schema metric values agreeing — finance reporting where dashboard sums must match detail-report sums; AI agents whose reasoning chains compose results from multiple schemas; regulated reporting where verification matters legally; analytical workflows where silent pre-aggregation drift would be costly.
Cost to author. Variable, depending on the grounding mix. An AC composed primarily of function-derived metrics over a single fact schema (transactions, with derived metrics like profit, unit_price, gross_margin_pct) reaches AAA at essentially the same cost as AA — function-derived metrics are deductively grounded with no per-edge attestation runtime. An AC with pre-aggregated sibling schemas requires the attestation runtime for I10 (single-digit minutes for a typical retail-scale AC, per the platform-design's quantitative targets) plus remediation of any failed edges. For ACs where pre-aggregation drift exists, AAA forces the conversation about whether to fix the drift, accept it via tolerated_edges with declared rationale, or stay at AA.
7.13.4 Grounding sources¶
The level definitions in §7.13.2 and §7.13.3 use the term "grounded" to describe a structural commitment whose truth is verified by at least one mechanism. This subsection makes the verification regimes explicit.
Empirical grounding (data attestation through DQ). A structural commitment is empirically grounded when DQ has examined the actual data and found it consistent with the declared structure. FD-edges from referential tables are empirically grounded by I3; cross-schema value mappings are empirically grounded by I4; pre-aggregated metric siblings are empirically grounded by I10. Empirical grounding is the rigor mechanism for data-borne structural commitments — facts that exist as patterns in stored data, requiring inspection to verify.
Deductive grounding (verification-by-construction through operator catalog semantics). A structural commitment is deductively grounded when its truth follows necessarily from the operator catalog's declarations combined with the data engine's correct evaluation. An FD-edge day → month derived through MONTH_OF is deductively grounded by the catalog's declaration that MONTH_OF is a deterministic unary function with the appropriate type signature. A metric value profit = SUM(revenue) - SUM(cost) is deductively grounded by the catalog's declarations of SUM and arithmetic operators. Deductive grounding is the rigor mechanism for function-borne structural commitments — facts that hold by construction, requiring no data inspection.
Both grounding sources are legitimate. A function-derived structural commitment is not less verified than a data-attested one; it is verified through different epistemic foundations. Verification-by-construction is arguably stronger in a technical sense: it holds for every possible input the function could receive, not just for the specific inputs DQ happened to examine. The level definitions accept both sources because both produce verified truth, just by different means.
Grounding is a property of the commitment, not the mechanism. When the verification status reports an AC at Level AAA, what's being claimed is that every metric coherence commitment in the AC is grounded — by data-attestation, by construction, or by transparent toleration. The mechanism mix is reported informationally (per §7.13.8) but does not alter the level. An AC composed entirely of function-derived metrics over a single source schema reaches AAA cleanly when all its commitments are deductively grounded; the absence of per-edge data-attestation is not a deficiency, because there are no data-borne metric siblings whose coherence requires data verification.
The framework's correctness story is uniform across regimes. Coframe's grammar layer reasons about structural commitments. The four-rule filter, MTI, query resolution, and integrity conditions all operate at the level of commitments — what the AC declares, what the catalog admits, what the verification regime grounds. How each individual commitment is grounded is an implementation detail that affects cost (data-attestation runtime vs. deductive correctness) and applicability (data-attestation requires data; deductive grounding requires catalog-defined functions) but does not affect the framework's structural guarantees.
This is what makes Coframe's grammar layer storage-strategy-agnostic (§2.8.5): an engineer can slide along the function-vs-data spectrum based on performance, storage cost, or convenience, and the AC's verification level — its structural rigor commitment — is stable across the spectrum.
7.13.5 Tolerated edges and AAA¶
An AC achieves AAA with attestation.tolerated_edges declarations provided that:
- Each tolerated edge is named explicitly (no global tolerance).
- Each tolerated edge has an engineer-supplied rationale recorded in the AC catalog.
- The verification status enumerates the tolerated edges and their tolerance rationale.
- MCP query results include the tolerated edges in the
coherence_posturefield for any query depending on them.
This is consistent with how WCAG AAA accessibility allows specific exemptions when justified: the level is achieved with explicit, documented exceptions, not through silent acceptance. Anonymous tolerance (turning off attestation globally) is attestation.enabled: false and hard-caps at AA, not at AAA.
Tolerated edges apply only to data-attested or mixed-grounded commitments where the engineer chooses to accept disagreement transparently. Function-derived commitments cannot meaningfully be "tolerated" — their grounding is deductive; toleration of a deductive grounding would be either redundant (the commitment is true by construction; nothing to tolerate) or incoherent (rejecting the catalog's correctness claim).
7.13.6 Opt-out and the AA cap¶
An AC with attestation.enabled: false cannot achieve Level AAA when its metric coherence commitments require data-attestation grounding. The opt-out is the explicit signal that data-attested cross-schema metric coherence is asserted-not-verified, which is precisely the AA-vs-AAA distinction for the data-borne portion of the AC's commitments.
A subtle case worth naming: an AC composed entirely of function-derived metrics — no pre-aggregated sibling schemas, all derived metrics computed via Frame-QL inline expressions or operator catalog functions — has no metric coherence commitments requiring data-attestation. Such an AC reaches AAA via deductive grounding regardless of whether attestation.enabled is set. The opt-out flag is irrelevant when there's no data-attestation to opt out of. The verification status reports level: AAA, attestation: enabled-but-unused (or equivalent) to document the situation.
Most ACs in practice have a mix of grounding mechanisms and so the opt-out cap applies to the data-attested portion. The cap is structural, not punitive. Opt-out is a legitimate choice for some deployments (proof-of-concept work, small-scale prototypes, ACs whose coherence is verified by upstream pipelines and re-attestation is redundant). Such deployments achieve AA on the data-attested portion (with deductive grounding still contributing where applicable) and have the option to re-enable attestation when production rigor is needed.
7.13.7 Naming-function declined and the levels¶
When an AC declines structured naming (Chapter 3 §3.7.3 Option 4), name-based family membership claims are asserted by the engineer's naming choices and not verified by a declared naming function. This does not directly affect the AC's verification level — naming consistency falls under per-column rules verified at Phase 1 (Level A) — but it does mean the verification status records naming-consistency: asserted rather than verified. This distinction is reported alongside the level in the DQ deliverable and propagated to MCP coherence_posture.
An AC that declines structured naming can still achieve any level (A, AA, or AAA) within its other commitments. Consumers reading the verification status see both the level and the naming-consistency status; they choose whether the asserted-naming posture is acceptable for their use.
7.13.8 Reporting and propagation¶
Each AC's verification status reports its level explicitly, with optional grounding-source breakdown for transparency:
verification_status:
level: AAA
grounding_summary:
fd_edges:
data_attested: 12 # passed I3 attestation
verified_by_construction: 5 # function-derived (e.g., MONTH_OF, BUCKET)
mixed: 2 # both materialized and function-derivable; cross-checked
metric_coherence:
data_attested: 7 # passed I10 attestation
verified_by_construction: 3 # function-derived metrics (Frame-QL expressions)
tolerated: 0 # explicit toleration
unattestable: 1 # predecessor not in AC; recorded but not blocking
# (only blocks AAA if it represents an ungrounded commitment)
attestation:
enabled: true
failure_mode: soft
edges_passed: 7
edges_failed: 0
naming_consistency: verified
coherence_posture: unconditional_within_scope
The level field is the headline. The grounding_summary field is informational — same level either way — but gives consumers visibility into how the level was achieved. AI agents reasoning about result trust can branch on this: a query depending on a function-derived FD-edge is well-grounded (deductive); a query depending on a tolerated metric edge is transparently disagreed-with-rationale; a query depending on a data-attested edge that passed I10 is empirically grounded.
MCP query results propagate the level in the coherence_posture field (§11.7.3). Consumers can branch on the level when assessing result trust without making a separate validate_ac call. The optional grounding-summary is also propagated when consumers request it (per the MCP capability specification in §11.7.3); default propagation surfaces only the level for compactness.
7.13.9 Coframe Pro and levels¶
Coframe Pro preserves the level taxonomy and lifts the empirical/deductive duality to the framework's primary architectural framing (per §1.5 "Generalized functional grammar layer"). Pro extensions that interact with the level definitions:
- User-defined operators participate in deductive grounding when their catalog entries declare partition-invariance, identity-preservation, type signatures, and missing-value treatment correctly. A function-derived metric using a user-defined operator is deductively grounded under the same regime as catalog-built-in operators.
- Federated-edge attestation extends I10 across multi-backend ACs; this contributes to AAA when metric coherence commitments span backends.
- Sensitivity analysis provides bounded-estimate annotations on results from ACs at AA with ungrounded commitments — letting consumers reason about what coherence-uncertainty means quantitatively, not just qualitatively.
- I3-by-attestation vs I3-by-construction reporting is fully formalized in Pro — the verification status surface gives finer-grained breakdowns than Core's grounding summary.
The level definitions are designed to be stable across both Coframe Core and Coframe Pro. An AC at Level AAA in Core remains at AAA when migrated to Pro; Pro's extensions add verification capabilities without changing what the levels mean.
7.14 Where to go next¶
After reading this chapter, the natural next chapters are:
-
Chapters 9 and 10: Frame-QL and Query Resolution — how queries execute against the verified AC, including how the framework reasons over the metric genealogy and FD-DAG that DQ produced.
-
Chapter 3: ColumnSpec and Naming Machinery — the ColumnSpec specification whose integrity conditions DQ verifies.
-
Chapter 5: schema.init Format — the engineer's input artifact whose declarations DQ attests.
-
Chapter 6: Data-API Protocol — the backend interface DQ calls.
For the framework's overall posture and the broader structural picture, see the Foundations chapter (Chapter 2).
Part IV: Query¶
Chapter 8: Frame-QL¶
The query language for Coframe Core.
8.1 Overview¶
This chapter specifies Frame-QL: the query language for Coframe Core. Queries in Frame-QL are expressed at the grammar level — referencing columns by their conceptual roles in the AC's vocabulary — rather than at the physical level (referencing tables and joins).
The chapter is organized as follows:
- §8.2 introduces Frame-QL and its design principles.
- §8.3 specifies lexical structure: tokens, comments, case sensitivity, literals.
- §8.4 specifies top-level structure: query forms, top-level grammar.
- §8.5 specifies frame clauses: SELECT, FROM, WHERE, BY, HAVING, ORDER BY, LIMIT.
- §8.6 specifies expressions: reducer expressions, mapper expressions, composite expressions, registered ratio operators.
- §8.7 specifies WITH-blocks.
- §8.8 provides rung-by-rung examples.
- §8.9 specifies disambiguation, including handling of cousin queries.
- §8.10 specifies error messages and diagnostics.
- §8.11 compares Frame-QL with SQL.
Query resolution — how the framework routes Frame-QL queries to backend data — is specified separately in Chapter 9.
8.2 What Frame-QL is¶
Frame-QL is a declarative query language. Queries describe the desired result; the framework determines how to produce it. Queries reference columns by their AC-registered family-names, not by physical column names in backend tables.
A Frame-QL query specifies:
- The columns the result should contain.
- The grain at which the result is anchored (the BY clause).
- Optional filters, ordering, and limits.
The framework resolves the query against the AC, applying the integrity-condition machinery, the four-rule filter, and the operator catalog. The result is a Frame: a collection of column values at the specified grain, with annotations.
Frame-QL is designed to be both human-authorable and machine-emittable. Many users do not write Frame-QL directly — LLMs and authoring tools emit it on behalf of users expressing intent in natural language. Frame-QL is also human-readable for engineers who want direct authoring.
8.2.1 Design principles¶
- Grammar minimalism: Frame-QL provides the constructs needed to express the supported query rungs. No SQL features beyond what the framework supports.
- Declarative: queries describe what the result should be; the framework determines how to produce it.
- Predictable: identical queries produce identical results across runs, subject to data updates.
- Disambiguation explicit: where ambiguity is possible, the syntax provides explicit disambiguation (qualified names, BY-clause grain anchors).
- AC-vocabulary-faithful: queries reference AC family-names, not raw column names from underlying schemas.
8.2.2 What Frame-QL is not¶
Frame-QL is not SQL. Some differences:
- No JOIN clause. Cross-schema reach is automatic via the four-rule filter (Chapter 9).
- No GROUP BY clause. The BY clause specifies the output grain.
- No subqueries (except WITH-chained frames).
- No window functions in Coframe Core (Coframe Pro territory).
Frame-QL is also not a programming language. It's a declarative specification of what the result should be.
8.2.3 Coframe Core vs. Coframe Frame-QL¶
Coframe Core supports a defined subset of Frame-QL:
- Rungs 0, 1, 2, 6, 7, 9 (described in §8.8).
- Closed operator catalog (Chapter 10).
- Single-backend ACs.
- Frame-QL outputs are session-local (not re-ingested as AC content).
Coframe Pro additionally supports broadcast as a first-class operator type (Rung 2 extensions), holistic-within-self reductions (Rung 4), type-changing reductions (Rung 5), epoch transitions (Rung 3 with full machinery), custom operators, multi-backend queries, persistent re-ingestion, and Frame-as-query (where a Frame specification is itself a query). These are outside Coframe Core's scope.
8.3 Lexical structure¶
8.3.1 Tokens¶
Frame-QL source consists of tokens separated by whitespace. The token classes are:
-
Identifiers: alphanumeric sequences starting with a letter or underscore, optionally containing dots for qualified references. Examples:
revenue,peak_revenue,transactions.amount,revenue_per_customer. -
Keywords: reserved words including
SELECT,FROM,WHERE,BY,WITH,AS,AND,OR,NOT,IF,THEN,ELSE,CASE,WHEN,IS,NULL,MISSING,TRUE,FALSE,DISTINCT,IN,BETWEEN,LIKE. -
Operators: arithmetic (
+,-,*,/,%), comparison (<,<=,=,<>,>=,>), logical (handled via keywords AND, OR, NOT). -
Literals: numeric (
42,3.14), string (single-quoted:'west','2026-01-01'), boolean (TRUE,FALSE), missing (NULLorMISSING— equivalent in Frame-QL). -
Punctuation: parentheses
(,), brackets[,], comma,, semicolon;.
8.3.2 Comments¶
Single-line comments start with -- and extend to end of line. Multi-line comments are delimited by /* and */.
8.3.3 Case sensitivity¶
Keywords and built-in function names are case-insensitive (SELECT and select are equivalent). Identifiers are case-sensitive by default (matching the AC's registered names exactly), with implementation-specific configuration permitting case-insensitive identifier matching.
8.3.4 String literals¶
Single-quoted strings use SQL-style escape: doubled single-quote within a string represents a literal single-quote ('it''s'). String literals support no other escape sequences in Coframe Core; backends handle Unicode and special characters per their native conventions.
8.3.5 Date and timestamp literals¶
Frame-QL parses date/timestamp literals via explicit cast or function-call syntax: CAST('2026-01-01' AS DATE), PARSE_DATE('2026-01-01'), or DATE '2026-01-01' (SQL-standard syntax). String literals are not implicitly converted; explicit cast is required.
8.3.6 Identifier syntax for AC family-names¶
The AC's family-names appear as identifiers in queries. Per Foundations §2.11.3, the framework treats names as opaque labels — Frame-QL parsers consume them as identifier strings and pass them to the resolver for equality comparison against AC declarations.
The parser places minimal constraints on identifier content (alphanumeric, underscores, optionally dots for qualification). Names that include characters outside this set must be quoted; the framework supports backtick-quoted identifiers (`unusual name`) when backends support them.
8.4 Top-level structure¶
8.4.1 Query forms¶
A Frame-QL query is one of:
- A Frame (a SELECT-form or sugar-form query, with mandatory BY on outer Frames).
- A WITH-block (one or more inner Frames followed by an outer Frame).
A Frame has two equivalent forms:
- Explicit form:
SELECT select_item_list [FROM ...] [WHERE ...] BY ... [HAVING ...] [ORDER BY ...] [LIMIT ...]. - Sugar form: the SELECT keyword is omitted; the query begins directly with the select_item list.
Outer Frames must have an explicit BY clause. Inner Frames in WITH-blocks may omit BY (inheriting from the outer Frame, see §8.7).
8.4.2 Top-level grammar¶
query := frame | with_block
frame := select_clause [from_clause] [where_clause] by_clause [having_clause] [order_by_clause] [limit_clause]
select_clause := [SELECT] select_item_list
with_block := WITH inner_frame_list outer_frame
The grammar above is abbreviated; the full BNF specification is published separately as part of the Coframe Core distribution.
8.5 Frame clauses¶
8.5.1 SELECT clause¶
The SELECT clause specifies the columns to include in the result. Each select-item is one of:
- A bare column reference:
revenue,region,customer_name. References an AC-registered family-name. - A qualified reference:
transactions.revenue,stores.store_name. Qualifies which schema's appearance of the family-name to use; useful for disambiguation. - A reducer expression:
SUM(revenue),MAX(peak_revenue),COUNT(*),COUNT_DISTINCT(customer). - A mapper expression:
revenue / units_sold,UPPER(customer_name),revenue + tax. - A composed expression: combinations of mappers and reducers, with parentheses for grouping.
- A literal:
42,'west',TRUE. - An aliased item:
expression AS name, e.g.,SUM(revenue) AS total_revenue.
Items are separated by commas.
8.5.2 FROM clause¶
The FROM clause is optional in Frame-QL. When present, it lists schemas that contribute to the query.
Use cases for FROM:
- Disambiguation when the framework's automatic schema selection is ambiguous (multiple schemas could serve the query).
- Restricting the framework to specific schemas (e.g., for performance reasons).
- Cousin disambiguation: when a family-name resolves to multiple cousins, the FROM clause restricts to the schemas containing the intended sibling group.
When FROM is omitted, the framework selects schemas via the four-rule filter (Chapter 9).
8.5.3 WHERE clause¶
The WHERE clause filters rows before aggregation. The expression must evaluate to a boolean value at the input grain.
WHERE expressions follow standard SQL three-valued logic (TRUE / FALSE / NULL). Rows where the expression evaluates to NULL or FALSE are excluded.
8.5.4 BY clause¶
The BY clause specifies the output grain — the entities each row of the result represents.
The BY clause is mandatory on outer Frames. It specifies the grain explicitly so the framework's resolution machinery knows what aggregation level the result is anchored at.
The BY clause can reference:
- A single AC-dimension (e.g.,
BY region). - A tuple of AC-dimensions (e.g.,
BY (region, year)). - The grain of a specific schema (e.g.,
BY transactionmeans "at transaction grain").
The framework navigates from the input grain to the output grain via the FD-DAG, applying the appropriate ip_reducer per metric column (see Chapter 9).
8.5.5 HAVING clause¶
The HAVING clause filters output rows after aggregation. The expression evaluates at the output grain.
HAVING differs from WHERE: WHERE filters input rows; HAVING filters output rows. HAVING expressions can reference aggregated values (the result of reducers in SELECT).
8.5.6 ORDER BY clause¶
The ORDER BY clause specifies the result ordering.
Order specifications: ASC (ascending, default) or DESC (descending). Multiple specifications are applied lexicographically.
8.5.7 LIMIT clause¶
The LIMIT clause restricts the result to the first N rows after ordering.
LIMIT applies after ORDER BY; without ORDER BY, LIMIT's selection is implementation-defined.
8.6 Expressions¶
8.6.1 Reducer expressions¶
Reducer expressions aggregate over rows.
Identity-preserving reducers (Rung 1):
SUM(c),AVG(c),MAX(c),MIN(c)— standard reducers per the operator catalog.COUNT(*),COUNT(c)— count operations.COUNT_DISTINCT(c)— distinct count.MEDIAN(c),MODE(c)— distribution-summary operators.FIRST(c),LAST(c)— order-based selection.STDEV(c),VARIANCE(c)— variance-based statistics.CONCAT_AGG(c, separator),STRING_AGG(c, separator),GROUP_CONCAT(c, separator)— string concatenation.ARRAY_AGG(c)— array aggregation.BOOL_AND(c),BOOL_OR(c)— boolean reducers.BIT_AND(c),BIT_OR(c),BIT_XOR(c)— bitwise reducers.APPROX_DISTINCT(c)— approximate distinct count.APPROX_PERCENTILE(c, p),APPROX_QUANTILE(c, p)— approximate quantile operators.
For each reducer's missing-value treatment, see Chapter 10.
8.6.2 Mapper expressions¶
Mappers operate row-wise.
- Arithmetic:
+,-,*,/,%,^. Standard precedence rules. - Comparison:
=,<>(or!=),<,<=,>,>=. Three-valued logic. - Logical:
AND,OR,NOT. Three-valued logic per Chapter 10. - String functions:
UPPER(s),LOWER(s),TRIM(s),SUBSTRING(s, start, length),LENGTH(s),CONCAT(s1, s2, ...), etc. - Date/time functions:
DATE_ADD(d, interval),DATE_DIFF(d1, d2, unit),EXTRACT(field FROM d), etc. - Type conversion:
CAST(expr AS type),TO_INT(s),TO_STRING(n), etc. - Missing-value handling:
COALESCE(c1, c2, ..., default),IFNULL(c, replacement),NULLIF(c, value). - Conditional:
CASE WHEN cond1 THEN val1 [WHEN cond2 THEN val2 ...] [ELSE valN] END,IF(cond, true_val, false_val).
For each mapper's missing-value treatment, see Chapter 10.
8.6.3 Composite expressions¶
Mappers and reducers can compose:
revenue / units_sold— mapper composing two columns.SUM(revenue) / SUM(units_sold)— reducers composed via division.100 * SUM(revenue) / SUM(total_market_revenue)— percentage operation.RATIO_OF(revenue, units_sold)— registered ratio operator (see §8.6.4).
The framework handles operator-precedence and missing-value propagation per Chapter 10.
8.6.4 Registered ratio operators¶
For commonly-needed ratios, Coframe Core supports lightweight registered ratio operators:
RATIO_OF(numerator, denominator)— computes the ratio of two reducers (typically SUMs) at the output grain. Handles missing-value treatment per the catalog.COUNT_OF(filter_expression)— counts rows where the filter is TRUE. Useful for conditional counting.
Registered ratios are a convenience layer; their behavior is fully specified in Chapter 10.
8.6.5 Qualified references and disambiguation¶
When the same family-name appears in multiple schemas with different family-roots (i.e., cousins in the AC's metric genealogy, per Foundations §2.7.5), the engineer can qualify references:
transactions.revenue— the revenue column from the transactions schema.monthly_summary.revenue— the revenue column from the monthly_summary schema.
Qualified references override the framework's automatic schema selection for that specific column. The framework's four-rule filter still applies at the query level; qualified references constrain which schemas can serve specific columns.
For the cousin disambiguation use case, see §8.9.
8.7 WITH-blocks¶
8.7.1 WITH-block structure¶
WITH-blocks let engineers define intermediate frames (Rung 9). Each inner frame produces a result that subsequent frames can reference.
WITH
inner_frame_1 AS (
SELECT ... BY ...
),
inner_frame_2 AS (
SELECT ... FROM inner_frame_1 ... BY ...
)
outer_frame_query
The outer frame is the query's actual output. Inner frames are intermediate; their content is available within the WITH-block.
8.7.2 Session-local intermediates¶
In Coframe Core, WITH-frame outputs are session-local:
- They exist for the duration of the query session.
- They are not registered as persistent AC schemas.
- They are not visible across sessions.
- They are not re-queryable through future sessions' four-rule filter.
This is a Coframe Core-specific simplification. Coframe Pro supports persistent Frame-QL outputs through the re-ingestion workflow; Coframe Core does not.
8.7.3 Inner frame semantics¶
Inner frames behave like regular frames: they can have SELECT, FROM, WHERE, BY, HAVING, ORDER BY, LIMIT. Inner frames may omit BY if they're conceptually at the same grain as the outer frame (the framework infers).
Inner frames can reference earlier inner frames in the same WITH-block, building up complex queries step by step.
8.7.4 Example¶
WITH
region_revenue AS (
SELECT region, SUM(revenue) AS total
BY region
),
region_customer_count AS (
SELECT region, COUNT_DISTINCT(customer) AS customers
BY region
)
SELECT region, total / customers AS revenue_per_customer
FROM region_revenue, region_customer_count
BY region
ORDER BY revenue_per_customer DESC
The inner frames compute regional totals and customer counts; the outer frame combines them into a per-region ratio.
8.8 Rung-by-rung examples¶
This section illustrates each Coframe Core-supported Frame-QL rung with examples.
8.8.1 Rung 0: Read¶
Reading column values directly.
Result: each customer with their name.
Result: each 2026 transaction with its revenue and store.
8.8.2 Rung 2: Broadcast¶
Broadcast attribute or dimension values across rows. Broadcast in Coframe Core is handled at query time; the framework's resolver applies it automatically when an attribute from a coarser-grain schema is requested at a finer-grain anchor.
Result: each transaction with its revenue and the region_name of the transaction's region. The framework resolves region_name via FD-DAG navigation (transaction → store → region) and broadcasts from the regions reference table.
8.8.3 Rung 1: Identity-preserving reduction¶
Aggregate via the column's family ip_reducer.
Result: total revenue per region. The framework navigates from revenue's source anchor (transaction grain in the transactions schema, or coarser if a sibling exists) to (region) grain via the FD-DAG, summing along the way.
Result: total revenue per (region, year). Same identity-preserving reduction at a composite grain.
8.8.4 Rung 6: Multi-input expressions¶
Compose mappers across multiple columns within a frame.
Result: per region, the ratio of total revenue to total units sold.
SELECT region, year,
SUM(revenue) AS revenue,
COUNT_DISTINCT(customer) AS customers,
SUM(revenue) / COUNT_DISTINCT(customer) AS revenue_per_customer
BY (region, year)
Result: per (region, year), revenue, customer count, and revenue per customer.
8.8.5 Rung 7: Cross-schema reach¶
Queries that draw on multiple schemas, with the four-rule filter determining schema reachability.
The framework picks schemas for revenue and customer separately. Revenue may come from transactions (transaction-grain) or store_monthly_summary (store-month grain), navigated to (region). The customer column may come from transactions, navigated similarly. Both pickings produce equivalent results (per the Multi-Table Invariance theorem, Chapter 9).
8.8.6 Rung 9: WITH-chained frames¶
Session-local intermediate frames that subsequent queries reference.
(See §8.7.4 for a worked example.)
8.8.7 What Coframe Core does not support¶
Frame-QL Rungs 3, 4, 5 are simplified or Coframe-only in Coframe Core:
-
Rung 3 (epoch transitions): simplified. In Coframe Core, derived columns are declared via DNA in schema.init (per Chapter 3 and Chapter 4), not constructed through query-time epoch transitions. Cross-grain navigation via the family ip_reducer works (Rung 1); ad-hoc query-time transitions to new families are Coframe Pro territory.
-
Rung 4 (holistic-within-self reductions): Coframe-only. Coframe Core does not provide the identity-tracking machinery these require.
-
Rung 5 (type-changing reductions): simplified. COUNT and COUNT_DISTINCT are supported as defined operators (Rung 1 with type-change in the catalog); richer count-of machinery is Coframe Pro territory.
8.9 Disambiguation¶
8.9.1 When disambiguation is needed¶
A Frame-QL query is ambiguous when:
- A bare family-name reference resolves to multiple cousins in the AC. (Same family-name, different family-roots — per Foundations §2.7.5.)
- The four-rule filter produces multiple incompatible survivors that cannot be reconciled by the framework.
- A column appears at multiple anchorings in different schemas, and the BY clause is reachable from multiple paths via the FD-DAG.
When ambiguous, the framework refuses the query with a dubious-query diagnostic (per §8.11.4 and Chapter 9).
8.9.2 Disambiguation mechanisms¶
Engineers disambiguate via:
- Qualified references:
transactions.revenueconstrains the family-name to the column in the transactions schema, isolating one cousin from others. - Explicit FROM clause:
FROM transactions, storesrestricts the query to specific schemas. Cousins outside the FROM list are excluded. - BY-clause grain anchors:
BY transaction(rather thanBY (region)) specifies the grain explicitly so the framework's resolution path is unambiguous.
After disambiguation, the framework re-resolves the query; if the disambiguation suffices, the query proceeds.
8.9.3 Cousin disambiguation example¶
Consider an AC where peak_concurrent_users appears as two cousins:
- In a
system_metrics_hourlyschema, anchored at(server, hour). - In a
product_analytics_dailyschema, anchored at(region, day), computed from session logs.
Both share the family-name peak_concurrent_users but trace to different family-roots in the AC's metric genealogy. They are cousins.
A query referencing the bare name:
The framework refuses this query as dubious: peak_concurrent_users resolves to two cousins. The diagnostic surfaces:
DUBIOUS: query references 'peak_concurrent_users' which appears in this AC
with two distinct family-roots:
- family-root in system_metrics_hourly (E = {server, hour})
- family-root in product_analytics_daily (E = {region, day})
Specify with qualified reference (e.g., system_metrics_hourly.peak_concurrent_users)
or explicit FROM clause.
The engineer disambiguates:
The query now references one specific cousin's family; the framework resolves it.
8.10 Frame-QL semantics summary¶
Operationally, a Frame-QL query proceeds through:
- Parse: the query's text is parsed into an AST per the grammar (§8.4).
- Resolve names: every column reference is bound to AC family-names (per Foundations §2.7.3).
- Type-check: every operator and expression is type-validated per the operator catalog (Chapter 10).
- Schema selection: the four-rule filter (Chapter 9) selects schemas to serve each column term.
- Plan execution: the framework constructs a query execution plan over the selected schemas.
- Execute: the data-API performs the query; results are returned.
Steps 4 and 5 are specified in Chapter 9.
8.11 Error messages and diagnostics¶
8.11.1 Parse errors¶
Errors in lexical structure or grammar produce parse errors. The framework reports:
- The character position where the error was detected.
- The expected tokens at that position.
- A suggested correction when possible.
Example:
Parse error at character 47 in query:
SELECT region, SUM(revenue BY region
^
Expected ')' to close SUM(revenue, but found 'BY'.
8.11.2 Binding errors¶
Errors in resolving family-names against the AC produce binding errors:
8.11.3 Resolution errors¶
Errors during the four-rule filter or schema selection (per Chapter 9) produce resolution errors:
Resolution error: no schema can serve 'revenue' at grain (country) —
schemas containing revenue do not reach country anchor via FD-DAG.
8.11.4 Dubious-query errors¶
When a query is dubious per §8.9.1, the framework refuses it with a dubious-query diagnostic:
DUBIOUS: query references 'peak_revenue' which has multiple resolutions:
- peak_revenue in store_monthly_summary (family-root in transactions schema)
- peak_revenue in regional_summary (family-root in regional_summary schema)
These are cousins (same family-name, different family-roots) and produce
different results. Disambiguate via qualified reference or explicit FROM.
8.11.5 Integrity-condition errors¶
If a query's resolution would depend on integrity conditions that have failed (per Chapter 7), the framework refuses:
Integrity error: this query would depend on the FD-edge 'store → region',
which the data does not attest. Re-run DQ or remove the dependency.
8.11.6 Operator-rule errors¶
Type-checking errors during expression evaluation:
8.11.7 Backend errors¶
Errors reported by the backend during query execution are passed through:
8.11.8 Diagnostic conventions¶
All diagnostics include:
- The error category (parse, binding, resolution, dubious, integrity, operator, backend).
- The query position or column where the error was detected.
- A descriptive message.
- A suggested remediation when possible.
8.12 Comparison with SQL¶
For engineers familiar with SQL, Frame-QL is similar in spirit but different in mechanics. The principal differences:
- No JOIN. Cross-schema reach is automatic. The framework's four-rule filter (Chapter 9) handles what JOIN handles in SQL, but without engineer-authored join conditions.
- No GROUP BY. The BY clause specifies the output grain. Aggregation is automatic per the grain.
- AC family-names instead of physical column names. Queries reference the AC's vocabulary; the framework maps to backend tables.
- Dubious-query refusal. SQL silently produces results from joins of unrelated tables; Frame-QL refuses such queries until disambiguation.
- Built-in integrity checking. The framework checks integrity conditions before query execution. Queries depending on broken integrity fail fast.
- No window functions (in Coframe Core; available in Coframe Pro).
- Built-in handling of ratios, percentages, multi-grain navigation. Operations that require careful SQL authoring are direct Frame-QL constructs.
Frame-QL queries are typically shorter than equivalent SQL because the framework handles structural complexity automatically. SQL queries make all structural decisions explicit; Frame-QL queries make analytical intent explicit and let the framework handle the structure.
8.13 Summary¶
Frame-QL is the declarative query language for Coframe Core. Queries reference AC family-names, specify output grain via BY clauses, and let the framework handle schema selection, FD-DAG navigation, and aggregation per the AC's structural commitments.
Coframe Core supports Rungs 0, 1, 2, 6, 7, 9 of the full Frame-QL surface. Within this scope, queries can express most analytical needs.
Disambiguation through qualified references and explicit FROM clauses handles cousin cases and other ambiguity. The framework's dubious-query mechanism refuses ambiguous queries rather than silently producing one of multiple possible interpretations.
For how queries are resolved (the four-rule filter, the Multi-Table Invariance theorem, schema selection logic), see Chapter 9.
Chapter 9: Query Resolution¶
How Frame-QL queries are resolved against an Analytics Collection: schema selection, the four-rule filter, the Multi-Table Invariance theorem, and the dubious-query mechanism.
9.1 Overview¶
This chapter specifies the framework's query resolution process: how a parsed Frame-QL query is mapped to backend data via the AC's structural metadata. Resolution is the process between Frame-QL parsing (Chapter 8) and backend query execution.
The chapter is organized as follows:
- §9.2 frames the query resolution problem.
- §9.3 specifies the resolution pipeline.
- §9.4 specifies single-schema resolution.
- §9.5 specifies cross-schema resolution and the four-rule filter.
- §9.6 specifies the Multi-Table Invariance theorem.
- §9.7 specifies the dubious-query mechanism.
- §9.8 specifies resolution errors and diagnostics.
The chapter assumes familiarity with the Foundations chapter (Chapter 2) — particularly §2.6 (operations linking predecessor and successor metrics), §2.7 (DNA, family, metric genealogy, and the structural relations identical/sibling/cousin), and §2.8 (FD-DAG) — and Chapter 8 (Frame-QL).
9.2 The query resolution problem¶
Given a parsed Frame-QL query Q (per Chapter 8) and an AC that has passed DQ verification (per Chapter 7), the framework must produce a query plan that:
- Selects which schemas in the AC contribute data to which column terms in Q.
- Determines the navigation paths from the schemas' anchors to Q's target anchor (via FD-DAG).
- Plans the aggregation operations applying along each path.
- Resolves any disambiguation Q requires.
- Produces a backend-executable plan that, when run, produces correct results.
The resolution process is structurally rigorous: given the AC's principles and integrity conditions, resolution either succeeds with a specific plan, or fails with a specific diagnostic. There is no "best-effort" resolution; queries either resolve cleanly or surface errors.
9.3 The resolution pipeline¶
A Frame-QL query Q proceeds through the following resolution pipeline:
- Parse: the query text is parsed into an AST per the grammar (Chapter 8).
- Bind names: each column reference is bound to the AC's family-name. Unrecognized names produce binding errors.
- Type-check: each operator and expression is type-validated against the operator catalog (Chapter 10).
- Identify column terms: the resolver enumerates the column terms in Q — the (family-name, expression-context) pairs where the resolver must select a schema source.
- Apply the four-rule filter to each column term: identify the schemas in the AC that can serve this term.
- Detect dubious cases: if any column term resolves to multiple cousins, refuse with a dubious-query diagnostic.
- Select among surviving schemas: for each column term, the resolver picks one schema among the survivors per cost-based heuristics.
- Construct execution plan: the framework assembles backend operations (per the data-API protocol, Chapter 6) implementing the query.
- Execute: the backend runs the plan; results are returned to the caller with annotations.
Resolution is largely automatic; the engineer's role is to author Frame-QL queries that the resolver can handle cleanly. Disambiguation mechanisms (Chapter 8 §8.9) let the engineer override the resolver's defaults when needed.
9.4 Single-schema resolution¶
When a query's column terms can all be served from a single schema, resolution is straightforward.
9.4.1 Same-grain: Rung 0¶
A query reading columns at the schema's grain involves no aggregation. The resolver:
- Identifies the schema containing all referenced columns.
- Constructs a backend SELECT-with-WHERE-and-projection query.
- Executes via the data-API.
Example (Rung 0):
The customers schema provides all three columns at customer grain. The resolver selects this schema, applies the WHERE filter, and projects the requested columns.
9.4.2 Coarser grain via FD-DAG: Rung 1¶
A query asking for an AC-metric at a coarser anchor than its source requires identity-preserving reduction (the family ip_reducer applied at the target grain).
Example (Rung 1):
The transactions schema has revenue at transaction grain. The framework navigates the FD-DAG: transaction → store → region. Applying the revenue family's ip_reducer (SUM) at (region) grain produces total revenue per region.
The resolver:
- Identifies the schema where revenue is available (transactions).
- Determines the FD-DAG path from the schema's grain (transaction) to the target grain (region).
- Constructs a backend GROUP BY operation: group transactions by store, navigate to region, sum revenue.
- Executes.
9.4.3 Composite grain navigation¶
A query at a composite grain combines navigation paths.
SELECT region, year, SUM(revenue) AS revenue, COUNT_DISTINCT(customer) AS customers
BY (region, year)
The resolver navigates transaction → store → region for the spatial dimension and date → year for the temporal dimension. The composite grain (region, year) is the target; the framework groups transactions by (region, year) and applies the appropriate reducers.
9.4.4 Broadcast: Rung 2¶
When a query references attributes from a coarser-grain schema at a finer-grain anchor, the framework broadcasts.
store_name is in the stores schema at store grain. transaction is at transaction grain. The framework:
- Identifies the FD-DAG path: transaction → store.
- Joins the
transactionsschema with thestoresschema on store_id. - Projects transaction and store_name.
Broadcast is handled at query time via this FD-DAG-mediated join, not via a per-column derivation.
9.4.5 Multi-input expressions: Rung 6¶
Multi-input expressions compute new values from multiple columns at the same grain.
The resolver computes SUM(revenue) and SUM(units_sold) at region grain, then divides. The division happens after aggregation at the output grain.
9.4.6 What single-schema resolution doesn't include in Coframe Core¶
Single-schema resolution does not handle:
- Cross-schema reach (Rung 7 — see §9.5).
- Epoch transitions or holistic-within-self reductions (Rungs 3, 4 — Coframe Pro territory).
9.5 Cross-schema resolution¶
Cross-schema resolution applies when a query could be served from multiple schemas, or when a single column's resolution requires combining schemas.
9.5.1 The four-rule filter¶
The four-rule filter determines which schemas can serve a given column-and-grain requirement. For each column term in Q, each candidate schema S in the AC is tested against four rules. Schemas failing any rule are dropped.
Given:
- A column term referencing family-name
c. - A required output entity set
E_out(from the BY clause). - A query potentially filtering dimensions to restricted value sets.
Rule 1: Family membership¶
Drop S if S has no column with family-name c.
The framework tests by checking each ColumnSpec in S: does its name equal c (string equality)?
In the redesigned framework, family membership is by name equality on the name field. Two columns with the same name belong to the same family.
Rule 2: Entity-set capability¶
Drop S based on the column term's shape, using the FD-DAG-extended entity set E*(c, S):
- For
reducer(c): drop S ifE*(c, S) ⊉ E_out. The schema must be able to reach the output's grain via FD-DAG paths from its actualE. - For bare
c: drop S ifE*(c, S) ⊉ E_out(the output grain must be reachable so that broadcast is well-defined).
E*(c, S) is the closure of E(c, S) under FD-DAG reachability — both upward (coarser ancestors) and downward (finer descendants).
Rule 3: Coverage consistency¶
Drop S if there exists a dimension d ∈ E*(c, S) such that S's value-set for d (per quasi-metadata) is a strict subset of the query's required value-set for d.
Coverage requires that the schema can produce results for every value the query asks about. If the query asks for revenue across all 2026 dates and S only has data for January 2026, S fails Rule 3.
Declared degeneracy modifies coverage analysis: a schema declared degenerate on dimension d with explicit value-set V is treated as covering exactly V; queries requiring values outside V against this schema fail Rule 3.
Rule 4: Family-root agreement (sibling check)¶
Drop S if S's column with family-name c has a different family-root than the query's intended family-root.
The framework computes the family-root for each S's column via DNA-walk (per Foundations §2.7.4). The query's intended family-root is determined by the query's context — typically the most common family-root among schemas passing Rules 1, 2, 3, with cousin disambiguation triggered if multiple roots are present.
In the redesigned framework, Rule 4 is the sibling check: among schemas with the queried family-name and reachable anchor with adequate coverage, are they siblings (same family-root) or cousins (different family-root)?
- Siblings: structurally interchangeable; MTI applies.
- Cousins: not interchangeable; the framework requires disambiguation.
9.5.2 Surviving schemas¶
After applying all four rules:
Surviving schemas = {S in AC : S satisfies Rules 1-4 for the given column term and grain}.
Three cases:
- No surviving schemas: the query cannot be served. Resolution error: "no schema can serve
cat grainE_out." - Exactly one surviving schema: the framework uses it.
- Multiple surviving schemas with same family-root (siblings): see §9.5.3.
- Multiple surviving schemas with different family-roots (cousins): see §9.7 (dubious-query mechanism).
9.5.3 Multiple surviving schemas¶
When multiple schemas survive the four-rule filter and share the same family-root (siblings), the framework's reasoning is:
The Multi-Table Invariance (MTI) theorem (§9.6) guarantees that surviving schemas produce equivalent results. The framework can pick any surviving schema; the choice is operationally free.
Selection criteria (operationally):
- Cost minimization: pick the schema with the lowest expected backend cost (smaller table, pre-aggregated data, etc.).
- Cache locality: prefer schemas whose data is recently computed.
- Engineer constraint: if the query's FROM clause restricts to specific schemas, only those are considered.
The selection doesn't affect correctness; it affects performance. The framework uses heuristics; engineers don't typically intervene.
9.5.4 Mixed schemas for different column terms¶
A query may need different schemas for different column terms. Example:
The framework resolves each column term independently:
SUM(revenue): four-rule filter on schemas with revenue family-name; survivors includetransactions,store_monthly_summary, etc.COUNT_DISTINCT(customer): four-rule filter on schemas with customer family-name; survivors includecustomersreference table,transactions, etc.
The framework picks one schema per column term (subject to MTI guarantees) and composes the results at the BY-clause grain via FD-DAG navigation.
9.5.5 Broadcast and reference table reach¶
A query referencing attributes from a reference table broadcasts via the FD-DAG.
Resolution:
- Bind transaction to the
transactionsschema. - Bind store_name to the
storesschema (the only schema with this family-name). - Apply Rule 2 for store_name:
E(store_name, stores) = {store}; via FD-DAG, store is reachable from transaction (transaction → store).E*(store_name, stores) ⊇ {transaction}. Rule 2 passes. - Plan: read stores.store_name and broadcast to transactions via the transaction → store mapping.
Cross-schema reach via reference tables is handled uniformly through the four-rule filter.
9.6 The Multi-Table Invariance theorem¶
9.6.1 Statement¶
For a Frame-QL query Q in Coframe Core's supported rungs (0, 1, 2, 6, 7, 9), and an AC satisfying the integrity conditions (Chapter 7):
MTI: any two schemas surviving the four-rule filter for Q with the same family-root produce equivalent results when used to resolve Q.
In other words, among schemas that pass Rules 1-4 with the same family-root (siblings), the framework's choice is operationally free; correctness is preserved across the choice.
9.6.2 What MTI rests on¶
MTI rests on:
- Principle 2 (Foundations §2.2.2): schemas observe the same universe of entities. Cross-schema agreement is structural.
- The integrity conditions verified by DQ (Chapter 7): cross-schema value-mapping consistency for AC-dimensions and AC-attributes, FD-DAG attestation, coverage map honoring.
- Cross-schema metric coherence: in the default Coframe Core configuration (
attestation.enabled: true), this is verified per attestable DNA edge during DQ Phase 3 (Chapter 7 §7.6.8). In opted-out configurations (attestation.enabled: false), it is asserted from Principle 2 plus the ip_reducer's partition-invariance, but not directly attested.
In the default configuration, MTI is an unconditional guarantee within Coframe Core's scope when DQ has passed: every dependency is verified, including cross-schema metric coherence per attestable edge. Edges marked unattestable (predecessor not present, operator not partition-invariant) are explicitly recorded in the verification status; MTI applies among schemas whose attestable edges have all passed.
In opted-out configurations, MTI is a conditional guarantee: it depends on the engineer's commitment to ETL coherence between predecessor and successor schemas. The opted-out posture is recorded in the AC's verification status and propagated as coherence-asserted-not-verified annotations on query results.
9.6.3 Where MTI applies in Coframe Core¶
MTI applies for:
- Rungs 0, 1, 2: read, identity-preserving reduction, broadcast — covered by dimension/attribute consistency and FD-DAG reach.
- Rung 6: multi-input expressions — covered by per-column reasoning composing MTI across each input.
- Rung 7: cross-schema reach — the four-rule filter selects MTI-equivalent (sibling) schemas.
- Rung 9: WITH-chained frames — each frame is independently MTI-protected; chaining preserves the property within the session.
MTI does not apply (in Coframe Core) to:
- Rungs 3, 4, 5: simplified or not supported.
9.6.4 What MTI requires structurally¶
MTI's structural requirements:
- The schemas under consideration share the same family-root for the queried column.
- The schemas pass the four-rule filter (Rules 1-4).
- The integrity conditions hold (per DQ).
- The family's ip_reducer is partition-invariant (i.e., the family has an ip_reducer; the family is not anchor-locked).
For families without an ip_reducer (those rooted at non-partition-invariant operators per Chapter 10 §10.4), MTI does not apply because cross-anchor navigation is not available. Such families' columns are accessible only at their declared anchors; queries at other anchors against these families fail Rule 2 of the four-rule filter.
9.6.5 Why MTI matters for Coframe Core¶
MTI's practical value:
- Engineers don't need to specify which schema to use. The framework selects; correctness is preserved among siblings.
- Performance optimization is principled. Backends can choose pre-aggregated schemas without worrying about correctness drift.
- AC pluralism is preserved. Multiple ACs over the same data, each with different schema structures, all produce correct results for queries within their scope.
MTI is the structural guarantee that makes Coframe Core's automated resolution possible. Without it, the framework would have to involve engineers in schema selection or risk incorrect results.
9.6.6 What can violate MTI¶
MTI rests on the integrity conditions and the cross-schema metric coherence verification (or, in opted-out configurations, the metric coherence lemma). If those fail, MTI may fail.
In the default configuration (attestation enabled), the following are detected at AC validation time:
- Cross-schema value-mapping inconsistency: detected by the FD-DAG attestation and value-mapping consistency checks (§7.6.3, §7.6.4). Hard violations unless
trust_declared_FDis set. - Coverage gaps: detected by the coverage-map analysis (§7.6.2, §7.6.6.2). Hard violations when a schema declared non-degenerate fails to cover its dimension's universe.
- Cross-schema metric coherence violations: detected by per-DNA-edge attestation (§7.6.8). Failure-mode-driven response: hard validation failure (strict), soft advisory with query-result annotations (default), or per-edge tolerance (explicit).
In default configurations, MTI is an unconditional guarantee for queries that resolve via attested-passed edges. Queries that draw on edges with unresolved coherence advisories carry the warning into the result.
In opted-out configurations (attestation.enabled: false), cross-schema metric coherence is not detected at validation time. Lemma violations may surface only when the engineer cross-checks query results across schemas. The optional approach in opted-out mode is to enable attestation selectively per-edge (Coframe Pro feature) or to re-enable global attestation when production rigor is needed.
The verification status reports the AC's coherence posture explicitly; MCP responses propagate this to LLM clients so AI-agent-mediated queries inherit visibility into the rigor configuration of the AC they're querying.
9.7 The dubious-query mechanism¶
9.7.1 When queries are dubious¶
A query is dubious when its resolution would produce ambiguous results that the framework cannot disambiguate without engineer input. The principal cases:
- The query's family-name resolves to multiple cousins (different family-roots) in the AC, and the FROM clause and qualified references don't isolate to a single root.
- The four-rule filter produces survivors that include multiple incompatible family-roots.
- The BY clause is reachable from multiple paths via the FD-DAG, and the paths produce different results (rare; typically caught at AC validation).
9.7.2 The framework's response¶
The framework refuses dubious queries. It surfaces a diagnostic enumerating the possible interpretations and asks the engineer to disambiguate.
Example:
DUBIOUS: query references 'revenue' which appears in this AC with two
distinct family-roots:
- revenue family-root in transactions schema
- revenue family-root in monthly_summary_v2 schema (independent observation)
These are cousins, not siblings: they trace to different family-roots in
the metric genealogy and produce different results under the same query.
Disambiguate via:
- Qualified reference: transactions.revenue or monthly_summary_v2.revenue
- Explicit FROM clause: FROM transactions, ... or FROM monthly_summary_v2, ...
The framework does not pick a default. Engineers must disambiguate; the framework's posture is that ambiguity is a structural concern engineers should resolve, not a default the framework picks.
9.7.3 Disambiguation mechanisms¶
Per Chapter 8 §8.9, engineers disambiguate via:
- Qualified references (
transactions.revenue). - Explicit FROM clause restricting candidate schemas.
- BY-clause grain anchors specifying the navigation path explicitly.
After disambiguation, the framework re-resolves the query. If the disambiguation suffices, the query proceeds.
9.7.4 The framework's posture¶
The framework's binary correctness commitment is preserved by the dubious-query mechanism. Either a query has a unique interpretation (proceeds) or it has multiple (refuses). There's no "pick one and warn" mode.
This is consistent with the framework's posture throughout: structural facts decide outcomes; engineers don't override the structural reasoning to opt into permissiveness.
9.7.5 The cousin case as the canonical dubious case¶
In the redesigned framework, the canonical dubious-query case is the cousin case: a family-name resolves to multiple cousins, indicating that the AC has two non-equivalent metrics sharing a name. The dubious-query mechanism handles this directly:
- The framework computes family-roots via DNA-walk on each candidate schema.
- If multiple distinct family-roots survive, cousins are present.
- The framework refuses with a diagnostic naming the cousins and their respective family-roots.
This is sharper than the original framework's dubious-query mechanism, which handled multiple sources of ambiguity. The redesigned version centers on cousin disambiguation as the primary case, with other ambiguity sources caught at AC validation (per Chapter 7) or at Rule 4 of the four-rule filter.
9.8 Resolution errors and diagnostics¶
9.8.1 Categories¶
Resolution errors fall into the following categories:
- No-survivor errors: no schema in the AC can serve a column term at the requested grain.
- Coverage-gap errors: the schema's coverage doesn't include the query's required values.
- Anchor-unreachable errors: the schema's anchor doesn't reach the query's target grain via FD-DAG.
- Dubious-query errors: cousins surface; engineer disambiguation required.
- Integrity-condition errors: a query depends on an integrity condition that has failed.
9.8.2 Resolution error specifics¶
No-survivor errors¶
Resolution error: no schema in AC 'retail_analytics_v1' contains the family
'monthly_active_users'. Family must be declared in at least one schema.
Coverage-gap errors¶
Resolution error: the query asks for revenue across all of 2026, but the
schemas containing revenue cover:
- transactions: dates from 2025-06-01 onward
- store_monthly_summary: months from 2026-01 to 2026-09
Coverage gaps in either schema prevent serving the full year.
Revise the WHERE clause or wait for coverage updates.
Anchor-unreachable errors¶
Resolution error: cannot serve 'revenue' at grain (continent).
The transactions schema (E={transaction}) reaches at most country via
the FD-DAG. There is no path from the AC's FD-DAG to (continent).
Dubious-query errors¶
(See §9.7.2 for example.)
Integrity-condition errors¶
Integrity error: this query would depend on the FD-edge 'store → region',
which the data does not attest. The latest DQ run shows 12 stores with
multiple region mappings. Re-run DQ to update or address the data
violations before proceeding.
9.8.3 Resolution advisories¶
Some resolution conditions are reported as advisories rather than errors:
- Multiple surviving schemas with cost differences: the framework selected schema X over schema Y; cost differential noted.
- Stale quasi-metadata: the quasi-metadata informing Rule 3 is older than a configurable freshness threshold. Resolution proceeded with the cached data.
- Approximate operator usage: the query uses APPROX_DISTINCT or APPROX_PERCENTILE; results are approximate within the operator's error bound.
Advisories don't block query execution; they are surfaced alongside results for engineer awareness.
9.9 Summary¶
Query resolution is the framework's process for taking a parsed Frame-QL query and producing a backend-executable plan. The principal mechanism is the four-rule filter: for each column term in the query, the framework selects schemas that can serve the term by checking family membership (Rule 1), entity-set capability via FD-DAG (Rule 2), coverage consistency (Rule 3), and family-root agreement (Rule 4).
When multiple schemas survive with the same family-root (siblings), the Multi-Table Invariance theorem guarantees they produce equivalent results; the framework selects per cost-based heuristics. When schemas survive with different family-roots (cousins), the framework refuses the query as dubious and requires engineer disambiguation.
The integrity conditions verified by DQ (Chapter 7) are what make resolution rigorous: without them, the framework's structural guarantees do not hold. With them, queries either resolve correctly or surface specific diagnostics — there is no silent incorrectness.
For the operator catalog used during resolution, see Chapter 10. For Frame-QL syntax, see Chapter 8. For DQ's role in establishing the conditions resolution depends on, see Chapter 7.
Part V: Reference¶
Chapter 10: Operator Catalog¶
The framework's specification of operators: type, structural properties, naming function entries, and per-(operator, signature) missing-value treatment.
10.1 Overview¶
This chapter specifies Coframe Core's operator catalog: the closed set of operators the framework supports, with each operator's structural properties and per-signature missing-value treatment.
The catalog is closed in Coframe Core: engineers cannot register new operators. Coframe Pro supports custom operator registration; that is outside Coframe Core's scope.
The chapter is organized as follows:
- §10.2 specifies the structure of an operator-catalog entry: the fields each entry declares.
- §10.3 specifies conventions and reading guides used throughout the catalog.
- §10.4 enumerates the reducer catalog: each reducer's type, partition_invariance, missing-value treatment, and (where applicable) default naming-function entry.
- §10.5 enumerates the function catalog: each function's identity-preservation flag, applicability, and missing-value behavior.
- §10.6 specifies the OBSERVED operator: the framework's grain-role operator.
- §10.7 specifies cross-cutting notes: annotation propagation, multi-input behavior, MCP exposure.
The chapter assumes familiarity with the Foundations chapter (Chapter 2), the ColumnSpec chapter (Chapter 3), and the (E, M) paired declaration introduced in Foundations §2.4.
10.2 Operator-catalog entry structure¶
Each operator in Coframe Core's catalog has an entry with the following fields:
| Field | Description |
|---|---|
name |
The operator's catalog identifier (e.g., SUM, MAX, MAP_DIV). |
type |
reducer or function. |
partition_invariant |
(Reducers only) Boolean. Whether the operator distributes over partitions of input rows. |
identity_preserving |
(Functions only) Boolean. Whether the operator preserves family-name on its input. |
arity |
Number of input columns (1 for unary, 2+ for multi-input). |
input_types |
The data types the operator accepts as input. |
output_type |
The operator's output data type, possibly as a function of input type. |
default_naming |
(Optional) The catalog-default naming function entry: how the operator transforms name_pred into name_self when not identity-preserving. |
missing_value_behavior |
A specification per (operator, M_eff) of how the operator handles missing values. |
The framework consults these fields throughout: operator-type for the well-formedness E-relation (§2.6.1), partition_invariance for ip_reducer determination (§2.7.6), identity-preservation for the name-relationship between predecessor and successor (§2.6.3), default naming function for AC-level convenience (§3.7.3), and missing-value behavior for query execution (§10.3 and per-operator entries below).
10.3 Conventions and reading guide¶
10.3.1 Effective signature M_eff¶
For reducers, the relevant signature for missing-value treatment is the effective signature M_eff:
M_eff = M(c, S) ∩ C_collapse, whereC_collapse = col_in \ col_out(the entities being collapsed by the reduction).- If
c ∈ M(c, S)(column itself is in determinants — MNAR): the operation is governed by MNAR rules regardless of collapse structure.
Three categories per reducer cell:
- MCAR-effective:
M_eff = ∅andc ∉ M. Effectively MCAR for this operation. - MAR-effective:
M_eff ≠ ∅andc ∉ M. Effectively MAR with collapsed determinants creating bias. - MNAR:
c ∈ M(c, S). No principled treatment.
For functions, M_eff is not computed (no aggregation); the column's raw signature M is referenced directly.
10.3.2 Per-call behavior¶
Reducer entries specify per-call behavior:
- For each output group (each combination of values in
col_out), the reducer is called with the inputs from rows belonging to that group. - The cell specifies what the reducer does when its inputs contain missing values.
- Calls with no missing inputs are unaffected by the cell — they succeed normally.
The output column has a mix of computed values (where inputs had no missing values) and behavior per the cell (where inputs had missing values).
10.3.3 Internal mechanisms¶
The framework uses these internal mechanisms (not engineer-facing):
- skip: drop missing inputs from the operation; compute over non-missing only.
- propagate: any missing input → output is missing for this call.
- mean-substitute: replace missing inputs with the mean of non-missing inputs in the same call, then compute. Produces an unbiased estimate of the universe value under MCAR; biased estimate under MAR (without conditional imputation, which is Coframe-only).
- define(v): replace missing with explicit value
vbefore operation. Only for operators that explicitly define replacement (COALESCE, IFNULL).
10.3.4 Annotations¶
Annotations travel with all results:
- missing-fraction: per-call and aggregated, what fraction of inputs were missing.
- partial-coverage: result reflects only observed values.
- lower-bound / upper-bound: result is a structural bound on the universe value (specific to MAX/MIN).
- under-count: result undercounts the universe distinct count (specific to COUNT_DISTINCT).
- bias-warning: result is systematically biased; quantification included where computable.
- substitution-applied: framework imputed values; substitution method and fraction included.
- propagate-reason: this output is missing because inputs contained missing values, no principled treatment available.
10.3.5 Refused (used sparingly)¶
"Refused" means the framework declines to execute the query. Reserved for structural malformedness:
- Dubious queries (multiple cousin matches; framework cannot pick uniquely).
- Queries referencing non-existent columns or names.
- Queries violating four-rule filter at structural level.
Missing-value handling does not refuse; it produces results with missing values per the cell rules.
10.3.6 Default naming function entries¶
For non-identity-preserving operators, the catalog provides default naming function entries — sample mappings from (name_pred, E_pred, op) to name_self that AC authors may adopt directly, override, or replace.
The default entries are illustrative starting points. The AC's declared naming function is authoritative; the catalog defaults have effect only if the AC opts in. The framework treats the naming function as a black box (per §3.7.2); the catalog's defaults are listed below as reference suggestions, not framework commitments.
10.4 Reducer catalog¶
Reducers aggregate over rows, collapsing entities. For a reducer operation op(m_pred) → m, E_pred ⊇ E under FD-DAG navigation.
10.4.1 SUM(c)¶
| Field | Value |
|---|---|
| Type | reducer |
partition_invariant |
true |
| Arity | 1 |
| Input types | numeric, integer |
| Output type | same as input |
| Default naming | f_SUM(name_pred, *) = name_pred (identity-preserving when op is the predecessor's family ip_reducer) |
Missing-value treatment:
M_eff |
Behavior |
|---|---|
| MCAR-effective | mean-substitute, then sum (substitution-applied) |
| MAR-effective | mean-substitute, then sum (substitution-applied + bias-warning) |
| MNAR | propagate (output missing, propagate-reason) |
Reasoning: under MCAR, mean-substitution gives an unbiased estimate of the universe sum (the missing values' expected value equals the observed mean). Under MAR-effective, mean-substitution still serves the operator's purpose (universe-sum estimation) but produces a biased estimate; conditional mean substitution would be needed for unbiased estimate, but Coframe Core doesn't provide it. The bias-warning annotation surfaces the analytical concern. Under MNAR, the missing values' expected value diverges systematically from the observed mean; mean-substitution is inappropriate. Propagate ensures no biased number masquerades as a result.
10.4.2 AVG(c)¶
| Field | Value |
|---|---|
| Type | reducer |
partition_invariant |
false |
| Arity | 1 |
| Input types | numeric, integer |
| Output type | numeric |
| Default naming | f_AVG(name_pred, *) → mean_<name_pred> (e.g., mean_revenue) |
Missing-value treatment:
M_eff |
Behavior |
|---|---|
| MCAR-effective | skip (partial-coverage) |
| MAR-effective | skip (partial-coverage + bias-warning) |
| MNAR | propagate (output missing, propagate-reason) |
Reasoning: for AVG, skip and mean-substitute produce identical results (substituting missing with the mean doesn't change the mean). Skip is operationally simpler. Result is the unbiased estimate of universe mean under MCAR; biased under MAR. Under MNAR, the observed mean diverges from the universe mean; propagate is the principled response.
Because partition_invariant: false, AVG cannot serve as a family ip_reducer. AVG-rooted families are anchor-locked.
10.4.3 MAX(c) / MIN(c)¶
| Field | Value |
|---|---|
| Type | reducer |
partition_invariant |
true |
| Arity | 1 |
| Input types | numeric, integer, date, timestamp, string |
| Output type | same as input |
| Default naming | f_MAX(name_pred, *) → peak_<name_pred> (e.g., peak_revenue); f_MIN(name_pred, *) → trough_<name_pred> |
Missing-value treatment:
M_eff |
Behavior |
|---|---|
| MCAR-effective | skip (partial-coverage + lower-bound for MAX, upper-bound for MIN) |
| MAR-effective | skip (partial-coverage + bias-warning + lower-bound/upper-bound) |
| MNAR | propagate (output missing, propagate-reason) |
Reasoning: mean-substitution doesn't help for MAX/MIN (the substituted values aren't extreme). Skip gives the observed extreme. The lower-bound/upper-bound annotation acknowledges that the universe extreme could differ; missing values could have been more extreme than any observed. Under MAR-effective, the bias-warning indicates that determinant-correlation may further skew the observed extreme. Under MNAR, propagate.
10.4.4 COUNT(*)¶
| Field | Value |
|---|---|
| Type | reducer |
partition_invariant |
true |
| Arity | 0 (counts rows, not column values) |
| Input types | (no input columns) |
| Output type | integer |
| Default naming | f_COUNT_STAR(*, *) → row_count |
Missing-value treatment:
M_eff |
Behavior |
|---|---|
| All cases (signature irrelevant) | succeeds (counts rows) |
Reasoning: COUNT(*) doesn't depend on column values. Always succeeds; no signature interaction.
10.4.5 COUNT(c)¶
Counts non-missing values of c.
| Field | Value |
|---|---|
| Type | reducer |
partition_invariant |
true |
| Arity | 1 |
| Input types | any |
| Output type | integer |
| Default naming | f_COUNT(name_pred, *) → <name_pred>_count (e.g., revenue_count) |
Missing-value treatment:
M_eff |
Behavior |
|---|---|
| All cases | succeeds (missing-fraction annotation) |
Reasoning: counting non-missing values is well-defined regardless of signature. The missing-fraction annotation tells engineers what was excluded.
10.4.6 COUNT_DISTINCT(c)¶
| Field | Value |
|---|---|
| Type | reducer |
partition_invariant |
false |
| Arity | 1 |
| Input types | any |
| Output type | integer |
| Default naming | f_COUNT_DISTINCT(name_pred, *) → distinct_<name_pred>_count |
Missing-value treatment:
M_eff |
Behavior |
|---|---|
| MCAR-effective | skip (under-count) |
| MAR-effective | skip (bias-warning + under-count) |
| MNAR | propagate (output missing, propagate-reason) |
Reasoning: distinct values among missing data may include values not observed elsewhere; observed distinct count is always a lower bound on universe distinct count (under-count annotation). No simple substitution helps. MAR-effective adds bias from determinant-correlation; MNAR adds value-correlation bias making distinct-counting structurally unprincipled.
Because partition_invariant: false, COUNT_DISTINCT cannot serve as a family ip_reducer. COUNT_DISTINCT-rooted families are anchor-locked.
10.4.7 MEDIAN(c) and quantile operators¶
| Field | Value |
|---|---|
| Type | reducer |
partition_invariant |
false |
| Arity | 1 |
| Input types | numeric, integer |
| Output type | same as input |
| Default naming | f_MEDIAN(name_pred, *) → median_<name_pred> |
Missing-value treatment:
M_eff |
Behavior |
|---|---|
| MCAR-effective | skip (partial-coverage) |
| MAR-effective | propagate (output missing, propagate-reason) |
| MNAR | propagate (output missing, propagate-reason) |
Reasoning: distribution-shape-sensitive. Mean-substitution distorts the observed distribution by adding mass at the mean. Skip preserves observed distribution shape but undercounts. Under MCAR, the observed distribution is an unbiased sample of the universe distribution; quantiles are well-estimated. Under MAR-effective and MNAR, MEDIAN propagates rather than producing a number. Distribution-shape sensitivity makes the result structurally biased; the relationship between missing-fraction and quantile-bias isn't simple, so the bias cannot be meaningfully quantified for engineer interpretation.
This is stricter than SUM/AVG: where SUM under MAR is principled (mean-substitute estimates universe sum even with bias) and the bias is quantifiable per missing-fraction, MEDIAN under MAR is not principled.
10.4.8 MODE(c)¶
| Field | Value |
|---|---|
| Type | reducer |
partition_invariant |
false |
| Arity | 1 |
| Input types | any |
| Output type | same as input |
| Default naming | f_MODE(name_pred, *) → mode_<name_pred> |
Missing-value treatment:
M_eff |
Behavior |
|---|---|
| MCAR-effective | skip (partial-coverage) |
| MAR-effective | propagate (output missing, propagate-reason) |
| MNAR | propagate (output missing, propagate-reason) |
Reasoning: same posture as MEDIAN — distribution-sensitive.
10.4.9 FIRST(c) / LAST(c)¶
These operators take a value from one specific row (the first or last under some ordering).
| Field | Value |
|---|---|
| Type | reducer |
partition_invariant |
false |
| Arity | 1 |
| Input types | any |
| Output type | same as input |
| Default naming | f_FIRST(name_pred, *) → first_<name_pred>; f_LAST(name_pred, *) → last_<name_pred> |
Missing-value treatment:
| Behavior |
|---|
If chosen row's c-value is missing: propagate (output missing). Else: succeeds. |
Reasoning: these aren't aggregating across rows; they're picking one row. Signature affects interpretation but not mechanical behavior; the framework's machinery propagates the chosen row's c-value (whether missing or not).
partition_invariant: false because the chosen row depends on global ordering, which doesn't compose under partition.
10.4.10 STDEV(c) / VARIANCE(c)¶
Distribution-sensitive operators measuring spread.
| Field | Value |
|---|---|
| Type | reducer |
partition_invariant |
false |
| Arity | 1 |
| Input types | numeric, integer |
| Output type | numeric |
| Default naming | f_STDEV(name_pred, *) → stdev_<name_pred>; f_VARIANCE(name_pred, *) → variance_<name_pred> |
Missing-value treatment:
M_eff |
Behavior |
|---|---|
| MCAR-effective | skip (partial-coverage) |
| MAR-effective | propagate (output missing, propagate-reason) |
| MNAR | propagate (output missing, propagate-reason) |
Reasoning: variance-based statistics are distribution-sensitive. Treatment parallels MEDIAN/MODE. Mean-substitution would suppress variance (substituted values equal mean, contributing zero deviation), severely biasing toward smaller variance. Skip preserves shape but partial-coverage. MAR/MNAR propagate.
10.4.11 CONCAT_AGG / STRING_AGG(c)¶
Aggregates string values into a delimited concatenation.
| Field | Value |
|---|---|
| Type | reducer |
partition_invariant |
false |
| Arity | 1 (plus separator parameter) |
| Input types | string |
| Output type | string |
| Default naming | f_CONCAT_AGG(name_pred, *) → concat_<name_pred> |
Missing-value treatment:
M_eff |
Behavior |
|---|---|
| MCAR-effective | skip (partial-coverage) |
| MAR-effective | skip (bias-warning) |
| MNAR | propagate (output missing, propagate-reason) |
Reasoning: concatenation skips missing values by default; result is the concatenation of non-missing values. The result is descriptive (a list of observed values) rather than estimative (a claim about the universe). MAR/MCAR with bias-warning is acceptable because the engineer interprets the result as "the values we observed, with X% missing" — there's no false estimation claim. Under MNAR, the value-driven missingness makes the concatenation systematically misrepresentative; propagate.
partition_invariant: false because concatenation order is sensitive to the underlying row ordering, which doesn't compose cleanly under partition.
GROUP_CONCAT is a SQL-standard alias for CONCAT_AGG with the same behavior.
10.4.12 ARRAY_AGG(c)¶
Aggregates values into an array.
| Field | Value |
|---|---|
| Type | reducer |
partition_invariant |
false |
| Arity | 1 |
| Input types | any |
| Output type | array of input type |
| Default naming | f_ARRAY_AGG(name_pred, *) → <name_pred>_array |
Missing-value treatment: same posture as CONCAT_AGG.
M_eff |
Behavior |
|---|---|
| MCAR-effective | skip (partial-coverage) |
| MAR-effective | skip (bias-warning) |
| MNAR | propagate (output missing, propagate-reason) |
10.4.13 BOOL_AND(c) / BOOL_OR(c)¶
Aggregate boolean values via AND / OR.
| Field | Value |
|---|---|
| Type | reducer |
partition_invariant |
true |
| Arity | 1 |
| Input types | boolean |
| Output type | boolean |
| Default naming | f_BOOL_AND(name_pred, *) → all_<name_pred>; f_BOOL_OR(name_pred, *) → any_<name_pred> |
Missing-value treatment:
| Operator | Behavior |
|---|---|
| BOOL_AND | If any input is FALSE: result is FALSE (regardless of missing inputs). If no FALSE and any input is missing: propagate (output missing). If all inputs are TRUE: result is TRUE. |
| BOOL_OR | If any input is TRUE: result is TRUE (regardless of missing inputs). If no TRUE and any input is missing: propagate (output missing). If all inputs are FALSE: result is FALSE. |
Reasoning: boolean reducers follow three-valued logic at the aggregation level. BOOL_AND can short-circuit on FALSE (NULL ∧ FALSE = FALSE in three-valued logic). Symmetric for BOOL_OR with TRUE. When the determinative value isn't present and there's missingness, the result is missing.
10.4.14 BIT_AND(c) / BIT_OR(c) / BIT_XOR(c)¶
Bitwise reducers.
| Field | Value |
|---|---|
| Type | reducer |
partition_invariant |
true |
| Arity | 1 |
| Input types | integer |
| Output type | integer |
| Default naming | f_BIT_AND(name_pred, *) → bit_and_<name_pred>; analogous for OR, XOR |
Missing-value treatment:
M_eff |
Behavior |
|---|---|
| All cases | propagate (output missing, propagate-reason) |
Reasoning: bitwise reducers operate on the bit patterns of all inputs. A missing input means the bit pattern is unknown; the reduction can't be computed without it. Unlike SUM (where mean-substitution gives unbiased estimate) or BOOL_AND (which can short-circuit on definitive values), bit operations have no principled way to handle missing inputs. Propagate uniformly.
10.4.15 APPROX_DISTINCT(c)¶
Approximate distinct count, typically via HyperLogLog or similar sketch.
| Field | Value |
|---|---|
| Type | reducer |
partition_invariant |
false (in the count-returning form supported by Coframe Core) |
| Arity | 1 |
| Input types | any |
| Output type | integer |
| Default naming | f_APPROX_DISTINCT(name_pred, *) → approx_distinct_<name_pred>_count |
Missing-value treatment:
M_eff |
Behavior |
|---|---|
| MCAR-effective | skip (under-count + approximation-error) |
| MAR-effective | skip (bias-warning + under-count + approximation-error) |
| MNAR | propagate (output missing, propagate-reason) |
Reasoning: same posture as COUNT_DISTINCT, with additional approximation-error annotation reflecting the sketch's inherent error bound.
In Coframe Core, APPROX_DISTINCT returns counts (not sketches), so partition_invariant: false. In Coframe Pro, sketch-typed columns with merge-supporting custom operators may be partition-invariant; this is outside Coframe Core.
10.4.16 APPROX_PERCENTILE(c, p) / APPROX_QUANTILE(c, p)¶
Approximate quantile operators, typically via t-digest or similar sketch.
| Field | Value |
|---|---|
| Type | reducer |
partition_invariant |
false |
| Arity | 1 (plus percentile parameter) |
| Input types | numeric, integer |
| Output type | same as input |
| Default naming | f_APPROX_PERCENTILE(name_pred, *) → approx_percentile_<name_pred> |
Missing-value treatment:
M_eff |
Behavior |
|---|---|
| MCAR-effective | skip (partial-coverage + approximation-error) |
| MAR-effective | propagate (output missing, propagate-reason) |
| MNAR | propagate (output missing, propagate-reason) |
Reasoning: same posture as MEDIAN/quantile, with additional approximation-error annotation.
10.5 Function catalog¶
Functions transform values row-wise without aggregating. For a function operation, E_pred = E.
Functions are organized by category: arithmetic, comparison, logical, string, date/time, type conversion, missing-value handling, conditional. Each category specifies the operators it includes and their structural properties.
10.5.1 Arithmetic operators¶
+, -, *, /, %, ^. Standard precedence rules apply.
| Field | Value |
|---|---|
| Type | function |
identity_preserving |
false (most cases); special cases noted below |
| Arity | 2 |
| Input types | numeric, integer |
| Output type | numeric or integer (per standard arithmetic typing) |
| Default naming | None (results are typically singletons; AC author chooses naming) |
Special identity-preserving cases:
c + 0andc - 0: identity-preserving onc.c * 1andc / 1: identity-preserving onc.
These special cases are recognized only if explicitly declared by the AC's naming function or detected by AC-validation logic. The default catalog flags arithmetic operators as not identity-preserving.
Missing-value treatment for arithmetic on inputs with missing values: any missing input → output is missing (propagate).
10.5.2 Comparison operators¶
=, <> (or !=), <, <=, >, >=. Three-valued logic.
| Field | Value |
|---|---|
| Type | function |
identity_preserving |
false |
| Arity | 2 |
| Input types | comparable types (numeric, string, date, etc.) |
| Output type | boolean |
| Default naming | None (results are typically used inline in expressions) |
Missing-value treatment: any missing input → output is missing (NULL in three-valued logic).
10.5.3 Logical operators¶
AND, OR, NOT. Three-valued logic per the standard SQL semantics.
| Field | Value |
|---|---|
| Type | function |
identity_preserving |
false |
| Arity | 2 (AND, OR) or 1 (NOT) |
| Input types | boolean |
| Output type | boolean |
| Default naming | None |
Missing-value treatment per three-valued logic:
TRUE AND NULL = NULL;FALSE AND NULL = FALSE;NULL AND NULL = NULL.TRUE OR NULL = TRUE;FALSE OR NULL = NULL;NULL OR NULL = NULL.NOT NULL = NULL.
10.5.4 String functions¶
UPPER(s), LOWER(s), TRIM(s), SUBSTRING(s, start, length), LENGTH(s), CONCAT(s1, s2, ...).
| Field | Value |
|---|---|
| Type | function |
identity_preserving |
false (most cases) |
| Arity | 1 (UPPER, LOWER, TRIM, LENGTH) or 2+ (SUBSTRING, CONCAT) |
| Input types | string |
| Output type | string (most) or integer (LENGTH) |
| Default naming | None |
Missing-value treatment: any missing input → output is missing.
10.5.5 Date/time functions¶
DATE_ADD(d, interval), DATE_DIFF(d1, d2, unit), EXTRACT(field FROM d), etc.
| Field | Value |
|---|---|
| Type | function |
identity_preserving |
false |
| Arity | varies |
| Input types | date, timestamp |
| Output type | varies (date, integer per the function) |
| Default naming | None |
Missing-value treatment: any missing input → output is missing.
10.5.6 Type conversion functions¶
CAST(expr AS type), TO_INT(s), TO_STRING(n), etc.
| Field | Value |
|---|---|
| Type | function |
identity_preserving |
true for type-widening casts; false otherwise |
| Arity | 1 |
| Input types | varies (per the cast) |
| Output type | the target type |
| Default naming | None |
Special identity-preserving cases:
- Type-widening casts (e.g., INT → DECIMAL with greater precision) preserve identity on the input metric.
- Type-narrowing or type-changing casts do not preserve identity.
The framework recognizes specific casts as identity-preserving via the catalog's per-cast declarations.
Missing-value treatment: any missing input → output is missing.
10.5.7 Missing-value handling functions¶
COALESCE(c1, c2, ..., default), IFNULL(c, replacement), NULLIF(c, value).
| Field | Value |
|---|---|
| Type | function |
identity_preserving |
conditional (see below) |
| Arity | varies |
| Input types | any |
| Output type | same as inputs |
| Default naming | None |
COALESCE(c, default) and IFNULL(c, replacement): identity-preserving on c if the AC author treats "the metric with explicit handling of missing values" as the same metric as c. The default catalog flags these as identity-preserving for the engineer's first input (c) when the second input is a literal default.
NULLIF(c, value): not identity-preserving (introduces missingness rather than handling it).
Missing-value treatment: per the semantics of each function. COALESCE returns the first non-missing input; IFNULL substitutes the replacement for missing values; NULLIF returns missing when the input equals the comparison value.
10.5.8 Conditional expressions¶
CASE WHEN cond1 THEN val1 [WHEN cond2 THEN val2 ...] [ELSE valN] END, IF(cond, true_val, false_val).
| Field | Value |
|---|---|
| Type | function |
identity_preserving |
false |
| Arity | varies |
| Input types | varies |
| Output type | the type of the THEN/ELSE values |
| Default naming | None |
Missing-value treatment: per the standard CASE/IF semantics. If a condition is missing, the framework follows three-valued logic; if the selected value is missing, the output is missing.
10.5.9 Multi-input function operators¶
MAP_DIV(c1, c2), MAP_MULT(c1, c2), etc.: multi-input functions producing singleton columns (per Foundations §2.7.7).
| Field | Value |
|---|---|
| Type | function (multi-input) |
identity_preserving |
false |
| Arity | 2+ |
| Input types | per the function |
| Output type | per the function |
| Default naming | None (singletons; AC author names) |
Missing-value treatment: any missing input → output is missing (propagate).
When used as a singleton's op (per ColumnSpec specification §3.5.6), the framework consults this entry for the operator's semantics.
10.6 The OBSERVED operator¶
The OBSERVED operator is the framework's grain-role operator, used as the op field for grain-role columns where E = {c}.
10.6.1 Specification¶
| Field | Value |
|---|---|
name |
OBSERVED |
| Type | (special — not reducer or function) |
partition_invariant |
n/a |
identity_preserving |
n/a |
| Arity | 0 (the column is observationally rooted) |
| Input types | n/a |
| Output type | the column's data_type |
| Default naming | n/a |
10.6.2 Semantics¶
A ColumnSpec with op: OBSERVED represents a column whose values are observed directly from the backend, not derived through any operator within the AC. Grain-role columns are typically declared with op: OBSERVED; their DNA is self-referential.
OBSERVED is not a reducer and not a function. It is the framework's marker for "this column's values come from the data, not from an operation in the AC."
For a column with op: OBSERVED:
- The column's family is rooted at this column.
- The family has an ip_reducer iff the column is conceptually associated with a reducer (typically a non-grain-role column observationally rooted at some anchor, like a
revenuecolumn directly observed at transaction grain). - For genuine grain-role columns, the family-root concept applies trivially (the column anchors a dimension).
10.6.3 Use cases¶
OBSERVED is used for:
- Grain-role columns in reference and fact schemas (e.g.,
customer_id,transaction_id,date). - AC-attribute columns observed directly from reference tables (e.g.,
customer_name,store_address). - AC-metric columns whose values are observed directly without AC-internal derivation, even if the upstream pipeline did derive them. For example, a
revenuecolumn in a transactions table is observationally rooted as far as the AC is concerned, even if the ETL pipeline computed it from an order amount minus discounts. The AC takes the column as observed; the ETL's logic is outside the AC's structural commitment.
10.6.4 Distinction from reducer roots¶
A column with op: OBSERVED and a column with op: SUM (where op: SUM is the family ip_reducer) both serve as roots, but they have different structural meanings:
op: OBSERVED: the values come from outside the AC. The framework treats them as given. The column's family-root is itself.op: SUMat a root: the values are observed, and the AC is also committed to the family's algebraic structure (SUM is the family's ip_reducer; cross-anchor navigation via SUM produces siblings).
In practice, AC-metric roots with a partition-invariant ip_reducer are commonly declared with op equal to the ip_reducer rather than OBSERVED. This makes the family's algebraic commitment explicit at the root. The framework accepts both forms; OBSERVED is the option for columns whose family does not have a meaningful ip_reducer (e.g., a directly-observed mean that's anchor-locked).
10.7 Cross-cutting notes¶
10.7.1 When multiple inputs have different signatures¶
For multi-input operators, when the inputs have different M signatures, the framework applies the most-restrictive treatment:
- If any input is MNAR: propagate.
- Else if any input is MAR-effective: apply MAR-effective treatment per the operator's catalog.
- Else (all inputs are MCAR-effective): apply MCAR-effective treatment.
The operator's per-(operator, M_eff) entry specifies what each treatment is.
10.7.2 Annotations always travel with results¶
Every result produced by a Coframe Core query carries annotations from the operators applied. Annotations propagate compositionally:
missing-fraction: aggregated across all calls and reducers in the query.lower-bound/upper-bound: propagate from MAX/MIN through subsequent operations as best the framework can compute.bias-warning: propagates if any reducer in the query produced a bias-warning annotation.under-count: propagates from COUNT_DISTINCT-like operators.
The framework provides these annotations alongside query results so engineers (and LLM clients via MCP) understand the reliability of the result.
10.7.3 MCP exposure¶
The operator catalog is exposed to MCP clients via the MCP server (Chapter 11). LLM clients can query the catalog to:
- Enumerate available operators with their structural properties.
- Look up an operator's missing-value behavior under specific signatures.
- Determine which operators are partition-invariant (and thus eligible as ip_reducers).
This enables LLM clients to reason about the AC's available structural operations without external documentation.
10.7.4 What this catalog is not¶
This catalog is the framework's reference of supported operators. It is not:
- A SQL function library. Some operators have SQL-standard names and semantics; others do not. Coframe Core's operator semantics are specified by this catalog, not by SQL standards.
- A guide to operator usage. Frame-QL syntax (Chapter 8) governs how operators appear in queries; this catalog governs what each operator does.
- An exhaustive list. Coframe Pro supports custom operator registration; AC authors there may add operators with their own catalog entries. Coframe Core is closed.
10.8 Where to go next¶
After reading this chapter, the natural next chapters are:
- Chapter 8: Frame-QL — the query language using these operators.
- Chapter 9: Query Resolution — how queries route to schemas and how operators apply during resolution.
- Chapter 3: ColumnSpec and Naming Machinery — the per-column declaration referencing operator-catalog entries.
For the framework's overall posture and the broader structural picture, see the Foundations chapter (Chapter 2).
Part VI: MCP¶
Chapter 11: The MCP Server¶
The framework's interface to LLM clients. How AI agents reason about ACs, construct Frame-QL queries, and consume results.
11.1 Overview¶
This chapter specifies the coframe-mcp server: the framework's interface for LLM clients to reason about ACs, construct queries, and consume results. MCP — the Model Context Protocol — is the emerging standard for letting LLMs interact with structured tools and data sources. The Coframe Core MCP server is the bridge between LLM clients and Coframe Core ACs.
The chapter is organized as follows:
- §11.2 frames why an MCP server matters for Coframe Core.
- §11.3 specifies the server's exposed capabilities.
- §11.4 specifies the two operating modes (direct and dialogue).
- §11.5 specifies the family-vocabulary exposure model.
- §11.6 specifies the AC scope as the agent's surface.
- §11.7 specifies request/response formats.
- §11.8 specifies error handling and diagnostics.
- §11.9 specifies deployment patterns.
- §11.10 lists what the MCP server is not.
The chapter assumes familiarity with the Foundations chapter (Chapter 2), Frame-QL (Chapter 8), Query Resolution (Chapter 9), and the Operator Catalog (Chapter 10).
11.2 Why an MCP server¶
The framework's structural commitments make Coframe Core a natural substrate for AI agents. Frame-QL queries express analytical intent without making structural decisions; the framework's resolver verifies the query is well-formed and either resolves it correctly or returns a structured diagnostic. Errors are caught at parse or resolution time, not at result time.
For this to work in practice, LLM clients need access to the AC's structural metadata — the family vocabulary, the metric genealogy, the FD-DAG, the operator catalog, the dimension value-sets. They need this metadata in a form they can consume programmatically, not as PDF documentation.
The MCP server is the framework's answer. It exposes the AC's structural surface in MCP's request/response format. LLM clients (Claude, GPT, custom agents) connect to the server, query for the metadata they need, construct Frame-QL queries, submit them for resolution, and consume results — all through a stable protocol.
The server is what makes Coframe Core an AI-native query layer rather than just a query layer that AI agents can use through text-to-SQL adapters.
11.3 Exposed capabilities¶
The server exposes the following capabilities. Each is an MCP tool that LLM clients can invoke.
11.3.1 AC discovery¶
list_acs()— enumerate the ACs the server exposes. Each AC has an identifier and a brief description.describe_ac(ac_name)— return the AC's name, description, scope summary (number of schemas, families, dimensions), and naming function status (catalog default / custom / none).
11.3.2 Family-level reasoning¶
list_families(ac_name)— enumerate all families in the AC with their family-name and a brief root description.describe_family(ac_name, family_name)— return details on a specific family: family-root description, family-root'sopand partition_invariance, ip_reducer (if the family has one), the anchors at which the family's columns are observable, the family's structural relations summary (number of siblings, presence of cousins).
11.3.3 Genealogy navigation¶
list_genealogy(ac_name)— return the AC-wide metric genealogy. For each family, the family-root, the siblings, and any cousins.describe_column(ac_name, schema_name, column_name)— return ColumnSpec details for a specific column: the four-part declaration ((src_name, data_type),(E, M),(op, dna),name), derived properties (family-root, structural relations to other columns), and any verification status from DQ.
11.3.4 FD-DAG navigation¶
list_fd_edges(ac_name)— enumerate the AC's FD-DAG edges. Each edge has a source dimension, target dimension, channel, and attestation status.describe_fd_edge(ac_name, source, target)— details on a specific FD-edge.
11.3.5 Dimension value reasoning¶
get_dimension_values(ac_name, dimension)— enumerate the universe-wide value-set for an AC-dimension.get_dimension_coverage(ac_name, dimension, schema_name)— return the coverage map for a specific schema's coverage of a specific dimension.
11.3.6 Operator catalog¶
list_operators()— enumerate the operators in Coframe Core's catalog.describe_operator(op_name)— return the operator-catalog entry: type, partition_invariance, identity_preserving flag, default naming, missing-value behavior per signature.
11.3.7 Query operations¶
resolve_query(ac_name, frame_ql)— parse and resolve a Frame-QL query against an AC without executing. Returns the resolution plan (which schemas serve which column terms, navigation paths, expected output shape) or a structured diagnostic if resolution fails.execute_query(ac_name, frame_ql)— parse, resolve, and execute. Returns the result data plus annotations.nl_query(ac_name, natural_language_utterance)— submit a natural-language query for execution. The server translates to Frame-QL via a server-side dialogue layer, then executes. Available when the server is configured with an LLM for the dialogue.
11.3.8 DQ inspection¶
get_dq_status(ac_name)— return the AC's DQ status: violations, advisories, integrity status.get_dq_advisories(ac_name)— return current advisories the engineer hasn't addressed.
11.4 Two operating modes¶
The MCP server supports two modes for query execution. Deployments choose per architecture.
11.4.1 Direct mode¶
The LLM client constructs Frame-QL itself, using the AC metadata exposed by the server. The client:
- Calls
list_families,describe_family,get_dimension_values, etc., to understand the AC's structural surface. - Constructs a Frame-QL query expressing the analytical intent.
- Calls
resolve_queryto verify the query resolves cleanly (and to refine if not). - Calls
execute_queryto get results.
Direct mode requires the LLM to have Frame-QL knowledge. This is reasonable for any modern LLM: Frame-QL is a small declarative language, and the syntax is documentable in a few hundred tokens.
11.4.2 Dialogue mode¶
The LLM client submits a natural-language utterance; the server translates to Frame-QL via a server-side dialogue layer before executing. The client calls nl_query("what was peak weekly revenue last quarter by region?") and receives results.
Dialogue mode places translation responsibility on the server. The server must be configured with an LLM endpoint (OpenAI, Anthropic, or local model) that performs the natural-language-to-Frame-QL translation. The server presents the translated Frame-QL to the calling client alongside the results, so the client can verify the translation reflects the intent.
Dialogue mode is useful when:
- The calling client is a thin agent (e.g., a Slack bot, a UI form) without LLM capabilities.
- The deployment wants centralized control over the translation policy.
- The deployment wants to log natural-language utterances and their translations for auditing.
Direct mode is useful when:
- The calling client is itself an LLM with sufficient capability.
- The deployment wants the calling LLM to reason explicitly about query construction.
Both modes can be enabled simultaneously; clients choose per call.
11.5 The family-vocabulary exposure model¶
The MCP server's exposure is organized around the family vocabulary as the top-level concept.
When an LLM client constructs a query, it doesn't navigate a flat list of "metric definitions." It looks up families. A family is a conceptual unit: a name, a root, a set of anchors at which the family's columns are observable, an ip_reducer (or its absence). The agent reasons about families as conceptual quantities.
Within a family, the agent sees siblings: same family, different anchors. The agent picks the appropriate sibling for the query's grain. Cousins surface as warnings: same family-name, different family-roots, requiring disambiguation.
This is meaningfully different from semantic-layer exposures. A semantic layer presents a flat list of named metrics; the agent has to reconstruct conceptual relationships from documentation. Coframe Core's MCP server presents a structured genealogy where the relationships are first-class.
For LLMs, this matters concretely. The family vocabulary is a small number of holdable concepts (revenue, customer_count, peak_revenue, etc.) rather than a sprawling list of metric definitions across multiple grains. The agent can ask "what's our revenue this quarter?" — look up the revenue family, identify the right sibling for "this quarter," construct the query — without the cognitive load of comparing flat metric definitions.
11.6 The AC scope as the agent's surface¶
The MCP server exposes exactly what the AC author chose to expose — the AC scope (per Foundations §2.3.4). Backend columns not declared as ColumnSpecs are not visible through the MCP server.
This has two practical implications.
Curated agent surface. The AC author's selection is a deliberate exposure boundary for AI agents. A backend table with hundreds of columns may produce an AC with a handful exposed via MCP. The agent navigates the curated surface, not the full backend inventory.
Structurally-enforced exposure boundary. Queries through the MCP server cannot reach undeclared columns. The framework's query-resolution machinery operates within the AC scope; columns outside scope are not queryable. Different teams' ACs over the same backend can have different scopes (a marketing AC, a finance AC, an operations AC), each presenting its scope to its agents.
For governance, this matters. ACs serve as deliberate exposure boundaries for AI agents — the AC author decides what's reachable through the analytical surface. PII, sensitive operational data, and internal bookkeeping columns can be excluded from the AC scope; agents querying through the MCP server cannot reach them.
This is not the same as access control, and the MCP server does not implement authorization beyond what the deployment's MCP authentication layer provides. But the AC scope's structural enforcement is a meaningful first line: agents are working within bounds the AC author chose, not the full backend's exposure.
11.7 Request/response formats¶
MCP defines the wire format. Coframe Core's MCP server uses standard MCP request/response shapes; this section specifies the content within them.
11.7.1 Request format¶
Each capability invocation is an MCP tool call with named arguments. For example:
{
"tool": "describe_family",
"arguments": {
"ac_name": "retail_analytics_v1",
"family_name": "revenue"
}
}
11.7.2 Response format for structural metadata¶
Structural metadata responses are JSON objects with the family / column / FD-edge / operator's fields. For example, describe_family returns:
{
"family_name": "revenue",
"ac_name": "retail_analytics_v1",
"family_root": {
"schema": "transactions",
"name": "revenue",
"E": ["transaction"],
"op": "SUM",
"M": {"signature": "MCAR", "determinants": []}
},
"ip_reducer": "SUM",
"partition_invariant": true,
"anchors_observed": [
{"E": ["transaction"], "schemas": ["transactions"]},
{"E": ["store", "month"], "schemas": ["store_monthly_summary"]}
],
"siblings": [...],
"cousins": [],
"annotations": []
}
11.7.3 Response format for query results¶
Query result responses include the result data plus annotations:
{
"frame_ql_query": "SELECT region, SUM(revenue) AS total BY region",
"resolution_plan": {
"schemas_used": ["transactions"],
"navigation_paths": ["transaction → store → region"],
"attested_edges_consulted": [
{"predecessor": "revenue@[transaction]", "successor": "revenue@[store,month]", "status": "passed"}
]
},
"result_columns": ["region", "total"],
"result_data": [
{"region": "west", "total": 12450000},
{"region": "east", "total": 15300000}
],
"annotations": [
{"type": "missing-fraction", "value": 0.02, "scope": "result"},
{"type": "partial-coverage", "scope": "stores: 92% of universe"}
],
"coherence_posture": {
"level": "AAA",
"grounding_summary": {
"metric_coherence": {
"data_attested": 4,
"verified_by_construction": 3,
"tolerated": 0,
"unattestable": 1
}
},
"attestation_enabled": true,
"edges_passed": 4,
"edges_failed": 0,
"edges_tolerated": 0,
"naming_consistency": "verified"
},
"execution_metadata": {
"duration_ms": 145,
"rows_scanned": 4500000
}
}
The coherence_posture field is propagated on every query result that depends on cross-schema reach. It reports:
- The AC's verification level (
A,AA, orAAA) per §7.13. Informational in v1.0; stable surface in v1.x. - The optional
grounding_summaryfield showing how commitments consulted by this query were grounded — empirically (data-attested through DQ), deductively (verified-by-construction through operator catalog semantics), tolerated, or unattestable. Per §7.13.4 and §7.13.8, the level is the headline; the grounding mix is informational and lets sophisticated consumers reason about what kind of verification underlies the result. AI agents whose reasoning chains compose results from multiple commitments can branch on the grounding mix when assessing per-commitment confidence. - Whether attestation is enabled for the AC (default:
true). - The count of attested edges in the AC, broken down by status, that participate in the query's resolution.
- The naming-consistency status (
verifiedif a naming function is declared and checked;assertedif naming is declined per Chapter 3 §3.7.3 Option 4). - When attestation is disabled and the AC has any data-attested commitments, an additional annotation
{"type": "coherence-asserted-not-verified", "scope": "ac"}accompanies the result; the level is correspondingly capped atAAfor the data-attested portion. - When any edge consulted has status
failed_with_deltas(underfailure_mode: soft), an additional annotation{"type": "coherence-warning", "edges": [...]}accompanies the result with the failed edges and their disagreement magnitudes. - When tolerated edges are involved (
edges_tolerated > 0), the tolerated-edge identifiers and rationales are surfaced via atolerated_edgesannotation.
This propagation gives AI agents and other MCP clients direct visibility into the rigor configuration of the AC they're querying. An agent reasoning about result trust can branch on coherence_posture.level, grounding_summary (when surfaced), attestation_enabled, and edges_failed without making a separate validate_ac call. By default, MCP query results surface only the level field for compactness; clients requesting include_grounding: true get the full grounding summary.
11.7.4 Error response format¶
Error responses use a structured format with category, message, and remediation guidance:
{
"error": {
"category": "dubious_query",
"message": "Family 'peak_concurrent_users' resolves to multiple cousins.",
"details": {
"cousins": [
{"schema": "system_metrics_hourly", "family_root_E": ["server", "hour"]},
{"schema": "product_analytics_daily", "family_root_E": ["region", "day"]}
]
},
"remediation": "Disambiguate via qualified reference (e.g., 'system_metrics_hourly.peak_concurrent_users') or explicit FROM clause."
}
}
The error categories include: parse, binding, resolution, dubious_query, integrity, operator, backend. Categories and details follow the conventions in Frame-QL §8.11.
11.8 Error handling and diagnostics¶
The server's error handling treats LLM clients as deserving of structured, actionable diagnostics.
Categorized errors. Every error has a category (per §11.7.4) so the client can branch on category rather than parsing message text.
Actionable details. Errors include enough information for the client to diagnose and potentially auto-correct. A dubious-query error names the cousins and suggests qualified references; a binding error suggests similar family-names.
Remediation guidance. Errors include human-readable remediation suggestions when possible. The client (or its underlying LLM) can incorporate these into a refined query.
Idempotent retries. The server's operations are idempotent. Clients can retry operations without side effects beyond logging.
11.9 Deployment patterns¶
The MCP server can be deployed in several patterns.
Local development. A single-machine deployment for engineer-side query authoring and testing. The server runs alongside the AC, the data, and the engineer's IDE/agent.
Team-shared analytics endpoint. A team-internal MCP server exposing one or more ACs to the team's analytics tooling — Claude clients, internal LLM-driven dashboards, custom agents. Authentication via the deployment's MCP authentication layer.
Multi-tenant analytics platform. A central MCP server exposing different ACs to different tenants/teams. Each AC's scope acts as its tenant's analytical surface; the server enforces tenant-to-AC access via authentication.
Federation across ACs. Multiple MCP servers exposing different ACs, with a federation layer routing queries to the appropriate server. The federation layer handles authentication and routing; individual servers handle their ACs.
The framework provides reference deployments via the coframe-mcp Python package. Production deployments customize per organizational needs.
11.10 What the MCP server is not¶
The MCP server is not:
A general LLM gateway. The server exposes Coframe Core ACs to LLM clients. It does not proxy general LLM traffic, perform retrieval-augmented generation outside the AC, or serve as a generic chatbot infrastructure.
A SQL-translation layer. The server exposes Frame-QL, not SQL. Translation between Frame-QL and SQL happens within the framework when Frame-QL is executed; the server's API is at the Frame-QL level.
An authorization service. Authorization is the deployment's concern, implemented at the MCP authentication layer (per MCP's spec). The server's AC-scope enforcement is structural — agents cannot reach columns outside the AC scope — but is not a substitute for proper authentication.
A data egress mechanism. The server returns query results, with annotations describing missing-value treatment and biases. It is not a bulk-export tool; for bulk data export, deployments use other mechanisms (the data-API directly, ETL pipelines).
A persistence layer. The server is stateless beyond AC metadata caches; query results are not persisted server-side. Clients that need persistent results store them themselves.
11.11 Where to go next¶
After reading this chapter, the natural next chapters are:
- Chapter 8: Frame-QL — the query language the MCP server exposes.
- Chapter 9: Query Resolution — how queries route through the four-rule filter and MTI.
- Chapter 10: Operator Catalog — what the
list_operatorsanddescribe_operatorcapabilities expose. - Chapter 2: Foundations — the structural commitments the MCP server's metadata exposure reflects.
For deployment-specific guidance (configuration, authentication, monitoring), see the coframe-mcp package documentation.
Appendices¶
Appendix A: BNF Grammar for Frame-QL¶
The complete formal grammar for Coframe Core's Frame-QL query language.
A.1 Overview¶
This appendix specifies the complete BNF grammar for Frame-QL as supported by Coframe Core. The grammar formalizes what Chapter 8 specifies in prose. Parser implementations should treat this appendix as authoritative for syntactic concerns; semantic concerns (query resolution, type-checking, integrity-condition validation) are specified in Chapters 9 and 10.
The grammar uses standard BNF conventions:
<nonterminal>denotes a grammar production."keyword"and'symbol'denote literal tokens.|separates alternatives.[ ... ]denotes optional content (zero or one).{ ... }denotes repetition (zero or more).( ... )groups productions.
Whitespace between tokens is ignored except where it disambiguates identifiers from keywords.
A.2 Top-level grammar¶
<query> ::= <frame> | <with_block>
<with_block> ::= "WITH" <inner_frame_list> <frame>
<inner_frame_list> ::= <inner_frame> { "," <inner_frame> }
<inner_frame> ::= <identifier> "AS" "(" <frame> ")"
<frame> ::= <select_clause>
[ <from_clause> ]
[ <where_clause> ]
<by_clause>
[ <having_clause> ]
[ <order_by_clause> ]
[ <limit_clause> ]
| <select_item_list> -- sugar form (no SELECT keyword)
[ <from_clause> ]
[ <where_clause> ]
<by_clause>
[ <having_clause> ]
[ <order_by_clause> ]
[ <limit_clause> ]
<select_clause> ::= "SELECT" <select_item_list>
Outer Frames (the frame outside a WITH-block, or any standalone frame) require a <by_clause>. Inner Frames within a WITH-block may omit <by_clause> to inherit the outer Frame's grain.
A.3 Frame clauses¶
A.3.1 SELECT items¶
<select_item_list> ::= <select_item> { "," <select_item> }
<select_item> ::= <expression> [ "AS" <identifier> ]
| <column_ref>
| <literal>
<column_ref> ::= <identifier>
| <identifier> "." <identifier> -- qualified
A qualified column reference (schema_name.column_name) constrains the framework's automatic schema selection for that specific column.
A.3.2 FROM clause¶
<from_clause> ::= "FROM" <from_item_list>
<from_item_list> ::= <from_item> { "," <from_item> }
<from_item> ::= <identifier> -- schema name from the AC
| <identifier> -- WITH-block frame name
The FROM clause's identifiers are resolved against the AC's schemas and the current WITH-block's inner frames. The framework distinguishes them based on the surrounding scope.
A.3.3 WHERE clause¶
WHERE expressions filter rows before aggregation. The expression must evaluate to a boolean per three-valued logic.
A.3.4 BY clause¶
<by_clause> ::= "BY" <by_target>
<by_target> ::= <identifier> -- single dimension
| "(" <identifier_list> ")" -- composite grain
| <identifier> -- schema grain (e.g., BY transaction)
<identifier_list> ::= <identifier> { "," <identifier> }
The BY clause specifies the output grain of the frame. The framework navigates from input grains to the output grain via the FD-DAG.
A.3.5 HAVING clause¶
HAVING expressions filter the result rows after aggregation. They may reference aggregated values from the SELECT clause.
A.3.6 ORDER BY clause¶
<order_by_clause> ::= "ORDER" "BY" <order_spec_list>
<order_spec_list> ::= <order_spec> { "," <order_spec> }
<order_spec> ::= <expression> [ "ASC" | "DESC" ]
A.3.7 LIMIT clause¶
A.4 Expressions¶
A.4.1 Expression hierarchy¶
Frame-QL expressions follow standard SQL-like precedence. The grammar below presents the precedence levels from lowest (OR) to highest (atomic).
<expression> ::= <or_expression>
<or_expression> ::= <and_expression> { "OR" <and_expression> }
<and_expression> ::= <not_expression> { "AND" <not_expression> }
<not_expression> ::= [ "NOT" ] <comparison>
<comparison> ::= <addition> [ <comparison_op> <addition> ]
| <addition> "IS" [ "NOT" ] ( "NULL" | "MISSING" | "TRUE" | "FALSE" )
| <addition> [ "NOT" ] "IN" "(" <expression_list> ")"
| <addition> [ "NOT" ] "BETWEEN" <addition> "AND" <addition>
| <addition> [ "NOT" ] "LIKE" <string_literal>
<comparison_op> ::= "=" | "<>" | "!=" | "<" | "<=" | ">" | ">="
<addition> ::= <multiplication> { ( "+" | "-" ) <multiplication> }
<multiplication> ::= <power> { ( "*" | "/" | "%" ) <power> }
<power> ::= <unary> { "^" <unary> }
<unary> ::= [ "+" | "-" ] <primary>
<primary> ::= <literal>
| <column_ref>
| <function_call>
| <reducer_call>
| <case_expression>
| "(" <expression> ")"
A.4.2 Function and reducer calls¶
<function_call> ::= <identifier> "(" [ <expression_list> ] ")"
| <identifier> "(" <distinct_modifier> <expression> ")"
| <identifier> "(" "*" ")" -- COUNT(*) form
<reducer_call> ::= <identifier> "(" <expression> ")"
| <identifier> "(" "DISTINCT" <expression> ")"
| <identifier> "(" <expression> "," <literal> ")" -- e.g., APPROX_PERCENTILE(c, 0.95)
| "COUNT" "(" "*" ")"
<distinct_modifier> ::= "DISTINCT"
<expression_list> ::= <expression> { "," <expression> }
The grammar does not distinguish syntactically between function and reducer calls. The framework determines the call's type at name-binding time per the operator catalog (Chapter 10).
A.4.3 CASE expressions¶
<case_expression> ::= "CASE" <when_clause_list> [ "ELSE" <expression> ] "END"
| "CASE" <expression> <when_value_list> [ "ELSE" <expression> ] "END"
<when_clause_list> ::= <when_clause> { <when_clause> }
<when_clause> ::= "WHEN" <bool_expression> "THEN" <expression>
<when_value_list> ::= <when_value> { <when_value> }
<when_value> ::= "WHEN" <expression> "THEN" <expression>
Two CASE forms: searched (with <bool_expression> per WHEN) and simple (one initial expression compared via equality with each WHEN value).
A.4.4 IF expressions¶
IF(condition, true_value, false_value) is shorthand for the searched CASE form.
A.4.5 Type casts¶
<cast_expression> ::= "CAST" "(" <expression> "AS" <type_name> ")"
<type_name> ::= "NUMERIC" | "INTEGER" | "STRING" | "BOOLEAN" | "DATE" | "TIMESTAMP"
A.4.6 Special date/time literals¶
A.5 Lexical tokens¶
A.5.1 Identifiers¶
<identifier> ::= <unquoted_identifier> | <quoted_identifier>
<unquoted_identifier> ::= <letter_or_underscore> { <letter_or_underscore_or_digit> } [ "." <unquoted_identifier> ]
<quoted_identifier> ::= "`" { <any_char_except_backtick> } "`"
<letter_or_underscore> ::= "a" .. "z" | "A" .. "Z" | "_"
<letter_or_underscore_or_digit> ::= <letter_or_underscore> | "0" .. "9"
Unquoted identifiers consist of letters, digits, and underscores, starting with a letter or underscore. They may include dots for qualified references (e.g., transactions.revenue).
Quoted identifiers (backtick-delimited) allow any characters except the backtick itself. The framework treats names as opaque labels (per Foundations §2.11.3); quoted identifiers let AC authors use names with arbitrary content.
A.5.2 Literals¶
<literal> ::= <numeric_literal>
| <string_literal>
| <boolean_literal>
| <missing_literal>
| <date_literal>
| <timestamp_literal>
<numeric_literal> ::= <integer_literal> | <decimal_literal> | <scientific_literal>
<integer_literal> ::= [ "-" ] <digit> { <digit> }
<decimal_literal> ::= [ "-" ] <digit> { <digit> } "." { <digit> }
<scientific_literal> ::= <decimal_literal> ( "e" | "E" ) [ "+" | "-" ] <digit> { <digit> }
<string_literal> ::= "'" { <string_char> | <escaped_quote> } "'"
<escaped_quote> ::= "''" -- doubled single-quote within string
<boolean_literal> ::= "TRUE" | "FALSE"
<missing_literal> ::= "NULL" | "MISSING"
<digit> ::= "0" | "1" | ... | "9"
NULL and MISSING are equivalent in Frame-QL. Both denote the absence of a value.
A.5.3 Reserved keywords¶
The following identifiers are reserved and cannot be used as bare unquoted identifiers:
SELECT, FROM, WHERE, BY, HAVING, ORDER, LIMIT, AS, ASC, DESC,
WITH, AND, OR, NOT, IS, NULL, MISSING, TRUE, FALSE,
DISTINCT, IN, BETWEEN, LIKE,
CASE, WHEN, THEN, ELSE, END, IF,
CAST, NUMERIC, INTEGER, STRING, BOOLEAN, DATE, TIMESTAMP
Keywords are case-insensitive (SELECT and select are equivalent).
A.5.4 Comments¶
<comment> ::= <line_comment> | <block_comment>
<line_comment> ::= "--" { <any_char_except_newline> } <newline>
<block_comment> ::= "/*" { <any_char_except_block_end> } "*/"
Block comments do not nest.
A.6 Whitespace¶
Whitespace between tokens is required where adjacent characters would otherwise form a single token (e.g., between two unquoted identifiers, between a keyword and an identifier). It is otherwise optional.
Whitespace characters include the space, tab, carriage return, and newline.
A.7 Conformance notes¶
A.7.1 Coframe Core conformance¶
A Frame-QL parser conforms to Coframe Core's specification iff it accepts the language defined by this grammar and rejects strings outside the language as syntax errors.
The grammar defines syntactic acceptance only. Semantic concerns — name binding, type checking, four-rule filter resolution, operator semantics, integrity-condition validation — are specified in Chapters 9, 10, and 11 and are not part of this grammar.
A.7.2 Reserved-keyword extensibility¶
This grammar's reserved-keyword list reflects Coframe Core's current scope. Coframe Pro may extend the keyword list to support additional syntactic constructs. Coframe Core-conformant parsers should treat unknown keywords from Coframe as binding errors at semantic time, not as syntax errors at parse time.
A.7.3 Optional extensions¶
Implementations may extend the grammar with operations Coframe Core does not require, provided the extensions do not break Coframe Core-valid queries. Specifically:
- Backend-specific functions in
<function_call>(e.g., backend-specific date/time helpers) are permitted. - Backend-specific data types in
<type_name>(e.g., backend-specific decimal precision) are permitted.
Such extensions are deployment-specific. The grammar above is the framework-required minimum.
A.8 Worked grammar examples¶
The following Frame-QL queries demonstrate the grammar's structure.
Simple read:
Aggregation with filter:
Multi-input expression:
SELECT region,
SUM(revenue) / COUNT_DISTINCT(customer) AS revenue_per_customer
BY region
HAVING revenue_per_customer > 1000
WITH-block:
WITH
region_revenue AS (
SELECT region, SUM(revenue) AS total
BY region
),
region_customers AS (
SELECT region, COUNT_DISTINCT(customer) AS customers
BY region
)
SELECT region, total, customers, total / customers AS arpu
FROM region_revenue, region_customers
BY region
ORDER BY arpu DESC
LIMIT 10
Qualified reference for cousin disambiguation:
Conditional expression:
SELECT region,
SUM(CASE WHEN amount > 0 THEN amount ELSE 0 END) AS positive_revenue,
COUNT(CASE WHEN status = 'returned' THEN 1 END) AS returns
BY region
These examples are syntactically valid per the grammar; their semantic resolution depends on the AC against which they're executed.
A.9 Where to go next¶
For semantic concerns, consult:
- Chapter 8 (Frame-QL): prose specification of the language's semantics, missing-value handling, and rung classification.
- Chapter 9 (Query Resolution): how queries route to backend operations, including the four-rule filter and MTI.
- Chapter 10 (Operator Catalog): the operators supported in expression contexts and their per-operator semantics.
For the parser implementation, consult the Coframe Core distribution's coframe-core package, which includes a reference parser implementing this grammar.
Appendix B: Glossary¶
Alphabetical reference of Coframe Core terms with brief definitions and pointers to the chapters where they are specified in detail.
How to use this glossary¶
This glossary is a quick-reference index. Each entry gives a brief definition and a chapter/section pointer where the term is fully specified. For more on the terminology's structure, see the Coframe Vocabulary Spine (a separate document).
The glossary is alphabetical. Compound terms are listed in their canonical hyphenated form (e.g., family-root, not family root). Mathematical notation entries appear at the start.
Notation¶
(E, M) paired declaration — A column's joint commitment to its entity-set anchoring E and missingness signature M. Both are declared together because each column is a property of its declared entities (Principle 1) and the missingness mechanism is bounded by those entities (per the structural rule M.determinants ⊆ E ∪ {self-token}). See §2.4.
E(c, S) — The entity-set anchoring of column c in schema S. The set of AC-dimensions whose values determine c's value. See §2.4.1.
E*(c, S) — The FD-DAG-extended entity set: the closure of E(c, S) under FD-DAG reachability (upward to coarser ancestors and downward to finer descendants). Used by the four-rule filter's Rule 2 to determine schema reachability for queries. See §3.8.3.
M(c, S) — The missingness signature of column c in schema S. One of MCAR, MAR, or MNAR with declared determinants. See §2.4.2.
m_pred --op--> m — Notation for an operation linking a predecessor metric m_pred = (name_pred, E_pred) to a successor metric m = (name, E) via operator op. See §2.6.
A¶
Anchor-locked family — A family whose family-root has a non-partition-invariant operator (e.g., AVG, MEDIAN, COUNT_DISTINCT). Anchor-locked families have no ip_reducer; their columns exist at specific anchors but cannot be derived to other anchors via name-preserving aggregation. See §2.7.6.
Ancestry tree — The chain of predecessor metrics recoverable by walking DNA from a column backward to a root. Linear for unary-operator columns; branching for multi-input columns. See §2.7.2.
AC — See Analytics Collection.
AC-attribute — A column trichotomy classification: a non-grain-role column whose anchoring E has cardinality 1, observing a property of its anchor entity. Examples: customer_name, store_address. See §2.5.2.
AC-dimension — A column trichotomy classification: a column appearing in grain role (E = {c}) in some schema. AC-dimensions are the entities the AC observes. See §2.5.1.
AC-metric — A column trichotomy classification: a non-grain-role column whose anchoring E may have cardinality ≥ 1 and varies across schemas. AC-metrics are the analytical quantities the AC measures. See §2.5.3.
AC scope — What the AC author chooses to expose. Three components: selection (which columns to include), naming (what to call them), and structural commitments (how each ColumnSpec is declared). Backend columns without ColumnSpec declarations are outside the AC scope. See §2.3.4.
AC Verification Level — An ordinal characterization of an AC's verification status: Level A (structural well-formedness), Level AA (verified structural integrity — every dimensional structural commitment is grounded), or Level AAA (verified cross-schema metric coherence — every metric coherence commitment is grounded). The levels are monotonic: AA implies A; AAA implies AA. Grounding admits two sources: empirical (data-attested through DQ) and deductive (verified-by-construction through operator catalog semantics). Computed deterministically from the AC's integrity-condition results and the grounding status of its commitments. Informational in v1.0; stable surface in v1.x. See §7.13, §7.13.4.
Analytics Collection (AC) — A Coframe artifact capturing a coherent AC scope over a backend's data. Consists of schemas, ColumnSpecs, AC-level annotations, and verified integrity status. See §2.3.
Anchor-independence of names — A naming rule: the successor's name does not encode the successor's anchor. Two operations with the same (name_pred, op) but different output anchors produce successors with the same name. See §2.6.
Annotations — Metadata accompanying query results, documenting missing-value treatment, biases, partial coverage, and similar concerns. Annotations propagate compositionally through query operations. See §10.3.4 and §10.7.2.
Attestation, per-DNA-edge — Verification that a successor metric's values agree with the predecessor's values, aggregated via the family's ip_reducer at the successor's anchor, within tolerance, on shared keys, scoped to the intersection of the two schemas' declared scopes. Performed during DQ Phase 3 by default in Coframe Core; engineers may opt out per AC. The mechanism that makes the cross-schema metric coherence statement a verified condition rather than an unverified lemma. See §7.6.8.
B¶
Backend — The data engine (DuckDB, Polars, Snowflake, etc.) holding the physical data referenced by an AC. Coframe Core ACs reference one backend; Coframe Pro supports multi-backend ACs. See §1.1, §1.5.
Broadcast — An operator type with E_pred ⊆ E (replicating a coarser-grain attribute to a finer-grain anchor). Broadcast is not a Coframe Core operator type; broadcasting in Coframe Core is handled by Frame-QL's Rung 2 mechanism at query time. Coframe Pro supports broadcast as a first-class operator type. See §2.6.1, §1.5.
BY clause — A Frame-QL clause specifying the output grain of a frame. Mandatory on outer frames. The framework navigates from input grains to the BY-clause grain via the FD-DAG. See §8.5.4.
C¶
Candidate FD-DAG — The engineer's declared FD-DAG, which DQ verifies against the data-driven FD-DAG. The candidate is the input; the data-driven version is the verified output. See §2.8.2.
Catalog default naming function — Default mappings from (name_pred, E_pred, op) to name_self for non-identity-preserving operators, provided as illustrative starting points in the operator catalog. AC authors may adopt them via naming_function: catalog_default. See §10.3.6, §3.7.3.
Coframe — The framework as a whole. The grammar-layer thesis, the structural commitments, the verification regime. Edition-independent. Coframe ships in two editions: Coframe Core (open-source) and Coframe Pro (commercial). See Chapter 2.
coframe (lowercase) — In code/identifier contexts (Python package names, file paths, configuration keys) and as a generic noun where edition is irrelevant.
Coframe Core — The open-source edition of Coframe, specified by this manual. Single-backend, closed operator catalog, deterministic missing-value handling, query-focused. AC-attribute model assumes stable attribute values per entity (|E| = 1 for AC-attributes). See §1.1, §2.5.
Coframe Pro — The commercial edition extending Coframe Core. Adds custom operators, multi-backend support, Slowly Changing Attributes (SCA), configurable strictness, persistent re-ingestion, sensitivity analysis, and richer authoring tooling. See §1.5.
ColumnSpec — The AC's declaration of a single column. Structurally divided into four parts: backend-facing (src_name, data_type), entity-facing (E, M), operator/operation-facing (op, dna), and cross-schema linkage (name). See §2.3.6 and Chapter 3.
Combination law — Legacy term replaced in the redesigned framework by per-operator partition_invariance. See §2.6.1.
Composite-grain fact schema — A schema-type classification: a schema with multiple AC-dimensions in grain role. See §2.9.3.
Cousins — A structural relation among columns: same family-name, different family-roots. Cousins represent observationally independent metrics that share a name. The framework refuses queries that resolve to multiple cousins as dubious. See §2.7.5.
Coverage map — Per AC-dimension per schema, the value-set observed in the schema relative to the universe-wide value set, with classification (fully covered, coverage-restricted, attribution-incomplete). Produced by DQ Phase 3. See §7.6.2.
Cross-schema metric coherence — The structural fact that siblings of the same family-root produce coherent values across schemas under partition-invariant operators. Verified by per-DNA-edge value attestation during DQ Phase 3 in default Coframe Core configurations; asserted-not-verified in opted-out configurations. See §2.10.5, §7.6.8, §7.7.2.
D¶
Data-API — The protocol Coframe Core uses to communicate with backends. Backends implement the protocol; the framework consumes it for introspection, verification, and query execution. See Chapter 6.
Data-driven FD-DAG — The FD-DAG attested by the data via DQ Phase 3, contrasted with the candidate FD-DAG declared by the engineer. The integrity condition Logical FD-DAG ⊆ Data-driven FD-DAG requires that declared edges be data-attested. See §2.8.2.
Data-quality (DQ) — The framework's structural-verification process. Three phases: metadata-only verification (Phase 1), quasi-metadata fetch (Phase 2), and quasi-metadata-derived verification (Phase 3). Produces integrity violations, advisories, and the structural-verification deliverable. See Chapter 7.
Declared scope — A schema's commitment to be degenerate on specific AC-dimensions with specific value-sets. The framework verifies declared scope against quasi-metadata. See §2.9.4.
Deductive verification regime — One of two parallel verification regimes Coframe operates in. The deductive regime grounds structural commitments in function semantics plus type-checking plus engine correctness: a function-derived FD-edge or function-derived metric is verified by construction, with no data attestation needed. Contrasted with the empirical verification regime (data-attested through DQ). Both regimes contribute uniformly to the FD-DAG, family genealogy, and AC Verification Levels. Coframe Core uses the deductive regime for catalog-defined operators and Frame-QL inline expressions; Coframe Pro lifts the duality to the framework's primary architectural framing. See §2.8.5, §1.5.
Default naming function — See Catalog default naming function.
Different families — A structural relation: columns with different name values belong to different families and share no structural relation under the framework's grammar-layer reasoning. See §2.7.5.
DNA — A column's structural representation of its operational lineage: a snapshot (name_pred, E_pred, op_pred) capturing the predecessor metric. Self-referential for root columns. Walking DNA backward yields the column's ancestry tree. See §2.7.1.
Dubious query — A query whose resolution would produce ambiguous results because a referenced family-name resolves to multiple cousins (or another structural source of ambiguity). The framework refuses dubious queries with a structured diagnostic and asks for engineer disambiguation. See §9.7.
E¶
Effective signature M_eff — For reducer operations, the relevant missingness signature for cell-level treatment. Computed as the input column's M restricted to entities being collapsed by the reduction. See §10.3.1.
Empirical verification regime — One of two parallel verification regimes Coframe operates in. The empirical regime grounds structural commitments in data attestation through DQ: declared FD-edges are verified against actual data tuples; cross-schema metric coherence is verified per attestable DNA edge during DQ Phase 3; pre-aggregated metrics are checked against finer-grained sources. Contrasted with the deductive verification regime (verified-by-construction through function semantics). Both regimes contribute uniformly to the FD-DAG, family genealogy, and AC Verification Levels. See §2.8.5, §1.5.
Entity — In Coframe's conceptual foundation, the key space primitive: what an observation is about. AC-dimensions are columns appearing in entity-anchoring (grain) role; entity-set declarations on each ColumnSpec specify what each column observes. See §2.2.4.
Entity, Family, Operator (the triple) — Coframe's three universal conceptual primitives. Entity manages the key space (what's observed); Family manages the value space (what's recorded); Operator manages the operational linkage (how observations transform). Every structural rule and integrity condition in the framework can be expressed in terms of how these three relate. See §2.2.4.
Entity-set anchoring — See E(c, S).
F¶
Family — A set of columns sharing a family-name. Every metric column belongs to exactly one family. See §2.7.3.
Family-DAG — The AC-wide structure of derivation relationships among families. Primitive families are roots; derived families have predecessor families. See §2.7.8.
Family-name — A column's name field. The framework treats family-names as opaque labels; family membership is determined by string equality on names. See §2.7.3.
Family-root — The earliest ancestor in a column's ancestry tree that shares the column's family-name. Found by walking DNA backward as long as name_pred equals the column's name. The framework computes family-roots; AC authors do not declare them. See §2.7.4.
Fact schema — A schema-type classification: a schema with non-grain-role columns observing AC-metrics. Contrasted with reference schemas. See §2.9.3.
FD-DAG — The framework's structural representation of functional-dependency relationships among AC-dimensions. Acyclic. See §2.8.
FD edge — An edge in the FD-DAG: source AC-dimension functionally determines target AC-dimension. Declared with one of three channels: declared, reference_table, or computed. See §5.7.
Four-rule filter — The framework's mechanism for selecting schemas to serve a query column term. Four rules: family membership (Rule 1), entity-set capability (Rule 2), coverage consistency (Rule 3), family-root agreement / sibling check (Rule 4). See §9.5.1.
Frame — A Frame-QL query unit. Outer Frames have a mandatory BY clause; inner Frames within WITH-blocks may inherit the outer's grain. See §8.4.
Frame-QL — The declarative query language for Coframe Core. Queries reference AC family-names rather than physical column names; the framework handles structural resolution. See Chapter 8.
Function — An operator type that transforms values row-wise without aggregating. For function operations, E_pred = E. See §2.6.1.
Function-derived FD-edge — An FD-edge established by a deterministic unary function from the operator catalog (e.g., month = MONTH_OF(day)). Verified by construction (function semantics) rather than by data attestation. Populates the FD-DAG identically to a data-attested FD-edge. The Coframe Core regime for catalog-defined operators; the duality is generalized in Coframe Pro. See §2.8.5, §1.5.
Function-derived metric — A metric established by a Frame-QL inline expression composed of catalog operators (e.g., profit = SUM(revenue) - SUM(cost), unit_price = revenue / quantity). Verified by construction rather than by data attestation. Participates in the family genealogy on equal structural footing with data-stored metrics. See §2.8.5, §1.5.
G¶
Grain — A schema's grain is the set of grain-role columns: grain(S) = {c : c is grain-role in S}. See §2.9.2.
Grain integrity — A structural commitment that a schema's grain-role columns' value tuples are unique per row. Verified at DQ Phase 2 via the data-API. See §3.9.2.
Grain-role column — A column where E(c, S) = {c}. The column anchors itself in this schema. AC-dimensions appear in grain role in some schema. See §2.5.1.
Grammar layer — The framework's structural-reasoning surface: the structural metadata about how analytical data is organized (anchors, derivations, family relationships, FD-DAG, integrity conditions). Distinguished from the semantic layer. See §1.2, §2.11.2.
Grounded — A structural commitment is grounded when its truth is verified by at least one mechanism. The AC Verification Levels (§7.13) admit two grounding sources: empirical (data-attested through DQ; the structure is found in the data) and deductive (verified-by-construction through operator catalog semantics; the structure is true by the function's definition). Both grounding sources are legitimate; the level definitions reflect what's verified, not which mechanism verified it. Mixed-grounded commitments (a function-derivable structural object also materialized as data) require cross-check verification — that the materialized values agree with the function output. See §7.13.4.
I¶
Identical — A structural relation: two columns with the same (name, E) and the same family-root. Identical columns are interchangeable for query purposes. See §2.7.5.
Identity-preservation — A property of an operator with respect to a predecessor: the operator produces a successor whose name equals the predecessor's. For reducers, identity-preservation requires the operator equals the predecessor's family ip_reducer. For functions, identity-preservation is declared as a flag in the operator-catalog entry. See §2.6.2.
Identity-preserving reducer (ip_reducer) — See ip_reducer.
Integrity condition — A structural fact the framework verifies. Includes well-formedness rules verified at AC validation, and data-attested rules verified by DQ. See §2.10.
ip_reducer — A property of a family: the operator under which the family's columns are interchangeable across anchors via partition-invariant aggregation. A family has an ip_reducer iff its family-root's op has partition_invariant: true in the operator catalog. See §2.7.6.
L¶
Lemma — In the framework's grammar layer, a structural fact the framework relies on. The framework distinguishes lemmas verified-by-default-with-opt-out (currently: cross-schema metric coherence, verified per DNA edge during DQ Phase 3) from lemmas asserted-not-verified (currently: catalog-declared partition-invariance and identity-preservation; the engineer's principle commitments; naming consistency when no naming function is declared). See §7.7.
Level A — The first AC Verification Level: structural well-formedness. The AC's metadata is internally consistent (integrity conditions I0–I9 hold). No data has been examined and no function evaluation has been required. See §7.13.1.
Level AA — The second AC Verification Level: verified structural integrity. Level A plus every dimensional structural commitment is grounded — by data-attestation (passing I3–I6 against actual data), by verification-by-construction (function-derived FD-edges grounded by operator catalog semantics), or by a mix of the two with cross-checking. Cross-schema metric coherence is not yet grounded at AA. See §7.13.2, §7.13.4.
Level AAA — The third AC Verification Level: verified cross-schema metric coherence. Level AA plus every metric coherence commitment is grounded — by data-attestation (passing I10 per-DNA-edge value attestation), by verification-by-construction (function-derived metrics grounded by operator catalog semantics), by mixed-and-cross-checked grounding, or by transparent toleration with rationale. The Multi-Table Invariance theorem is an unconditional guarantee within scope at AAA. See §7.13.3, §7.13.4.
Logical FD-DAG — See Candidate FD-DAG.
M¶
MAR — Missing At Random. A missingness signature where the missingness mechanism depends on declared determinants other than the column itself. Determinants must be in E ∪ {self-token} and the column itself must not be a determinant. See §2.4.2.
MCAR — Missing Completely At Random. A missingness signature where missingness is independent of determinants. M.signature = "MCAR", M.determinants = []. See §2.4.2.
MCP — Model Context Protocol. The standard for letting LLMs interact with structured tools and data sources. Coframe Core's MCP server exposes ACs to LLM clients. See Chapter 11.
Metric genealogy — The AC-wide structure of all metric columns organized by family and by family-root. The framework's primary structural object for reasoning about AC-metrics across schemas. See §2.7.5.
Missingness signature — See M(c, S).
MNAR — Missing Not At Random. A missingness signature where the column itself is among the determinants of its missingness. See §2.4.2.
Multi-Table Invariance (MTI) — The structural guarantee that schemas surviving the four-rule filter for a query with the same family-root produce equivalent results. The theorem makes the framework's automatic schema selection structurally trustworthy. See §9.6.
N¶
Naming function — An AC-level declaration mapping (name_pred, E_pred, op) to name for non-identity-preserving operations. Four declaration options: catalog default, override, custom, or none. The framework treats the function as a black box. See §3.7.
Naming relationship — The structural commitment between predecessor and successor names: identical when the operation is identity-preserving; different when not. Reflects the structural fact that aggregation-consistency between predecessor and successor justifies sharing a name. See §2.6.3.
O¶
OBSERVED operator — A special operator denoting that the column's values come from outside the AC (observed directly from the backend), not derived through any operator within the AC. Used as the op field for grain-role columns and for other observationally-rooted columns. See §10.6.
op — A ColumnSpec field: the operator that produced this column. For root columns, the operator under which the column is observationally rooted (typically the family ip_reducer or OBSERVED). See §3.5.1.
Operation — A predecessor-to-successor link in the metric genealogy: m_pred --op--> m. Operations are governed by well-formedness conditions on the E-relation and the naming relationship. See §2.6.
Operator catalog — The framework's specification of supported operators with type, partition_invariance, identity-preservation, default naming, and missing-value treatment per (operator, signature). See Chapter 10.
Operator type — One of reducer (aggregates over rows, collapsing entities) or function (transforms values row-wise). Coframe Pro additionally recognizes broadcast. See §2.6.1.
Open operational space — Informal phrase for the structural property described in §1.2.1: cross-grain navigation extends to function-derived groupings, not just data-attested ones. The FD-DAG and family genealogy admit function-derived edges and metrics — produced by deterministic operator catalog functions like MONTH_OF, BUCKET, SUBSTR — as first-class participants alongside data-attested ones. The reasoning surface scales with what the data admits given the operator catalog, not just with what's been pre-materialized as columns. The full architectural generalization of the data-borne / function-borne duality (admitting user-defined deterministic functions as first-class structural objects under explicit empirical and deductive verification regimes) is Coframe Pro's generalized functional grammar layer.
P¶
Partition-invariant — A reducer property: the reducer distributes over partitions of input rows. Partition-invariant reducers can serve as family ip_reducers. SUM, MAX, MIN, COUNT, BOOL_AND, BOOL_OR, BIT_AND/OR/XOR are partition-invariant; AVG, MEDIAN, COUNT_DISTINCT, STDEV, etc. are not. See §9.4.
Per-DNA-edge value attestation — An optional deployment-time DQ enhancement: verifying specific DNA edges by computing aggregations over predecessor data and comparing to the column's observed values. Strengthens MTI guarantees beyond standard scope. Not required by Coframe Core; default in Coframe Pro. See §7.6.8.
Predecessor — The input metric of an operation. In m_pred --op--> m, m_pred is the predecessor. See §2.6.
Primitive family — A family whose family-root is a root column (DNA self-referential). Contrasted with derived families. See §2.7.8.
Principle 1: Column-borne information — Every column's value is determined by its declared entities. Formally: E(c, S) is the set of entities such that two rows in S with the same values in E(c, S) have the same value in c. See §2.2.1.
Principle 2: Same universe of observation — Schemas in an AC observe the same universe of entities. Same-named AC-dimensions and AC-attributes have consistent value mappings across schemas. See §2.2.2.
Q¶
Quasi-metadata — The data information DQ fetches via the data-API to support principle-verification. Per-AC-dimension-per-schema observed value sets, per-pair value-mappings, per-column missing counts, etc. See §7.5.
Query resolution — The framework's process for taking a parsed Frame-QL query and producing a backend-executable plan. Includes the four-rule filter, schema selection, dubious-query detection, and execution planning. See Chapter 9.
R¶
Recursive hierarchy — A self-referential hierarchical pattern where each member of an entity-set has a parent in the same entity-set: employee-manager organizational hierarchies, parent-part bills-of-materials, message-thread reply structures. Not in Coframe Core's scope; supported as a first-class concept in Coframe Pro with recursive query primitives in Frame-QL. Engineers using Core can pre-flatten such hierarchies during ETL or use closure-table modeling. See §1.5.
Redundant-grain rule — A structural rule: in any schema, grain-role columns must have no FD-DAG ancestor that is also in the schema's grain. The framework remediates by re-declaring redundant grain-role columns as non-grain references. See §7.6.6.1.
Reducer — An operator type that aggregates over rows, collapsing entities. For reducer operations, E_pred ⊇ E under FD-DAG navigation. See §2.6.1.
Reference schema — A schema-type classification: a schema with one AC-dimension in grain role and other columns observing AC-attributes of the grain dimension. See §2.9.3.
Root column — A column whose DNA is self-referential. Roots are observationally rooted in the AC; they have no further predecessor within the AC's structural reasoning. See §2.7.1.
Rung (Frame-QL) — A Frame-QL operation classification. Coframe Core supports Rungs 0 (read), 1 (identity-preserving reduction), 2 (broadcast), 6 (multi-input expressions), 7 (cross-schema reach), 9 (WITH-chained frames). Other rungs are simplified or Coframe-only. See §8.8.
S¶
Schema — A virtual table in an AC: a logical view over backend data with declared ColumnSpecs. Coframe Core ACs may have multiple schemas; queries automatically draw on multiple schemas via the four-rule filter. See §2.9.
schema.init — The YAML format engineers author as input to DQ. Specifies the AC's structure: schemas, ColumnSpecs, naming function declaration, candidate FD-DAG, instructions. See Chapter 5.
Schema scope — See Declared scope.
Schema type — Classification of a schema as reference, fact, composite-grain fact, etc., based on its grain and column structure. See §2.9.3.
Self-referential DNA — A DNA that points to its own column. Identifies the column as a root. See §2.7.1.
Semantic layer — The interpretive surface of analytical data: what metrics mean, what conventions a team uses, what story the data tells. The engineer's domain. Distinguished from the grammar layer. See §1.2, §2.11.2.
Siblings — A structural relation: same family-name, different E, same family-root. Siblings represent the same conceptual metric observed at different anchors. The four-rule filter selects siblings as substitutable schemas; MTI's domain is precisely the siblings. See §2.7.5.
Singleton — A column produced by a multi-input operation (e.g., a registered ratio). Singletons stand alone in the metric genealogy; other columns do not derive from them through DNA. See §2.7.7.
SCA — Slowly Changing Attribute. The structural pattern where an attribute attached to a stable entity has time-varying values: a customer's segment that changes over months, a product's category that changes over years, a store's region that changes when stores are reassigned. In Coframe Pro, an SCA is expressed as E(a, S) = {d, t} — an attribute anchored at (entity, slow-time-grain) rather than at (entity) alone. The traditional data-warehousing term is "Slowly Changing Dimension" (SCD), but Coframe's vocabulary distinguishes the entity (which is identity-stable) from the attribute (which is time-varying); SCA is the precise term. Coframe Core handles attribute time-variance via event modeling rather than as time-varying attribute structure; SCA as a structural concern is a Pro feature. See §1.5.
SCD — See SCA. "Slowly Changing Dimension" is the data-warehousing tradition's term for what Coframe more precisely calls a Slowly Changing Attribute.
SCH — Slowly Changing Hierarchy. A generalization of SCA where FD-DAG edges themselves vary over time (e.g., the category → department mapping changes when departments are reorganized). Coframe Pro feature; not in Core.
Structural relation — A definite pair-wise relation between columns in the AC: identical, sibling, cousin, or different families. See §2.7.5.
Structural rigor as binary — The framework's posture: integrity conditions are non-negotiable. Either an AC honors them or it doesn't. See §2.11.1.
Successor — The output metric of an operation. In m_pred --op--> m, m is the successor. See §2.6.
T¶
Three-valued logic — The logical system Frame-QL uses for boolean operations: TRUE, FALSE, and NULL/MISSING. Standard SQL three-valued logic semantics. See §10.5.3.
Trichotomy — See Column trichotomy (alternate term: AC-dimension / AC-attribute / AC-metric classification). See §2.5.
U¶
Universe of observation — The set of entities the AC's schemas collectively observe. Per Principle 2, all schemas in an AC observe the same universe. See §2.2.2.
Universe-wide value set — For an AC-dimension, the union of its observed value sets across all non-degenerate schemas. Used for coverage analysis. See §7.6.1.
V¶
Virtual splitting — An AC authoring pattern where a single physical table is represented as multiple virtual tables in the AC, each with a filter clause defining which rows it contains. See §5.4.4.
Virtual table — A schema in an AC, conceptually distinct from any single physical backend table. May map directly to a physical table or use a query/filter for a derived view. See §2.9.1.
W¶
Well-formedness conditions — The structural rules an operation must satisfy: operator-type-appropriate E-relation, name-relationship consistency. Violations are integrity errors. See §2.6.4.
WITH-block — A Frame-QL construct defining inner frames whose results are referenceable by subsequent frames in the same query. Inner frames are session-local in Coframe Core. See §8.7.
Where to go next¶
For terms not in this glossary, consult the relevant chapter directly. The chapters define their own technical vocabulary in context.
For the structural relationships among the framework's terms (rather than alphabetical lookup), consult the Coframe Vocabulary Spine.
Appendix C: Worked Example — The Retail AC¶
A complete walkthrough from a backend warehouse to a working Coframe Core AC, with example queries and expected behaviors.
C.1 Overview¶
This appendix walks through the retail AC end-to-end. The example is the same retail scenario referenced throughout the manual; this appendix shows how the pieces fit together as a complete authoring exercise.
The walkthrough covers:
- The backend warehouse's tables and their content (§C.2).
- Phase 1 (Discovery): selecting what to expose and drafting an initial schema.init (§C.3).
- The complete schema.init (§C.4).
- Phase 2-3 (DQ): what verification produces (§C.5).
- Sample queries against the verified AC (§C.6).
- A cousin disambiguation example (§C.7).
- An AI-assisted query through the MCP server (§C.8).
The example uses simplified data for clarity. Real-world ACs are larger but follow the same structure.
C.2 The backend warehouse¶
The retail organization's warehouse has these tables in DuckDB:
customers (1.2M rows)
- customer_id (integer, primary key)
- customer_name (varchar)
- customer_email (varchar)
- customer_phone (varchar)
- customer_segment (varchar)
- customer_signup_date (date)
- customer_address_line1 (varchar)
- customer_address_city (varchar)
- customer_zip (varchar)
- customer_marketing_consent (boolean)
- created_at (timestamp)
- updated_at (timestamp)
- etl_batch_id (integer)
stores (250 rows)
- store_id (integer, primary key)
- store_name (varchar)
- store_address (varchar)
- store_city (varchar)
- store_state (varchar)
- region_id (integer, FK to regions)
- country_id (integer, FK to countries)
- store_open_date (date)
- store_status (varchar)
- etl_batch_id (integer)
transactions (~50M rows, ~1.5M new per day)
- transaction_id (integer, primary key)
- customer_id (integer, FK)
- store_id (integer, FK)
- product_id (integer, FK)
- transaction_date (date)
- transaction_timestamp (timestamp)
- amount (decimal)
- units_sold (integer)
- discount_amount (decimal)
- tax_amount (decimal)
- payment_method (varchar)
- is_returned (boolean)
- etl_batch_id (integer)
store_revenue_monthly (15K rows, populated by upstream ETL)
- store_id (integer)
- month (date)
- total_revenue (decimal)
- peak_daily_revenue (decimal)
- transaction_count (integer)
- unique_customer_count (integer)
- etl_batch_id (integer)
The warehouse has additional tables (regions, products, returns_log, etc.) not shown for brevity. The retail AC will reference some of these.
C.3 Phase 1: Discovery¶
The engineer reviews the warehouse and decides what to include in the AC scope.
C.3.1 Selection decisions¶
For the analytics use case (revenue analysis by region, customer segmentation, peak performance tracking), the engineer selects:
From customers: customer_id, customer_name, customer_segment. Excluded: PII fields (email, phone, address, zip), the marketing-consent flag, ETL bookkeeping (created_at, updated_at, etl_batch_id), and signup_date (not needed for this AC's purpose).
From stores: store_id, store_name, region_id, country_id. Excluded: address details, opening date, status, ETL bookkeeping.
From transactions: transaction_id, customer_id, store_id, product_id, transaction_date, amount, units_sold, is_returned. Excluded: timestamps below day grain, payment_method, tax/discount details (separate analytical purpose), ETL bookkeeping.
From store_revenue_monthly: store_id, month, total_revenue, peak_daily_revenue, unique_customer_count. Excluded: transaction_count (could be derived), ETL bookkeeping.
This selection produces an AC with about 16 ColumnSpecs across 4 schemas — a focused analytical surface, not a full backend mirror.
C.3.2 Naming decisions¶
The engineer chooses to:
- Adopt the operator catalog's default naming function (
naming_function: catalog_default). - Name AC-dimensions in their natural-language form:
customer,store,region,country,product,transaction,date,month,quarter,year. - Name AC-attributes after their concept:
customer_name,customer_segment,store_name. - Name AC-metrics in line with the catalog defaults:
revenue,units_sold,peak_revenue,customer_count.
C.3.3 Candidate FD-DAG¶
Based on the warehouse structure, the engineer identifies the following FD-edges:
store → region(fromstores.region_id)store → country(fromstores.country_id)region → country(declared; trust-instruction added)transaction → date(fromtransactions.transaction_date)date → month(computed:MONTH_OF(date))month → quarter(computed:QUARTER_OF_MONTH(month))quarter → year(computed:YEAR_OF_QUARTER(quarter))
The framework infers transitive edges (store → country, date → year, etc.) automatically.
C.4 The complete schema.init¶
schema_init:
ac_name: retail_analytics_v1
ac_description: |
Retail analytics AC covering transactions, stores, customers,
and pre-aggregated monthly summaries. Scope: revenue and
customer analysis at flexible grains.
naming_function: catalog_default
collection:
- virtual_table:
schema_name: customers
source:
backend: warehouse_main
physical_table: customers
column_specs:
- column_spec:
src_name: customer_id
name: customer
data_type: integer
E: [customer]
- column_spec:
src_name: customer_name
name: customer_name
data_type: string
E: [customer]
M:
signature: MCAR
determinants: []
op: OBSERVED
- column_spec:
src_name: customer_segment
name: customer_segment
data_type: string
E: [customer]
M:
signature: MCAR
determinants: []
op: OBSERVED
- virtual_table:
schema_name: stores
source:
backend: warehouse_main
physical_table: stores
column_specs:
- column_spec:
src_name: store_id
name: store
data_type: integer
E: [store]
- column_spec:
src_name: store_name
name: store_name
data_type: string
E: [store]
M:
signature: MCAR
determinants: []
op: OBSERVED
- column_spec:
src_name: region_id
name: region
data_type: integer
E: [store]
M:
signature: MCAR
determinants: []
op: OBSERVED
- column_spec:
src_name: country_id
name: country
data_type: integer
E: [store]
M:
signature: MCAR
determinants: []
op: OBSERVED
- virtual_table:
schema_name: transactions
source:
backend: warehouse_main
physical_table: transactions
column_specs:
- column_spec:
src_name: transaction_id
name: transaction
data_type: integer
E: [transaction]
- column_spec:
src_name: customer_id
name: customer
data_type: integer
E: [transaction]
M:
signature: MAR
determinants: [transaction]
op: OBSERVED
- column_spec:
src_name: store_id
name: store
data_type: integer
E: [transaction]
M:
signature: MCAR
determinants: []
op: OBSERVED
- column_spec:
src_name: product_id
name: product
data_type: integer
E: [transaction]
M:
signature: MCAR
determinants: []
op: OBSERVED
- column_spec:
src_name: transaction_date
name: date
data_type: date
E: [transaction]
M:
signature: MCAR
determinants: []
op: OBSERVED
- column_spec:
src_name: amount
name: revenue
data_type: numeric
E: [transaction]
M:
signature: MCAR
determinants: []
op: SUM
dna:
name: revenue
E: [transaction]
op: SUM
- column_spec:
src_name: units_sold
name: units_sold
data_type: integer
E: [transaction]
M:
signature: MCAR
determinants: []
op: SUM
dna:
name: units_sold
E: [transaction]
op: SUM
- column_spec:
src_name: is_returned
name: is_returned
data_type: boolean
E: [transaction]
M:
signature: MCAR
determinants: []
op: OBSERVED
- virtual_table:
schema_name: store_monthly_summary
source:
backend: warehouse_main
physical_table: store_revenue_monthly
column_specs:
- column_spec:
src_name: store_id
name: store
data_type: integer
E: [store]
- column_spec:
src_name: month
name: month
data_type: date
E: [month]
- column_spec:
src_name: total_revenue
name: revenue
data_type: numeric
E: [store, month]
M:
signature: MCAR
determinants: []
op: SUM
dna:
name: revenue
E: [transaction]
op: SUM
- column_spec:
src_name: peak_daily_revenue
name: peak_revenue
data_type: numeric
E: [store, month]
M:
signature: MCAR
determinants: []
op: MAX
dna:
name: revenue
E: [store, day]
op: SUM
- column_spec:
src_name: unique_customer_count
name: customer_count
data_type: integer
E: [store, month]
M:
signature: MCAR
determinants: []
op: COUNT_DISTINCT
dna:
name: customer
E: [transaction]
op: OBSERVED
fd_dag:
- source: store
target: region
channel: reference_table
table: stores
- source: store
target: country
channel: reference_table
table: stores
- source: region
target: country
channel: declared
- source: transaction
target: date
channel: reference_table
table: transactions
- source: date
target: month
channel: computed
mapping: MONTH_OF(date)
- source: month
target: quarter
channel: computed
mapping: QUARTER_OF_MONTH(month)
- source: quarter
target: year
channel: computed
mapping: YEAR_OF_QUARTER(quarter)
instructions:
- directive: trust_declared_FD
edges:
- source: region
target: country
rationale: |
Maintained by upstream reference data; transient violations
possible during region reorganizations.
This schema.init has 4 schemas, 16 ColumnSpecs total, and explicit FD-DAG declaration with 7 direct edges (transitive closure adds more).
C.5 Phase 2-3: DQ output¶
The framework runs DQ against this schema.init. Plausible findings:
C.5.1 Successful verification¶
- All 16 ColumnSpecs pass Phase 1 metadata-only verification.
- The candidate FD-DAG is fully attested by the data-driven FD-DAG.
- Cross-schema value-mapping consistency holds for
customer,store,region,country,month. - Grain integrity holds in all four schemas.
- The metric genealogy is well-formed:
revenuefamily has root intransactions, with sibling at(store, month)instore_monthly_summary.peak_revenuefamily has its own root atMAXof revenue at(store, day)grain.customer_countfamily is rooted at COUNT_DISTINCT of customer at transaction grain (anchor-locked, no ip_reducer).
C.5.2 Advisories surfaced¶
The framework surfaces some advisories:
- Attribution-incomplete advisory: The
customercolumn intransactionsis declared MAR withtransactionas determinant, but ~3.2% of transactions havecustomer_id = NULL. The advisory notes this and suggests either declaring synthetic-unknown or accepting the incomplete attribution. - Coverage advisory: The
store_monthly_summaryschema covers months January 2025 through April 2026 (fully covered for those months), but doesn't cover transactions earlier than January 2025. Engineers querying revenue-by-region for 2024 will be served fromtransactionsonly; queries for 2025-2026 may use either schema.
The engineer reviews the advisories and accepts both as intentional. The AC validates and is ready for query workloads.
C.6 Sample queries¶
C.6.1 Simple read at customer grain¶
SELECT customer, customer_name, customer_segment
WHERE customer_segment = 'enterprise'
BY customer
ORDER BY customer_name
LIMIT 100
The framework selects the customers schema (only schema with all three columns). Direct read with filter and projection.
C.6.2 Total revenue by region for last quarter¶
SELECT region, SUM(revenue) AS total_revenue
WHERE quarter = 'Q1-2026'
BY region
ORDER BY total_revenue DESC
The framework's resolution: revenue family has siblings in transactions (transaction grain) and store_monthly_summary (store-month grain). Both pass the four-rule filter; both are siblings (same family-root). MTI applies. The framework picks store_monthly_summary for cost reasons (smaller table, pre-aggregated). Navigation: (store, month) → (region, quarter) via FD-DAG. Result: total revenue per region for Q1-2026.
C.6.3 Peak revenue per region per quarter¶
SELECT region, quarter, MAX(peak_revenue) AS peak
WHERE year = 2026
BY (region, quarter)
ORDER BY peak DESC
The framework: peak_revenue family is in store_monthly_summary only. The MAX aggregation at (store, month) → (region, quarter) via FD-DAG navigation. MAX is partition_invariant; the family has an ip_reducer (also MAX); cross-anchor navigation works.
C.6.4 Multi-input ratio¶
SELECT region,
SUM(revenue) / SUM(units_sold) AS revenue_per_unit
WHERE year = 2026
BY region
HAVING revenue_per_unit > 50
The framework: SUM(revenue) and SUM(units_sold) computed at region grain, then divided. Both metrics have transaction-grain roots; navigation to region works. HAVING applied after aggregation.
C.6.5 Customer count and average¶
SELECT region,
SUM(revenue) AS total,
customer_count AS customers,
total / customers AS avg_per_customer
FROM transactions, stores
WHERE year = 2026
BY region
Wait — this query has an issue. customer_count has op: COUNT_DISTINCT which is not partition-invariant. The family is anchor-locked at (store, month). The framework cannot navigate it to region grain.
The four-rule filter rejects customer_count for the (region) anchor. Resolution error:
Resolution error: cannot serve 'customer_count' at grain (region).
The family is anchor-locked (rooted at non-partition-invariant
operator COUNT_DISTINCT); cross-anchor navigation is not available.
To compute distinct customers per region, use COUNT_DISTINCT(customer)
directly: SUM(revenue) / COUNT_DISTINCT(customer) AS avg_per_customer.
The engineer rewrites:
SELECT region,
SUM(revenue) / COUNT_DISTINCT(customer) AS avg_per_customer
WHERE year = 2026
BY region
This works: COUNT_DISTINCT(customer) computed directly at the query's region grain, drawing customer values from transactions. The AC's customer_count metric is anchor-locked but the underlying ad-hoc COUNT_DISTINCT is fine.
This pattern — anchor-locked families surface limits clearly; ad-hoc reducers in the query express what the engineer wants — is intentional. The AC's first-class metrics are anchor-stable; cross-anchor distinct-counting requires an explicit ad-hoc formulation in each query.
C.7 A cousin disambiguation example¶
Suppose the retail organization has another summary table — historical_quarterly_summary — populated by a separate ETL process from a different transactional source (a legacy system being phased out). The engineer adds it to the AC for backward-compatibility queries:
- virtual_table:
schema_name: historical_quarterly_summary
source:
backend: warehouse_main
physical_table: historical_quarterly
column_specs:
- column_spec:
src_name: store_id
name: store
data_type: integer
E: [store]
- column_spec:
src_name: quarter
name: quarter
data_type: string
E: [quarter]
- column_spec:
src_name: revenue_legacy
name: revenue
data_type: numeric
E: [store, quarter]
M:
signature: MCAR
determinants: []
op: SUM
dna:
name: revenue
E: [store, quarter]
op: SUM
The legacy source's revenue is observationally rooted at (store, quarter) — it's not derived from the transactions table; it comes from a separate observation. Its DNA is self-referential (root).
After AC-load, the framework's metric genealogy shows:
- Family
revenuehas two family-roots: - Root at
transactionsschema, anchor[transaction],op: SUM. - Root at
historical_quarterly_summary, anchor[store, quarter],op: SUM.
The two revenue columns are cousins: same family-name, different family-roots.
A query referencing bare revenue:
The framework refuses with a dubious-query diagnostic:
DUBIOUS: query references 'revenue' which has multiple resolutions in
this AC:
- revenue family-root in transactions schema (E=[transaction], op=SUM)
- revenue family-root in historical_quarterly_summary schema
(E=[store, quarter], op=SUM)
These are cousins — same family-name, different family-roots — and
produce different results because the underlying observations differ.
Disambiguate via:
- Qualified reference: transactions.revenue or
historical_quarterly_summary.revenue
- Explicit FROM clause: FROM transactions, ... or
FROM historical_quarterly_summary, ...
The engineer disambiguates:
The query now resolves cleanly. The framework selects schemas serving the transactions-rooted revenue family (siblings in transactions and store_monthly_summary).
C.8 An AI-assisted query through MCP¶
The retail organization has Claude integrated with the Coframe Core MCP server. An analyst asks Claude:
"What was peak weekly revenue in the West region last quarter?"
Claude's reasoning, against the MCP server:
-
Look up the family vocabulary. Claude calls
list_families(retail_analytics_v1)and seesrevenue,peak_revenue,units_sold,customer_count. -
Identify the right family. "Peak revenue" maps to the
peak_revenuefamily. Claude callsdescribe_familyto confirm: family-root instore_monthly_summaryat(store, month), op MAX, partition_invariant true. -
Identify the right grain. "Last quarter" needs
quarter. "West region" needsregion. The query grain is(region, quarter). The framework's FD-DAG reaches both:store → region,month → quarter. peak_revenue's anchor(store, month)→(region, quarter)is reachable. -
Construct the query. "Peak weekly revenue" is ambiguous — there's no
peak_weekly_revenuein the AC. The AC haspeak_revenue(max of daily revenue). Claude picks the closest match and notes the assumption in its response. -
Construct Frame-QL:
-
Resolve and execute. Claude calls
resolve_queryfirst (verifies clean resolution), thenexecute_query. Result:[{"region": "West", "peak_in_quarter": 285000.00}]. -
Present to analyst. Claude responds:
"Peak revenue in the West region for Q1-2026 was \$285,000. Note: this is the maximum daily revenue at any single store in the region during the quarter. The AC tracks peak daily revenue, not peak weekly — if you want peak weekly specifically, that would require a different aggregation we'd need to construct."
The exchange demonstrates the framework's family vocabulary as an LLM-friendly substrate: Claude reasons about families and anchors, constructs valid Frame-QL, and the framework verifies resolution. The analyst gets a correct answer with appropriate caveat about the peak_revenue's actual semantics.
C.9 Where to go next¶
For more on the workflow this example illustrates:
- Chapter 4 (AC Authoring Workflow) — the four-phase workflow with iteration cycle.
- Chapter 5 (schema.init Format) — the format used in §C.4.
- Chapter 7 (Data Quality and Structural Verification) — the DQ process producing §C.5's output.
- Chapter 8 (Frame-QL) — the query language used in §C.6.
- Chapter 9 (Query Resolution) — how the four-rule filter selects schemas.
- Chapter 11 (The MCP Server) — the agent integration shown in §C.8.
For a practitioner-oriented narrative covering similar territory, see the companion article Coframe: A Grammar-Layer Substrate for AI-Native Analytics.
Appendix D: Performance and Scaling Guidance¶
Expected performance characteristics for Coframe Core deployments at varying data scales, configuration knobs that affect performance, and operational guidance.
This appendix translates the Platform Design's quantitative performance targets into practitioner-oriented expectations. It addresses common questions: how large can my AC be? how long will DQ take? what should query latency look like? what happens at scale?
D.1 Scope and assumptions¶
The performance characteristics below assume:
- Reference hardware: a modern multi-core machine (16+ cores, 64+ GB RAM, NVMe SSD storage). Cloud-equivalent: AWS m6i.4xlarge, GCP n2-standard-16, or similar.
- Reference data: the Manual's running example (the retail AC) at varying scales — 10K, 1M, 50M, 500M, 5B rows in the transactions fact table; 2-10 schemas; a single-digit number of AC-dimensions and AC-attributes per schema.
- Reference backend: DuckDB or Polars on local files. Performance characteristics for other backends depend on the backend's own performance properties; the framework's overhead is roughly the same.
These are expected characteristics, not guaranteed SLAs. Real-world deployments will vary based on data shape, query mix, hardware, and backend specifics.
D.2 AC loading¶
AC loading consists of three activities: parsing schema.init, building the AC catalog, and loading the cached DQ deliverable. Loading does not re-run DQ — that's a separate activity.
| AC size | Expected loading time |
|---|---|
| Small (1-3 schemas, 20-50 columns total) | Under 100 ms |
| Medium (5-10 schemas, 100-200 columns) | Under 500 ms |
| Large (20+ schemas, 500+ columns) | Under 2 seconds |
The dominant cost in loading is parsing the AC catalog and constructing the FD-DAG and family genealogy. Both are O(columns) operations with a small constant. Loading is fast enough to be invisible in interactive usage.
D.3 DQ runtime¶
DQ Phase 1 (metadata-only verification) is fast: under 1 second for any AC size, because it operates on the parsed schema.init plus minimal backend introspection.
DQ Phase 2 (quasi-metadata fetch) runtime is dominated by backend I/O: how long it takes to count rows, fetch distinct values, fetch column statistics. For a backend with table statistics already cached (DuckDB after a SUMMARIZE, Polars after a lazy frame is materialized), Phase 2 completes in seconds. For a cold backend, Phase 2 may take 10-60 seconds depending on table sizes.
DQ Phase 3 (quasi-metadata-derived verification) runtime depends on what's being verified:
- FD attestation is O(rows × FD-edges). For the retail AC with 50M rows and ~20 declared FD-edges, expect 30-90 seconds end-to-end on the reference backend.
- Per-DNA-edge value attestation is O(rows × attestable-edges). For 50M rows with ~8 attestable edges, expect 60-180 seconds. The Platform Design targets full attestation under 5 minutes for 50M-row AC; sampled attestation under 30 seconds.
- Coverage map generation is O(rows). Bounded by table scan time.
For very large fact tables (500M+ rows), full attestation becomes operationally expensive. Coframe Core supports stratified sampling (attestation.sampling_threshold_rows, default 1e8); above the threshold, attestation samples the predecessor table rather than reading it in full. Sampled attestation produces an attestation result with a confidence annotation.
For 500M+ row scenarios where full attestation matters analytically, schedule it during off-peak hours (overnight) and use sampled attestation on demand. Coframe Pro's incremental attestation re-runs only edges affected by source changes — substantially better for very large ACs but a Pro-only feature in v1.0.
D.4 Query resolution and execution¶
Query resolution (parse + four-rule filter + plan, no execution) is fast: under 50 ms P50 for the retail AC reference suite, regardless of underlying data size. Resolution operates on the AC catalog (an in-memory structure) plus the query AST.
Query execution time is dominated by backend execution. The framework's overhead — the resolution cost above plus result annotation propagation — is typically under 100 ms. For interactive queries against in-cache data, end-to-end latency (parse → resolve → execute → return) is under 200 ms P50.
For ad-hoc queries against large fact tables, end-to-end latency is dominated by the backend's scan or aggregation cost. The framework does not introduce material overhead beyond the backend's own query cost.
D.5 MCP server performance¶
The MCP server is stateless beyond AC metadata caches. Per-request overhead is small:
- AC metadata operations (
list_families,describe_family,get_dimension_values): under 50 ms; the AC catalog is in memory. - Query resolution operations (
resolve_query): same as the underlying resolution cost — under 50 ms. - Query execution operations (
execute_query,nl_query): dominated by the underlying query execution time; the MCP layer adds under 100 ms.
For concurrent load, the server scales by adding instances behind a load balancer. Each instance maintains its own AC catalog cache; the catalogs are read-only at runtime. The bottleneck under high concurrent load is the backend, not the MCP server itself.
D.6 Memory footprint¶
Memory consumption has three components:
- AC catalog: O(columns × constant). For a large AC with 500 columns, the catalog occupies tens of MB. Negligible compared to backend memory.
- DQ deliverable cache: bounded by the size of fetched quasi-metadata. For a 50M-row AC with 20 dimensions and 5 metrics, expect 10-100 MB cached.
- Backend execution memory: depends on the backend. DuckDB and Polars use their own memory budgets; Coframe Core does not allocate substantial memory beyond passing data through.
For most deployments, the AC catalog and DQ cache are not the constraining resource; the backend is. Plan memory based on the backend's own memory model.
D.7 Scaling characteristics¶
Coframe Core scales with three dimensions:
AC size (number of schemas, columns, FD-edges). Linear in catalog construction; linear in DQ Phase 1 verification. A 10x larger AC takes roughly 10x longer to load and verify metadata. Both remain operationally fast.
Data size (row counts in fact tables). Linear in DQ Phase 3 attestation; linear in query execution at the backend. A 10x larger fact table takes roughly 10x longer to attest and 10x longer to query (modulo backend optimizations like indexing or partitioning).
Concurrent query load. The framework itself is stateless; concurrent load scales horizontally with MCP server instances. The backend determines the practical concurrency ceiling.
D.8 Configuration knobs¶
Operationally important configuration knobs in Coframe Core:
attestation.enabled(defaulttrue): whether per-DNA-edge value attestation runs during DQ Phase 3. Disabling reduces DQ runtime but caps the AC at AA verification level.attestation.sampling_threshold_rows(default1e8): row count above which attestation falls back to stratified sampling.attestation.tolerance(default0.001): relative tolerance for value comparison during attestation. Tighter tolerances catch smaller drift but may surface noise from floating-point operations.attestation.fail_fast(defaultfalse): whether to halt DQ on first attestation failure or collect all failures.coverage.precompute(defaulttrue): whether coverage maps are precomputed during DQ Phase 2 or computed on demand.mcp.cache_ttl(default 60 seconds): how long the MCP server caches AC catalog metadata before re-fetching.
The full set of configuration options is in the coframe-core package documentation.
D.9 Operational guidance¶
For interactive query workloads (analysts and AI agents querying ACs in real time): keep AC sizes moderate (< 500 columns), use cached AC loading, expect sub-second query latency. Run DQ on a daily or per-data-pipeline cadence.
For periodic analytical workloads (scheduled reports, batch agent queries): AC size can be larger (1000+ columns); query latency is less sensitive. DQ on a per-pipeline-completion cadence is appropriate.
For very large data (500M+ row fact tables): use sampled attestation on demand, full attestation overnight. Consider Coframe Pro for incremental attestation if very-large ACs are central to the workload.
For multi-tenant deployments (multiple ACs per server): scale MCP server instances horizontally; the framework's per-instance memory is small, so one instance can serve many ACs concurrently.
For development workflows: schema.init and DQ deliverable both belong in source control's CI. Run DQ in CI on schema.init changes; cache DQ deliverables across CI runs to keep CI fast.
D.10 What this appendix is not¶
- A benchmark report. The numbers above are expected ranges, not measured benchmarks. Phase 8 of the Platform Design includes a benchmarking commitment; published benchmarks will accompany v1.0 release.
- A configuration reference. The full configuration surface lives in the
coframe-corepackage documentation. - A performance-troubleshooting guide. When DQ or queries are slower than expected, see the package documentation's troubleshooting section.
End of Coframe Core Manual, v1.0.