DataLex Architecture¶
1. System overview¶
DataLex is a Git-native data modeling platform with three runtime surfaces:
- CLI (
datalex) — validation, dbt sync, DDL emission, diff, package resolution, layout migration. - Core engine (
packages/core_engine) — deterministic loader, dialect plugins, dbt integration, cross-repo packages. - Web UI (
packages/web-app) — visual studio for editing the same YAML the CLI reads.
The authoritative source of truth is a DataLex project tree — one YAML
file per object, dispatched by kind:. See
datalex-layout.md for the reference.
2. Core engine modules (datalex_core)¶
2.1 DataLex loader (datalex_core/datalex/)¶
loader.py— streaming,kind:-dispatched walker. Reads one file at a time; does not materialize the whole project in memory. Source-located errors (file,line,column,suggested_fix).project.py—DataLexProjectdataclass: entities, sources, models, terms, domains, policies, snippets, imports. Resolves snippets at load time.parse_cache.py— content-addressed on-disk cache (build/.cache/*.jsonor~/.datalex/cache/), keyed bysha256(content) + schema_hash. Warm loads skip re-parsing unchanged files.migrate_layout.py— one-shot migrator from legacy*.model.yamlto the DataLex tree. Invoked viadatalex datalex migrate to-datalex-layout.diff.py— semantic diff with explicitprevious_name:rename detection; breaking-change classification.errors.py— source-positioned diagnostics withto_dict()for--output-json.types.py— type palette + composite type parser (array<T>,map<K,V>,struct<...>).
2.2 Dialect registry (datalex_core/dialects/)¶
base.py—DialectPluginprotocol (render_type,render_entity, …).registry.py—register()/get_dialect()/known_dialects().postgres.py,snowflake.py— shipped today; plugin shape means new dialects are a self-contained module, not an edit to a monolith.
2.3 dbt integration (datalex_core/dbt/)¶
manifest.py— importstarget/manifest.jsoninto DataLex sources / models. Idempotent viameta.datalex.dbt.unique_id; user-authored fields merged, not overwritten.profiles.py— parsesprofiles.yml(with dbt's precedence:--profiles-dir→$DBT_PROFILES_DIR→<project>/profiles.yml→~/.dbt/profiles.yml). Resolves relative DuckDB paths against the dbt project dir.warehouse.py— narrow per-table introspection (not full schema discovery). Supportsduckdbandpostgrestoday; other dialects fall back to the full connector in §2.5.sync.py— orchestrator behinddatalex datalex dbt sync. Merge policy: warehouse ownstype+nullable; manifest/user own everything else.emit.py— emitssources.yml+models/_schema.ymlwithcontract.enforced: trueanddata_type:on every column.
2.4 Cross-repo packages (datalex_core/packages.py)¶
ImportSpec.from_dict— parsesimports:entries (org/name@version,git:+ref:, orpath:).resolve_imports— fetches each package (shallow git clone or local copy), hashes contents, writes.datalex/lock.yaml.load_imports_for— consumes the lockfile; errors oncontent_hashdrift. Imported entities namespaced under@alias.entity_name.- Cache root:
~/.datalex/packages/(override via--cache-rootorDATALEX_CACHE_ROOT).
2.5 Database connectors (datalex_core/connectors/)¶
Full-schema introspection for reverse engineering (distinct from the
narrow dbt/warehouse.py):
- PostgreSQL, MySQL, Snowflake, BigQuery, Databricks, SQL Server, Azure SQL, Redshift.
BaseConnectorABC,ConnectorConfigdataclass,ConnectorResultwith driver check + include/exclude filters.- Used by legacy
datalex pull <connector>and bydbt syncas a fallback when the narrow path doesn't support a dialect.
2.6 Legacy importers and emitters (datalex_core/)¶
These predate DataLex but remain wired in for reverse-engineering tasks:
importers.py— SQL DDL, DBML, JSON Schema / OpenAPI, Spark schema, dbt manifest (the legacy path;datalex_core/dbt/manifest.pyis the current one).generators.py— DDL emission; migrated into the dialect registry for Postgres and Snowflake, retained for other dialects during rollout.docs_generator.py— HTML / Markdown data dictionary.policy.py— policy rule evaluator (10 rule types: naming conventions, required fields, SLA, deprecation checks, custom expressions).
3. CLI surface (packages/cli)¶
datalex_cli.py— registers the DataLex-spec subcommand tree underdatalex datalex …:migrate,validate,info,emit ddl,diff,expand,dbt sync,dbt emit,dbt import,packages resolve,packages list. (The doubled word is transitional — the group name will be flattened in a follow-up.)- Legacy flat commands (
datalex validate,datalex pull,datalex generate sql,datalex doctor,datalex watch,datalex apply,datalex migrate) still exist — see archive/yaml-spec-v2.md for their semantics if you're on a legacy project.
See cli.md for the current cheat sheet.
4. Web UI (packages/web-app + packages/api-server)¶
- React + React Flow studio reading/writing the DataLex tree through the Node API server.
- Features: subject-area grouping, dark mode, schema-aware YAML autocomplete with inline lint, virtualized rendering for 1000+ entities, diagram export (PNG/SVG), global search, keyboard shortcuts.
- The UI has no database of its own — everything is filesystem + Git.
5. End-to-end flow: dbt sync path¶
┌──────────────────┐ 1. manifest.json ┌──────────────────┐
│ dbt project │ ─────────────────────▶ │ datalex_core.dbt │
│ target/ │ 2. profiles.yml │ .manifest │
│ dbt_project.yml│ ─────────────────────▶ │ .profiles │
│ profiles.yml │ │ .warehouse │
└────────┬─────────┘ │ .sync │
│ └────────┬─────────┘
│ 3. information_schema query │
│ (per table, per profile target) │
▼ │
┌──────────────────┐ │
│ Warehouse │ ◀─────────────────────────────────┘
│ (duckdb/postgres)│
└──────────────────┘ ▼
┌──────────────────────┐
│ DataLex YAML tree │
│ sources/*.yaml │
│ models/dbt/*.yaml │
│ (unique_id stamped)│
└──────────┬───────────┘
│ 4. datalex datalex dbt emit
▼
┌──────────────────────┐
│ dbt YAML out │
│ sources/*.yml │
│ models/_schema.yml │
│ (contracts on) │
└──────────────────────┘
Full walkthrough: tutorial-dbt-sync.md.
6. Repository layout¶
DataLex/
packages/
core_engine/src/datalex_core/
datalex/ # loader, project, migrator, diff, parse cache
dialects/ # dialect plugin registry (postgres, snowflake, …)
dbt/ # manifest, profiles, warehouse, sync, emit
connectors/ # full-schema introspection per warehouse
_schemas/datalex/ # JSON Schema per kind: — bundled with the package
… # legacy importers/emitters/policy kept in parallel
cli/src/datalex_cli/
datalex_cli.py # datalex datalex … subcommand tree
main.py # legacy flat commands
api-server/ # Node.js: UI backend
web-app/ # React Flow studio
examples/jaffle_shop_demo # dbt sync demo (DuckDB, zero setup)
model-examples/ # legacy scenario projects
docs/ # current docs (this file + tutorials/reference)
docs/archive/ # pre-DataLex specs (kept for reference)
tests/datalex/ # unittest suite for the DataLex surface
7. Design choices worth knowing¶
- File-per-entity,
kind:-dispatched. Diffs stay small; the parser can stream; concurrent edits don't collide on a single 10K-line file. meta.datalex.*is emitter-owned. Anything else undermeta:is yours and survives round-trip. This is the contract that makesdbt syncsafe to re-run.- Warehouse introspection is narrow on purpose.
dbt synconly needs columns for named tables, not the full schema — sodbt/warehouse.pyis a tightinformation_schema.columnsquery, not the heavierconnectors/path. - Cross-repo packages are content-hashed. Lockfile drift is an error,
not a warning. If your CI runs
load_imports_forit will catch silent upstream changes.
8. Non-goals (for now)¶
- Multi-tenant / hosted SaaS. Everything is local filesystem + Git.
- SSO / OIDC / SAML / RBAC.
- Write-path to live warehouses (no
datalex applyauto-run in prod).
These remain options for a future enterprise phase; the current tool is shaped for individual dbt users and teams who want their models in Git.