Getting started with DataLex¶
Pick the path that matches what you have in hand. Every path finishes with a reviewable YAML tree on disk and a live ER diagram in the browser. The normal path needs no Docker, no second terminal, and no config files to hand-edit; Docker is available as an isolated fallback.
DataLex 1.4 highlights — doc-block round-trip, custom policy packs, snapshots/exposures/unit-tests panels, contract enforcement, Atlan/DataHub/OpenMetadata catalog export, the
datalex readiness-gateCI gate, and two AI agents that propose entities + canonical layers from your staging models. See CHANGELOG.md for the full list.
60-second install¶
pip install 'datalex-cli[serve]' # CLI + bundled Node runtime
datalex serve # opens http://localhost:3030
That's it for most machines. The [serve] extra pulls a portable Node
runtime so you do not need to install Node separately. If you already
have Node 20+ on your PATH, plain pip install datalex-cli works too.
Want your own warehouse drivers? Add a connector extra:
pip install 'datalex-cli[serve,postgres]' # or snowflake, bigquery, databricks, …
pip install 'datalex-cli[serve,all]' # every driver + Node
Verify the installed package before opening a real repo:
datalex --version # 1.4.1+
If startup fails with
ERR_MODULE_NOT_FOUND ... datalex_core/_server/ai/providerMeta.js,
upgrade to datalex-cli 1.4.0 or newer.
What you'll see when the app opens (1.4.1)¶
A six-step Onboarding Journey panel slides in from the right:
- Welcome to DataLex — two-line value prop · click Let's go
- Connect your project — opens the Import dialog. Use a Git URL
(e.g.
https://github.com/duckcode-ai/jaffle-shop-DataLexfor the demo) or paste an absolute path to a dbt folder. - See what's missing — opens the Validation drawer; click any red file to view readiness gaps (these mirror what CI scores).
- Design your first business domain —
+opens the New Logical Entity dialog. Start with one concept likeCustomerorOrder. - Add your AI provider — opens Settings → AI; paste an OpenAI / Anthropic key (or pick the local provider — no key needed).
- Ask AI to draw a diagram — one click runs the Conceptualizer against your staging models and proposes entities + relationships.
Each step auto-completes when its underlying action succeeds. Close the panel anytime — the floating "Onboarding · n/6" pill resumes you where you left off. Replay from Settings → Replay onboarding; the full 13-step spotlight tour is still available under Settings → Deep feature tour.
Docker fallback¶
Docker is optional. Use it when you want a fully isolated install path or your local Python/Node versions are getting in the way.
git clone https://github.com/duckcode-ai/DataLex.git
cd DataLex
docker build -t datalex:local .
docker run --rm -p 3030:3001 datalex:local
Open http://localhost:3030.
For an existing dbt repo:
cd ~/path/to/your-dbt-project
docker run --rm -p 3030:3001 \
-v "$PWD":/workspace \
-e REPO_ROOT=/workspace \
-e DM_CLI=/app/datalex \
datalex:local
In the UI, use /workspace as the dbt repository path.
Pick your path¶
| You have... | Start here | Time |
|---|---|---|
| Nothing — want to try with a canonical dbt repo | Scenario 1 — clone jaffle-shop | 5 min |
| An existing dbt project on disk | Scenario 2 — your local dbt repo | 5 min |
| A dbt repo on GitHub you want to try | Scenario 3 — a git URL | 4 min |
| A live warehouse, no dbt yet | Scenario 4 — warehouse pull | 7 min |
| A dbt repo + GitHub Actions you want gated | Scenario 5 — wire up CI | 5 min |
| CLI only, no UI | CLI dbt-sync tutorial | 5 min |
Scenario 1 — Clone jaffle-shop DataLex¶
The fastest way to see the full DataLex workflow is the dedicated
duckcode-ai/jaffle-shop-DataLex
repo. It extends jaffle-shop with DuckDB seeds, dbt staging and marts,
semantic models, DataLex conceptual/logical/physical diagrams, generated
SQL, Interface metadata, project-local modeling skills, and every 1.4
moat feature wired up:
- Doc-block references in
stg_customers.yml+fct_orders.yml - A custom policy pack at
.datalex/policies/jaffle.policy.yaml - Snapshot, exposure, and unit-test fixtures
- Glossary bindings ready for
datalex emit catalog - A GitHub Actions workflow that runs
actions/datalex-gate
pip install 'datalex-cli[serve,duckdb]'
git clone https://github.com/duckcode-ai/jaffle-shop-DataLex ~/src/jaffle-shop-DataLex
cd ~/src/jaffle-shop-DataLex
make setup # creates .venv, installs dbt + datalex-cli >= 1.4.0
make seed # dbt seed
make build # dbt build → jaffle_shop.duckdb
make serve # datalex serve --project-dir .
Use Python 3.11 or 3.12 for this dbt example. Python 3.13+ currently
breaks in dbt's serializer stack; use the Docker fallback (make
docker-up) if you do not want to manage Python versions locally.
Open the project in the UI and start with these files / panels:
DataLex/commerce/Conceptual/commerce_concepts.diagram.yamlDataLex/commerce/Logical/commerce_logical.diagram.yamlDataLex/commerce/Physical/duckdb/commerce_physical.diagram.yamlmodels/marts/core/dim_customers.ymlmodels/marts/core/fct_orders.yml- Bottom drawer → Snapshots / Exposures / Unit Tests / Policy Packs tabs (new in 1.4)
Every UI edit lands in the clone, so git diff shows normal dbt and
DataLex YAML changes.
📖 Full walkthrough: tutorials/jaffle-shop-walkthrough.md
Scenario 2 — Your local dbt repo¶
This is the main event. You point DataLex at your existing dbt folder;
every UI edit round-trips back to the original .yml files on disk,
so your git history sees real diffs.
cd ~/path/to/your-dbt-project # folder containing dbt_project.yml
datalex serve --project-dir .
What you'll see in the startup log:
[datalex] registered project: your-dbt-project → /Users/…/your-dbt-project
[datalex] Starting DataLex server on http://localhost:3030
The browser opens with your folder already registered as the active project — no "Import" click needed to see the tree.
Next, import your dbt models once:
- Top bar → Import dbt repo
- Pick the Local folder tab
- Select your project root
- Leave ☑ Edit in place checked (default ON)
- Click Import
The importer shells out to dm dbt import in the background. For
projects with 200+ models, expect a few seconds. When it's done, the
Explorer shows every model file at its real dbt path.
Then run a readiness review — new in 1.4:
- Top bar → Run readiness review (or right-click any folder → Run dbt readiness review).
- Each YAML file gets a red / yellow / green badge in the Explorer.
- Click any badge → the Validation drawer shows the findings, rationale, suggested fix, and an Ask AI handoff.
DataLex also rebuilds the local AI modeling index automatically using
your dbt YAML, SQL files, target/manifest.json, target/catalog.json,
semantic manifest, validation findings, doc blocks, and DataLex files.
This is what lets Ask AI answer repo-wide questions instead of only
reading the open diagram. Doc-block bound descriptions
(description_ref: { doc: <name> }) are expanded in the AI index so
prompts that match the doc-block prose retrieve the bound columns.
Then build your first ER diagram:
- Create a diagram. Two paths:
- Explorer toolbar → New Diagram (Layers icon). A new file
appears at
datalex/diagrams/untitled.diagram.yaml. - Right-click any folder in the Explorer → New diagram here…
to land it next to the models it describes. Rename it to
something meaningful like
customer_360.diagram.yaml. - Populate the canvas. Two paths, same result:
- Canvas toolbar → Add Entities (or pane right-click → Add entities to diagram…). A picker opens with search, domain filter, and multi-select over every entity resolved from the model graph.
- Drag a
schema.ymlor.model.yamlfrom the Explorer onto the canvas. Each referenced model renders as an entity. - Foreign keys from dbt
tests: - relationships: {to: "ref('…')"}become dashed edges automatically. - Save All writes positions into the
.diagram.yamlfile — sogit commitcaptures your layout.
Ask the AI to model for you¶
In the entity inspector empty state (no entity selected), two new 1.4 buttons surface deterministic agents:
- Conceptualize from staging clusters every staging-layer model
into business entities + relationships and proposes a conceptual
diagram. Domains are inferred from common nouns
(
customer→crm,order→sales, …). - Canonicalize from staging detects columns that recur across
staging models (same name, similar description) and lifts them into
a logical canonical entity with shared
{% docs %}blocks.
Both agents are deterministic — no API key required. They produce proposals through the existing review-and-apply flow, so nothing is written until you accept it.
Editing rules¶
- Every UI edit writes back to the original file. Rename a column,
add a test, drag to create a foreign key — they all patch the
.ymlat its original path. - Doc-block round-trip is preserved. When a column's description
resolves to
{{ doc("name") }}, DataLex storesdescription_ref: { doc: "name" }next to the rendered text. On re-emit the YAML keeps the jinja reference, not the rendered string. AI proposals that try to overwrite a doc-block-bound description in YAML are rejected withDOC_BLOCK_OVERWRITE— propose a change to the.mdfile instead. - No duplicate folders. DataLex doesn't create a shadow tree. Your
~/your-dbt-project/models/staging/stg_customers.ymlis the one true source; we just read and patch it. - Save All flushes every dirty buffer to disk. Writes return a
structured
{ code, message, details? }envelope and a 207 Multi-Status response when some files fail. - Rename / delete previews. Renaming or deleting a folder or file from the Explorer shows an impact preview first.
- Dangling relationships. Open the Validation panel: any
relationships:entry pointing at a missing entity or column gets a red banner with a one-click Remove dangling action. - Ask AI with reviewable YAML proposals. Use Ask AI from the right panel, canvas, Explorer context menu, or selected text. The agent retrieves doc-blocks, BM25 lexical context, validation findings, and skills before proposing changes. Click Review plan to inspect the proposal in the center editor before you apply.
- Team skills live in Git. The Skills tab writes Markdown skill
files under
DataLex/Skills/*.md.
📖 Full walkthrough: tutorials/import-existing-dbt.md
Scenario 3 — A git URL¶
"Try this dbt repo before I clone it." Works for any public URL; use Scenario 2 for local round-trip.
datalex serve
In the UI: Import dbt repo → Git URL tab → paste
https://github.com/<org>/<repo> (optional ref: branch/tag/SHA) →
Import.
The api-server clones to $TMPDIR/datalex-dbt-<uuid>/, runs the
importer, and hands the tree to the workspace store. You can poke at
the model, but saves land in the tmpdir and get cleaned up on next
boot. For real round-trip, clone locally and go back to Scenario 2.
Scenario 4 — Live warehouse pull¶
Your warehouse exists, dbt doesn't (yet). DataLex introspects the database, lets you pick tables, writes a DataLex tree.
cd ~/path/to/new-or-existing-project
datalex serve --project-dir .
Then in the UI:
- Left panel → Connectors → New connection
- Pick your dialect: postgres, mysql, snowflake, bigquery, databricks, sqlserver, azure_sql, azure_fabric, redshift
- Fill credentials → Test. You should see a pill like
pingMs: 12 · PostgreSQL 16.2 - Pull → the warehouse table picker opens
- Tick the schemas/tables you want; toggle "Row counts" for a
SELECT COUNT(*)per table; preview inferred PKs + FKs - Commit → SSE log streams
[pull] customers: 100 rows, etc.
Output layout adapts to the project. If the folder contains
dbt_project.yml, pulls land at sources/<db>__<schema>.yaml +
models/staging/stg_<schema>__<table>.yml. Otherwise flat.
📖 Full walkthrough: tutorials/warehouse-pull.md
Scenario 5 — Wire up CI¶
DataLex 1.4 ships a GitHub Action that runs the same readiness review shown in the UI on every PR. It posts a sticky comment with the red/yellow/green file counts, uploads SARIF to the Security tab, and fails the build when the project score drops below your threshold.
Drop this into .github/workflows/datalex-readiness.yml:
name: DataLex readiness
on:
pull_request:
permissions:
contents: read
issues: write # sticky PR comments
pull-requests: write
security-events: write # SARIF upload
jobs:
readiness:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with: { fetch-depth: 0 }
- uses: duckcode-ai/DataLex/actions/datalex-gate@main
with:
project-path: .
min-score: 80
changed-only: true
base-ref: origin/${{ github.base_ref }}
Run the same gate locally:
pip install 'datalex-cli'
datalex readiness-gate --project . --min-score 80 \
--sarif datalex-readiness.sarif --pr-comment datalex-readiness.md
📖 Full walkthrough: tutorials/ci-readiness-gate.md
What stays in your project, what doesn't¶
DataLex writes these local runtime files into --project-dir:
| File / folder | What it is | Commit it? |
|---|---|---|
.dm-projects.json |
Projects list the UI sees | Optional — safe to commit or gitignore |
.dm-credentials.json |
Warehouse credentials | Never — already in our gitignore template |
.datalex/agent/ |
Local AI index, chat history, memory, runtime cache | No — local runtime state |
.datalex/policies/*.yaml |
Custom policy packs (1.4) | Yes — checked into git so CI uses the same rules |
dm |
Auto-written CLI shim for subprocess calls | Gitignored |
Your DataLex modeling artifacts live under DataLex/ and are meant to
be reviewable YAML. Commit model/diagram YAML and DataLex/Skills/*.md
when they represent team standards. git status shows real diffs on
every UI edit.
Troubleshooting install¶
| Symptom | Fix |
|---|---|
datalex: command not found |
Your pip bin dir isn't on PATH — python -m datalex_cli serve works too. |
ERROR: 'node' was not found on PATH |
pip install "datalex-cli[serve]" or install Node 20+. |
Port 3030 already in use |
Prior server still running. lsof -ti:3030 \| xargs kill, or --port 4040. |
ModuleNotFoundError: No module named 'datalex_cli' in API logs |
Upgrade: pip install -U datalex-cli. |
| UI stuck on "model-examples" instead of my folder | Delete stale projects file: rm .dm-projects.json && datalex serve --project-dir . |
| Web bundle auto-build fails | cd packages/web-app && npm install && npm run build (only matters for source checkouts) |
| Blank page after refresh | Hard-refresh (⌘⇧R / Ctrl+F5) — old bundle cached in the browser. |
DOC_BLOCK_OVERWRITE when applying an AI proposal |
Doc-block-bound descriptions live in .md files. Edit the {% docs %} block instead of the YAML description, or remove description_ref first if you really mean to break the binding. |
CONTRACT_PREFLIGHT on dbt-sync forward |
A contract-enforced model has columns with type: unknown. Run dbt compile to populate types or set data_type explicitly. |
Mental model in 30 seconds¶
warehouse <──pull──> DataLex YAML tree <──sync──> dbt project
(live) (git-tracked) (models/*.yml)
│
├── readiness-gate ──▶ GitHub PR
├── emit catalog ──▶ Atlan / DataHub / OpenMetadata
└── conceptualize / canonicalize ──▶ AI proposals
- Pull introspects a live database → writes a DataLex model tree.
- Import dbt reads your dbt
manifest.json→ populates the same tree with columns/types/tests; preserves{% docs %}references. - Sync merges DataLex metadata back into dbt's
schema.ymlfiles non-destructively. Runs a contract pre-flight in 1.4. - Emit writes dbt-parseable YAML from scratch (greenfield).
- Readiness gate scores the project red/yellow/green and fails CI on regressions.
- Emit catalog ships glossary + bindings to Atlan, DataHub, or OpenMetadata.
- Conceptualize / Canonicalize propose entities and a logical layer from staging models.
All are available from the CLI (datalex --help) and from the UI
toolbar.
Where to go next¶
- 📘 docs/tutorials/ — end-to-end walkthroughs:
- Jaffle-shop walkthrough
- Import an existing dbt repo
- Live warehouse pull
- CI readiness gate (1.4)
- Custom policy packs (1.4)
- 📗 docs/cli.md — every CLI subcommand and flag
- 🧠 docs/ai-agentic-modeling.md — Ask AI, doc-block-aware retrieval, conceptualizer + canonicalizer
- 🌐 docs/mesh-interfaces.md — shared model contracts + catalog export (1.4)
- 📙 docs/architecture.md — how DataLex is wired
- 📕 docs/api-contracts.md — HTTP API for integrators
- 📓 docs/datalex-layout.md — on-disk YAML spec
Once you have a DataLex tree on disk, everything else is plain git:
git init && git add . && git commit -m "chore(model): baseline import"
From there, PRs review like any code change and these CLI commands give you CI hooks:
datalex readiness-gate --project .— red/yellow/green PR gate (1.4)datalex policy-check models/.../stg_customers.model.yaml --policy ...— org rulesdatalex validate models/.../stg_customers.model.yaml— schema checkdatalex lint models/.../stg_customers.model.yaml— semantic rulesdatalex gate old.yml new.yml— fail PRs on breaking changesdatalex emit catalog --target atlan|datahub|openmetadata --model ...— catalog export (1.4)