METRO: Practice Engineering and Data Governance Tooling Link to heading

Executive Summary Link to heading

METRO is a practice engineering initiative that codifies data engineering methodology, AI-assisted development frameworks, and web design standards into executable tools and auditable documentation. Its most significant technical artifact is md-util, a Go CLI that transforms raw data assets from disparate sources into SHACL 1.2 ontology representations and emits dialect-specific DDL for Postgres, BigQuery, Databricks, Parquet/Arrow, and Vortex — solving the schema synchronization problem across heterogeneous data environments with a single canonical model. Together, METRO and md-util demonstrate end-to-end system design from strategic framing through specification to working software.


Technical Overview Link to heading

METRO encompasses two threads of work: a practice documentation system that formalizes how data engineering and AI-assisted development should be conducted, and md-util, a production tool that puts those principles into practice.

Practice Framework: Left-of-Do Link to heading

The intellectual foundation is the Left-of-Do framework, a closed-loop methodology linking business intent to behavioral proof through three stages. Hypothesis-Driven Development (HDD) defines strategic intent through testable benefit hypotheses — answering “why are we building this?” across problem reality, agent hypothesis, and assumption inventory phases. Specification-Driven Development (SDD) translates that intent into precise, structured specifications that AI agents can reliably build from. Behavior-Driven Development (BDD) proves the specification was met through executable Gherkin scenarios covering happy-path, edge-case, adversarial, and failure-mode conditions. Both HDD and BDD are implemented as interactive Go CLI tools that walk teams through structured question sets, producing auditable artifacts at each stage.

md-util: SHACL-First Data Governance Link to heading

md-util is a Go CLI implementing an ingest, review, emit pipeline with W3C SHACL 1.2 Turtle as the canonical interchange format.

Five ingestors handle CSV (with statistical type inference over 10,000-row samples and nullable column detection), JSON/JSONL, SQL DDL (parsed without external dependencies), DuckDB (via subprocess delegation, avoiding CGO), and PostgreSQL (via environment-variable configuration). Each ingestor produces a uniform internal model of NodeShape and PropertyShape structs, eliminating cross-boundary format leakage.

A centralized type mapping layer provides the single source of truth for XSD-to-SQL conversions across Postgres, BigQuery, and Databricks dialects. No emitter contains inline type logic — all five emitters (Postgres DDL, BigQuery DDL, Databricks DDL, Parquet/Arrow schema JSON, and Vortex DuckDB COPY commands) project SHACL shapes through this shared mapping. The result: one ontology in, multiple governed schemas out, with no synchronization drift.

The turtle serialization package implements bidirectional SHACL Turtle parsing and emission without external RDF libraries, using a state-machine scanner that tracks bracket depth and string literals. Round-trip testing ensures parse-serialize-parse fidelity.

AI Integration as Engineering Discipline Link to heading

md-util is itself a demonstration of the Left-of-Do framework in practice. The .context/ directory contains the full audit trail: an HDD document framing the problem hypothesis, an SDD specifying intent and architectural constraints (including the critical “never reverse-engineer .ttl from DDL” principle), and a 26-scenario BDD Gherkin specification defining system boundaries and edge cases — timeout handling, path traversal guards, format auto-detection, and type ambiguity fallbacks.

Development proceeded through deliberate git-flow feature branches (feature/init-command, feature/lakes, worktree-feature+emit) with session-based status reports tracking version progression from v0.2 through v0.5.0. The current release carries 35+ passing tests spanning SHACL model validation, type mapping, Turtle round-trip serialization, and emit dialect handling.

Supporting Standards Link to heading

METRO also codifies a Node-free web design standard (Caddy, HTMX, Alpine.js, Tailwind, Go templates, Hugo) and two AI prompt kit frameworks: Proper Skills for building production-ready AI agent capabilities, and Dark Code for auditing hidden complexity and risk in AI-generated code. These sit alongside formalized practice analyses covering data transformation (SQL-first with DuckDB/Polars), producer-consumer pipeline architecture, and organizational strategy positioning.

Architecture and Design Choices Link to heading

Go is the implementation language throughout, chosen for deployment simplicity and runtime reliability. md-util avoids CGO entirely — DuckDB integration uses subprocess delegation, keeping the binary self-contained. Production safety guards include path traversal prevention, collision warnings on .ttl overwrites, and type ambiguity detection with explicit fallback annotations. Schema migrations, environment configuration via direnv, and secrets management through HashiCorp Vault follow the same operational patterns used across the broader project portfolio.

The through-line across METRO is that methodology and tooling are not separate concerns. The frameworks that define how work should be done are implemented as the same Go CLIs and structured artifacts that do the work — closing the loop between theory and practice.