Data Engineering & Architecture

Scalable Multi-Engine Data Profiler

Producing deterministic, highly structured database intelligence to give AI agents the deep context they need to write perfect SQL.

Data Profiler
Supported Engines
4 (incl. Snowflake)
Execution Model
Concurrent
Test Coverage
77% (49 tests)
Tooling
uv, pydantic, ruff

Why ?

While building Aiden (an AI Data Engineer), we realized LLMs hallucinate SQL when they only see column names. They need to know that a status column only contains 'OPEN' and 'CLOSED', or that a revenue column has no negative values.

Every database speaks a different dialect. Snowflake uses NUMBER(10,2), DuckDB uses HUGEINT. Fetching this at scale without crashing the warehouse is a massive engineering challenge. This profiler acts as the universal translator and orchestrator to pull this deep context reliably.

Problem Statement
"How do we extract statistical metadata (histograms, cardinality, null ratios) across completely different data warehouses without dealing with mismatched data types or timeout crashes on massive schemas?"

Architecture & The Adapter Pattern

If tomorrow we need to support PostgreSQL or BigQuery, the core orchestrator doesn't change-we simply write a new Adapter. The Adapter Factory dynamically dispatches queries to the underlying engine while pushing heavy statistical aggregations directly down to the warehouse.

Loading diagram...
src/adapters/base.pypython
from abc import ABC, abstractmethod
from pydantic import BaseModel
class DatabaseAdapter(ABC):
@abstractmethod
def get_tables(self, schema: str) -> list[str]:
pass
@abstractmethod
def profile_column(self, table: str, col: str) -> dict:
pass
profiler_output.jsonjson
{
"table": "users",
"column": "status",
"harmonized_type": "STRING",
"metrics": {
"null_ratio": 0.0,
"approx_distinct_count": 2,
"histogram": [
{"bucket": "OPEN", "frequency": 8450},
{"bucket": "CLOSED", "frequency": 1550}
]
}
}

Resilience & Context Harmonization

Profiling 10,000 tables takes time. If the network drops at table 9,999, you shouldn't have to start over.

The engine runs concurrent threads but strictly updates a checkpoint.json file, allowing it to seamlessly resume exactly where it left off.

Notice the harmonized_type in the output artifact. It maps proprietary database types (like Snowflake's NUMBER) into a universal taxonomy, standardizing context for the downstream LLM agents.

Production-Grade DevEx

Tooling choices were made prioritizing speed, strict validation, and maintainability:

  • uv: Lightning-fast dependency management & virtual environments.
  • Pydantic v2: Strict data validation ensuring the output JSON always conforms to the expected schema.
  • Ruff & Pyright: Blazing fast linting and strict static typing.
uv-profiler-run