Running Great Expectations at Scale on AWS Glue

Running Great Expectations 1.9+ in a production AWS Glue environment requires solving three non-obvious problems: context initialization without a filesystem, expectation suite management at scale, and performance under high data volumes. This guide covers all three, based on production deployments processing 2.4 billion events per day. The core architectural decision is whether to use an Ephemeral Data Context (no persistent state) or a File Data Context (reads from a mounted path). In Glue, the filesystem is read-only except for /tmp, and Glue workers don't share state between runs. The right choice is almost always an Ephemeral context with expectations loaded from S3 at job startup.

Organizing expectation suites at scale requires a naming convention that maps to your data architecture. We use a three-level hierarchy: {layer}.{dataset}.{version}. A Bronze suite for the orders dataset at version 3 is bronze.orders.v3. This makes suite management scriptable, enables version-controlled expectations alongside your pipeline code, and allows automatic suite selection based on Glue job arguments. Loading suites from S3 at startup adds 2-4 seconds of latency per dataset. At scale with 25+ datasets, this matters. The pattern is to load all relevant suites during the Glue job initialization phase, before the Spark DAG is built, so suite loading parallelizes with Spark context startup.

The Medallion architecture introduces layer-specific validation requirements that map directly to GX expectation set design. Bronze layer validations focus on structural integrity: schema conformity, null constraints on primary keys, format validation on critical fields. The success threshold is 50% — Bronze ingests raw data and some quality issues are expected. Silver validations add business rule expectations: referential integrity, range constraints, cross-column consistency. Threshold rises to 75%. Gold validations add statistical expectations: distribution bounds, outlier limits, completeness requirements for analytics. Threshold: 90%. The threshold difference matters for pipeline orchestration. A Bronze job failing at 45% is expected behavior — quarantine the bad records, log the issue, continue. A Gold job failing at 85% is a P1 incident — the analytics layer is serving degraded data.

Step Functions orchestration changes the failure handling pattern. With a direct Glue job, you handle failures in the job itself. With Step Functions, you get distributed compensation: a Gold failure triggers a Lambda that checks whether Bronze and Silver are healthy, determines the blast radius, and routes the appropriate alert. This separation of validation logic from orchestration logic is the key architectural advantage at scale. Performance optimization at scale comes from three levers: batch validation (validate all columns in a single pass rather than per-expectation passes), Spark-native expectations (use GX's Spark BatchRequest rather than converting to Pandas), and selective profiling (run the expensive profiling expectations only on changed partitions, not on full dataset scans).

Practical deployment checklist: pin GX to an exact version in requirements.txt (minor versions introduce breaking changes to suite formats), pre-warm the GX context in Glue's --additional-python-modules to avoid cold start penalties, use S3 for validation result storage with a partition scheme that enables efficient time-range queries, and configure CloudWatch metrics emission from every validation run so your reliability dashboard has real-time signal without querying DynamoDB on every dashboard load.

Key Takeaways

Use Ephemeral Data Context in Glue — load suites from S3 at startup, not filesystem paths
Name suites with {layer}.{dataset}.{version} convention for scriptable, version-controlled management
Layer-specific thresholds: Bronze 50%, Silver 75%, Gold 90% — failure handling differs by layer
Step Functions orchestration enables distributed compensation and blast radius analysis on validation failures
Pin GX to exact version, use Spark-native BatchRequests, and emit CloudWatch metrics for real-time reliability signal

Running Great Expectations at Scale on AWS Glue

Key Takeaways

Related Articles

Ready to transform how you use your data?