Skip to main content
All insights
ML Reliability 10 min read ·

SENTINEL™: How Autonomous Root Cause Analysis Works

Z

ZEVORIX Engineering

Data Reliability Team

SENTINEL™ v3.0 is an 8-step autonomous reliability pipeline that operates continuously across your data and AI infrastructure. Understanding how it works — not just what it does — is essential for tuning it to your environment and for building confidence that its autonomous decisions are trustworthy. This is a technical deep-dive into each step. Step 1 is signal ingestion. SENTINEL consumes reliability signals from three sources: NEXUS™ validation results (quality scores, expectation failures, drift alerts), pipeline execution metadata (job duration, record counts, error rates), and ML model health metrics (MRS scores, feature importance changes, prediction confidence distributions). All signals are normalized to a common reliability event format and timestamped with microsecond precision.
Steps 2 and 3 are anomaly detection and correlation. Each incoming signal is evaluated against rolling baselines using an ensemble of detectors: statistical z-score for continuous metrics, categorical change detection for schema signals, and temporal pattern detection for freshness anomalies. Correlated signals within a configurable time window (default: 15 minutes) are grouped into incident candidates. The correlation step is critical — a single schema change in a Bronze table may generate 12 separate anomaly signals across downstream Silver and Gold tables; without correlation, these look like 12 separate incidents instead of one root cause. Step 4 is blast radius calculation. Before analyzing root cause, SENTINEL computes the blast radius: which downstream datasets, pipelines, and ML models depend on the affected asset? This computation traverses the OpenLineage graph stored in S3. The blast radius determines incident severity: a Bronze anomaly affecting 3 Gold tables and 2 ML models is severity P1; the same anomaly affecting 0 downstream assets is severity P3.
Step 5 is root cause analysis. The RCA engine applies a library of 50+ pattern-matching rules to the correlated signals and lineage graph. Rules are organized into categories: schema changes, volume anomalies, freshness violations, distribution drift, and resource constraints. Each rule produces a confidence score and a hypothesis. The highest-confidence hypothesis becomes the working RCA — surfaced in ORBIT™ with the supporting evidence and the lineage path that led to the conclusion. For complex incidents where rule-based RCA produces low-confidence hypotheses (confidence below 0.7), SENTINEL escalates to an LLM-assisted RCA step. The LLM receives the reliability event timeline, the lineage graph subgraph, historical incident descriptions, and the rule-based hypotheses. It produces a structured RCA report with a ranked list of probable causes and specific investigation steps.
Steps 6 and 7 are action recommendation and execution. For each incident, SENTINEL selects an action from the registry based on the RCA and the configured automation policy. Conservative policy (default): recommend actions to data engineers, require approval for any automated remediation. Moderate policy: automatically execute low-impact actions (pipeline reruns, cache invalidation, alert escalation), require approval for high-impact actions (data quarantine, model rollback). Aggressive policy: execute all actions autonomously, log everything, notify asynchronously. The two-factor approval workflow gates high-impact actions: the data owner receives a Slack message with full incident context, RCA, and the proposed action; they must confirm via button click within 30 minutes or SENTINEL escalates to the secondary approver. This ensures human oversight without blocking time-sensitive remediation.
Step 8 is learning and feedback. Every resolved incident updates SENTINEL's pattern library. If a rule-based RCA hypothesis matched a human-confirmed root cause, that rule's confidence weight increases. If it did not match, the weight decreases. Over time, SENTINEL's RCA accuracy improves to match your specific data infrastructure patterns — incidents that required LLM assistance in month 1 are handled by pattern-matching alone by month 3. The predictive DRS (Data Reliability Score) adds a forecasting dimension: SENTINEL models 48-hour reliability trajectories for each dataset, flagging assets that are trending toward threshold violations before they fail.

Key Takeaways

  • SENTINEL v3.0 is an 8-step pipeline: signal ingestion → anomaly detection → correlation → blast radius → RCA → action recommendation → execution → learning
  • Signal correlation groups related anomalies into single incidents — a Bronze schema change generating 12 signals is one incident, not twelve
  • Blast radius traverses the OpenLineage graph to determine severity: more downstream dependents = higher severity
  • Two-factor approval workflow gates high-impact autonomous actions with configurable timeout escalation
  • The learning loop improves RCA accuracy over time — LLM-assisted incidents in month 1 become pattern-matched in month 3

Ready to transform how you use your data?

Connect with our experts and discover how ZEVORIX can help your organization reach its full potential with data and AI.

Tell us about your data challenges.

Our team will get back to you within 24 hours.

Or write to us directly at contact@zevorix.io

We typically respond within 24 hours.