SelfHealOps
An autonomous, self-healing DevOps agent designed to automatically classify, analyze, and remediate CI/CD pipeline failures and infrastructure issues using a hierarchical multi-agent system.
Architecture
State Machine
Directed Cyclic Graph
AI Engine
NIM
NVIDIA Inference
Metrics
Grafana
Real-time Observability
Core Logic
Python
FastAPI Backend
The Problem & Solution
When CI/CD pipelines fail or infrastructure issues arise, engineering teams lose valuable hours diagnosing stack traces, reviewing commits, and applying manual fixes. This disrupts the deployment lifecycle and reduces developer productivity.
SelfHealOps operates as an autonomous resolution agent. It utilizes a LangGraph-powered state machine and specialized NVIDIA NIM-powered LLM agents to process incident data, determine root causes, generate concrete remediation plans, and execute safe fixes through a strict policy engine.
Key Features
- ✓ LangGraph State Machine Orchestration
- ✓ Automated Root Cause Analysis
- ✓ Hierarchical AI Agent Delegation
- ✓ Production-grade Observability (Prometheus/Grafana)
- ✓ Integration with K8s & PyGithub
Directed Cyclic Graph Architecture
The system creates a cyclic graph of agent execution managed by a LangGraph orchestrator. It safely validates proposed fixes before automated execution, and iterates until the failure is resolved.
Self-Healing Pipeline Flow
Agentic state machine coordinating diagnosis and remediation actions.
Specialized Worker Agents
The Classifier
Analyzes incoming CI/CD logs and pipeline context to categorize the exact failure domain (e.g., DEPENDENCY_FAILURE, INFRASTRUCTURE_FAILURE).
The Analyst
Performs deep analysis of historical commits and error tracebacks to determine the true technical root cause.
The Strategist
Translates the root cause into a sequential list of deterministic actions required to fix the system.
The Auditor
Evaluates the proposed action plan against rigid policy guardrails to prevent destructive commands.
The Scholar
Extracts successful remediation patterns and stores them semantically, enabling future incidents to be resolved instantly via memory recall.
SelfHealOps Control Panel
Visualizing the autonomous agent task traces, global JSON context variables, and repository blueprints.
# Starting SelfHealOps Orchestrator (LangGraph + NVIDIA NIM)...
[TRIGGER] GitHub Webhook: Pipeline #4492 FAILED (exit code 1).
[CLASSIFIER] Analyzing 500 lines of logs. Category identified: DEPENDENCY_CONFLICT
[ANALYST] Fetching recent commits... Found package.json update in commit 8a93b21
[ANALYST] Root cause: 'date-fns' upgraded to v3.0, breaking 'format' imports in utils.ts
[STRATEGIST] Generating deterministic fix: Revert 'date-fns' to ^2.30.0.
[AUDITOR] Evaluating proposed git commit and npm install against safety policies...
[AUDITOR] Policy PASSED: Safe to modify package.json and commit.
[EXECUTOR] Applying fix... Pushing hotfix branch 'selfhealops/fix-deps'
[SCHOLAR] Storing pattern in ChromaDB (Similarity Key: DEPENDENCY_CONFLICT_DATEFNS)
[SUCCESS] Pipeline re-triggered. Status: GREEN.
{
"incident_id": "pl_4492_deps",
"trigger": "github_actions",
"category": "DEPENDENCY_CONFLICT",
"root_cause": {
"file": "utils.ts",
"culprit_commit": "8a93b21",
"description": "date-fns v3 breaking change"
},
"remediation_plan": [
"git checkout -b selfhealops/fix-deps",
"npm install date-fns@^2.30.0",
"git commit -m 'fix: rollback date-fns'",
"git push origin selfhealops/fix-deps"
],
"auditor_status": "APPROVED",
"scholar_embedded": true
}LangGraph Management
Managing cyclic execution in LangGraph requires robust state tracking. Passing an \`AgentState\` typed dictionary ensures predictable input/output validation at each node during automated remediation.
NVIDIA NIM Optimization
To prevent CI/CD timeouts, the orchestrator routes smaller tasks (Classifier, Auditor) to ultra-fast local Llama-3-8B NIM endpoints, reserving 70B models strictly for deep analysis.
├── backend/
├── orchestrator.py # LangGraph Core
├── agents/ # Specialized Personas
├── classifier.py
├── auditor.py
└── strategist.py
├── tools/ # k8s, bash execution
└── vector_store/ # ChromaDB
├── policies/ # Safety guardrail YAMLs
└── ui/ # Real-time dashboard