We cut EKS costs 44% in one weekend. Here's exactly how.
Every FinOps tool stops at detection. We built a multi-agent workflow that closes the loop — AI agents that write the Helm and Terraform PRs, not just surface the findings.
The result
44% EKS cost reduction on a production cluster. Applied through real Helm and Terraform PRs. Zero performance regressions measured over 48 hours. Time from start to applied changes: one weekend.
Every other "AI cost tool" would have given us a dashboard. Ours gave us merged PRs.
The workflow
Four agents. One loop. Findings to merged PRs.
Each layer passes context to the next. No copy-paste between tools. No findings that sit in a dashboard for six months.
Investigate with AWS DevOps Agent
Pointed AWS DevOps Agent at the cluster with a single instruction: audit everything — CloudWatch history, node utilization, pod allocation, autoscaling behavior, HPA configs. The agent surfaced 20+ concrete findings ranked by impact. What used to be a two-month consulting engagement became a 90-minute read.
Verify with Claude Code
Fed the findings into Claude Code, which had full context over the Helm charts and Terraform modules. Claude Code went into the live cluster to validate each finding — catching false positives before any change was proposed. The investigation layer and the remediation layer shared context, not just output.
Remediate through IaC
Claude Code wrote the actual changes. Helm values updated. Terraform modules updated. Node pool sizes adjusted. HPA configs tightened. Resource requests and limits right-sized. Every change landed in infrastructure code, reviewable and revertible — not as a runtime hack that disappears on the next deploy.
Review, apply, verify
Changes shipped as normal pull requests. Reviewed by engineers. Applied with rollback ready. Metrics monitored for 48 hours across performance, latency, and availability. No regressions. Savings locked in.
Context flow
Where the waste hides
Typical waste distribution
Across EKS clusters we've run this workflow on, waste clusters tightly into the same five buckets. Your numbers will vary — these are the categories to expect.
Percentages reflect share of waste, not share of total spend. Actual breakdown varies by workload shape and autoscaling maturity.
The uncomfortable part
Why not run this yourself?
The workflow sounds straightforward. The production-grade version isn't. Six things separate a rough prototype from something you can trust on real infra.
Blast-radius analysis on every proposed change
Before a change becomes a PR, we check what it touches, what depends on it, and what would break if it rolled out wrong.
Approval gates so senior engineers review before apply
Nothing applies to infrastructure without an explicit review step. Every change has a named approver.
Automated rollback if metrics regress after apply
Latency, error rate, and availability are watched for 48 hours. A real regression triggers automatic rollback.
Audit trail mapped to SOC 2, HIPAA, PCI controls
Every change is logged with the finding, the proposer, the approver, and the rollback path. Ready for audit evidence export.
Deep integration with Security Hub, Config, CI/CD, and Slack
Findings pull from the tools you already run. PRs land in your existing repos. Approvals happen where your team already works.
Continuous monitoring so savings do not regress six months later
We keep watching after the fix lands. If drift creeps back in, we catch it early instead of at the next finance review.
We've spent a year building the safety layer around the workflow. You get to use it day one.
Frequently asked
Questions engineers actually ask
30 minutes on the call, 2–3 days for the full findings report delivered back to you.
Run this on your cluster.
Free assessment. 30 minutes on the call. 2–3 days for the full findings report. No commitment to remediate anything you don't approve.