Case study — EKS remediation workflow

We cut EKS costs 44% in one weekend. Here's exactly how.

Every FinOps tool stops at detection. We built a multi-agent workflow that closes the loop — AI agents that write the Helm and Terraform PRs, not just surface the findings.

44%EKS cost cut

The result

44% EKS cost reduction on a production cluster. Applied through real Helm and Terraform PRs. Zero performance regressions measured over 48 hours. Time from start to applied changes: one weekend.

Every other "AI cost tool" would have given us a dashboard. Ours gave us merged PRs.

The workflow

Four agents. One loop. Findings to merged PRs.

Each layer passes context to the next. No copy-paste between tools. No findings that sit in a dashboard for six months.

Step 01Investigation layer

Investigate with AWS DevOps Agent

Pointed AWS DevOps Agent at the cluster with a single instruction: audit everything — CloudWatch history, node utilization, pod allocation, autoscaling behavior, HPA configs. The agent surfaced 20+ concrete findings ranked by impact. What used to be a two-month consulting engagement became a 90-minute read.

Step 02Validation layer

Verify with Claude Code

Fed the findings into Claude Code, which had full context over the Helm charts and Terraform modules. Claude Code went into the live cluster to validate each finding — catching false positives before any change was proposed. The investigation layer and the remediation layer shared context, not just output.

Step 03Remediation layer

Remediate through IaC

Claude Code wrote the actual changes. Helm values updated. Terraform modules updated. Node pool sizes adjusted. HPA configs tightened. Resource requests and limits right-sized. Every change landed in infrastructure code, reviewable and revertible — not as a runtime hack that disappears on the next deploy.

Step 04Human control layer

Review, apply, verify

Changes shipped as normal pull requests. Reviewed by engineers. Applied with rollback ready. Metrics monitored for 48 hours across performance, latency, and availability. No regressions. Savings locked in.

Context flow

Cluster
DevOps Agent
Claude Code
IaC repo
Merged PR

Where the waste hides

Typical waste distribution

Across EKS clusters we've run this workflow on, waste clusters tightly into the same five buckets. Your numbers will vary — these are the categories to expect.

Oversized node pools~30%
Non-optimal instance families~25%
Misconfigured HPA / autoscaling~20%
Stale or idle workloads~15%
Misallocated resource requests / limits~10%

Percentages reflect share of waste, not share of total spend. Actual breakdown varies by workload shape and autoscaling maturity.

The uncomfortable part

Why not run this yourself?

The workflow sounds straightforward. The production-grade version isn't. Six things separate a rough prototype from something you can trust on real infra.

Blast-radius analysis on every proposed change

Before a change becomes a PR, we check what it touches, what depends on it, and what would break if it rolled out wrong.

Approval gates so senior engineers review before apply

Nothing applies to infrastructure without an explicit review step. Every change has a named approver.

Automated rollback if metrics regress after apply

Latency, error rate, and availability are watched for 48 hours. A real regression triggers automatic rollback.

Audit trail mapped to SOC 2, HIPAA, PCI controls

Every change is logged with the finding, the proposer, the approver, and the rollback path. Ready for audit evidence export.

Deep integration with Security Hub, Config, CI/CD, and Slack

Findings pull from the tools you already run. PRs land in your existing repos. Approvals happen where your team already works.

Continuous monitoring so savings do not regress six months later

We keep watching after the fix lands. If drift creeps back in, we catch it early instead of at the next finance review.

We've spent a year building the safety layer around the workflow. You get to use it day one.

Frequently asked

Questions engineers actually ask

30 minutes on the call, 2–3 days for the full findings report delivered back to you.

Run this on your cluster.

Free assessment. 30 minutes on the call. 2–3 days for the full findings report. No commitment to remediate anything you don't approve.