Cloud Infrastructure Audit Checklist 2026 | Find & Fix Costly IT Mistakes

Most CTOs spend 40–60% more on cloud infrastructure than they need to. The waste isn't from bad decisions — it's from defaults that were never revisited, engineers who spun up resources and moved on, and licensing that auto-renewed without anyone reviewing it. A one-day infrastructure audit typically uncovers $50K–$300K in annual savings for companies spending $200K+ on cloud. This checklist shows you exactly where to look.

Why Infrastructure Debt Accumulates (And How Fast)

Cloud resources are easy to create and easy to forget. A developer spins up an EC2 instance to test something, the project wraps up, and the instance keeps running for 18 months. A team buys Zoom and Webex for different departments. A terminated employee's Microsoft 365 license stays active because offboarding didn't include an IT step.

None of this is negligence — it's the natural result of teams moving fast. But the bill compounds. Here's how waste typically breaks down:

Compute overprovisioning: 30–40% of average cloud bills
Idle/forgotten resources: 10–15%
Wrong storage tier: 5–10%
Missing Reserved Instances or Savings Plans: 20–30% on eligible workloads
Software license waste: 15–20% of SaaS spend

The good news: most of this is fixable in 30 days with zero architectural changes.

Before You Start: What to Gather

The audit goes faster if you have these ready:

Last 90 days of cloud billing from AWS Cost Explorer or Azure Cost Management
List of all cloud accounts, including sandbox and test accounts
Software license inventory — especially Microsoft 365, any per-seat SaaS
Current headcount vs licensed seats for each major tool
Last deployment dates for all production and dev environments

The 10 Most Expensive IT Mistakes

Mistake #1: Oversized Instances

Running an m5.xlarge when your workload only needs an m5.large costs exactly twice as much — $0.192/hr vs $0.096/hr. At 730 hours/month, that's $70/month per wasted instance. Most companies have 10–30 instances in this situation.

How to find it: Enable AWS Compute Optimizer (free). It analyzes 14 days of CPU, memory, and network utilization and tells you exactly which instances are oversized and what to downsize them to.

# AWS: Check average CPU for a specific instance over 14 days
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --start-time $(date -u -v-14d +%Y-%m-%dT%H:%M:%SZ) \
  --end-time $(date -u +%Y-%m-%dT%H:%M:%SZ) \
  --period 1209600 \
  --statistics Average \
  --dimensions Name=InstanceId,Value=i-XXXXXXXXX

If average CPU is below 20% over 14 days, the instance is a candidate for rightsizing. Compute Optimizer automates this across your entire account.

Azure equivalent: Open Azure Advisor → Cost → right-size or shut down underutilized virtual machines.

Mistake #2: Idle Resources Running 24/7

A single t3.medium dev instance running 24/7 costs about $340/year. Multiply by 10–20 dev/test environments that only see activity during business hours and you're looking at $3,400–$6,800 in pure waste.

How to find it:

# Find all running EC2 instances tagged "dev" or "test"
aws ec2 describe-instances \
  --filters "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[*].
    [InstanceId, LaunchTime,
     Tags[?Key==`Environment`].Value | [0],
     Tags[?Key==`Name`].Value | [0]]' \
  --output table

Cross-reference the launch date against your deployment logs. Any dev/test instance that hasn't had a deployment in 30+ days is a candidate for termination or scheduled shutdown.

Fix: Implement Lambda-based auto-shutdown for non-production environments outside business hours (e.g., off at 7pm, on at 8am weekdays). This alone cuts dev/test compute costs by 65%.

Mistake #3: Wrong Storage Tier

S3 Standard costs $0.023/GB/month. S3 Glacier Instant Retrieval costs $0.004/GB/month — 83% cheaper for data accessed less than once per quarter. Most companies have terabytes of logs, backups, and archival data sitting in Standard.

How to find it: Run an S3 Storage Lens report. Look for buckets where Last Accessed Date is more than 90 days ago but the bucket is still in Standard storage.

Quick fix: Enable S3 Intelligent-Tiering on buckets older than 3 months. It automatically moves objects between access tiers based on usage — zero operational overhead, no retrieval fees for frequent access.

For Azure: check Blob Storage access tiers. Any blob not accessed in 30+ days should be in Cool tier; not accessed in 90+ days should be Archive.

Mistake #4: Data Transfer Costs Nobody Is Watching

AWS charges $0.09/GB for data leaving to the internet, but $0.00 for data within the same region. Applications that serve large assets directly from S3 (instead of CloudFront), or that transfer data across regions unnecessarily, generate line items that look inexplicable on the bill.

# Find your data transfer costs for the last 30 days
aws ce get-cost-and-usage \
  --time-period Start=2026-06-01,End=2026-06-30 \
  --granularity MONTHLY \
  --filter '{"Dimensions":{"Key":"USAGE_TYPE_GROUP",
    "Values":["EC2: Data Transfer - Internet"]}}' \
  --metrics BlendedCost

If this is above $500/month and you're not serving video content, investigate what's generating egress. Common culprits: application logs being shipped cross-region, S3 downloads not going through CloudFront, database replicas in the wrong region.

Mistake #5: Microsoft 365 License Waste

At $22/user/month (Business Premium), a 100-person company spends $26,400/year on M365. When 12–15% of licenses belong to inactive accounts — terminated employees, contractors who finished, role changes that never triggered a license downgrade — that's $3,168–$3,960/year in pure waste.

How to find it (PowerShell):

# Find disabled accounts that still have active licenses
Connect-MgGraph -Scopes "User.Read.All"

Get-MgUser -All -Property DisplayName,UserPrincipalName,AccountEnabled,AssignedLicenses |
  Where-Object {
    $_.AssignedLicenses.Count -gt 0 -and
    $_.AccountEnabled -eq $false
  } |
  Select-Object DisplayName, UserPrincipalName, AssignedLicenses |
  Export-Csv -Path "inactive-licensed-users.csv" -NoTypeInformation

Run this quarterly and pipe the output to your HR offboarding checklist. Also check for users who haven't signed in within 90 days — they may have left without formal offboarding.

Beyond M365: audit every per-seat SaaS tool. Salesforce, HubSpot, GitHub, Figma, and Notion all have admin dashboards that show last-active date per seat. Run the same exercise on your 5 most expensive SaaS tools.

Mistake #6: No Reserved Instances or Savings Plans

If you've been running the same EC2 instances for more than 12 months on On-Demand pricing, you're paying a 30–60% premium for no reason. Reserved Instances and Compute Savings Plans require a 1–3 year commitment, but for stable workloads, the math is straightforward.

Example: m5.2xlarge in us-east-1

On-Demand: $0.384/hr = $2,802/month
1-year Reserved (No Upfront): $0.238/hr = $1,737/month
Annual savings: $12,780 per instance

How to find it: AWS Cost Explorer → Recommendations → Reserved Instance Recommendations. It shows you exactly which instances to reserve and projects the annual savings. Typical finding for companies with $50K+ monthly AWS bills: $150K–$400K in annual savings from RI purchases alone.

Rule of thumb: Any workload running more than 8 hours/day for a predictable period should be reserved. That includes all production application servers, databases, and analytics clusters.

Mistake #7: No Auto-Scaling on Variable Workloads

Static instance groups sized for peak traffic waste money 22 hours out of 24. If your API tier handles 10x more requests at 2pm than at 2am and you're running a fixed 10-instance fleet, you're paying for 9 instances that are largely idle overnight.

What to check:

Any application or API server tier with more than 2 instances that isn't behind an Auto Scaling Group (AWS) or VM Scale Set (Azure)
Load balancers pointing to a fixed instance list
Kubernetes node groups without cluster autoscaler enabled

Auto-scaling on a typical web/API tier reduces compute costs by 35–50% while improving availability during traffic spikes. It's one of the highest-ROI infrastructure changes you can make.

Mistake #8: Untagged Resources

Untagged resources aren't directly waste — but they're a reliable signal that unmanaged resources exist. If you can't answer "which team owns this and what does it do?" for every resource in your cloud account, you have ghost resources that nobody will notice are wasted.

# AWS: Find untagged EC2 instances
aws ec2 describe-instances \
  --query 'Reservations[*].Instances[?!Tags ||
    length(Tags[?Key==`Owner`]) == `0`]
    .[InstanceId, InstanceType, State.Name]' \
  --output table

Enforce four tags on every resource: Environment, Owner, Application, CostCenter. Use AWS Config Rules or Azure Policy to flag non-compliant resources automatically. This pays for itself in the first incident where you need to know "who owns this thing that's generating $800/month in egress costs."

Mistake #9: Shadow IT SaaS Subscriptions

The average company uses 130–210 SaaS applications. Your IT team knows about maybe 40 of them. The rest are credit card purchases by individual teams — Notion, Figma, Loom, Miro, Zapier, Calendly, Monday.com — often with multiple subscriptions in the same category from different teams that never compared notes.

How to find it:

Pull a report from corporate cards of all recurring charges under $1,000/month
Cross-reference against your approved vendor list
Survey all team leads: "What do you pay for with the company card?"
Check for duplicate tools — two project management tools, two video platforms, two note-taking tools

Typical finding: 15–20% of SaaS spend is redundant or unused. In a 50-person company spending $50K/year on SaaS, that's $7,500–$10,000 recoverable.

Mistake #10: Infrastructure Without Code

If your infrastructure wasn't built with Terraform, Bicep, Pulumi, or another IaC tool, you have configuration drift. Servers that were manually patched, security groups that got an extra rule "just to fix something quickly," database configs that differ between staging and production.

The hidden cost isn't the infrastructure itself — it's the incident response time when something breaks in a way that's impossible to reproduce. A 3-hour production incident caused by an undocumented config change costs more than 3 months of IaC migration work.

Quick self-assessment: Can you destroy your entire staging environment and recreate it — identically — in under 30 minutes? If the answer is no, you have drift, and it will eventually cause an outage.

How to Run the Audit: A One-Day Schedule

Morning (4 Hours): Data Collection

Pull 90-day cost breakdown from AWS Cost Explorer / Azure Cost Management — group by service, then by resource tag
Run Compute Optimizer (AWS) or Azure Advisor and export the recommendations
Export the M365 license report and identify inactive/disabled users
List every resource that has been running for more than 6 months with its monthly cost
Pull the corporate card SaaS subscription list

Afternoon (4 Hours): Analysis and Prioritization

Assign a monthly cost and implementation effort (Low/Medium/High) to each finding
Sort by monthly savings ÷ implementation effort
Categorize: immediate actions (no code change needed) vs. 30-day plan (requires testing) vs. 90-day project (architectural)
Draft the remediation list with owners and deadlines

What the Audit Should Produce

Idle/oversized resources list — with monthly cost impact per item
License reclaim list — users and tools to be removed or downgraded
Reserved Instance purchase plan — which instances, which term, projected savings
Storage tier migration plan — buckets/blobs to move to cheaper tiers
SaaS rationalization list — redundant tools to consolidate or cancel
IaC migration backlog — services to bring under code, prioritized by risk

Typical Findings by Company Size

Based on infrastructure audits across companies at different cloud spend levels:

$5K–$20K/month cloud spend: Typically find $15K–$60K in annual savings. Main culprits: oversized instances, no Reserved Instances, M365 license waste.
$20K–$100K/month cloud spend: Typically find $80K–$300K in annual savings. Add: storage tier waste, data transfer inefficiencies, missing auto-scaling.
$100K+/month cloud spend: Typically find $300K–$1M+ in annual savings. Add: Reserved Instance coverage gaps, architectural inefficiencies, cross-account waste.

The ROI on a thorough audit is almost always 10:1 or better in year one. The challenge isn't finding the savings — it's implementing the changes without breaking production.

What to Do With the Findings: Implementation Order

Fix in this sequence to maximize impact and minimize risk:

Zero-risk immediate actions (Week 1): Terminate idle dev instances. Reclaim unused software licenses. Cancel redundant SaaS tools. These require no code changes and carry no production risk.
Right-sizing with monitoring (Weeks 2–4): Downsize instances that Compute Optimizer flags, but do it with CloudWatch alarms on CPU and memory set to auto-scale back up if load increases. Monitor for 72 hours after each change.
Reserved Instance purchases (Month 2): Only after confirming workloads are stable. Buying RIs before rightsizing locks in the wrong instance types.
Auto-scaling and storage tier migrations (Month 2–3): Requires staging environment testing. Higher engineering effort but high long-term savings.
IaC migration (Months 3–6): Phased by criticality. Start with new infrastructure, then migrate existing services by environment (dev → staging → production).

When to Bring in External Help

Do the audit yourself if:

You have a DevOps or platform engineer who can dedicate 2 days
Your monthly cloud spend is under $20K
You're comfortable navigating AWS Cost Explorer or Azure Cost Management
You know what every line item on your cloud bill is

Bring in external help if:

Your monthly cloud spend is over $50K and you haven't reviewed it in the last 6 months
You have more than one cloud account and no centralized billing view
You've had at least one "we didn't know that was still running" moment in the last year
Your infrastructure isn't in IaC and you're not sure where to start
The engineering team is too busy shipping features to own the audit

A well-scoped cloud audit engagement typically costs $4K–$12K and uncovers 5–15x that in first-year savings — with a written remediation plan, prioritized by ROI, that your team can execute independently.

Related Resources

If the audit surfaces specific cost reduction opportunities, these guides go deeper on each area: