15 May 2025
I've Reviewed 20+ AWS Environments. Here's What Always Breaks.
The first time I ran a full security review on a production AWS account, I expected to find the obvious stuff. Public S3 buckets. Default VPCs with everything open. SSH exposed to the internet.
I found those. But the more reviews I've done, the more I realise the interesting vulnerabilities aren't the ones that show up on top-10 lists. They're the ones that made complete sense at 2am during a sprint, got shipped, and nobody ever revisited.
This isn't a checklist post. It's the stuff I actually look for, with the commands and Terraform I actually use.
IAM: Where Most Breaches Start
IAM misconfigurations aren't AWS's fault. The service is well-designed. The problem is how people use it when they're moving fast.
This is the pattern I find most often — attached to a Lambda role in a payments service:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "*",
"Resource": "*"
}
]
}
Administrator access on a function that processes invoices. It was added temporarily six months ago to debug something. Temporary became permanent. Nobody noticed.
The fix is specific action, specific resource:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::payments-receipts-prod/*"
},
{
"Effect": "Allow",
"Action": "secretsmanager:GetSecretValue",
"Resource": "arn:aws:secretsmanager:eu-west-1:123456789012:secret:stripe-api-key-*"
}
]
}
Only what the function actually touches. Scoped to the exact resource ARNs.
To find your overprivileged roles quickly:
# Find roles with AdministratorAccess attached
aws iam list-roles --query 'Roles[].RoleName' --output text | \
tr '\t' '\n' | \
xargs -I {} sh -c '\
result=$(aws iam list-attached-role-policies --role-name "$1" \
--query "AttachedPolicies[?PolicyName==\"AdministratorAccess\"].PolicyName" \
--output text 2>/dev/null); \
[ -n "$result" ] && echo "$1: $result"' _ {}
Anything that comes back needs a conversation.
Also worth checking: inline policies. They don't show up in the attached policies list and are often where the worst stuff hides.
# Check inline policies on all roles
aws iam list-roles --query 'Roles[].RoleName' --output text | \
tr '\t' '\n' | \
while read role; do
policies=$(aws iam list-role-policies --role-name "$role" --query 'PolicyNames[]' --output text 2>/dev/null)
[ -n "$policies" ] && echo "$role has inline policies: $policies"
done
S3: Public Exposure Is Still Happening in 2025
AWS added account-level Block Public Access settings in 2018. Seven years later, I still find accounts where it's off at the individual bucket level, overriding the account setting.
# Audit public access status on every bucket in the account
aws s3api list-buckets --query 'Buckets[].Name' --output text | \
tr '\t' '\n' | \
while read bucket; do
status=$(aws s3api get-public-access-block --bucket "$bucket" 2>/dev/null | \
python3 -c "
import sys, json
d = json.load(sys.stdin)['PublicAccessBlockConfiguration']
print('OPEN' if not all(d.values()) else 'OK')
" 2>/dev/null || echo "NO_BLOCK_CONFIG")
[ "$status" != "OK" ] && echo "$bucket: $status"
done
Anything that prints is worth investigating. NO_BLOCK_CONFIG means the bucket has no public access block configured at all — check its ACL and bucket policy.
The Terraform to lock this down at the account level — one resource, covers everything:
resource "aws_s3_account_public_access_block" "main" {
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
Do this first. Before anything else S3-related. It sets a floor that individual bucket settings can't undercut.
For buckets that legitimately need to serve public content — a static site, a public CDN origin — put CloudFront in front and keep the bucket private. The bucket gets an Origin Access Control, CloudFront gets the public endpoint. The bucket never touches the internet directly.
Security Groups: The Firewall Nobody Reviews After Day One
Security groups are stateful firewalls. They're supposed to be restrictive. In practice, they drift toward open because it's easier to allow everything and fix it later — except later never comes.
This is what I find in production RDS security groups:
# What I keep finding
resource "aws_security_group_rule" "db_ingress" {
type = "ingress"
from_port = 5432
to_port = 5432
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"] # the entire internet
security_group_id = aws_security_group.rds.id
}
Your database should not be reachable from the internet. If your application can reach it, your database is in a VPC, and only your application tier's security group should have access — not a CIDR range:
# What it should be
resource "aws_security_group_rule" "db_from_app_only" {
type = "ingress"
from_port = 5432
to_port = 5432
protocol = "tcp"
source_security_group_id = aws_security_group.app_tier.id
security_group_id = aws_security_group.rds.id
}
Same goes for Redis, internal APIs, anything in your private subnets. Source should be a security group reference, never a CIDR unless you're dealing with an external IP you actually control.
To audit your account for security groups with unrestricted access on sensitive ports:
aws ec2 describe-security-groups \
--query "SecurityGroups[?IpPermissions[?
IpRanges[?CidrIp=='0.0.0.0/0'] &&
(FromPort==\`22\` || FromPort==\`3389\` || FromPort==\`5432\` || FromPort==\`3306\` || FromPort==\`6379\`)
]].[GroupId, GroupName, VpcId]" \
--output table
This is usually where a penetration tester starts. Don't let them find it before you do.
CloudTrail: If You're Not Logging, You're Just Guessing
I've reviewed accounts with no CloudTrail. When I ask why, the answer is usually some version of "we weren't sure what to do with the data." Understandable. But you don't need to process logs in real-time to get value from them.
You need them the day something goes wrong. An incident without logs is just assumptions and finger-pointing.
Enable multi-region CloudTrail with log file validation:
resource "aws_cloudtrail" "account_trail" {
name = "account-audit-trail"
s3_bucket_name = aws_s3_bucket.cloudtrail_logs.id
include_global_service_events = true
is_multi_region_trail = true
enable_log_file_validation = true
event_selector {
read_write_type = "All"
include_management_events = true
data_resource {
type = "AWS::S3::Object"
values = ["arn:aws:s3:::"]
}
}
tags = {
Environment = "production"
Purpose = "audit"
}
}
# The CloudTrail bucket needs its own policy to block public access
# and deny deletion of log files
resource "aws_s3_bucket_policy" "cloudtrail_logs" {
bucket = aws_s3_bucket.cloudtrail_logs.id
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Sid = "AWSCloudTrailWrite"
Effect = "Allow"
Principal = { Service = "cloudtrail.amazonaws.com" }
Action = "s3:PutObject"
Resource = "${aws_s3_bucket.cloudtrail_logs.arn}/AWSLogs/*"
Condition = {
StringEquals = {
"s3:x-amz-acl" = "bucket-owner-full-control"
}
}
},
{
Sid = "DenyLogDeletion"
Effect = "Deny"
Principal = "*"
Action = ["s3:DeleteObject", "s3:DeleteObjectVersion"]
Resource = "${aws_s3_bucket.cloudtrail_logs.arn}/*"
}
]
})
}
enable_log_file_validation = true tells CloudTrail to generate a hash digest for each log file. If someone tampers with the logs — deletes events, modifies entries — you'll know. This matters if you ever need the logs for a legal or compliance investigation.
Quick check on whether CloudTrail is running in the account you're reviewing:
aws cloudtrail describe-trails --include-shadow-trails false \
--query 'trailList[].{
Name: Name,
MultiRegion: IsMultiRegionTrail,
LogValidation: LogFileValidationEnabled,
GlobalEvents: IncludeGlobalServiceEvents
}' \
--output table
Secrets: Not in Environment Variables, Please
Lambda environment variables are not encrypted by default. They use the AWS managed key, which means any IAM principal with lambda:GetFunctionConfiguration can read them in plaintext. I've seen teams that know this but still use env vars because Secrets Manager feels like overhead.
It's not. Here's the whole thing in Python:
import boto3
import json
from botocore.exceptions import ClientError
from functools import lru_cache
@lru_cache(maxsize=None)
def get_secret(secret_name: str, region: str = "eu-west-1") -> dict:
client = boto3.client("secretsmanager", region_name=region)
try:
response = client.get_secret_value(SecretId=secret_name)
return json.loads(response["SecretString"])
except ClientError as e:
raise RuntimeError(f"Failed to retrieve secret '{secret_name}': {e}") from e
# Somewhere in your Lambda handler or app startup
db = get_secret("production/myapp/postgres")
conn_str = f"postgresql://{db['username']}:{db['password']}@{db['host']}/mydb"
The @lru_cache is deliberate — you don't want to call Secrets Manager on every Lambda invocation. Cache it in memory. The secret doesn't change on every request.
The Terraform to store the secret:
resource "aws_secretsmanager_secret" "db_credentials" {
name = "production/myapp/postgres"
recovery_window_in_days = 7
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
resource "aws_secretsmanager_secret_version" "db_credentials" {
secret_id = aws_secretsmanager_secret.db_credentials.id
secret_string = jsonencode({
username = var.db_username
password = var.db_password
host = aws_db_instance.main.address
port = 5432
dbname = var.db_name
})
}
And the IAM permission for your Lambda or ECS task role — note the specific ARN, not a wildcard:
{
"Effect": "Allow",
"Action": "secretsmanager:GetSecretValue",
"Resource": "arn:aws:secretsmanager:eu-west-1:123456789012:secret:production/myapp/postgres-*"
}
The trailing -* handles the random suffix that Secrets Manager appends to the secret ARN. Without it, the permission won't work.
Enable Security Hub. Just Do It.
Security Hub aggregates findings from GuardDuty, Inspector, Macie, and IAM Access Analyzer into one place, runs them against CIS AWS Foundations and AWS Foundational Security Best Practices, and gives you a score.
It takes thirty seconds to enable:
aws securityhub enable-security-hub \
--enable-default-standards \
--region eu-west-1
Do it in every region you operate in. The first 30 days are free. After that it costs something, but the visibility you get — a single pane that tells you what's misconfigured across your entire account — is worth it.
If you're in a multi-account setup, enable Security Hub in your management account and aggregate findings from member accounts. One dashboard, all accounts.
None of this is complicated. Every issue I've described has a known fix, documented by AWS, with working Terraform. The reason these misconfigurations keep showing up isn't that teams don't know how to fix them — it's that security review doesn't happen on the same cadence as feature development.
You ship infrastructure changes every week. If nobody's reviewing them from a security angle, you're accumulating debt that eventually becomes an incident at 3am.
The tooling I use for this kind of review — posture checks, CIS benchmarking, automated IAM analysis — is in my open-source SSDLC-Toolbox: github.com/rohithyal/ssdlc-toolbox.