Blog/AWS Security
AWS Security

Securing Hadoop/EMR on AWS — lessons from MNC scale

2026-06-13 4 min read

Before AWS Security, I spent 15+ years securing distributed data platforms in production — Kafka, Hadoop and CDP at MNC scale, on estates up to ~300 nodes. When you move that workload to AWS EMR, the tools change but the questions an auditor asks don't: who can access this data, can you prove it, and what happens when it breaks.

Here's how on-prem Hadoop security translates to AWS — and the one trap I see teams fall into every time.

Practitioner guidance, not a substitute for your own threat model.

The four layers, on-prem → AWS

Layer On-prem Hadoop/CDP AWS EMR equivalent
Encryption in transit TLS + SASL on RPC/HTTP/shuffle EMR security configuration: EnableInTransitEncryption + PEM certs in S3
Encryption at rest HDFS TDE, LUKS on local disk LocalDiskEncryptionConfiguration (EBS + local via KMS); S3EncryptionConfiguration SSE-KMS for EMRFS
Authentication Kerberos + AD cross-realm trust EMR Kerberos (dedicated KDC + AD trust) — still Kerberos, but cluster-scoped
Authorization (data) Apache Ranger policies IAM roles for EMRFS + Lake Formation fine-grained grants

The encryption story maps almost one-to-one — you apply it as an EMR security configuration at cluster creation, instead of editing hdfs-site.xml and managing LUKS by hand.

What genuinely changes

1. Authorization moves from Ranger to IAM + Lake Formation. This is the biggest shift. Where you wrote Ranger policies for table/column access, EMR splits it: coarse cluster access via Kerberos, fine-grained data access via Lake Formation grants and IAM roles for EMRFS. If you're migrating, map each Ranger policy intent (not its syntax) to a Lake Formation grant.

2. Audit moves to CloudTrail + S3 access logs. You stop running a self-managed audit pipeline. EMR API calls land in CloudTrail; EMRFS data access lands in S3 access logs and CloudTrail data events. That's less to operate — and it's tamper-evident if you turn on log-file validation.

3. Key management becomes KMS. Instead of a homegrown keystore, you get KMS: rotation, ViaService scoping, and deletion safeguards. Encryption is only as strong as key management — KMS makes that auditable.

4. The node is ephemeral, the data isn't. On-prem, the cluster was the data (HDFS). On EMR, compute is ephemeral and data lives in S3 (EMRFS). So your security centre of gravity shifts from the cluster to the S3 data lake — bucket policies, Block Public Access, and SSE-KMS matter more than node hardening.

The trap

Teams lift-and-shift the cluster into a VPC and call it secure because it's "internal." Internal is not a security control. The blast radius of one compromised node is the whole data lake. The non-negotiables:

  • Private subnets; security groups scoped to the least path (not "open within the VPC")
  • No public access to the EMR master; reach it via SSM Session Manager / a bastion
  • S3 Block Public Access on the data-lake buckets, account-wide
  • Encrypt everything (in transit + at rest), with KMS rotation on

Why this is a moat, not baggage

Securing data at that scale taught me exactly what APRA cares about. CPS 234's encryption, access-control and logging requirements are the same controls I ran on Hadoop for years — just expressed in AWS services, and producing the same audit evidence. The "old" data-platform background isn't legacy; it's the reason the AWS Security controls feel obvious.

A worked EMR security configuration + the IAM/KMS patterns are in the open-source aws-security-toolkit.


Practitioner guides on AWS Security for APRA-regulated Australia at aiopsone.com.


Primary sources: Amazon EMR security · APRA CPS 234

Get the CPS 234 → AWS Controls cheatsheet

A practitioner mapping of every APRA CPS 234 control to the real AWS services that satisfy it. Free — straight to your inbox.

No spam. Unsubscribe anytime. See our privacy policy.