What is a Data Lake in AWS?
A Data Lake is a centralized repository that stores structured, semi-structured, and unstructured data at any scale — usually for analytics, ML, and search.
Storage layer —
Amazon S3is the canonical Data Lake storage; cheap, durable, infinitely scalable.Catalog —
AWS Glue Data Catalogstores schemas, partitions, and metadata.Ingestion — Kinesis Data Firehose, DMS, Glue, Snowball, partner ETL tools.
Query / Analytics — Athena (serverless SQL), Redshift Spectrum, EMR (Spark/Hive), QuickSight.
Governance —
AWS Lake Formationmanages permissions, row/column security, and audit.
Data Lake vs Data Warehouse: Lake holds raw data in any format (schema-on-read); Warehouse (Redshift) holds curated structured data (schema-on-write).
S3 is the data lake. Add Glue Catalog + Lake Formation + Athena to sketch the modern AWS lake stack in one breath.