What is a Data Lake in AWS?

Question

Accepted Answer

A Data Lake is a centralized repository that stores structured, semi-structured, and unstructured data at any scale — usually for analytics, ML, and search. Storage layer — Amazon S3 is the canonical Data Lake storage; cheap, durable, infinitely scalable. Catalog — AWS Glue Data Catalog stores schemas, partitions, and metadata. Ingestion — Kinesis Data Firehose, DMS, Glue, Snowball, partner ETL tools. Query / Analytics — Athena (serverless SQL), Redshift Spectrum, EMR (Spark/Hive), QuickSight. Governance — AWS Lake Formation manages permissions, row/column security, and audit. Data Lake vs Data Warehouse: Lake holds raw data in any format (schema-on-read); Warehouse (Redshift) holds curated structured data (schema-on-write). S3 is the data lake. Add Glue Catalog + Lake Formation + Athena to sketch the modern AWS lake stack in one breath.