Design a file storage and sharing service like Google Drive or Dropbox
A file storage and sharing service like Google Drive or Dropbox allows users to upload, download, sync, and share files across any number of devices, with automatic background sync that feels instantaneous and conflict handling that never loses data. The system must handle files ranging from a few bytes to tens of gigabytes, scale to billions of files and petabytes of total storage, and provide fine-grained access control without sacrificing performance.
Core challenge: the hardest problems are efficient delta sync (re-uploading only changed portions of a large file), deduplication across all users and devices, and real-time change notification so all a user's devices reflect the latest state within seconds. Naive designs (upload the entire file on every change) become unworkable for files over a few megabytes and for users on mobile connections.
File chunking and content addressing — every file is split into fixed-size chunks of 4–8 MB. Each chunk is hashed with
SHA-256to produce a content address. The chunk is stored in object storage (S3orGCS) keyed by its hash. Because storage is content-addressed, if two different files (or two different users) contain the same chunk, it is stored exactly once — providing global deduplication for free. The file metadata record stores the ordered list of chunk hashes, not the bytes themselves.Delta sync — when a file changes, the client computes chunk hashes for the new version and sends only the hash list to the server. The server responds with which chunk hashes it does not yet have. The client uploads only those missing chunks. For a 1 GB file where only the last paragraph changed, this means uploading one 4–8 MB chunk instead of the full gigabyte — a dramatic bandwidth saving. The algorithm is conceptually similar to
rsync's rolling checksum but operates at the chunk granularity.Metadata service — the metadata tier is a sharded relational database (
PostgreSQLorMySQL) keyed byuser_id. It stores the folder tree, file names, owner, parent folder ID, chunk hash list, version vectors, permission grants, and modification timestamps. This layer answers all folder-browse and search operations. It is separated from blob storage so that metadata operations (rename, move, share) are fast and transactional without touching the blob layer.Notification and sync service — each connected device maintains a long-lived WebSocket or long-poll connection to a notification service. When a file changes, the metadata service publishes an event to a message bus (
Kafka). The notification service consumes those events and pushes change notifications to all devices registered to that user. Each device then fetches updated metadata and downloads only the missing chunks. Notification delivery is at-least-once; the client must be idempotent.Versioning — each file write appends a new version record pointing to a new chunk hash list. The previous version's records are retained for the version history window (e.g., 30 or 180 days depending on the plan). Restoring to a past version simply updates the current-version pointer to a historical record — the chunks are already in object storage. Old versions are garbage-collected by a background job that ref-counts chunks and deletes unreferenced ones after the retention window closes.
Sharing and access control — sharing is modeled as permission grants: a row in the
permissionstable ties afile_id(orfolder_id) to agrantee_id(user or group) with a permission level (viewer, commenter, editor). Every download request re-validates the ACL. Downloads are served via short-lived signed URLs issued by the metadata service and redeemed directly against the object store — the application servers never proxy file bytes, keeping download latency low and compute costs minimal.Conflict resolution — when two devices modify the same file while offline and then reconnect, the server detects the conflict by comparing version vectors. The simpler approach (used by Dropbox) is last-write-wins for binary files plus a conflicted copy sibling for the loser, visible in the folder view. For structured formats (Google Docs), an Operational Transformation or CRDT layer merges edits at the character level — a much more complex but user-transparent resolution.
A complete upload flow: the Dropbox client detects a file save event from the OS. It chunks the new file, computes chunk hashes, and POSTs the hash list to the metadata API. The API responds with the set of unknown hashes. The client uploads those chunks directly to S3 via pre-signed PUT URLs. Once all chunks land, the client calls a commit endpoint that atomically writes the new version record. The metadata service publishes a change event; the notification service pushes it to all of the user's other devices within a second. Those devices download new chunk manifests and fetch missing chunks, completing the sync.
Key trade-off — chunk size vs deduplication ratio: smaller chunks (512 KB) yield a higher deduplication ratio because common sub-file patterns are more likely to match, but they create far more metadata rows and increase the overhead of index lookups. Larger chunks (16–32 MB) minimize metadata cost but reduce dedup effectiveness and increase the minimum re-upload size when a file changes. Dropbox originally used 4 MB and experimentally tuned upward for larger file types. The right answer depends on your user's file-size distribution and how much metadata overhead your database can sustain.
Further topics worth covering: client-side encryption (end-to-end encrypted services like Tresorit encrypt chunks before upload, which eliminates server-side deduplication because identical plaintext produces different ciphertexts per user key); storage tiering (infrequently accessed files migrate to cold storage like S3 Glacier after a configurable idle period); quota enforcement (soft quotas checked at upload time, hard quotas enforced via pre-commit validation); virus scanning as a post-upload sidecar; and full-text search (text files and extracted PDF/DOCX content indexed in Elasticsearch).
Lead with the chunking and content-addressing model — it simultaneously solves delta sync, deduplication, and efficient versioning, which makes it the single most important idea in this design. When discussing conflict resolution, acknowledge that last-write-wins with a conflicted copy is pragmatically correct for binary files (the only safe merge for opaque content), and note that character-level merging requires OT or CRDTs, citing Google Docs as the example.
Mentioning that downloads are served via signed S3 URLs (so app servers never proxy bytes) signals production-level experience.