Requirements
Product Requirements
- Post tweet
- Timeline
- User Timeline
- Home Timeline
- User Follow Relation
Extra: - Trends
- Search
Timeline
- normal users (few followers)
- push model
special case:
- celebrity/big V
- pull model w redis cache
Trends
- hotspot ranking
#hashtag
- kafka stream
- apache storm
Search
- inverted index
- k = word, value = tweet_ids
- scatter/gather
Engineering Requirements
high availability
- eventual consistent
low latency
- 200ms for timeline generation
large scale
- distrubuted data storage
heavy read
- read:write = 1000:1
Service & API Design
- Write New Tweet
POST /v0/tweet --data { "user_id": "123", "auth_token": "ABC123", \ "tweet_data": "hello world!", "media_ids": "ABC987" } Response: { "created_at": "Wed Sep 05 00:37:15 +0000 2012", "status": "hello world!", "tweet_id": "987", "user_id": "123", ... } writeTweet(tweet_id, user_id, tweet_data, timestamp, location, media_ids)
- Read Timeline
GET /v0/home?user_id=123 Response: { "user_id": "456", "tweet_id": "123", "status": "foo" }, { "user_id": "789", "tweet_id": "456", "status": "bar" }, { "user_id": "789", "tweet_id": "579", "status": "baz" }, getTimeline(user_id, timestamp)
- Read Followers List
GET /v0/user_id=123/followers Response: { "user_id": "456", "nickname": "papa", "profile_img_url": "***", ... } getFollowers(user_id)
Back-of-the-envelop Calculation
users: 300M+
write/post tweet: 600 tweets/s
read: 600,000 tweets/s
read:write = 1000:1
- read heavy
Ingress (new data storage) estimate:
assumption:
- 100M new tweets per day
- text: 140 char = 280 B, metadata = 30 B
- photo: 1 photo / 5 tweets
- video: 1 video / 10 tweets
text: 100M * (280 + 30) bytes => 30GB/day
photos: (100M/5 photos * 200KB) ~= 4TB/day
videos: (100M/10 videos * 2MB) ~= 20TB/day
egress (read data) estimate:
assumption:
- 200M DAU
- 7 timeline per day
- 20 tweets per timeline
- user only watch 1/3 of the videos
total tweets read: 200M DAU * ((2 + 5) * 20 tweets) => 28B/day
text: 28B * (280 + 30) bytes ~= 8.68TB
photos: (28B/5 photos * 200KB) ~= 1.12PB/day
videos: (28B/10/3 videos * 2MB) ~= 1.87PB/day
throughput estimate:
- ingress = (30GB + 4TB + 20TB) / 86400 ~= 290MB/s
- egress = (8.6TB + 1.12PB + 1.87PB) / 86400 ~= 35GB/s
Data Model Design
SQL vs NoSQL
SQL
- user table
- tweets table
NoSQL
Graph DB
- followers social graph
- FlockDB is a distributed graph datastore built for fast graph traversals, storing adjacency lists, supporting a high rate of add remove update operations, paginating through millions of entries, horizontally scaling, running graph walking queries.
High-level Architecture Diagram
Component Design & Tradeoff
database desgin
large scale & distributed
sharding
- shard by user_id
- shard by tweet_id
- shard by create_time
- shard by tweet_id+create_time
replication & fault tolorence
- read write separate
- master/slave replication cluster
- write to master, master replicate to slaves
- read from master or slaves
load balancer design
dispatch strategy
- L7 (for requests): random pick 2 + select least load 1
- L5 (for sessions):
- TBD
- ensure workload distributed evenly among all servers
cache design
cache replacement policy
- LRU
- last 3 days
cache update strategy
- cache aside + write-behind(write-back)
- write tweet_id & user_id to cache, write async whole tweet into DB
data schema
- native list
- key:
user_id
, value:"<tweet_id 8B><user_id 8B><Mics 4B>"
References
https://github.com/donnemartin/system-design-primer/tree/master/solutions/system_design/twitter
https://github.com/donnemartin/system-design-primer#sql-or-nosql
https://massivetechinterview.blogspot.com/search?q=twitter
https://www.youtube.com/watch?v=wYk0xPP_P_8&t=1982s
Why is Twitter not using NoSQL?
What Database Does Twitter Use? – A Deep Dive
Free Source Code – Twitter Database Server: MySQL Database Schema
Deterministic Aperture: A distributed, load balancing algorithm