Design Twitter


Requirements

Product Requirements

  • Post tweet
  • Timeline
    • User Timeline
    • Home Timeline
  • User Follow Relation
    Extra:
  • Trends
  • Search

Timeline

  • normal users (few followers)
  • push model

special case:

  • celebrity/big V
  • pull model w redis cache

Trends

  • hotspot ranking
  • #hashtag
  • kafka stream
  • apache storm

Search

  • inverted index
  • k = word, value = tweet_ids
  • scatter/gather

Engineering Requirements

  • high availability

    • eventual consistent
  • low latency

    • 200ms for timeline generation
  • large scale

    • distrubuted data storage
  • heavy read

    • read:write = 1000:1

Service & API Design

  1. Write New Tweet
    POST /v0/tweet
    --data
    { "user_id": "123", "auth_token": "ABC123", \
        "tweet_data": "hello world!", "media_ids": "ABC987" }
    
    Response:
    {
        "created_at": "Wed Sep 05 00:37:15 +0000 2012",
        "status": "hello world!",
        "tweet_id": "987",
        "user_id": "123",
        ...
    }
    
    writeTweet(tweet_id, user_id, tweet_data, timestamp, location, media_ids)
  2. Read Timeline
    GET /v0/home?user_id=123
    
    Response:
    {
        "user_id": "456",
        "tweet_id": "123",
        "status": "foo"
    },
    {
        "user_id": "789",
        "tweet_id": "456",
        "status": "bar"
    },
    {
        "user_id": "789",
        "tweet_id": "579",
        "status": "baz"
    },
    
    getTimeline(user_id, timestamp)
  3. Read Followers List
    GET /v0/user_id=123/followers
    
    Response:
    {
        "user_id": "456",
        "nickname": "papa",
        "profile_img_url": "***",
        ...
    }
    
    getFollowers(user_id)

Back-of-the-envelop Calculation

users: 300M+
write/post tweet: 600 tweets/s
read: 600,000 tweets/s

read:write = 1000:1

  • read heavy

Ingress (new data storage) estimate:
assumption:

  • 100M new tweets per day
  • text: 140 char = 280 B, metadata = 30 B
  • photo: 1 photo / 5 tweets
  • video: 1 video / 10 tweets

text: 100M * (280 + 30) bytes => 30GB/day
photos: (100M/5 photos * 200KB) ~= 4TB/day
videos: (100M/10 videos * 2MB) ~= 20TB/day

egress (read data) estimate:
assumption:

  • 200M DAU
  • 7 timeline per day
  • 20 tweets per timeline
  • user only watch 1/3 of the videos

total tweets read: 200M DAU * ((2 + 5) * 20 tweets) => 28B/day
text: 28B * (280 + 30) bytes ~= 8.68TB
photos: (28B/5 photos * 200KB) ~= 1.12PB/day
videos: (28B/10/3 videos * 2MB) ~= 1.87PB/day

throughput estimate:

  • ingress = (30GB + 4TB + 20TB) / 86400 ~= 290MB/s
  • egress = (8.6TB + 1.12PB + 1.87PB) / 86400 ~= 35GB/s

Data Model Design

SQL vs NoSQL

SQL

  • user table
  • tweets table
    sql schema

NoSQL
Graph DB

  • followers social graph
  • FlockDB is a distributed graph datastore built for fast graph traversals, storing adjacency lists, supporting a high rate of add remove update operations, paginating through millions of entries, horizontally scaling, running graph walking queries.

High-level Architecture Diagram

twitter arch diagram

Component Design & Tradeoff

database desgin

large scale & distributed
sharding

  • shard by user_id
  • shard by tweet_id
  • shard by create_time
  • shard by tweet_id+create_time

replication & fault tolorence

  • read write separate
  • master/slave replication cluster
  • write to master, master replicate to slaves
  • read from master or slaves

load balancer design

dispatch strategy

cache design

cache replacement policy

  • LRU
  • last 3 days

cache update strategy

  • cache aside + write-behind(write-back)
  • write tweet_id & user_id to cache, write async whole tweet into DB

data schema

  • native list
  • key: user_id, value: "<tweet_id 8B><user_id 8B><Mics 4B>"

References

https://github.com/donnemartin/system-design-primer/tree/master/solutions/system_design/twitter

https://github.com/donnemartin/system-design-primer/blob/master/solutions/system_design/scaling_aws/README.md

https://github.com/donnemartin/system-design-primer#sql-or-nosql

https://massivetechinterview.blogspot.com/search?q=twitter

https://www.youtube.com/watch?v=wYk0xPP_P_8&t=1982s

Why is Twitter not using NoSQL?

What Database Does Twitter Use? – A Deep Dive

Free Source Code – Twitter Database Server: MySQL Database Schema

Caching with Twemcache

Deterministic Aperture: A distributed, load balancing algorithm


Author: Zijun Zhou
Reprint policy: All articles in this blog are used except for special statements CC BY 4.0 reprint policy. If reproduced, please indicate source Zijun Zhou !
  TOC