Blob File System v.s. GFS/HDFS

Similarities

  • Master nodes
    • Both GFS and TFS (taobao file system) uses single master node with multiple slaves. 
    • My comments: facebook does not use such approaches? I think it uses P2P approach.

Differences

  • Functions
    • GFS/HDFS are more popular
      • On top of it, we can build the big tables such as BigTable, Hypertable, HBase.
    • Blob File System
      • Usually used for Photos, Albums (These are called Blob Data)
  • Challenges
    • Blob FS
      • For each write, it will request the master node to assign a blob number and machine lists to write to. 
      • Challenge
        • The volume of meta-data is of huge size
          • E.g., Taobao has more than 10G photos, assume each photo has meta-data of size 20Bytes, the total size will be 20*10 = 200G, much more than the memory of a single machine.
      • Solution
        • The meta-data is not stored in Blob FS.
        • The meta-data are stored in external systems.
          • e.g., Taobao TFS has an id for each photo, the id are stored in external databases, such as Oracle or Mysql sharding cluster.
        • Blob FS use chunk to organize data.
          • Every blob file is a logical file. 
          • Every chuck in a physical file.
          • Multiple logical file will share a physical file, so that it can reduces the number of physical files. 
            • All meta-data for physical files can be in memory, thus every read of Blob file only needs one I/O access.
    • HDFS/GFS
      • GFS v2 may be able to combine GFS and Blob FS into a system. It is difficult to do so, since
        • It needs to support both large and small files.
        • The size of meta-data is too large and thus the master nodes also need to be distributed.

Google File System (GFS) v.s. HDFS

Overview

HDFS is a simplified version of GFS.

Similarities

  • Master and Slaves
    • Both GFS and HDFS use single master + multiple slaves mode.
    • The master node maintains the check-point, data migration, log
  • Data blocks and replication
    • It maintains multiple copies (usually 3) to support better reliability and performance
  • Tree structure
    • It maintains a tree-structure file system, and allows operations like those under Linux system
      • copy, rename, move, copy, delete etc.

Differences

  • File appends
    • GFS 
      • allow multiple appends and allow multiple clients to append simultaneously
      • if every append will visit the master node, it will be of low efficiency. 
        • GFS use “Leasing Mechanism” to deliver the write permission of Chunk to Chunk Server.
        • Check server can write the chunks within the lease (e.g., 12s).
        • Since multiple servers may write simultaneously, and the API is asynchronous, the records might be in different order. This makes the system design very complicated.
    • HDFS
      • Only allow one open and data append
      • The client will first write the data in local tmp file, and when the size of tmp data reach the size of a chunk (64M), then it will ask the HDFS master to assign a machine and chucn number to write the Chuck data. 
      • Advantage
        • The master will not be bottleneck. Since each write only occur when the data accumulated to be up to 64M.
      • Disadvantage
        • If the machine down in the process, some logs are not in the HDFS, and it might lose some data.
  • Master failure
    • GFS
      • Backup master node. When the main master node fails, a new master node will be voted from the backup nodes.
      • Support snapshot by using “copy on write” approach.
    • HDFS
      • HDFS needs human-interations in terms of failure.
      • Does not support snapshot. 
  • Garbage Collection (GC)
    • GFS
      • Lazy GC. 
      • It will marks the files to be deleted (e.g., rename the file to one contains time information), thus the files will not be able to be visited by normal users.
      • The master node will periodically check the files and delete the out-date ones (usually the files with more than 3 days).
    • HDFS
      • HDFS use simple and directly delete mechanism.

    Facebook Distributed System Optimization via Asynchronization — Big Data Queries

    Asynchronization Query

    • Asynchronization Query
      • Each query is asynchronous, all functions return “Future Object”
      • Every DB query are divided into two parts
        • Set request
        • Receive response
    • Future Object tree
      • Each future object has two states
        • Waiting for execution
        • Finished execution
      • Once the tree structure is constructed, the execution will start from the bottom to the root. 
        • When the root finishes execution, it means the page loading is completed.
      • Lazy manner
        • The execution process is lazy, since it first construct the trees and then execute. This is similar with Spark map-reduce, the functions forms a DAG structure, only when a node is being needed will its predecessor be executed.

    Memcache

    • In terms of the problem of which query should be executed first, it should not be depended in the coding process.
      • But, there should be an extra phase to determine such kind of schedule. 
    • Importance of which query to be executed first
      • 比如我们现在有两个查询需求。一个是查询你在淘宝上买过东西的朋友,另一个是查询你在淘宝上买过保时捷的朋友。常理来说,我们会先想到查询你在淘宝上的朋友,再进行另一个条件的查询,比如这样:”
        IdList friends = waitFor(getFriends(myId));
        yield return getTaoBaoBuyers(friends);
        但是对于保时捷这个查询而言,这是不对的,因为淘宝上买保时捷的人是很少的,可能就一两个,而淘宝上的好友数可能有上百。因此保时捷的查询应该是这个次序比较优化:
        IdList buyers = waitFor(getPorscheBuyer());
        yield return getFriends(buyers);

    Reference
    [1] http://www.infoq.com/cn/news/2015/04/async-distributed-haiping

    Facebook Cassandra

    Introduction

    Cassandra is an open source distributed database system that is designed for storing and managing large amounts of data across commodity servers. 
    • Provide scalability and high availability without compromising performance
    • Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platfrom for mission critical data.
    • Support replicating across multiple datacenters and providing low latency for your users
    • Offer the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.

    History

    • combined Google BigTable with Amazon Dynamo

    Architecture

    • Overview
      • Peer-to-Peer distributed system
      • All nodes the same
      • Data partitioned among all nodes in the cluster
      • Custom data replication to ensure fault tolerance
      • Read/Write anywhere design
    • Communication
      • Each node communicates with each other through the Gossip protocol, which exchanges information across cluster every second
      • A commit log is used on each node to capture write activity
      • Data durability is assured
      • Data also written to an in-memory structure (memtable) and then to disk once the memory structure is full (an SStable)
    • Advantage
      • Scalability
        • Since it uses peer to peer distributed systems, it is easy to add nodes to increase from TB to PB
      • Fault-tolerance
        • Replication
      • Post-relation database
        • Since it column index, it has dynamic schema design to allows for much more flexibile data storage
      • Data compression
        • Uses Google’s Snappy data compression algorithm
        • Compresses data on a per column family level

    Current status

    Facebook used to use Cassandra for inbox search. Now they are using HBase for the same. 

    Reference

    Facebook Photo Storage

    1. Introduction

    Binary Large OBject

    – Unstructured: video, jpg
    – Immutable: a post contents will not be modified

    Current scale

    – Over 400 billion photos stored in BloB system
    – The world’s largest photo store

    Blob hotness and age

    – Access cool down overtime
    – Hot data and warm data

    2. Storage Solutions

    Haystack 2008

    • High throughput
      • In memory index (store in group)
      • Single I/O per request
      • Multiple copies
    • Failure tolerant
      • RAID6
        • 2 Redundant drives
        • 1.2x replication
        • with 3 hosts: 3.6x space, e.g., 100PB raw = 28 PB Usable
      • Multiple copies 
        • 3 copies in different data centers

    Warm Storage

    • Redundancy still required
    • Read throughput much less
    • Ever growing
    • Can we do better than 3.6x?
      • Yes, F4

    F4

    • RS Encoding
      • Redundancy != Replication
      • Leverage need for less throughput
    • Space saving
      • 2.1x from 3.6x
    • Example
      • Hot contents through Haystack
      • Warm contents through f4

    3. F4

    Warm storage problem

    • Need to store (warm) data efficiently
    • Storage must be highly fault tolerant
    • Read latency should be comparable to haystack
    • Load is NOT primary concern

    Solution: f4

    • 2.1x replication factor compared to haystack’s 3.6x
    • Yet more fault tolerant than haystack!!

    Design of f4

    • Data splitting RS(5,2)
      • Use two parity blocks for each 5 blocks, and group to a stripe
    • RS rebuild

    • Block placement policy
      • Each stripe is placed in different racks (=>hosts)
      • RS(10,4) is used in practice (1.4x)
      • Tolerate 4 racks (->4 disks/hosts) failure
    • f4 cell anatomy
      • f4 storage consists of a set of cells
      • One cell resides completely in one datacenter
      • Cell consists of 3 kind of nodes, the index is distributed across storage nodes
        • Storage
        • Compute
        • Coordinator.
    • f4 Reads
    • Reads with datacenter failures (2.1x)

    • Haystack v.s. f4

    4. Tips

    • Hot data will be cached by cdn

    Reference
    [1] https://www.quora.com/How-have-Facebook-distributed-systems-has-been-designed-to-look-as-a-single-system-Transparency
    [2] https://code.facebook.com/videos/334113483447122/f4-photo-storage-at-facebook-scale-presentation/