Facebook Cassandra

Introduction

Cassandra is an open source distributed database system that is designed for storing and managing large amounts of data across commodity servers. 
  • Provide scalability and high availability without compromising performance
  • Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platfrom for mission critical data.
  • Support replicating across multiple datacenters and providing low latency for your users
  • Offer the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.

History

  • combined Google BigTable with Amazon Dynamo

Architecture

  • Overview
    • Peer-to-Peer distributed system
    • All nodes the same
    • Data partitioned among all nodes in the cluster
    • Custom data replication to ensure fault tolerance
    • Read/Write anywhere design
  • Communication
    • Each node communicates with each other through the Gossip protocol, which exchanges information across cluster every second
    • A commit log is used on each node to capture write activity
    • Data durability is assured
    • Data also written to an in-memory structure (memtable) and then to disk once the memory structure is full (an SStable)
  • Advantage
    • Scalability
      • Since it uses peer to peer distributed systems, it is easy to add nodes to increase from TB to PB
    • Fault-tolerance
      • Replication
    • Post-relation database
      • Since it column index, it has dynamic schema design to allows for much more flexibile data storage
    • Data compression
      • Uses Google’s Snappy data compression algorithm
      • Compresses data on a per column family level

Current status

Facebook used to use Cassandra for inbox search. Now they are using HBase for the same. 

Reference

Leave a Reply