Google File System (GFS) v.s. HDFS

Overview

HDFS is a simplified version of GFS.

Similarities

  • Master and Slaves
    • Both GFS and HDFS use single master + multiple slaves mode.
    • The master node maintains the check-point, data migration, log
  • Data blocks and replication
    • It maintains multiple copies (usually 3) to support better reliability and performance
  • Tree structure
    • It maintains a tree-structure file system, and allows operations like those under Linux system
      • copy, rename, move, copy, delete etc.

Differences

  • File appends
    • GFS 
      • allow multiple appends and allow multiple clients to append simultaneously
      • if every append will visit the master node, it will be of low efficiency. 
        • GFS use “Leasing Mechanism” to deliver the write permission of Chunk to Chunk Server.
        • Check server can write the chunks within the lease (e.g., 12s).
        • Since multiple servers may write simultaneously, and the API is asynchronous, the records might be in different order. This makes the system design very complicated.
    • HDFS
      • Only allow one open and data append
      • The client will first write the data in local tmp file, and when the size of tmp data reach the size of a chunk (64M), then it will ask the HDFS master to assign a machine and chucn number to write the Chuck data. 
      • Advantage
        • The master will not be bottleneck. Since each write only occur when the data accumulated to be up to 64M.
      • Disadvantage
        • If the machine down in the process, some logs are not in the HDFS, and it might lose some data.
  • Master failure
    • GFS
      • Backup master node. When the main master node fails, a new master node will be voted from the backup nodes.
      • Support snapshot by using “copy on write” approach.
    • HDFS
      • HDFS needs human-interations in terms of failure.
      • Does not support snapshot. 
  • Garbage Collection (GC)
    • GFS
      • Lazy GC. 
      • It will marks the files to be deleted (e.g., rename the file to one contains time information), thus the files will not be able to be visited by normal users.
      • The master node will periodically check the files and delete the out-date ones (usually the files with more than 3 days).
    • HDFS
      • HDFS use simple and directly delete mechanism.

    Leave a Reply