System Architecture — Centralized architecture

Lecture 2 — part 2

Client-server applications

  • Clients
    • Interacts with users through a user interface
    • Performs application functions
    • Interacts with client middleware using middleware API
    • Receives response and display it if needed
  • Servers
    • Implement services
    • Invoked by server middleware
    • Provide error-recovery and failure handling service

 Overview

  • Common communication patterns in distributed applications
    • Client-Server
    • Group (multicast)
    • Function-shipping/Applets
  • Client
    • Process that requests services
  • Server
    • Process that provides services
  • Details
    • Client usually blocks until server responds
    • Client usually invoked by end users when hey require services
    • Server usually waits for incoming requests
    • Server can have many clients making concurrent requests
    • Server is usually a program with special privileges

Application Software Architectures

  • Many applications can be considered to be made up of three software components or logical tiers

    • user interface
    • processing layer
    • data layer

  • Client/server architecture

    • Single-physical tiered, two physical tiered
    • multi-tiered
    • Distributed Data
      • e.g., distributed database
    • Remote data
      • e.g., network file system 
    • Distributed programs
      • e.g., world wide web
    • Remote presentation
      • e.g., telnet
    • Distributed presentation
      • e.g., X Windows

  • Motivation for multi-tier architectures

    • Frees clients from dependencies on the exact implementation of database
    • It allows business logic to be concentrated in one place
      • software updates are restricted to middle layer
    • Performance improvements possible by batching requests from many clients to database
    • Database and business logic tiers could be implemented by multiple servers for scalability

Distributed Software and System Architectures

Lecture 2

1. Distributed Architectures

  • Software Architecture
    • Logical organization of the collection of software components that make up a distributed application
  • System Architecture
    • Instantiation of a software architecture, i.e., physical placement of software components on computers

2. Architecture Style

  • Layered architectures
    • Pros
      • Division of task
      • Scalability
      • Transparency
      • Potability
    • Cons
      • e.g., layer 1 and layer 3 cannot talk directly
  • Object-based architectures
    • Pros
      • Independent components
      • Free to talk to anyone
  • Data-centered architectures
    • e.g., google docs
  • Event-based architectures
    • e.g., facebook

3. System Architecture

  • Centralized architecture
    • client-server applications
  • Decentralized architecture
    • Peer-to-peer applications
  • Hybrid architecture

[Leetcode] Peek Iterator

Problem:

Given an Iterator class interface with methods: next() and hasNext(), design and implement a PeekingIterator that support the peek() operation — it essentially peek() at the element that will be returned by the next call to next().

Example:
Assume that the iterator is initialized to the beginning of the list: [1, 2, 3].
Call next() gets you 1, the first element in the list.
Now you call peek() and it returns 2, the next element. Calling next() after that still return 2.
You call next() the final time and it returns 3, the last element. Calling hasNext() after that should return false.

Analysis:

Iterator接口只提供了next() 和 hasNext() 函数

  • next(), 取下一个元素,同时指针移到下下个元素
  • hasNext(), 返回是否还有下一个元素

然后 peek() 函数只取下一个元素,而并不真正移动指针。即,peek如果多次连续被调用,返回的值应该不变 (一直是当前元素)。因此可以用next函数来取下一个元素,并将其cache,所以如果peek多次被调用,只需要返回cache中的值即可。

具体实现如下

  • peek()
    • 如果cache有效,直接返回cache中的值。
    • 否则,调用next函数取下一个元素放在cache中,并将cache设为有效
  • next()
    • 如果cache有效,则返回cache中的值,并将cache置为失效
    • 否则,直接调用next函数
  • hasNext()
    • 如果cache有效,则直接返回true
    • 否则,直接调用hasNext函数

Code:

Introduction to Distributed Computing Systems

Lecture 1

Definition

  • Distributed System
    • Tannenbaum
      • A distributed system is a collection of independent computers that appears to its users as a single coherent system
    • Lamport
      • You know you have one when the crash of a computer you’ve never heard of stops you from getting any work done.
  • Distributed Applications
    • Applications that consist of a set of processes that are distributed across a network of machines and work together as an ensemble to solve a common problem.
  • Types of Distributed Systems
    • Distributed computing systems
      • Cluster computing: homogeneous
      • Grid computing: heterogenous via virtual organizations
      • Cloud computing: everything as a service
    • Distributed Information Systems
      • Transaction processing system (Transaction RPC)
      • Enterprise Information integration
      • Publish/Subscribe systems (message oriented v.s. RPC/RMI)
    • Distributed Pervasive Systems
      • Home systems
      • Healthy care systems
      • Sensor networks

Characteristic properties of transactions (ACID)

  • Atomic
    • To the outside world, the transaction happends individually
  • Consistent
    • The transaction does not violate system invariants
  • Isolated
    • Concurrent transactions do not interfere with each other.
  • Durable
    • Once a transaction commits, the changes are permanent.

Goal/Beneftis

  • Resource sharing
  • Distribution transparency
  • Scalability
  • Fault tolerance and availability
  • Performance
    • Parallel computing can be considered a subset of distributed computing.

Challenges

  • Heterogeneity
  • Need for “openness”
    • Open standards: key interfaces in software and communication protocols need to be standardized
    • Often defined in Interface Definition Language (IDL)
  • Security
    • Denial of service attacks
  • Scalability
    • size
    • geographically
    • administratively
  • Transparency
  • Failure handling
  • Quality of service

Scalability

  • Factors
    • Size
      • Concerning centralized services/data/algorithm
    • Geographically
      • Synchronous communication in LAN vs. asynchronous communication in WAN
    • Administratively
      • Policy conflicts from different organizations (e.g., for security, access control)
  • Scalability techniques
    • Hiding communication latency
      • Asynchronous communication
      • Code migration (to client)
    • Distribution
      • Splitting a large component to parts (e.g., DNS)


    • Replication
      • Caching (decision of clients vs. servicers)
      • On demand (pull) vs. planned (push)

Communication

  • Communication Paradigms

    • Interprocess communication
      • Socket programming, message passing, etc.
    • Remote invocation
      • Request/Reply
      • RPC/RMI
    • Indirect communication
      • Group communication
      • Publisher-subscriber
      • Message queues
      • Tuple spaces

  • Communication Patterns

    • Client-servier
    • Group-oriented/Peer-to-Peer
      • Applications that require reliability, scalability

Distributed Software

  • Middleware handles heterogeneity
  • High-level support
    • Make distributed nature of application transparent to the user/programmmer
      • Remote procedure callls
      • RPC + Object orientation = CORBA
  • Higher-level support BUT expose remote objects, partial failure, etc. to the programmer
  • Scalability

Fundamental/Abstract Models

  • Interaction Model
    • Reflects the assumptions about the progresses and the communication channels in the distributed system.
  • Failure Model
    • Distinguish between the types of failures of the processes and the communication channels. 
  • Security Model
    • Assumptions about the principals and the adversary

Interaction Models

  • Synchronous Distributed Systems
    • A system in which the following bounds are defined
      • The time to execute each step of a process has an upper and lower bound
      • Each message transmitted over a channel is received within a known bounded delay.
      • Each process has a local clock whose drift rate from real times has a known bound.
  • Asynchronous distributed system
    • Each step of a process can take an arbitrary time
    • Message delivery time is arbitrary
    • Clock drift rate are arbirary
  • Some implications
    • In a synchronous system, timeout can be used to detect failures
    • While in asynchrous system, it is impossible to detect failures or “reach aggrement”.

    Blob File System v.s. GFS/HDFS

    Similarities

    • Master nodes
      • Both GFS and TFS (taobao file system) uses single master node with multiple slaves. 
      • My comments: facebook does not use such approaches? I think it uses P2P approach.

    Differences

    • Functions
      • GFS/HDFS are more popular
        • On top of it, we can build the big tables such as BigTable, Hypertable, HBase.
      • Blob File System
        • Usually used for Photos, Albums (These are called Blob Data)
    • Challenges
      • Blob FS
        • For each write, it will request the master node to assign a blob number and machine lists to write to. 
        • Challenge
          • The volume of meta-data is of huge size
            • E.g., Taobao has more than 10G photos, assume each photo has meta-data of size 20Bytes, the total size will be 20*10 = 200G, much more than the memory of a single machine.
        • Solution
          • The meta-data is not stored in Blob FS.
          • The meta-data are stored in external systems.
            • e.g., Taobao TFS has an id for each photo, the id are stored in external databases, such as Oracle or Mysql sharding cluster.
          • Blob FS use chunk to organize data.
            • Every blob file is a logical file. 
            • Every chuck in a physical file.
            • Multiple logical file will share a physical file, so that it can reduces the number of physical files. 
              • All meta-data for physical files can be in memory, thus every read of Blob file only needs one I/O access.
      • HDFS/GFS
        • GFS v2 may be able to combine GFS and Blob FS into a system. It is difficult to do so, since
          • It needs to support both large and small files.
          • The size of meta-data is too large and thus the master nodes also need to be distributed.

    Google File System (GFS) v.s. HDFS

    Overview

    HDFS is a simplified version of GFS.

    Similarities

    • Master and Slaves
      • Both GFS and HDFS use single master + multiple slaves mode.
      • The master node maintains the check-point, data migration, log
    • Data blocks and replication
      • It maintains multiple copies (usually 3) to support better reliability and performance
    • Tree structure
      • It maintains a tree-structure file system, and allows operations like those under Linux system
        • copy, rename, move, copy, delete etc.

    Differences

    • File appends
      • GFS 
        • allow multiple appends and allow multiple clients to append simultaneously
        • if every append will visit the master node, it will be of low efficiency. 
          • GFS use “Leasing Mechanism” to deliver the write permission of Chunk to Chunk Server.
          • Check server can write the chunks within the lease (e.g., 12s).
          • Since multiple servers may write simultaneously, and the API is asynchronous, the records might be in different order. This makes the system design very complicated.
      • HDFS
        • Only allow one open and data append
        • The client will first write the data in local tmp file, and when the size of tmp data reach the size of a chunk (64M), then it will ask the HDFS master to assign a machine and chucn number to write the Chuck data. 
        • Advantage
          • The master will not be bottleneck. Since each write only occur when the data accumulated to be up to 64M.
        • Disadvantage
          • If the machine down in the process, some logs are not in the HDFS, and it might lose some data.
    • Master failure
      • GFS
        • Backup master node. When the main master node fails, a new master node will be voted from the backup nodes.
        • Support snapshot by using “copy on write” approach.
      • HDFS
        • HDFS needs human-interations in terms of failure.
        • Does not support snapshot. 
    • Garbage Collection (GC)
      • GFS
        • Lazy GC. 
        • It will marks the files to be deleted (e.g., rename the file to one contains time information), thus the files will not be able to be visited by normal users.
        • The master node will periodically check the files and delete the out-date ones (usually the files with more than 3 days).
      • HDFS
        • HDFS use simple and directly delete mechanism.

      Facebook Distributed System Optimization via Asynchronization — Big Data Queries

      Asynchronization Query

      • Asynchronization Query
        • Each query is asynchronous, all functions return “Future Object”
        • Every DB query are divided into two parts
          • Set request
          • Receive response
      • Future Object tree
        • Each future object has two states
          • Waiting for execution
          • Finished execution
        • Once the tree structure is constructed, the execution will start from the bottom to the root. 
          • When the root finishes execution, it means the page loading is completed.
        • Lazy manner
          • The execution process is lazy, since it first construct the trees and then execute. This is similar with Spark map-reduce, the functions forms a DAG structure, only when a node is being needed will its predecessor be executed.

      Memcache

      • In terms of the problem of which query should be executed first, it should not be depended in the coding process.
        • But, there should be an extra phase to determine such kind of schedule. 
      • Importance of which query to be executed first
        • 比如我们现在有两个查询需求。一个是查询你在淘宝上买过东西的朋友,另一个是查询你在淘宝上买过保时捷的朋友。常理来说,我们会先想到查询你在淘宝上的朋友,再进行另一个条件的查询,比如这样:”
          IdList friends = waitFor(getFriends(myId));
          yield return getTaoBaoBuyers(friends);
          但是对于保时捷这个查询而言,这是不对的,因为淘宝上买保时捷的人是很少的,可能就一两个,而淘宝上的好友数可能有上百。因此保时捷的查询应该是这个次序比较优化:
          IdList buyers = waitFor(getPorscheBuyer());
          yield return getFriends(buyers);

      Reference
      [1] http://www.infoq.com/cn/news/2015/04/async-distributed-haiping

      Facebook Cassandra

      Introduction

      Cassandra is an open source distributed database system that is designed for storing and managing large amounts of data across commodity servers. 
      • Provide scalability and high availability without compromising performance
      • Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platfrom for mission critical data.
      • Support replicating across multiple datacenters and providing low latency for your users
      • Offer the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.

      History

      • combined Google BigTable with Amazon Dynamo

      Architecture

      • Overview
        • Peer-to-Peer distributed system
        • All nodes the same
        • Data partitioned among all nodes in the cluster
        • Custom data replication to ensure fault tolerance
        • Read/Write anywhere design
      • Communication
        • Each node communicates with each other through the Gossip protocol, which exchanges information across cluster every second
        • A commit log is used on each node to capture write activity
        • Data durability is assured
        • Data also written to an in-memory structure (memtable) and then to disk once the memory structure is full (an SStable)
      • Advantage
        • Scalability
          • Since it uses peer to peer distributed systems, it is easy to add nodes to increase from TB to PB
        • Fault-tolerance
          • Replication
        • Post-relation database
          • Since it column index, it has dynamic schema design to allows for much more flexibile data storage
        • Data compression
          • Uses Google’s Snappy data compression algorithm
          • Compresses data on a per column family level

      Current status

      Facebook used to use Cassandra for inbox search. Now they are using HBase for the same. 

      Reference

      Facebook Photo Storage

      1. Introduction

      Binary Large OBject

      – Unstructured: video, jpg
      – Immutable: a post contents will not be modified

      Current scale

      – Over 400 billion photos stored in BloB system
      – The world’s largest photo store

      Blob hotness and age

      – Access cool down overtime
      – Hot data and warm data

      2. Storage Solutions

      Haystack 2008

      • High throughput
        • In memory index (store in group)
        • Single I/O per request
        • Multiple copies
      • Failure tolerant
        • RAID6
          • 2 Redundant drives
          • 1.2x replication
          • with 3 hosts: 3.6x space, e.g., 100PB raw = 28 PB Usable
        • Multiple copies 
          • 3 copies in different data centers

      Warm Storage

      • Redundancy still required
      • Read throughput much less
      • Ever growing
      • Can we do better than 3.6x?
        • Yes, F4

      F4

      • RS Encoding
        • Redundancy != Replication
        • Leverage need for less throughput
      • Space saving
        • 2.1x from 3.6x
      • Example
        • Hot contents through Haystack
        • Warm contents through f4

      3. F4

      Warm storage problem

      • Need to store (warm) data efficiently
      • Storage must be highly fault tolerant
      • Read latency should be comparable to haystack
      • Load is NOT primary concern

      Solution: f4

      • 2.1x replication factor compared to haystack’s 3.6x
      • Yet more fault tolerant than haystack!!

      Design of f4

      • Data splitting RS(5,2)
        • Use two parity blocks for each 5 blocks, and group to a stripe
      • RS rebuild

      • Block placement policy
        • Each stripe is placed in different racks (=>hosts)
        • RS(10,4) is used in practice (1.4x)
        • Tolerate 4 racks (->4 disks/hosts) failure
      • f4 cell anatomy
        • f4 storage consists of a set of cells
        • One cell resides completely in one datacenter
        • Cell consists of 3 kind of nodes, the index is distributed across storage nodes
          • Storage
          • Compute
          • Coordinator.
      • f4 Reads
      • Reads with datacenter failures (2.1x)

      • Haystack v.s. f4

      4. Tips

      • Hot data will be cached by cdn

      Reference
      [1] https://www.quora.com/How-have-Facebook-distributed-systems-has-been-designed-to-look-as-a-single-system-Transparency
      [2] https://code.facebook.com/videos/334113483447122/f4-photo-storage-at-facebook-scale-presentation/