System Architecture — Centralized architecture

Posted on December 24, 2015 by admin

Lecture 2 — part 2

Client-server applications

Clients

Interacts with users through a user interface
Performs application functions
Interacts with client middleware using middleware API
Receives response and display it if needed

Servers

Implement services
Invoked by server middleware
Provide error-recovery and failure handling service

Overview

Common communication patterns in distributed applications

Client-Server
Group (multicast)
Function-shipping/Applets

Client

Process that requests services

Server

Process that provides services

Details

Client usually blocks until server responds
Client usually invoked by end users when hey require services
Server usually waits for incoming requests
Server can have many clients making concurrent requests
Server is usually a program with special privileges

Application Software Architectures

Many applications can be considered to be made up of three software components or logical tiers

user interface
processing layer
data layer

Client/server architecture

Single-physical tiered, two physical tiered
multi-tiered

Distributed Data

e.g., distributed database

Remote data

e.g., network file system

Distributed programs

e.g., world wide web

Remote presentation

e.g., telnet

Distributed presentation

e.g., X Windows

Motivation for multi-tier architectures

Frees clients from dependencies on the exact implementation of database
It allows business logic to be concentrated in one place

software updates are restricted to middle layer

Performance improvements possible by batching requests from many clients to database
Database and business logic tiers could be implemented by multiple servers for scalability

Distributed Software and System Architectures

Posted on December 24, 2015 by admin

Lecture 2

1. Distributed Architectures

Software Architecture

Logical organization of the collection of software components that make up a distributed application

System Architecture

Instantiation of a software architecture, i.e., physical placement of software components on computers

2. Architecture Style

Layered architectures

Pros

Division of task
Scalability
Transparency
Potability

Cons

e.g., layer 1 and layer 3 cannot talk directly

Object-based architectures

Pros

Independent components
Free to talk to anyone

Data-centered architectures

e.g., google docs

Event-based architectures

e.g., facebook

3. System Architecture

Centralized architecture

client-server applications

Decentralized architecture

Peer-to-peer applications

Hybrid architecture

[Leetcode] Peek Iterator

Posted on December 23, 2015 by admin

Problem:

Given an Iterator class interface with methods: next() and hasNext(), design and implement a PeekingIterator that support the peek() operation — it essentially peek() at the element that will be returned by the next call to next().

Example:

Assume that the iterator is initialized to the beginning of the list: [1, 2, 3].

Call next() gets you 1, the first element in the list.

Now you call peek() and it returns 2, the next element. Calling next() after that still return 2.

You call next() the final time and it returns 3, the last element. Calling hasNext() after that should return false.

Analysis:

Iterator接口只提供了next() 和 hasNext() 函数

next(), 取下一个元素，同时指针移到下下个元素
hasNext(), 返回是否还有下一个元素

然后 peek() 函数只取下一个元素，而并不真正移动指针。即，peek如果多次连续被调用，返回的值应该不变 (一直是当前元素)。因此可以用next函数来取下一个元素，并将其cache，所以如果peek多次被调用，只需要返回cache中的值即可。

具体实现如下

peek()

如果cache有效，直接返回cache中的值。
否则，调用next函数取下一个元素放在cache中，并将cache设为有效

next()

如果cache有效，则返回cache中的值，并将cache置为失效
否则，直接调用next函数

hasNext()

如果cache有效，则直接返回true
否则，直接调用hasNext函数

Code:

Reference
[1] https://leetcode.com/problems/peeking-iterator/

Introduction to Distributed Computing Systems

Posted on December 23, 2015 by admin

Lecture 1

Definition

Distributed System

Tannenbaum

A distributed system is a collection of independent computers that appears to its users as a single coherent system

Lamport

You know you have one when the crash of a computer you’ve never heard of stops you from getting any work done.

Distributed Applications

Applications that consist of a set of processes that are distributed across a network of machines and work together as an ensemble to solve a common problem.

Types of Distributed Systems

Distributed computing systems

Cluster computing: homogeneous
Grid computing: heterogenous via virtual organizations
Cloud computing: everything as a service

Distributed Information Systems

Transaction processing system (Transaction RPC)
Enterprise Information integration
Publish/Subscribe systems (message oriented v.s. RPC/RMI)

Distributed Pervasive Systems

Home systems
Healthy care systems
Sensor networks

Characteristic properties of transactions (ACID)

Atomic

To the outside world, the transaction happends individually

Consistent

The transaction does not violate system invariants

Isolated

Concurrent transactions do not interfere with each other.

Durable

Once a transaction commits, the changes are permanent.

Goal/Beneftis

Resource sharing
Distribution transparency
Scalability
Fault tolerance and availability
Performance

Parallel computing can be considered a subset of distributed computing.

Challenges

Heterogeneity
Need for “openness”

Open standards: key interfaces in software and communication protocols need to be standardized
Often defined in Interface Definition Language (IDL)

Security

Denial of service attacks

Scalability

size
geographically
administratively

Transparency
Failure handling
Quality of service

Scalability

Factors

Size

Concerning centralized services/data/algorithm

Geographically

Synchronous communication in LAN vs. asynchronous communication in WAN

Administratively

Policy conflicts from different organizations (e.g., for security, access control)

Scalability techniques

Hiding communication latency

Asynchronous communication
Code migration (to client)

Distribution

Splitting a large component to parts (e.g., DNS)

Replication

Caching (decision of clients vs. servicers)
On demand (pull) vs. planned (push)

Communication

Communication Paradigms

Interprocess communication

Socket programming, message passing, etc.

Remote invocation

Request/Reply
RPC/RMI

Indirect communication

Group communication
Publisher-subscriber
Message queues
Tuple spaces

Communication Patterns

Client-servier
Group-oriented/Peer-to-Peer

Applications that require reliability, scalability

Distributed Software

Middleware handles heterogeneity
High-level support

Make distributed nature of application transparent to the user/programmmer

Remote procedure callls
RPC + Object orientation = CORBA

Higher-level support BUT expose remote objects, partial failure, etc. to the programmer
Scalability

Fundamental/Abstract Models

Interaction Model

Reflects the assumptions about the progresses and the communication channels in the distributed system.

Failure Model

Distinguish between the types of failures of the processes and the communication channels.

Security Model

Assumptions about the principals and the adversary

Interaction Models

Synchronous Distributed Systems

A system in which the following bounds are defined

The time to execute each step of a process has an upper and lower bound
Each message transmitted over a channel is received within a known bounded delay.
Each process has a local clock whose drift rate from real times has a known bound.

Asynchronous distributed system

Each step of a process can take an arbitrary time
Message delivery time is arbitrary
Clock drift rate are arbirary

Some implications

In a synchronous system, timeout can be used to detect failures
While in asynchrous system, it is impossible to detect failures or “reach aggrement”.

Blob File System v.s. GFS/HDFS

Posted on December 20, 2015 by admin

Similarities

Master nodes

Both GFS and TFS (taobao file system) uses single master node with multiple slaves.
My comments: facebook does not use such approaches? I think it uses P2P approach.

Differences

Functions

GFS/HDFS are more popular

On top of it, we can build the big tables such as BigTable, Hypertable, HBase.

Blob File System

Usually used for Photos, Albums (These are called Blob Data)

Challenges

Blob FS

For each write, it will request the master node to assign a blob number and machine lists to write to.
Challenge

The volume of meta-data is of huge size

E.g., Taobao has more than 10G photos, assume each photo has meta-data of size 20Bytes, the total size will be 20*10 = 200G, much more than the memory of a single machine.

Solution

The meta-data is not stored in Blob FS.
The meta-data are stored in external systems.

e.g., Taobao TFS has an id for each photo, the id are stored in external databases, such as Oracle or Mysql sharding cluster.

Blob FS use chunk to organize data.

Every blob file is a logical file.
Every chuck in a physical file.
Multiple logical file will share a physical file, so that it can reduces the number of physical files.

All meta-data for physical files can be in memory, thus every read of Blob file only needs one I/O access.

HDFS/GFS

GFS v2 may be able to combine GFS and Blob FS into a system. It is difficult to do so, since

It needs to support both large and small files.
The size of meta-data is too large and thus the master nodes also need to be distributed.

Google File System (GFS) v.s. HDFS

Posted on December 20, 2015 by admin

Overview

HDFS is a simplified version of GFS.

Similarities

Master and Slaves

Both GFS and HDFS use single master + multiple slaves mode.
The master node maintains the check-point, data migration, log

Data blocks and replication

It maintains multiple copies (usually 3) to support better reliability and performance

Tree structure

It maintains a tree-structure file system, and allows operations like those under Linux system

copy, rename, move, copy, delete etc.

Differences

File appends

allow multiple appends and allow multiple clients to append simultaneously
if every append will visit the master node, it will be of low efficiency.

GFS use “Leasing Mechanism” to deliver the write permission of Chunk to Chunk Server.
Check server can write the chunks within the lease (e.g., 12s).
Since multiple servers may write simultaneously, and the API is asynchronous, the records might be in different order. This makes the system design very complicated.

HDFS

Only allow one open and data append
The client will first write the data in local tmp file, and when the size of tmp data reach the size of a chunk (64M), then it will ask the HDFS master to assign a machine and chucn number to write the Chuck data.
Advantage

The master will not be bottleneck. Since each write only occur when the data accumulated to be up to 64M.

Disadvantage

If the machine down in the process, some logs are not in the HDFS, and it might lose some data.

Master failure

Backup master node. When the main master node fails, a new master node will be voted from the backup nodes.
Support snapshot by using “copy on write” approach.

HDFS

HDFS needs human-interations in terms of failure.
Does not support snapshot.

Garbage Collection (GC)

Lazy GC.
It will marks the files to be deleted (e.g., rename the file to one contains time information), thus the files will not be able to be visited by normal users.
The master node will periodically check the files and delete the out-date ones (usually the files with more than 3 days).

HDFS

HDFS use simple and directly delete mechanism.

Facebook Distributed System Optimization via Asynchronization — Big Data Queries

Posted on December 20, 2015 by admin

Asynchronization Query

Asynchronization Query

Each query is asynchronous, all functions return “Future Object”
Every DB query are divided into two parts

Set request
Receive response

Future Object tree

Each future object has two states

Waiting for execution
Finished execution

Once the tree structure is constructed, the execution will start from the bottom to the root.

When the root finishes execution, it means the page loading is completed.

Lazy manner

The execution process is lazy, since it first construct the trees and then execute. This is similar with Spark map-reduce, the functions forms a DAG structure, only when a node is being needed will its predecessor be executed.

Memcache

In terms of the problem of which query should be executed first, it should not be depended in the coding process.

But, there should be an extra phase to determine such kind of schedule.

Importance of which query to be executed first

“比如我们现在有两个查询需求。一个是查询你在淘宝上买过东西的朋友，另一个是查询你在淘宝上买过保时捷的朋友。常理来说，我们会先想到查询你在淘宝上的朋友，再进行另一个条件的查询，比如这样：”
```
IdList friends = waitFor(getFriends(myId));
yield return getTaoBaoBuyers(friends);
```
但是对于保时捷这个查询而言，这是不对的，因为淘宝上买保时捷的人是很少的，可能就一两个，而淘宝上的好友数可能有上百。因此保时捷的查询应该是这个次序比较优化：
```
IdList buyers = waitFor(getPorscheBuyer());
yield return getFriends(buyers);
```

Reference
[1] http://www.infoq.com/cn/news/2015/04/async-distributed-haiping

System Design

Posted on December 20, 2015 by admin

Reference
[1] http://www.mitbbs.com/article_t/JobHunting/32777529.html

Facebook Cassandra

Posted on December 19, 2015 by admin

Introduction

Cassandra is an open source distributed database system that is designed for storing and managing large amounts of data across commodity servers.

Provide scalability and high availability without compromising performance
Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platfrom for mission critical data.
Support replicating across multiple datacenters and providing low latency for your users
Offer the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching.

History

combined Google BigTable with Amazon Dynamo

Architecture

Overview

Peer-to-Peer distributed system
All nodes the same
Data partitioned among all nodes in the cluster
Custom data replication to ensure fault tolerance
Read/Write anywhere design

Communication

Each node communicates with each other through the Gossip protocol, which exchanges information across cluster every second
A commit log is used on each node to capture write activity
Data durability is assured
Data also written to an in-memory structure (memtable) and then to disk once the memory structure is full (an SStable)

Advantage

Scalability

Since it uses peer to peer distributed systems, it is easy to add nodes to increase from TB to PB

Fault-tolerance

Replication

Post-relation database

Since it column index, it has dynamic schema design to allows for much more flexibile data storage

Data compression

Uses Google’s Snappy data compression algorithm
Compresses data on a per column family level

Current status

Facebook used to use Cassandra for inbox search. Now they are using HBase for the same.

Reference

[1] http://cassandra.apache.org/
[2] http://www.planetcassandra.org/blog/interview/facebooks-instagram-making-the-switch-to-cassandra-from-redis-a-75-insta-savings/
[3] https://www.quora.com/Do-facebook-still-make-use-of-Cassandra-in-any-of-its-main-leaving-out-analytics-architecture

Facebook Photo Storage

Posted on December 18, 2015 by admin

1. Introduction

Binary Large OBject

– Unstructured: video, jpg
– Immutable: a post contents will not be modified

Current scale

– Over 400 billion photos stored in BloB system
– The world’s largest photo store

Blob hotness and age

– Access cool down overtime

– Hot data and warm data

2. Storage Solutions

Haystack 2008

High throughput

In memory index (store in group)
Single I/O per request
Multiple copies

Failure tolerant

RAID6

2 Redundant drives
1.2x replication
with 3 hosts: 3.6x space, e.g., 100PB raw = 28 PB Usable

Multiple copies

3 copies in different data centers

Warm Storage

Redundancy still required
Read throughput much less
Ever growing
Can we do better than 3.6x?

Yes, F4

F4

RS Encoding

Redundancy != Replication
Leverage need for less throughput

Space saving

2.1x from 3.6x

Example

Hot contents through Haystack

Warm contents through f4

3. F4

Warm storage problem

Need to store (warm) data efficiently
Storage must be highly fault tolerant
Read latency should be comparable to haystack
Load is NOT primary concern

Solution: f4

2.1x replication factor compared to haystack’s 3.6x
Yet more fault tolerant than haystack!!

Design of f4

Data splitting RS(5,2)

Use two parity blocks for each 5 blocks, and group to a stripe

RS rebuild

Block placement policy

Each stripe is placed in different racks (=>hosts)
RS(10,4) is used in practice (1.4x)
Tolerate 4 racks (->4 disks/hosts) failure

f4 cell anatomy

f4 storage consists of a set of cells
One cell resides completely in one datacenter
Cell consists of 3 kind of nodes, the index is distributed across storage nodes

Storage
Compute
Coordinator.

f4 Reads

Reads with datacenter failures (2.1x)

Haystack v.s. f4

4. Tips

Hot data will be cached by cdn

Reference
[1] https://www.quora.com/How-have-Facebook-distributed-systems-has-been-designed-to-look-as-a-single-system-Transparency
[2] https://code.facebook.com/videos/334113483447122/f4-photo-storage-at-facebook-scale-presentation/

Client-server applications

Overview

Application Software Architectures

Many applications can be considered to be made up of three software components or logical tiers

Client/server architecture

Motivation for multi-tier architectures

1. Distributed Architectures

2. Architecture Style

3. System Architecture

Centralized architecture client-server applications Decentralized architecture Peer-to-peer applications Hybrid architecture

Problem:

Analysis:

Code:

Definition

Characteristic properties of transactions (ACID)

Goal/Beneftis

Challenges

Scalability

Communication

Communication Paradigms

Communication Patterns

Distributed Software

Fundamental/Abstract Models

Interaction Models

Similarities

Differences

Overview

Similarities

Differences

Asynchronization Query

Memcache

Introduction

History

Architecture

Current status

1. Introduction

Binary Large OBject

Current scale

Blob hotness and age

2. Storage Solutions

Haystack 2008

Warm Storage

F4

3. F4

Warm storage problem

Solution: f4

Design of f4

4. Tips

Centralized architecture

client-server applications

Decentralized architecture

Peer-to-peer applications

Hybrid architecture