Process Context Switch

Context Switch

  • Whenever an interrupt arrives, the CPU must to a state-save of currently running process, then switch into kernel mode to handle the interrupt, and the do a state-restore of the interrupted process. 
  • Context switch happens when
    • the time slice for one process has expired and a new process is to be loaded from the ready queue.
    • This will be instigated by a timer interrupt, which will then cause the current process’ state to be saved and the new process’s state to be restored.
    • Saving and restoring states involves saving and restoring all registers and program counter(s), as well as the process control blocks.
    • Context switch happens VERY VERY frequently, and the overhead of doing the switching is just lost CPU time, so context switches (state saves and restores) need to be as fast as possible. Some hardware has specific provisions for speeding this up, such as a single machine instruction for saving or storing all registers at once. 

Registers

  • The Operating System needs to store a copy of the CPU registers to memory.
  • When it is time for the processor to give up the processor so another process can run, it needs to save it’s current state.
  • Equally, we need to be able to restore this state when the process is given more time to run on the CPU. To do this the operating system needs to store a copy of the CPU registers to memory
  • When it is time for the process to run again, the operating system will copy the register values back from memory to the CPU registers and the process will be right back where it left off. 

Reference

[1] http://www.bottomupcs.com/elements_of_a_process.xhtml

Process Heap

Introduction

The heap is an area of memory that is managed by the process for on the fly memory allocation.

This is for variables whose memory requirements are not known at compile time.

  • brk
    • The bottom of the heap is known as the brk, so called for the system call which modifies it.
    • By using the brk call to grow the area downward the process can request the kernel allocate more for it to use.
  • malloc
    • The heap is mostly managed by the malloc library call.
    • This makes managing the heap for programmer by allowing them to simply allocate and free (via the free call) heap memory.
    • Malloc can use schemes like a buddy allocator to manage the heap memory for the user.
    • Malloc can also be smarter about allocation and potentially use anonymous mmaps for extra process memory.
    • This is where instead of mmaping a file into the process memory, it directly maps an area of system RAM.

Process Stack

Stacks usually grown down, i.e., the stack starts at a higher address in memory and progressively gets lower.

Stack Pointer

  • To keep track of current growth of the stack, the hardware defines a register as the stack pointer. 
  • The compiler (or the programmer, when writing in assembler) users this register to keep track of current top of the stack. 
  • Store
    • It is stored in the register ESP.

Call Stack

  • A call stack is used for several related purposes, but the main reason for have one is to keep track of the point to which each active subrountine should return control when it finishes executing.
  • Active subrountine
    • An active subrountine is one that has been called by yet to complete execution after which control should be handed back to the point of the call. 

How it works?

  • Return address
    • Store the returning address somewhere, so that we can continue executing the current function after completion of called function.
  • Other info
    • Saving other information about current function. 
  • Parameters
    • Providing the callee with the parameters
  • Provided called function some space to store its automic variables
  • Return value
    • Providing some mechanism to get return value from the called function. 
    • It is implemented using EAX register. Called function stores return value in EAX.

Execution in Steps

  1. Push the parameters in the reverse order (right to left)
  2. Call function now
    1. It implicitly pushes the return address into STACK
  3. Push the current EBP
    1. i.e., push Frame Pointer (FP) of calling function into stack. We need to do this to continue with the execution of calling function after return from the called function,
  4. Copy the ESP to EBP
    1. Yes, this location will be the new FRAME Pointer
  5. Allocate space on stack for local variable
    1. It is done by decrementing ESP
  6. Put the value to be returned in EAX
  7. Copy current EBP to ESP
    1. It will reduce stack size. Now we have the old FP at the top of the stack??

Stack Frame

  • The stack frame generally includes the following components
    • The return address
    • Argument variables passed on the stack
    • Local variables
    • Saved copies of any registers modified by the subprogram that need to be restored.

Stack Overflow

  • Introduction
  • Suggestion
    • If you are a programmer accept arbitrary input into a stack variable (say, reading from the keyboard or over the network), you need to say how big that data is going to be
    • OS
      • The OS can make the stack not executable. 

Reference

[1] http://articles.manugarg.com/stack.html

Process Communication

Shared Memory Systems

  • In general, the memory to be shared in a shared-memory system is initially within the address space of a particular process, which needs to make system calls in order to make that memory publicly available to one or more other processes. 
  • Other processes which wish to use the shared memory must then make their own system calls to attach the shared memory area onto their address space. 
  • Generally a few messages must be passed back and forth between the cooperating processes first in order to set up and coordinate the shared memory access. 

Example” POSIX Shared Memory

  • Step 1: one of the processes involved to allocate some shared memory, using shmget
    • int segment_id = shmget(IPC_PRIVATE, size, S_IRUSR | S_IWUSR)
      • The first parameter 
        • specifies the key (identifier) of the segement. IPC_PRIVATE creates a new shared memory segment.
      • The second parameter
        • How big the shared memory segment is to be, in bytes.
      • The third parameter
        • A set of bitwise ORed flags. In this case, the segment is being created for reading and writing.
      • The return value
        • an integer identifier
  • Step 2: Any process which wishes to use the shared memory must attach the sahred memory to their address space, using shmat
    • char* shared_memory = (char *) shmat(segment_id, NILL, 0)
      • The first parameter
        • Specify the key of the segment that the process wishes to attack to its address space
      • The second parameter
        • Indicate where the process wishes to have the segment attacked. Null insicates the system should decide.
      • The third parameter
        • A flag for read-only operation. Zero indicates read-write; One indiciates Readonly.
      • The return value
        • The return value of shmat is a void *, which the process can use (type cast) as appropriae. In this example, it is being used as a character pointer. 
  • Step 3: The process may access the memory using the pointer returned by shmat. For example, using sprintf
    • sprintf(shared_memory, “Writing to shared memoryn”);
  • Step 4: When a process no longer needs a piece of shared memory, it can be detached using shmtd
    • shmat(shared_memory)
  • Step 5: And finally the process that originally allocated the shared memory can remove it from the system using shmctl.
    • shmctl(segment_id, IPC_RMID)

Message-Passing

  • Message passing systems must support at a minimum system calls for “send message” and “receive message”
  • A communication link must be established between the corresponding process before message can be sent. 
  • There are three key issues to be resolved in message passing systems
    • Direct or indirect communication (naming)
    • Synchronous or asynchronous communication
    • Automatic or explicit buffering

Naming

  • Direct communication
    • With direct communication the sender must know the name of the receiver to which it wishes to send message
  • Indirect communication
    • Multiple processes can share the same mailbox or boxes.
    • Only one process can read any given message in a mailbox. 
    • The OS must provide system calls to create and delete mailboxes, and to send and receive messages to/from mailboxes.

Synchronization

  • Either the sending or receiving of messages (or neither or both) may be either blocking or non-blocking.

Buffering

  • Messages are passed via queues, which may have one or three capacity configurations
    • Zero capacity
      • Message cannot be stored in the queue, so senders must lock until receivers accept the messages.
    • Bounded capacity
      • There is a certain pre-determined finite capacity in the queue. Senders must block if the queue if full, until space becomes available in the queue, but may be either blocking or non-blocking otherwise. 
    • Unbounded capacity
      • The queue has a theoretical infinite capacity, so senders are never forced to block. 

Reference

[1] https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/3_Processes.html

Process Memory, State, and Scheduling

Process Concept

  • A process is an instance of a program in execution

Process Memory

  • Text: 
    • comprises the compiled program code, read in from non-volatile storage when the program is launched. 
  • Data
    • Stores global and static variables, allocated and initialized prior to executing main.
  • Heap
    • Used for dynamic memory allocation, is managed via calls to new, delete, malloc, free etc. 
  • Stack
    • Used for local variables
    • Space on the stack is reserved for local variables when they are declared (at function entrance or elsewhere, depending on the language), and the space is freed up when the variables go out of scope
    • The stack is also used for function return values, and the exact mechanisms for stack management may be language specific. 
  • When processes are swapped out of memory and later restored, additional information must also be stored and restored. Key among them are the program counter and the value of all program registers

Process State

  • New
    • The process is in the stage of being created.
  • Ready
    • The process has all the resources available that it needs to run, but the CPU is not currently working on this process’s instruction.
  • Running
    • The CPU is working on this process’s instruction.
  • Waiting
    • The process cannot run at the moment, because it is waiting for some resource to become available for some event to occur. For example, the process may be waiting for keyboard input, disk access request, inter-process messages, a time to go off, or a child process to finish.
  • Terminated
    • The process has completed.

Process Control Block

For each process, there is a Process Control Block (PCB), which stores the following (types of) process-specific information.
  • Process State
    • Running, waiting etc. 
  • Process ID
    • and also parent process ID
  • CPU registers and Program Counter
    • These need to be saved to restored when swapping processes in and out of the CPU
  • CPU scheduling information
    • Such as priority information and pointers to scheduling queues
  • Memory-Management information
    • e.g., page tables or segment tables
  • Accounting information
    • User and kernel CPU time consumed, account numbers, limits, etc. 
  • I/O Status information
    • Devices allocated, open file tables etc. 

Process Scheduling

  • The two main objectives of the process scheduling system are to keep the CPu busy at all times and to deliver “acceptable” response times for all program, particularly for interactive ones. 
  • The process scheduler must meet these objectives by implementing suitable policies for swapping process in and out of the CPU.

Scheduling Queues

  • Job queue
    • All processes are stored in the job queue.
  • Ready queue
    • Processes in the Ready state are placed in the ready queue
  • Device queue
    • Processes waiting for a device to be available or to deliver data are placed on device queues. There is generally a separate device queue for each device.

Scheduler

  • Long-term scheduler
    • Typically of a batch system or a very heavily loaded system. It runs infrequently, (such as when one process ends selecting one or more to be loaded in from disk in its place), and can afford to take the time to implement intelligent and advanced scheduling algorithms
  • Short-term scheduler, or CPU scheduler
    • Runs very frequently, on the order of 100 milliseconds, and must very quickly swap one process out of the CPU and swap in another one.
  • Medium-term scheduler
    • Some systems also employ medium-term scheduler.
    • When system loads get high, this scheduler will swap one or more processes out of the ready queue system for a few seconds, in order to allow smaller faster jobs to finish up quickly and clear the system. 

    Reference

    [1] https://www.cs.uic.edu/~jbell/CourseNotes/OperatingSystems/3_Processes.html

    Network Topology in the Cloud — ToR

    Top of Rack

    [1] demonstrates that most of the topology in Amazon Cloud is ToR mode.
    • All servers in a track are first connected to a separate Top of Rack (ToR) switch, and then the ToR switch is connected to aggregate swtiches. 
    • Such a topology has currently become a mainstream network topology in a data center.

    Reference

    [1] A Measurement Study on Co-residence Threat inside the Cloud, by Zhang Xu, Haining Wang and Zhenyu Wu, in UsenixSecurity2016

    Virtual Private Cloud (VPC)

    VPC Introduction

    [1] VPC is a logically isolated networking environment that a separate private IP space and routing configuration. 

    Characteristics

    • After creating a VPC, a customer can launch instances into VPC, instead of the large EC2 network pool. 
    • The customer can also divide a VPC into multiple subnets, where each subnet can have a preferred availability zone to place instances.
    • The private IP address of an instance in VPC is only known to its owner. It cannot be detected by other users. Thus, it can significantly reduces the threat of co-residence. 

    Reference

    [1] A Measurement Study on Co-residence Threat inside the Cloud, by Zhang Xu, Haining Wang and Zhenyu Wu, in UsenixSecurity2016

    Transactions and Replications

    One Copy Serializability

    • Replicated transactional service
      • Each replica manager providers concurrency control and recovery of its own data items in the same way as it would for non-replicated data
    • Effects of transactions performed by various clients on replicated data items are the same as if they has been performed one at a time on a single data item
    • Additional complications
      • Failures should be serialized w.r.t. transactions, i.e., any failure observed by a transaction must appear to have happened before a transaction started

    Replication Scheme

    • Primary copy
      • All client request are directed to a single primary server
    • Read one — write all
      • cannot handle network partitions
      • Each write operation sets a write lock to each replica manager
      • Each read sets a read lock at one replica manager
    • Schemes that can handle network partitions
      • Available copies with validation
      • Quorum consensus
      • Virtual Partition

    Available Copies Replication

    • Can handle the case when some replica managers are unavailable because they failed or communication failure
    • Reads can be performed by any available replica manager but writes must be performed by all available replica managers
    • Normal case is like one/write all
      • As long as the set of available replica managers does not change during a transaction
    • RM failure case
      • One copy serializability requires that failures and recovery be serialized w.r.t transactions
      • This is not achieved when different transactions make conflicting failure observations
      • Examples shows local concurrency control is not enough
      • Additional concurrency control procedure (called local validation) has to be performed to ensure correctness
    • Available copies with local validation assume no network partition
      • i.e, functioning replica managers can communicate with one another.
    • Local validation
      • Before a transaction commits it checks for failures (and recoveries) of replica managers of data items it has accessed

    Handling Network Partititons

    • Network partitions separate replica managers into two or more subgroups, in such a way that the members of a group can communicate with one another but members of different subgroups cannot communicate
    • Optimistic approaches
      • Available copies with validation
    • Pessimistic approaches
      • Quorum consensus

    Available Copies with Validation

    • Available copies algorithm applied within each partition
      • Maintains availability for read operations
    • When partition is repaired, possibly conflicting transactions in separate partitions are validated
      • The effects of a committed transactions that is now aborted on validation will have to be undone
        • Only feasible for applications where such compensating actions can be taken
    • Validation
      • Versions vector (write-write conflicts)
      • Precedence graphs (each partition maintains a log of data items affected by the Read and Write operations of transactions)
      • Log used to construct a precedence graph whose nodes are transactions and whose edges represent conflicts between Read and Write operations
        • No cycle in graph corresponding to each partition
      • If there are cycles in graph, validation fails.

    Quorum Consensus

    • A quorum is a subgroup of replica managers whose size gives it the right to carry out operations
    • Majority voting one instance of a quorum consensus scheme 
      • R+ W > total number of votes in group
      • W > half f the total votes
      • Ensure that each read quorum intersects a write quorum, and two write quorum will intersect
    • Each replica has a version number that is used to detect if the replica is up to date. 

    Virtual Partition Scheme

    • Combines available copies and quorum consensus
    • Virtual partition = set of replica managers that have a read and write quorum
    • If a virtual partition can be formed, available copies is used
      • Improve performance of Reads
    • If a failure occurs, and virtual partition changes during a transaction, it is aborted
    • Have to ensure virtual partitions do not overlap.

    CAP Conjecture

    • Is it possible to achieve consistency, availability, and partition tolerance?.
    • Classic distributed systems: focused on ACID semantics
      • Atomic
      • Consistent
      • Isolated
      • Durable
    • Modern internet system: focused on BASE
      • Basically Available
      • Soft-state (or scalable)
      • Eventually consistent
    • ACID v.s. BASE

    Why the Divide

    • What goals might you want for a shared-data system
      • consistency
      • availability
      • partition tolerance
    • Strong consistency
      • all clients see the same view, even in the presence of updates
    • High availability
      • All clients can find the same replica of the data, even in the presence of failures
    • Partition-tolerance
      • The system properties hold even when the system is partitioned. 

    CAP Conjecture

    • You can only have two out of these three properties
    • The choice of which feature to discard determines the nature of your system. 

    Consistency and Availability

    • Comment
      • Providing transaction semantics requires all nodes to be in contact with each other
    • Example
      • Single-site clustered databases
    • Typical features
      • Two-phase commit
      • Cache invalidation protocol
      • Classic distributed system style

    Consistency and Partition Tolerance

    • Comment
      • If one is willing to tolerate system-wide blocking, then can provide consistency even when there are temporary partitions
    • Examples
      • Distributed databases
      • Distributed locking
      • Quorum (majority) protocols
    • Typical features
      • Pessimistic locking
      • Minority partitions unavailable
      • Also common distributed system
        • voting vs. primary replicas

    Partition-Tolerance and Availability

    • Comment
      • sacrifice consistency
    • Examples
      • DNS
      • Web cache
      • Coda
      • Bayou
    • Typical features
      • TTLs and lease cache management
      • Optimistic updating with conflict resolution

    Techniques

    • Expiration-based caching: AP
    • Quorum/majority algorithm: PC
    • Two phase commit: AC

    Fault Tolerance

    Definitions

    • Availability
      • Probability the system operates correctly at any given moment
    • Reliability
      • Ability to run correctly for a long interval of time
    • Safety
      • Failure to operate correctly does not lead to catastrophic failures
    • Maintainability
      • Ability to “easily” repair a failed system

    Failure Models

    • Different types of failures

    Two-Army Problem

    Byzantine Agreement 

    • [Lamport et al. (1982)]
    • Goal
      • Each process learn the true values sent by correct processes
    • Assumption
      • Every message that is sent is delivered correctly
      • The receiver knows who sent the message
      • Message delivery time is bounded
    • Byzantine Agreement Result
      • In a system with m faulty processes agreement, agreement can be achieved only if there are 2m+1 functioning correctly.
      • Note
        • This result only guarantees that each process receives the true values sent by correct processors, but it does not identify the correct process!

    Byzantine General Problem

    • Phase 1: Generals announce their troop strengths to each other

    • Phase 2: Each general construct a vector with all troops

    • Phase 3: General send their vectors to each other and compute the majority voting

    Reliable Group Communication

    • Reliable Multicast
      • All nonfaulty process which do not join/leave during communication receive the message

    • Atomic Multicast
      • All message are delivered in the same order to all processes

    View Delivery

    • A view reflects current membership of group
    • A view is delivered when a membership change occurs and the application is notified of the change
      • Receiving a view is different from delivering a view
        • All members have to agree to the delivery of a review
    • View synchronous group communication
      • The delivery of a new view draws a conceptual line across the system and every message is either delivered on one side or the other of that line


    Atomic Multicast

    • All message are delivered in the same order to “all” processes
    • Group view
      • The set of processes known by the sender when it multicast the message
    • Virtual synchronous multicast
      • A message multicast to a group view G is delivered to all nonfaulty process in G
        • If sender fails after sending the message, the message may be delivered to no one

    Virtual Synchrony Implementation
    • Only stable messages are delivered
    • Stable message
      • A message received by all processes in the message’s group view
    • Assumptions (can be ensured by using TCP)
      • Point-to-point communication is reliable
      • Point-to-point communication ensures FIFO-ordering


    Message Ordering

    • Total ordering does not imply causality or FIFO!


    Virtualization and Code Migration

    Motivation for Code Migration

    • Load sharing in distributed systems
      • Long-running process can be migrated to idle processors
    • Client-server systems
      • Code for data entry shipped to client systems
      • If large quantities of data need to be processed, it is better to ship the data processing component to the client
      • Dynamically configurable client software
        • More flexibility, easier maintenance and upgrade of client software
    • Enterprise and “Desktop Grids”
      • Computationally-intensive tasks shipped to idle PCs around the network

    Models for Code Migration

    • A process has three segments
      • Code segment
        • Set of instructions making the problem
      • Execution segment
        • Private data, stack, PC, registers
      • Resource segment
        • Reference to external resources such as files, printers, devices etc
    • Weak vs Strong Mobility
      • Weak mobility
        • Only code segment + initialization data migrated
          • e.g., Java Applets
      • Strong mobility
        • Code segment + Execution segment
    • Send-initiated vs. receiver-initiated migration
      • Receiver-initiated is much easier considering security

    Migration and Local Resources

    • Process-to-resource bindings make code migration difficult
    • Three types of processor to resource bindings
      • Binding by identifier
        • When a process refers to a resource by its identifier
          • e.g., URL, IP address, local communication endpoint (socket)
      • Binding by value
        • Weaker form of binding when only the value of a resource is needed
          • e.g., when a program replies on standard language libraries
      • Binding by type
        • Weakest form of binding when a process indicates the type of a resource
          • e.g., a printer