Distributed Systems [Fall 20120] - GitHub Pages

Distributed Systems [Fall 20120] - GitHub Pages

Distributed Systems [Fall 2013] Course Review Final Exam December 19, 4:10-6:10 PM MAT 417 Open book/notes Textbook or printings of the slides

No any other kinds of materials are allowed What we have learnt so far Concepts and Theories Distributed Algorithms

System Issues Some protocols Distributed System Models Client-server SMTP, Telnet, SSH Thin client Diskless nodes, X11

Peer to peer BitTorrent Grid Many projects from www.distributed.net Processes, Threads and Coordination Process: an execution stream (or program) in the context of a particular process state Threads: separate streams of executions that

share an address space. Lightweight Processes Processes Have separate address space. Use fork() system call to create and exec() execute a different program. Need some mechanism to do inter-process communication. Signal, Pipe, Socket, Shared memory, Message

queue Clipboard ( yes, clipboard) Threads Have shared address space. Also have some per-thread state PC and other registers, stack. Use pthread_create() to create a thread. No need to use IPC Data are already shared.

Multi-threaded Program in Network Programming Exploit multiple (core) CPUs. Fast context switching overhead. Exploit I/O concurrency If a thread blocks on I/O, other threads carry on. Deal with multiple client connections.

Problems with Multi-threaded Program Creating a thread for each request is expensive. If too many requests come in, the memory will be used up. Thread Pool

Like the concept of connection pool. Create a fixed number of threads on startup. Dispatcher thread waits for request. For each request, choose an idle worker thread. When the worker thread is done, it goes back to the pool.

Synchronization Mutex lock protects critical section. pthread_mutex_lock() pthread_mutex_unlock() Conditional Variable avoid busy waiting and repeatedly using locks. pthread_cond_wait() Why Conditional Variable

Some operations on shared data structures can only be done when certain condition is met. E.g. dequeue can only be done when the queue is not empty. Possible solution: repetitive lock-and-test. DEQUEUE ENQUEUE try_again: lock (queue_mutex); lock (queue_mutex); enqueue (); if (queue is empty) { unlock (queue_mutex); unlock (queue_mutex); goto try_again;

} dequeue (); unlock (queue_mutex); Problem: repetitive locking and unlocking is costly Conditional Variables Typical operations on a condition variable (used with a lock): wait: waiting on the condition variable signal: waking up one waiting thread to run.

DEQUEUE lock (queue_mutex); while (queue is empty) wait (queue_mutex, nonempty_cond); dequeue (); unlock (queue_mutex); ENQUEUE lock (queue_mutex); enqueue ();

signal (nonempty_cond); unlock (queue_mutex); RPC A remote procedure call that looks like a local one. Commonly used today: RPC Architecture Create stub functions to make RPC appear to

the user that the call is local. Stub function contains the functions interface. Client side composes a RPC request and send to the server Server side executes the procedure and sends back the result. Client side returns the result to the caller. Marshaling and Unmarshaling Incompatibility problems:

Different bytes ordering Different sizes of integer Different float number representations Different character sets Different alignments ... Marshaling and Unmarshaling Marshaling convert the data to make transfers across platform

Unmarshaling is the opposite construct the object from the data received. Sometimes referred to as serialize and desterilize. RPC Semantics Communication and machines may fail. Most RPC systems will offer either at least once semantics or at most once semantics (such as YFS RPC)

No such thing as exactly once semantics More Issues with RPC RPC is slower, a lot slower Security Authentication, encryption, etc. Clock Synchronization Why? Temporal ordering of events produced by

concurrent processes Synchronization between senders and receivers of messages Coordination of joint activity Serialization of concurrent access for shared objects Clock Synchronization Real world clock Cristians Algorithm

Berkeley Algorithm Logical clock Lamport Logical Clock Cristians Algorithm Request time, get reply Measure actual round-trip time d Request Sent: T0

Reply received: T1 Cristians Algorithm The client sets time to T1 - T0 TNew =TServer + 2 T-T

Error bound <= 1 0 2

The Berkeley Algorithm Gusella & Zatti, 1989 Assumes no machine has an accurate time source Obtains average from participating computers Synchronizes all clocks to average Can discard outliers Berkeley Algorithm Machines run time dmon

Process that implements protocol One machine is elected (or designated) as the server (master) Others are slaves Algorithm has provisions for ignoring readings from clocks whose skew is too great Compute a fault-tolerant average If master fails Any slave can take over

Network Time Protocol, NTP 1991, 1992 Internet Standard, version 3: RFC 1305

Synchronization similar to Cristians alg. All messages delivered unreliably with UDP Use a stratum structure NTP An example Columbia NTP server Stratum Two Problems with Real Synchronization Clocks are never exactly synchronized

More often than not, distributed systems do not need real time, but *some* time that every machine in a protocol agrees upon! Logical Time What does Concurrent Mean? Happening at the same time? NO. There is nothing called simultaneous in the physical world.

Concurrency means the absence of causal order. Sequential and Concurrent Events Sequential = Totally ordered in time. Total ordering is feasible in a single process that has only one clock. This is not true in a distributed system. Two issues are important here:

How to synchronize physical clocks ? Can we define sequential and concurrent events without using physical clocks? Causality Causality helps identify sequential and concurrent events without using physical clocks. Joke Re: joke implies causally ordered before or happened

before Causality Rule 1. If a, b are two events in a single process P, and the time of a is less than the time of b then a b. Rule 2. If a = sending a message, and b = receipt of that message, then a b.

Rule 3. a b b c a c b c a c a c Lamport Clocks Lamport clock assigns logical timestamps to events consistent with happens before ordering All hosts use a counter (clock) with initial value of zero The counter is incremented by and assigned to each

event, as its timestamp. A send(message) event carries its timestamp For a receive(message) event the counter is updated by Max(receiver.counter, message.timestamp) + 1 Distributed Mutex: Coordination Five algorithms discussed None of the algorithms is perfect Distributed Mutex: Coordination

Safety - At most one process may execute in CS at any time. No deadlock too. Liveness Every request for a CS is eventually granted This is called progress. Fairness Requests are granted in FIFO order and bounded wait

Centralized Lock Server A central coordinator Grants permission to enter CS & keeps a queue of requests to enter the CS. Ensures only one thread at a time can access the CS. Centralized Lock Server

To enter critical section: send REQUEST to central server wait for permission from server To leave: send RELEASE to central server Server: Has an internal queue of all REQUESTs its received but to which it hasnt yet sent OK Delays sending OK back to process until process is at head of queue

Centralized Lock Server The lock server implementation in YFS Safety, liveness and order are guaranteed It takes 3 messages per entry/exit operation.

Client delay: one round trip time request + grant The coordinator becomes performance bottleneck and single point of failure This is bad Token Ring Approach Processes are organized in a logical ring p i has a communication channel to

p(i+1)mod N Operations: Only the process holding the token can enter the CS. To exit the CS, the process sends the token onto its neighbor. If a process does not require to enter the CS when it receives the token, it forwards the token to the next neighbor. Token Ring Approach

Safety & liveness are guaranteed, but ordering is not. Client delay 0 to N message transmissions. A shared priority queue Each process i locally maintains Q(i) , part of a shared priority queue To run critical section, must have replies from all other processes AND be at the front of Q(i)

When you have all replies: #1: All other processes are aware of your request #2: You are aware of any earlier requests for the mutex A shared priority queue FIFO fairness. Client Delay: 3(N-1) (N-1 requests + N-1 ack + N-1 release)

Very unreliable - any process failure halts progress Ricart & Agrawalas algorithm An improved version of Lamports shared priority queue 1. Broadcast a timestamped request to all. 2. Upon receiving a request, send ack if You do not want to enter your CS, or You are trying to enter your CS, but your timestamp is higher than that of the sender. (If you are in CS, then buffer the request)

3. Enter CS, when you receive ack from all. 4. Upon exit from CS, send ack to each pending request before making a new request. (No release message is necessary. No priority queue needed) Ricart & Agrawalas algorithm Safety: Prove that at most one process can be in CS. Fairness: Prove that FIFO fairness holds even if channels are not FIFO

Not reliable Message complexity = 2(N-1) (N-1 requests + N-1 acks - no release message) Majority rules Instead of collecting REPLYs, collect VOTEs Can progress with as many as N/2 1 failed processes Not fair Deadlock!

No guarantee that anyone receives a majority of votes DFS Networked file systems with a single-server architecture Their goals: Have a consistent namespace for files across computers Let authorized users access their files from any

computer Thats why you have the identical files in all CLIC machines. SUN NFS NFS provides transparent, remote file access Simple, portable, really popular (its gotten a little more complex over time) Weak consistency semantics

Requires hefty server resources to scale (flushonclose, server queried for lots of operations) SUN NFS NFS V1-V3 are stateless NFS V4 is stateful AFS Developed at CMU (~1985), commercial product by Transarc (part of OSF/DCE) Segment network into clusters, with one file

server per cluster Dynamic reconfigurations to balance load Stateful protocol + aggressive caching Servers participate in client cache management Entire files are cached on disk File System 1. Machine failures are the norm 1000s of components Bugs, human errors, failures of memory, disk,

connectors, networking, and power supplies Monitoring, error detection, fault tolerance, automatic recovery must be integral parts of a design 2. Design for big-data workloads Search, ads, Web analytics, Map/Reduce, GFS A GFS cluster A single master Many chunkservers

Accessed by many clients A file Divided into fixed-sized chunks (similar to FS blocks) Labeled with 64-bit unique global IDs (called handles) Stored at chunkservers 3-way replicated across chunkservers Master keeps track of metadata (e.g., which chunks belong to which files) GFS

Consistency Models Replicated data a huge theme in distributed systems For performance and fault tolerance Often easier to replicate data than computation Caches in NFS Involves sophisticated optimizations for performance

How do we know it is correct? Distributed Shared Memory (Reminder) Two models for communication in distributed systems: message passing shared memory Shared memory is often thought more intuitive to write parallel programs than message

passing Each node can access a common address space. Nave DSM Assume each host has a local copy of all of memory Reads are local, so they are very fast Send write msg to each other host But don't wait

Nave DSM Example 1 Simple mutual exclusion algorithm, for locking. x and y start as zero on both CPUs Intuitive explanation for why this should "work": If CPU0 sees y == 0, CPU1 can't have reached "y = 1", so CPU1 must see x == 1, so

it won't execute critical section. Perhaps neither will enter, but never both CPU 0: x = 1; if (y == 0) enter_cs(); CPU 1:

y = 1; if ( x == 0) enter_cs(); Problems 1

Maybe both will enter CS CPU0 sends write x=1 msg, reads local y=0 CPU1 reads local x=0 before write msg arrives Local memory and slow writes cause disagreement about r/w order CPU0 thinks its x=1 was before CPU1's read of x CPU1 thinks its read of x was before arrival of x=1 so both can enter the critical section! Nave DSM Example 2

CPU0: v0 = f0(); done0 = true; CPU1: while(!done0) ; v1 = f1(v0); done1 = true; CPU2: while(!done1)

; v2 = f2(v0, v1); CPU2 should execute f2() with results from CPU0 and CPU1 Waiting for CPU1 implies waiting for CPU0

Problem 2 CPU0's writes of v0 and done0 may be interchanged by network Leaving v0 unset but done0=true But assume each CPU sees each other's writes in issue order Yet another problem: CPU2 sees CPU1's writes before CPU0's writes i.e. CPU2 and CPU1 disagree on order of CPU0 and

CPU1 writes Consistency Models There are no "right" or "wrong" models A model may make it harder or easier to program i.e. lead to more or less intuitive results A model may be harder or easier to implement efficiently

Strict Consistency Each instruction is stamped with the wall-clock time at which it started across all CPUs Rule 1: Load gets value of most recent previous Set to same address Rule 2: each CPU's instructions have timestamps in execution order Essentially the same as on uniprocessor Property of Strict Consistency

Very intuitive behavior But Not efficiently enough Sequential Consistency Is an execution (a set of operations) correct? There must be some total order of operations such that Rule 1. All CPUs see results consistent with that total order i.e. reads see most recent write in the

total order Rule 2. Each CPU's instructions appear in-order in the total order Implementing Sequential Consistency 1. Each CPU to execute reads/writes in program order, one at a time 2. Each memory location to execute reads/writes in arrival order, one at a time Proof in Lamport 1979

In what sense is sequential looser than strict? I.e. what are we giving up? I.e. what programs will break? Answer: sequential consistency doesn't let you reason about timing. You *can* reason based on per-CPU instruction order and observed values:

An Example CPU0: w(x)0 w(x)1 CPU1: w(y)0 w(y)2 CPU2: r(y)? r(x)?

Strict consistency requires r(y)2 r(x)1 But sequential consistency allows either or both to read as 0 Causal Consistency Causal consistency: Any execution is the same as if all causally-related read/write ops were executed in an order that reflects their causality All concurrent ops may be seen in different orders

Eventual Consistency Allow stale reads, but ensure that reads will eventually reflect previously written values Even after very long times Doesnt order concurrent writes as they are executed, which might create conflicts later: which write was first? Used in Amazons Dynamo, a key/value store Plus a lot of academic systems

Plus file synchronization Transactions ACID Properties: Atomicity: none or exactly once (2PC) Consistency: A transaction produces consistent results only; otherwise it aborts. Isolation: A program running under transaction protection must behave exactly as it would in singleuser mode. (2PL)

Durability: The results of transactions having completed successfully must not be forgotten by the system 2-Phase Lock (2PL) Phase 1: acquire locks Phase 2: release locks You may not get any more locks after you release any locks Typically implemented by not allowing

explicit unlock calls Locks automatically released on commit/abort 2PL deadlocks t1.lock(foo); t1.lock(bar); t2.lock(bar); t2.lock(foo); To overcome this issue Each transaction can get all its locks at once Each transaction can get all its locks in a

predefined order 2PL Deadlock However, they are impartial as transactions often do not know which locks they will need in the future Automatically abort all long-running transactions Two-phase Commit (2PC)

Distributed transaction protocol. Phase 1: Voting Each participant prepares to commit, and votes on whether or not it can commit Phase 2: Committing Each participant actually commits or aborts Two-phase Commit TC sends prepare messages to A and B

A and B respond, saying whether theyre willing to commit If both say yes, TC sends commit messages If either says no, TC sends abort messages

A and B decide to commit if they receive a commit message. The Voting Phase Coordinator asks each participant: canCommit?(T) Participants must prepare to commit using permanent storage before answering yes Objects are still locked Once a participant votes yes, it is not allowed

to cause an abort Outcome of T is uncertain until doCommit or doAbort Other participants might still cause an abort The commit phase The coordinator collects all votes If unanimous yes, causes commit If any participant voted no, causes abort The fate of the transaction is decided atomically

at the coordinator, once all participants vote Coordinator records fate using permanent storage Then broadcasts doCommit or doAbort to participants Timeout Actions Participant waiting for vote request abort Coordinate wait abort Participant wait (tricky) Coordinate may send others commit then crash

Coordinate may send others abort then crash No way to tell which is the case, have to wait (blocked) Uncertain state: must wait until coordinator recovers (time between all participant vote & coordinator recovery) 2PC is safe, but not live. Three-phase Commit (3PC) Reduce uncertainty period of 2PC

Avoid blocking as long as majority of processors can agree on the action Three-phase commit Three-phase Commit Three-phase Commit 3PC is live, but not safe. It is impossible to design a protocol thats both

safe and live in the most general case. Terminologies Mean time to failure (MTTF) Mean time to repair (MTTR)

Availability = MTTF / (MTTF + MTTR) Mean time between failure: MTBF = MTTF + MTTR Example: Suppose OS crashes once per month, takes 10min to reboot MTTF = 30 days = 720 hours = 43,200 minutes MTTR = 10 minutes Availability = 43,200 / 43,210 = 0.997 (~3 nines) Thats 2 hours downtime per year

Disk Failure Whole disk failure (power supply, electronics, motor, etc.) Sector errors - soft or hard Read or write to the wrong place (e.g., disk is bumped during operation) Can fail to read or write if head is too high, coating on disk bad, etc. Disk head can hit the disk and scratch it.

Checksums and Hashes Checksums: Usually guard against accidental modification Weak but fast & easy Cryptographic hash functions: Usually more expensive, guard against malicious modification RAID (Redundant Array of Independent

Disks) RAID 0 No redundancy; improves performance. RAID 1 Store a strong checksum with the data to catch all errors and then fix all errors you can RAID {2..6}

Write-Ahead Logging Keep a separate on-disk log of all operations Log first, and then do IO The tail of the log can be kept in memory until a transaction commits or a buffer page is flushed to disk Recovering from Simple Failures If we can read the log Redo all (usually) transactions (forward)

Use new-value in byte-level update records Undo uncommitted transactions (backward) Use old-value in byte-level update records It may take a long time to start the system after crashes Paxos The only known completely-safe and largely-live agreement

protocol Lets all nodes agree on the same value despite node failures, network failures, and delays Only blocks in exceptional circumstances that are vanishingly rare in practice Extremely useful, e.g.: nodes agree that client X gets a lock nodes agree that Y is the primary nodes agree that Z should be the next operation to be executed

Paxos Safety If agreement is reached, everyone agrees on the same value The value agreed upon was proposed by some node Fault tolerance (i.e., as-good-as-it-gets liveness) If less than half the nodes fail, the rest nodes reach agreement eventually No guaranteed termination (i.e., imperfect liveness) Paxos may not always converge on a value, but only in very

degenerate cases that are improbable in the real world Ah, and lots of awesomeness Basic idea is obvious in retrospect, but details are complex! Chubby What is Chubby? Lock service in a loosely-coupled distributed system Client interface similar to: Whole-file advisory locks (think .lck files)

Notification of various events (e.g., file modifications, think inotify) Primary goals: reliability, availability, easy-tounderstand semantics Bigtable A distributed storage system for (semi-)structured data Scalable

Thousands of servers Terabytes of in-memory data Petabyte of disk-based data Millions of reads/writes per second, efficient scans Self-managing Servers can be added/removed dynamically

Servers adjust to load imbalance Extremely popular at Google (as of 2008) The bigtable paper has been cited by 1560! (OSDI 06) Slides Ack Some materials from: Jingyu Zhou https ://jingyu.dyndns.org/~jzhou/courses/S11.distribut

ed / https://jingyu.dyndns.org/~jzhou/courses/S11. distributed.grad/

Recently Viewed Presentations

  • Modular Block-RAM-Based Longest-Prefix Match Ternary Content ...

    Modular Block-RAM-Based Longest-Prefix Match Ternary Content ...

    Data Compression. Package classification. IP forwarding. The internet BGP routers. Memory-augmented neural networks. Neural Turing Machines (DeepMind) Memory Networks (Facebook AI) FPT'14 & FCCM'15 . Current work. ... Search for "BOMB" ...
  • Monday, January 3rd 2017

    Monday, January 3rd 2017

    Word Generation - When Should Someone be Considered an Adult? With your group, reciprocally read the passage. Try to figure out the meanings of the bolded words and write down what you come up with on the front page. The...
  • Roundtable: OCLC Roundtable: OCLC Research Library Partnership

    Roundtable: OCLC Roundtable: OCLC Research Library Partnership

    Round Robin Round-up - Pacific Northwest style. Today's agenda. I prepared individual snapshots of the data for each institution, showing their activity versus the group averages. Mostly this helped us find several mistakes - some of them mine, some of...
  • Wireless M-Bus - Royal Holloway

    Wireless M-Bus - Royal Holloway

    Wireless M-Bus Sniffer Protocol sniffers display wireless M-Bus data record contents provided you know the key. The standard suggests "at least 8 bytes of the key shall be different for each meter"
  • Harmonic Analysis - Lawrence Berkeley National Laboratory

    Harmonic Analysis - Lawrence Berkeley National Laboratory

    Material Properties In a harmonic analysis, Young's Modulus, Poisson's Ratio, and Mass Density are required input All other material properties can be specified but are not used in a harmonic analysis As will be shown later, damping is not specified...
  • Unidad Iii Estequiometría

    Unidad Iii Estequiometría

    El peso molecular o formular, corresponde al peso total, es decir, al 100 % del compuesto. De acuerdo a la ley de la composición definida, la composición porcentual de una substancia o compuesto siempre es la misma, sin importar el...
  • AP Automation with Dynamics Content Management Simply Better

    AP Automation with Dynamics Content Management Simply Better

    Brent Wesler. Managing Director. [email protected] 603-622-2122 ext.106. AP Automation with Dynamics GP. 50 years ago the copy machine entered the American workplace. corporations have been dealing with the proliferation of paper and all the costs associated with it.
  • Changes in Declarative and Nondeclarative Memory Across the ...

    Changes in Declarative and Nondeclarative Memory Across the ...

    Are there relationships between declarative memory and frontal lobe functioning. Detected small to moderate correlations for domains of recall, frontal , aging, and IQ. Moderate for stroop, recall delay, and aging.