CSE 40822 - Cloud Computing - Lecture Outline

Prof. Douglas Thain, University of Notre Dame
(back to class webpage)

Caution: These are high level notes that I use to organize my lectures. You may find them useful for reviewing main points, but they aren't a substitute for the readings or for participating in class.

Week 1: The Cloud Landscape

The term “cloud” is very broad and encompasses a wide variety of computing techniques. Some of them have been around for a long time (e.g. distributed computing) while others are relatively new (pay-as-you-go).

A rough working definition: A cloud is a distributed systems composed of multiple machines that work together to serve multiple users with high reliability, large capacity, and rapid scalability.

Some key aspects of cloud computing:(but not everything called “cloud” has all of these) A brief history of computing, leading up to clouds:

(Many aspects of computing writ large can be seen as pendulum that swing from one extreme to another with both technology and society. Centralization/Distribution is one of these pendulums.)

Cloud Architecture Layers:
End User
Scalable Web Interface
Applications
Middleware
(HTCondor, Hadoop, ...)
Virtualized Resources
(VMWare, Docker, ...)
Physical Resources
(CPU, RAM, Disk, GPU)
Layers of Service Delivery: How does this change things for IT and business as a whole? Distinguishing related terms: Cloud on the Hype Cycle
References:

Week 2: Principles of Distributed Computing

To understand clouds, we must first have a handle on distributed systems in general, so this week is a crash course in operating systems, networks, and then distributed systems, which is the combination of the two.

Definitions:

Quick Overview of Operating Systems

The earliest machines had no OS, which made sharing and portability hard. A modern OS exists to share resources between competing users, and to allow programs to move portably between different machines. Layers of a conventional operating system.

Applicationsfirefox, emacs gcc
System Callsopen/read/write/fork/exec
AbstractionsFilesystem, Virtual Memory
DriversDisk, Network, Video
HardwareIDE, Ethernet, VGA

A process is a running program that has its own private address space, and is protected from interference by other programs. It is both a unit of concurrency and a unit of independent failure. (i.e. A process can be safely killed.)

A thread is an additional unit of concurrency that can run inside a process. But it is not a unit of independent failure: threads cannot be killed in any reliable way.

Multiprocess Server Example: HTTPD Example:

Quick Overview of Networking

Architecture of the Internet:

Networking Layers:

Application: HTTP, FTP, DNS …
Transport: TCP / UDP
Network: Internet Protocol
Data Link: Ethernet, Token Ring, 802.11
Physical: Cat5, Optical, RF

Most Commonly Used Protocols: Idealized Vision of the Internet Reality of the Internet Abstract view of the Internet from applications:

Principles of Distributed Systems

A distributed system consists of a set of processes that work together to accomplish some task by communicating over a network. As described above, processes are independent, self-contained programs, and the network allows them to exchange (unreliable) packets of limited size.

We would like to build distributed systems that work as simply and reliably as non-distributed systems, but it simply isn’t possible. Distributed systems are fundamentally different than standalone machines in (at least) four ways outlined by “A Note on Distributed Computing”

“A Note” discusses this common fallacy: “Let’s take an existing program, break it into pieces (functions, objects, modules, etc) and then connect the pieces over the network. Now we have a usable distributed system that works just like the original system.” (This is the key idea in RPC, CORBA, DCOM, RMI, and many other similar systems.) It does not work because distributed systems are fundamentally different.

Easy to show with a thought experiment:

Suppose you have a regular program makes use a library that implements a stack data structure with the operations push(x) and x=pop(). We want to share the stack among multiple distributed users, so put the stack in a separate server process, and have it accept and return messages. If the client sends “push(x)”, the server responds with “ok”. If the clients sends “pop()” the server responds with “x”, which is the value at the top of the stack. Messages can be lost, so if the client doesn’t get a response in a reasonable amount of time, it simply sends the request message again.

Questions to consider:

  Small group discussion: Design a solution to this problem. Change the messages exchanged so that no data is lost, and the stack still works as desired.

Design principles for distributed protocols:

Moral of the story: Interfaces to distributed systems must be designed from scratch to accommodate failure and concurrency!

References:
  • Martin van Steen and Andrew Tannenbaum, Distributed Systems, CreateSpace Independent Publisher, 2017. ISBN: 978-1543057386
  • Case Study: HTCondor

    Purpose: Basic Structure: Matchmaking: Job Universes: Building Computing Communities Example Applications References

    Workflows and Makeflow

    What is a workflow? A workflow is a form of parallel programming. Examples of Workflow Systems: Case Study: Makeflow Architecture Makeflow Language Example Used in Class:
    {
            "define" : {
                      "ntemps" : 100,
                      "detail" : "high",
                      "grandinputs" : [ "output."+x+".txt" for x in range(1,ntemps,1) ]
            },
    
            "rules" : [
                    {
                            "command" : "echo --temp "+x+" --detail "+detail+ " >output."+x+".txt",
                            "inputs" : [ "input."+n+".txt" for n in range(1,11,2) ],
                            "outputs" : [ "output."+x+".txt" ]
                    } for x in range(1,ntemps,1),
    
                    {
                    "command" : "cat "+ join(grandinputs," ")+" >grandoutput.txt",
                    "inputs" : grandinputs,
                    "outputs" : [ "grandoutput.txt" ]
                    }
    
            ]
    
    References

    Map-Reduce and Hadoop

    Background and Context

    The Map-Reduce Programming Model

    User provides two functions: Map and Reduce, and asks for them to be invoked in a given data set. They must have the following form:
    Map( key, value ) -> list( key, value )
    Reduce( key, list(values) ) -> output
    

    The framework is responsible for locating the data, applying the functions, and then storing the outputs. The user is not concerned with locality, fault tolerance, optimization, and so forth. <

    The Map functions are applied to each of the files comprising the data sets, and emit a series of (key,value) pairs. Then, for each key, a bucket is created for all of the values with that key. The Reduce function is then applied to all values in that bucket.

    (Blackboard diagram of how this works.)

    WordCount is the “hello world” of Map-Reduce. This program reads in a large number of files and computes the frequency of each unique word in the input.

    Map( key, value ) {
       // key is the file name
       // value is the file contents
       For each word in value {
          Emit( word, 1 )
       }
    }
    
    Reduce( key, list(values) ) {
       count = 0;
       For each v in list(values) {
          count++;
       }
       Emit( key, count );
    }
    

    Sometimes you need to run multiple rounds of Map-Reduce in order to get the desired effect. For example, suppose you now want to generate the top ten most frequently used words in this set of documents. Run Map-Reduce on the output of the previous, but with this program:

    Map( key, value ) {
       word = key
       count = value
       Emit( 1, “count word”);
    }
    Reduce( key, list(values) ) {
       For first ten items in list(values) {
          Emit( value )
       }
    }
    

    Example Problems to Work in Class

    Suppose you have the following weather data. A set of (unsorted) tuples, each consisting of a year, month, day, and the maximum observed temp that day:

    (2007,12,10,35)
    (2008,3,22,75)
    (2015,2,15,12) ...
    
    1. Write a Map-Reduce program to compute the maximum temperature observed each month for which data is present.
    2. Write a Map-Reduce program to compute the average temperature for the day of the year (over all years).
    3. Now suppose that you have data representing a graph of friends:
      A -> B,C,D
      B -> A,C,D
      C -> A,B
      D -> A,B
      
      Write a Map-Reduce program that will identify common friends: (A,B) -> C,D (A,C) -> B . . .
    4. Write a Map-Reduce program that will identify the people with the greatest number of friends (incoming links, not outgoing links.)

    The Hadoop Distributed System

    Hadoop began a an open-source implementation very similar in spirit to the Google File System (GFS) and the Map-Reduce programming model. It has grown into a complex ecosystem of interacting pieces of software.

    HDFS - Hadoop Distributed Filesystem Architecture: Interface: Considerations: Hadoop Map-Reduce Architecture: Interface: Considerations: Question: Which part of a Map-Reduce program is naturally scalable, and which part is likely to be a bottleneck? Does that affect how you would design a M-R program. References: