CSE 40771/60771 - Distributed Systems - Fall 2021

Prof. Douglas Thain, University of Notre Dame

Course Overview

A distributed system is any computer system consisting of multiple machines that work together on a common problem. Distributed systems appear in many areas of computing, including cloud computing, mobile computing, edge computing, the internet of things, aerospace systems, and more. Distributed systems have been both interesting and difficult to build because their components may be autonomous and highly failure-prone. Students will learn the fundamental principles of distributed systems, study examples of current distributed systems, and build their own distributed systems from scratch. Topics include concurrency, fault tolerance, replication, consistency, agreement. Students will undertake a final project that involves building and evaluating a custom distributed system. Grading will be based on assignments, exams, and a final project.

This will be a fun and challenging class for students who like to build working software systems. Distributed systems connects the very practical aspects of software engineering (e.g. like how to handle a network disconnection) and the fundamental principles of computers (e.g. whether a partitioned system can reach agreement.) The skills that you learn here will apply directly to advanced systems used in industry.

Prerequisites

CSE 20232 (Data Structures) and CSE 20289 (Systems Programming)

Textbook

Marten van Steen and Andrew Tanenbaum, Distributed Systems 3rd edition, 2017.

Programming Assignments

Five programming assignments are required, due approximately two weeks apart for the first ten weeks of the semester. The assignments together build towards an implementation of a scalable key-value store that could run in a cloud service or as a peer-to-peer system.
  1. Measuring Fundamentals.
    Precisely measure the cost of fundamental operations in the system: function call, hash table read/write, network packet, file system I/O, process creation.
  2. Remote Procedure Call.
    Build a system in Python for performing remote procedure call between processes. Carefully measure the performance and throughput of this system with multiple clients.
  3. Logging and Recovery.
    Make the prior system persistent by implementing logging, recovery, and periodic log compression. Measure the performance of the system, observing outliers.
  4. Caching and Consistency.
    Improve the performance of the prior system by implementing a caching layer that provides a non-trivial consistency model. Measure the performance under various workloads.
  5. Replication.
    Improve the prior system by implementing one of several possible replication models, such as chain replication, primary-backups, round robin. Measure the performance and scalability.

Final Project

In the final project, students will propose, build, and measure a distributed system of their own design, which must make use of multiple techniques discussed in class to achieve a system that is robust and performant. Examples might include a distributed filesystem, a parallel programming model, or a peer-to-peer data routing system. The final submission will include a project report describing the design of the system.

Graduate Students

Graduate students taking CSE 60771 will have the following additional work. A selection of paper readings will be assigned that address the course topics in greater detail, balanced between "classic" results in distributed systems and specific case studies in distributed systems design. In addition, the final project report must be written as an academic suitable for submission to an academic conference, including a problem statement, survey of related work, system design, and a complete performance evaluation.

Outline of Topics

  1. Overview of Distributed Systems (1 week)
    Purposes: Parallelism, Communication, Redundancy, Autonomy Architectures: Client-Server, Manager-Worker, Peer-to-Peer, Brokers Settings: Cluster, Cloud, Edge, Mobile, IOT, Interplanetary, …
  2. Fundamental Properties of Distributed Systems (1 week)
    Processes: Private Computer, Relationships, Failure Models Networks: Messages, Layering, Failure Model Clocks: Local Devices, Communication, Logical Clocks
  3. Remote Procedure Call (1 week)
    Classic RPC Model Tradeoffs between time and failure. Client Concurrency: Asynchronous Calls Server Concurrency: Processes or Threads
  4. State Management (1 week)
    Cost of Persistence: Files, Blocks, and Synchronization Transaction Logs and State Machines Checkpointing and Recovery Performance Tradeoffs
  5. Caching and Consistency (2 weeks)
    Classic N-Client Caching: Write-Through, Write-Back, Callbacks, and Leases CAP Theorem: Consistency, Availability, Partitioning Tradeoffs Consistency Models.
  6. Replication (1 weeks)
    Primary-Backup Chain Replication Distributed Transactions
  7. Scalable Storage Systems (2 weeks)
    Data Versus Metadata Performance Evaluation Metrics Scalable Filesystems (Ceph) Scalable Data Structures - (Kafka, RabbitMQ)
  8. Scalable Computation Systems (2 weeks)
    Distributed Execution Semantics Performance Evaluation Metrics Batch Systems (HTCondor, Sparrow)
  9. Peer-to-Peer Algorithms and Systems (1 week)
    Mutual Exclusion Leader Election Distributed Hash Tables (Chord)
  10. Distributed Agreement (1 week)
    Byzantine Generals Paxos and RAFT Case Study: Zookeeper
  11. Security (1 week)
    Symmetric Key Systems (Kerberos) Public-Key Systems (WWW Infrastructure) Block Chain (Bitcoin)