CSE 40771/60771 - Distributed Systems - Fall 2021
Prof. Douglas Thain, University of Notre Dame
A distributed system is any computer system consisting of multiple machines that work together on a common problem. Distributed systems appear in many areas of computing, including cloud computing, mobile computing, edge computing, the internet of things, aerospace systems, and more. Distributed systems have been both interesting and difficult to build because their components may be autonomous and highly failure-prone. Students will learn the fundamental principles of distributed systems, study examples of current distributed systems, and build their own distributed systems from scratch. Topics include concurrency, fault tolerance, replication, consistency, agreement.
Students will undertake a final project that involves building and evaluating a custom distributed system. Grading will be based on assignments, exams, and a final project.
This will be a fun and challenging class for students who like to
build working software systems. Distributed systems connects the
very practical aspects of software engineering (e.g. like how to handle
a network disconnection) and the fundamental principles of computers
(e.g. whether a partitioned system can reach agreement.) The skills
that you learn here will apply directly to advanced systems used in industry.
CSE 20232 (Data Structures) and CSE 20289 (Systems Programming)
Five programming assignments are required, due approximately two weeks apart for the first ten weeks of the semester. The assignments together build towards an implementation of a scalable key-value store that could run in a cloud service or as a peer-to-peer system.
- Measuring Fundamentals.
Precisely measure the cost of fundamental operations in the system: function call, hash table read/write, network packet, file system I/O, process creation.
- Remote Procedure Call.
Build a system in Python for performing remote procedure call between processes. Carefully measure the performance and throughput of this system with multiple clients.
- Logging and Recovery.
Make the prior system persistent by implementing logging, recovery, and periodic log compression. Measure the performance of the system, observing outliers.
- Caching and Consistency.
Improve the performance of the prior system by implementing a caching layer that provides a non-trivial consistency model. Measure the performance under various workloads.
Improve the prior system by implementing one of several possible replication models, such as chain replication, primary-backups, round robin. Measure the performance and scalability.
In the final project, students will propose, build, and measure a distributed system of their own design, which must make use of multiple techniques discussed in class to achieve a system that is robust and performant. Examples might include a distributed filesystem, a parallel programming model, or a peer-to-peer data routing system. The final submission will include a project report describing the design of the system.
Graduate students taking CSE 60771 will have the following additional work. A selection of paper readings will be assigned that address the course topics in greater detail, balanced between "classic" results in distributed systems and specific case studies in distributed systems design. In addition, the final project report must be written as an academic suitable for submission to an academic conference, including a problem statement, survey of related work, system design, and a complete performance evaluation.
Outline of Topics
- Overview of Distributed Systems (1 week)
Purposes: Parallelism, Communication, Redundancy, Autonomy
Architectures: Client-Server, Manager-Worker, Peer-to-Peer, Brokers
Settings: Cluster, Cloud, Edge, Mobile, IOT, Interplanetary, …
- Fundamental Properties of Distributed Systems (1 week)
Processes: Private Computer, Relationships, Failure Models
Networks: Messages, Layering, Failure Model
Clocks: Local Devices, Communication, Logical Clocks
- Remote Procedure Call (1 week)
Classic RPC Model
Tradeoffs between time and failure.
Client Concurrency: Asynchronous Calls
Server Concurrency: Processes or Threads
- State Management (1 week)
Cost of Persistence: Files, Blocks, and Synchronization
Transaction Logs and State Machines
Checkpointing and Recovery
- Caching and Consistency (2 weeks)
Classic N-Client Caching: Write-Through, Write-Back, Callbacks, and Leases
CAP Theorem: Consistency, Availability, Partitioning Tradeoffs
- Replication (1 weeks)
- Scalable Storage Systems (2 weeks)
Data Versus Metadata
Performance Evaluation Metrics
Scalable Filesystems (Ceph)
Scalable Data Structures - (Kafka, RabbitMQ)
- Scalable Computation Systems (2 weeks)
Distributed Execution Semantics
Performance Evaluation Metrics
Batch Systems (HTCondor, Sparrow)
- Peer-to-Peer Algorithms and Systems (1 week)
Distributed Hash Tables (Chord)
- Distributed Agreement (1 week)
Paxos and RAFT
Case Study: Zookeeper
- Security (1 week)
Symmetric Key Systems (Kerberos)
Public-Key Systems (WWW Infrastructure)
Block Chain (Bitcoin)