CSE 40771/60771 - Distributed Systems

A distributed system is any computer system consisting of multiple machines that work together on a common problem. Distributed systems appear in many areas of computing, including cloud computing, mobile computing, edge computing, the internet of things, aerospace systems, and more. Distributed systems have been both interesting and difficult to build because their components may be autonomous and highly failure-prone. Students will learn the fundamental principles of distributed systems, study examples of current distributed systems, and build their own distributed systems from scratch. Topics include concurrency, fault tolerance, replication, consistency, agreement. Students will undertake a final project that involves building and evaluating a custom distributed system. Grading will be based on assignments, exams, and a final project.

This will be a fun and challenging class for students who like to build working software systems. Distributed systems connects the very practical aspects of software engineering (e.g. like how to handle a network disconnection) and the fundamental principles of computers (e.g. whether a partitioned system can reach agreement.) The skills that you learn here will apply directly to advanced systems used in industry.

Prerequisites

CSE 20232 (Data Structures) and CSE 20289 (Systems Programming)

Textbook

Marten van Steen and Andrew Tanenbaum, Distributed Systems 3rd edition, 2017.

Programming Assignments

Five programming assignments are required, due approximately two weeks apart for the first ten weeks of the semester. The assignments together build towards an implementation of a scalable key-value store that could run in a cloud service or as a peer-to-peer system.

Final Project

In the final project, students will propose, build, and measure a distributed system of their own design, which must make use of multiple techniques discussed in class to achieve a system that is robust and performant. Examples might include a distributed filesystem, a parallel programming model, or a peer-to-peer data routing system. The final submission will include a project report describing the design of the system.

Graduate Students

Graduate students taking CSE 60771 will have the following additional work. A selection of paper readings will be assigned that address the course topics in greater detail, balanced between "classic" results in distributed systems and specific case studies in distributed systems design. In addition, the final project report must be written as an academic suitable for submission to an academic conference, including a problem statement, survey of related work, system design, and a complete performance evaluation.

Outline of Topics

Overview of Distributed Systems (1 week)
Purposes: Parallelism, Communication, Redundancy, Autonomy Architectures: Client-Server, Manager-Worker, Peer-to-Peer, Brokers Settings: Cluster, Cloud, Edge, Mobile, IOT, Interplanetary, …
Fundamental Properties of Distributed Systems (1 week)
Processes: Private Computer, Relationships, Failure Models Networks: Messages, Layering, Failure Model Clocks: Local Devices, Communication, Logical Clocks
Remote Procedure Call (1 week)
Classic RPC Model Tradeoffs between time and failure. Client Concurrency: Asynchronous Calls Server Concurrency: Processes or Threads
State Management (1 week)
Cost of Persistence: Files, Blocks, and Synchronization Transaction Logs and State Machines Checkpointing and Recovery Performance Tradeoffs
Caching and Consistency (2 weeks)
Classic N-Client Caching: Write-Through, Write-Back, Callbacks, and Leases CAP Theorem: Consistency, Availability, Partitioning Tradeoffs Consistency Models.
Replication (1 weeks)
Primary-Backup Chain Replication Distributed Transactions
Scalable Storage Systems (2 weeks)
Data Versus Metadata Performance Evaluation Metrics Scalable Filesystems (Ceph) Scalable Data Structures - (Kafka, RabbitMQ)
Scalable Computation Systems (2 weeks)
Distributed Execution Semantics Performance Evaluation Metrics Batch Systems (HTCondor, Sparrow)
Peer-to-Peer Algorithms and Systems (1 week)
Mutual Exclusion Leader Election Distributed Hash Tables (Chord)
Distributed Agreement (1 week)
Byzantine Generals Paxos and RAFT Case Study: Zookeeper
Security (1 week)
Symmetric Key Systems (Kerberos) Public-Key Systems (WWW Infrastructure) Block Chain (Bitcoin)