· Isabel Frolick

Overview of Distributed Systems & Networking

Distributed Systems

A distributed system is a collection of independent machines that appear to users as a single coherent system. These machines communicate and coordinate over a network to share resources, improve efficiency, and provide fault tolerance.

It is important to distinguish between applications that run on distributed systems (e.g., cloud-based software) and the infrastructure that makes distributed systems possible (networking, protocols, storage). Applications usually interact with the system as if it were a single machine, due to abstractions that hide the complexity.

Distributed vs Decentralized

Distributed system: Tasks and resources are spread across multiple servers to increase efficiency and reliability.
Decentralized system: Systems are spread across multiple locations by design (e.g., Air Traffic Control systems at each airport, communicating).

Distributed and decentralized are not mutually exclusive. For example, Air Traffic Control systems are

Distributed (multiple servers provide redundancy and backups) and Decentralized (servers are physically located
in different regions.)</p>

Goals of Distributed Systems

Resource Sharing – share hardware, software, and data.
Distribution Transparency – hide system complexity:
- Access transparency: Access without knowing the machine.
- Location transparency: Resource location doesn’t matter.
- Replication transparency: Multiple copies exist but are invisible to the user.
- Failure transparency: The system keeps running despite failures.

Core Concepts

Availability: System works correctly at a given time.
Reliability: System continues to work without failures.
Safety: Avoid catastrophic errors (e.g., incorrect results).
Recoverability: Ability to recover from failure.
Maintainability: Easy to update and repair.
Security: Protecting confidentiality, integrity, and availability (CIA triad).

Fault Tolerance

Failures are inevitable in distributed systems. Systems must detect, handle, and recover from them. More machines, more tolerance.

A failure occurs when the service doesn’t meet its specification.
An error is the incorrect state that leads to a failure.
A fault is the underlying cause of the error (e.g., hardware defect, software bug).

Types of Faults

Transient: Happens once, then disappears.
Intermittent: Occurs occasionally.
Permanent: Persists until fixed.

Handling Failures

Usually done with non-volatile storage and replication , where multiple copies of data are stored. This introduces complexity to the system, where copies must be updated to be consistent when one is changed by the user. There are a few ways to ensure consistency across replicas:

Consistency Models:

Strong consistency: All copies are updated immediately (e.g., banks must always agree on account balances).
Weak consistency: Updates may take time but eventually synchronize (e.g., social media feeds).

Scalability

By size (number of users, data volume).
By geography (multiple regions).
By administration (different teams or organizations).

Networking

Networking is the foundation of distributed systems. It allows processes on different machines to communicate and share data using protocols like TCP and UDP.

IP Addresses & Ports

Each port on Unix/Linux can only be used by a single process at a time.
Source ports can be assigned explicitly or chosen by the system (ephemeral/temporary).
Destination ports must always be specified.

Sockets

A socket is an endpoint for sending and receiving data.

Berkeley sockets (Unix) provide the standard API used by most operating systems.

In practice, creating a socket is done via the system call:

int socket(int domain, int type, int protocol);
// domain: IPv4 (AF_INET) or IPv6 (AF_INET6)
// type: TCP (SOCK_STREAM) or UDP (SOCK_DGRAM)
// protocol: usually 0 (default)

The call returns a file descriptor (an integer ID used by the OS). If it fails, it returns -1.

TCP vs UDP

There are two commonly used transport protocols used to transmit messages between machines. The goal of all transport protocols is to deliver error-free, in sequence messages with no duplicates or lost packages.

TCP (Transmission Control Protocol)
- Reliable, connection-oriented.
- Guarantees ordered, lossless delivery.
- More overhead, less efficient for many-to-many communication.
UDP (User Datagram Protocol)
- Unreliable, connectionless.
- Fast and lightweight.
- Useful for real-time apps (gaming, video calls).
- Supports multicasting (sending one message to multiple receivers)..

Typical Workflow

Server-Side Flow (TCP)

socket() → create socket.
bind() → assign IP address + port to socket.
listen() → mark as listening for connections.
accept() → accept an incoming connection.
read() / write() → exchange data.
close() → free resources.

Client-Side Flow (TCP)

socket() → create socket.
connect() → connect to server IP + port.
read() / write() → exchange data.
close() → free resources.

UDP Flow (simpler, no connection setup)

socket() → create socket.
bind() → optional, assign local port.
sendto() → send a message to destination.
recvfrom() → receive a message.
close() → free resources.

Server (TCP): socket → bind → listen → accept → read/write → close

Client (TCP): socket → connect → read/write → close

UDP: socket → (optional bind) → sendto/recvfrom → close

Multicasting & Broadcasting

Multicasting: Send one message to many machines that join the same multicast address (224.0.0.0–239.255.255.255). Works only with UDP.
Broadcasting: Send a message to all machines on a local network segment. Less efficient since everyone receives it.

Persistent vs Transient Communication

Persistent: Message stored until delivered (e.g., email).
Transient: Message exists only while sender and receiver are active (e.g., VoIP, FaceTime).