A long post with summaries from reading technical papers in distributed systems, focusing on computing.
Status: Forever In Progress
MapReduce is a programming model and implementation for processing and generating large data sets. Programs are written by users in a functional style using map and reduce primitives that are automatically parallelized and executed on a large cluster of machines. Using this model, users are able to express computations easily while hiding the messy details of parallelization, fault-tolerance, data distribution and load balancing in a library.
Map takes a set of input key/value pairs and produces a set of intermediate key/value pairs that are then passed to Reduce which merges these values to produce a smaller set of values (typically zero or one per reduce function). Some examples of problems that can be solved using this approach are word count, distributed grep, url frequency count, reverse web link graph, term vector per host, inverted index and distributed sort.
The MapReduce library in the user program splits the input files into M pieces of 16-64MB and starts up many copies of the program on a cluster of machines. One of these programs is designated as the master that assigns the tasks to other workers; the master takes locality into account when making this decision and attempts to schedule a map task on a machine that contains a replica of the corresponding input data or on a machine in the same network switch as the machine containing the data; workers read a split input file, process it with the users map function and write it to a local disk buffer and report back to master; the master then assigns the reduce task to a reduce worker that reads from the local disk buffer, sorts the intermediate data, uses the user's reduce function to create an output file per reduce task and reports back to master; on completion of all map and reduce tasks, the master wakes up the user program and passes control back to it. Due to locality awareness, when running large map reduce operations on a large set of workers in a cluster, most input data is read locally and consumes no network bandwidth. One of the common causes that lengthens the total time taken for a MR operation is a straggler; in order to alleviate the problem, when a MR operation is close to completion, the master schedules backup executions of the remaining in-progress tasks.
MapReduce is resilient to large scale worker failures. The master uses regular pings to workers and in case of worker failure reassigns the tasks to other workers where they have to be restarted since the intermediate data is saved on workers local disk buffer. A master makes regular checkpoints of its progress, and in case of master failure a new master task is started from the last checkpointed state. Most map reduce functions are deterministic functions and produce the same output after failure as that would have been produced by a non-faulting sequential execution of the entire program. Having higher task granularity (more M and R tasks than workers) helps improve dynamic load balancing and also speeds up recovery when a worker fails.
MapReduce library is flexible in providing refinements e.g. a user submitted partitioning function (instead of the default hash(key) mod R), ordering guarantee of the processed output, a combiner function for partial merging of the data on the map worker before it is sent over the network to a reduce worker, reading input data in several formats, auxiliary files as additional inputs to map and reduce tasks, an execution mode to skip on bad records vs erroring out, local execution on a single local machine to facilitate debugging, profiling and small-scale testing, status information reports on progress, errors etc via an internal http server, counters to count occurrences of various events.
Searching over or Sorting a large amount of data (over 1 TB) programs are representative of most of the real programs written by users of Map Reduce. Performance tests were done over a cluster of 1800 machines, with 2 GHz CPU, 4GB RAM, 2x160 GB hard disks and 1 Gb ethernet links; with machines arranged in a 2-level tree-shaped switched network with 100-200 Gbps of aggregate bandwidth at root; machines were in same hosting facility reducing the round-trip time between any pair of machines to less than 1 ms.
One of the most significant uses of MR has been for creating an indexing system that produces data structures for google web search service. The indexing system takes a large set (over 20 TB) of crawled documents as input that are stored using GFS and the indexing process runs as a sequence of 5-10 MR operations. Benefits of using MR for this have been: indexing code is simpler, smaller and easier to understand; allows to keep conceptually unrelated computations separate making it easy to change the indexing process; easier to operate the indexing process.
The major learnings from the Map Reduce system have been: 1) restricting the programming model makes it easy to parallelize and distribute the computations and make them fault-tolerant, 2) network bandwidth is a scarce resource and locality optimization helps solve for that, 3) redundant execution can be used to reduce the impact of slow machines, failing machines and data failures.
Storm is a realtime distributed streaming data processing framework. The high level architecture consists of streaming data flowing through topologies (a directed graph) of vertices (spouts [data sources] and bolts [data processors]) and edges (data flows). At Twitter, Storm is used for a variety of product use cases (computing counts, clustering for machine learning algos), user services, search content discovery, revenue etc.
Storm runs on a distributed cluster of hundreds of machines using Mesos (cluster manager), Zookeeper (centralized conf manager) and Nimbus (master node created by Mesos using Zookeeper). A client submits topologies (logical query plans) to Nimbus that distributes them amongst worker nodes that run one supervisor process (to report progress to master) and many worker processes; each worker process runs executors that are made of tasks that do the actual work; a task is an instance of a spot or bolt.
Topologies can be created by a user (by specifying the spouts and bolts) or generated by Summingbird. Apache Thrift is an interface description language and binary communication protocol that is used as a RPC framework. Nimbus is a Thrift service and a topology is a Thrift object created using Summingbird (with scala to express computations and constraints).
Storm provides atleast once or almost once guarantees. For atleast once guarantees, it uses a bitwise XOR over system-generated tuple ids to mark the completion of processing of a tuple.
Spark is a distributed data processing framework suitable for applications that can reuse a working set of data across multiple parallel operations (unlike in Map Reduce where data has to be loaded repeatedly from disk). This is ideal of use cases of machine learning algorithms (iterative algorithms that apply a function repeatedly to the same data to optimize for a parameter) and exploratory ad-hoc analytics (by loading the same dataset to memory and querying it repeatedly across a number of machines). Spark outperforms Hadoop by 10x in iterative machine learning workloads and can be used interactively to scan a 39 GB dataset with sub-second latency. Spark supports this while providing similar scalability and fault tolerance as Map Reduce.
The core of Spark is the implementation of RDD - a read-only collection of objects partitioned across set of machines that can be rebuilt if lost. The fault tolerance is achieved through the notion of lineage (implemented as objects). RDD is cached across machines and use in MapReduce-like operations. These operations are passed to the worker nodes through closures that have access to variables in the scope they were created, as well as to shared variables (broadcast variables, accumulators) passed to them.
RDDs are a distributed fault-tolerant memory abstraction for in-memory computations on iterative (machine learning and graph algorithms) and interactive (data mining) workloads using Spark. They are used for sharing data in cluster applications.
RDDs provide an interface on coarse-grained transformations that apply same transformation to multiple data items and log transformations used to build the dataset, thus providing fault tolerance via lineage. Itis a read-only partitioned collection of records created through deterministic operations (map, filter, reduce) on data in stable storage or other RDDs. They do not need to be materialized all the time and have enough information on how they were derived from source datasets (lineage).
Spark exposes RDDs though a language-integrated API where each dataset is represented as an object and transformations are invoked using methods on these objects. RDDs are defined via transformations, used via actions and can be pipelined.