9 MapReduce

Massive data parallelism

Map

Takes in a (key, value) pair, produces a list of intermediate (key, value) pair
(word, frequency) for each document

Reduce

Takes a single intermediate key and a list of all associated values (of the same key)
produce a final (key, final_value)
All maps need to be complete before the reduce step
How to guarantee that all occurrences of a single intermediate key go to a single worker?
- Hashing

Pasted image 20251012203722.png

MapReduce Coordinator

orchestrates which worker runs which task and when
- maintain a data structure of all tasks
- if tasks are completed
- which worker is assigned to which task

MapReduce Worker

regularly polls the coordinator for a task
performs the task

Failures in MapReduce

What happens if the failed worker had previously completed map tasks?

rerun failed worker's task on new worker
What happens if the failed worker had previously completed reduce tasks?
we do not need to re-run completed reduce tasks, as they are written to external storage
What happens if the coordinator fails?
request the client to restart computation

Stragglers

A worker is very slow. MapReduce is sensitive to stragglers

Spawn tasks on multiple works simultaneously