9 MapReduce

Massive data parallelism

Map

  • Takes in a (key, value) pair, produces a list of intermediate (key, value) pair
  • (word, frequency) for each document

Reduce

  • Takes a single intermediate key and a list of all associated values (of the same key)
  • produce a final (key, final_value)
  • All maps need to be complete before the reduce step
  • How to guarantee that all occurrences of a single intermediate key go to a single worker?
    • Hashing

Pasted image 20251012203722.png

MapReduce Coordinator

  • orchestrates which worker runs which task and when
    • maintain a data structure of all tasks
    • if tasks are completed
    • which worker is assigned to which task

MapReduce Worker

  • regularly polls the coordinator for a task
  • performs the task

Failures in MapReduce

What happens if the failed worker had previously completed map tasks?

  • rerun failed worker's task on new worker
    What happens if the failed worker had previously completed reduce tasks?
  • we do not need to re-run completed reduce tasks, as they are written to external storage
    What happens if the coordinator fails?
  • request the client to restart computation

Stragglers

A worker is very slow. MapReduce is sensitive to stragglers

  • Spawn tasks on multiple works simultaneously