https://s3-us-west-2.amazonaws.com/secure.notion-static.com/b3a9c072-5e4d-458a-b0b5-13bf081d378b/Untitled.png

  1. system groups data in batches(groups)
  2. groups are large (all items per day for ex)
  3. high throughput at the cost of latency Example: MapReduce
  1. continuous stream of data
  2. low latency at the cost of decreased throughput Example: Spark, Flink
  1. very small batches
  2. balance between latency and throughput

Spark vs MapReduce

MapReduce allows developing and running a huge number of parallel computations on a big number of machines in the same time. But every job has to read the input from disk and write the output to disk. So lower bound of execution is determined by disk speeds.