Jiaxi Liu (Jesse)

Master’s Graduate

Software Engineer | Scalable APIs · Web Scraping · Data Integration · Code Quality & Refactoring

Back to Blog

Big Data Review: Hadoop, Spark, and Hive

Big data systems solve a basic problem: data is too large for one machine to store or process efficiently, so storage and computation must be distributed.

Hadoop

Hadoop is a foundational big data framework. Core components:

  • HDFS: distributed file system
  • MapReduce: distributed compute model
  • YARN: resource scheduling

HDFS splits large files into blocks, stores them across machines, and uses replication for fault tolerance.

MapReduce has two stages:

  1. Map: each node processes its own data blocks
  2. Reduce: intermediate results are aggregated

It is reliable and scalable, but batch latency is high.

Spark

Spark is a large-scale data processing framework, usually faster than Hadoop MapReduce because it uses memory heavily.

Common components:

  • Spark Core
  • Spark SQL
  • Spark Streaming
  • MLlib
  • GraphX

Spark fits iterative computation, machine learning, interactive analysis, and streaming.

Spark vs Hadoop MapReduce

MapReduce is traditional batch processing and frequently writes intermediate data to disk.

Spark keeps intermediate data in memory when possible, making it better for multi-step workloads.

Hive

Hive is a data warehouse tool that lets users query data on HDFS with SQL-like syntax.

It fits:

  • Offline analysis
  • Reporting
  • Data warehousing
  • SQL access to big data

Hive is not a traditional OLTP database and is not suited for high-frequency low-latency transactions.

Summary

  • Hadoop: distributed storage and batch processing foundation
  • Spark: faster large-scale compute engine
  • Hive: SQL layer for analytical data warehouses