Big data systems solve a basic problem: data is too large for one machine to store or process efficiently, so storage and computation must be distributed.
Hadoop
Hadoop is a foundational big data framework. Core components:
- HDFS: distributed file system
- MapReduce: distributed compute model
- YARN: resource scheduling
HDFS splits large files into blocks, stores them across machines, and uses replication for fault tolerance.
MapReduce has two stages:
- Map: each node processes its own data blocks
- Reduce: intermediate results are aggregated
It is reliable and scalable, but batch latency is high.
Spark
Spark is a large-scale data processing framework, usually faster than Hadoop MapReduce because it uses memory heavily.
Common components:
- Spark Core
- Spark SQL
- Spark Streaming
- MLlib
- GraphX
Spark fits iterative computation, machine learning, interactive analysis, and streaming.
Spark vs Hadoop MapReduce
MapReduce is traditional batch processing and frequently writes intermediate data to disk.
Spark keeps intermediate data in memory when possible, making it better for multi-step workloads.
Hive
Hive is a data warehouse tool that lets users query data on HDFS with SQL-like syntax.
It fits:
- Offline analysis
- Reporting
- Data warehousing
- SQL access to big data
Hive is not a traditional OLTP database and is not suited for high-frequency low-latency transactions.
Summary
- Hadoop: distributed storage and batch processing foundation
- Spark: faster large-scale compute engine
- Hive: SQL layer for analytical data warehouses