Examples
HDFS (Hadoop) - Apache product
GFS - Google File System
Requirements
- Fault tolerance
- Scalability
- Optimized for batch processing (throughput is more important then latency and most of the files are expected to be appended than overwritten)
GFS High-level architecture
It has single master node and multiple chunknodes.
Master node is reponsible for holding meta information and informing clients about chunks location in chunknodes and doing administrative work(collecting garbarge etc).
Chunknodes are responsible for storing fixed-size chunks of data. Chunks are identified by 64 bit unique value assigned by master.
Storing Data
- Client communicates master to get chunknode index to start pushing data.
- Afterward, the clients start pushing the data to all the replicas using some form of chain replication.
- The chunk servers are put in a chain depending on the network topology and data is pushed linearly along the chain. For instance, the client pushes the data to the first chunk server in the chain, which pushes the data to the second chunk server, etc. It helps to utilize each machine network bandwidth.
- One of chunk servers becomes primary replica and responsible for commiting mutations of a chunk.
- Primary replica applies serial number to each mutation.
- Primary have to acknowledge writes from other replicas.