29 Oct 2017

Google File System

The Google Filesystem

Link to the paper

The paper can be found here.

Introduction

GFS is a scalable distributed FS for large distributed data intensive applications. Capabilities of fault tolerance and high streaming performance are inbuil while running on commodity grade hardware.

Important Assumptions

Following are some important observations that were made to formulate the deisgn the way it is (you might find GFS helpful, if you have similar concerns at your setup):

component failures are the norm rather than the exception.
files are huge by traditional standards. Multi-GB files are common. (This is assumed because generally people combine several KB-sized files to larger files to avoid opening and processing each of the billions of small files - which can easily turn into bottleneck of your task)
Most files are mutated by appending new data rather than overwriting existing data. Random writes within a file are practically non-existent
Once written, the files are only read, and often only sequentially
co-designing the applications and the file system API benefits the overall system by increasing our flexibility
an atomic append operation is helpful so that multiple clients can append concurrently to a file without extra synchronization between them
High sustained bandwidth is more important than low latency

Architecture

GFS cluster contains: a) single master b) multiple chunkservers
Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. Chunkservers store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range. (For reliability, each chunk is replicated on multiple chunkservers.)
The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers
The master peri- odically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state
Clients interact with the master for metadata opera- tions, but all data-bearing communication goes directly to the chunkservers.

(Neither the client nor the chunkserver caches file data. Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached.)

Single master: vastly simplifies the design. Enables master to make sophisticated decisions and have global knowledge. Need to take care so that this does not become the bottleneck - hence all data is directly exchanged between clients and chunkservers. Clients also cache the chunk to chunkserver mappings for a finite well-defined time.

Interactions on a file read

Using the fixed chunk size, the client translates the file name and byte offset specified by the application into a chunk index within the file
Then, it sends the master a request containing the file name and chunk index.
The master replies with the corresponding chunk handle and locations of the replicas
The client caches this information using the file name and chunk index as the key
The client then sends a request to one of the replicas, most likely the closest one. The request specifies the chunk handle and a byte range within that chunk.

Note: Further reads of the same chunk require no more client-master interaction until the cached information expires or the file is reopened

Note2: In fact, the client typically asks for multiple chunks in the same request and the master can also include the informa- tion for chunks immediately following those requested

Metadata at the Master

The master stores three major types of metadata: a) the file and chunk namespaces, b) the mapping from files to chunks, and c) the locations of each chunk’s replicas.
All metadata is kept in the master’s memory for efficiency
The first two types (namespaces and file-to-chunk mapping) are also kept persistent by logging mutations to an operation log stored on the master’s local disk and replicated on remote machines
The master rebuilds chunk location information (asking each chunkserver about its chunks) at startup and whenever a chunkserver joins the cluster.

Note: the file namespace data typically requires less then 64 bytes per file because it stores file names compactly using prefix compression.

Operation log

Need to store operation log reliably and not make changes visible to clients until metadata changes are made persistent.
Strategy: replicate it on multiple remote machines and respond to a client opera- tion only after flushing the corresponding log record to disk both locally and remotely. (batch several log records together before flushing to reducing the impact of flushing and replication on overall system throughput)
Checkpoint to avoid large logs. Checkpoints stored persistently on disk.
The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup without extra parsing.

Conclusion

I have intentionally skipped explaining the following parts:
- Consistency guarantees by GFS
- Snapshotting
- Namespace and locking by the master
- Replica placement and rebalancing when a few nodes go down or come up
- Garbage collection
- Performance

Reader interested to read those might directly refer to the concerned section in paper. The above gives a fairly good and basic understanding of GFS and is sufficient for any reader just trying to get and overview of the sophisticated FS.

Visitors:

Stats:

0 comments