img
Home > Candidate Patterns > In-Memory Fault-tolerant Collection

In-Memory Fault-tolerant Collection

How can a service and retrieve transient state data in its native form in a fault-tolerant fashion without the use of disk persistence?

Problem

Under extreme scalability conditions, disk persistence for reliability purposes can become a bottleneck for processing, and may be difficult to keep synchronized as concurrency is introduced. In addition, typical implementations of State Repository used by Stateful Services and Partial State Deferral and its variations can become a bottleneck for processing, single point of failure, and may require special programming models and best practices for state management.

Solution

A service composition or a service instance can put and get its transient state data reliably in and out of familiar data structures such as java Hashmap or .NET Dictionary, with near in-memory speeds while still achieving predictable scalability and high availability of services. This is achieved by an underlying architecture that provides redundancy and failover of in-memory application data by synchronizing in-memory data structures with a duplicate copy that exists on another machine on the network.

Application

A shared or dedicated FT Hashmap or FT Dictionary utility is made available as part of the inventory or service architecture. This is a core pattern that is used by In-Memory Fault Tolerant Stateful Services and Load-Balanced Stateful Services.

Impacts

Performance bottlenecks due to disk I/O are dramatically reduced or eliminated. Synchronization of primary and backup data between network nodes may be affected by network latency.


Architecture

Inventory, Service

Status

Under Review

Contributors

David Chappell

Problem

Local storage of application state using in-memory data structures are volatile in that the state data does not survive failure of the application. In high volume throughput scenarios, conventional means of storing data such as file persistence or relational database storage can become a bottleneck for processing, increasing latency of service requests beyond the acceptable limits of Service Level Agreements (SLA). In addition, the conversion between native object data representation of data and relational table storage of data in row/column structure may introduce additional latency and impact scalability.

img

Solution

At the core of the FT Collection is an architecture that relies on two or more interconnected server processes that cooperate together across a network, with the sole purpose of providing in-memory storage of instance data for individual application objects. An application or service places its transient state data is through familiar data structure collection API's such as a java HashMap or a .NET dictionary. When the application does a put() to the Hashmap, that operation is delegated to the grid, which automatically elects a primary storage node and a backup storage node on another machine.

img

Application

As networked nodes acting in a role of primary data storage for a collection begin to fail, the FT Collection reverts to the backup storage node and may possibly re-elect a new backup storage node as well. This is a core pattern that is used by In-Memory Fault Tolerant Stateful Services and Load-Balanced Stateful Services.

img

Impacts

Performance bottlenecks due to disk I/O are dramatically reduced or eliminated. Object to relational mappings are not required. Synchronization of primary and backup data between network nodes may be affected by network latency.

The Fault-Tolerant collection may be based on a static pair of primary/backup servers and therefore vulnerable to certain failure scenarios that involve the primary and the backup server. In contrast it may provide continual failover to new network storage nodes as each backup becomes primary and new backups are created on the fly. A simplistic Fault-Tolerant collection may require manual or static partitioning of storage data across multiple servers, or a more sophisticated one may provide full location transparency and automatic partitioning and division of labor across all storage nodes on the grid.

Relationships

img