Fast cluster failover using virtual memory-mapped communication

Yuanyuan Zhou, Peter M. Chen, Kai Li

Research output: Contribution to conferencePaperpeer-review

13 Scopus citations

Abstract

This paper proposes a novel way to use virtual memory-mapped communication (VMMC) to reduce the failover time on clusters. With the VMMC model, applications' virtual address space can be efficiently mirrored on remote memory either automatically or via explicit messages. When a machine fails, its applications can restart from the most recent checkpoints on the failover node with minimal memory copying and disk I/O overhead. This method requires little change to applications' source code. We developed two fast failover protocols: deliberate update failover protocol (DU) and automatic update failover protocol (AU). The first can run on any system that supports VMMC, whereas the other requires special network interface support. We implemented these two protocols on two different clusters that supported VMMC communication. Our results with three transaction-based applications show that both protocols work quite well. The deliberate update protocol imposes 4-21% overhead when taking checkpoints every 2 seconds. If an application can tolerate 20% overhead, this protocol can failover to another machine within 4 milliseconds in the best case and from 0.1 to 3 seconds in the worst case. The failover performance can be further improved by using special network interface hardware. The automatic update protocol is able to take checkpoints every 0.1 seconds with only 3-12% overhead. If 10% overhead is allowed, it can failover applications from 0.01 to 0.4 seconds in the worst case.

Original languageEnglish (US)
Pages373-382
Number of pages10
DOIs
StatePublished - 1999
EventProceedings of the 1999 13th ACM International Conference on Supercomputing, ICS'99 - Rhodes, Greece
Duration: Jun 20 1999Jun 25 1999

Other

OtherProceedings of the 1999 13th ACM International Conference on Supercomputing, ICS'99
CityRhodes, Greece
Period6/20/996/25/99

All Science Journal Classification (ASJC) codes

  • General Computer Science

Fingerprint

Dive into the research topics of 'Fast cluster failover using virtual memory-mapped communication'. Together they form a unique fingerprint.

Cite this