LIFEGUARD: Practical repair of persistent route failures

Ethan Katz-Bassett, Colin Scott, David R. Choffnes, Ítalo Cunha, Vytautas Valancius, Nick Feamster, Harsha V. Madhyastha, Thomas Anderson, Arvind Krishnamurthy

Research output: Chapter in Book/Report/Conference proceedingConference contribution

65 Scopus citations

Abstract

The Internet was designed to always find a route if there is a policy-compliant path. However, in many cases, connectivity is disrupted despite the existence of an underlying valid path. The research community has focused on short-term outages that occur during route convergence. There has been less progress on addressing avoidable long-lasting outages. Our measurements show that long-lasting events contribute significantly to overall unavailability. To address these problems, we develop LIFEGUARD, a system for automatic failure localization and remediation. LIFEGUARD uses active measurements and a historical path atlas to locate faults, even in the presence of asymmetric paths and failures. Given the ability to locate faults, we argue that the Internet protocols should allow edge ISPs to steer traffic to them around failures, without requiring the involvement of the network causing the failure. Although the Internet does not explicitly support this functionality today, we show how to approximate it using carefully crafted BGP messages. LIFEGUARD employs a set of techniques to reroute around failures with low impact on working routes. Deploying LIFEGUARD on the Internet, we find that it can effectively route traffic around an AS without causing widespread disruption.

Original languageEnglish (US)
Title of host publicationSIGCOMM'12 - Proceedings of the ACM SIGCOMM 2012 Conference Applications, Technologies, Architectures, and Protocols for Computer Communication
Pages395-406
Number of pages12
DOIs
StatePublished - 2012
EventACM SIGCOMM 2012 Conference Applications, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM 2012 - Helsinki, Finland
Duration: Aug 13 2012Aug 17 2012

Publication series

NameSIGCOMM'12 - Proceedings of the ACM SIGCOMM 2012 Conference Applications, Technologies, Architectures, and Protocols for Computer Communication

Other

OtherACM SIGCOMM 2012 Conference Applications, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM 2012
Country/TerritoryFinland
CityHelsinki
Period8/13/128/17/12

All Science Journal Classification (ASJC) codes

  • Computer Networks and Communications
  • Hardware and Architecture
  • Electrical and Electronic Engineering

Keywords

  • availability
  • bgp
  • internet
  • measurement
  • outages
  • repair
  • routing

Fingerprint

Dive into the research topics of 'LIFEGUARD: Practical repair of persistent route failures'. Together they form a unique fingerprint.

Cite this