TY - GEN
T1 - LIFEGUARD
T2 - Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication, ACM SIGCOMM 2012
AU - Katz-Bassett, Ethan
AU - Scott, Colin
AU - Choffnes, David R.
AU - Cunha, Italo
AU - Valancius, Vytautas
AU - Feamster, Nick
AU - Madhyastha, Harsha V.
AU - Anderson, Thomas
AU - Krishnamurthy, Arvind
PY - 2012
Y1 - 2012
N2 - The Internet was designed to always find a mute if there is a policy- compliant path. However, in many cases, connectivity is disrupted despite the existence of an underlying valid path. The research community has focused on short-term outages that occur during route convergence. There has been less progress on addressing avoidable long-lasting outages. Our measurements show that long- lasting events contribute significantly to overall unavailability. To address these problems, we develop LIFEGUARD, a system for automatic failure localization and remediation. LIFEGUARD uses active measurements and a historical path atlas to locate faults, even in the presence of asymmetric paths and failures. Given the ability to locate faults, we argue that the Internet protocols should allow edge ISPs to steer traffic to them around failures, without requiring the involvement of the network causing the failure. Although the Internet does not explicitly support this functionality today, we show how to approximate it using carefully crafted BGP messages. LIFEGUARD employs a set of techniques to reroute around failures with low impact on working routes. Deploying LIFEGUARD on the Internet, we find that it can effectively route traffic around an AS without causing widespread disruption.
AB - The Internet was designed to always find a mute if there is a policy- compliant path. However, in many cases, connectivity is disrupted despite the existence of an underlying valid path. The research community has focused on short-term outages that occur during route convergence. There has been less progress on addressing avoidable long-lasting outages. Our measurements show that long- lasting events contribute significantly to overall unavailability. To address these problems, we develop LIFEGUARD, a system for automatic failure localization and remediation. LIFEGUARD uses active measurements and a historical path atlas to locate faults, even in the presence of asymmetric paths and failures. Given the ability to locate faults, we argue that the Internet protocols should allow edge ISPs to steer traffic to them around failures, without requiring the involvement of the network causing the failure. Although the Internet does not explicitly support this functionality today, we show how to approximate it using carefully crafted BGP messages. LIFEGUARD employs a set of techniques to reroute around failures with low impact on working routes. Deploying LIFEGUARD on the Internet, we find that it can effectively route traffic around an AS without causing widespread disruption.
KW - Availability
KW - BGP
KW - Measurement
KW - Outages
KW - Repair
UR - http://www.scopus.com/inward/record.url?scp=84894553472&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84894553472&partnerID=8YFLogxK
U2 - 10.1145/2377677.2377756
DO - 10.1145/2377677.2377756
M3 - Conference contribution
AN - SCOPUS:84894553472
SN - 9781450314190
T3 - Computer Communication Review
SP - 395
EP - 406
BT - Proceedings of the ACM SIGCOMM 2012 and Best Papers of the Co-located Workshops
Y2 - 13 August 2012 through 17 August 2012
ER -