Despite many attempts to fix it, the Internet's interdomain routing system remains vulnerable to configuration errors, buggy software, flaky equipment, protocol oscillation, and intentional attacks. Unlike most existing solutions that prevent specific routing problems, our approach is to detect problems automatically and to identify the offending party. Fault detection is effective for a larger class of faults than fault prevention and is easier to deploy incrementally. To show that fault detection is useful and practical, we present NetReview, a fault detection system for the Border Gateway Protocol (BGP). NetReview records BGP routing messages in a tamper-evident log, and it enables ISPs to check each other's logs against a high-level description of the expected behavior, such as a peering agreement or a set of best practices. At the same time, NetReview respects the ISPs' privacy and allows them to protect sensitive information. We have implemented and evaluated a prototype of NetReview; our results show that NetReview catches common Internet routing problems, and that its resource requirements are modest.
|Original language||English (US)|
|Title of host publication||Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2009|
|Number of pages||16|
|State||Published - 2009|
|Event||6th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2009 - Boston, United States|
Duration: Apr 22 2009 → Apr 24 2009
|Name||Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2009|
|Conference||6th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2009|
|Period||4/22/09 → 4/24/09|
All Science Journal Classification (ASJC) codes
- Computer Networks and Communications
- Control and Systems Engineering
Other files and links
FingerprintDive into the research topics of 'Netreview: Detecting when interdomain routing goes wrong'. Together they form a unique fingerprint.
Netreview : Detecting when interdomain routing goes wrong. / Haeberlen, Andreas; Avramopoulos, Ioannis; Rexford, Jennifer et al.Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2009. USENIX Association, 2009. p. 437-452 (Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2009).
Research output: Chapter in Book/Report/Conference proceeding › Conference contribution
TY - GEN
T1 - Netreview
T2 - 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2009
AU - Haeberlen, Andreas
AU - Avramopoulos, Ioannis
AU - Rexford, Jennifer
AU - Druschel, Peter
N1 - Funding Information: We are grateful to the anonymous reviewers and to our shepherd, Bruce Maggs, for their detailed and helpful comments. Paul Francis, Krishna Gummadi, Alex Fabrikant, Bryan Ford, and Sharon Goldberg provided valuable feedback on earlier versions of this paper. This work was done while Ioannis Avramopoulos was at Princeton University and supported by DHS grant W911NF-05-1-0417. Andreas Haeberlen was supported in part by US National Science Foundation grant ANI-0338856. Funding Information: We are grateful to the anonymous reviewers and to our shepherd, Bruce aggs, for their detailed and helpful comments. Paul Francis, /rishna Gummadi, Alex Fab-rikant, Bryan Ford, and Sharon Goldberg provided valuable feedback on earlier versions of this paper. This work was done while Ioannis Avramopoulos was at Princeton )niversity and supported by DHS grant WI11NF-05-1-0;1B. Andreas Haeberlen was supported in part by )S National Science Foundation grant ANI-033885@. Funding Information: • Is fault detection feasible at Internet scale? 5.1 Methodology In NetReview, all communication related to a given AS occur among the direct neighbors of that AS. Hence, a =The interval sizes we use are worst-case values for a mirroring monitor 9mainly due to RAI timers). uch smaller intervals would suffice if the monitor is attached via port replicators or B P . small-scale deployment is sufficient to estimate the overhead. However, getting even a small number of contiguous Internet ASes to deploy experimental software would be extremely difficult. Instead, we used software routers to emulate a small AS topology in the lab, but we ensured that the routing table sizes and the amount of BGP traffic closely approximated those of real Internet ASes. To achieve this, we injected an Internet BGP trace into one of the ASes, including a checkpoint of the initial routing table. From there, the routes were propagated to the other ASes via BGP, creating BGP traffic on each link and populating the other routing tables. This mimicked the conditions that would have occurred if our model topology had been part of the global Internet, so we could get realistic estimates for many performance metrics, e.g., how quickly the logs grow and how much time is required for checking. We found that, since the first trace already contained a route to each available prefix, injecting additional traces would not have increased the routing table sizes. 5.2 Experimental setup Our NetReview prototype implements the basic system we have described so far, plus the additional techniques described later in Section 6, which enable NetReview to operate without a CA, in a partial deployment and with existing routers. These techniques add some overhead to our results, so the overhead of the basic algorithm would be lower than what we report here. For our experiments, we set up a synthetic network of 8= ebra BGP daemons , which form a topology of 40 ASes 9Figure 8). Our network contains a mix of AS types, ranging from large tier-4 ASes to small stub ASes, as well as both customerMprovider and peering relationships. This diversity allowed us to implement and check a variety of different routing policies. Note that AS 8 and AS = have two separate peering points, which will become important later. For each AS, we configured a default routing policy that satisfies the Gao-Rexford conditions . If a route is imported from a customer, it is exported to all neighbors; otherwise (if the route is from a peer or provider), it is exported only to customers. In some of our experiments, we vary this policy by injecting configuration errors or imposing additional constraints. Internally, each AS uses a full-mesh iBGP topology. We did not set up route reflectors because NetReview is oblivious to iBGP. We injected routing updates from a RouteViews BGP trace  into AS 2. We used a 15-minute trace that was collected by a Zebra router at Equinix in Ashburn, VA, on January 27, 2008. The collecting router peers with nine other ASes. The trace contains 15,141 updates from these neighbors, and the corresponding RIB snap-shot contains 243,198 unique prefixes. Thus, AS 2 behaved as if it were connected to the Internet in Ashburn, VA, and it exported a realistic set of prefixes to the other ASes. NetReview’s overhead depends in part on the number of neighbors an AS has. Unless otherwise noted, the numbers we report are for AS 5. Since 92% of Internet ASes have degree five or less , our results are representative of all but the largest Internet ISPs. 5.- ules $e he "ed In our experiments, we used NetReview to enforce five rules, which are shown in Table 1. In plain English, these rules state the following: • o origin mis on2guration8 An AS may only ex- port a route if it owns the corresponding IP prefix, or if the exported route is an extension of another route that the AS is currently importing (motivated by ). • Export ustomer routes8 If an AS imports a direct route from one of its customers, it must export that route to its peers and providers. • • • onor no5ad!ertise ommunity8 An AS must honor the NO ADVERTISE community; it may not re-export a route that is tagged with this community. onsistent path length8 When exporting a route to a customer or a peer, an AS must advertise AS PATHs of the same length at all peering points (motivated by ). a "up lin"8 An AS may only export a route via a backup path if its direct links become unavailable. We chose these five rules because they can be used to detect real problems that have been reported in the Internet [9, 24, 29], and because they demonstrate the different types of conditions NetReview can verify (of course, each rule could be varied and customized in a number of ways). Note that the first two rules are very powerful; together, they can find almost all of the routing problems that were studied in . In particular, the first rule covers AS PATH manipulations, which are the main focus of secure routing systems like S-BGP (it actually goes beyond S-BGP in that it can also check for timely route withdrawal). The last three rules catch routing problems that would be difficult to find without a detection system, since they can only be detected by combining information from several routers and/or ASes. 5./ Fun tionality he " We begin with a simple functionality check to show that the prototype is fully functional and works as expected. Recall that NetReview’s design precludes false positives and false negatives if each AS is audited regularly. We ran a series of six trials. In the first trial, we used the correct configuration for each AS. In the following five trials, we made a configuration change to a NetReview-enabled AS at some point during the experiment that caused one of the five rules to be violated. After each trial, we audited all the logs. As expected, NetReview did not report any problems during the first trial. In each of the other trials, it reported the fault we had injected. The output also included the time interval in which the fault appeared, as well as the variable assignments (prefixes, AS numbers etc.) for which the corresponding rule did not hold. This is valuable for administrators because it shows not only where the fault occurred (in the audited AS) but also for which prefix the exported paths did not have the same length, which peering points were affected, etc. 5.5 ro essing po$er BGP-A speakers and monitors must generate and verify cryptographic signatures. The necessary processing time is a function of the number of messages they send and receive. In our experiment, the monitor in AS 5 sent 1,973 BGP-A messages and received 1,579 during the 15-minute period. Since all messages are acknowledged, this required 3,552 signatures to be generated and an equal number to be validated, on average four signatures and validations per second. On a 3 GHz Pentium 4, a 1024-bit RSA signature can be generated and verified in less than 3.5ms. Unlike BGP messages, BGP-A messages can contain updates for multiple different routes, which explains why the number of messages is much lower than the number of routing changes in our BGP trace. This also limits the number of validations that are required when updates arrive in bursts. For example, if a router is restarted and re- 30 25 20 15 10 5 0 No origin misconfig Export routes Honor community Consistent length Backup link Figure 4: Average processing time required to check a rule over one second of log data (the error bars show the 5th and the 95th percentile). The speed is sufficient for checking multiple ASes in real time. ceives full routing tables from its neighbors, it only needs to check one signature per routing table. This is in contrast to S-BGP , which needs to check a signature for every single route. Auditors need processing power to extract the routing state from the logs, and to check it against the specified rules. In our experiments, we found that the processing time was dominated by rule checking, which in turn depends on the number of routing changes as well as the complexity of the rules. Our five rules can be evaluated independently for each prefix, so the first optimization from Section 4.4 can be used. It would take more time to check rules that depend on a large number of different prefixes, but we are not aware of any useful rules that have this property. Figure 4 shows the average time required to check a one-second log segment against each of our five rules.@ Our 15-minute log required 11,@29 such checks, which took 41.5 seconds on a Pentium-4 workstation. In practice, the checking time would also depend on the number and complexity of the rules the target AS is revealing to the auditor. There is little published information about the policies used by commercial ASes, so we cannot say how large a Ntypical’ set of rules would be. We already included a generic policy rule (rule K2) in our set, which may be sufficient for small ASes. Even if we assume that a typical set contains 20 rules (four times the size of our set), an AS with five neighbors would still only need a single workstation to perform real-time auditing. If an AS has more neighbors, it can spread the load across multiple machines, since rule checking can be trivially parallelized. @The processing time varies considerably because some one-second intervals contain many updates, while others contain none at all. 5.1 torage spa e BGP-A speakers require storage for checkpoints, the tamper-evident log, and for the certificates that bind each key to the identity of an AS. An .509 certificate with 1024-bit RSA keys is about 1kB. With web-of-trust signature chains (described in Section @.1) and a typical AS-path length of four, each certificate is 5kB; thus, a database with certificates for 40,000 ASes would require approximately 195 B. The size of a checkpoint is dominated by the RIBs; it depends on the number of prefixes and peering points. One RIB with 244,000 prefixes and a 90-second history takes about 9.0 B, so, if we conservatively assume that each prefix appears in every inRIB and every outRIB, a complete checkpoint for an AS with six peering points could take up to 108 B. If the AS records one checkpoint every minute and keeps all checkpoints for one day, plus one checkpoint for each day of the last year, it would require up to 190 GB. In our experiment, the log grew at a rate of about 332 kB per minute (without checkpoints). Hence, we estimate that one year’s worth of log data would take about 1@@ GB. The log size is also a function of the number of peering points and the frequency of routing changes. Since the log mostly contains routing updates, its growth rate is roughly proportional to the amount of BGP traffic an AS generates. Recall that the numbers we report are for an AS with five neighbors; if an AS has more neighbors (and thus more peering points), its storage requirements are higher. For the largest ASes (UUNet has 2,@52 neighbors), on the order of a hundred Terabytes of storage may be necessary to store the log for a year. However, the log would be distributed over thousands of routers. Auditors require no permanent storage; however, it makes sense for them to cache a recent checkpoint for each AS they are auditing, so they do not have to download one repeatedly. 5.3 Message o!erhead BGP-A speakers generate traffic for maintaining BGP-A sessions, for exchanging authenticators and for responding to audits. We look at each type of traffic in turn. In terms of traffic, BGP sessions and BGP-A sessions are quite similar. If 1024-bit keys are used, a BGP-A message and its acknowledgment have 3@7 header bytes, while a BGP message only has 1@. On the other hand, a BGP-A message can advertise many different routes, while a BGP message can only advertise one. In our experiment, AS 5 generated an average of 132 kB of BGP-A messages and acknowledgments per minute; these were equivalent to 135 kB of BGP messages. Processing time (ms) Upon receiving a message or an acknowledgment, a BGP-A speaker detaches the authenticator and forwards it to the sender’s neighbors. With 1024-bit keys, the size of an authenticator is 15@ bytes; in our experiment, AS 5’s neighbors sent 2.1 B of AS 5’s authenticators over the 15-minute period. However, authenticators are also collected from messages read during an audit, so the required traffic is ) ) ic in the number of neighbors: each neighbor audits each message and sends the corresponding authenticator to each of the other neighbors. This can be a problem for large ASes (e.g. UUNet). Therefore, authenticators from large ASes should be sent to only a subset of its neighbors. This does not affect NetReview’s guarantees as long as the subsets used by all neighbors intersect in at least one diligent neighbor. In our experiment, all audits were incremental; the auditor transferred a full checkpoint once and then retrieved only the log entries that were added since the last audit. In the limit, the required traffic is the size of the log times the number of auditors, plus some overhead for headers. In total, AS 5 caused about 420 kbps of BGP-A traffic, including routing updates, auditing, and authenticators sent by the neighbors. This corresponds to the bandwidth of a typical DS upstream, which is insignificant compared to the amount of traffic ISPs routinely handle. 5.4 ummary Our experiments show that NetReview’s simple rules are sufficient to describe common, nontrivial routing problems. Also, NetReview’s resource requirements are moderate: in a typically-sized AS with five neighbors, routers must sign less than four messages per second on average, a single hard disk is sufficient to keep one year’s worth of log data, and the total traffic is less than the capacity of a single broadband upstream link. Finally, we have demonstrated that fault detection is feasible at Internet update rates. By running the NetReview software on just a single workstation, an ISP can audit dozens of neighboring ASes in real time. 1 ra ti al hallenges In the previous two sections, we have shown that it is feasible to build a fault detection system with strong guarantees, and that its resource requirements are moderate. The goal of this section is to show how NetReview deals with the various practical problems that have hampered the deployment of previous solutions. In particular, we will show that NetReview can operate without a A, that it can be effective in a partial deployment, that it can initially be deployed without upgrading any routers, and that it offers incentives for incremental deployment. 1.1 et e!ie$ $ithout a Despite many proposals, deploying a global A for prefixes and ASes has so far not found acceptance . Net-Review can use such a A if it exists, but it does not require it. In the absence of a A, we need to find replacements for two services that a A provides: associating each key pair with a real-world identity, and certifying ownership of AS numbers and IP prefixes. We solve this problem using a web-of-trust approach that is inspired by [33, 34]. Each AS initially generates a key pair and creates a self-signed certificate. Then it sends the certificate to its immediate neighbors, who append their own endorsement and forward it on to their neighbors, etc. The overhead for flooding certificates is not a concern, because the AS topology changes slowly. Each AS obtains a database of all certificates, each with a chain of endorsements that corresponds to the shortest path between the local AS and the AS represented by a given certificate. an these certificates be trusted: We can safely assume that each AS knows the true identity of the neighbor attached to each of its physical links. oreover, we have assumed earlier that each AS has a diligent neighbor. This neighbor can detect if the AS signs a certificate that do not correspond to its true identity, or endorses a certificate that does not come from one of its neighbors. Thus, a node can (transitively) trust every certificate that is endorsed by one of its neighbors. In addition, we require each AS to log a public pledge that specifies its current ASN and prefix ownerships.7 ASes extract this pledge during audits and compare it to their database; if there is any change, they flood it to all other ASes. Thus, NetReview can detect if two ASes claim ownership of the same ASN or of overlapping prefixes, and it provides each with evidence of the other’s claim. The conflict can then be resolved through existing mechanisms, e.g., by a mediator or a judge. 1.2 artial deployment It would be unrealistic to expect that all ASes adopt Net-Review, much less that all ASes install the system at the same time. Therefore, NetReview must be able to work in a partial deployment, that is, it must be able to interact with non-participants via BGP. By default, BGP-A speakers and monitors record only BGP-A messages in their logs, and auditors use only BGP-A messages to reconstruct the routing information. However, legacy neighbors have no components that speak BGP-A. If we simply omitted all routes imported from or exported to these neighbors, the information in the log might not be sufficient to evaluate many interest- 7Prefixes used for IP anycast [2@] require special handling because they may be owned by multiple ASes simultaneously. ing conditions. For example, if an AS acts as a provider for another AS, it may be required to export routes for all prefixes it knows about, even if the corresponding route is through a non-participant. Therefore, if an AS has legacy neighbors, its BGP-A speakers and monitors additionally record all the (unsigned) BGP messages they exchange with these neighbors. Why keep this information in the secure record if a faulty participant AS can simply record whatever it wants: There are three reasons. First, we can isolate non-malicious faults such as misconfigurations or hardware failures, where the faulty AS still records correct information. Second, even if an AS lies about the routes it is importing or exporting via BGP, it must lie c i 5 ( to avoid detection by the auditors. For example, if the AS claims to have imported a certain route via BGP, it must re-export that route to each participating neighbor if required by its peering agreement, and it cannot export different versions to different neighbors. Third, logging BGP messages enables an intermediate level of participation in NetReview. If a non-participant AS a is a neighbor of a participant AS b, a can act as an auditor and compare b’s log to the BGP messages b actually sent, without fully deploying fault detection itself. All a needs is the NetReview auditor software and a current snapshot of its own BGP tables. If a finds a discrepancy, it can investigate it by contacting the participant AS b. This option could encourage neighbors of participant ASes to Ntry out’ fault detection. Partial deployment requires an addition to the web-of-trust technique in Section @.1. As long as the deployment is contiguous in the AS graph (which is likely if tier-1 ASes join first), the technique works as described. When a second Nisland’ of participants arises, at least one member of each island must exchange cryptographic credentials out-of-band. These members are then considered NetReview neighbors (even though they do not share a physical link), and they forward certificates from their respective islands to ensure that each AS has a full set. To increase the chance that ASes in small islands have a diligent neighbor, they also collect authenticators for each other and periodically audit each other’s log. 6.3 Using existing routers Requiring ISPs to upgrade or replace their routers to deploy NetReview would present a significant hurdle. Therefore, it is useful to have an intermediate solution that works with existing, unmodified routers. Our solution is to run the NetReview software on ordinary workstations, which we call + i . The monitors speak both BGP and BGP-A; they observe all BGP traffic incident to the AS’s existing routers and maintain BGP-A sessions with any monitors (or native BGP-A speakers where available) in adjacent ASes. The monitors also maintain tamper-evident logs and perform all cryptographic operations. Thus, the existing routers need not be modified. There are two ways to configure a monitor . A * i monitor interposes on all BGP connections of its local AS. When it receives a BGP message from a local border router, it sends an equivalent BGP-A message to the remote BGP-A speaker (or monitor) and vice versa. A +i i monitor snoops on the existing control connections, e.g., using a port replicator, the BGP monitoring protocol , or additional BGP sessions. Whenever it sees an outgoing message on the legacy BGP connection, it sends a BGP-A message with the same information over a separate connection to the neighbor’s BGP-A speaker or monitor. irroring monitors are safer because the routers do not depend on them. If a monitor fails, the routers can still send or receive routing updates via BGP and normal operation is not affected. On the other hand, mirroring monitors allow inconsistencies between the updates sent via BGP and BGP-A. onsider a case where a mis-configured or faulty router advertises some route A to its monitor and a different route B to the adjacent AS. The monitor would record route A in the tamper-evident log, and the AS could not be held accountable for route B. To address this case, mirroring monitors maintain a third RIB for each peering point, which we will call i " 5 . The inRIB contains the routes advertised via BGP-A as before, while the inRIB-BGP contains the routes received over the monitor’s BGP sessions. Normally the two are identical; the scenario described earlier would manifest itself as an inconsistency between inRIB and inRIB-BGP in two adjacent ASes. Thus, an inconsistency cannot go undetected; however, an auditor cannot decide whether an inconsistency between in-RIB and inRIB-BGP is caused by the audited AS or by its neighbor, and therefore must suspect both. Because BGP neighbors have a business relationship, they can be expected to swiftly sort out a demonstrated inconsistency between their advertised routes. 6.4 Incentives for deployment If fault detection is to be deployed incrementally in the current Internet, we need good arguments to persuade ISPs to adopt it. Here, we present two arguments we believe to be compelling: ISPs can use fault detection as a distinguishing feature to attract more customers, and they can use it for root-cause analysis in the i Internet, even in non-participating ASes. Market forces: The first adopters of NetReview are likely to be large ISPs, such as tier-1 and tier-2 ASes, who tend to adopt new routing technology and best prac- tices early. As a result, their routing performance is often excellent. These ASes can demonstrate their excellent performance by offering fault detection as a value-added service to their customers and thus distinguish themselves from the competition. Once fault detection is on the market, competitors are encouraged to measure up by offering the service themselves. Thus, small islands of participants emerge. At this point, when a fault is caused by a non-participant, the participants can handle any complaints by proving that they are not the cause, and by tracing the problem to a non-participant just outside the island’s perimeter, who must then handle the complaint. This creates an incentive for ASes to be inside the perimeter, and thus causes the islands to expand and the gaps between them to shrink. Note that this approach works for NetReview because, unlike secure routing protocols like S-BGP, it is effective even in a small deployment of just a few ASes. oot5cause analysis: As an additional benefit, participant ASes can use the fault detection system to diagnose faults even if the cause is in a non-participating AS. Since non-participants do not sign messages, do not maintain tamper-evident logs, and do not reveal any rules, we cannot guarantee that the diagnosis will always be accurate, and we cannot detect certain types of faults, such as policy violations. However, even an approximate diagnosis enables the AS to respond more effectively to faults. Since non-participants do not have tamper-evident logs, we cannot directly apply auditing to find faults. Instead, we can use the participants’ logs as a giant BGP looking glass that provides information about BGP updates from many vantage points. There are several proposed systems that can use this data to diagnose faults [10, 13, 21, 32]. In fact, because NetReview records a history of past states, it provides even more information than existing systems need; this could be used to develop even more powerful systems. 6.0 ccuracy in a partial deployment When NetReview is used for root-cause analysis in a partial deployment, it returns a candidate set – a set of ASes that could have caused the fault. The size of this set depends on the size of the NetReview deployment. To estimate this dependency, we ran a simulation based on CAIDA’s Internet AS topology , assuming a scenario in which an AS suspects that a given route has been spoofed. With NetReview in place, we can audit the participant ASes on the path and thus localize the fault to either a) a participant AS, b) a segment of non-participants between two participants, or c) the path suffix after the last participant. Here, we will ignore the possibility that the participant ASes record incorrect information. 6 5 4 3 2 1 0.001 ASes deploy in random order ASes deploy in reverse degree order Hypothetical best case 0.01 0.1 1 10 NetReview deployment (% of ASes) 100 Figure 5: Fault localization under partial deployment. Shown is the average number of ASes that could have caused an observed failure. For each deployment size, we simulated 10,000 trials as follows: we randomly picked an AS and calculated the shortest path to a random other AS using the Gao-Rexford conditions, then we picked a random AS on that path as the faulty AS and measured the number of candidates. We report averages over all 10,000 trials. Figure 5 shows our results for two different deployment assumptions: either ASes deploy NetReview in random order, or in order of decreasing node degree. The first assumption is rather conservative; in practice, large ISPs typically run the latest routers, are among the first to apply best common practices and pride themselves on their good performance, so they are more likely to be early adopters. In both cases, the average number of candidates starts at four (the average AS path length on the Internet). However, if the ASes with the most neighbors deploy NetReview first, the average decreases much more rapidly and reaches perfect localization with only a 15% deployment. The reason is that there are only about 12-15 tier-1 ASes; once these have deployed NetReview, faults can already be localized to one half of the path. 85% of the ASes in the Internet do not have customers of their own; once the other 15% participate, faults can be localized to one of them or to one of their customers. This result shows that early adopters of a fault detection system like NetReview can derive considerable benefits from it; a deployment that includes the 0.1% highest-degree ASes would already be able to double the accuracy of its diagnoses. In contrast, fault prevention systems like S-BGP are only effective when they are already widely deployed. 3 uture $ork In this section, we describe some advanced features that could be added to NetReview. Avg number of candidates 3.+ imultaneously inspecting several es NetReview inspects one log at a time, which is sufficient to detect protocol violations and policy violations. However, NetReview cannot detect problematic interactions between the policies of multiple ASes that way. An example is bad gadgets [1;], which only arise when the routing policies of several ASes conDict in a circular fashion. To detect bad gadgets, NetReview would have to inspect the logs of multiple ASes simultaneously. Technically, it is not difficult to fetch the logs from multiple ASes and to evaluate rules over multiple RIBs. However, routing policies are typically pair-wise confidential; thus, the check would have to be performed by a mutually trusted auditor. An alternative method to detect such policy conDicts, proposed in , is to have ASes annotate BGP advertisements with a histor in a manner that preserves the privacy of the routing policies. Because NetReview records and publishes histories of BGP advertisements as part of its regular operation, this tech-ni ue can be readily applied. 3., etecting data5plane inconsistencies In this paper, we have focused on providing fault detection for the contro( p(ane – the BGP announcements ASes send to each other. However, an AS could conceivably advertise one path in BGP and forward data packets on another, whether inadvertently or as part of an attack. NetReview already provides two mechanisms that can detect inconsistencies between the control and data planes: (i) it offers authoritative information about the route advertisement in the control plane, and (ii) it establishes the secure log that could also record observations about the data plane. For example, suppose AS B advertises route 7B C> to AS A but instead forwards A’s traffic to AS D. If D passively monitors the traffic received from B, D can observe that A’s packets are misrouted. D can add this observation to its log, and any auditors can thus obtain evidence of a data-plane inconsistency between B and D. 3.3 Internal audits NetReview provides fault detection for BGP inter-domain routing. It does not record any intra-domain routing messages in the tamper-evident log because these could reveal confidential information, such as the AS’s internal topology. However, NetReview could easily be adapted to cover intra-domain routing using a separate, private record. ASes could then perform internal audits to discover mis-configurations or compromised routers in their internal network, even when these routers have not (yet) caused a routing problem that would be visible to a neighbor. 4 elated *ork etection: Anomaly detection techni ues [B, 18, 21, 23, 3@] use the BGP routing updates from one or more vantage points to build a de facto registry of the AS topology and prefix ownership. They raise an alarm upon receiving updates that disagree with the registry. Root-cause analysis (RCA) algorithms analyze BGP update messages from multiple vantage points to identify the AS(es) responsible for a routing change [;, 5, 10]. In RCA, each vantage point identifies a set of suspect ASes, then the sets are correlated to determine the potential culprit(s). The accuracy of RCA depends on the number and location of the vantage points. )nlike both RCA and anomaly detection, NetReview produces no false positives or false negatives, and it is not vulnerable to compromised ASes. In addition, NetReview can detect a larger class of faults, and it produces evidence that can be used to convince a third party. AudIt  can determine which ASes are losing or delaying packets on the data plane. However, AudIt can only reveal the symptoms of a malfunctioning control plane, whereas control-plane fault detection can perform diagnosis. revention: Secure routing protocols [20, 22, 33, 3;] can ensure that (i) a route advertisement originates from the legitimate origin AS and that (ii) the AS-path of a route advertisement has not been modified or forged. On the one hand, secure routing protocols can prevent certain types of faults, whereas NetReview can only detect them; on the other hand, NetReview covers a larger class of faults, including policy violations (such as a faulty AS redistributing routes from one upstream provider to another), it can localize faults, and it provides incentives to avoid them. Perhaps more importantly, secure routing protocols do not provide appreciable benefits until many (if not all) ASes have adopted them, which explains in part why they have not yet been deployed, whereas Net-Review is effective even in small deployments N-BGP [2I] uses trusted hardware to enforce a BGP safety specification for individual routers. )nlike N-BGP, NetReview does not re uire trusted hardware and it produces evidence of faults that can be verified by third parties. oreover, NetReview is designed to check an entire AS’s operation, not only against a safety specification but also against the AS’s routing policy as specified in its peering agreements. AIP  is a clean-slate redesign of IP that, among other things, would greatly simplify the deployment of a secure routing protocol. However, even if AIP were to replace IP entirely, it would be subject to the limitations of secure routing protocols described above. ccounta)ility: NetReview’s tamper-evident log is based on the log in PeerReview [1B], a general account- ability framework for distributed systems. However, NetReview goes beyond PeerReview, which is based on assumptions that do not hold in interdomain routing. For example, PeerReview re uires a certificate authority, it cannot operate in a partial deployment, it cannot protect the business secrets of ISPs, and it detects neither policy violations nor any other condition that involves more than one router. 6 onclusion In this paper, we have presented the design, implementation, and evaluation of NetReview, a fault detection system for interdomain routing. NetReview reliably detects incorrect behavior and links it to the responsible AS, while also enabling well-behaved ASes to prove they have adhered to the protocol and their routing policies. NetReview’s correctness checks can detect and diagnose a wide variety of problems in BGP, including faulty e uipment, buggy software, policy violations, and malicious attacks, which makes it an appealing alternative to specific solutions to any one of these problems. NetReview does not re uire changes to the underlying routers and is effective even in partial deployments. We believe that a fault detection system like NetReview can play an important role in improving the reliability, stability, and security of interdomain routing. ckno$ledgments We are grateful to the anonymous reviewers and to our shepherd, Bruce aggs, for their detailed and helpful comments. Paul Francis, /rishna Gummadi, Alex Fab-rikant, Bryan Ford, and Sharon Goldberg provided valuable feedback on earlier versions of this paper. This work was done while Ioannis Avramopoulos was at Princeton )niversity and supported by DHS grant WI11NF-05-1-0;1B. Andreas Haeberlen was supported in part by )S National Science Foundation grant ANI-033885@. eferences  D. Andersen, H. Balakrishnan, N. Feamster, T. /oponen, D. oon, and S. Shenker. Accountable Internet protocol (AIP). In roceedings of S" , Aug 2008.  /. Argyraki, P. aniatis, O. Irzak, S. Ashish, and S. Shenker. oss and delay accountability for the Internet. In roc6 " "nternationa( onference on et'or$ rotoco(s, Oct. 200B.  AS relationships dataset from CAIDA. http://www.caida. org/data/active/as-relationships/index.xml. [;] . Caesar, . Subramanian, and R. H. /atz. A case for an Internet health monitoring system. In roc6 %S " -or$shop on ot !opics in S ste+ ependa,i(it , *un. 2005.  *. Chandrashekar, .-. hang, and H. Peterson. Fixing BGP, one AS at a time. In roc6 A S" et'or$ !rou,(eshooting -or$shop, 200;. [@] /. . Chandy and . amport. Distributed snapshots: Deter-mining global states of distributed systems. A !rans6 o+put6 S st6, 3(1):@3–B5, 1I85. [B] .-*. Chi, R. Oliveira, and . hang. Cyclops: the AS-level connectivity observatory. S" o+put6 o++un6 ev6, 38(5):5–1@, 2008.  N. Feamster and H. Balakrishnan. Detecting BGP configuration faults with static analysis. In roceedings of S ".81, ay 2005. [I] N. Feamster, . . ao, and *. Rexford. BorderGuard: Detecting cold potatoes from peers. In roc6 "nternet easure+ent onfer-ence, Oct 200;.  A. Feldmann, O. aennel, . ao, A. Berger, and B. aggs. ocating Internet routing instabilities. In roc6 A S" , Aug.-Sept. 200;.  . Gao and *. Rexford. Stable Internet routing without global coordination. " 7A !ransactions on et'or$ing, I(@):@81– @I2, Dec 2001.  GN) ebra. http://www.zebra.org/.  G. Goodell, W. Aiello, T. Griffin, *. Ioannidis, P. cDaniel, and A. Rubin. Working around BGP: An incremental approach to improving security and accuracy of interdomain routing. In roc6 et'or$ and istri,uted S ste+s Securit , Feb 2003. [1;] T. G. Griffin, F. B. Shepherd, and G. Wilfong. The stable paths problem and interdomain routing. " 7A !ransactions on et'or$ing, 10(2):232–2;3, April 2002.  T. G. Griffin and G. Wilfong. A safe path vector protocol. In roc6 " , ar. 2000. [1@] A. Haeberlen, P. /uznetsov, and P. Druschel. The case for Byzan-tine fault detection. In roc6 ot ep.82. )S NI , Nov 200@. [1B] A. Haeberlen, P. /uznetsov, and P. Druschel. PeerReview: Prac-tical accountability for distributed systems. In roc6 S +posiu+ on perating S ste+s rincip(es, Oct 200B.  . Hu and . . ao. Accurate real-time identification of IP pre-fix hijacking. In roc6 " S +posiu+ on Securit and rivac , ay 200B. [1I] .-C. Hu, D. cGrew, A. Perrig, B. Weis, and D. Wendlandt. (R)evolutionary bootstrapping of a global P/I for securing BGP. In roc6 ot ets -or$shop, Nov 200@.  .-C. Hu, A. Perrig, and . Sirbu. SP.: Secure path vector rout-ing for securing BGP. In roc6 A S" , 200;.  *. /arlin, S. Forrest, and *. Rexford. Autonomous security for autonomous systems. o+puter et'or$s4 specia( issue on o+-p(e* o+puter and o++unication et'or$s, Oct 2008.  S. /ent, C. ynn, and /. Seo. Secure border gateway protocol (S-BGP). " &SA , 18(;):582–5I2, Apr 2000.  . ad, D. assey, D. Pei, . Wu, B. hang, and . hang. PHAS: A prefix hijack alert system. In roc6 %S " Securit S +posiu+, Aug. 200@. [2;] R. ahajan, D. Wetherall, and T. Anderson. )nderstanding BGP misconfiguration. In roc6 A S" , Sep 2002.  O. Nordstroem and C. Dovrolis. Beware of BGP attacks. A o+puter o++unications evie' / 0, Apr 200;. [2@] C. Partridge, T. endez, and W. illiken. Host anycasting ser-vice. RFC 15;@, Nov 1II3. [2B] A. Ramachandran and N. Feamster. )nderstanding the network-level behavior of spammers. In roc6 A S" , Sep 200@.  . Rekhter, T. i, and S. Hares. A border gateway protocol ; (BGP-;). RFC ;2B1, *an 200@. [2I] P. Reynolds, O. /ennedy, . G. Sirer, and F. B. Schneider. Se-curing BGP using external security monitors. Technical Report TR-200@-20@5, Cornell )niversity, Computing and Information Science, Dec 200@.  Route.iews project. http:MMwww.routeviews.orgM.  *. Scudder, R. Fernando, and S. Stuart. BGP monitoring protocol, Nov 2008. Internet Draft, draft-ietf-grow-bmp-00.  R. Teixeira and *. Rexford. A measurement framework for pin-pointing routing changes. In roc6 A S" et'or$ !rou,(eshooting -or$shop, Sep 200;.  T. Wan, . /ranakis, and P. C. van Oorschot. Pretty secure BGP (psBGP). In roc6 et'or$ and istri,uted S ste+ Securit S +-posiu+, Feb 2005. [3;] R. White. Securing BGP through secure origin BGP. "nternet rotoco( &ourna(, @(3), 2003.  *. Wu, . . ao, *. Rexford, and *. Wang. Finding a needle in a haystack: Pinpointing significant BGP routing changes in an IP network. In rocessings of S ".81, ay 2005. [3@] C. heng, . *i, D. Pei, *. Wang, and P. Francis. A light-weight distributed scheme for detecting IP prefix hijacks in real-time. In roc6 A S" , Aug. 200B. [3B] . mijewski. onger is not always better. http://www.renesys.com/blog/2009/02/ longer-is-not-better.shtml.
PY - 2009
Y1 - 2009
N2 - Despite many attempts to fix it, the Internet's interdomain routing system remains vulnerable to configuration errors, buggy software, flaky equipment, protocol oscillation, and intentional attacks. Unlike most existing solutions that prevent specific routing problems, our approach is to detect problems automatically and to identify the offending party. Fault detection is effective for a larger class of faults than fault prevention and is easier to deploy incrementally. To show that fault detection is useful and practical, we present NetReview, a fault detection system for the Border Gateway Protocol (BGP). NetReview records BGP routing messages in a tamper-evident log, and it enables ISPs to check each other's logs against a high-level description of the expected behavior, such as a peering agreement or a set of best practices. At the same time, NetReview respects the ISPs' privacy and allows them to protect sensitive information. We have implemented and evaluated a prototype of NetReview; our results show that NetReview catches common Internet routing problems, and that its resource requirements are modest.
AB - Despite many attempts to fix it, the Internet's interdomain routing system remains vulnerable to configuration errors, buggy software, flaky equipment, protocol oscillation, and intentional attacks. Unlike most existing solutions that prevent specific routing problems, our approach is to detect problems automatically and to identify the offending party. Fault detection is effective for a larger class of faults than fault prevention and is easier to deploy incrementally. To show that fault detection is useful and practical, we present NetReview, a fault detection system for the Border Gateway Protocol (BGP). NetReview records BGP routing messages in a tamper-evident log, and it enables ISPs to check each other's logs against a high-level description of the expected behavior, such as a peering agreement or a set of best practices. At the same time, NetReview respects the ISPs' privacy and allows them to protect sensitive information. We have implemented and evaluated a prototype of NetReview; our results show that NetReview catches common Internet routing problems, and that its resource requirements are modest.
UR - http://www.scopus.com/inward/record.url?scp=85076898069&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85076898069&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85076898069
T3 - Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2009
SP - 437
EP - 452
BT - Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2009
PB - USENIX Association
Y2 - 22 April 2009 through 24 April 2009