Network Working Group B. Zhang Internet-Draft Univ. of Arizona Intended status: Informational L. Wang Expires: April 29, 2010 Univ. of Memphis X. Zhao Univ. of Arizona Y. Liu Univ. of Memphis L. Zhang UCLA October 26, 2009 FIB Aggregation draft-zhang-fibaggregation-02.txt Status of this Memo This Internet-Draft is submitted to IETF in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on April 29, 2010. Copyright Notice Copyright (c) 2009 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents in effect on the date of publication of this document (http://trustee.ietf.org/license-info). Please review these documents carefully, as they describe your rights Zhang, et al. Expires April 29, 2010 [Page 1] Internet-Draft FIB Aggregation October 2009 and restrictions with respect to this document. This document may contain material from IETF Documents or IETF Contributions published or made publicly available before November 10, 2008. The person(s) controlling the copyright in some of this material may not have granted the IETF Trust the right to allow modifications of such material outside the IETF Standards Process. Without obtaining an adequate license from the person(s) controlling the copyright in such materials, this document may not be modified outside the IETF Standards Process, and derivative works of it may not be created outside the IETF Standards Process, except to format it for publication as an RFC or to translate it into languages other than English. Abstract The rapid growth of Forwarding Information Base (FIB) has raised concerns among many Internet Service Providers. One potential solution to this problem is FIB aggregation, i.e. letting each router aggregate its FIB entries without affecting the forwarding paths taken by data traffic. It is a simple local software optimization within a router, requiring no changes to routing protocols or router hardware. To understand the effectiveness of using FIB aggregation to extend router lifetime, in this draft we present several FIB aggregation algorithms and evaluate their performance using routing tables and updates collected from tens of networks. Our results show that FIB aggregation can reduce the FIB table size by as much as 70% with small computational overhead. We also show that the computational overhead can be controlled through various mechanisms. Zhang, et al. Expires April 29, 2010 [Page 2] Internet-Draft FIB Aggregation October 2009 Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. FIB Aggregation Overview . . . . . . . . . . . . . . . . . . . 6 3. FIB Aggregation Techniques and Algorithms . . . . . . . . . . 7 3.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 7 3.2. Level 1 Aggregation . . . . . . . . . . . . . . . . . . . 9 3.3. Level 2 Aggregation . . . . . . . . . . . . . . . . . . . 9 3.4. Level 3 Aggregation . . . . . . . . . . . . . . . . . . . 10 3.5. Level 4 Aggregation . . . . . . . . . . . . . . . . . . . 12 3.6. Practical Considerations . . . . . . . . . . . . . . . . . 13 4. Updating Aggregated FIB . . . . . . . . . . . . . . . . . . . 14 5. Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 16 5.1. Methodology . . . . . . . . . . . . . . . . . . . . . . . 17 5.2. Table Size Reduction and Overhead . . . . . . . . . . . . 18 5.3. Routing Update Handling . . . . . . . . . . . . . . . . . 20 6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . 23 7. Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 23 8. Security Considerations . . . . . . . . . . . . . . . . . . . 23 9. Informative References . . . . . . . . . . . . . . . . . . . . 23 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 24 Zhang, et al. Expires April 29, 2010 [Page 3] Internet-Draft FIB Aggregation October 2009 1. Introduction The Internet routing scalability problem has raised concerns in both industry and research communities, as documented in the report from the IAB Workshop on Routing and Addressing [RAWS]. Several solutions have been proposed under the IRTF RRG and IETF GROW Working Group. To address the root cause of the scalability problem may require changes to the Internet routing architecture and protocols. However, deploying these architectural and protocol changes may take a long time, as we have learned from the design and deployment of IPv6. Thus we should also explore simple short-term solutions that can help ISPs address problems facing them today. In particular, ISPs urgently need a solution to reduce their forwarding table size. Forwarding tables are derived from routing tables and router configurations, so their size increases as routing tables grow. However, forwarding tables use high performance memory that is more expensive and more difficult to scale than the memory used to hold routing tables. Therefore, their size is a more immediate concern to ISPs as well as vendors. This draft reports our investigation on the feasibility of a purely local solution: FIB aggregation, which is to combine multiple entries in the routing table to one entry in the forwarding table without changing the next hop for data traffic. This approach is particularly appealing because it can be done by a software upgrade at one router at a time, and its impact is limited within the router. It does not require changes to routing protocols or router hardware, nor does it affect multi-homing, traffic engineering, or other network-wide operations. It is important to note that FIB aggregation is not a replacement for the long-term architectural solutions; rather, FIB aggregation is a local solution that can be quickly implemented and deployed in the short-term, and in the long run, it can co-exist and complement architectural solutions. The idea of FIB aggregation is not new. It was mentioned as a potential strategy in "Preliminary Recommendation for a Routing Architecture" [Li09RRG]. Through personal exchanges, we have learned that one vendor has implemented a simple FIB aggregation scheme (similar to our Level-1 aggregation). There is also a patent for a FIB aggregation algorithm [Cain02AUTOAGG]. In this work, we evaluate the benefits and costs of several FIB aggregation schemes. We recognize that FIB aggregation is an opportunistic technique -- its effectiveness depends on what prefixes are present in the table, how many of them can be numerically represented by a single prefix, and how many of them share the same next-hop in data forwarding. The benefits of FIB aggregation also come with certain cost, such as extra CPU overhead. The cost also depends on the actual aggregation algorithms, and how routing changes are handled to update the Zhang, et al. Expires April 29, 2010 [Page 4] Internet-Draft FIB Aggregation October 2009 forwarding table. We design and implement five algorithms at different aggregation levels, each representing different tradeoffs between table size reduction and computation complexity. We evaluate them using publicly available routing tables from tens of networks. The results show that the lowest-level aggregation can reduce the FIB table size by 30% to 50%, equivalent to the full DFZ routing table size two and half years ago, while the highest-level aggregation can reduce the FIB table size by 70%, equivalent to the full DFZ table size eight years ago. The computation time of one full aggregation run ranges from 50 milliseconds to 250 milliseconds on a commodity Linux machine. Although these numbers may not reflect the computation time on a router, they reflect the relative speed of different levels of aggregation. Moreover, such full aggregation is performed only when certain thresholds are reached, so the computation time is amortized over time. To handle routing changes, we design and implement algorithms to incrementally update the aggregated FIB upon a change. The full aggregation algorithm is invoked only when the router CPU load is low or the FIB size rises above a given threshold. The computation time can be further reduced by not aggregating highly active prefixes, which represent a small fraction of the routing table but are responsible for many routing changes. The evaluation using one-month BGP routing updates shows that, compared with un-aggregated FIB, the computation overhead of maintaining aggregated FIB over time is small. Overall, we believe that FIB aggregation is a viable solution to the routing scalability problem, since it provides significant reduction in table size, and its extra computation can be controlled by various mechanisms, such as using different levels of aggregation, incremental update algorithms, on-deman invocation, and not aggregating highly active prefixes. Another proposed approach to FIB size reduction is Virtual Aggregation [Virtual_Aggregation]. Virtual Aggregation designates a small set of routers (APRs) to announce short prefixes (called virtual prefixes), so that other routers do not need to install in their FIB more specific prefixes under those virtual prefixes; they can simply forward packets falling with the virtual prefixes to the APRs responsible for those virtual prefixes. It can be independently deployed by a single ISP, and does not require changes to the routing architecture or protocols. However, Virtual Aggregation requires considerable changes to network-wide router configurations and specialized routers to announce virtual prefixes. Moreover, it introduces extra delays (path stretch) in packet delivery. To some extent, the Level 4 FIB aggregation described in this draft can be viewed as a local Virtual Aggregation within a router. Zhang, et al. Expires April 29, 2010 [Page 5] Internet-Draft FIB Aggregation October 2009 2. FIB Aggregation Overview FIB aggregation eliminates and aggregates entries in a FIB without affecting the delivery of data traffic. For example, it may remove an entry for prefix P1 from the FIB if another entry for prefix P2 with the same next-hop already covers P1's address space. It may also introduce a new entry to the FIB after removing several entries that share the same next-hop. To ensure packet delivery and avoid changing the path that a packet will take, a FIB aggregation scheme should satisfy the property of forwarding correctness: after longest match lookup, if a destination address has a non-NULL next-hop in the un-aggregated FIB, it should have the same next-hop in the aggregated FIB. There are two main questions in designing FIB aggregation techniques: how to aggregate the full FIB and how to update an aggregated FIB upon a routing change. In the next two sections, we will describe five aggregation algorithms and one update handling algorithm. The effectiveness of FIB aggregation depends on which address prefixes are present in the routing table and how these prefixes are distributed over next-hop routers. Generally speaking, the fewer neighbors a router has, the better aggregation it may achieve. In the extreme case, all prefixes share the same next-hop and the aggregation is maximized. According to Li et. al [RouterTopology], although some routers have high degrees up to a couple of hundreds, these routers connect to a large number of end-customers, not transit neighbor routers. Therefore, they will still use only a small number of next-hops, i.e., the transit neighbors, to reach most of the address prefixes. Besides sharing the same next-hop, prefixes also need to be numerically aggregatable. This is possible due to two factors. First, in IP address allocation, large blocks of Internet addresses are first allocated to Regional Internet Registries and then they further allocate the addresses to networks within the same region. A router outside the region tends to use the same nexthop to reach these address prefixes, which can then be aggregated. Second, for prefixes split for traffic engineering or other purposes, a router near the origin network is likely to have different next-hops, but a router further away from the origin network is more likely to have the same next-hop towards these numerically aggregatable prefixes. Therefore, although FIB aggregation is opportunistic and the aggregation degree varies from router to router, there are some inherent properties of the Internet that can make FIB aggregation effective. If FIB aggregation is effective in reducing table size, its most appealing feature is that the impact is limited within a Zhang, et al. Expires April 29, 2010 [Page 6] Internet-Draft FIB Aggregation October 2009 router's data plane. It does not change any routing protocols, or any router's routing decisions. Data traffic still flows on the same router paths. Therefore, it can co-exist with almost any new routing protocols, including those architectural solutions to the routing scalability problem in the long run. 3. FIB Aggregation Techniques and Algorithms We present four levels of full FIB aggregation, each aggregating the table more but incuring more computation and other overhead. These algorithms assume that the routing table is stored in a tree structure. Though our implementation uses patricia trie, a tree-like data structure widely used for routing tables since the BSD implementation, the algorithms should apply to any tree data structure. Note that our algorithms do not build any additional trees just for aggregation; we simply use the existing trees that the RIB and FIB use. For a network device that uses non-tree data structure to implement routing tables, the general techniques discussed here still apply. In real networks, the algorithms and implementations can always be optimized for specific hardware, operating systems, and routing table data structures. 3.1. Terminology In a patricia trie-based RIB or FIB, IP prefixes and their next-hop information are stored in certain trie nodes. The prefixes become more specific along the tree, which facilitates longest prefix matching. Figure 1 shows part of a patricia trie containing four prefixes 12.0/14, 12.2/15, 12.0/16, and 12.1/16. Note that some of the trie nodes, e.g. Node 1, may not have corresponding prefixes; we call them "internal nodes". However, they may become real nodes when their corresponding prefixes are advertised to this router. For example, if a routing announcement for 12.0/15 is received, this prefix and its next-hop information will be inserted at Node 1. 12.0/14 / \ Node 1 12.2/15 / \ 12.0/16 12.1/16 Figure 1 Before describing the detailed aggregation algorithms, we introduce a few more terms: Zhang, et al. Expires April 29, 2010 [Page 7] Internet-Draft FIB Aggregation October 2009 o Ancestor prefix and immediate ancestor prefix: given two address prefixes P1 and P2, if P1 is a less specific prefix of P2, then P1 is called an ancestor prefix of P2. For example, 12/8 is an ancestor prefix of 12.1/16. If P1 is the longest prefix among all the ancestor prefixes of P2, it is called the immediate ancestor prefix of P2. o Nearest descendant prefixes: for a given prefix P in a patricia trie, its nearest descendant prefixes includes all the prefixes whose immediate ancestor prefix is P. o Covering prefix and covered prefix: if an address prefix P has no ancestor prefix in a RIB/FIB, then it is a covering prefix. Otherwise, it is a covered prefix. Note that in a patricia trie implementation, a covering prefix may have ancestor nodes that do not have corresponding prefixes. o Real prefix and generated prefix: if a prefix exists in an unaggregated RIB/FIB, it is called a real prefix. If a prefix is generated by aggregation, it is called a generated prefix. o Hole: if a RIB/FIB entry X has a more specific prefix than another entry Y does and they have different next-hops, then X punches a hole in Y. For example, the entry (prefix: 12.1/16, nexthop: A) punches a hole in (prefix: 12/8, nexthop: B). o Sibling prefix: if two prefixes of the same length are numerically consecutive and numerically aggregatable, we call them sibling prefixes. For example, 12.2/16 and 12.3/16 are sibling prefixes. In a patricia trie, sibling prefixes are children prefixes under the same parent. o IN-FIB and NON-FIB: an aggregation algorithm determines whether a prefix should be placed in the FIB. If placed in the FIB, the prefix is labeled as IN-FIB. Otherwise, it is labeled as NON-FIB. o Postorder traversal: this is a recursive tree traversal algorithm that examines the children nodes before the parent nodes. Suppose in a given FIB tree, prefix P1 has two children prefixes P2 and P3 and prefix P2 has two children prefixes P4 and P5 (P3, P4, P5 are leaf nodes), then the algorithm will visit the prefixes in the following order: P4, P5, P2, P3, P1. o Preorder traversal: this is a recursive tree traversal algorithm that traverses each node, its left subtree and then its right subtee. In the above example, a pre-order traversal will visit the prefixes in the following order: P1, P2, P4, P5, P3. Zhang, et al. Expires April 29, 2010 [Page 8] Internet-Draft FIB Aggregation October 2009 3.2. Level 1 Aggregation (A) (A) / \ => / \ / \ / \ (A) ( ) ( ) ( ) Level-1: Removing covered prefixes Figure 2 This technique is illustrated in Figure 2. The binary tree represents part of the IP address space. Nodes labeled with letters are prefixes in the routing table, and the letter represents the next-hop for the prefix. Nodes in parenthesis without labels are internal nodes that do not have their corresponding prefixes in the FIB tree. Level-1 aggregation removes a prefix if it shares the same next-hop with its immediate ancestor prefix. Addresses that previously match the removed prefix now will match the ancestor prefix and still get the same next-hop. Previously non-routable packets, whose table lookup ends up with NULL next-hop, will still be non-routable. In other words, this aggregation does not introduce any new prefix nor extra routable space into the table. The algorithm implementing this technique simply traverses the tree recursively from the root node in postorder. When it arrives at a node with a prefix, it compares this prefix's next-hop with its immediate ancestor prefix's next-hop. If they have the same next- hop, it labels the current node NON-FIB, otherwise labels it IN-FIB. The immediate ancestor prefix's next-hop is updated and remembered during the tree traversal. Eventually every prefix node is labeled as either NON-FIB or IN-FIB, and all IN-FIB prefixes comprise the aggregated FIB. The aggregation is done recursively throughout the entire table. The computation time is O(n), where n is the total number of nodes in the tree. 3.3. Level 2 Aggregation Zhang, et al. Expires April 29, 2010 [Page 9] Internet-Draft FIB Aggregation October 2009 ( ) (A) / \ => / \ / \ / \ (A) (A) ( ) ( ) Level-2: Combining sibling prefixes Figure 3 This technique is illustrated in Figure 3. In addition to performing Level 1 aggregation, Level 2 combines sibling prefixes that share the same next-hop into a parent prefix. If the parent node already has a prefix with a different next-hop, then the aggregation cannot be done. Or if the parent node already has a prefix with the same next- hop, then it is part of Level 1 aggregation. Therefore, Level 2 is done when the parent node has no prefix. The net result is to introduce a new prefix to cover two existing prefixes, but there is no extra routable space introduced, i.e. the aggregated FIB covers the exact address space as the un-aggregated FIB. The algorithm implementing Level 2 aggregation traverses the tree recursively from the root node in postorder. Besides doing Level 1 aggregation, when it arrives at a node without a prefix, it compares this node's two children. If both children have prefixes and use the same next-hop, then both children are labeled NON-FIB, and this current node is assigned the parent prefix and labeled IN-FIB. The aggregation is done recursively throughout the entire table. The computation time is O(n), where n is the total number of nodes in the tree. 3.4. Level 3 Aggregation (A) (A) / \ / \ / \ / \ () () => () () /\ /\ /\ /\ / \ / \ / \ / \ (A) () () (A) () [] [] () Level-3: Allowing extra space Figure 4 This technique is illustrated in Figure 4 (nodes with square brackets are extra routable space introduced by the aggregation). In addition to performing the Level 1 and 2 aggregation, Level 3 aggregates a set of non-sibling prefixes that have the same next-hop into a super Zhang, et al. Expires April 29, 2010 [Page 10] Internet-Draft FIB Aggregation October 2009 prefix. Between these non-sibling prefixes, non-routable space is allowed. For example, in Figure 1c, at the bottom level of the tree, there are two nodes with address prefixes (real nodes) sharing the same next hop. However, these two nodes are separated in the tree by two nodes without address prefixes. The prefixes of the two real nodes can be aggregated into a grandparent prefix. Level 3 aggregation must be implemented with care to ensure its forwarding correctness. For example, in Figure 1c, two grandchildren prefixes are aggregated into one grandparent prefix. This would be incorrect if there is already a great-grandparent prefix (not shown in the figure) covering the subtree with a different next-hop B, because that means the two middle nodes at the bottom level are not non-routable space and their next-hops would change from B to A after the aggregation. In order to handle this case without introducing much computation overhead, we decide that in our implementation we only apply this type of aggregation to prefixes that do not have any existing ancestor prefix, i.e. covering prefixes. In a typical DFZ routing table, about half of all the prefixes are covering prefixes and the other half are covered prefixes. The covered prefixes can be aggregated by Level 1 and Level 2, therefore our choice does not lose too much aggregation capability. The algorithm implementing Level 3 aggregation traverses the tree recursively in postorder. Besides doing Level 1 and Level 2 aggregation for all nodes, when it arrives at a covering prefix, it checks whether this prefix has a sibling node that does not have a prefix. If yes, it returns a pointer of this prefix node to its parent node, which will further pass this pointer up along the tree. Whenever an upper level node has two such pointers, one from a left descendant and another from a right descendant, and these two descendants have the same next-hop, then a new prefix is created at this upper level node and labeled IN-FIB, while the two descendant nodes are labeled NON-FIB. If the two descendants have different next-hops, then aggregation cannot be done and they remain IN-FIB. The computation complexity is O(n), where n is the number of nodes in the tree. Zhang, et al. Expires April 29, 2010 [Page 11] Internet-Draft FIB Aggregation October 2009 3.5. Level 4 Aggregation ( ) (A) / \ / \ / \ / \ () () => () () /\ /\ /\ /\ / \ / \ / \ / \ (A) (B)() (A) () (B)[] () Level-4: Allowing holes in the aggregation Figure 5 This technique is illustrated in Figure Figure 5. In addition to performing Level 1, 2 and 3 aggregation, Level 4 aggregates a set of non-sibling prefixes with the same next-hops, which have other prefixes with different next-hops in between. For example, in Figure 1d, an address prefix with next-hop B is between the prefixes being aggregated to a new prefix with next-hop A, punching a "hole'' in the aggregate prefix. This type of aggregation maintains forwarding correctness and may also introduce extra routable space as Level 3 does. For the same reason as in Level 3, our algorithm only applies this type of aggregation to prefixes that do not have ancestor prefixes. The seemingly trivial difference between Level 4 and Level 3 actually has significant implication to algorithm design. It allows the maximum flexibility for aggregation. However, taking advantage of it may also require significant computational time. For example, given a set of non-sibling prefixes with different next-hops, which super- prefix should be inserted? Which next-hop should the super-prefix take? Finally, how should the decision be made without too much computational complexity? In this work, we present and evaluate two different Level 4 algorithms described as follows. The Level 4A algorithm traverses the tree recursively once in postorder. Besides doing Level 1, 2 and 3 aggregations, when it arrives at a covering prefix (our level 4 algorithm only applies to covering prefixes), it returns a pointer of this prefix node to its parent, which will further pass this pointer up along the tree. Whenever an upper level node receives two lists of its prefix descendants, one from its left child and the other from its right child, this node combines the two lists to get all its immediate prefix descendants and their next-hops, picks the most popular next- hop as its own next-hop and inserts a prefix at this node. All the immediate prefix descendants that use the most popular next-hop will be labeled NON-FIB, and other descendants are labeled IN-FIB. If Zhang, et al. Expires April 29, 2010 [Page 12] Internet-Draft FIB Aggregation October 2009 there are multiple next-hops tie for the most popular, then one of them will be randomly selected. The computation time is O(n), where n is the number of nodes in the tree. The Level 4B algorithm is based on Herrin's proposal [Herrin]. It differs from the Level 4A algorithm mainly in how the popular next- hop is calculated -- the 4A algorithm considers only the immediate prefix descendants of a node, while the 4B algorithm considers all the prefix descendants of a node (i.e. all the prefixes in the tree rooted at this node). It traverses the tree twice. The first step traverses the tree recursively in postorder, which is like sweeping all tree nodes from bottom up. During this process, the algorithm calculates the most popular next-hop among all descendant prefixes of a node and records this next-hop with the node unless this node already has a prefix with a different next-hop. The second step traverses the covering prefixes in preorder, which is like going through all tree nodes from top down. During this process, the algorithm tries to insert new prefixes with the most popular next-hop from all prefix descendants (not just immediate prefix descendants as in Level 4A), as calculated in the previous postorder tree traversal, and label the descendant prefixes NON-FIB or IN-FIB accordingly. When there are multiple equally popular next-hops, we randomly select one. The computation time is O(n), where n is the number of nodes in the tree. It tries to do a more thorough aggregation than Level 4A, but will take longer time since it traverse the tree twice. 3.6. Practical Considerations Our aggregation techniques ensure forwarding correctness by aggregating prefixes with the same next-hop. In reality, there are other types of information stored in a FIB entry, such as packet count. Aggregating two prefixes will probably lose such fine-grained statistics, which also happens in all other routing scalability solutions, usually to a much wider extent. Losing more detailed information is a necessary cost we have to pay in order to improve overall system scalability. One way to mitigate the problem is having a router-wide packet counter instead of per-FIB packet counter, thus the aggregated information from all line cards are still kept for each prefix. Another way is for the operators to configure some important prefixes not be subject to FIB aggregation, thus fine-grained statistics of these important prefixes will be kept. A side effect of Level 3 and 4 aggregations is that they introduce new prefixes that cover previously non-routable space, therefore some previously non-routable traffic (which would have been dropped by this router) will be forwarded along the next-hop of the new prefixes. This does not violate the correctness requirement since Zhang, et al. Expires April 29, 2010 [Page 13] Internet-Draft FIB Aggregation October 2009 all previously routable traffic is still routable and will follow the same path. The impact of introducing extra routable space into the FIB depends on how much traffic is destined to that address space. In normal operational conditions, the volume of such traffic should be negligible. However, malicious traffic such as port scanning usually explores such non-routable space and in certain cases it may become noticeable. Eventually these packets will be dropped, either because they arrive at a router that does not have a route for these packets, or because the packet's time-to-live expires. These packets will consume bandwidth during transit and that is a new type of overhead introduced by Level 3 aggregation. A good Level 3 or Level 4 algorithm should consider the tradeoff between table size reduction, extra routable space introduced, and computation time. In our evaluation, we limited the newly introduced prefixes to be no longer than a certain length and found that setting the prefix length limit to 15 results in a good trade-off between table size reduction and extra routable space. 4. Updating Aggregated FIB Internet routes change over time, thus the obvious question is how to update the aggregated FIB when there is a change. Re-run the full FIB aggregation will maintain the best aggregation all the time, but it will also incur significant computation overhead since the full FIB aggregation accesses every tree node, comparing to updating an un-aggregated FIB which only accesses a single prefix node. We use the combination of four mechanisms to make sure that the computation cost of updating aggregated FIBs is under the control of operators. First, operators can choose one of the four aforementioned levels of full FIB aggregation that suits their routers the best. Routers with faster CPU and fewer routing updates can use higher level FIB aggregation, otherwise they can use lower level FIB aggregation. Second, we design an algorithm that updates the aggregated FIB incrementally. The algorithm tries to minimize the number of tree nodes that have to be accessed and changed to maintain forwarding correctness after a routing change. Third, the full FIB aggregation is only invoked when needed, e.g., the table size has crossed a threshold after being incrementally updated for a while, or when the router has free CPU cycles to spare, i.e., the router load is under a threshold. Fourth, as shown in [HighlyActive], a small number of highly active prefixes are responsible for a large number of routing changes. Therefore one can keep these highly active prefixes in the FIB without aggregating them out. Then every time they have a change, the update process does not incur extra overhead. Operators can configure the criteria for Zhang, et al. Expires April 29, 2010 [Page 14] Internet-Draft FIB Aggregation October 2009 deciding which prefixes should be kept in FIB regardless of the aggregation process. This last method will be evaluated in our future work. The basic algorithm of incrementally updating the aggregated FIB is described as follows. The algorithm must satisfy three goals: o Keep forwarding correctness; o Adhere to the aggregation level. That is, there is no generated prefixes at level 1 aggregation, no extra routable space at level 2, and no holes at level 3. o Minimize the number of changes to the FIB. Processing a routing update includes two steps: updating the labelled RIB and updating the aggregated FIB. The second step is straightforward as for whichever node that has changed its label from IN-FIB to NON-FIB or vice versa we will just apply such change to the FIB. Therefore we will focus on describing how the labelled RIB is incrementally updated. In general, when a prefix gets a new nexthop, its nearest descendants need to be re-aggregated, and when a prefix is withdrawn, its nearest descendants need to be de-aggregated. The details differ depending on the level of aggregation. We first define a few basic operations, and then use them to describe the incremental update algorithm for each level. o update-node(p): when an announcement of prefix p is received, insert the corresponding node if it does not exist in RIB, otherwise update its nexthop information if it's a different one. If p existed in the RIB but was a generated prefix, label it as a real prefix. Let A be p's nearest ancestor, if nexthop(A) == nexthop(p), label p as NON-FIB, otherwise IN-FIB. o re-aggregate(p): For each D of p's nearest descendants. If nexthop(D) != nexthop(p), label D IN-FIB. Optionally, we can also label D NON-FIB if nexthop(D) == nexthop(p), which does not affect forwarding correctness but reduces the FIB size at the expense of updating more FIB nodes. o de-aggregate(p): Let A be p's nearest ancestor. For each D of p's nearest descendants, if nexthop(D) != nexthop(A), label D IN-FIB. Optionally, we can also label D NON-FIB if nexthop(D) == nexthop(A), which does not affect forwarding correctness but reduces the FIB size at the expense of updating more FIB nodes. When A does not exist, its nexthop is considered to be NULL and the aforementioned actions still hold. Zhang, et al. Expires April 29, 2010 [Page 15] Internet-Draft FIB Aggregation October 2009 Using the above basic operations, we describe the incremental update algorithm for each level of aggregation as follows. o Level One: Upon receiving an announcement of prefix p, update- node(p) and re-aggregate(p). Upon receiving a withdrawal of prefix p, de-aggregate(p) and remove p from the RIB. o Level Two: Upon receiving an announcement of prefix p, update- node(p) and re-aggregate(p). Upon receiving withdrawal of prefix p, de-aggregate(p) and remove p. However, if p's nearest ancestor, A, is a generated prefix, we also need to de- aggregate(A) and remove A in order to prevent extra routable space. For example, in Figure 2, when one of the covered prefix is withdrawn, its generated parent prefix needs to be de- aggregated. o Level Three: Upon receiving an announcement of prefix p, update- node(p) and re-aggregate(p). During the re-aggregation, if any D is a generated prefix and nexthop(D) != nexthop(p), de- aggregate(D) and remove D.. This is needed since in Level 3 aggregation, a generated prefix is not supposed to appear as a descendant of any real prefix, otherwise the forwarding correctness may not hold (explained in Section 3.4). Upon receiving a withdrawal of prefix p, de-aggregate(p) and remove p. o Level Four: The processing of withdrawals is the same as in Level 3. The processing of announcements is mostly the same as in Level 3, except that the de-aggregation will take place even if a node turn from generated prefix to a real prefix with the same nexthop. In Level 4 aggregation, it is possible that a generated prefix covers another generated prefix. Therefore when a prefix becomes real, its descendants need to be de-aggregated to make sure there's no generated prefixes underneath. While in Level 3 aggregation, a generated prefix does not cover another generated prefix. 5. Evaluation We use publicly available routing tables from tens of networks collected by the RouteViews Oregon monitor [routeviews] to evaluate the various FIB aggregation algorithms for their table size reduction, computing times, and extra routable space. We also use BGP routing updates to evaluate our FIB update algorithm in terms of the frequency of FIB updates, the scope of the updates, computing times, and how often the full algorithm may need to run. Zhang, et al. Expires April 29, 2010 [Page 16] Internet-Draft FIB Aggregation October 2009 5.1. Methodology We evaluate different FIB aggregation algorithms using publicly available BGP routing tables taken from the RouteViews Oregon monitor [routeviews]. Although these routing tables contain valid next AS hops, they either do not have next-hop router information or do not reflect the diversity of next-hops that an operational router typically has, since the route monitors are not operational routers. Therefore we need to generate realistic next-hops based on known information. Our guideline of this process is trying to overestimate the number of next-hops so that the table reduction results reflect the worst case scenario, and real routers are likely to have better aggregation ratio. The specifics of generating next-hops are as follows. Tables downloaded from RouteViews only contain the AS path for each prefix. From the AS path, we can extract the next-AS-hop for each prefix. We assume that prefixes sharing the same next-AS-hop share the same iBGP neighbor (and thus the same next-hop router at the IP level). We use routing tables from looking glass servers, which provider the IBGP neighbor information, to validate this assumption. For each next-AS-hop, if there is only one iBGP neighbor, then all the prefixes using this next-AS-hop satisfy the assumption. If there are multiple iBGP neighbors, the one that carries the most prefixes is called "popular,'' and the prefixes that use the popular iBGP neighbors satisfy the assumption. We found that most routing tables have more than 90% of prefixes satisfying the assumption. Based on this assumption, we use next-AS-hop as the next-hop router in evaluation. This reflects the worst case scenario since large networks have hundreds to thousands of neighbor ASes, but the number of real next-hops should be much smaller. We verify the correctness of each aggregated FIB by dividing each original RIB prefix into multiple /24 sub prefixes and look up the /24 sub prefixes in the aggregated FIB, which should give the same next-hop as that in the RIB. All the results from our FIB aggregation algorithms and incremental update algorithms have been verified using this method. All the evaluation has been done on a Linux machine with an Intel Core 2 Quad 2.83GHz CPU using a single thread. The algorithms are implemented in C and no performance optimization techniques have been attempted. The Patricia trie implementation is taken from the C source code of Perl's Net::Patricia module [NetPatricia], which in turn was adapted from MRTD's [mrtd] source code. We use the public BGP routing tables to do the evaluation because these tables come from a diverse set of networks, from tier-1 ISPs to Zhang, et al. Expires April 29, 2010 [Page 17] Internet-Draft FIB Aggregation October 2009 small networks. However, in operational networks, there are other types of routes, such as VPN routes, which can be of a large number too. The FIB aggregation algorithms can be applied to these other types of routes as well, even though this paper does not evaluation the effectiveness of doing so. We plan to obtain forwarding tables from operational ISP routers to validate our results. 5.2. Table Size Reduction and Overhead We applied our algorithms to 37 routing tables archived at RouteViews on Dec. 31, 2008 and then calculated the ratio between the aggregated FIB size and the original routing table size. We obtained the following results: (1) each level of aggregation can reduce the FIB size more than the previous level, which is as expected; (2) even with the simple Level 1 aggregation, the FIB size can be reduced by 30% to 50%; (3) Level 4 aggregation can reduce the FIB size from 60% to over 90% with a median around 70% -- some of the tables have almost all the prefixes sharing the same nexthop, so they can be aggregated into very few entries; and (4) the Level 4A algorithm is slightly better than the Level 4B algorithm, although the difference is almost negligible. To evaluate the effectiveness of our algorithms over a longer period of time, we applied them to the RouteViews routing tables from 2001 to 2008. For each year, we obtained the median aggregation ratio of all the routing tables (see Figure 6). We observed an overall decreasing trend in the aggregation ratio (more table size reduction), suggesting that the FIB has become more amenable to aggregation over the years. One possible explanation is that the increasing practice of prefix splitting due to multi-homing and traffic engineering has made a larger percentage of FIB entries aggregatable. We plan to further investigate this phenomenon in our future work. Zhang, et al. Expires April 29, 2010 [Page 18] Internet-Draft FIB Aggregation October 2009 +==========+======+======+======+======+======+======+======+======+ |Algorithms| 2001 | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | +----------+------+------+------+------+------+------+------+------+ | Level-1 | 0.686| 0.707| 0.686| 0.653| 0.672| 0.660| 0.669| 0.665| +----------+------+------+------+------+------+------+------+------+ | Level-2 | 0.556| 0.564| 0.527| 0.491| 0.509| 0.491| 0.494| 0.479| +----------+------+------+------+------+------+------+------+------+ | Level-3 | 0.498| 0.518| 0.470| 0.433| 0.462| 0.447| 0.455| 0.437| +----------+------+------+------+------+------+------+------+------+ | Level-4A | 0.396| 0.430| 0.367| 0.347| 0.367| 0.353| 0.358| 0.343| +----------+------+------+------+------+------+------+------+------+ | Level-4B | 0.421| 0.458| 0.389| 0.363| 0.391| 0.370| 0.366| 0.363| +==========+======+======+======+======+======+======+======+======+ Figure 6 In order to understand the significance of the above results, we use the size of the routing table from 1994 to 2009 to translate the table size reduction into how many years we turn the clock back for a router. The data is obtained from bgp.potaroo.net, a site that tracks the growth of the BGP table size. We found that the FIB size in 2001 is around 30% of the FIB size on 12/31/2008, which means that if an ISP uses our Level 4 aggregation algorithm, it will still be able to use routers that were deployed in 2001. Assuming these routers were sufficiently provisioned when they were purchased seven or eight years ago, they should still be able to accommodate some further routing table growth. We measured the computing time incurred by our algorithms to aggregate each of the 37 routing tables (see Figure 7). The Level 1 to 3 algorithms typically require tens of milliseconds, while the Level 4 algorithms consume at most a few hundreds milliseconds. An operational router may have slower CPU than our commodity Linux machine, but it has specialized hardware and software, thus it is hard to infer a router's computing time from what we report here. Nevertheless, the simplicity of the algorithms and the very short computing time suggest that the computational overhead in an operational router may be small. Moreover, our results can be used to compare the relative speed between different aggregation algorithms. For example, we can observe that the Level 4B algorithm is more computing intensive than the Level 4A algorithm. Zhang, et al. Expires April 29, 2010 [Page 19] Internet-Draft FIB Aggregation October 2009 +==========+==========+==========+=============+ |Algorithms| Max (ms) | Min (ms) | Median (ms) | +----------+----------+----------+-------------+ | Level-1 | 58.0 | 53.0 | 55.8 | +----------+----------+----------+-------------+ | Level-2 | 74.0 | 63.0 | 66.0 | +----------+----------+----------+-------------+ | Level-3 | 76.7 | 66.9 | 69.8 | +----------+----------+----------+-------------+ | Level-4A | 120.0 | 110.0 | 115.5 | +----------+----------+----------+-------------+ | Level-4B | 398.5 | 200.0 | 269.3 | +==========+==========+==========+=============+ Figure 7 We also measured the amount of extra routable space (measured by the number of /8 prefixes) introduced by three of our algorithms (see Figure 8. Since Level 1 and 2 algorithms do not introduce new routable space, we do not present them here. We found that the Level 3 algorithm is better than the Level 4 algorithms in this regard, while the Level 4B algorithm covers more routable space than the Level 4A algorithm. This is mainly because the 4B algorithm aggregates prefixes from top to bottom, which introduces more shorter prefixes than the 4A algorithm. +==========+===========+===========+==============+ |Algorithms| Max (/8s) | Min (/8s) | Median (/8s) | +----------+-----------+-----------+--------------+ | Level-3 | 6.50 | 1.49 | 1.72 | +----------+-----------+-----------+--------------+ | Level-4A | 7.03 | 4.58 | 5.09 | +----------+-----------+-----------+--------------+ | Level-4B | 7.19 | 6.76 | 7.14 | +==========+===========+===========+==============+ Figure 8 5.3. Routing Update Handling We use one month (December 2008) of BGP routing updates collected by RouteViews from the Level 3 Communications peer router to evaluate our incremental update handling algorithms. We summarize our results in Figure 9. There were 7,254,478 routing updates during this month. For each unaggregated or aggregated FIB, we counted how many of these routing updates would cause an update to the FIB, i.e. an insertion, removal, or a change to the next-hop of a FIB entry. There were Zhang, et al. Expires April 29, 2010 [Page 20] Internet-Draft FIB Aggregation October 2009 2,914,020 updates to the unaggregated FIB. We make the following observations: (a) the RIB processing time per routing update is increased from 0.593us to 0.806us for Level 1 aggregation (36% increase) and 0.824us for Level 4A aggregation (42% increase) due to the extra operations required by the update handling algorithm, i.e. de-aggregation in certain cases; (b) the total FIB processing time is decreased by 3% to 10%. More specifically, although there is a slightly higher total number of affected prefixes for the aggregated FIBs than the unaggregated FIB (6th column of Figure 9), each prefix takes less time to update in an aggregated FIB (last column of Figure 9), leading to a lower total FIB processing time. The lower FIB update time per prefix is likely due to the small FIB size after aggregation (i.e. faster prefix lookup). +==========+=====+=====+=======+=======+=======+=====+=====+ |Algorithms|T_rib|t_rib| N_fib | n_fib | p_fib |T_fib|t_fib| | | (s) | (us)| | | | (s) | (us)| +----------+-----+-----+-------+-------+-------+-----+-----+ | Original | 4.30|0.593|2914020|2914020| 1.000| 2.60|0.892| +----------+-----+-----+-------+-------+-------+-----+-----+ | Level-1 | 5.85|0.806|2904630|2921335| 1.005| 2.53|0.866| +----------+-----+-----+-------+-------+-------+-----+-----+ | Level-2 | 5.96|0.822|2901530|2940178| 1.013| 2.45|0.833| +----------+-----+-----+-------+-------+-------+-----+-----+ | Level-3 | 5.98|0.824|2900389|2941398| 1.014| 2.42|0.823| +----------+-----+-----+-------+-------+-------+-----+-----+ | Level-4A | 6.10|0.841|2897450|2942969| 1.016| 2.33|0.792| +----------+-----+-----+-------+-------+-------+-----+-----+ | Level-4B | 6.41|0.880|2913988|3388764| 1.162| 2.61|0.770| +==========+=====+=====+=======+=======+=======+=====+=====+ Figure 9 T_rib: total RIB processing time; t_rib: average RIB processing time per routing update; N_fib: total number of FIB updates; n_fib: total number of prefixes affected in the FIB; p_fib: average number of affected prefixes per FIB update; T_fib: total FIB processing time; t_fib: average FIB processing time per affected prefix. Note that there may be fewer routing updates that cause changes to an aggregated FIB than the unaggregated FIB (see the 4th column of Figure 9). For example, the aggregated FIB from our Level 4A algorithm had 16,570 fewer FIB updates than the unaggregated FIB. This is due to two reasons. First, some of the route withdrawals may be for prefixes already removed from the FIB by the aggregation Zhang, et al. Expires April 29, 2010 [Page 21] Internet-Draft FIB Aggregation October 2009 algorithm. Second, our update algorithm minimizes the number of FIB updates at the cost of slightly increased FIB size. On the other hand, each FIB update may affect more prefixes in the case of the aggregated FIB (6th column of Figure 9). Nevertheless, the total number of affected prefixes for the aggregated FIBs is less than 1% higher than the unaggregated FIB. Since each prefix takes less time to update in the FIB, there is an overall time saving in updating the aggregated FIBs. In summary, FIB aggregation can reduce both the FIB size and FIB update time. However, these benefits come at the cost of higher RIB processing time. How to reduce the time overhead of the update handling algorithm in one area of our future research. Since our update handling algorithm trades off the FIB size for fewer changes, the FIB needs to be re-aggregated when its size reaches a certain threshold. To estimate how frequently the re-aggregation will be triggered, we measure the growth of the FIB size as our algorithm handles the BGP updates during the month of Dec. 2008 (see Figure 10). The Level 4A aggregated FIB had 122,493 entries on Dec. 1, 2008. If we set the FIB size threshold to be 150,000 entries (about 55% of the full routing table size), the FIB would be re- aggregated on Dec. 4, 2008. We modified our update handling algorithm to do a full aggregation when the FIB size reaches 150,000 and found that the full aggregation was indeed triggered once every few days (see Figure 11. Considering that each full aggregation run takes at most a few hundred milliseconds on our commodity PC (perhaps a little longer on a router), incurring this overhead every few days or so should not be a concern for an ISP. 2008.12.01 122493 2008.12.02 134973 2008.12.03 144604 2008.12.04 153646 2008.12.05 161553 2008.12.06 163750 2008.12.07 165838 2008.12.08 169662 2008.12.09 174251 2008.12.10 177586 2008.12.11 182915 2008.12.12 186142 2008.12.13 188863 2008.12.14 191313 2008.12.15 192892 2008.12.16 195079 2008.12.17 197376 2008.12.18 199840 2008.12.19 200484 2008.12.20 202941 2008.12.21 203587 2008.12.22 204350 2008.12.23 205750 2008.12.24 206364 2008.12.25 208668 2008.12.26 209141 2008.12.27 209494 2008.12.28 210531 2008.12.29 211312 2008.12.30 212679 2008.12.31 212897 Figure 10 Zhang, et al. Expires April 29, 2010 [Page 22] Internet-Draft FIB Aggregation October 2009 2008.12.01 122493 2008.12.02 134973 2008.12.03 144604 2008.12.04 118495 2008.12.05 134863 2008.12.06 140979 2008.12.07 144169 2008.12.08 105650 2008.12.09 125091 2008.12.10 135525 2008.12.11 146605 2008.12.12 119185 2008.12.13 128680 2008.12.14 135729 2008.12.15 142430 2008.12.16 147821 2008.12.17 122117 2008.12.18 135979 2008.12.19 144314 2008.12.20 114800 2008.12.21 122179 2008.12.22 129412 2008.12.23 136886 2008.12.24 141400 2008.12.25 146456 2008.12.26 149222 2008.12.27 113078 2008.12.28 120509 2008.12.29 120567 2008.12.30 136701 2008.12.31 141190 Figure 11 6. Conclusion We have presented an in-depth analysis of FIB aggregation and our results suggest that it is a viable short-term solution to the problem of growing FIB table size. Our aggregation algorithms reduces the FIB size by as much as 70% and requires no hardware changes or network-wide software/configuration changes, thus reducing the need for ISP router upgrades in the short term. During this time, the research community and the industry can work on a long-term solution to reduce both the routing table and the FIB table. Moreover, FIB aggregation can co-exist with any long-term solution to further reduce ISPs' operational costs. 7. Acknowledgements The authors are part of the eFIT project team funded by the US National Science Foundation. 8. Security Considerations This draft is a discussion on the Internet's necessity to follow an evolutionary path towards the future. There is no direct impact on the Internet security. 9. Informative References [Cain02AUTOAGG] Cain, B., "Auto aggregation method for IP prefix/length pairs", Patent, June 2002, Zhang, et al. Expires April 29, 2010 [Page 23] Internet-Draft FIB Aggregation October 2009 . [Herrin] Herrin, W., "Opportunistic Topological Aggregation in the RIB-FIB Calculation?", July 2008, . [HighlyActive] Oliveira, R., Izhak-Ratzin, R., Zhang, B., and L. Zhang, "Measurement of Highly Active Prefixes in BGP". [Li09RRG] Li, T., "Preliminary Recommendation for a Routing Architecture", Internet Draft, March 2009, . [NetPatricia] Plonka, D. and P. Prindeville, "Net-Patricia Perl Module", . [RAWS] Meyer, D., Zhang, L., and K. Fall, "Report from the IAB Workshop on Routing and Addressing", Internet Draft, 2007, . [RFC4984] Meyer, D., Zhang, L., and K. Fall, "Report from the IAB Workshop on Routing and Addressing", RFC 4984, September 2007. [RouterTopology] Li, L., Alderson, D., Willinger, W., and J. Doyle, "A first-principles approach to understanding the Internet's router-level topology", ACM SIGCOMM 2004. [Virtual_Aggregation] Francis, P., Xu, X., and H. Billani, "FIB Suppression with Virtual Aggregation and Default Routes", draft-francis-idr-intra-va-01, September 2008. [mrtd] "MRTD: The Multi-Threaded Routing Toolkit", . [routeviews] Advanced Network Technology Center and University of Oregon, "The RouteViews Project", . Zhang, et al. Expires April 29, 2010 [Page 24] Internet-Draft FIB Aggregation October 2009 Authors' Addresses Beichuan Zhang Univ. of Arizona Email: bzhang@arizona.edu Lan Wang Univ. of Memphis Email: lanwang@memphis.edu Xin Zhao Univ. of Arizona Email: zhaox@email.arizona.edu Yaoqing Liu Univ. of Memphis Email: yliu6@memphis.edu Lixia Zhang UCLA Email: lixia@cs.ucla.edu Zhang, et al. Expires April 29, 2010 [Page 25]