Previous Section  < Free Open Study >  Next Section

The Need for Fast Reroute

Network administrators have been dealing with link and node failures for as long as there have been networks. It has traditionally fallen to the Interior Gateway Protocol (IGP) to quickly route around failures, converging on the remaining topology. However, there are a few things the IGP doesn't do that well when it comes to convergence:

  • In a large network, your IGP can take quite a few seconds to converge; until the entire network is converged, there is packet loss. It's not uncommon to see 5 to 10 seconds of packet loss when a core link flaps in a large network.

  • A link failure can lead to congestion in some parts of the network while leaving other parts free of congestion.

  • Configuring the IGP to converge quickly can make it overly sensitive to minor packet loss, causing false negatives and IGP convergence for no reason.

Also, assuming that the IGP is a link-state protocol, SPF has to be run once when the link goes down and then again when it comes back up. This problem is exacerbated with MPLS TE: If a link that is a part of an LSP fails, the LSP is torn down. After the headend recomputes a new path, SPF has to be run again for prefixes routed over the tunnel when autoroute is in place, thus making convergence times even worse than in a pure IP network.

IP networks that use SONET can also employ automatic protection switching (APS) to aid in quick recovery from link failures. The goal of APS is to switch over from an active link to a standby link within 50 milliseconds upon failure of the active link. However, if APS is run directly on a router, even after APS switches traffic over to the standby link, the IGP still needs to converge with the new neighbor on the other end of the link. Until the new IGP neighbors come up, packets might still be dropped.

APS also does not come without additional cost—the hardware cost of the add/drop multiplexer (ADM) required to achieve APS.

Luckily, there's an alternative to all this. You can use MPLS TE's FRR capabilities to minimize packet loss, without all the drawbacks of APS or fast IGP convergence.

RFC 2702, "Requirements for Traffic Engineering Over MPLS," describes the "resilience attribute" as the behavior of a traffic trunk under fault conditions:

A basic resilience attribute indicates the recovery procedure to be applied to traffic trunks whose paths are impacted by faults.

This is called headend LSP reroute or simply headend reroute.

At its simplest, headend rerouting is calculating a new path for an LSP after its existing path goes down. However, during the time required to perform this basic reroute, there can be significant traffic loss; the packet loss is potentially worse than with regular IP routing if you are autorouting over the TE tunnel. This is because you first need to signal a new TE LSP through RSVP and run SPF for destinations that need to be routed over the tunnel. It is desirable to be able to deal with a link or node failure in a way that has less loss than the basic headend LSP reroute.

Normally, when a link or node fails, this failure is signalled to the headends that had LSPs going through the failed link or node. The headends affected attempt to find new paths across the network for these tunnels.

Although a few seconds of loss is generally acceptable for data traffic, real-time applications such as voice, video, and some legacy applications might not be so forgiving. Many attempts have been made and drafts submitted to the IETF to solve this problem.

Within Cisco, the question arose as to how to make MPLS TE tunnels resistant to failures within the network. Mechanisms were developed to address this question, and they are collectively known as FRR, fast restoration mechanisms, or simply protection. Although it's impossible to have a completely lossless failure recovery mechanism, it's certainly possible to have mechanisms that minimize loss as much as possible.

Generally speaking, the goal for the different FRR mechanisms is to achieve as little packet loss as possible. Practically, this translates into anything from SONET-like recovery times (50 ms or less) to a few hundred milliseconds of loss before FRR kicks in.

    Previous Section  < Free Open Study >  Next Section