Network Availability Protocols



Network availability protocols, which include link integrity protocols, link bundling protocols, loop detection protocols, first-hop redundancy protocols (FHRP), and routing protocols, increase the resiliency of devices connected within a network. Network resiliency relates to how the overall design implements redundant links and topologies and how the control-plane protocols are optimally configured to operate within that design. The use of physical redundancy is a critical part of ensuring the availability of the overall network. If a network device fails, having a path means the overall network can continue to operate. The control-plane capabilities of the network provide the capability to manage the way in which the physical redundancy is leveraged, the network load balances traffic, the network converges, and the network is operated.
You can apply the following basic principles to network availability technologies:
  • Wherever possible, leverage the capability of the device hardware to provide the primary detection and recovery mechanism for network failures. This ensures both a faster and a more deterministic failure recovery.
  • Implement a defense-in-depth approach to failure detection and recovery mechanisms. Multiple protocols, operating on different network layers, can complement each other in detecting and reacting to network failures.
  • Ensure that the design is self-stabilizing. Use a combination of control-plane modularization to ensure that any failures are isolated in their impact and that the control plane prevents any flooding or thrashing conditions from arising.
These principles are intended to be a complementary part of the overall structured modular design approach to the network architecture and primarily serve to reenforce good resilient network design practices.
Note 
A complete discussion of all network availability technologies and best practices could easily fill an entire volume. Therefore, this discussion introduces and provides only an overview of the network availability technologies most relevant to TelePresence enterprise network deployments.
The protocols discussed in this section can be subdivided between Layer 2 (L2) and Layer 3 (L3) network availability protocols. 

Nonstop Forwarding with Stateful Switchover | Device Availability Technologies



SSO is a redundant route- and switch-processor availability feature that significantly reduces MTTR by allowing extremely fast switching between the main and backup processors. SSO is supported on routers (such as the Cisco 7600, 10000, and 12000 series families) and switches (such as the Catalyst 4500 and 6500 series families).
Prior to discussing the details of SSO, a few definitions might be helpful. For example, “state” in SSO refers to maintaining—among many other elements—the following between the active and standby processors:
  • Layer 2 protocols configurations and current status
  • Layer 3 protocol configurations and current status
  • Multicast protocol configurations and current status
  • QoS policy configurations and current status
  • Access list policy configurations and current status
  • Interface configurations and current status
Also, the adjectives cold, warm, or hot denote the readiness of the system and its components to assume the network services functionality and the job of forwarding packets to their destination. These terms appear in conjunction with Cisco IOS verification command output relating to NSF/SSO and with many high availability feature descriptions:
  • Cold: Cold redundancy refers to the minimum degree of resiliency that has been traditionally provided by a redundant system. A redundant system is cold when no state information is maintained between the backup or standby system and the system it offers protection to. Typically a cold system would have to complete a boot process before it came online and would be ready to take over from a failed system.
  • Warm: Warm redundancy refers to a degree of resiliency beyond the cold standby system. In this case, the redundant system has been partially prepared but does not have all the state information known by the primary system, so it can take over immediately. Some additional information must be determined or gleaned from the traffic flow or the peer network devices to handle packet forwarding. A warm system would already be booted up and would need to learn or generate only state information prior to taking over from a failed system.
  • Hot: Hot redundancy refers to a degree of resiliency where the redundant system is fully capable of handling the traffic of the primary system. Substantial state information has been saved, so the network service is continuous, and the traffic flow is minimally or not affected.
To better understand SSO, it might be helpful to consider its operation in detail within a specific context, such as within a Cisco Catalyst 6500 with two supervisors per chassis.
The supervisor engine that boots first becomes the active supervisor engine. The active supervisor is responsible for control-plane and forwarding decisions. The second supervisor is the standby supervisor, which does not participate in the control- or data-plane decisions. The active supervisor synchronizes configuration and protocol state information to the standby supervisor, which is in a hot-standby mode. As a result, the standby supervisor is ready to take over the active supervisor responsibilities if the active supervisor fails. This “take-over” process from the active supervisor to the standby supervisor is referred to as switchover.
Only one supervisor is active at a time, and supervisor-engine redundancy does not provide supervisor-engine load balancing. However, the interfaces on a standby supervisor engine are active when the supervisor is up and, thus, can be used to forward traffic in a redundant configuration.
NSF/SSO evolved from a series of progressive enhancements to reduce the impact of MTTR relating to specific supervisor hardware/software network outages. NSF/SSO builds on the earlier work known as Route Processor Redundancy (RPR) and RPR Plus (RPR+)Each of these redundancy modes of operation incrementally improves upon the functions of the previous mode:
  • RPR: The first redundancy mode of operation introduced in Cisco IOS Software. In RPR mode, the startup configuration and boot registers are synchronized between the active and standby supervisors; the standby is not fully initialized; and images between the active and standby supervisors do not need to be the same. Upon switchover, the standby supervisor becomes active automatically, but it must complete the boot process. In addition, all line cards are reloaded, and the hardware is reprogrammed. Because the standby supervisor is “cold,” the RPR switchover time is 2 or more minutes.
  • RPR+: An enhancement to RPR in which the standby supervisor is completely booted, and line cards do not reload upon switchover. The running configuration is synchronized between the active and the standby supervisors, which run the same software versions. All synchronization activities inherited from RPR are also performed. The synchronization is done before the switchover, and the information synchronized to the standby is used when the standby becomes active to minimize the downtime. No link layer or control-plane information is synchronized between the active and the standby supervisors. Interfaces might bounce after switchover, and the hardware contents need to be reprogrammed. Because the standby supervisor is “warm,” the RPR+ switchover time is 30 or more seconds.
  • NSF with SSO: NSF works in conjunction with SSO to ensure Layer 3 integrity following a switchover. It allows a router experiencing the failure of an active supervisor to continue forwarding data packets along known routes while the routing protocol information is recovered and validated. This forwarding can continue to occur even though peering arrangements with neighbor routers have been lost on the restarting router. NSF relies on the separation of the control plane and the data plane during supervisor switchover. The data plane continues to forward packets based on pre-switchover Cisco Express Forwarding (CEF) information. The control-plane implements graceful restart routing protocol extensions to signal a supervisor restart to NSF-aware neighbor routers, reform its neighbor adjacencies, and rebuild its routing protocol database (in the background) following a switchover. Because the standby supervisor is “hot,” the NSF/SSO switchover time is 0 to 3 seconds.
As previously described, neighbor nodes play a role in NSF function. A node that is capable of continuous packet forwarding during a route processor switchover is NSF-capable. Complementing this functionality, an NSF-aware peer router can enable neighbor recovery without resetting adjacencies and support routing database resynchronization to occur in the background. Figure 1 illustrates the difference between NSF-capable and NSF-aware routers. To gain the greatest benefit from NSF/SSO deployment, NSF-capable routers should be peered with NSF-aware routers (although this is not absolutely required for implementation) because only limited benefit will be achieved unless routing peers are aware of the capability of the restarting node to continue packet forwarding and assist in restoring and verifying the integrity of the routing tables after a switchover.

 
Figure 1: NSF-capable versus NSF-aware routers
Cisco NSF and SSO are designed to be deployed together. NSF relies on SSO to ensure that links and interfaces remain up during switchover and that lower layer protocol state is maintained. However, it is possible to enable SSO with or without NSF because these are configured separately.
The configuration to enable SSO is simple, as shown here:
Router(config)# redundancy
Router(config-red)# mode sso
NSF, on the other hand, is configured within the routing protocol and is supported within EIGRP, OSPF, IS-IS and (to an extent) BGP. Sometimes NSF functionality is also called “graceful-restart.”
To enable NSF for EIGRP, enter the following commands:
Router(config)# router eigrp 100
Router(config-router)# nsf
Similarly, to enable NSF for OSPF, enter the following commands:
Router(config)# router ospf 100
Router(config-router)# nsf
Continuing the example, to enable NSF for IS-IS, enter the following commands:
Router(config)# router isis level2
Router(config-router)# nsf cisco
And finally, to enable NSF/graceful-restart for BGP, enter the following commands:
Router(config)# router bgp 100
Router(config-router)# bgp graceful-restart
You can see from the example of NSF that the line between device-level availability technologies and network availability technologies sometimes is blurry. A discussion of more network availability technologies follows.

Stackwise/Stackwise Plus | Device Availability Technologies



Cisco StackWise and StackWise Plus technologies create a unified, logical switching architecture through the linkage of multiple, fixed configuration 3750G and 3750E switches.
Note 
Cisco 3750G switches use StackWise technology, and Cisco 3750E switches can use either StackWise or StackWise Plus. (StackWise Plus is used only if all switches within the group are 3750E switches; whereas, if some switches are 3750E and others are 3750G, StackWise technology will be used.)
Also to prevent excessive wordiness, “StackWise” is used in this section to refer to both StackWise and StackWise Plus technologies, with the exception of explicitly pointing out the differences between the two at the end of this section.
Cisco StackWise technology intelligently joins individual switches to create a single switching unit with a 32-Gbps switching stack interconnect. Configuration and routing information is shared by every switch in the stack, creating a single switching unit. Switches can be added to and deleted from a working stack without affecting availability.
The switches unite into a single logical unit using special stack interconnect cables that create a bidirectional closed-loop path. This bidirectional path acts as a switch fabric for all the connected switches. Network topology and routing information is updated continuously through the stack interconnect. All stack members have full access to the stack interconnect bandwidth. The stack is managed as a single unit by a master switch, which is elected from one of the stack member switches.
Each switch in the stack has the capability to behave as a master in the hierarchy. The master switch is elected and serves as the control center for the stack. Each switch is assigned a number. Up to nine separate switches can be joined together.
Each stack of Cisco Catalyst 3750 Series switches has a single IP address and is managed as a single object. This single IP management applies to activities such as fault detection, VLAN creation and modification, security, and quality of service (QoS) controls. Each stack has only one configuration file, which is distributed to each member in the stack. This allows each switch in the stack to share the same network topology, MAC address, and routing information. In addition, it allows for any member to immediately take over as the master, if there is a master failure.
To efficiently load balance the traffic, packets are allocated between two logical counter-rotating paths. Each counter-rotating path supports 16 Gbps in both directions, yielding a traffic total of 32 Gbps bidirectionally. When a break is detected in a cable, the traffic is immediately wrapped back across the single remaining 16-Gbps path (within microseconds) to continue forwarding.
Switches can be added and deleted to a working stack without affecting stack availability. (However, adding additional switches to a stack might have QoS performance implications) Similarly, switches can be removed from a working stack without any operational effect on the remaining switches.
Stacks require no explicit configuration but are automatically created by StackWise when individual switches are joined together with stacking cables, as shown in Figure 1. When the stack ports detect electromechanical activity, each port starts to transmit information about its switch. When the complete set of switches is known, the stack elects one of the members to be the master switch, which will be responsible for maintaining and updating configuration files, routing information, and other stack information. This process is referred to as hot stacking.

 
Figure 1: Catalyst 3750G StackWise cabling
Courtesy of Cisco Systems, Inc. Unauthorized use not permitted.
Note 
Master switch election occurs only on stack initialization or if there is a master switch failure. If a new, more favorable switch is added to a stack, this will not trigger a master switch election, nor will any sort of preemption occur.
Each switch in the stack can serve as a master, creating a 1:N availability scheme for network control. In the unlikely event of a single unit failure, all other units continue to forward traffic and maintain operation. Furthermore, each switch is initialized for routing capability and is ready to be elected as master if the current master fails. Subordinate switches are not reset so that Layer 2 forwarding can continue uninterrupted.
The three main differences between StackWise and StackWise Plus are as follows:
  • StackWise uses source stripping, and StackWise Plus uses destination stripping (for unicast packets). Source stripping means that when a packet is sent on the ring, it is passed to the destination, which copies the packet and then lets it pass all the way around the ring. After the packet has traveled all the way around the ring and returns to the source, it is stripped off of the ring. This means bandwidth is used up all the way around the ring, even if the packet is destined for a directly attached neighbor. Destination stripping means that when the packet reaches its destination, it is removed from the ring and continues no further. This leaves the rest of the ring bandwidth free to be used. Thus, the throughput performance of the stack is multiplied to a minimum value of 64 Gbps bidirectionally. This capability to free up bandwidth is sometimes referred to as spatial reuse.
    Note 
    Even in StackWise Plus, broadcast and multicast packets must use source stripping because the packet might have multiple targets on the stack.
  • StackWise Plus can locally switch; StackWise cannot. In StackWise Plus, packets originating and destined to ports on the same local switch will not have to traverse the Stack ring, which results in more efficient switching. In contrast, in StackWise, because there is no local switching and because there is source stripping, even locally destined packets must traverse the entire stack ring.
  • StackWise Plus can support up to two Ten Gigabit Ethernet ports per Cisco Catalyst 3750-E.
Finally, both StackWise and StackWise Plus can support Layer 3 NSF when two or more nodes are present in a stack. NSF is discussed in the following section, along with SSO.

Device Availability Technologies



Most network designs have single points of failure, and the overall availability of the network might be dependent on the availability of a single device. A prime example of this is the access layer of a campus network. Most endpoint devices connect to the access switch through a single network interface card (NIC); this is referred to as being single-homed; therefore, access switches represent a single point of failure for all attached single-homed devices, including CTS codecs.
Note 
Beginning with CTS 1.5 software, Cisco TelePresence Multipoint Switches can utilize a NIC teaming feature that can enable these to be multihomed devices, that is, devices that connect to multiple access switches. Multihoming eliminates the access switch from being a single-point of failure and thus improves overall availability.
Ensuring the availability of the network services is often dependent on the resiliency of the individual devices. Device resiliency, as with network resiliency, is achieved through a combination of the appropriate level of physical redundancy, device hardening, and supporting software features. Studies indicate that most common failures in campus networks are associated with Layer 1 failures, from components such as power supplies, fans, and fiber links. The use of diverse fiber paths with redundant links and linecards, combined with fully redundant power supplies and power circuits, are the most critical aspects of device resiliency. The use of redundant power supplies becomes even more critical in access switches with the introduction of Power over Ethernet (PoE) devices such as IP phones. Multiple devices are now dependent on the availability of the access switch and its capability to maintain the necessary level of power for all the attached end devices. After physical failures, the most common cause of device outage is often related to the failure of supervisor hardware or software. The network outages due to the loss or reset of a device due to supervisor failure can be addressed through the use of supervisor redundancy. Cisco Catalyst switches provides two mechanisms to achieve this additional level of redundancy:
  • Cisco StackWise/StackWise-Plus
  • Cisco Nonstop Forwarding (NSF) with Stateful Switchover (SSO)
Both of these mechanisms, discussed in the following sections, provide for a hot active backup for the switching fabric and control plane, thus ensuring that data forwarding and the network control plane seamlessly recover (with subsecond traffic loss, if any) during any form of software or supervisor hardware crash.

TelePresence Phases of Deployment



As TelePresence technologies evolve, so too will the complexity of deployment solutions. Therefore, customers will likely approach their TelePresence deployments in phases, with the main phases of deployment as follows:
  • Phase 1. Intracampus/Intra-enterprise deployments: Most enterprise customers will likely begin their TelePresence rollouts by provisioning (point-to-point) intra-enterprise TelePresence deployments. View this model as the basic TelePresence building block on which you can add more complex models.
  • Phase 2. Intra-enterprise multipoint deployments: Because collaboration requirements might not always be facilitated with point-to-point models, the next logical phase of TelePresence deployment is to introduce multipoint resources to the intra-enterprise deployment model. Phases 1 and 2 might at times be undertaken simultaneously.
  • Phase 3. Intercompany deployments: To expand the application and business benefits of TelePresence meetings to include external (customer- or partner-facing) meetings, an intercompany deployment model can be subsequently overlaid over either point-to-point or multipoint intra-enterprise deployments.
  • Phase 4. TelePresence to the executive home: Because of the high executive-perk appeal of TelePresence and the availability of high-speed residential bandwidth options (such as fiber to the home), some executives might benefit greatly from deploying TelePresence units to their residences. Technically, this is simply an extension of the intra-enterprise model but might also be viewed as a separate phase because of the unique provisioning and security requirements posed by such residential TelePresence deployments, as illustrated in Figure 1.

     
    Figure 1: TelePresence to the executive home (an extension of the intra-enterprise deployment model)

Intercompany Deployment Model



The intercompany deployment model connects not only TelePresence systems within an enterprise, but also allows for TelePresence systems within one enterprise to call systems within another enterprise. The intercompany model expands on the intracampus and intra-enterprise models to include connectivity between different enterprises, both in a point-to-point or in a multipoint manner. It offers a significant increase in value to the TelePresence deployment by greatly increasing the number of endpoints to which a unit can communicate. This model is also at times referred to as the business-to-business (B2B) TelePresence deployment model.
The intercompany model offers the most flexibility and is suitable for businesses that often require employees to travel extensively for both internal and external meetings. In addition to the business advantages of the intra-enterprise model, the intercompany deployment model lets employees maintain high-quality customer relations without the associated costs of travel time and expense.
The network infrastructure of the intercompany deployment model builds on the intra-enterprise model and requires the enterprises to share a common MPLS VPN service provider. Additionally, the MPLS VPN service provider must have a “shared services” Virtual Routing and Forwarding (VRF) instance provisioned with a Cisco IOS XR Session/Border Controller (SBC).
The Cisco SBC bridges a connection between two separate MPLS VPNs to perform secure inter-VPN communication between enterprises. Additionally, the SBC provides topology and address hiding services, NAT and firewall traversal, fraud and theft of service prevention, DDoS detection and prevention, call admission control policy enforcement, encrypted media pass-through, and guaranteed QoS.
Figure 1 illustrates the intercompany TelePresence deployment model.

 
Figure 1: Intercompany TelePresence deployment model
Note 
The initial release of the intercompany solution requires a single service provider to provide the shared services to enterprise customers, which includes the secure bridging of customer MPLS VPNs. However, as this solution evolves, multiple providers can peer and provide intercompany services between them, and as such, can remove the requirement that enterprise customers share the same SP.
Although the focus of this chapter is TelePresence deployments within the enterprise, several of these options can be hosted or managed by service providers. For example, the Cisco Unified Communications Manager and Cisco TelePresence Manager servers and multipoint resources can be located on-premise at one of the customer campus locations, colocated within the service provider network (managed by the enterprise), or hosted within the service provider network (managed by the service provider). However, with the exception of interVPN elements required by providers offering intercompany TelePresence services, the TelePresence solution components and network designs remain fundamentally the same whether the TelePresence systems are hosted/managed by the enterprise or by the service provider.