Connecting a CTS-1000 System



The CTS-1000 is a multi-user TelePresence system that can support one or two users at a given location. For purposes of this discussion, the CTS-1000 is virtually identical to the CTS-500 except for two main differences:
  • The CTS-1000 includes a 65-inch plasma display (as compared to the 37-inch LCD display used by the CTS-500).
  • The CTS-1000 does not support the optional DMP.
With the exception of these two components, the parts list and connection details of these systems are identical. Figure 1 shows the connectivity of the CTS-1000.


Figure 1: Connectivity schematic for a CTS-1000 system

Connecting a CTS-500 System

The CTS-500 is a personal TelePresence system intended to support single-user TelePresence conferencing. The CTS-500 system includes a 37-inch display with an integrated codec, camera, microphone, and speaker. Additionally, the CTS-500 has a connection for an optional Cisco Digital Media Player (DMP), which you can use to display live or streaming video content when you do not use the CTS-500 for TelePresence meetings. The minimum room dimensions to support a CTS-500 are 8 x 6 x 8 feet.
Specifically, the CTS-500 includes the following:
  • One Cisco TelePresence codec (a primary codec)
  • One Cisco Unified 7975G IP Phone
  • One 37-inch LCD display
  • One high-definition camera
  • One microphone
  • One speaker
  • One input for auxiliary audio
  • One input for auxiliary video that you can use for a document camera or PC
The Cisco TelePresence primary codec is the center of the CTS systems. Essentially, all internal TelePresence components connect to it, and it, in turn, provides the sole access point to the network infrastructure.
Explicitly, the Cisco Unified 7975G IP Phone connects to the TelePresence primary codec through an RJ-45 cable that provides it with network connectivity and 802.3af Power-over-Ethernet (PoE).
Another RJ-45 cable connects from the TelePresence primary codec to the camera, providing the camera with 802.3af PoE. A second cable from the primary codec to the camera provides video connectivity.
High Definition Multimedia Interface (HDMI) video cable also connects the primary codec to the 37-inch LCD display. This cable has a proprietary element for carrying management information instead of audio signals because the audio signals are processed independently by the master codec.
Additionally, a speaker cable and a microphone cable connect the speaker and microphone to the primary codec, respectively.
The primary codec also has inputs for auxiliary audio and auxiliary video. Auxiliary video can come from a PC connection or from a document camera connection. An IP power switch (IPS) provides control for the on/off function of the document camera, attached projector, and lighting shroud of the CTS unit through an Ethernet connection.
Finally, an RJ-45 cable provides 10/100/1000 Ethernet connectivity from the primary codec to the network infrastructure. Figure 1 illustrates these interconnections for a CTS-500 system.

 
Figure 1: Connectivity schematic for a CTS-500 system

Interoperability with Video Conferencing



In addition to supporting audio-only participants, Cisco TelePresence also supports video conferencing participants. This is done by bridging together the TelePresence multipoint meeting (hosted on a Cisco TelePresence Multipoint Switch [CTMS] and a regular multipoint video conference [hosted on a Cisco Unified Video conferencing MCU]). Bridging multipoint meetings together has been around for years in the video conferencing industry and is referred to as cascading. The Cisco implementation of cascading betweenTelePresence and video conferencing is similar to previous implementations in the market, except that Cisco had to create a way of mapping multiscreen TelePresence systems with standard single-screen video conferencing systems. Cisco calls this Active Segment Cascading.
Note 
A multipoint cascaded conference is the only method for interoperating between Cisco TelePresence and traditional video conferencing endpoints. Direct, point-to-point calls between a TelePresence system and a video conferencing endpoint are not allowed. This is highly likely to change in the future as Cisco continues to develop additional interoperability capabilities within the TelePresence solution.

Interoperability RTP Channels

When Cisco engineered its video conferencing interoperability solution, the vast majority of video conferencing equipment ran at CIF or 4 CIF resolutions (720p was just beginning to become widely deployed), and no video conferencing endpoints or MCUs (including the CUVC platform) were capable of receiving and decoding the Cisco 1080p resolution video and AAC-LD audio. Therefore, Cisco had two choices:
  • Degrade the experience for the TelePresence participants by encoding the entire meeting at a much lower resolution and using inferior audio algorithms to accommodate the video conferencing participants
  • Maintain the 1080p/AAC-LD experience for the TelePresence participants and send an additional video and audio stream for the video conferencing MCU to digest
For obvious reasons, Cisco chose the latter method.
Note 
The methods described herein are highly likely to change in the future as video conferencing equipment becoming increasingly capable of higher definition video resolutions (720p and 1080p) and AAC-LD audio becomes more commonplace within the installed base.
When a Cisco TelePresence system (single-screen or multiscreen model) dials into a CTMS meeting that is configured for interoperability, the CTMS requests the TelePresence endpoint to send a copy of its 1080p video in CIF resolution and a copy of its AAC-LD audio in G.711 format. These CIF and G.711 streams are then switched to the CUVC MCU, which, in turn, relays them to the video conferencing participants. In the reverse direction, the CUVC sends the CTMS its CIF resolution video and G.711 audio from the video conferencing participants, and the CTMS relays that down to the TelePresence participants.

CIF Resolution Video Channel

Multiscreen TelePresence systems, such as the CTS-3000 and CTS-3200, provide three channels of 1080p / 30 resolution video. However, only one can be sent to the CUVC at any given time, so the Cisco TelePresence codec uses a voice-activated switching methodology to choose which of the three streams it should send at any moment in time. If a user on the left screen starts talking, the left codec encodes that cameras video using H.264 at 1080p / 30 (or whatever resolution/motion-handling setting the system is set to use) and also at CIF / 30. If a user in the center starts talking, the left codec stops encoding the CIF video channel, and the center codec now begins encoding the center cameras video using H.264 at 1080p / 30 (or whatever resolution/motion-handling setting the system is set to use) and CIF / 30. This switching occurs dynamically throughout the life of the meeting between the left, center, and right codecs based on the microphone sensitivity (who is speaking the loudest) of each position.
Single-screen TelePresence systems such as the CTS-1000 and CTS-500 have only one screen (the center channel), so no switching is required.
When encoded, the CIF channel is multiplexed by the primary codec into the outgoing RTP video stream along with the other four video channels (left, center, right, and auxiliary). In the case of a single-screen system, the CIF channel is multiplexed in with the other two video channels (center and auxiliary).

G.711 Audio Channel

On multiscreen TelePresence systems, such as the CTS-3000 and CTS-3200, there are three channels of AAC-LD audio. Instead of sending one at a time, the primary (center) codec mixes all three channels together and encodes the mix in G.711 format. Therefore, all parties can be heard at any given time.
The G.711 channel is multiplexed into the outgoing RTP audio stream with the other four channels (left, center, and right AAC-LD audio channels, and the auxiliary audio channel).
Single-screen systems have only a single microphone channel (center), so there is no need to mix. The center channel is encoded in both AAC-LD and G.711 formats and multiplexed together into the outgoing RTP audio stream along with the auxiliary audio channel, for a total of three audio channels.

Additional Bandwidth Required

As a result of having to send these additional CIF resolution video and G.711 audio channels, additional bandwidth is consumed by each participating TelePresence System. The CIF resolution video is encoded at 704 kbps, and the G.711 audio is encoded at 64 kbps, for a total of 768 kbps additional bandwidth.

CTMS Switching of the Interop Channels

As previously discussed, the multiple channels of video and the multiple channels of audio are multiplexed using the SSRC field in the RTP header. Ordinarily, there are four video positions and four audio positions within the SSRC field (left, center, right, and auxiliary). A fifth SSRC position was defined to carry the CIF and G.711 interop channels within the video and audio RTP streams.
When the CTMS receives the video and audio streams from any TelePresence system, it reads the SSRC position of the RTP header and decides where to switch it. Left, center, right, and auxiliary positions are switched to the other TelePresence participants, and the interop position is switched to the CUVC MCU.
In the opposite direction, the CIF video and G.711 audio coming from the CUVC MCU to the CTMS is appended with the SSRC position of the interop channel and sent down to all the participating TelePresence rooms.

Decoding of the Interop Channels

When a Cisco TelePresence system receives RTP packets containing the SSRC value of the interop position, the primary (center) codec forwards the CIF video RTP packets to the left secondary codec to be decoded. The center codec decodes the G.711 audio, mixes it with the left channel of decoded AAC-LD audio, and plays the mix out the left speaker. This way, the video conferencing participants always appear on the left display and are heard coming out the left speaker, along with any TelePresence participants seated on that side of the system. On single-screen systems, it obviously appears on the single (center) display and speaker.
Because CIF video is 4:3 aspect ratio (352x288 resolution), and the TelePresence displays are 16:9 aspect ratio and run at 1080p / 60 resolution, the CIF video must be displayed in the best possible way. Stretching it to fit a 65-inch 1080p display would look terrible, so the left codec pixel doubles the decoded video to 4CIF resolution (704x576) and displays it on the 1080p display surrounded by black border
s

Interoperability with Out-of-Band Collaboration Applications



Another aspect of supporting audio-only participants is providing a method for all participants to collaborate on shared documents and presentations. Cisco TelePresence provides for auxiliary video inputs and outputs so that users can attach their PCs to the system, and how that signal is encoded and multiplexed over RTP so that all the other TelePresence participants can see it. But how do the audio-only participants get to view it, and can the audio-only participants share a document or presentation with the TelePresence participants?
The predominant method of sharing documents and presentations on an audio conference is through the use of web-based collaboration tools such as Cisco MeetingPlace or Cisco Webex. There have been other methods in the past, such as the infamous T.120 protocol, but those are pretty much defunct. Therefore, the way forward is to engineer a method whereby the auxiliary video channel can automatically convert into a format viewable by the web conference participants and vice versa. At the time this book was written, this functionality was in the process of being developed.
In the meantime, there is a method for achieving the desired results, but it requires steps on the part of the user to enable it. It’s relatively simple but requires a three-step process:
Step 1
The user attaches his or her PC to the auxiliary (VGA) video input of the TelePresence system. This allows whatever he or she is sharing to be instantly viewed by all the other TelePresence participants.
Step 2
The user dials into the audio conferencing server and enters the appropriate DTMF tones to bridge the TelePresence meeting into the audio conference (as described in the previous sections).
Step 3
The user also fires up the web browser on his or her PC and logs onto the web conferencing server for that audio conference and joins the meeting as a web-participant as well and then activates the sharing feature within that web conference.
Now whatever the user shares on her PC is sent simultaneously by his or her PC over VGA to the Cisco TelePresence Primary codec and over HTTP to the web conference. Although this adds a degree of complexity to the ease-of-use of TelePresence, the good news is that it works with virtually any audio and web conferencing product on the planet.

Dual-Tone Multi-Frequency | TelePresence Audio



Dual-Tone Multi-Frequency (DTMF) enables the user to interact with voice prompts using the touch-tone buttons on the telephone to enter digits, *, and # symbols. This enables the user to navigate IVR menus, enter conference numbers and passwords, check their voicemail, and so on. DTMF has been around throughout the history of telephony and has been adapted into the numerous protocols used in IP-based telephony. H.323, Media Gateway Control Protocol (MGCP), Session Initiation Protocol (SIP) and others all incorporate support for DTMF using one or more methods.
Depending on the protocol in use, there are fundamentally two forms of DTMF signaling:
  • In-band: Puts the DTMF tones inside the audio stream
  • Out-of-band: Interprets the DTMF tones and converts them into messages carried over the signaling protocol
In the case of SIP, there are two predominant methods for incorporating DTMF support:
  • RFC 2833: Defines the RTP payload type used for carrying DTMF tones in-band within the RTP audio stream
  • Key-Pad Markup Language (KPML): Defines a method relaying DTMF tones through the SIP signaling protocol
Which method is used for a given call is negotiated through SIP during the session establishment phase and can be RFC 2833, KPML, or none, depending on what type of audio device the TelePresence system connects to.

RFC 2833

RFC 2833 is maintained by the Internet Engineering Task Force (IETF) and was standardized in 2000. It defines a method for carrying DTMF signaling events and other tones and telephony events within RTP packets.
RTP defines a variety of payload types and contains a payload type field within the RTP header to indicate what payload type the packet contains (audio, video, DTMF, and so on). Refer back to Figure 3-8 for details of the RTP header contents. The payload type for DTMF is referred to as a named event. RFC 2833 defines several different named event types; one of those is DTMF events. When a DTMF tone is identified, the encoder translates it into an RTP packet, setting the payload type (PT) field to the numerical number associated with DTMF payload events, and inserting the actual DTMF event into the body of the RTP payload. The numerical identifier used for DTMF events can be negotiated ahead of time during the session establishment period. For example, Cisco devices generally use payload type 97 for DTMF events, but this can be negotiated on a call-by-call basis.

Key-Pad Markup Language

Key-Pad Markup Language (KPML) is also maintained by the IETF and was converted from an IETF Internet draft to a published RFC (RFC 4730) in November 2006. It defines a method for carrying DTMF signaling events with SIP event packages within the SIP signaling protocol.
SIP RFC 3265 defines a method within SIP by which an application can SUBSCRIBE to specific event packages and receive NOTIFY messages whenever such an event occurs. RFC 4730 leverages this SUBSCRIBE/NOTIFY architecture to define a DTMF event package. During the SIP session establishment phase (through the INVITE, 180 TRYING, 183 SESSION PROGRESS, and 200 OK messages), the SIP User Agents advertise support for KPML in the Allowed-Events and Supported headers. After being advertised, the User Agent who wants to receive DTMF events sends a SUBSCRIBE message to the other User Agent indicating that it wants to subscribe to the KMPL event package. The User Agent receiving the SUBSCRIBE request sends a 200 OK acknowledgment to the subscriber. Thereafter, any DTMF event initiated by the User Agent who received the SUBSCRIBE request is sent in a NOTIFY message to the subscriber, contained as an XML document within the body of that NOTIFY message.

Other Protocols

Other protocols have similar approaches for DTMF support. H.323, for example, commonly uses RFC 2833, or H.245 Alpha-Numeric notation, among others. Media Gateway Control Protocol (MGCP) commonly uses RFC 2833, or MGCP-DTMF relay. In the case of Cisco TelePresence, the CUCM handles all the call signaling and deals with convertingbetween the various signaling protocols to enable end-to-end DTMF. For example, if a Cisco TelePresence system called a Cisco Webex audio conferencing bridging service, it would need to traverse an IP-PSTN gateway. For example, if that gateway were running MGCP, there are two ways that DTMF might be configured:
  • RFC 2833: If the gateway advertises RFC 2833 support only, the Unified CM would advertise RFC 2833 to the Cisco TelePresence system.
  • DTMF-relay: If the gateway advertised DTMF-relay, the CUCM would advertise KPML to the Cisco TelePresence system.

How DTMF Tones Are Processed in Cisco TelePresence

In the case of Cisco TelePresence, the user presses the buttons on the Cisco TelePresence IP Phone, but the Cisco TelePresence Primary (center) codec is the one doing all the SIP signaling and handling all the media; the IP Phone is just the user interface instrument to the system.
The first generation of Cisco TelePresence used the eXtensible Markup Language (XML) programming language as an interface between the Cisco IP Phone and the primary codec. When a user wanted to enter DTMF tones, they pressed the Tones softkey, which would bring up a page where the user could enter the digits they wanted to send and then press the Send softkey. That XML content was then sent to the primary codec over an HTTP session. The codec would read the XML contents and convert them into SIP-KPML messages, or RFC 2833 payloads.
At the time this book was written, Cisco TelePresence was moving away from XML to a newer, more robust method using Java MIDlets. With this method, a small Java application (called a MIDlet) runs on the IP Phone and intercepts the button presses. These button presses are then communicated over a TCP session between the Java MIDlet running on the IP Phone and the primary codec. The codec reads the Java TCP messages and converts them into SIP-KPML messages, or RFC 2833 payloads.
In both cases, the button presses are sent by the IP Phone to the primary codec, and the codec converts them and plays them out onto the network. If RFC 2833 is in use, the primary codec converts the XML or Java MIDlet event into an RTP DTMF event and multiplexes it into the audio RTP stream. If KPML is in use during the SIP session establishment phase, the Unified CM would send a SUBSCRIBE request to the primary codec subscribing to the KPML event package. The primary codec, therefore, converts the XML or Java MIDlet event into an XML document and sends it in a NOTIFY message to the CUCM.

Audio-Only Participants



It is common to have one or more participants who cannot attend the meeting in person but are available to dial in and attend through a phone. These callers need to join the TelePresence meeting through an audio-only call, which known as the Audio Add-In feature.
In addition to AAC-LD, Cisco TelePresence also supports the G.711 audio encoding standard. This makes it interoperable with virtually any telephone device or audio conferencing bridge whether that is a standard Plain Old Telephone Service (POTS) phone, a cellular phone, an IP Phone, or an audio-conferencing bridging service, such as Cisco Meetingplace, Cisco Webex, or the numerous other audio bridging services in the market.
The feature is invoked just like it is on regular telephones and cellular phones. The user simply presses the Conference softkey on the Cisco TelePresence IP Phone user interface and places a standard telephone call to the destination phone number. Alternatively, the remote person can dial the telephone number of the TelePresence room, and the user can answer the incoming call and then press the Conference/Join softkey to bridge the caller in.
Under the hood, the way this feature works is that the audio call is established as a completely separate session. (It is signaled using the Session Initiation Protocol [SIP], and a G.711 RTP stream is negotiated.) The RTP stream of audio coming into the TelePresence system from the remote party is decoded and blended out all of the speakers (just like the auxiliary audio is) and is simultaneously mixed in with the auxiliary audio stream going out to all the other participating TelePresence rooms, allowing all the TelePresence participants to hear the audio caller. In the opposite direction, sound coming from all three microphones within the room, with all three of the audio channels received from the other participating TelePresence rooms, is mixed and sent out over the G.711 RTP stream to the audio participant, allowing him to hear everything that’s said by the TelePresence participants. Figure 1 illustrates how this is done.

 
Figure 1: Audio-only participants input and output mapping
Note 
Future versions of Cisco TelePresence might incorporate support for additional audio algorithms for the audio add-in stream, such as G.722, to increase the fidelity of the Audio Add-In participant.
If multiple audio-only participants are needed, the user can use an audio conferencing bridging service, such as Cisco Webex, as illustrated in Figure 2.

 
Figure 2: Multiple audio-only participants using a conferencing bridging service
To successfully dial into a bridging service such as Cisco Webex, the TelePresence user initiating the Audio Add-In feature must navigate the Interactive Voice Response (IVR) menu of the bridging service and enter the correct conferencing ID number and password to join that audio meeting. This is a good segue into the next topic, DTMF.

Demultiplexing and Decoding



As previously discussed, Cisco TelePresence uses a multiplexing technique, using the SSRC field of the RTP header, to transport multiple video and audio channels over RTP. Each call (session) consists of two RTP streams: one for video and one for audio. On single-screen systems, the video RTP stream consists of two video channels: one for the Cisco TelePresence camera and one for the auxiliary (PC or document camera) video inputs. Likewise, the audio RTP stream consists of two audio channels: one for the Cisco TelePresence microphone and one for the auxiliary (PC) audio input. On multiscreen systems, the video RTP stream consists of four video channels, and the audio RTP stream consists of four audio channels.

Video and Audio Output Mapping

These channels must be demultiplexed, decoded, and played out the corresponding output (to the appropriate screen for video and to the appropriate speaker for audio). Because the entire TelePresence system connects to the network using a single 1000Base-T Gigabit Ethernet interface, all the packets are received by the primary (center) codec. The primary codec analyzes the SSRC field of the RTP headers and sends the left video channel to the left secondary codec and the right video channel to the right secondary codec. The primary codec then proceeds to buffer and decode the center and auxiliary video packets and all audio packets, and the two secondary codecs buffer and decode their respective video packets.
Figure 1 illustrates how these channels are mapped from the transmitting TelePresence codec to the receiving TelePresence codec. 

 
Figure 1: Video and audio output mapping
Note 
Figure 1 illustrates a multiscreen system. Single-screen systems would behave exactly the same way, except that the left and right channels would not be present.

Display Outputs, Resolution, and Refresh Rate (Hz)

The left, center, and right video channels are decoded by each Cisco TelePresence codec and sent out the corresponding HDMI interface to the left, center, and right displays. At the time this book was written, the CTS-1000, CTS-3000, and CTS-3200 use 65-inch plasma displays, whereas the CTS-500 uses a 37-inch LCD display. In all cases, these displays run at 1080p resolution at 60 Hz refresh rate using progressive scan. Therefore, the Cisco TelePresence codec must decode the video (whether it was encoded at 1080p / 30 or 720p / 30) and send it to the display at 1080p / 60.
The auxiliary video channel is also decoded and sent out the auxiliary HDMI interface to either the projector or an auxiliary LCD display or displayed as Presentation-in-Picture(PIP) on the center display. Depending on its destination, the Cisco TelePresence codec decodes the video (which was encoded at 1024x768 at either 5 fps or 30 fps) and sends it out at the correct refresh rate. When it is sent out the auxiliary HDMI interface to either the projector or an auxiliary LCD display, the Cisco TelePresence codec outputs it at 49.5 Hz using interlaced scanning. When it is sent as PIP to the primary HDMI display port, it overlays it on top of the center channels video and outputs it at 1080p / 60.

Frames per Second Versus Fields per Second Versus Refresh Rate (Hz)

It’s worth inserting a quick word here on the difference between frames versus fields versus refresh or scan rates (Hz). These terms are frequently confused in the video conferencing and telepresence industries. (For example, a vendor might state that its system does 60 fields per second.)
In a Cisco TelePresence system, the camera operates at a scan rate (also known as refresh rate or clock rate) of 30 Hz. The codec encodes that video into H.264 video frames at a rate of 30 frames per second (30 fps). The plasma and LCD displays used in Cisco TelePresence operate at a scan rate of 60 Hz using progressive scan display technology. Because the displays are 60-Hz progressive scan, Cisco can claim 60 fields per second support as well. But what actually matters is that the source (the camera) is operating at 30 Hz, and the video is encoded at 30 fps. To truly claim 60 fps, the camera would need to run at 60 Hz, the encoder would need to pump out 60 video frames per second (every 16 ms or so), and the displays would need to run at 120 Hz. This would provide astounding video quality but would also result in double the DSP horsepower and bandwidth needed and quite frankly is unnecessary because the current 30-fps implementation is already the highest-quality solution on the planet and is absolutely adequate for reproducing a true-to-life visual experience.
Instead of getting caught up in a debate over Hz rates and progressive scan versus interlaced scan methods, the most accurate method for determining the true “frame rate” of any vendors’ codec is to analyze their RTP packets. As described earlier in the Real-Time Transport Protocol section, all vendors implementing RTP for video transport use the marker bit to indicate the end of a video frame. Using a packet sniffer, such as the open source program Wireshark (http://www.wireshark.org), and filtering on the RTP marker bit, a graph can be produced with the marker bits highlighted. The x-axis on the graph displays the time those packets arrived and, hence, the number of milliseconds between each marker bit. Dividing 1000 by the number of milliseconds between each marker bit reveals the number of frames per second. With Cisco TelePresence, the marker bits appear every 33 ms (30 fps). With other vendor implementations, which use variable frame-rate encoders, there are much larger and variable times between marker bits. For example, if the time between any two marker bits is 60 ms, the video is only approximately 15 fps for those two frame intervals. If it’s 90 ms, the video is only approximately 11 fps. Because the time between marker bits often varies frame-by-frame in these implementations, you can compute the time between all marker bits to derive an average fps for the entire session.
Figure 2 shows a screenshot of a Wireshark IO Graph of a competitor’s (who shall remain nameless) 720p implementation. In this screenshot, you can see that the RTP packets that have the marker bit set to 0 (false) are colored red (gray in this screen capture), whereas the RTP packets that have the marker bit set to 1 (true) are colored black so that they stand out. The time between the first marker bit on the left (99.839s) and the next marker bit after that (99.878s) is 39 ms (which is approximately 25 fps), whereas the difference between the 99.878s marker bit and the next marker bit after that (99.928s) is 50 ms (20 fps).

 
Figure 2: Example Wireshark IO graph

Audio Outputs

As discussed previously, the audio from the left, center, and right microphone channels is played out the corresponding left, center, and right speakers. The speakers are mounted underneath each display, except for on the CTS-500 in which case they are mounted above the display because the microphone array is mounted underneath the display. This preserves the directionality and spatiality of the sounds, giving the user the audible perception that the sound is emanating from the correct direction and distance. The auxiliary audio is blended across all the speakers because this source is not actually associated with the left, center, or right positions.

Amplification and Volume

The Cisco TelePresence codec contains an embedded amplifier, and the amplification levels and the wattage of the speakers are closely matched to reproduce human speech and other in-room sounds at the correct decibel levels to mimic, as closely as possible, the volume you would experience if the person were actually sitting that far away in person. This means that the users can speak at normal voice levels. (They never feel like they have to raise their voices unnaturally.)

Acoustic Echo Cancellation

As sound patterns are played out of the speakers, they naturally reflect off of surfaces within the environment (walls, ceilings, floors) and return back to enter the microphones. If these sounds were not removed, people would hear their own voices reflected back to them through the system. Acoustic Echo Cancellation (AEC) is a digital algorithm to sample the audio signal before it plays out of the speakers and creates a synthetic estimate of that sound pattern, then samples the audio coming into the microphones, and when the same pattern is recognized, digitally subtracts it from the incoming audio signal, thereby canceling out the acoustic echo. This sounds simple enough but is complicated by the naturally dynamic nature of sound in various environments. Depending on the structures and surfaces in the room such as tables, chairs, walls, doors, floors, and ceilings, the distance of those surfaces from the microphones, the materials from which those surfaces are constructed, the periodic movement of those surfaces, and the movement of human bodies within the room, and the number of milliseconds the algorithm must wait to determine whether the sound coming into the microphones is echo can vary significantly. Therefore, the algorithm must automatically and dynamically adapt to these changing conditions.
The Cisco TelePresence codec contains an embedded AEC that requires no human tuning or calibration. It is on by default and is fully automatic. In nonstandard environments, where the Cisco TelePresence codecs are used with third-party microphone mixers, you can disable the embedded AEC using the following CLI command:
CTS>set audio aec {enable | disable}

Depacketization and Decoding



So far this chapter has discussed how video and audio signals are encoded, packetized, and multiplexed onto the IP network. The following sections describe what happens when the packets reach the destination TelePresence endpoint and how the video and audio signals are decoded.

Managing Latency, Jitter, and Loss

The first step in decoding is to receive, buffer, and reassemble the packets to prepare them to be decoded. Recall that the encoder encodes and packetized the video and audio signals at a smooth, consistent rate. (A 30 Hz camera clock rate and 48 kHz audio sampling rate result in fixed, consistent, encoding intervals.) However, as the packets containing those video and audio samples traverse the IP network, there will inevitably be variation in their arrival times and possibly the order in which they arrive. Therefore, the receiving endpoint must buffer the packets, reordering them if necessary, until an adequate number of packets have arrived to begin decoding a given video frame or audio sample. Lost packets or packets that arrive too late to be decoded (late packets) must also be dealt with by the decoder.
The following sections detail how the Cisco TelePresence codec handles latency, jitter, and loss in the packets that it receives. 

Latency

At the human experience level, latency is defined and measured as the time it takes for the speech or gestures of one individual (the speaker) to reach the ears and eyes of another (the listener), and for the audible or visual reaction of that listener to come all the way back to speaker so that they can hear and see the listener’s reaction. Hence, the human experience is round-trip in nature. This is referred to as conversational latency, or experience-level latency; 250 ms to 350 ms is the threshold at which the human mind begins to perceive latency and be annoyed by it.
At the technical level however, the latency in Cisco TelePresence is defined and measured as the time it takes for an audio or video packet containing speech or motion to travel from the Ethernet network interface of the speaker’s TelePresence system to the Ethernet network interface of the listener’s TelePresence system in one direction. The listener’s TelePresence system processes the incoming packets and computes a running average of the latency based on timestamps within the RTP packets and their associated RTCP sender reports. Therefore, latency is measured only at the network-level from one TelePresence system to another, not at the experience-level. It is measured unidirectionally by each TelePresence system, not measured round-trip, and does not take into account the processing time (encoding and decoding) of the packets.

Latency Target

To maintain acceptable experience-level latency, Cisco recommends that customers engineer their networks with a target of no more than 150 ms of network-level latency, in each direction, between any two TelePresence systems. Given the circumference of the earth, the speed of light, and the cabling paths that light travels on between cities, it is not always possible to achieve 150 ms between any two points on the globe. Therefore, Cisco TelePresence implements the following thresholds to alert the network administrator and the user when network-level latency exceeds acceptable levels.

Latency Thresholds

When network-level latency exceeds 250 ms averaged over any 10-second period, the Cisco TelePresence system receiving those packets generates an alarm, and an onscreen message displays to the user. The alarm is written to the syslog log file of that TelePresence system, and an SNMP trap message is generated. The onscreen message displays for 15 seconds, after which it is removed. The onscreen message does not display again for the duration of the meeting, unless the media is interrupted or restarted, such as when the user places the meeting on hold and then resumes it (using the Hold or Resume softkeys), or the user terminates the meeting and then reestablishes it (using the End Call or Redial softkeys).
Tip 
Cisco TelePresence release 1.5 added support for satellite networks. This feature requires a software license to activate. When activated, the latency threshold is adjusted from 250 ms to 2 seconds.

Understanding Latency Measurements in Multipoint Meetings

As audio and video packets traverse, a Cisco TelePresence Multipoint Switch (CTMS), the RTP header containing the original timestamp information, is overwritten, and a new timestamp value is applied by the CTMS. Therefore the latency measured by each participating TelePresence system is only a measurement of the latency from the CTMS to thatendpoint. It is possible for the end-to-end latency from one TelePresence system through the CTMS to another TelePresence System to exceed the 250 ms-latency threshold, without the TelePresence system realizing it.
For example, if the latency from one TelePresence system in Hong Kong to the CTMS in London is 125 ms, and the latency from the CTMS in London to the other TelePresence system in San Francisco is 125 ms, the end-to-end latency from the Ethernet network interface of the Hong Kong system to the Ethernet network interface of the San Francisco system is 250 ms, plus approximately 10 ms added by the CTMS, for a total of 260 ms. The TelePresence System in San Francisco will not realize this and will think that the latency for that meeting is only 125 ms. Therefore, care should be taken when designing the network and the location of the CTMS to reduce the probability of this situation occurring as much as possible. The CTMS is the only device in the network that is aware of the end-to-end latency between any two TelePresence systems in a multipoint meeting. Network administrators can view the end-to-end statistics (calculating the sum of any two legs in that meeting) through the CTMS Administration interface. 

Jitter

Simply put, jitter is variation in network latency. In Cisco TelePresence, jitter is measured by comparing the arrival time of the current video frame to the expected arrival time of that frame based on a running clock of fixed 33 ms intervals. Unlike most other video conferencing and telepresence products on the market that use variable frame rate codecs, Cisco TelePresence operates at a consistent 30 frames per second (30 fps). Therefore, the sending codec generates a video frame every 33 ms, and the receiving codec expects those video frames to arrive every 33 ms.

Frame Jitter Versus Packet Jitter

Video frames vary in size based on how much motion is represented by a given video frame. When a low amount of motion occurs within the encoded video, the video frame is relatively small. When a large amount of motion occurs within the encoded video, the video frame is large. Cisco TelePresence 1080p video frames can be as large as 65,000 bytes (65 KB) and averages approximately 13 KB.
These video frames are then segmented into smaller chunks and placed within the payload of RTP packets. Cisco TelePresence video packets tend to be approximately 1100 bytes each, with relatively minor variation in size.
Given a constant end-to-end network latency and relatively constant packet sizes, you can expect to have low packet-level jitter. However, there will still inevitably be variation in the arrival times of video frames simply due to the variation in their size. This variation is primarily a function of the serialization rate (speed) of the network interfaces the packets constituting those video frames traverse but can also be affected by queuing and shaping algorithms within the network routers along the path that might need to queue (buffer) the packets to prioritize them relative to other traffic, shape them prior to transmission, and then transmit (serialize) them on their outgoing interface. On fast networks (45 Mbps DS-3 circuits or faster), the time required to serialize all the packets constituting a large 65 KB video frame versus the time required to serialize a small 13 KB video frame is inconsequential. On slower networks (10 Mbps or slower), the time difference between the serialization of these frame sizes can be significant.
Cisco TelePresence systems implement jitter buffers to manage these variations in video frame arrival times. Upon receipt at the destination, the packets are buffered until an adequate portion of the video frame has arrived, and then the packets are removed from the buffer and decoded. The size (depth) of the jitter buffer dictates how much jitter can be managed before it begins to be noticeable to the user. Packets exceeding the jitter buffer are dropped by the receiving codec because they arrived too late to be decoded. The depth of the jitter buffer has an important consequence to the experience-level latency; every millisecond spent waiting in the jitter buffer increases the end-to-end latency between the humans, so jitter buffers must be kept as small as reasonably possible to accommodate network-level jitter without adding an unacceptable amount of experience-level latency.

Jitter Target

To maintain acceptable experience-level latency, Cisco recommends that customers engineer their networks with a target of no more than 10 ms of packet-level jitter and no more than 50 ms of video frame jitter in each direction between any two TelePresence systems. Given the desire to deploy TelePresence over the smallest and, hence least expensive, amount of bandwidth possible and the need in some circumstances to implement shaping within the routers along the path to conform to a service providers contractual rates and policing enforcements, 50 ms of jitter at the video frame-level jitter is not always possible to accomplish. Therefore, Cisco TelePresence implements the following thresholds and jitter buffer behavior to alert the network administrator when video frame-level jitter exceeds acceptable levels.

Jitter Thresholds

Cisco TelePresence uses a quasi-adaptive jitter buffer. At the beginning of every new meeting, the jitter buffer starts out at 85 ms in depth. After monitoring the arrival time of the video frames for the first few seconds of the meeting, if the incoming jitter exceeds 85 ms average, the jitter buffer is dynamically adjusted to 125 ms. After that, if the jitter exceeds 125 ms averaged over any 10-second period, the Cisco TelePresence system receiving those video frames generates an alarm and dynamically adjusts the jitter buffer to 165 ms. The alarm is written to the syslog log file of that TelePresence system, and an SNMP trap message is generated. No onscreen message is displayed to the user.
Any packets exceeding the 165 ms jitter buffer depth are discarded by the receiving TelePresence system and logged as “late packets” in the call statistics. No alarms or onscreen messages are triggered by this threshold. However, late packets are just as bad as lost packets in that they can cause a noticeable effect on the video quality, so care should be taken to design the network so that video frame jitter never exceeds 165 ms.

Packet Loss

Loss is defined as packets that did not arrive (because they were dropped somewhere along the network path) are measured by each TelePresence system by comparing thesequence numbers of the RTP packets it receives versus the sequence numbers it expected to receive. Packet loss can occur anywhere along the path for a variety of reasons; the three most common follow:
  • Layer-1 errors on the physical interfaces and cables along the path, such as a malfunctioning optical interface
  • Misconfigured network interfaces along the path, such as Ethernet speed or duplex mismatches between two devices
  • Bursts of packets exceeding the buffer (queue) limit or policer configurations on network interfaces along the path, such as Ethernet switches with insufficient queue depth or oversubscribed backplane architectures, or WAN router interfaces that police traffic to conform to a service provider’s contractual rates
A closely related metric is late packets, which are packets that arrived but exceeded the jitter buffer (arrived too late to be decoded) and hence were discarded (dropped) by the receiving TelePresence system. Lost packets and late packets are tracked independently by Cisco TelePresence systems, but they both result in the same outcome, noticeable pixelization of the video.
Loss is by far the most stringent of the three metrics discussed here. Latency can be annoying to the users, but their meeting can still proceed, and jitter is invisible to the user, but loss (including packets that arrived but exceeded the 165 ms jitter buffer and manifest into late packets) is immediately apparent. Consider the following calculation:
1080p resolution uncompressed (per screen)
     2,073,600 pixels per frame
× 3 colors per pixel
× 1 byte (8 bits) per color
× 30 frames per second
= 1.5 Gbps uncompressed
The Cisco TelePresence systems use the H.264 codec to compress this down to 4 Mbps (per screen). This represents a compression ratio of > 99 percent. Therefore, each packet is representative of a large amount of video data, and, hence, a small amount of packet loss can be extremely damaging to the video quality.
At the time this book was written, Cisco TelePresence was just beginning to implement a new technique known as Long-Term Reference Frames. This enables the system to recover from packet loss significantly faster by maintaining multiple reference frames and, therefore, reducing the number of IDR reference frames that need to be retransmitting when packet loss occurs.

Loss Target

To maintain acceptable experience-level video quality, Cisco recommends that customers engineer their networks with a target of no more than .05 percent packet loss in each direction between any two TelePresence systems. This is an incredibly small amount, and given the complexity of today’s global networks, 0.05 percent loss is not always possible to accomplish. Therefore, Cisco TelePresence implements the following thresholds to alert the network administrator when packet loss (or late packets) exceeds acceptable levels.

Loss Thresholds

When packet loss (or late packets) exceeds 1 percent averaged over any 10-second period, the Cisco TelePresence system receiving those packets generates an alarm, and an onscreen message appears. The alarm is written to the syslog log file of that TelePresence system, and an SNMP trap message is generated. The onscreen message appears for 15 seconds, after which it is removed, and a 5-minute hold timer is started. During the 5-minute hold timer, syslog/SNMP alarms continue to be generated, but no onscreen message displays.
When packet loss (or late packets) exceeds 10 percent averaged over any 10-second period, the Cisco TelePresence system receiving those packets generates a second alarm, and a second on-screen message appears (unless the hold timer is already in affect). The alarm is written to the syslog log file of that TelePresence system, and an SNMP trap message is generated. The on-screen message displays for 15 seconds, after which it is removed, and a 5-minute hold timer starts (if it weren’t already started by loss threshold #1). During the 5-minute hold timer, syslog/SNMP alarms continue to be generated, but no onscreen message appears.
If loss (or late packets) exceeds 10 percent averaged over any 60-second period, in addition to the actions described, the system downgrades the quality of its outgoing video. When the video downgrades, an alarm generates, and an onscreen icon and message display indicating that the quality has been reduced. The video quality is downgraded by reducing its motion handling (by applying a higher-compression factor to the motion) but the resolution is not affected. For example, if the meeting runs at 1080p-Best, it downgrades to 1080-Good. If the meeting runs at 720p-Best, it downgrades to 720p-Good.
Tip 
Ten percent packet loss can be measured as 1 out of every 10 packets lost, in which case every inter-reference frame would be impacted, causing the video to completely freeze; or it can be measured as 10 ten packets consecutively lost followed by 90 packets consecutively received, which would have a much less severe effect on the video quality. For example, ten-percent packet loss due to a duplex mismatch, in which packets are dropped consistently and evenly, would have a much more severe effect than ten-percent packet loss due to a queue in the network tail dropping several packets in a burst and then forwarding the remaining packets.
Finally, if loss equals 100 percent for greater than 30 seconds, the codec hangs up the call. If the packets begin flowing again anytime up to the 30-second timer, the codec immediately recovers.