Cisco Telepresence: August 2011

TelePresence Packet Rates

A single TelePresence camera configured for 1080p best resolution can send up to approximately 4 Mbps of video traffic at peak rate. With an average of 1100 bytes per packet, this yields approximately 455 packets per second. However, with normal motion, each camera typically generates somewhere between 3 Mbps to 3.5 Mbps of video; yielding a packet rate between 340 packets to 400 packets per second. Likewise a single TelePresence microphone generates a voice packet of roughly 220 bytes every 20 msec. This yields a voice packet rate of approximately 50 packets per second. The audio rate is fixed regardless of the amount of speaking going on within the meeting. Therefore under normal motion, a TelePresence endpoint with one cameras and one microphone (such as the CTS-1000 or CTS-500) typically generates approximately 390 packets to 450 packets per second. Likewise, under normal motion, a TelePresence endpoint with three cameras and three microphones (such as the CTS-3000) typically generates approximately 1170 packets to 1350 packets per second. Note that this does not include the use of the auxiliary video and audio inputs, voice and video interoperability with legacy systems, and any signaling and management traffic; all of which increase the packet rate.

Multiplexing

The underlying transport protocol for RTP is not specified within RFC 3550. However, RTP and RTCP are typically implemented over UDP. When implemented over UDP, the port range of 16384 to 32767 is often used. RTP streams often use the even-numbered ports, with their accompanying RTCP streams using the next higher odd-numbered port. In this case, RTP sessions are identified by a unique destination address pair that consists of a network address plus a pair of ports for RTP and RTCP. Cisco TelePresence endpoints are capable of both sending and receiving multiple audio and video streams. The primary codec multiplexes these streams together into a single audio RTP stream and a single video RTP stream. Multiplexing is accomplished through the use of different Synchronization Source Identifiers (SSRC), for each video camera and each audio microphone, including the auxiliary inputs. In the case of Cisco TelePresence, SSRCs indicate not only the source position of the audio or video media, but also the destination position of the display or speaker to which the media is intended. Therefore, the receiving primary codec can demultiplex the RTP session and send the individual RTP streams to the appropriate display or speaker. This is how spatial audio is achieved with TelePresence and how video display positions are maintained throughout a TelePresence meeting.

When endpoints join a multipoint call, they first attempt to exchange RTCP packets. Successful exchange of these packets indicates the opposite endpoint is a Cisco TelePresence device, capable of supporting various Cisco extensions. Among other things, these extensions determine the number of audio and video channels each TelePresence endpoint can send and receive and their positions. Audio and video streams are sent and received based on their position within the CTS endpoint, which is then mapped to a corresponding SSRC. Figure 1 shows an example for a three-screen endpoint, such as a CTS-3000, communicating with a CTMS.

Figure 1: Audio and video positions for three-screen TelePresence endpoints

Each CTS-3000 or CTS-3200 can transmit (and receive) up to four audio streams and four video streams from the left, center, right, and auxiliary positions. These correspond to the left, center, and right cameras, microphones, and the auxiliary input. Therefore, in a point-to-point meeting between CTS-3000s or CTS-3200s, there can be as many as four audio SSRCs multiplexed together in a single audio RTP stream and four video SSRCs multiplexed together in a single video RTP stream sent from each endpoint.

As Figure 1 illustrates, the CTMS can transmit up to four video streams, corresponding to the left, center, and right displays of the CTS-3000 or CTS-3200, and either a projector or monitor connected to the auxiliary video output. However, the CTMS transmits only up to three audio streams, corresponding to the left, center, and right speaker positions of the CTS-3000 or CTS-3200. Audio sent by an originating CTS-3000 or CTS-3200 toward the auxiliary position is redirected to one of the three speaker positions of the destination CTS-3000 or CTS-3200 by the CTMS. The CTMS chooses the three loudest audio streams to send to the remote CTS-3000 or CTS-3200 when there are more than three streams with audio energy.

Figure 2 shows the audio and video positions for a multipoint call consisting of CTS-1000s or CTS-500s.

Figure 2: Audio and video positions for single-screen TelePresence endpoints

Each CTS-1000 or CTS-500 can transmit up to two audio streams and two video streams from the center and auxiliary positions. These correspond to the single camera and microphone of the CTS-1000 or CTS-500, and the auxiliary input. Therefore, in a point-to-point meeting between CTS-1000s or CTS-500s there can be as many as two audio SSRCs multiplexed together in a single audio RTP stream and two video SSRCs multiplexed together in a single video RTP stream sent from each endpoint.

As Figure 2 illustrates, the CTMS can still transmit up to three audio streams, corresponding to the left, center, and right microphone positions of the CTS-3000 or CTS-3200, even though the CTS-1000 or CTS-500 have only a single speaker. The CTS-1000 or CTS-500 mixes the audio from each of the three positions to play out on its single speaker. The CTS-1000 or CTS-500 can receive only up to two video streams, corresponding to the center display and picture-in-picture auxiliary video output.

RTP Control Protocol

The RTP Control Protocol (RTCP) provides four main functions for RTP sessions:

Provides feedback on the quality of the distribution of RTP packets from the source.
Carries a persistent transport-level identifier called the canonical name (CNAME), which associates multiple data streams from a given participant in a set of RTP sessions.
Provides a feedback mechanism to scale the actual use of RTCP itself. Because all participants send RTCP packets, the rate of arrival of RTCP packets can determine the number of participants and the rate at which RTCP packets should be sent so that the network is not overwhelmed by RTCP packets.
Can be optionally used to convey minimal session control information between participants.

Several different RTCP packets packet types convey the preceding information. These include a Sender Report (SR), Receiver Report (RR), Session Descriptor (SDES), and Application Specific (APP) packets; among others. Typically RTCP sends compound packets containing combinations of these reports in a single packet.

As mentioned previously, each RTP stream has an accompanying RTCP stream. These RTCP streams can be sent using the next higher odd-numbered port or optionally multiplexed in with the RTP streams themselves. Initial TelePresence software versions multiplexed in the RTCP streams within the RTP streams. For interoperability, current TelePresence software versions support both options. RTCP signals packet loss within TelePresence deployments, causing the sender to send a new reference frame (IDR) to resynchronize the video transmission. Additionally, RTCP is used between TelePresence endpoints to inform each other of the number of audio and video channels they are capable of supporting. The number of channels corresponds to the number of displays, cameras, microphones, and speakers supported by the particular endpoint. Finally, the Cisco TelePresence Multipoint Switch (CTMS) uses RTCP packets to perform session control within multipoint calls. The CTMS informs TelePresence endpoints that do not have active speakers to stop transmitting video. When a particular table segment or room has an active speaker, the CTMS detects this through the audio energy and informs the table segment or room, through RTCP packets, to send video.

Real-Time Transport Protocol | Encoding and Packetization

The preceding content covered the basic concepts of video and audio inputs and encoding processes. The video is encoded using H.264, and the audio is encoded using AAC-LD. Now those video and audio samples must be turned into IP packets and sent onto the network to be transported to the other end. This is done using the Real-Time Transport Protocol (RTP).

RTP is a network protocol specifically designed to provide transport services for real-time applications such as interactive voice and video. The services it provides include identification of payload type, sequence numbering, timestamps, and monitoring of the delivery of RTP packets through the RTP control protocol (RTCP). RTP and RTCP are both specified in IETF RFC 3550.

RTP Packet Format

RTP defines a standard packet format for delivering the media, as shown in Figure 1.

Figure 1: RTP packet format

The following are the fields within the RTP packet:

Version (V): A 2-bit field indicating the protocol version. The current version is 2.
Padding (P): A 1-bit field indicating padding at the end of the RTP packet.
Extension Header (X): A 1-bit field indicating the presence of an optional extension header.
CRSC Count (CC): A 4-bit field indicating the number of Contributing Source (CSRC) identifiers that follow the fixed header. It is presented only when inserted by an RTP mixer such as a conference bridge or transcoder.
Marker (M): A 1-bit marker bit that identifies events such as frame boundaries.
Payload Type (PT): A 7-bit field that identifies the format of the RTP payload.
Sequence Number: A 16-bit field that increments by one for each RTP packet sent. The receiver uses this field to identify lost packets.
Timestamp: A 32-bit timestamp field that reflects the sampling instant of the first octet of the RTP packet.
Synchronization Source Identifier (SSRC): A 32-bit field that uniquely identifies the source of a stream of RTP packets.
Contributing Source Identifiers (CSRC): Variable length field that contains a list of sources of streams of RTP packets that have contributed to a combined stream produced by an RTP mixer. You can use this to identify the individual speakers when a mixer combines streams in an audio or video conference.
RTP Extension (Optional): Variable length field that contains a 16-bit profile specific identifier and a 16-bit length identifier, followed by variable length extension data. Intended for limited use.
RTP Payload: Variable-length field that holds the real-time application data (voice, video, and so on.).

Frames Versus Packets

To understand the role of RTP within TelePresence, you need to understand the behavior of voice and video over a network infrastructure. Figure 2 shows a sample comparison of voice and video traffic as it appears on a network.

Figure 2: Comparison of voice and video on the network

As you can see in Figure 2, voice appears as a series of audio samples, spaced at regular intervals. In the case of Cisco TelePresence, voice packets are typically sent every 20 msec. Each packet contains one or more encoded samples of the audio, depending on the encoding algorithm used and how many samples per packet it is configured to use. G.711 and G.729, two of the most common VoIP encoding algorithms, typically include two voice samples per packet. The sizes of the voice packets are consistent, averaging slightly over 220 bytes for a 2-sample G.711 packet; therefore, the overall characteristic of voice is a constant bit-rate stream.

Video traffic appears as a series of video frames spaced at regular intervals. In the case of Cisco TelePresence, video frames are sent approximately every 33 msec. The size of each frame varies based on the amount of changes since the previous frame. Therefore, the overall characteristic of TelePresence video is a relatively bursty, variable bit-rate stream.

A video frame can also be referred to as an Access Unit in H.264 terminology. The H.264 standard defines two layers, a Video Coding Layer (VCL) and a Network Abstraction Layer (NAL). Figure 3 shows a simplified example.

Figure 3: Mapping TelePresence video into RTP packets

The VCL is responsible for encoding the video. Its output is a string of bits representing the encoded video. The function of the NAL is to map the string of bits into units that can then be transported across a network infrastructure. IETF RFC 3984 defines the format for H.264 video carried within the payload of the RTP packets. Each video frame consists of multiple RTP packets spaced out over the frame interval. The boundary of each video frame is indicated through the use of the marker bit, as shown in Figure 1. Each RTP packet contains one or more NAL Units (NALU), depending upon the packet type: single NAL unit packet, single-time or multi-time aggregation packet, or fragmentation unit (part of a NALU). Each NALU consists of an integer number of bytes of coded video.

Note

RTP Packets within a single video frame and across multiple frames are not necessarily independent of each other. In other words, if one packet within a video frame is discarded, it affects the quality of the entire video frame and might possibly affect the quality of other video frames.

TelePresence Video Packet and Frame Sizes

The sizes of individual RTP packets within frames vary, depending upon the number of NALUs they carry and the sizes of the NALUs. Overall, packet sizes average 1100 bytes for Cisco TelePresence video. The number of packets per frame also varies considerably based upon how much information is contained within the video frame. This is partially determined by how the video is encoded; either reference frames or inter-reference frames.

Note

Coding is actually done at the macroblock layer. An integer number of macroblocks then form a slice, and multiple slices form a frame. Therefore, technically slices are intrapredicted (I-slices) or interpredicted (P-slices).

Compression of reference frames is typically only moderate because only spatial redundancy within the frame is eliminated. Therefore, reference frames tend to be much larger in size than inter-reference frames. Reference frame sizes up to 64 KB to 65 KB (and approximately 60 individual packets) have been observed with TelePresence endpoints. Inter-reference frames have much higher compression because only the difference between the frame and the reference frame is sent. This information is typically sent in the form of motion vectors indicating the relative motion of objects from the reference frame. The size of TelePresence inter-reference frames is dependent upon the amount of motion within the conference call. Under normal motion, TelePresence inter-reference frames tend to average 13 KB in size and typically consist of approximately 12 individual packets. Under high motion they can be approximately 19 KB in size and consist of approximately 17 individual packets. From a bandwidth utilization standpoint, much better performance can be achieved by sending reference frames infrequently. During normal operation, Cisco TelePresence codecs send reference frames (IDRs) only once every several minutes in point-to-point calls to reduce the burstiness of the video and lower the overall bit rate.

Audio Encoding | Cisco TelePresence

The microphones used in Cisco TelePresence are purposefully designed to capture the sounds emanating from a human subject sitting within a few feet of the microphone, along with the regular background noises that accompany that person within the room he or she sits in, while filtering out certain unwanted frequency ranges (such as the high-frequency whirrs of spinning fans in laptop computers or the low-frequency hums of heating and ventilation systems) and electrostatic interference (such as GSM/GPRS cellular signals).

The center (primary) Cisco TelePresence codec has four microphone input ports: three for the Cisco TelePresence microphones and one auxiliary audio input. The Cisco TelePresence microphones use a proprietary 6-pin Mini-XLR connector. The auxiliary audio input is a standard 3.5 mm (1/8-inch) mini-stereo connector, which enables the users to connect the audio sound card of their PC along with the VGA video input discussed in the previous sections.

On single-screen systems such as the CTS-1000 and CTS-500, only the center microphone input and the auxiliary audio input are used. On multiscreen systems, such as the CTS-3000 and CTS-3200, the left and right inputs are also used.

Each audio input is encoded autonomously, resulting in up to four discrete channels of audio. This is superior to most other systems on the market that mix all the microphone inputs into a single outgoing channel. By maintaining the channels separately, Cisco TelePresence can maintain the directionality and spatiality of the sound. If the sound emanates from the left, it will be captured by the left microphone and reproduced by the left speaker on the other end. If it emanates from the right, it will be captured by the right microphone and reproduced by the right speaker on the other end.

AAC-LD Compression Algorithm

Cisco TelePresence uses the latest audio encoding technology known as Advanced Audio Coding–Low Delay (AAC-LD). AAC is a wideband audio coding algorithm designed to be the successor of the MP3 format and is standardized by the International Organization for Standardization/International Electrotechnical Commission (ISO/IEC) Moving Picture Experts Group (MPEG). It is specified both as Part 7 of the MPEG-2 standard and Part 3 of the MPEG-4 standard. As such, it can be referred to as MPEG-2 Part 7 and MPEG-4 Part 3, depending on its implementation; however, it is most often referred to as MPEG-4 AAC, or AAC for short.

AAC-LD (Low Delay) bridges the gap between the AAC codec, which is designed for high-fidelity applications such as music, and International Telecommunication Union (ITU) speech encoders such as G.711 and G.722, which are designed for speech. AAC-LD combines the advantages of high-fidelity encoding with the low delay necessary for real-time, bidirectional communications.

Sampling Frequency and Compression Ratio

The AAC-LD standard allows for a wide range of sample frequencies (8 kHz to 96 kHz). Cisco TelePresence implements AAC-LD at 48 kHz sampling frequency. This means that the audio is sampled 48,000 times per second, per channel. These samples are then encoded and compressed to 64 kbps, per channel, resulting in a total bandwidth of 128 kbps for single-screen systems (two channels) and 256 kbps for multiscreen systems (four channels).

Automatic Gain Control and Microphone Calibration

Automatic Gain Control (AGC) is an adaptive algorithm to dynamically adjust the input gain of the microphones to adapt to varying input signal levels. Whether the people are sitting close to the microphones or far away, speaking in a soft voices or yelling, or any combination in between, the microphones have to continuously adapt to keep the audio sounding lifelike and at the correct decibel levels to reproduce the sense of distance and directionality at the far end.

Keeping multiple discrete microphones autonomous and yet collectively synchronized so that the entire room is calibrated is no small task. Cisco TelePresence uses advanced, proprietary techniques to dynamically calibrate the microphones to the room and relative to each other. It is more complex for Cisco TelePresence than other implementations because the microphones need to be kept discrete and autonomous from each other. This preserves the notion of location, which is critical to the proper operation of multipoint switching in which the active speaker switches in on the appropriate screen. For example, if a person is sitting in the center segment of the room but facing the left wall when she talks, the speech emanating from her hits both the left and center microphones. The system must be smart enough to detect which microphone is closest to the source and switch to the correct camera (in this case the center camera), while playing the sound out both speakers on the other end to retain the sense of distance and directionality of the audio. It does this by assigning a 0 to 100 scale for each channel. In this scenario, the speech emanating from the person might be ranked an 80 at the center microphone and a 45 at the left microphone. These two microphone inputs are independently encoded and transported to the other end where they are played out both the center and right speakers at the appropriate decibel levels so that the people on the other end get the sense of distance and directionality. However, because the center microphone was a higher rank than the left microphone, the correct camera would be triggered (in this case, the center camera).

Video Encoding | Cisco TelePresence

When the video coming from the cameras is presented at the HDMI inputs of the codecs, the video passes through the DSP array to be encoded and compressed using the H.264 encoding and compression algorithm. The encoding engine within the Cisco TelePresence codec derives its clock from the camera input, so the video is encoded at 30 frames per second (30 times a second, the camera passes a video frame to the codec to be encoded and compressed).

H.264 Compression Algorithm

H.264 is a video encoding and compression standard jointly developed by the Telecommunication Standardization Sector (ITU-T) Video Coding Experts Group (VCEG) and the International Organization for Standardization/International Electrotechnical Commission (ISO/IEC) Moving Picture Experts Group (MPEG). It was originally completed in 2003, with development of additional extensions continuing through 2007 and beyond.

H.264 is equivalent to, and also known as, MPEG-4 Part 10, or MPEG-4 AVC (Advanced Video Coding). These standards are jointly maintained so that they have identical technical content and are therefore synonymous. Generally speaking, PC-based applications such as Microsoft Windows Media Player and Apple Quicktime refer to it as MPEG-4, whereas real-time, bidirectional applications such as video conferencing and telepresence refer to it as H.264.

H.264 Profiles and Levels

The H.264 standard defines a series of profiles and levels, with corresponding target bandwidths and resolutions. Basing its development of the Cisco TelePresence codec upon the standard, while using the latest Digital Signal Processing hardware technology and advanced software techniques, Cisco developed a codec that could produce 1080p resolution (1920x1080) at a bit rate of under 4 Mbps. One of the key ingredients used to accomplish this level of performance was by implementing Context-Adaptive Binary Arithmetic Coding (CABAC).

CABAC

CABAC is a method of encoding that provides considerably better compression but is extremely computationally expensive and hence requires considerable processing power to encode and decode. CABAC is fully supported by the H.264 standard, but Cisco TelePresence, with its advanced array of Digital Signal Processing (DSP) resources was the first implementation on the market capable of performing this level of computational complexity while maintaining the extremely unforgiving encode and decode times (latency) needed for real-time, bidirectional human interaction.

Resolution

The resolution of an image is defined by the number of pixels contained within the image. It is expressed in the format of the number of pixels wide x number of pixels high (pronounced x-by-y). The H.264 standard supports myriad video resolutions ranging from Sub-Quarter Common Interchange Format (SQCIF) (128x96) all the way up to ultra-high definition resolutions such as Quad Full High Definition (QFHD) or 2160p (3840x2160). Table 1 lists the most common resolutions used by video conferencing and Telepresence devices. The resolutions supported by Cisco TelePresence are noted with an asterisk.

Table 1: H.264 Common Resolutions
Name	Width	Height	Aspect Ratio
QCIF	176	144	4:3
CIF*	352	288	4:3
4CIF	704	576	4:3
480p	854	480	16:9
XGA*	1024	768	4:3
720p*	1280	720	16:9
1080p*	1920	1080	16:9

Aspect Ratio

The aspect ratio of an image is its width divided by its height. Aspect ratios are typically expressed in the format of x:y (pronounced x-by-y, such as “16 by 9”). The two most common aspect ratios in use by video conferencing and telepresence devices are 4:3 and 16:9. Standard Definition displays are 4:3 in shape, whereas high-definition displays provide a widescreen format of 16:9. As shown in Table 1, Cisco TelePresence supports CIF and XGA that are designed to be displayed on standard-definition displays, along with 720p and 1080p that are designed to be displayed on high-definition displays. The Cisco TelePresence system uses high-definition format displays that are 16:9 in shape, so when a CIF or XGA resolution is displayed on the screen, it is surrounded by black borders, as illustrated in Figure 1.

Figure 1: 16:9 and 4:3 images displayed on a Cisco TelePresence system

Frame Rate and Motion Handling

The frame rate of the video image constitutes the number of video frames per second (fps) that is contained within the encoded video. Cisco TelePresence operates at 30 frames per second (30 fps). This means that 30 times per second (or every 33 milliseconds) a video frame is produced by the encoder. Each 33 ms period can be referred to as a frame interval.

Motion handling defines the degree of compression within the encoding algorithm to either enhance or suppress the clarity of the video when motion occurs within the image. High motion handling results in a smooth, clear image even when a lot of motion occurs within the video (people walking around, waving their hands, and so on). Low motion handling results in a noticeable choppy, blurry, grainy, or pixelized image when people or objects move around. These concepts have been around since the birth of video conferencing. Nearly all video conferencing and telepresence products on the market offer user-customizable settings to enhance or suppress the clarity of the video when motion occurs within the image. Figure 2 shows a historical reference that many people can relate to, which shows a screenshot of a Microsoft NetMeeting desktop video conferencing client from the 1990s that provided a slider bar to allow the users to choose whether they preferred higher quality (slower frame rate but clearer motion) or faster video (higher frame rate, but blurry motion).

Figure 2: Microsoft NetMeeting video quality setting

Cisco TelePresence provides a similar concept. Although the Cisco TelePresence cameras operate at 1080p resolution 30 Hz (1080p / 30), the encoder within the Cisco TelePresence codec to which the camera attaches can encode and compress the video into either 1080p or 720p resolutions at three different motion-handling levels per resolution providing the customer with the flexibility of deciding how much bandwidth they want the system to consume. Instead of a sliding scale, Cisco uses the terms Good, Better, and Best. Best motion handling provides the clearest image (and hence uses the most bandwidth), whereas Good motion handling provides the least-clear image (and hence the least bandwidth). The Cisco TelePresence codec also supports the CIF resolution (352x288) for interoperability with traditional video conferencing devices, and the XGA resolution (1024x768) for the auxiliary video channels used for sharing a PC application or a document camera image. Table 2 summarizes the resolutions, frame rates, motion-handling levels, and bit rates supported by the Cisco TelePresence codec.

Table 2: Cisco TelePresence Supported Resolutions, Motion Levels, and Bit Rates
Resolution	Frame Rate	Motion Handling	Bit Rate
CIF	30 fps	Not configurable	768 kbps
XGA	5 fps	Not configurable	500 kbps
XGA	30 fps	Not configurable	4 Mbps
720p	30 fps	Good	1 Mbps
720p	30 fps	Better	1.5 Mbps
720p	30 fps	Best	2.25 Mbps
1080p	30 fps	Good	3 Mbps
1080	30 fps	Better	3.5 Mbps
1080p	30 fps	Best	4 Mbps

Tip

Note that Cisco TelePresence always runs at 30 fps (except for the auxiliary PC and document camera inputs that normally run at 5 fps but can also run at 30 fps if the customer wants to enable that feature). Most video conferencing and telepresence providers implement variable frame-rate codecs that attempt to compensate for their lack of encoding horsepower by sacrificing frame rate to keep motion handling and resolution quality levels as high as possible. Cisco TelePresence codecs do not need to do this because they contain so much DSP horsepower. The Cisco TelePresence codec can comfortably produce 1080p resolution at a consistent 30 fps regardless of the amount of motion in the video. The only reason Cisco provides the motion handling (Good, Better, Best) setting is to give the customer the choice of sacrificing motion quality to fit over lower bandwidths.

The video from the Cisco TelePresence cameras and the auxiliary video inputs (PC or document camera) is encoded and compressed by the Cisco TelePresence codec using the H.264 standard. Each input is encoded independently and can be referred to as a channel. (This is discussed further in subsequent sections that details how these channels are packetized and multiplexed onto the network using the Real-Time Transport Protocol.) On multiscreen systems, such as the CTS-3000 and CTS-3200, each camera is connected to its respective codec, and that codec encodes the cameras video into an H.264 stream. Thus there are three independently encoded video streams. The primary (center) codec also encodes the auxiliary (PC or document camera) video into a fourth, separate H.264 stream.

Frame Intervals and Encoding Techniques

Each 33-ms frame interval contains encoded slices of the video image. A reference frame is the largest and least compressed and contains a complete picture of the image. After sending a reference frame, subsequent frames contain only the changes in the image since the most recent reference frame. For example, if a person is sitting in front of the camera, talking and gesturing naturally, the background (usually a wall) behind the person does not change. Therefore, the inter-reference frames need to encode only the pixels within the image that are changing. This allows for substantial compression of the amount of data needed to reconstruct the image. Reference frames are sent at the beginning of a call, or anytime the video is interrupted, such as when the call is placed on hold and the video suspended, and then the call is taken off of hold and the video resumed. An Instantaneous Decode Refresh (IDR) Frames An IDR frame is a reference frame containing a complete picture of the image. When an IDR frame is received, the decode buffer is refreshed so that all previously received frames are marked as “unused for reference,” and the IDR frame becomes the new reference picture. IDR frames are sent by the encoder at the beginning of the call and at periodic intervals to refresh all the receivers. They can also be requested at any time by any receiver. There are two pertinent examples of when IDRs are requested by receivers:

In a multipoint meeting, as different sites speak, the Cisco TelePresence Multipoint Switch (CTMS) switches the video streams to display the active speaker. During this switch, the CTMS sends an IDR request to the speaking endpoint so that all receivers can receive a new IDR reference frame.
Whenever packet loss occurs on the network and the packets lost are substantial enough to cause a receiver to lose sync on the encoded video image, the receiver can request a new IDR frame so that it can sync back up.

This concept of reference frames and inter-reference frames result in a highly variable bit rate. When the video traffic on the network is measured over time, peaks and valleys occur in the traffic pattern. (The peaks are the IDR frames, and the valleys are the inter-IDR frames). It is important to note that Cisco TelePresence does not use a variable frame rate (it is always 30 fps), but it is variable bit rate. (The amount of data sent per frame interval varies significantly depending upon the amount of motion within each frame interval.) Figure 3 illustrates what a single stream of Cisco TelePresence 1080p / 30 encoded video traffic looks like over a one second time interval.

Figure 3: 1080p / 30 traffic pattern as viewed over one second

Long-Term Reference Frames A new technique, implemented at the time this book was authored, is Long-Term Reference (LTR) Frames. LTR allows for multiple frames to be marked for reference, providing the receiver with multiple points of reference to reconstruct the video image. This substantially reduces the need for periodic IDR frames, hence further reducing the amount of bandwidth needed to maintain picture quality and increasing the efficiency of the encode and decode process. With LTR, the periodic spikes in the traffic pattern illustrated in Figure 3 would be substantially less frequent, resulting in overall efficiencies in bandwidth consumption. It would also enable the receiver to more efficiently handle missing frame data, resulting in substantially higher tolerance to network packet loss.

Camera and Auxiliary Video Inputs | Telepresence

There are three types of video inputs on the codec:

The connection from the Cisco TelePresence camera
The connection from the user’s PC
The connection from the document camera

Although the PC and the document camera are two separate inputs, only one is active at any given time. (Either the PC is displayed, or the document camera video image is displayed.) These latter two inputs will be referred to collectively as the auxiliary video inputs.

Camera Resolution and Refresh Rate (Hz)

The Cisco TelePresence cameras operate at 1080p (1920x1080) resolution with a refresh rate of 30 Hz. The camera sensors encode the video into a digital format and send it down the DVI-HDMI cable to their respective codecs. The left camera is attached to the left secondary codec, the center camera to the center primary codec, and the right camera to right secondary codec.

Note

On single-screen systems, such as the CTS-1000 and CTS-500, there is no left or right, only center.

Auxiliary Video Inputs Resolution and Refresh Rate (Hz)

At the time of writing, the PC video input (VGA-DVI) and the document camera (DVI-HDMI) on the Cisco TelePresence codec operate at 1024x768 resolution with a refresh rate of 60 Hz. The PC must be configured to output this resolution and refresh rate on its VGA output interface. Likewise, the document camera must be configured to output this resolution and refresh rate on its DVI output interface. The majority of PCs on the market at the time the product was designed use 1024x768 resolution and VGA interfaces, although an increasing number of models are beginning to support higher resolutions and are beginning to offer DVI and even HDMI interfaces instead of, or in addition to, VGA. Future versions of the Cisco TelePresence codec might support additional resolutions, refresh rates, and interface types for these connections.

Note

VGA is an analog interface. DVI comes in three flavors: DVI-A that is analog, DVI-D that is digital, and DVI-I that can dynamically sense whether the connected device is using analog (DVI-A) or digital (DVI-D). It is worth mentioning that the first generation Cisco TelePresence codec offers a DVI-A connector for the PC connection. The other end of the cable that attaches to the PC is VGA. So the signal from the PC is a VGA analog to DVI-A analog connection.

Cisco Telepresence

TelePresence Packet Rates

Multiplexing

RTP Control Protocol

Real-Time Transport Protocol | Encoding and Packetization

RTP Packet Format

Frames Versus Packets

TelePresence Video Packet and Frame Sizes

Audio Encoding | Cisco TelePresence

AAC-LD Compression Algorithm

Sampling Frequency and Compression Ratio

Automatic Gain Control and Microphone Calibration

Video Encoding | Cisco TelePresence

H.264 Compression Algorithm

H.264 Profiles and Levels

CABAC

Resolution

Aspect Ratio

Frame Rate and Motion Handling

Frame Intervals and Encoding Techniques

Camera and Auxiliary Video Inputs | Telepresence

Camera Resolution and Refresh Rate (Hz)

Auxiliary Video Inputs Resolution and Refresh Rate (Hz)

Popular Posts

Feedjit

Blog Archive

Blog List

Total Pageviews