Cisco Telepresence: Audio Encoding

The microphones used in Cisco TelePresence are purposefully designed to capture the sounds emanating from a human subject sitting within a few feet of the microphone, along with the regular background noises that accompany that person within the room he or she sits in, while filtering out certain unwanted frequency ranges (such as the high-frequency whirrs of spinning fans in laptop computers or the low-frequency hums of heating and ventilation systems) and electrostatic interference (such as GSM/GPRS cellular signals).

The center (primary) Cisco TelePresence codec has four microphone input ports: three for the Cisco TelePresence microphones and one auxiliary audio input. The Cisco TelePresence microphones use a proprietary 6-pin Mini-XLR connector. The auxiliary audio input is a standard 3.5 mm (1/8-inch) mini-stereo connector, which enables the users to connect the audio sound card of their PC along with the VGA video input discussed in the previous sections.

On single-screen systems such as the CTS-1000 and CTS-500, only the center microphone input and the auxiliary audio input are used. On multiscreen systems, such as the CTS-3000 and CTS-3200, the left and right inputs are also used.

Each audio input is encoded autonomously, resulting in up to four discrete channels of audio. This is superior to most other systems on the market that mix all the microphone inputs into a single outgoing channel. By maintaining the channels separately, Cisco TelePresence can maintain the directionality and spatiality of the sound. If the sound emanates from the left, it will be captured by the left microphone and reproduced by the left speaker on the other end. If it emanates from the right, it will be captured by the right microphone and reproduced by the right speaker on the other end.

AAC-LD Compression Algorithm

Cisco TelePresence uses the latest audio encoding technology known as Advanced Audio Coding–Low Delay (AAC-LD). AAC is a wideband audio coding algorithm designed to be the successor of the MP3 format and is standardized by the International Organization for Standardization/International Electrotechnical Commission (ISO/IEC) Moving Picture Experts Group (MPEG). It is specified both as Part 7 of the MPEG-2 standard and Part 3 of the MPEG-4 standard. As such, it can be referred to as MPEG-2 Part 7 and MPEG-4 Part 3, depending on its implementation; however, it is most often referred to as MPEG-4 AAC, or AAC for short.

AAC-LD (Low Delay) bridges the gap between the AAC codec, which is designed for high-fidelity applications such as music, and International Telecommunication Union (ITU) speech encoders such as G.711 and G.722, which are designed for speech. AAC-LD combines the advantages of high-fidelity encoding with the low delay necessary for real-time, bidirectional communications.

Sampling Frequency and Compression Ratio

The AAC-LD standard allows for a wide range of sample frequencies (8 kHz to 96 kHz). Cisco TelePresence implements AAC-LD at 48 kHz sampling frequency. This means that the audio is sampled 48,000 times per second, per channel. These samples are then encoded and compressed to 64 kbps, per channel, resulting in a total bandwidth of 128 kbps for single-screen systems (two channels) and 256 kbps for multiscreen systems (four channels).

Automatic Gain Control and Microphone Calibration

Automatic Gain Control (AGC) is an adaptive algorithm to dynamically adjust the input gain of the microphones to adapt to varying input signal levels. Whether the people are sitting close to the microphones or far away, speaking in a soft voices or yelling, or any combination in between, the microphones have to continuously adapt to keep the audio sounding lifelike and at the correct decibel levels to reproduce the sense of distance and directionality at the far end.

Keeping multiple discrete microphones autonomous and yet collectively synchronized so that the entire room is calibrated is no small task. Cisco TelePresence uses advanced, proprietary techniques to dynamically calibrate the microphones to the room and relative to each other. It is more complex for Cisco TelePresence than other implementations because the microphones need to be kept discrete and autonomous from each other. This preserves the notion of location, which is critical to the proper operation of multipoint switching in which the active speaker switches in on the appropriate screen. For example, if a person is sitting in the center segment of the room but facing the left wall when she talks, the speech emanating from her hits both the left and center microphones. The system must be smart enough to detect which microphone is closest to the source and switch to the correct camera (in this case the center camera), while playing the sound out both speakers on the other end to retain the sense of distance and directionality of the audio. It does this by assigning a 0 to 100 scale for each channel. In this scenario, the speech emanating from the person might be ranked an 80 at the center microphone and a 45 at the left microphone. These two microphone inputs are independently encoded and transported to the other end where they are played out both the center and right speakers at the appropriate decibel levels so that the people on the other end get the sense of distance and directionality. However, because the center microphone was a higher rank than the left microphone, the correct camera would be triggered (in this case, the center camera).

Cisco Telepresence

Audio Encoding | Cisco TelePresence

AAC-LD Compression Algorithm

Sampling Frequency and Compression Ratio

Automatic Gain Control and Microphone Calibration

No comments:

Post a Comment

Popular Posts

Feedjit

Blog Archive

Blog List

Total Pageviews