Convergence Plus Logo


www Convergence Plus
 
Sections Online
Broadband
Broadcasting
Components
Expert View
Security
Storage

India Telecom

February 14, 2005
Voice quality measurement in VoIP networks

Though it may seem obvious, voice quality can be expressed and measured only with respect to the talker and the listener. It must be approached from an end-to-end perspective; that is, regardless of the systems, devices, and transmission methods used, any voice quality metric must be expressed in the context of the user's experience. This implies a very specific measurement criterion.

NEW DELHI -- The end-to-end aspect of voice quality is accompanied by the inherent subjective nature of qualitative evaluation. What a listener considers high quality (or, for that matter, low quality) is influenced by the expectations, context, physiology, and mood. For example, questionable quality on high utility calls (cellular phones, overseas calls, and so on) is often tolerated, while less-than-perfect quality on local calls is often judged quite harshly. The external environment also plays a role. For example, even though the audio signal coming out of your earpiece may be impeccable, there could be so much background noise in your environment that you cannot understand what is being said.

The PSTN has since, long addressed the voice quality problem by optimizing its circuits for the dynamic range of the human voice and the rhythms of human conversation. While not producing 'perfect' quality, users have become accustomed to PSTN levels of voice quality. Voice quality comparisons of different delivery systems are often made in this context - voice quality is relatively standard and predictable.

Clarity, delay and echo
To measure voice quality, we need to understand the contributing factors. In context of VoIP environments, a useful model of voice quality contains three elements: the 'clarity' (i.e., fidelity, clearness, lack of distortion), the 'delay' (i.e., the time it takes voice signals to pass between talker and listener), and the 'echo' (the talker's voice returning to the talker's ear). Generally, for voice quality to be reasonably good, the delay must be relatively short, clarity must be reasonably good, and the echo must be imperceptible.

Clarity

In the context of voice quality testing, clarity describes the perceptual fidelity, the clearness and the non-distorted nature of a particular voice signal. Clarity can also refer to the intelligibility of a voice signal.

Clarity is one of the most important parameters to measure when evaluating the quality of a voice signal or a voice channel, and measuring clarity is an interesting challenge. In voice-over-packet environments, the traditional audio measurement techniques do not always provide reliable and meaningful results. On top of that, clarity is, in many ways, a subjective metric, requiring special measurement techniques to express it in objective terms. The subjective characteristic of voice clarity is what makes measuring clarity a unique and interesting challenge. Agilent VQT's Clarity measurements provide a repeatable, accurate, and objective measure of this very subjective phenomenon.

Delay
In the context of voice quality testing, delay is the time it takes a voice signal to travel end-to-end between speaker and listener. Delay often manifests itself as an apparent time lag between when a talker speaks and when a listener responds. When barely perceptible delay is present, conversations can seem 'cold' or uncomfortable. When delay is present in larger amounts, the rhythm of conversations is disrupted, resulting in excessive interruptions, misunderstandings, and so forth. In traditional telephony networks (PSTNs), the delay is caused by long propagation time, satellite links, and switching operations, but is usually very small and not noticeable. When noticed, the utility of the call (an overseas call, for example) causes users to be somewhat tolerant of its effects.

Delay in voice-over-packet networks
With the emergence of VoIP, delay has become more of an issue. Delay in a VoIP network is primarily caused by the time it takes codecs to digitize and packetize voice signals. Also, as packets exhibit varying arrival times (jitter), compensating network devices, called jitter buffers, add even more delay. Delays caused by routine router or switch processing further add small amounts of delay as well.

How much delay is too much? This is a question only users of a voice network can answer. A general guideline is: less than 100-200ms of delay is normally not perceptible to human listeners, and properly loaded networks will probably not exceed this amount. Above 300-400ms of delay is usually obvious to users and can begin to adversely affect the conversation. Actual delay values can depend on the users and the particular conversation. Delay continues to be an important performance parameter for both individual VoIP devices and integrated PSTN/VoIP systems because of the various factors that contribute to it, and because it can exacerbate other existing system conditions that adversely affect voice quality (like echo). Hence, it has again become important to measure delay.

Echo
In the context of telephony, echo is the sound of the talker's voice returning to the talker's ear via a phone's speaker. Echo occurs when the talker's voice signal 'leaks' from the transmit path back into the receive path. If the time between the original spoken phrase and the returning echo is short (a few milliseconds), or if the echo's level is very low (-25dB to -50dB), it probably will cause little annoyance or disruption to the voice conversations. In fact, in most PSTN environments, echo exists, but occurs so close in time to source speech that it is very rarely noticed. Exceptions can include the echo you might hear while participating in an overseas satellite call.

In the vast majority of cases, echo is caused by an electrical mismatch between analog telephony devices and transmission media. A very common cause is the electrical mismatch between a four-wire E&M trunk line (used to connect many PBXs) and a two-wire FXO line (like the one that connects the telephone to the wall socket). This four-wire to two-wire conversion happens in a device known as a hybrid. Another cause can be the acoustic coupling problems between a telephone's speaker and microphone.

Echo in voice-over-packet networks
Though echo exists in the PSTN, it is generally imperceptible. Even the echo from the far-end tail circuit usually returns quickly enough to not be heard. So what makes echo noticeable in VoIP networks? Simply put, round-trip delay! VoIP networks introduce into the voice path a fundamental and unavoidable end-to-end delay. If echo is produced in the far-end PSTN analog tail circuit, at least twice this delay (round-trip delay) will pass before the echo reaches the near-end talker's ear. Thus, even strongly attenuated echo can become perceptible.

Near-end echo will not be heard (as it is too close in time to the original spoken phrase) and that the IP portion of the network will not produce echo (digitized audio does not leak into return path as recognizable echo). So, we can often correctly conclude that any perceptible echo originates from the far-end tail circuit. To compensate for this, VoIP edge gateways and routers utilize echo cancellation to eliminate the echo coming back into a VoIP network from the tail circuit. Echo cancellers face the tail circuit, where echo is likely to originate.

Double-talk complicates matters
Echo cancellation is further challenged when echo is accompanied by interrupting speech, sometimes referred to as double-talk. To eliminate echo, echo cancellers construct an estimate of the echo signal based on the original speech that is likely to be echoed and the characteristics of the circuit from which the echo is likely to come. The echo canceller subtracts that estimate from the signal returning to the talker through the echo canceller. In this way, 'legitimate' speech from the far end is allowed through, while echo (if the estimate is good) is eliminated. However, an interrupting speech signal coming from the same direction as the echo can cause an echo canceller to make mistakes in this estimation.

Measuring voice quality - clarity (PAMS) and clarity (PSQM)
Agilent's VQT measures clarity in two ways: PSQM (perceptual speech quality measure) and PAMS (perceptual analysis measurement system). The VQT's clarity measurements provide information that can be used to objectively evaluate the overall quality of a given speech sample after it has been transmitted by a telephony device or circuit. These measurements provide repeatable and objective measures of a subjective phenomenon.

PSQM is an algorithm spelled out in ITU specification P.861, designed originally to provide objective measures of the non-distorted nature of voice signals once they are passed through non-linear codecs. PSQM predicts and correlates well with the results of subjective listening tests (e.g. MOS). It uses a sensory model that, like any perceptual audio quality measurement algorithm, takes into account the physiology of the human ear, human cognitive processing models, and audio transmission characteristics. PSQM evaluates the quality of audio signals in a similar way non-linear codecs encode and decode audio signals. That is, it evaluates whether a particular piece of audio is distorted with regard to what a human listener would find annoying or distracting - that is, the perceptually relevant information.

PSQM provides a relative score that indicates just how 'different' the distorted signal is with regard to the original, showing whether voice signals are 'better' or 'worse' than the original. Typically, PSQM scores are calculated numerous times across a single voice sample, indicating how clarity changes as single vocal phrases are delivered. PSQM provides an estimate of network delay and takes into account attenuation/gain as an influencing factor on perceived clarity. Its numerical result is an estimate of audio clarity. Although the human ear is the ultimate judge of voice quality, PSQM - via the VQT's Clarity measurement - can provide consistent, useful and repeatable results.

PAMS is an algorithm developed by British Telecom (BT) Labs, designed to provide an objective measure, which predicts the results of subjective listening tests (MOS) on telephony systems. PAMS uses a sensory model that, like any perceptual audio quality measurement algorithm, takes into account the physiology of the human ear, human cognitive processing models, and audio transmission characteristics. It evaluates the quality of audio signals in a similar way non-linear codecs encode and decode audio signals. That is, it also evaluates whether a particular piece of audio is distorted with regard to what a human listener would find annoying or distracting.

PAMS provides two relative scores: listening quality and listening effort. These two scores indicate the clarity of a distorted signal with regard to the non-distorted version of that signal, showing whether the degraded voice signal is 'better' or 'worse' than the original. PAMS listening quality and listening effort scores correlate closely to MOS (mean opinion score), such that a system that gets a high PAMS score (good quality) would also likely get a high MOS score.

Listening quality primarily refers to the actual audible distortion to the sound of the voice signal, while listening effort refers more to the intelligibility of the voice signal. This distinction is important as both contribute to whether a listener judges voice quality as good or bad. For example, a voice signal may be quite distorted, but still understood. On the other hand, the voice signal could, in theory, be very clean, but due to other conditions, could be difficult to actually understand. PAMS provides a measure of both aspects of voice clarity. PAMS numerical results are an estimate of the clarity. Although, the human ear is the ultimate judge of voice quality, PAMS, via the AGILENT VQT's Clarity measurement, can provide consistent, useful and repeatable results as well.

The PAMS algorithm provides no information on the following network conditions/phenomena as part of its processing - delay, delay variance and attenuation/gain.










Mombasawala Mohmedsaeed, technical expert, Agilent Technologies, India
Disclaimer: No content may be used from this site without the written permission of the authors, Convergence Plus, Comnet Publishers Pvt. Ltd. and Exhibitions India Pvt. Ltd. The views expressed on this site are solely those of the authors and do not reflect those of Convergence Plus, Comnet Publishers Pvt. Ltd. and Exhibitions India Pvt. Ltd.