|
India
Telecom
February 14, 2005
Voice quality measurement in VoIP networks
Though
it may seem obvious, voice quality can be expressed
and measured only with respect to the talker and the
listener. It must be approached from an end-to-end perspective;
that is, regardless of the systems, devices, and transmission
methods used, any voice quality metric must be expressed
in the context of the user's experience. This implies
a very specific measurement criterion.
NEW
DELHI -- The end-to-end aspect of voice quality is accompanied
by the inherent subjective nature of qualitative evaluation.
What a listener considers high quality (or, for that
matter, low quality) is influenced by the expectations,
context, physiology, and mood. For example, questionable
quality on high utility calls (cellular phones, overseas
calls, and so on) is often tolerated, while less-than-perfect
quality on local calls is often judged quite harshly.
The external environment also plays a role. For example,
even though the audio signal coming out of your earpiece
may be impeccable, there could be so much background
noise in your environment that you cannot understand
what is being said.
The PSTN has since, long addressed the voice quality
problem by optimizing its circuits for the dynamic range
of the human voice and the rhythms of human conversation.
While not producing 'perfect' quality, users have become
accustomed to PSTN levels of voice quality. Voice quality
comparisons of different delivery systems are often
made in this context - voice quality is relatively standard
and predictable.
Clarity, delay and echo
To measure voice quality, we need to understand the
contributing factors. In context of VoIP environments,
a useful model of voice quality contains three elements:
the 'clarity' (i.e., fidelity, clearness, lack of distortion),
the 'delay' (i.e., the time it takes voice signals to
pass between talker and listener), and the 'echo' (the
talker's voice returning to the talker's ear). Generally,
for voice quality to be reasonably good, the delay must
be relatively short, clarity must be reasonably good,
and the echo must be imperceptible.
Clarity
In the context of voice quality testing, clarity describes
the perceptual fidelity, the clearness and the non-distorted
nature of a particular voice signal. Clarity can also
refer to the intelligibility of a voice signal.
Clarity
is one of the most important parameters to measure when
evaluating the quality of a voice signal or a voice
channel, and measuring clarity is an interesting challenge.
In voice-over-packet environments, the traditional audio
measurement techniques do not always provide reliable
and meaningful results. On top of that, clarity is,
in many ways, a subjective metric, requiring special
measurement techniques to express it in objective terms.
The subjective characteristic of voice clarity is what
makes measuring clarity a unique and interesting challenge.
Agilent VQT's Clarity measurements provide a repeatable,
accurate, and objective measure of this very subjective
phenomenon.
Delay
In the context of voice quality testing, delay is the
time it takes a voice signal to travel end-to-end between
speaker and listener. Delay often manifests itself as
an apparent time lag between when a talker speaks and
when a listener responds. When barely perceptible delay
is present, conversations can seem 'cold' or uncomfortable.
When delay is present in larger amounts, the rhythm
of conversations is disrupted, resulting in excessive
interruptions, misunderstandings, and so forth. In traditional
telephony networks (PSTNs), the delay is caused by long
propagation time, satellite links, and switching operations,
but is usually very small and not noticeable. When noticed,
the utility of the call (an overseas call, for example)
causes users to be somewhat tolerant of its effects.
Delay in voice-over-packet networks
With the emergence of VoIP, delay has become more of
an issue. Delay in a VoIP network is primarily caused
by the time it takes codecs to digitize and packetize
voice signals. Also, as packets exhibit varying arrival
times (jitter), compensating network devices, called
jitter buffers, add even more delay. Delays caused by
routine router or switch processing further add small
amounts of delay as well.
How much delay is too much? This is a question only
users of a voice network can answer. A general guideline
is: less than 100-200ms of delay is normally not perceptible
to human listeners, and properly loaded networks will
probably not exceed this amount. Above 300-400ms of
delay is usually obvious to users and can begin to adversely
affect the conversation. Actual delay values can depend
on the users and the particular conversation. Delay
continues to be an important performance parameter for
both individual VoIP devices and integrated PSTN/VoIP
systems because of the various factors that contribute
to it, and because it can exacerbate other existing
system conditions that adversely affect voice quality
(like echo). Hence, it has again become important to
measure delay.
Echo
In the context of telephony, echo is the sound of the
talker's voice returning to the talker's ear via a phone's
speaker. Echo occurs when the talker's voice signal
'leaks' from the transmit path back into the receive
path. If the time between the original spoken phrase
and the returning echo is short (a few milliseconds),
or if the echo's level is very low (-25dB to -50dB),
it probably will cause little annoyance or disruption
to the voice conversations. In fact, in most PSTN environments,
echo exists, but occurs so close in time to source speech
that it is very rarely noticed. Exceptions can include
the echo you might hear while participating in an overseas
satellite call.
In the vast majority of cases, echo is caused by an
electrical mismatch between analog telephony devices
and transmission media. A very common cause is the electrical
mismatch between a four-wire E&M trunk line (used
to connect many PBXs) and a two-wire FXO line (like
the one that connects the telephone to the wall socket).
This four-wire to two-wire conversion happens in a device
known as a hybrid. Another cause can be the acoustic
coupling problems between a telephone's speaker and
microphone.
Echo in voice-over-packet networks
Though echo exists in the PSTN, it is generally imperceptible.
Even the echo from the far-end tail circuit usually
returns quickly enough to not be heard. So what makes
echo noticeable in VoIP networks? Simply put, round-trip
delay! VoIP networks introduce into the voice path a
fundamental and unavoidable end-to-end delay. If echo
is produced in the far-end PSTN analog tail circuit,
at least twice this delay (round-trip delay) will pass
before the echo reaches the near-end talker's ear. Thus,
even strongly attenuated echo can become perceptible.
Near-end echo will not be heard (as it is too close
in time to the original spoken phrase) and that the
IP portion of the network will not produce echo (digitized
audio does not leak into return path as recognizable
echo). So, we can often correctly conclude that any
perceptible echo originates from the far-end tail circuit.
To compensate for this, VoIP edge gateways and routers
utilize echo cancellation to eliminate the echo coming
back into a VoIP network from the tail circuit. Echo
cancellers face the tail circuit, where echo is likely
to originate.
Double-talk complicates matters
Echo cancellation is further challenged when echo is
accompanied by interrupting speech, sometimes referred
to as double-talk. To eliminate echo, echo cancellers
construct an estimate of the echo signal based on the
original speech that is likely to be echoed and the
characteristics of the circuit from which the echo is
likely to come. The echo canceller subtracts that estimate
from the signal returning to the talker through the
echo canceller. In this way, 'legitimate' speech from
the far end is allowed through, while echo (if the estimate
is good) is eliminated. However, an interrupting speech
signal coming from the same direction as the echo can
cause an echo canceller to make mistakes in this estimation.
Measuring voice quality - clarity (PAMS) and clarity
(PSQM)
Agilent's VQT measures clarity in two ways: PSQM (perceptual
speech quality measure) and PAMS (perceptual analysis
measurement system). The VQT's clarity measurements
provide information that can be used to objectively
evaluate the overall quality of a given speech sample
after it has been transmitted by a telephony device
or circuit. These measurements provide repeatable and
objective measures of a subjective phenomenon.
PSQM is an algorithm spelled out in ITU specification
P.861, designed originally to provide objective measures
of the non-distorted nature of voice signals once they
are passed through non-linear codecs. PSQM predicts
and correlates well with the results of subjective listening
tests (e.g. MOS). It uses a sensory model that, like
any perceptual audio quality measurement algorithm,
takes into account the physiology of the human ear,
human cognitive processing models, and audio transmission
characteristics. PSQM evaluates the quality of audio
signals in a similar way non-linear codecs encode and
decode audio signals. That is, it evaluates whether
a particular piece of audio is distorted with regard
to what a human listener would find annoying or distracting
- that is, the perceptually relevant information.
PSQM provides a relative score that indicates just how
'different' the distorted signal is with regard to the
original, showing whether voice signals are 'better'
or 'worse' than the original. Typically, PSQM scores
are calculated numerous times across a single voice
sample, indicating how clarity changes as single vocal
phrases are delivered. PSQM provides an estimate of
network delay and takes into account attenuation/gain
as an influencing factor on perceived clarity. Its numerical
result is an estimate of audio clarity. Although the
human ear is the ultimate judge of voice quality, PSQM
- via the VQT's Clarity measurement - can provide consistent,
useful and repeatable results.
PAMS is an algorithm developed by British Telecom (BT)
Labs, designed to provide an objective measure, which
predicts the results of subjective listening tests (MOS)
on telephony systems. PAMS uses a sensory model that,
like any perceptual audio quality measurement algorithm,
takes into account the physiology of the human ear,
human cognitive processing models, and audio transmission
characteristics. It evaluates the quality of audio signals
in a similar way non-linear codecs encode and decode
audio signals. That is, it also evaluates whether a
particular piece of audio is distorted with regard to
what a human listener would find annoying or distracting.
PAMS provides two relative scores: listening quality
and listening effort. These two scores indicate the
clarity of a distorted signal with regard to the non-distorted
version of that signal, showing whether the degraded
voice signal is 'better' or 'worse' than the original.
PAMS listening quality and listening effort scores correlate
closely to MOS (mean opinion score), such that a system
that gets a high PAMS score (good quality) would also
likely get a high MOS score.
Listening quality primarily refers to the actual audible
distortion to the sound of the voice signal, while listening
effort refers more to the intelligibility of the voice
signal. This distinction is important as both contribute
to whether a listener judges voice quality as good or
bad. For example, a voice signal may be quite distorted,
but still understood. On the other hand, the voice signal
could, in theory, be very clean, but due to other conditions,
could be difficult to actually understand. PAMS provides
a measure of both aspects of voice clarity. PAMS numerical
results are an estimate of the clarity. Although, the
human ear is the ultimate judge of voice quality, PAMS,
via the AGILENT VQT's Clarity measurement, can provide
consistent, useful and repeatable results as well.
The PAMS algorithm provides no information on the following
network conditions/phenomena as part of its processing
- delay, delay variance and attenuation/gain.
|