Voice Communications Using Digital Networks
July 21, 1997
TABLE OF CONTENTS
A BRIEF HISTORY OF VOIP6
VOICE COMPRESSION TECHNOLOGY10
VOICE DATA TRANSMISSION17
CURRENTLY AVAILABLE SOFTWARE21
THE VALUE OF VOIP42
THE FUTURE OF VOIP45
TABLE OF FIGURES59
The decade of the nineteen nineties is often labeled the information age. Few would argue against
the statement timely information availability and exchange is vital to success in every endeavor. There is a
technology that is maturing today that makes the gathering and sharing of information less costly, faster,
and easier. It is called Internet Telephony.
Internet telephony was originally conceived as a method for the military to communicate tactical
information using speech from a battlefield to commanders across the globe via a digital network.
Applications that are based on concepts developed for the military now provides the individual or small
business with unprecedented communications capabilities. Internet Telephony now provides everyone
possessing some basic equipment and an Internet connection, instant access to people and data anywhere in
the world. Almost as importantly, it does this at a fraction of the price of using the traditional telephone
system. Once the equipment and Internet connection are in place, long distance calls may be made at
essentially no cost.
As appealing as free long distance calling is, this is a small portion of what Internet telephony can
accomplish. The most popular software packages also offer all of the services provided by business
telephone systems. However, they also add capabilities for collaborative computing, digital data transfer,
and video that even the most sophisticated telephone system cannot provide.
There are currently over 35 companies that produce Internet Telephony software for the consumer.
The software prices range from no cost packages provided in pre-release form, to full-featured programs
sold at retail for about $50. There are also packages targeted at the business community. These packages
use proprietary hardware to process audio and video. The price for these is in the $3000 range. In general
though, they provide little improvement over the applications described above.
Three major obstacles must be addressed before Internet Telephony will begin to challenge the
Public Switched Telephone Network for market share. First, the software has to become more user friendly.
Making a call must become as easy as it is using when using a telephone. Second, the Internet must mature
to the point of being as reliable as the telephone network. Third, the adoption and implementation of
standards must be completed. This needs to happen so that every software package will be able to
communicate with all of the others. Currently this is not the case.
Projections by International Data Corporation indicate that there is now three million Internet
Telephony users worldwide. By turn of the century projections indicate that this number will have
increased to sixteen million. The majority of this increase will be comprised of business users. These
figures show that Internet Telephony is one of the fastest growing technology sectors associated with the
Putting the infrastructure into place to provide voice communications over a digital network is
much less expensive than that used for telephone systems. Using purely digital networks to carry voice
information is about 75% less expensive than using the analog/digital hybrid system that comprises the
Public Switched Telephone Network. Because of these economies, it will be possible to bring voice
communications to large numbers of people, in many areas of the world, where it has not been previously
feasible. This is likely to be the most important contribution that Internet Telephony will make in the next
In the past three years, your author has spent countless hours using, and assisting various software
producing corporations to develop and market applications that allow a user to transmit voice over digital
networks. In this time, awareness has grown of the great possibilities presented by this technology. The
main motivation for producing this document is to make more people aware of the possibilities that are
presented for the consumer or small businessperson when digital technology is used to either replace or
supplement voice communications using the standard circuit switched telephone network.
In the following document, we will explore a technology that is developing into one of the fastest
growing fields in communications. This relatively new technology uses a digital network to transmit voice
and video data. The name that has been given to this technology by the organizations that develop
standards and the participants in the industry is VOIP. This acronym is used in place of the phrase, “Voice
using Internet Protocol”. This term will be used throughout the rest of this document when referring
generically to any of the several aspects of this technology.
At this point some may be asking what exactly it is that VOIP applications do. In the simplest
sense they act as an interface allowing what is essentially a telephone conversation to be held using a
digital network such as the Internet. To accomplish this, the application must do several things. It must first
provide a set of controls allowing the user to utilize the audio functions of their computer. It provides a
method for calling people. The software provides a method to compress the digital data in order to make
transmission faster and to reduce the bandwith required for the voice connection. The last function required
from all applications is the ability to communicate with the software that constructs and sends the packets
that carry the voice data through the network. When all of these aspects are implemented properly, the
result is the capability to conduct a conversation rivaling or surpassing in quality that which the Public
Switched Telephone Network (PSTN) users !
have come to expect.
The term “new” was used above to describe VOIP as a technology. In one sense, this is not an accurate
description. Experimentation in using digital networks to carry voice data actually began in the early
nineteen seventies. However, it has taken the commercialization of the Internet and the resultant
phenomenal increase in usage concomitant with available bandwith, to make VOIP a practical alternative
for the general public. In the sense, however, of commercial availability of a software application widely
distributed, the term “new” is highly accurate. In fact, Vocaltec Corporation released the first VOIP
application only about two and one half years ago. The history of VOIP will be discussed in more depth in
the section devoted to that subject.
There are several advantages associated with the use of VOIP. These fall into two broad categories. The
first is the saving of, sometimes large, sums of money. The second advantage is the power and flexibility
that is available when using digital rather than analog processes.
Perhaps the greatest motivation for communicating using the Internet is the reduction of costs associated
with everyday communication requirements. Some specific case studies will be discussed later, but cost
reductions can be dramatic in most circumstances.
The advantages of having enhanced communications capabilities are harder to quantify. This is in a great
part due to the fact that new functionality is being added at an amazing rate. As we shall see though,
features such as text and data transfer, video conferencing, group whiteboard, and application sharing bring
electronic communications capabilities to the average consumer that previously could be utilized only at
great cost by the most wealthy of corporations.
Before a reader gets the idea that VOIP is a panacea for the resolution of all of our communications
problems, it must be pointed out that there are certain drawbacks currently involved in adopting VOIP as a
primary communications tool.
In fact, much will need to happen before strict reliance on this technology should even be considered. The
major obstacles at present are all related to network issues concerning capacity and speed. These include
problems such as lost connections, lost packets resulting in choppy speech, and depending upon the
computer and the specific application used reduced voice quality from that experienced when using the
Fortunately though, as the network matures and more bandwith is available and as applications improve, all
of these issues promise to be mitigated or eliminated. As we will see, even with the current problems, on
balance the technology remains an attractive alternative or, at minimum, a supplement to the use of the
PSTN system for voice communication.
A Brief History of VOIP
The term “new” was used in the introduction to describe the technology used to transmit voice data
digitally. In fact, the technology is close to being thirty years old. The first experimentation in using a
digital network to carry voice data actually began in the early nineteen seventies. Prior to even this early
research, however, much of the groundwork required to make VOIP possible was being laid. In the
nineteen sixties the Bell system laboratories were working hard to increase the ability of transatlantic cables
to carry telephone calls. The experiments performed to determine acceptable delays and other intelligibility
issues proved invaluable to later developers of the systems used to transmit speech digitally.
By 1967 the government had funded and begun construction of a network designed to link the computers of
several universities and those of many research facilities within the defense establishment. This network
was called the Advanced Research Projects Agency Network (ARPANET). The network was designed to
survive a nuclear war. The capability to transmit voice data over such a network would obviously have
great advantages over dependence upon the Public Switched Telephone Network in time of emergency.
The combination of this motivation and the availability of government funding led to research that began in
earnest by early 1970. Due to the limited capacity of the early network to carry significant amounts of data,
much of the early research concerned itself with methods of compressing voice data so that it might be sent
in a form that resulted in the usage of the minimum amount of bandwith possible. This research built upon
the data from Bell Labs by adding adaptive components to model the human vocal tract to determine the
minimum amount of information that had to be transmitted to reconstruct intelligible speech.
By 1975 the military had developed packet radio in which data was digitized, assembled in packets, and
transmitted using radio frequencies. It was not long before it was realized that a battlefield commander
using available satellite technology could easily connect to the international ARPANET to transmit and
receive tactical information from almost anywhere on the globe. The need to use voice communications
from within this system suddenly became of paramount importance.
From 1976 to 1978 the remaining pieces required for making packet voice a viable system fell into place.
The method used to move data on the network that was in use within ARPANET, called Transmission
Control Protocol (TCP), put a high priority on error correction. This means that the computers on each end
of a connection would negotiate the re-transmission of information until every bit of data was accounted for
at the receiving end. The problem with this concept as it relates to human conversation is the potentially
long delays that are introduced by re-sending data until a perfect copy is received. Early studies in using the
transatlantic cable showed that this led to confusion when human beings were attempting to communicate
using speech. The solution to this problem was development of a protocol called the Network Voice
Protocol (NVP). This was designed as a realtime method for sending data. Instead of repetitively sending
data until every bit was received, the data was s!
ent only once. The premise was that it was better to get a small data loss that would cause minor breaks in
speech than it was to wait long periods during which there is no information exchanged. This method was
dubbed a “send and forget protocol” and is the predecessor of User Datagram Protocol (UDP) and its
sibling the Realtime Transmission Protocol (RTP) which are widely in use today for voice and video
transmission on the Internet.
In 1978 Bell Telephone was awarded a basic patent covering the transmission of voice on packet switched
networks. By 1980, voice conferencing was in wide use on the ARPANET.
For the next 14 years VOIP remained a technology used primarily by researchers and those in academia,
i.e., those with sufficient resources to have a presence on the Internet. However, during this period one
important development was taking place that would thrust the technology into the public spotlight. Internet
growth was becoming what could fairly be called explosive. In 1984, there were about 1000 host machines
on the Internet. In 1991, the National Science Foundation lifted the ban that prevented businesses from
conducting commerce on the Internet. By 1992, there were in excess of one million host machines on the
Internet. Between 1992 and the end of 1993 annualized growth of traffic on the World Wide Web reached a
calculated 341,634%. ( Zakon)
At about this time an enterprising group of Israelis decided that there was a commercial application for the
VOIP technology. The company that they were part of, called Vocaltec LTD, began to work seriously to
develop and market an application that would allow anyone with an Internet connection to talk to anyone
else on the planet with a similar connection. In February of 1995, the application called the Internet Phone
(Iphone) was released for download from the Vocaltec File Transfer Protocol site on the Internet. The
response was overwhelming. Units sold to date, including downloads and software sold at retail, are
estimated at four million copies.
This original application had a very simple graphical interface and was designed for a single purpose, that
is, talking to people while using the Internet as the medium. In order to find someone to converse with one
had to log onto a server and look through a directory of those who happened to be logged on at that time.
Most of the users at that time were computer hobbyists and hard core Information Technology
professionals. The software was simple and somewhat rough, but it worked and it showed the potential of
VOIP to the public.
The huge response to Iphone was all that was needed to create a rush of development efforts from
companies that ranged from start up concerns to software and hardware giants such as Microsoft and Intel.
Today there are over 35 entrants into a market that did not even exist just over two years ago. This appears
to be just the beginning, as there are now projections that the VOIP market could show gross sales as high
as 560 million dollars per year by the turn of the century. (IDC #11407)
Today’s applications have evolved quickly from the extremely simple and single purpose original Iphone to
true multimedia communications suites. In most offerings there are a whole range of features that include
video conferencing, voice mail, call waiting, caller identification, collaborative whiteboards, realtime text
communications, and data file transfer abilities. All of these features will be explored in depth in the section
dealing with the software that is currently on the market.
Voice Compression Technology
Perhaps the most important part of any software that allows voice communications using digital networks is
the code that compresses voice information into a size that is reasonable. It is easy to see why this would be
so when one recognizes the fact that all networks have limited bandwith available to use in carrying all of
the data that is required to be exchanged by all of the users. For the vast majority of the people using the
Internet today, the system bandwith is limited by the modem that they use to connect via the telephone
system to the Internet. At best, this limit may be as high as 33.6 Kilo bits per second (Kbs). In actuality
though, the average connection speed is closer to 24.0 Kbs. This is an order of magnitude greater than what
was available only a few years ago, but it is still much slower than the standard 56 Kbs available to the
original users of ARPANET. For VOIP to be practical, methods had to be developed to compress large
amounts of data present in a dig!
ital representation of a human voice to a level that would not saturate the limited bandwith available to
The software that performs this important task is called a codec. The word codec is an acronym
derived from the words “compress” and “decompress”. It is the function of the codec to use an algorithm to
compress the voice data and then decompress the data on the receiving end so that it will resemble as
closely as possible, the original digital signal. While this is a simple concept, in practice, these algorithms
are very complex and all result in certain compromises between data rates, processor usage, and the
resultant sound quality.
To begin with, it is interesting to see the kind of data rate required to send uncompressed speech
through a network. The PSTN requires a frequency response of from 300 Hz to about 3500 Hz to maintain
what is typically considered adequate sound quality for speech. To get a similar response from a digitally
sampled sound it is necessary to use 8 bits per sample, and to take a minimum of 8000 samples per second.
The math shows that: 8000 samples/second x 8 bits/sample = 64 Kbits/sec data rate. Comparing this
number to even the fastest modem connection available shows a severe deficit between the bandwith we
have available and that which is needed.
The types of available codecs can be broadly grouped into three categories. These are the
waveform, source, and hybrid codecs. A brief description of each is presented below. The subject matter is
complex, but treatment will be kept as simple as possible. (Woodard)
Waveform codecs attempt, without using any knowledge of how the signal to be coded was generated, to
produce a reconstructed signal whose waveform is as close as possible to the original. This means that in
theory they should be signal independent and work well with all audio signals. Generally, they are low
complexity codecs that produce high quality speech at rates of about 16 Kbs. When the data rate is lowered
below this level the reconstructed speech quality that is obtained degrades rapidly.
The simplest form of waveform coding is Pulse Code Modulation (PCM), which merely involves sampling
and quantizing the input waveform. As we saw above, speech is typically frequency limited to 4 kHz and
sampled at 8 kHz. If linear quantization is used, high quality speech actually requires around twelve bits per
sample. This means that our formula yields a bit rate of 96 Kbs. This bit rate can be reduced by using non-
uniform quantization of the samples. In speech coding an approximation to a logarithmic quantizer is often
used. Such quantizers give a signal to noise ratio, which is almost constant over a wide range of input
levels, and at a rate of eight bits per sample give a reconstructed signal which is almost indistinguishable
from the original. Such logarithmic quantizers were standardized in the 1960’s, and are still widely used
today. In America the mu-law codec is the standard for use in cellular telephones, while in Europe the
slightly different A-law compression is u!
sed. Both have the advantages of low complexity, minimal delays and high quality reproduced speech, but
require a relatively high bit rate.
A commonly used technique in speech coding is to attempt to predict the value of the next sample from that
of previous samples. It is possible to do this because of the correlations present in speech samples due to
the effects of the vocal tract and the vibrations of the vocal cords. If the predictions are effective then the
error signal between the predicted samples and the actual speech samples will have a lower variance than
the original speech samples. Therefore, we should be able to quantize this error signal with fewer bits than
the original speech signal. This error signal is added to or subtracted from the predicted signal on the
receiving end. Correct implementation results in reproduced speech closely approximating the original.
This is the basis of Differential Pulse Code Modulation (DPCM) schemes. That is, only differences
between the original and predicted signals are quantized and transmitted. (McElroy) (Woodard)
The results from such codecs can be improved if the predictor and quantizer are made adaptive so that they
change to match the characteristics of the speech being coded. This leads to Adaptive Differential PCM
(ADPCM) codecs. In the mid 1980’s, a standard was adopted for a ADPCM codec operating at 32 Kbs,
giving speech quality that is very similar to the 64 Kbs PCM codecs. ADPCM codecs operating at 16, 24
and 40 Kbs were standardized shortly thereafter.
The waveform codecs described above all code speech with an entirely time domain approach. Frequency
domain approaches are also possible, and have certain advantages. For example in Sub-Band Coding (SBC)
the input speech is split into a number of frequency bands, or sub-bands, and each is coded independently
using, for example, an ADPCM like coder. At the receiver the sub-band signals are decoded and
recombined to give the reconstructed speech signal. The advantages of doing this come from the fact that
the noise in each sub-band is dependent only on the coding used in that sub-band. Therefore, more bits can
be allocated to perceptually important sub-bands. The result is that the noise in these frequency regions is
low, while other sub-bands may contain high coding noise because noise at these frequencies is less
perceptually important. Adaptive bit allocation methods may be used to further enhance these results. Sub-
band codecs tend to produce communications similar to PSTN q!
uality speech in the range 16-32 Kbs. Due to the filtering necessary to split the speech into sub-bands they
are more complex than simple DPCM coders, and introduce more coding delay. However, the complexity
and delay are still relatively low when compared to most hybrid codecs. (Woodard)
Another frequency domain waveform coding technique is Adaptive Transform Coding (ATC), which uses
a fast transformation, such as the discrete cosine transformation, to split blocks of the speech signal into a
large numbers of frequency bands. The number of bits used to code each transformation coefficient is
adapted depending on the spectral properties of the speech, and PSTN quality reproduced speech can be
achieved at bit rates as low as 16 Kbs.
Source coders operate using a model of how the source was generated and attempt to extract, from
the signal being coded, the parameters of the model. It
is these model parameters which are transmitted to the decoder. Source coders for speech are called
vocoders and function as follows: The human vocal tract is represented as a time-varying filter and is
excited with either a white noise source, for unvoiced speech segments, or a train of pulses separated by the
pitch period for voiced speech. Therefore, the information, which must be sent to the decoder, is the filter
specification, a voiced/unvoiced flag, the necessary variance of the excitation signal, and the pitch period
for voiced speech. This is updated every 10-20 ms to follow the non-stationary nature of speech.
The model parameters can be determined by the encoder in a number of different ways, using either time or
frequency domain techniques. In addition, the information can be coded for transmission in various
different ways. Vocoders tend to operate at around 2.4 Kbs or below, and produce speech which, although
intelligible is far from natural sounding. Increasing the bit rate much beyond 2.4 Kbs is not worthwhile
because of the built in limitation in the coder’s performance due to the simplified model of speech
production used. The main use of vocoders has been in military applications where natural sounding speech
is not as important as a very low bit rate to allow the lowest possible bandwith and to enable strong
Hybrid codecs attempt to fill the gap between waveform and source codecs. As described above waveform
coders are capable of providing good quality speech at bit rates down to about 16 Kbs, but are of limited
use at rates below this. Vocoders on the other hand can provide intelligible speech at 2.4 Kbs and below,
but cannot provide natural sounding speech at any bit rate. Although other forms of hybrid codecs exist, the
most successful and commonly used are time domain Analysis-by-Synthesis (AbS) codecs. Such coders
use the same linear prediction filter model of the vocal tract as found in LPC vocoders. However, instead of
applying a simple two-state, voiced/unvoiced, model to find the necessary input to this filter, the excitation
signal is chosen by attempting to match the reconstructed speech waveform as closely as possible to the
original speech waveform. Atal and Remde first introduced AbS codecs in 1982 with what was to become
known as the Multi-Pulse Excited (MPE) codec.!
Later the Regular-Pulse Excited (RPE) and the Code-Excited Linear Predictive (CELP) codecs were
The hybrid codec is the type that is in use today in the majority of the applications designed to enable voice
over digital networks. The high sound quality and low bit rates that they provide make them almost
indispensable to VOIP software developers. Each implementation is somewhat different and these
differences are responsible for the disparity in sound quality between the various applications.
Bit Rate128 Kbs64 Kbs32 Kbs13.2 Kbs8.5 Kbs8 Kbs2.4 Kbs
Ratio1 to 12 to 14 to 110 to 115 to 116 to 153 to 1
Voice Data Transmission
The standard protocol used for sending data over networks has become what is known as Transmission
Control Protocol/Internet Protocol (TCP/IP). As the slash indicates, this protocol is comprised of two
distinct sections, each of which are discussed below. TCP/IP is generally used for sending data that is
required to be 100% accurate. This very attribute becomes its downfall when it is used for transmitting
voice data. It is, however, the basis for traffic over digital networks. Therefore, the underlying concepts will
TCP/IP works with what is known as datagrams. Datagram is a term that is used interchangeably
with the term packet. Although packet is a term more commonly used, technically, datagram is the right
word to use when describing TCP/IP. A datagram is just a unit of data, which is what the protocols really
deal with. A packet is a physical thing, appearing on a network or a wire. In most cases, a packet simply
contains a datagram, so there is very little difference. As indicated above TCP/IP consists of two separate
but inter-related protocol families. TCP is responsible for making sure that the commands get through to
the other end.
It keeps track of what is sent, and re-transmits anything that did not get through. If any message is too
large for one datagram, such as the content of an encoded segment of voice data, TCP will split it into
several datagrams and assure that they all arrive and are correctly re-assembled. Since these functions are
needed for many applications, they are put together into a separate protocol, rather than being part of the
specifications for sending, for instance, voice data. You can think of TCP as forming a library of routines
that applications can use when they need reliable network communications with another computer.
Similarly, TCP calls on the services of IP. Although the services that TCP supplies are needed by many
applications, there are still certain applications that don’t need them. However, every application needs
some services. So these services are put together into IP. As with TCP, you can think of IP as a library of
routines that TCP calls upon, and is also available to applications that don’t use TCP. This strategy of
building several levels of protocol is called layering. We think of the applications programs such as a VOIP
application suite, TCP, and IP, as being separate layers, each of which calls on the services of the layer
below it. Generally, TCP/IP applications use four layers. For example these might be an application layer
such as a VOIP, a protocol such as TCP that provides services needed by many applications, IP which
provides the basic service of getting datagrams to their destination, and the protocols needed to manage a
specific physical medium, such as a mode!
So we see that TCP is responsible for breaking up the message into datagrams, reassembling them at the
other end, re-sending anything that gets lost, and putting things back in the right order. IP is responsible for
routing individual datagrams. It may seem like TCP is doing most of the work. In small networks, this is
true. However, in the Internet, simply getting a datagram to its destination can be a complex job.