What is VAD (Voice Activity detector)?
Voice Recognition NLP
Jan. 1, 2020

What is VAD (Voice Activity detector)?

VAD (English Voice Activity Detection), as well as Silence Suppression, briefly means the detection of voice activity in the input acoustic signal to separate active speech from background noise or silence. In the field of telecommunication systems, the most expensive element is not the station equipment: switches, amplifiers, power supply systems, etc. Telephone systems are no exception. So the effectiveness of the communication system is determined first of all by the efficiency of the use of the lines. There are many different methods used to increase the amount of information transmitted, such as frequency and temporary signal compaction. Different compression systems are applied to the voice systems to which the mobile operators belong. Speech as a natural source of information has a surplus, i.e., it contains a wealth of data that does not carry a semantic load. In this regard, many different algorithms have been created that eliminate unnecessary speech, trying to leave only the significant parameters of speech. Typically, several voice data compression technologies are applied at the same time, which is combined under the common name of a voice codec or vocoder. The most common way to compress speech data is to remove pauses between phrases, words, and individual sounds. As numerous studies have shown, speech (monologue) can contain up to 50% pauses, and in dialogue, their volume can reach up to 70%. Taking into account that a telephone connection is only a conversation between two people, it gives the opportunity for 2-3 times compression without loss of quality. It is on this basis that the mechanism of the speech activity detector is realized.

The VAD algorithm does not work by itself but as one of the operations in the speech encoding process before being sent to the telecommunication system. Usually, the presence of pauses is determined on the basis of the analysis of digitized voice data packets, which are signal snippets. How to determine the pause, i.e. to select a criterion that would make it very likely to predict that a packet contains a pause rather than a speech - the most difficult aspect in the VAD algorithm. The cost of a wrong decision will be the loss of some of the speech data. In the simplest embodiment, the presence of a break in the set of digital streams is determined by comparing the total energy of the speech data packet with a certain threshold value that separates the break from the voice packet. In this case, it is necessary to select a threshold so as to prevent erroneous pauses too often, which can lead to loss of useful data and deterioration of Quality of Service characteristics, and on the other hand, prevent multiple skip breaks that can reduce the effectiveness of the VAD algorithm. Usually, a sophisticated algorithm is used to determine the pauses that take into account not only the energy of the packet but also the energy of the spectral components of the signal fragment. In addition, the rate of change (increase or decrease) of the energy of this fragment with the previous ones is also taken into account. Also, in the case of complex noise environments, the effectiveness of VAD can be ensured by periodically evaluating the background noise parameters.

On the receiving side, another part of the VAD is designed to restore the output signal. The essence of recovery is not just about filling the breaks with zero energy. As the studies show, one associate silence in the dynamics of their phone as a breakdown of the connection and creating discomfort. Therefore, there is noise between voice breaks. There are two possible options here. First, the noise can be generated by a white noise generator. This is the most efficient way because, in this case, only the length of the breaks is transmitted from the source. In the other case, the pause on the transmitter side is highly compressed, but the general parameters describing the volume, frequency, etc. remain. On the receiving side, the generator recreates the pause based on this additional data. This option requires the transmission of additional volumes of information, i.e. reduces the overall VAD efficiency, but on the other hand allows you to achieve the most natural voice, which virtually eliminates the 'traces' of the speech activity detector. In practice, as a rule, the second option is used, although it is more expensive, it is also more convenient.

Advantages and uses

In voice digitization, signal fragments classified as active speech can be further encoded and compressed by any audio codec (eg CELP), when used in software to distinguish between human voice encoded speech and background noise.

The use of the VAD mechanism (or Silence Suppression) allows saving from data transmission over the communication channel since the interruption of speech (determined by the signal level) is not digitized or encoded, and thus the "empty" packets with silence are not transmitted via network. This is very important for packet transmission (which is transmission on TCP / IP networks), because in addition to the data itself, each protocol at all levels of the OSI model (transport, network, etc.) adds its own service information for each data packet. As a result, the size of the package increases significantly. Thus, turning off "empty" low noise packets is an easy way to save bandwidth and, as a result, increase channel bandwidth. For this reason, the VAD mechanism is often used in conjunction with various codecs to compress IP telephony effectively.

Disadvantages and method for their elimination

The problem with VAD is that as a result of the silence suppression (in fact, the sound is low), the listener does not hear any recognizable signals (breathing, snore, and other small noises accompanying live speech). This creates some problems because in ordinary conversational speech, everything is heard. The lack of normal noise during voice reproduction causes unpleasant sensations and reduces the level of perception and understanding.

Ancillary sound emulation, called comfort noise generation (CNG) (the opposite of a VAD process) can be used to solve this problem.

The VAD algorithm is used in virtually all telecommunications systems where digital speech is transmitted. In particular, it has been widely used in VoIP, ISDN technologies, of course, and second-generation cellular communication systems. Increasing the system's operating efficiency, accessible and using VAD allows it to predict its further application, as well as to continue work on the search for a more sophisticated mechanism for detecting pauses in speech.

Subscribe to our newsletter

* indicates required
Share this article:

More great articles

Voice Recognition for accessibility: making your website more inclusive

Voice recognition technology has the potential to make websites more accessible to individuals with disabilities by allowing them to interact with the website through voice commands.

Read Story

Transfer learning and fine-tuning in Keras and Tensorflow to build an image recognition system and classify any object

This post will show you how to use transfer learning and fine-tuning to identify any customizable object categories! To recapitulate, here is the blog post series we’ll be following.

Read Story
The Future of Websites: How Speech Recognition Will Change Everything

Stop Typing. Start Talking: How speech recognition will change the future of websites

We run in a world where everything should be fast, easy to find, and easy to use. Your customers don't have much time, and they are willing to receive your service now, without additional effort. But how can you help them?

Read Story