Python speech recognition for beginners

Speech recognition has its underlying foundations in research done at Bell Labs in the mid-1950s. Early frameworks were restricted to a solitary speaker and had constrained vocabularies of around twelve words. Present-day discourse acknowledgment frameworks have made considerable progress since their antiquated partners. They can perceive discourse from different speakers and have tremendous vocabularies in various dialects.

In this article - we will show the basics of Speech Recognition with Python. As you maybe know, our platform backend is totally built with Python and Django and this topic should be covered for sure in our blog.

Libraries for Speech Recognition available

You can find a list of PyPI packages for speech recognition. A few of them are:

Some of these packages—such as wit and apiai — offer built-in features, like natural language processing for identifying a speaker’s intent, which goes beyond basic speech recognition. Others, like google-cloud-speech, focus solely on speech-to-text conversion.

There is one package that stands out in terms of ease-of-use: SpeechRecognition.

The flexibility and ease-of-use of the SpeechRecognition package make it an excellent choice for any Python project. However, support for every feature of each API it wraps is not guaranteed. You will need to spend some time researching the available options to find out if SpeechRecognition will work in your particular case.

How to install SpeechRecognition

You can install SpeechRecognition from a terminal with pip:

pip install SpeechRecognition

Our recommendation is to use the package with Python 3, but if you want - it is compatible with the older versions of the language.

Once installed, you should verify the installation by opening an interpreter session and typing:

# Show version of SpeechRecognition
# https://pypi.org/project/SpeechRecognition/
import speech_recognition as sr
sr.__version__
# 3.8.1

Speech recognition engine/API support:

CMU Sphinx (works offline)
Google Speech Recognition
Google Cloud Speech API
Wit.ai
Microsoft Bing Voice Recognition
Houndify API
IBM Speech to Text
Snowboy Hotword Detection (works offline)

The Recognizer

All of the magic in SpeechRecognition happens with the Recognizer class.

The primary purpose of a Recognizer instance is, of course, to recognize speech. Each instance comes with a variety of settings and functionality for recognizing speech from an audio source.

Creating a Recognizer instance is easy. In your current interpreter session, you need to type:

r = sr.Recognizer()

Each Recognizer instance has seven methods for recognizing speech from an audio source using various APIs. These are:

recognize_bing(): Microsoft Bing Speech
recognize_google(): Google Web Speech API
recognize_google_cloud(): Google Cloud Speech - requires installation of the google-cloud-speech package
recognize_houndify(): Houndify by SoundHound
recognize_ibm(): IBM Speech to Text
recognize_sphinx(): CMU Sphinx - requires installing PocketSphinx
recognize_wit(): Wit.ai

Of the seven, only recognize_sphinx() works offline with the CMU Sphinx engine. The other six all require an internet connection.

Each recognize_*() method will throw a speech_recognition.RequestError exception if the API is unreachable. For recognize_sphinx(), this could happen as the result of a missing, corrupt or incompatible Sphinx installation. For the other six methods, RequestError may be thrown if quota limits are met, the server is unavailable, or there is no internet connection.

All seven recognize_*() methods of the Recognizer class require an audio_data argument. In each case, audio_data must be an instance of SpeechRecognition’s AudioData class.

Working With Audio Files

SpeechRecognition makes working with audio files easy thanks to its handy AudioFile class. This class can be initialized with the path to an audio file and provides a context manager interface for reading and working with the file’s contents.

Supported File Types

WAV: must be in PCM/LPCM format
AIFF
AIFF-C
FLAC: must be native FLAC format; OGG-FLAC is not supported

If you are working on x-86 based Linux, macOS or Windows, you should be able to work with FLAC files without a problem. On other platforms, you will need to install a FLAC encoder and ensure you have access to the flac command line tool.

Using record() to Capture Data From a File

audio_file = sr.AudioFile('audio.wav')
with audio_file as source:
    audio_file = r.record(source)

You can now invoke recognize_google() or some of the other methods to attempt to recognize any speech in the audio. Depending on your internet connection speed, you may have to wait several seconds before seeing the result.

r.recognize_google(audio)

Congratulations! If everything is fine with your origin "wav" file - you will probably see its content in the form of text.

There is for sure many more topics to cover, but this is enough for the first tutorial. If you want to add and use speech recognition easily, you can try our service also. It is just that easy as installing 10 lines of code on your site. Good luck!

Python speech recognition for beginners

Libraries for Speech Recognition available

How to install SpeechRecognition

The Recognizer

Working With Audio Files

Supported File Types

Using record() to Capture Data From a File

Useful Links

More great articles

Voice Recognition for accessibility: making your website more inclusive

Transfer learning and fine-tuning in Keras and Tensorflow to build an image recognition system and classify any object

Stop Typing. Start Talking: How speech recognition will change the future of websites

Python speech recognition for beginners

Libraries for Speech Recognition available

How to install SpeechRecognition

The Recognizer

Working With Audio Files

Supported File Types

Using record() to Capture Data From a File

Useful Links

Subscribe to our newsletter

More great articles

Voice Recognition for accessibility: making your website more inclusive

Transfer learning and fine-tuning in Keras and Tensorflow to build an image recognition system and classify any object

Stop Typing. Start Talking: How speech recognition will change the future of websites