Process voice input with StanfordNLP

Have you ever wanted or needed to extract specific information from the user’s voice input without too much trouble? Well, there might be an easy way!


The goal of this article is to familiarize you with the development of processing voice input in Android. Therefore, this piece will allow you to understand how to transform the user’s input into text and most importantly how to parse and extract relevant information from it. 

In order to achieve such an effect, we will use the native Speech API interface on Android and the 3rd party NLP tool (Natural language processing) named StanfordNLP.

Before moving into more details, let’s establish a couple of reasonable requirements.

Prerequisites:

  • Basic understanding of how NLP (Natural language processor) works.
  • Basic experience with Android and Kotlin development.

Alright, now let’s get to it.

There are multiple ways to process text in client applications like the one we will be creating now, but there are a couple of important questions that you might be wondering about already.

Where will the processing take place? 

NLP processors usually require high processing power and were considered inadequate to run on mobile phones. While this is still true, mobile phones are recording better and better processing capabilities and they get more and more memory — RAM. I mean, heck, the Samsung S20 Ultra features 12GB RAM which is more than what most of the average laptops have today!

Today we will try to push the limits and allow the whole NLP processing to take part on the client side — on the android devices in our case. Please note that the official recommendation is still to offload all this processing work on a server instead of hosting it on the device itself.

What tool should I use?

While there are a lot of possibilities, we will focus on the most accessible ones, which are APIs like OpenNLP or StanfordNLP. These APIs basically receive a text and send back information about that text: sentences, words, parts of speech, lemmas and so on.

Apache OpenNLP represents a viable option but requires a lot of processing power and has a pretty slow initialization time on Android — from 20–40 seconds on some of my devices. Now, this is not something to blame because as explained above — in most cases — this kind of processing shouldn’t be done on the client!

Fortunately, we’re in luck with the the easiest and fastest tool (you guessed): StanfordNLP as it relies on the principle of a pipeline and has really low initialization times (from 1s on good devices to 4–5s on slower devices) and decent executions times (from 3–4ms to few hundreds depending on the tasks). Basically, you tell the pipeline what operations to execute on the text, and that’s all!

Alright, enough with the jibber jabber. Let’s set a goal!

Why don’t we create an easy Android app that takes voice input, transforms it into text and identifies its nouns?

Alright, now show me the code!

We’re almost there, hang on. Now that we know what we want to do, let’s first split the tasks that we have to resolve:

  • Add dependencies.
  • Convert user voice input into text.
  • Process the text and identify nouns.

Done, let’s get to it!

Dependencies

While for the Speech API we shouldn’t do anything, for the StanfordNLP we need to add the following lines in the app build.gradle file:

Convert user voice input into text

As mentioned above, we will use Speech API to get the user’s input by registering a SpeechRecognizer . We will also set a RecognitionListener for the recognizer in order to set some actions once we have some results:

Don’t worry about the requestExtractQuery() call for now, we will get into it soon.

Now, in order to have results, we need user input. Therefore, every time the user will request to input sound, we will be launching an ACTION_RECOGNIZE_SPEECH intent and also triggering the speechRecognizer to listen actively for results by calling startListening on the particular intent:

Perfect! Now we are receiving voice input from the user and we are converting it to text! We can access the resulted text by unwrapping the received bundle and getting the most confident result from the array of text results:

Afterwards, we and send it downstream for processing by calling requestQuery(resultedText) which will enable the resulted text to be fed into the NLP client.

Process the text and identify nouns

Before we move into the code that is responsible for the actual processing, we have to initialize the StanfordNLP client. A good UX decision would be to do this initialization — which does a lot of heavy lifting like loading the language model — in a splash screen.

As you can see, the view delegates this initialization to the viewModel which will delegate it further downstream to the NLPClient singleton instance itself. It’s important to note that we use a coroutine scope of IO because initializing the client is a heavy operation. Let’s see how a StanfordNLP client can be initialized just by passing an array of annotations:

How do we know what annotators to pass? Well, we consult the documentation. We know that we need to tokenize our text in order to get individual sentences and more importantly individual words. In order to achieve that we use both tokenize and ssplit. We also know that we need to get each token’s part of speech so we use pos annotator. And why not even go the extra mile and extract the root of the root with the simple purpose of getting the singular form from any plural nouns by using lemma . We pass these properties into the constructor and we’re done here, we have the client ready for work!

Now let’s get back to the text processing which is basically what we all have been waiting for – we have the text, but we need to extract the nouns from it:

Initially, we launch the pipeline by feeding it a document of the text we need to process. From this point, we sequentially apply the known annotators to the result of the pipeline. We first get the sentences and then iterate through the tokens of the sentence — for simplicity, we will assume we only have one sentence and one noun contained.

Afterwards, we iterate over the tokens of the sentence and extract their value which is the word itself. Alright, we have the word so we keep applying annotations, this time extracting the part-of-speech for it and verifying if the identified pos is for a noun. Once this is done, we simply have to do one more thing, extract the root noun by calling the token.lemma method. 

Let’s see how it works:

We’re done! That was easy, right? 

If you are in a hurry and would like to skip this final part, jump directly to the end of this article to find a public repository with all the code presented above. Thanks for your time!

Alright. Well… not so much. We converted the user sound input into text, we processed it and extracted the first noun we could find. But how does it all work under the hood?

This is a more complicated question that even I have trouble answering to as the voice transformation and the NLP domain are very vast and complex areas. Actually, both are still in active research and if you consider yourself a passionate, I can give you some starting points regarding this article:

  • The voice recognition interface (or voice transformation to text interface) that is used in this sample with the name of Speech API is a very complex system that has trained models underneath. You probably knew that, but do you know how a basic voice transformation system works? It relies on a probabilistic system that is based on a Hidden Markov Model which takes into account phonetic and acoustic features as well as word placement statistic information given by pre-trained language models. If you would like to understand the basic maths behind this process, I encourage you to check out Audio Processing and Speech Recognition: Concepts, Techniques and Research Overviews at chapter 2.3 Automatic Speech Recognition System.
  • StanfordNLP is really complex system and has a ton of features from the processing perspective. You can do even more advanced tasks like using NER trained models (Named Entity Recognition) to extract words related to people, places, weather, dates etc. One of the most complex tasks used in this sample is extracting the part-of-speech of the words. Stanford NLP does this by using a MEMM (Maximum Entropy Markov Model) in a probabilistic process that allows it to have statistical information about different features when assigning parts-of-speech to the words. You can find out more about the mathematics and mechanics behind this by checking out Speech and Language Processing An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition at chapter 5 Part-of-Speech Tagging.

Enough with the talk, where is the code?

Don’t worry, I won’t let you down with snippets only. Below you can find a repository with this exact sample and all the above code.

Please note though that the above repository also contains a logging system alongside the logic presented above. Therefore, for a first reader, I would advise you to ignore anything that is related to the logging system as it represents an UI tool for people that actual run the app to understand better what happens behind the stage. 

Thank you for your time and I hope it helped!

Final note:

I wouldn’t advise using such implementation in production as a part of devices might have issues in providing such computing power. Also, because the models are loaded locally, the size of the application grows in time and is no longer suited for production. These being said, use the above piece to experiment, learn and have fun!

Let me know in the comment section about any questions you might have or what else do you want me to cover next time!

Remember, if you liked this post subscribe for more by using the bottom subscribe widget! And if you really enjoyed it, then you can buy me a coffee here! Thanks!

11+
%d bloggers like this: