Speech recognition services. So Android can recognize speech without the Internet! Generating subtitles for films

“I would like to say right away that this is my first time dealing with recognition services. And therefore, I’ll tell you about the services from a layman’s point of view,” noted our expert, “to test recognition, I used three instructions: Google, Yandex and Azure.”

Google

The well-known IT corporation offers to test its Google Cloud Platform product online. Anyone can try the service for free. The product itself is convenient and easy to use.

Pros:

support for more than 80 languages;
fast name processing;
high-quality recognition in conditions bad connection and in the presence of extraneous sounds.

Cons:

there are difficulties in recognizing messages with an accent and poor pronunciation, which makes the system difficult to use by anyone other than native speakers;
lack of intelligible technical support service.

Yandex

Speech recognition from Yandex is available in several options:

Cloud
Library for access from mobile applications
"Boxed" version
JavaScript API

But let's be objective. We are primarily interested not in the variety of usage possibilities, but in the quality of speech recognition. Therefore, we took advantage trial version SpeechKit.

Pros:

ease of use and configuration;
good text recognition in Russian;
the system provides several answer options and through neural networks tries to find the option that is closest to the truth.

Cons:

During stream processing, some words may be determined incorrectly.

Azure

Azure system developed by Microsoft. It stands out from its analogues due to its price. But, be prepared to face some difficulties. The instructions presented on the official website are either incomplete or outdated. We were unable to launch the service adequately, so we had to use a third-party launch window. However, even here you will need an Azure service key for testing.

Pros:

Compared to other services, Azure processes messages very quickly in real time.

Cons:

the system is very sensitive to accent and has difficulty recognizing speech from non-native speakers;
The system operates only in English.

Review results:

After weighing all the pros and cons, we settled on Yandex. SpeechKit is more expensive than Azure, but cheaper than Google Cloud Platform. Google's program has seen a constant improvement in the quality and accuracy of recognition. The service improves itself through technology machine learning. However, Yandex’s recognition of Russian words and phrases is a level higher.

How to use voice recognition in business?

There are a lot of options for using recognition, but we will focus your attention on the one that will primarily affect your company’s sales. For clarity, let’s look at the recognition process using a real example.

Not so long ago, one well-known SaaS service became our client (at the request of the company, the name of the service was not disclosed). With the help of F1Golos, they recorded two audio videos, one of which was aimed at extending the life of warm customers, the other - at processing customer requests.

How to extend customer life using voice recognition?

Often, SaaS services operate on a monthly basis. subscription fee. Sooner or later, the period of trial use or paid traffic ends. Then there is a need to extend the service. The company decided to warn users about the end of traffic 2 days before the expiration of the term of use. Users were notified via voice mail. The video sounded like this: “Good afternoon, we remind you that your paid period for using the XXX service is ending. To extend the service, say yes; to cancel the services provided, say no.”

Calls from users who said the code words: YES, RENEW, I WANT, MORE DETAILS; were automatically transferred to the company's operators. Thus, about 18% of users renewed their registration thanks to just one call.

How to simplify a data processing system using speech recognition?

The second audio clip, launched by the same company, was of a different nature. They used voice messaging to reduce the cost of verifying phone numbers. Previously, they verified user numbers using a robocall. The robot asked users to press certain keys on the phone. However, with the advent of recognition technologies, the company changed tactics. The text of the new video was as follows: “You have registered on the XXX portal, if you confirm your registration, say yes. If you did not submit a registration request, say no." If the client uttered the words: YES, I CONFIRM, AHA or OF COURSE, the data about this was instantly transferred to the company’s CRM system. And the registration request was confirmed automatically in a couple of minutes. The introduction of recognition technologies has reduced the time of one call from 30 to 17 seconds. Thus, the company reduced costs by almost 2 times.

If you are interested in other ways to use voice recognition, or want to learn more about voice messaging, follow the link. On F1Golos you can sign up for your first newsletter for free and find out for yourself how new recognition technologies work.

In order to recognize speech and translate it from audio or video to text, there are programs and extensions (plugins) for browsers. However, why do all this if there is online service s? The programs must be installed on your computer; moreover, most speech recognition programs are far from free.

A large number of plugins installed in the browser greatly slow down its operation and the speed of surfing the Internet. And the services that we will talk about today are completely free and do not require installation - just go in, use it and leave!

In this article we will look at two online speech-to-text translation services. Both of them work on a similar principle: you start recording (allow the browser access to the microphone while using the service), speak into the microphone (dictate), and the output is text that can be copied to any document on the computer.

Speechpad.ru

Russian-language online speech recognition service. Has detailed operating instructions in Russian.

support for 7 languages (Russian, Ukrainian, English, German, French, Spanish, Italian)
downloading an audio or video file for transcription (videos from YouTube are supported)
simultaneous translation into another language
support for voice input of punctuation marks and line feeds
button panel (change case, newline, quotes, brackets, etc.)
availability personal account with record history (option available after registration)
availability of a plugin for Google Chrome for entering text by voice in the text field of websites (called “Voice text input - Speechpad.ru”)

Dictation.io

The second online speech-to-text translation service. A foreign service, which, meanwhile, works perfectly with the Russian language, which is extremely surprising. The quality of speech recognition is not inferior to Speechpad, but more on that later.

Main functionality of the service:

support for 30 languages, including Hungarian, Turkish, Arabic, Chinese, Malay, etc.
automatic recognition of the pronunciation of punctuation marks, line breaks, etc.
Possibility of integration with pages of any website
availability of a plugin for Google Chrome (called “VoiceRecognition”)

In speech recognition, the most important thing is translation quality speech to text. Pleasant “buns” and opportunities are nothing more than a good plus. So what can both services boast in this regard?

Comparative test of services

For the test, we will select two difficult-to-recognize fragments that contain words and figures of speech that are rarely used in modern speech. To begin with, we read a fragment of the poem “Peasant Children” by N. Nekrasov.

Below is the result of translating speech into text each service (errors are indicated in red):

As you can see, both services coped with speech recognition with almost the same errors. The result is quite good!

Now, for the test, let’s take an excerpt from the letter of the Red Army soldier Sukhov (film “White Sun of the Desert”):

Great result!

As you can see, both services cope very well with speech recognition - choose either one! It looks like they even use the same engine - the mistakes they made were too similar based on the test results). But if you need additional features such as uploading an audio / video file and translating it into text (transcription) or simultaneous translation of spoken text into another language, then Speechpad will be the best choice!

By the way, here’s how he performed a simultaneous translation of a fragment of Nekrasov’s poem into English:

Well, this is a short video instruction for working with Speechpad, recorded by the author of the project himself:

Friends, did you like it? this service? Do you know better analogues? Share your impressions in the comments.

Updated: Monday, July 31, 2017

What does gender have to do with fantastic idea conversation with a computer to professional photography? Almost none, unless you are a fan of the idea of endless development of the entire technical environment of man. Imagine for a moment that you are giving voice orders to your camera to change the focal length and make an exposure correction of half a stop plus. Remote control the camera has already been implemented, but there you need to silently press the buttons, but here is a hearing camera!

It has become a tradition to cite some science fiction film as an example of voice communication between a person and a computer, for example “2001: A Space Odyssey” directed by Stanley Kubrick. There on-board computer not only conducts a meaningful dialogue with the astronauts, but can read lips like a deaf person. In other words, the machine has learned to recognize human speech without errors. Perhaps remote voice control of the camera will seem superfluous to some, but many would like this phrase "Take us down, baby" and the photo of the whole family against the background of a palm tree is ready.

Well, so I paid tribute to tradition and dreamed up a little. But, speaking from the heart, this article was difficult to write, and it all started with a gift in the form of a smartphone with Android 4 OS. This HUAWEI model U8815 has a small touch screen four inches and on-screen keyboard. It’s a little unusual to type on it, but it turns out it’s not particularly necessary. (image01)

1. Voice recognition in a smartphone running Android OS

While mastering a new toy, I noticed graphic image microphone in the search bar Google and on the keyboard in Notes. Previously, I was not interested in what this symbol meant. I had conversations in Skype, and typed letters on the keyboard. This is what most Internet users do. But as they later explained to me, in the search engine Google was added voice search in Russian and programs have appeared that allow you to dictate short messages when using a browser "Chrome".

I said a phrase of three words, the program identified them and showed them in a cell with a blue background. There was something to be surprised about here, because all the words were written correctly. If you click on this cell, the phrase appears in the text field of the Android notepad. So I said a couple more phrases and sent a message to the assistant via SMS.

2. A brief history of voice recognition programs.

It was not a discovery for me that modern advances in the field of voice control make it possible to issue commands household appliances, car, robot. Team mode was introduced in past Windows versions, OS/2 and Mac OS. I've come across talking programs, but what's the use of them? Perhaps this is my peculiarity, that it is easier for me to speak than to type on the keyboard, but on cell phone I can't type anything at all. You have to write down contacts on a laptop with a normal keyboard and transfer them via USB cable. But to simply speak into a microphone and have the computer type the text itself without errors was a dream for me. The atmosphere of hopelessness was maintained by discussions on the forums. There was such a sad thought everywhere in them:

“However, in reality, to date, programs for real speech recognition (and even in Russian) practically do not exist, and they will obviously not be created soon. Moreover, even the inverse problem of recognition - speech synthesis, which, it would seem, is much simpler than recognition, has not been fully solved." (ComputerPress No. 12, 2004)

“There are still no normal speech recognition programs (not just Russian) because the task is quite difficult for a computer. And the worst thing is that the mechanism of word recognition by humans has not yet been realized, so there is nothing to start from when creating recognition programs.” (Another discussion on the forum).

At the same time, reviews of English-language voice text entry programs indicated clear successes. For example, IBM ViaVoice 98 Executive Edition had a basic vocabulary of 64,000 words and the ability to add the same number of your own words. The percentage of word recognition without training the program was about 80% and with subsequent work with a specific user reached 95%.

Among the Russian language recognition programs, it is worth noting “Gorynych” - an addition to the English-language Dragon Dictate 2.5. I will tell you about the search and then the “battle with the five Gorynychs” in the second part of the review. The first I found was the "English Dragon".

3. Continuous speech recognition program “Dragon Naturally Speaking”

Modern version of the company's program "Nuance" ended up with an old friend of mine from the Minsk Institute of Foreign Languages. She brought it back from a trip abroad, and bought it thinking that it could be a “computer secretary.” But something didn’t work out, and the program remained on the laptop, almost forgotten. Due to the lack of any clear experience, I had to go to my friend myself. All this lengthy introduction is necessary for a correct understanding of the conclusions that I have drawn.

The full name of my first dragon was: . The program is in English and everything in it is clear even without a manual. The first step is to create a profile specific user to determine the peculiarities of the sound of words in his performance. That’s what I did - the speaker’s age, country, and pronunciation features are important. My choice is as follows: age 22–54 years old, UK English, standard pronunciation. Next are several windows where you configure your microphone. (image04)

The next stage for serious speech recognition programs is training for the pronunciation features of a particular person. You are asked to choose the nature of the text: my choice is brief instructions by dictation, but you can also “order” a humorous story.

The essence of this stage of working with the program is extremely simple - text is displayed in the window, with a yellow arrow above it. When pronounced correctly, the arrow moves through the phrases, and at the bottom there is a workout progress bar. I had pretty much forgotten my conversational English, so I made progress with difficulty. Time was also limited - the computer was not mine and I had to interrupt the training. But a friend said she took the test in less than half an hour. (image05)

Refusing to let the program adapt my pronunciation, I went to the main window and launched the built-in text editor. Spoke individual words from some texts I found on the computer. The program printed those words that he said correctly, and replaced those that he said poorly with something “English.” Having pronounced the command “erase line” in English clearly, the program executed it. This means that I read the commands correctly, and the program recognizes them without prior training.

But it was important to me how this “dragon” writes in Russian. As you understood from the previous description, when training the program, you can only select English text; there is simply no Russian there. It is clear that it will not be possible to train Russian speech recognition. In the next photo you can see what phrase the program typed when pronouncing the Russian word “Hello”. (image06)

The outcome of the conversation with the first dragon turned out to be slightly comical. If you carefully read the text on the official website, you can see the English “specialization” of this software product. In addition, when loading, we read “English” in the program window. So why was all this necessary? It is clear that forums and rumors are to blame...

But there is also useful experience. A friend of mine asked to see the condition of her laptop. Somehow slowly he began to work. This is not surprising - the system partition had only 5% free space. Deleting unnecessary programs I saw that official version took up more than 2.3 GB. This figure will be useful to us later. (image.07)

Recognizing Russian speech, as it turned out, was a non-trivial task. In Minsk I managed to find “Gorynych” from a friend. He searched for the disc for a long time in his old rubble and, according to him, this is the official publication. The program installed instantly, and I found out that its dictionary contains 5,000 Russian words plus 100 commands and 600 English words plus 31 commands.

First you need to set up the microphone, which I did. Then I opened the dictionary and added the word "examination" because it was not in the program dictionary. I tried to speak clearly and monotonously. Finally, I opened the Gorynych Pro 3.0 program, turned on the dictation mode and received this list of “close-sounding words.” (image.09)

The result puzzled me, because it clearly differed for the worse from the work of an Android smartphone, and I decided to try other programs from “ Google Chrome online store". And I put off dealing with the “gorynych snakes” until later. I thought it was postponement action in the original Russian spirit

5. Google's voice capabilities

To work with voice regular computer with Windows OS you will need to install a browser Google Chrome. If you work online in it, then at the bottom right you can click on the store link software. There, completely free, I found two programs and two extensions for voice text input. The programs are called "Voice notepad" And "Voicenot - voice to text". After installation, they can be found on the tab "Applications" your browser "Chromium". (image. 10)

The extensions are called "Google Voice Search Hotword (Beta) 0.1.0.5" And "Voice text input - Speechpad.ru 5.4". After installation, they can be turned off or deleted on the tab "Extensions".(image. 11)

VoiceNote. In the application tab in the Chrome browser, double-click the program icon. A dialog box will open as in the picture below. By clicking on the microphone icon, you speak short phrases into the microphone. The program transmits your words to the speech recognition server and types the text in the window. All words and phrases shown in the illustration were typed the first time. Obviously, this method only works when there is an active Internet connection. (image. 12)

Voice notepad. If you launch the program from the applications tab, it will open new tab Internet pages Speechpad.ru. There is detailed instructions, how to use this service and compact form. The latter is shown in the illustration below. (image. 13)

Voice input Text allows you to fill out text fields on Internet pages using your voice. For example, I went to my page "Google+". In the new message input field, right-click and select "SpeechPad". Painted in pink the input window says that you can dictate your text. (image. 14)

Google Voice Search allows you to search by voice. When you install and activate this extension, a microphone symbol appears in the search bar. When you press it, a symbol will appear in a large red circle. Just say your search phrase and it will appear in the search results. (image. 15)

Important note: For the microphone to work with Chrome extensions, you need to allow microphone access in your browser settings. It is disabled by default for security reasons. Go to Settings→Personal information→Content settings. (To access all settings at the end of the list, click Show additional settings) . A dialog box will open Page content settings. Select an item down the list Multimedia→microphone.

6. Results of working with Russian speech recognition programs

A little experience in using voice text input programs has shown excellent implementation of this feature on the servers of an Internet company Google. Without any preliminary training, words are recognized correctly. This indicates that the problem of Russian speech recognition has been solved.

Now we can say that the result of developments Google will be a new criterion for evaluating products from other manufacturers. I would like the recognition system to work offline without accessing the company’s servers - it’s more convenient and faster. But when will it be released? independent program for working with a continuous stream of Russian speech is unknown. It is worth assuming, however, that with the opportunity to train, this “creation” will become a real breakthrough.

Programs of Russian developers "Gorynych", "Dictographer" And "Combat" I will go into detail in the second part this review. This article was written very slowly for the reason that the search for original disks is now difficult. At the moment, I already have all versions of Russian voice-to-text recognition engines except “Combat 2.52”. None of my friends or colleagues have this program, and I myself have only a few laudatory reviews on the forums. True, there was such a strange option - download “Combat” via SMS, but I don’t like it. (image16)

A short video clip will show you how speech recognition works in a smartphone with Android OS. The peculiarity of voice typing is the need to connect to Google servers. This is how your Internet should work

As we already found out in the first chapter, speech recognition programs are very relevant today and are widely used in everyday life. The two main problems of machine speech recognition - achieving guaranteed accuracy with a limited set of commands for at least one fixed voice and diction-independent recognition of arbitrary continuous speech with acceptable quality - have not yet been solved, despite the long history of their development. Moreover, there are doubts about the fundamental possibility of solving both problems, since even a person cannot always fully recognize the speech of his interlocutor. Let's look at some products in this area in Table 3.

Table 2

Comparative characteristics of the products “ABBYY FlexiCapture” and “CORRECT. Automation of document entry and processing"

Program

Possibilities

System Requirements

ABBYY FlexiCapture

Automates the extraction of information from paper documents and stores data in information system enterprises

OS: Windows XP SP2, Vista SP2, 7, Server 2003 SP2, Server 2008 SP2 or R2 + Desktop Expirience. Computer requirements: PC with processor of the Intel Core2/2 Quad/Pentium/Celeron/Xeon/Core i5/Core i7, AMD K6/Turion/Athlon/Duron/Sempron families, clock frequency 2 GHz or higher;

Requirements for installed software:

Net Framework 2.0 or higher if .Net scripting is used.

Additional requirements: Internet connection for activation serial number, USB port for hardware security key.

Price information is available when ordering. You can order a trial version.

CORRECT. Automation of document entry and processing

A solution for automated processing of primary accounting documentation based on ABBYY FlexiCapture using outsourcing.

OS: Windows XP SP2, Vista SP2, 7, Server 2003 SP2, Server 2008 SP2 or R2 + Desktop Expirience. Computer requirements:

PC with a processor of the Intel Core2/2 Quad/Pentium/Celeron/Xeon/Core i5/Core i7, AMD K6/Turion/Athlon/Duron/Sempron families, clock frequency 2 GHz or higher;

OP: 512 MB for each processor core, but not less than 1 GB; Disk space: 1 GB, of which 700 MB for installation; scanner with TWAIN, WIA or ISIS support; Internet connection to activate the serial number, USB port for a hardware security key; video card and monitor with a resolution of at least 1024×768; keyboard, mouse or other pointing device.

Price information is available when ordering.

Table 3

Comparative characteristics of programs for voice input

	Available on:	Features of the program
Yandex. Dictation	iPhone and iPad and Android	- Voice activation. To start recording, just say “Yandex, record.” - Speech recognition. You speak, and the application turns your speech into text. - Voice control. You can edit the text using commands -- for example, "Remove the last word", "Start with new line", "Add a funny smiley face." Yandex. Dictation not only recognizes words, but also understands their meaning, so the list of commands is unlimited. - Arrangement of punctuation marks. The application focuses on pauses in speech and places punctuation marks on its own. - Speech synthesis
	Windows 7 and 8. Development of Android application has begun	“Download RealSpeaker for free and you can enter text of any length using your voice in any text editor(notepad, MS Word, Skype, VKontakte, Facebook, etc.) in any of eleven languages,” stated on the project website. At the same time system requirements RealSpeaker is declared to be quite democratic: a computer with front camera and microphone, Internet access, Windows 7 or 8.
Gorynych 5.0 Dict Light	Operating system compatibility Microsoft systems Windows Me/2000/XP.	Very simple and user-friendly interface. Quick and easy microphone setup. Ability to add your own words to the dictionary. Practice words directly during dictation.
	Integrates into many different applications, primarily Microsoft Word	Built-in active dictionary. When selecting and assigning commands, you should remember that VOICETYPE has a mode in which the program automatically types in text everything that is not stored as a voice analogue of the system command. Therefore, if you used consonant expressions, then most likely VOICETYPE will start to stumble, which will ruin the whole thing. The second rather serious problem with VOICETYPE is the built-in self-learning module. If the program decides that it has correctly recognized a word or expression, in the sense of a text equivalent, but has not fully grasped your individual subtleties of pronunciation, then it can “ask” the user to repeat the word a couple of times and will overwrite a perfectly correct fragment. With poor pronunciation, you can completely ruin everything, since VOICETYPE DICTATION can confuse everything.

From the data in Table 3 it follows that voice input programs are widespread not only on computers, but also on smartphones. All specified programs in this table are easily accessible and understandable to use. All these products can be purchased free of charge.

Despite all the achievements recent years, continuous speech recognition tools still allow a large number of errors, require time-consuming setup, are demanding on hardware and user qualifications, and refuse to work in noisy rooms, although the latter is important both for noisy offices and mobile systems and operation under telephone conditions.

However, speech recognition, like machine translation from one language to another, belongs to the so-called cult computer technologies, to which special attention is paid. Interest in these technologies is constantly fueled by countless works of science fiction writers, so constant attempts to create a product that should correspond to our ideas about the technologies of tomorrow are inevitable. And even those projects that, in their essence, do not represent anything, are often quite commercially successful, since the consumer is keenly interested in the very possibility of such implementations, even regardless of whether he can apply it in practice.