Speech recognition for dummies. There are a lot of practical application options. What are we going to do?

As we already found out in the first chapter, speech recognition programs are very relevant today and are widely used in everyday life. The two main problems of machine speech recognition - achieving guaranteed accuracy with a limited set of commands for at least one fixed voice and diction-independent recognition of arbitrary continuous speech with acceptable quality - have not yet been solved, despite the long history of their development. Moreover, there are doubts about the fundamental possibility of solving both problems, since even a person cannot always fully recognize the speech of his interlocutor. Let's look at some products in this area in Table 3.

Table 2

Comparative characteristics of the products “ABBYY FlexiCapture” and “CORRECT. Automation of document entry and processing"

Program

Possibilities

System Requirements

ABBYY FlexiCapture

Automates the extraction of information from paper documents and stores data in information system enterprises

OS: Windows XP SP2, Vista SP2, 7, Server 2003 SP2, Server 2008 SP2 or R2 + Desktop Expirience. Computer requirements: PC with processor of the Intel Core2/2 Quad/Pentium/Celeron/Xeon/Core i5/Core i7, AMD K6/Turion/Athlon/Duron/Sempron families, clock frequency 2 GHz or higher;

Requirements for installed software:

Net Framework 2.0 or higher if .Net scripting is used.

Additional requirements: Internet connection for activation serial number, USB port for hardware security key.

Price information is available when ordering. You can order a trial version.

CORRECT. Automation of document entry and processing

A solution for automated processing of primary accounting documentation based on ABBYY FlexiCapture using outsourcing.

OS: Windows XP SP2, Vista SP2, 7, Server 2003 SP2, Server 2008 SP2 or R2 + Desktop Expirience. Computer requirements:

PC with a processor of the Intel Core2/2 Quad/Pentium/Celeron/Xeon/Core i5/Core i7, AMD K6/Turion/Athlon/Duron/Sempron families, clock frequency 2 GHz or higher;

OP: 512 MB for each processor core, but not less than 1 GB; Disk space: 1 GB, of which 700 MB for installation; scanner with TWAIN, WIA or ISIS support; Internet connection to activate the serial number, USB port for a hardware security key; video card and monitor with a resolution of at least 1024×768; keyboard, mouse or other pointing device.

Price information is available when ordering.

Table 3

Comparative characteristics of programs for voice input

	Available on:	Features of the program
Yandex. Dictation	iPhone and iPad and Android	- Voice activation. To start recording, just say “Yandex, record.” - Speech recognition. You speak, and the application turns your speech into text. - Voice control. You can edit the text using commands -- for example, "Remove the last word", "Start with new line", "Add a funny smiley face." Yandex. Dictation not only recognizes words, but also understands their meaning, so the list of commands is unlimited. - Arrangement of punctuation marks. The application focuses on pauses in speech and places punctuation marks on its own. - Speech synthesis
	Windows 7 and 8. Development of Android application has begun	“Download RealSpeaker for free, and you will be able to enter text of any length using your voice into any text editor (notepad, MS Word, Skype, VKontakte, Facebook, etc.) in any of eleven languages,” states the project website. At the same time system requirements RealSpeaker is declared to be quite democratic: a computer with front camera and microphone, Internet access, Windows 7 or 8.
Gorynych 5.0 Dict Light	Operating system compatibility Microsoft systems Windows Me/2000/XP.	Very simple and user-friendly interface. Quick and easy microphone setup. Ability to add your own words to the dictionary. Practice words directly during dictation.
	Integrates into many different applications, primarily Microsoft Word	Built-in active dictionary. When selecting and assigning commands, you should remember that VOICETYPE has a mode in which the program automatically types in text everything that is not stored as a voice analogue of the system command. Therefore, if you used consonant expressions, then most likely VOICETYPE will start to stumble, which will ruin the whole thing. The second rather serious problem with VOICETYPE is the built-in self-learning module. If the program decides that it has correctly recognized a word or expression, in the sense of a text equivalent, but has not fully grasped your individual subtleties of pronunciation, then it can “ask” the user to repeat the word a couple of times and will overwrite a perfectly correct fragment. With poor pronunciation, you can completely ruin everything, since VOICETYPE DICTATION can confuse everything.

From the data in Table 3 it follows that voice input programs are widespread not only on computers, but also on smartphones. All specified programs in this table are easily accessible and understandable to use. All these products can be purchased free of charge.

Despite all the achievements of recent years, continuous speech recognition tools still allow a large number of errors, require lengthy setup, are demanding on hardware and user qualifications, and refuse to work in noisy rooms, although the latter is important both for noisy offices and mobile systems and operation in conditions telephone communication.

However, speech recognition, like machine translation from one language to another, belongs to the so-called cult computer technologies, to which special attention is paid. Interest in these technologies is constantly fueled by countless works of science fiction writers, so constant attempts to create a product that should correspond to our ideas about the technologies of tomorrow are inevitable. And even those projects that, in their essence, do not represent anything, are often quite commercially successful, since the consumer is keenly interested in the very possibility of such implementations, even regardless of whether he can apply it in practice.

Updated: Monday, July 31, 2017

What does gender have to do with fantastic idea conversation with a computer to professional photography? Almost none, unless you are a fan of the idea of endless development of the entire technical environment of man. Imagine for a moment that you are giving voice orders to your camera to change the focal length and make an exposure correction of half a stop plus. Remote control the camera has already been implemented, but there you need to silently press the buttons, but here is a hearing camera!

It has become a tradition to cite some science fiction film as an example of voice communication between a person and a computer, for example “2001: A Space Odyssey” directed by Stanley Kubrick. There on-board computer not only conducts a meaningful dialogue with the astronauts, but can read lips like a deaf person. In other words, the machine has learned to recognize human speech without errors. Perhaps for someone remote voice control camera may seem superfluous, but many would like this phrase "Take us down, baby" and the photo of the whole family against the background of a palm tree is ready.

Well, so I paid tribute to tradition and dreamed up a little. But, speaking from the heart, this article was difficult to write, and it all started with a gift in the form of a smartphone with Android 4 OS. This HUAWEI model U8815 has a small touch screen four inches and on-screen keyboard. It’s a little unusual to type on it, but it turns out it’s not particularly necessary. (image01)

1. Voice recognition in a smartphone running Android OS

While mastering a new toy, I noticed graphic image microphone in the search bar Google and on the keyboard in Notes. Previously, I was not interested in what this symbol meant. I had conversations in Skype, and typed letters on the keyboard. This is what most Internet users do. But as they later explained to me, in the search engine Google was added voice search in Russian and programs have appeared that allow you to dictate short messages when using a browser "Chrome".

I said a phrase of three words, the program identified them and showed them in a cell with a blue background. There was something to be surprised about here, because all the words were written correctly. If you click on this cell, the phrase appears in the text field of the Android notepad. So I said a couple more phrases and sent a message to the assistant via SMS.

2. A brief history of voice recognition programs.

It was not a discovery for me that modern advances in the field of voice control make it possible to give commands to household appliances, cars, and robots. Team mode was introduced in past Windows versions, OS/2 and Mac OS. I've come across talking programs, but what's the use of them? Perhaps this is my peculiarity, that it is easier for me to speak than to type on the keyboard, but on cell phone I can't type anything at all. You have to write down contacts on a laptop with a normal keyboard and transfer them via USB cable. But to simply speak into a microphone and have the computer type the text itself without errors was a dream for me. The atmosphere of hopelessness was maintained by discussions on the forums. There was such a sad thought everywhere in them:

“However, in reality, to date, programs for real speech recognition (and even in Russian) practically do not exist, and they will obviously not be created soon. Moreover, even the inverse problem of recognition - speech synthesis, which, it would seem, is much simpler than recognition, has not been fully solved." (ComputerPress No. 12, 2004)

“There are still no normal speech recognition programs (not just Russian) because the task is quite difficult for a computer. And the worst thing is that the mechanism of word recognition by humans has not yet been realized, so there is nothing to start from when creating recognition programs.” (Another discussion on the forum).

At the same time, reviews of English-language voice text entry programs indicated clear successes. For example, IBM ViaVoice 98 Executive Edition had a basic vocabulary of 64,000 words and the ability to add the same number of your own words. The percentage of word recognition without training the program was about 80% and with subsequent work with a specific user reached 95%.

Among the Russian language recognition programs, it is worth noting “Gorynych” - an addition to the English-language Dragon Dictate 2.5. I will tell you about the search and then the “battle with the five Gorynychs” in the second part of the review. The first I found was the "English Dragon".

3. Continuous speech recognition program “Dragon Naturally Speaking”

Modern version of the company's program "Nuance" ended up with an old friend of mine from the Minsk Institute of Foreign Languages. She brought it back from a trip abroad, and bought it thinking that it could be a “computer secretary.” But something didn’t work out, and the program remained on the laptop, almost forgotten. Due to the lack of any clear experience, I had to go to my friend myself. All this lengthy introduction is necessary for a correct understanding of the conclusions that I have drawn.

The full name of my first dragon was: . The program is in English and everything in it is clear even without a manual. The first step is to create a profile specific user to determine the peculiarities of the sound of words in his performance. That’s what I did - the speaker’s age, country, and pronunciation features are important. My choice is as follows: age 22–54 years old, UK English, standard pronunciation. Next are several windows where you configure your microphone. (image04)

The next stage for serious speech recognition programs is training for the pronunciation features of a particular person. You are asked to choose the nature of the text: my choice is brief instructions by dictation, but you can also “order” a humorous story.

The essence of this stage of working with the program is extremely simple - text is displayed in the window, with a yellow arrow above it. When pronounced correctly, the arrow moves through the phrases, and at the bottom there is a workout progress bar. I had pretty much forgotten my conversational English, so I made progress with difficulty. Time was also limited - the computer was not mine and I had to interrupt the training. But a friend said she took the test in less than half an hour. (image05)

Refusing to let the program adapt my pronunciation, I went to the main window and launched the built-in text editor. Spoke individual words from some texts I found on the computer. The program printed those words that he said correctly, and replaced those that he said poorly with something “English.” Having pronounced the command “erase line” in English clearly, the program executed it. This means that I read the commands correctly, and the program recognizes them without prior training.

But it was important to me how this “dragon” writes in Russian. As you understood from the previous description, when training the program, you can only select English text; there is simply no Russian there. It is clear that it will not be possible to train Russian speech recognition. In the next photo you can see what phrase the program typed when pronouncing the Russian word “Hello”. (image06)

The outcome of the conversation with the first dragon turned out to be slightly comical. If you carefully read the text on the official website, you can see the English “specialization” of this software product. In addition, when loading, we read “English” in the program window. So why was all this necessary? It is clear that forums and rumors are to blame...

But there is also useful experience. A friend of mine asked to see the condition of her laptop. Somehow slowly he began to work. This is not surprising - the system partition had only 5% free space. Deleting unnecessary programs I saw that official version took up more than 2.3 GB. This figure will be useful to us later. (image.07)

Recognizing Russian speech, as it turned out, was a non-trivial task. In Minsk I managed to find “Gorynych” from a friend. He searched for the disc for a long time in his old rubble and, according to him, this is the official publication. The program installed instantly, and I found out that its dictionary contains 5,000 Russian words plus 100 commands and 600 English words plus 31 commands.

First you need to set up the microphone, which I did. Then I opened the dictionary and added the word "examination" because it was not in the program dictionary. I tried to speak clearly and monotonously. Finally, I opened the Gorynych Pro 3.0 program, turned on the dictation mode and received this list of “close-sounding words.” (image.09)

The result puzzled me, because it clearly differed for the worse from the work of an Android smartphone, and I decided to try other programs from “ online store Google Chrome» . And I put off dealing with the “gorynych snakes” until later. I thought it was postponement action in the original Russian spirit

5. Google's voice capabilities

To work with voice regular computer with Windows OS you will need to install a browser Google Chrome. If you work online in it, then at the bottom right you can click on the store link software. There, completely free, I found two programs and two extensions for voice text input. The programs are called "Voice notepad" And "Voicenot - voice to text". After installation, they can be found on the tab "Applications" your browser "Chromium". (image. 10)

The extensions are called "Google Voice Search Hotword (Beta) 0.1.0.5" And "Voice text input - Speechpad.ru 5.4". After installation, they can be turned off or deleted on the tab "Extensions".(image. 11)

VoiceNote. In the application tab in the Chrome browser, double-click the program icon. A dialog box will open as in the picture below. By clicking on the microphone icon, you speak short phrases into the microphone. The program transmits your words to the speech recognition server and types the text in the window. All words and phrases shown in the illustration were typed the first time. Obviously, this method only works when there is an active Internet connection. (image. 12)

Voice notepad. If you launch the program from the applications tab, it will open new tab Internet pages Speechpad.ru. There is detailed instructions, how to use this service and compact form. The latter is shown in the illustration below. (image. 13)

Voice input Text allows you to fill out text fields on Internet pages using your voice. For example, I went to my page "Google+". In the new message input field, right-click and select "SpeechPad". Painted in pink the input window says that you can dictate your text. (image. 14)

Google Voice Search allows you to search by voice. When you install and activate this extension, a microphone symbol appears in the search bar. When you press it, a symbol will appear in a large red circle. Just say your search phrase and it will appear in the search results. (image. 15)

Important note: For the microphone to work with Chrome extensions, you need to allow microphone access in your browser settings. It is disabled by default for security reasons. Go to Settings→Personal information→Content settings. (To access all settings at the end of the list, click Show additional settings) . A dialog box will open Page content settings. Select an item down the list Multimedia→microphone.

6. Results of working with Russian speech recognition programs

A little experience in using voice text input programs has shown excellent implementation of this feature on the servers of an Internet company Google. Without any preliminary training, words are recognized correctly. This indicates that the problem of Russian speech recognition has been solved.

Now we can say that the result of developments Google will be a new criterion for evaluating products from other manufacturers. I would like the recognition system to work offline without accessing the company’s servers - it’s more convenient and faster. But it is unknown when an independent program for working with a continuous stream of Russian speech will be released. It is worth assuming, however, that with the opportunity to train, this “creation” will become a real breakthrough.

Programs of Russian developers "Gorynych", "Dictographer" And "Combat" I will go into detail in the second part of this review. This article was written very slowly for the reason that the search for original disks is now difficult. At the moment, I already have all versions of Russian voice-to-text recognition engines except “Combat 2.52”. None of my friends or colleagues have this program, and I myself have only a few laudatory reviews on the forums. True, there was such a strange option - download “Combat” via SMS, but I don’t like it. (image16)

A short video clip will show you how speech recognition works in a smartphone with Android OS. The peculiarity of voice typing is the need to connect to Google servers. This is how your Internet should work

Yes, but things are still there.
I.A. Krylov. Fable "Swan, Pike and Crayfish"

The two main tasks of machine speech recognition - achieving guaranteed accuracy with a limited set of commands for at least one fixed voice and diction-independent recognition of arbitrary continuous speech with acceptable quality - have not yet been solved, despite the long history of their development. Moreover, there are doubts about the fundamental possibility of solving both problems, since even a person cannot always fully recognize the speech of his interlocutor.

Once upon a time, the possibility of a normal conversation with a computer seemed so obvious and natural to science fiction writers that the first computers, devoid of a voice interface, were perceived as something inferior.

It would seem, why not solve this problem programmatically, using “smart” computers? After all, there seem to be manufacturers of such products, and the power of computers is constantly growing, and technologies are improving. However, advances in automatic speech recognition and conversion to text seem to be at the same level as they were 20-40 years ago. I remember that back in the mid-90s, IBM confidently announced the presence of such tools in OS/2, and a little later Microsoft joined in the implementation of similar technologies. Apple also tried to work on speech recognition, but in early 2000 it officially announced its abandonment of this project. IBM (Via Voice) and Philips continue to work in this area, and IBM not only built the speech recognition function into its OS/2 operating system (now sunk into oblivion), but also still produces it as a separate product. The Via Voice continuous speech recognition package (http://www-306.ibm.com/software/voice/viavoice) from IBM was distinguished by the fact that it recognized up to 80% of words from the very beginning, even without training. During training, the probability of correct recognition increased to 95%, and in addition, in parallel with setting up the program for a specific user, the future operator mastered the skills of working with the system. Now there are rumors that similar innovations will be implemented as part of Windows XP, although the head and founder of the corporation, Bill Gates, has repeatedly stated that he considers speech technologies not yet ready for mass use.

Once upon a time, the American company Dragon Systems created probably the first commercial speech recognition system Naturally Speaking Preferred, which worked back in 1982 on an IBM PC (not even XT!). True, this program was more like a game and since then the company has not made any serious progress, and by 2000 it went bankrupt, and its latest version Dragon Dictate Naturally Speaking was sold to Lernout&Hauspie Speech Products (L&H), which was also one of the leaders in the field of systems and methods for speech recognition and synthesis (Voice Xpress). L&H, in turn, also went bankrupt with the sale of assets and property (by the way, Dragon Systems was sold for almost 0.5 billion dollars, and L&H already for 10 million, so its scale in this area It is not progress that is impressive, but regression!). The technologies of L&H and Dragon Systems were transferred to the company ScanSoft, which was previously engaged in optical image recognition (it now runs some well-known text recognition programs such as OmniPage), but it seems that no one is seriously doing this.

The Russian company Cognitive Technologies, which has achieved significant success in the field of character recognition, announced in 2001 a joint project with Intel to create Russian speech recognition systems; a speech corpus of the Russian language, RuSpeech, was prepared for Intel. Actually, RuSpeech is a speech database that contains fragments of continuous Russian speech with corresponding texts, phonetic transcription and additional information about the speakers. Cognitive Technologies set itself the goal of creating a “speaker-independent” continuous speech recognition system, and the speech interface consisted of a dialogue script system, text-based speech synthesis, and a speech command recognition system.

However, in fact, to date, programs for real speech recognition (and even in Russian) practically do not exist, and they will obviously not be created soon. Moreover, even the inverse problem of recognition—speech synthesis, which would seem to be much simpler than recognition—has not been fully solved. Any synthesized speech is perceived by a person worse than live speech, and this is especially noticeable when transmitted over a telephone channel, that is, exactly where it is most in demand today.

“That’s it, you’re finished,” said Ivan Tsarevich, looking straight into the eyes of the third head of the Serpent Gorynych. She looked at the other two in confusion. They grinned maliciously in response.

Joke

In 1997, the famous “Gorynych” entered the commercial market (essentially an adaptation of the Dragon Dictate Naturally Speaking program, carried out by the forces of a little-known until then Russian company White Group, the official distributor of Dragon Systems) became something of a sensation. The program seemed quite workable, and its price seemed very reasonable. However time goes by, “Gorynychi” change interfaces and versions, but do not acquire any valuable properties. Perhaps the core of Dragon Naturally Speaking was somehow tuned to the peculiarities of English speech, but even after successively replacing the dragon head with three Gorynych heads, it gives no more than 30-40% recognition of the average level of vocabulary, and with careful pronunciation. And who needs it anyway? As is known, according to the statements of the developers of Dragon Systems, IBM and Lernout&Hauspie, their programs during continuous dictation were able to correctly recognize up to 95% of the text, but they have not been produced for a long time, because it is known that for comfortable work the recognition accuracy must be increased to 99 %. Needless to say, to achieve such heights in real conditions requires, to put it mildly, considerable effort.

In addition, the program requires a long period of training and customization for a specific user, is very capricious in terms of equipment, and is more than sensitive to intonation and the speed of pronouncing phrases, so the ability to train it to recognize different voices varies greatly.

However, maybe someone will purchase this package as some kind of advanced toy, but this will not help fingers tired of working with the keyboard, even though the Gorynych manufacturers claim that the speed of entering speech material and transforming it into text is 500-700 characters per minute, which is inaccessible even for several experienced typists, if you add up the speed of their work.

Upon closer inspection new version We were unable to extract anything useful from this program. Even after a long “training” of the program (and the standard dictionary did not help us at all), it turned out that dictation must still be carried out strictly according to words (that is, after each word you need to pause) and the words must be pronounced clearly, which is not always typical for speech . Of course, “Gorynych” is a modification of the English-language system, and for English a different approach is simply unthinkable, but speaking Russian in such a manner seemed especially unnatural to us. In addition, during a normal conversation in any language, the sound intensity almost never drops to zero (this can be seen from spectrograms), but commercial programs learned to recognize dictation of texts on general topics performed in the manner of continuous speech 5-10 years ago .

The system is focused primarily on input, but contains tools that allow you to correct a misheard word, for which Gorynych offers a list of options. You can correct the text from the keyboard, which, by the way, is what you have to do all the time. Words that are not in the dictionary can also be entered using the keyboard. I remember that in previous versions it was stated that the more often you dictate, the more the system gets used to your voice, but neither then nor now did we notice this. It even seemed to us that working with the Gorynych program is still more difficult than, for example, teaching a parrot to talk, and of the new products in version 3.0, we can only note a more “pop” multimedia interface.

In a word, there is only one manifestation of progress in this area: due to the increase in computer power, the time delay between pronouncing a word and displaying its written version on the screen has completely disappeared, and the number of correct hits, alas, has not increased.

Analyzing the capabilities of the program, we are increasingly inclined to the opinion of experts that linguistic text analysis is a mandatory stage of the process of automatic dictation. Without this modern quality recognition cannot be achieved, and many experts associate the prospects of speech systems with the further development of the linguistic mechanisms they contain. As a result, speech technologies are becoming increasingly dependent on the language with which they work. And this means, firstly, that the recognition, synthesis and processing of Russian speech is something that Russian developers should do, and secondly, only specialized domestic products, initially focused specifically on the Russian language, will be able to truly solve this problem. task. True, it should be noted here that domestic specialists from the St. Petersburg “Center for Speech Technologies” (CDS) believe that the creation own system dictation in the current Russian conditions will not pay off.

Other toys

So far, Russian developers have successfully used speech recognition technologies mainly in interactive educational systems and games like “My Talking Dictionary”, Talk to Me or “Professor Higgins”, created by IstraSoft. They are used to control pronunciation among students English language and user authentication. By developing the “Professor Higgins” program, IstraSoft employees learned to divide words into elementary segments that correspond to the sounds of speech and do not depend on either the speaker or the language (previously, speech recognition systems did not perform such segmentation, and the smallest unit for them was the word) . In this case, the selection of phonemes from a stream of continuous speech, their encoding and subsequent restoration occurs in real time. This speech recognition technology has found a rather ingenious application: it allows you to significantly compress files with voice recordings or voice messages. The method proposed by IstraSoft allows speech compression by 200 times, and with compression of less than 40 times, the quality of the speech signal practically does not deteriorate. Intelligent speech processing at the phoneme level is promising not only as a compression method, but also as a step towards the creation of a new generation of speech recognition systems, because theoretically, machine speech recognition, that is, its automatic representation in the form of text, is precisely the extreme degree of speech compression signal.

Today, in addition to training programs, IstraSoft offers on its website (http://www.istrasoft.ru/user.html) programs for compressing/playing sound files, as well as a demo program for voice-independent recognition of Russian language commands, Istrasoft Voice Commander.

It would seem that now in order to create a new technology recognition system, there is very little left to do...

), which has been working in this field since 1990, seems to have achieved some success. TsRT has in its arsenal a whole set of software and hardware designed for noise reduction and to improve the quality of audio, and primarily speech, signals - these are computer programs, stand-alone devices, boards (DSP) built into devices for recording channels or transmitting speech information (we already wrote about this company in the article “How to improve speech intelligibility?” in No. 8'2004). "Center for Speech Technologies" is known as a developer of noise reduction and sound editing tools: Clear Voice, Sound Cleaner, Speech Interactive Software, Sound Stretcher, etc. The company's specialists took part in the restoration of audio information recorded on board the sunken submarine "Kursk" and on crashed aircraft courts, as well as in the investigation of a number of criminal cases, for which it was necessary to establish the content of speech phonograms.

The Sound Cleaner speech noise reduction complex is a professional set of software and hardware designed to restore speech intelligibility and to clean sound signals recorded in difficult acoustic conditions or transmitted over communication channels. This truly unique software product is designed to denoise and enhance the sound quality of live (i.e. real-time) or recorded audio and can help improve intelligibility and text transcript low-quality speech phonograms (including archival ones) recorded in difficult acoustic conditions.

Naturally, Sound Cleaner works more effectively in relation to noise and sound distortion of a known nature, such as typical noise and distortion of communication and sound recording channels, noise of rooms and streets, operating machinery, vehicles, household appliances, voice “cocktail”, slow music, electromagnetic interference from power systems, computer and other equipment, reverberation and echo effects. In principle, the more uniform and “regular” the noise, the more successfully this complex will cope with it.

However, when recording information in two channels, Sound Cleaner significantly reduces the impact of noise of any type for example, it has two-channel adaptive filtering methods designed to suppress both broadband non-stationary interference (such as speech, radio or television broadcasts, hall noise, etc.) and periodic (vibrations, network interference, etc.). These methods are based on the fact that when identifying a useful signal, additional information about the properties of the interference presented in the reference channel is used.

Since we are talking about speech recognition, we cannot fail to mention another development of the MDG - a family of computer transcribers, which, unfortunately, are not yet programs for automatic speech recognition and converting it into text, but rather are computer digital tape recorders controlled from specialized text editor. These devices are designed to increase the speed and improve the comfort of documenting sound recordings of oral speech when preparing reports, minutes of meetings, negotiations, lectures, interviews; they are also used in paperless office work and in many other cases. Transcribers are simple and easy to use and are accessible even to non-professional operators. At the same time, the speed of typing increases two to three times for professional touch typing operators, and five to ten times for non-professionals! In addition, the mechanical wear of the tape recorder and tape is significantly reduced if we are talking about an analog source. In addition, computer transcribers have interactive feature checking the typed text and the corresponding audio track. The connection between text and speech is established automatically and allows you to instantly automatically find and listen to the corresponding sound fragments of the speech signal in the typed text when you move the cursor to the part of the text being examined. Increasing speech intelligibility can be achieved here both by slowing down the playback speed without distorting the timbre of the voice, and by repeatedly repeating unintelligible fragments in ring mode.

Of course, it is much easier to implement a program that can recognize only a limited, small set of control commands and symbols. This, for example, could be numbers from 0 to 9 in the phone, the words “yes”/“no” and monosyllabic commands to call the desired subscribers, etc. Such programs were the very first to appear and have long been used in telephony for voice dialing or selecting a subscriber.

Recognition accuracy, as a rule, increases when pre-tuned to the voice of a specific user, and in this way speech recognition can be achieved even when the speaker has a diction defect or accent. Everything seems to be good, but noticeable successes in this area are visible only if individual use of equipment or software is assumed by one or more users, in extreme cases, for each of whom their own individual “profile” is created.

In short, despite all the advances in recent years, continuous speech recognition tools still allow a large number of errors, require time-consuming setup, are demanding on hardware and user qualifications, and refuse to work in noisy rooms, although the latter is important for noisy offices. and for mobile systems and operation in telephone communications.

However, speech recognition, like machine translation from one language to another, is one of the so-called iconic computer technologies that receive special attention. Interest in these technologies is constantly fueled by countless works of science fiction writers, so constant attempts to create a product that should correspond to our ideas about the technologies of tomorrow are inevitable. And even those projects that, in their essence, do not represent anything, are often quite commercially successful, since the consumer is keenly interested in the very possibility of such implementations, even regardless of whether he can apply it in practice.

Man has always been attracted to the idea of controlling a machine using natural language. Perhaps this is partly due to the desire of man to be ABOVE the machine. So to speak, to feel superior. But the main message is to simplify human interaction with artificial intelligence. Voice control in Linux has been implemented with varying degrees of success for almost a quarter of a century. Let's look into the issue and try to get as close to our OS as possible.

The crux of the matter

Human voice systems for Linux have been around for a long time, and there are a great many of them. But not all of them process Russian speech correctly. Some were completely abandoned by the developers. In the first part of our review, we will talk directly about speech recognition systems and voice assistants, and in the second, we will look at specific examples of their use on a Linux desktop.

It is necessary to distinguish between speech recognition systems themselves (translation of speech into text or into commands), such as, for example, CMU Sphinx, Julius, as well as applications based on these two engines, and voice assistants, which have become popular with the development of smartphones and tablets. This is, rather, a by-product of speech recognition systems, their further development and the implementation of all successful ideas of voice recognition, their application in practice. There are few of these for Linux desktops yet.

You need to understand that the speech recognition engine and the interface to it are two different things. That's how basic principle Linux architecture - dividing a complex mechanism into simpler components. The most difficult work falls on the shoulders of the engines. This is usually a boring console program that runs unnoticed by the user. The user interacts mainly with the interface program. Creating an interface is not difficult, so developers focus their main efforts on developing open-source speech recognition engines.

What happened before

Historically, all speech processing systems in Linux developed slowly and in leaps and bounds. The reason is not the crookedness of the developers, but the high level entering the development environment. Writing system code for working with voice requires a highly qualified programmer. Therefore, before you start understanding speech systems in Linux, you need to do small excursion into history. There was once such a wonderful woman at IBM operating system- OS/2 Warp (Merlin). It came out in September back in 1996. In addition to the fact that it had obvious advantages over all other operating systems, OS/2 was equipped with a very advanced speech recognition system - IBM ViaVoice. For that time, this was very cool, considering that the OS ran on systems with a 486 processor with 8 MB of RAM (!).

As you know, OS/2 lost the battle to Windows, but many of its components continued to exist independently. One of these components was the same IBM ViaVoice, which turned into an independent product. Since IBM always loved Linux, ViaVoice was ported to this OS, which gave the brainchild of Linus Torvalds the most advanced speech recognition system of its time.

Unfortunately, the fate of ViaVoice did not turn out the way Linux users would have liked. The engine itself was distributed free of charge, but its sources remained closed. In 2003, IBM sold the rights to the technology to the Canadian-American company Nuance. Nuance, which developed perhaps the most successful commercial speech recognition product - Dragon Naturally Speeking, is still alive today. This is almost the end of the inglorious history of ViaVoice on Linux. During the short time that ViaVoice was free and available to Linux users, several interfaces were developed for it, such as Xvoice. However, the project has long been abandoned and is now practically inoperable.

INFO

The most difficult part of machine speech recognition is natural human language.

What's today?

Today everything is much better. IN recent years, after the discovery of the Google Voice API sources, the situation with the development of speech recognition systems in Linux has improved significantly, and the quality of recognition has increased. For example, the Linux Speech Recognition project based on the Google Voice API shows very good results for the Russian language. All engines work approximately the same: first, the sound from the microphone of the user’s device enters the recognition system, after which either the voice is processed local device, or the recording is sent to remote server for further processing. The second option is more suitable for smartphones or tablets. Actually, this is exactly how commercial engines work - Siri, Google Now and Cortana.

Of the variety of engines for working with the human voice, there are several that are currently active.

WARNING

Installing many of the described speech recognition systems is a non-trivial task!

CMU Sphinx

Much of the development of CMU Sphinx takes place at Carnegie Mellon University. At different times, both the Massachusetts Institute of Technology and the now deceased Sun Microsystems corporation worked on the project. The engine sources are distributed under the BSD license and are available for both commercial and non-commercial use. Sphinx is not custom application, but rather a set of tools that can be used in developing applications for end users. Sphinx is now the largest speech recognition project. It consists of several parts:

Pocketsphinx - small fast program, processing sound, acoustic models, grammars and dictionaries;
Sphinxbase library, required for Pocketsphinx to work;
Sphinx4 - the actual recognition library;
Sphinxtrain is a program for training acoustic models (recordings of the human voice).

The project is developing slowly but surely. And most importantly, it can be used in practice. And not only on PC, but also on mobile devices Oh. In addition, the engine works very well with Russian speech. If you have straight hands and a clear head, you can set up Russian speech recognition using Sphinx to control home appliances or a smart home. In fact, you can turn an ordinary apartment into smart home, which is what we will do in the second part of this review. Sphinx implementations are available for Android, iOS and even Windows Phone. Unlike the cloud method, when the work of speech recognition falls on the shoulders of Google servers ASR or Yandex SpeechKit, Sphinx works more accurately, faster and cheaper. And completely local. If you wish, you can teach Sphinx the Russian language model and the grammar of user queries. Yes, you will have to work a little during installation. Just like setting up Sphinx voice models and libraries is not a task for beginners. Because the core of CMU Sphinx, the Sphinx4 library, is written in Java, you can include its code in your speech recognition applications. Specific examples uses will be described in the second part of our review.

VoxForge

Let us especially highlight the concept of a speech corpus. A speech corpus is a structured set of speech fragments that is provided software access to individual elements housings. In other words, it is a set of human voices in different languages. Without a speech corpus, no speech recognition system can operate. It is difficult to create a high-quality open speech corpus alone or even with a small team, so a special project is collecting recordings of human voices - VoxForge.

Anyone with access to the Internet can contribute to the creation of a speech corpus by simply recording and submitting a speech fragment. This can be done even by phone, but it is more convenient to use the website. Of course, in addition to the audio recording itself, the speech corpus should include additional information, such as phonetic transcription. Without this, speech recording is meaningless for the recognition system.

HTK, Julius and Simon

HTK - Hidden Markov Model Toolkit is a toolkit for research and development of speech recognition tools using hidden Markov models, developed at the University of Cambridge under the patronage of Microsoft (Microsoft once bought this code from a commercial enterprise Entropic Cambridge Research Laboratory Ltd, and then returned it Cambridge together with a restrictive license). The project's sources are available to everyone, but the use of HTK code in products intended for end users is prohibited by the license.

However, this does not mean that HTK is useless for Linux developers: it can be used as an auxiliary tool when developing open-source (and commercial) speech recognition tools, which is what the developers of the open-source Julius engine, which is being developed in Japan, do. Julius works best with Japanese. The great and powerful is also not deprived, because the same VoxForge is used as a voice database.

Continuation is available only to subscribers

Option 1. Subscribe to Hacker to read all materials on the site

Subscription will allow you to read ALL paid materials on the site within the specified period. We accept payment bank cards, electronic money and transfers from mobile operator accounts.

No program can completely replace the manual work of transcribing recorded speech. However, there are solutions that can significantly speed up and facilitate the translation of speech into text, that is, simplify transcription.

Transcription is the recording of an audio or video file in text form. There are paid paid tasks on the Internet, when the performer is paid a certain amount of money for transcribing the text.

Speech to text translation is useful

students to translate recorded audio or video lectures into text,
bloggers running websites and blogs,
writers, journalists for writing books and texts,
information businessmen who need a text after their webinar, speech, etc.,
people who have difficulty typing - they can dictate a letter and send it to family or friends,
other options.

We will describe the most effective tools, available on PC, mobile applications and online services.

1 Website speechpad.ru

This is an online service that allows you to translate speech into text using the Google Chrome browser. The service works with a microphone and ready-made files. Of course, the quality will be much higher if you use an external microphone and dictate yourself. However, the service does a good job even with YouTube videos.

Click “Enable recording”, answer the question about “Using a microphone” - to do this, click “Allow”.

The long instructions about using the service can be collapsed by clicking on button 1 in Fig. 3. You can get rid of advertising by completing a simple registration.

Rice. 3. Speechpad service

The finished result is easy to edit. To do this, you need to either manually correct the highlighted word or dictate it again. The results of the work are saved in personal account, you can also download them to your computer.

List of video lessons on working with speechpad:

You can transcribe videos from Youtube or from your computer, however, you will need a mixer, more details:

Video "audio transcription"

The service operates in seven languages. There is a small minus. It lies in the fact that if you need to transcribe a finished audio file, then its sound is heard through the speakers, which creates additional interference in the form of an echo.

2 Service dictation.io

A wonderful online service that allows you to translate speech into text for free and easily.

Rice. 4. Service dictation.io

1 in Fig. 4 – Russian language can be selected at the end of the page. IN Google browser Chrome selects the language, but for some reason Mozilla does not have this option.

It is noteworthy that the ability to auto-save the finished result has been implemented. This will save you from accidental deletion as a result of closing a tab or browser. Ready files this service does not recognize. Works with a microphone. You need to name punctuation marks when dictating.

The text is recognized quite correctly, there are no spelling errors. You can insert punctuation marks yourself from the keyboard. The finished result can be saved on your computer.

3 RealSpeaker

This program allows you to easily translate human speech into text. It is designed to work in different systems: Windows, Android, Linux, Mac. With its help, you can convert speech heard into a microphone (for example, it can be built into a laptop), as well as recorded into audio files.

Can understand 13 world languages. There is a beta version of the program that works as an online service:

You need to follow the link above, select the Russian language, upload your audio or video file to the online service and pay for its transcription. After transcription, you can copy the resulting text. The larger the file for transcription, the more time it will take to process it, more details:

In 2017 there was a free transcription option using RealSpeaker, but in 2018 there is no such option. It is very confusing that the transcribed file is available to all users for downloading; perhaps this will be improved.

Developer contacts (VKontakte, Facebook, Youtube, Twitter, e-mail, telephone) programs can be found on the page of his website (more precisely, in the footer of the site):

4 Speechlogger

Alternative previous application for mobile devices running on Android. Available for free in the app store:

The text is edited automatically and punctuation marks are added. Very convenient for dictating notes to yourself or making lists. As a result, the text will be of very decent quality.

5 Dragon Dictation

This is an application that is distributed free of charge for mobile devices from Apple.

The program can work with 15 languages. It allows you to edit the result and select the desired words from the list. You need to clearly pronounce all sounds, not make unnecessary pauses and avoid intonation. Sometimes there are mistakes in the endings of words.

The Dragon Dictation application is used by owners, for example, to dictate a shopping list in a store while moving around the apartment. When I get there, I can look at the text in the note, and I don’t have to listen.

Whatever program you use in your practice, be prepared to double-check the results and make certain adjustments. This is the only way to get a flawless text without errors.

Also useful services:

Receive the latest articles on computer literacy directly to your mailbox .
Already more 3,000 subscribers