Overview

As we move into the future, the things around us will become much more automated. Whether it is self service check out, or drones delivering us packages, we will continually move toward automated solutions. In our journey towards automation we have been lucky enough to have been tasked with the project that could change the way computers, in the future, recognize speech forever. The problem that was proposed to us, is to build upon previous program designs that can analyze and score people’s English proficiency. This process is to better help them in learning the language. The goal of our project is to take human speech, analyze it, and tell the speaker how proficient their speech is. We will specifically be focusing on programming software that will break speech down to units called a phone; these are the basic building blocks of human speech. Our project will take previous software and integrate it with another system. This software will then take in an audio file and analyze it for a variety of phones in order to find specific speech features. The beauty of this approach is that the software does not, in fact, rely on the language for its design, only for its analysis; Because of this the software we develop may in the future be applicable to all human languages.

We have been working closely with our client to find a better way of detecting phones. Most speech recognizers today use a Hidden Markov Model (HMM), a statistical approach, in order to detect phones with relatively high accuracy, and this is what our sponsor is currently using. Recently, research has been done on using Recurrent Neural Networks (RNN), which is analogous to the human mind, for the purpose of phone and speech recognition, and has shown promising results. We will be implementing both a Recurrent Neural Network and a Hidden Markov Model so that we can see which one will yield the best results.. We will begin by using the training set to develop the models, and then testing the models with the testing set. Once we have tested and tried both models we will design further tests to make sure that the models will work with speech files outside of the TIMIT set. The overview of how each of our different potential methods will be implemented is as such; first we will train the respective models using the TIMIT data set; Second we will test those models to measure their error rate; lastly we will use our system on untested data to ensure it will work in our sponsor's use case of measuring speech proficiency. Each model that is created by our training component will be saved, with this we will be able to have multiple models, which correspond to different languages, dialects, and accents. Before data is passed into either model to be trained it must be prepared, this involves converting the different speech files to the correct format (Wave files) and to the correct sample rate. We hope that our project implementation will go smoothly and accomplish these set of goals, in the hope that one day our software may be part of a speech recognition system that could be applicable to all languages all over the world and in a variety of fields.

Requirements

The functionality of this software system can be broken down into just a few requirements. Requirements: Read in an audio file and return a list of English phones that are located in the sound file Read in an audio file and return the location of the English phones in the audio file Calculate and achieve 80% accuracy using the TIMIT test data (created by speech experts) The system must be re-trainable. In other words, the program has to have the capability of being reset, and trained with new audio files that weren’t used to train the last iteration The way these requirements are met is not important as long as the accuracy requirements are met. There is quite a bit of flexibility in terms of the input and output formats, as long as the client is able to convert whichever format they currently have into the one required by the system. Ideally, however, such a conversion shouldn’t be required.
There also exist some mandatory environmental requirements:

The entire program must be developed/scripted from MATLAB
The program must be able to run on the average desktop computer
8GB RAM, the average computing power of the research computers at the Applied Linguistics Speech Lab

To ensure that the program will be useful to the sponsors, these environmental requirements must be met so that it can be used quickly and efficiently in the expected use-case scenario.

Finally, we have some optional/non-functional requirements:
We want the program to evaluate an audio file in 10 minutes
We want the program to process all training data in 24 hours
We want the trained program to fit on a 32 GB flash drive
We want all file formats to be standardized

These requirements aren’t really necessary, but would increase the likelihood that our sponsors would use our program in the future. So, while developing the key functionalities, or after developing the key functionalities, it may be wise to work towards these requirements.