Class / Patent application number | Description | Number of patent applications / Date published |
704232000 | Neural network | 69 |
20080221878 | FAST SEMANTIC EXTRACTION USING A NEURAL NETWORK ARCHITECTURE - A system and method for semantic extraction using a neural network architecture includes indexing each word in an input sentence into a dictionary and using these indices to map each word to a d-dimensional vector (the features of which are learned). Together with this, position information for a word of interest (the word to labeled) and a verb of interest (the verb that the semantic role is being predicted for) with respect to a given word are also used. These positions are integrated by employing a linear layer that is adapted to the input sentence. Several linear transformations and squashing functions are then applied to output class probabilities for semantic role labels. All the weights for the whole architecture are trained by backpropagation. | 09-11-2008 |
20090106022 | System and method for learning a network of categories using prediction - An improved system and method is provided for efficiently learning a network of categories using prediction. A learning engine may receive a stream of characters and incrementally segment the stream of characters beginning with individual characters into larger and larger categories. To do so, a prediction engine may be provided for predicting a target category from the stream of characters using one or more context categories. Upon predicting the target category, the edges of the network of categories may be updated. A category composer may also be provided for composing a new category from existing categories in the network of categories, and a new category composed may then be added to the network of categories. Advantageously, iterative episodes of prediction and learning of categories for large scale applications may result in hundreds of thousands of categories connected by millions of prediction edges. | 04-23-2009 |
20090216528 | Method of adapting a neural network of an automatic speech recognition device - A method of adapting a neural network of an automatic speech recognition device, includes the steps of: providing a neural network including an input stage, an intermediate stage and an output stage, the output stage outputting phoneme probabilities; providing a linear stage in the neural network; and training the linear stage by means of an adaptation set; wherein the step of providing the linear stage includes the step of providing the linear stage after the intermediate stage. | 08-27-2009 |
20090259464 | System And Method For Facilitating Cognitive Processing Of Simultaneous Remote Voice Conversations - A system and method for facilitating cognitive processing of simultaneous remote voice conversations is provided. A plurality of remote voice conversations participated in by distributed participants are provided over a shared communication channel. A main conversation between at least two of the distributed participants and one or more subconversations between at least two other of the distributed participants are identified from within the remote voice conversations. Segments of interest to one of the distributed participants are defined including a conversation excerpt having a lower attention activation threshold for the one distributed participant. Each of the subconversations is parsed into conversation excerpts. The conversation excerpts are compared to the segments of interest. One or more gaps between conversation flow in the main conversation are predicted. Segments of interest are selectively injected into the gaps of the main conversation as provided to the one distributed participant over the shared communications channel. | 10-15-2009 |
20090292538 | SYSTEMS AND METHODS OF IMPROVING AUTOMATED SPEECH RECOGNITION ACCURACY USING STATISTICAL ANALYSIS OF SEARCH TERMS - Systems and methods of improving speech recognition accuracy using statistical analysis of word or phrase-based search terms are disclosed. An illustrative system for statistically analyzing search terms includes an interface adapted to receive a text-based search term, a textual-linguistic analysis module that detects textual features within the search term and generates a first score, a phonetic conversion module that converts the search term into a phoneme string, a phonetic-linguistic analysis module that detects phonemic features within the phoneme string and generates a second score, and a score normalization module that normalizes the first and second scores and outputs a search term score to a user or process. | 11-26-2009 |
20100057452 | SPEECH INTERFACES - The described implementations relate to speech interfaces and in some instances to speech pattern recognition techniques that enable speech interfaces. One system includes a feature pipeline configured to produce speech feature vectors from input speech. This system also includes a classifier pipeline configured to classify individual speech feature vectors utilizing multi-level classification. | 03-04-2010 |
20100057453 | Voice activity detection system and method - Discrimination between at least two classes of events in an input signal is carried out in the following way. A set of frames containing an input signal is received, and at least two different feature vectors are determined for each of said frames. Said at least two different feature vectors are classified using respective sets of preclassifiers trained for said at least two classes of events. Values for at least one weighting factor are determined based on outputs of said preclassifiers for each of said frames. A combined feature vector is calculated for each of said frames by applying said at least one weighting factor to said at least two different feature vectors. Said combined feature vector is classified using a set of classifiers trained for said at least two classes of events. | 03-04-2010 |
20100217589 | Method for Automated Training of a Plurality of Artificial Neural Networks - The invention provides a method for automated training of a plurality of artificial neural networks for phoneme recognition using training data, wherein the training data comprises speech signals subdivided into frames, each frame associated with a phoneme label, wherein the phoneme label indicates a phoneme associated with the frame. A sequence of frames from the training data are provided, wherein the number of frames in the sequence of frames is at least equal to the number of artificial neural networks. Each of the artificial neural networks is assigned a different subsequence of the provided sequence, wherein each subsequence comprises a predetermined number of frames. A common phoneme label for the sequence of frames is determined based on the phoneme labels of one or more frames of one or more subsequences of the provided sequence. Each artificial neural network using the common phoneme label. | 08-26-2010 |
20110119057 | Neural Segmentation of an Input Signal and Applications Thereof - Disclosed are systems, methods, and computer-program products for segmenting content of an input signal and applications thereof. In an embodiment, the system includes simulated neurons, a phase modulator, and an entity-identifier module. Each simulated neuron is connected to one or more other simulated neurons and is associated with an activity and a phase. The activity and the phase of each simulated neuron is set based on the activity and the phase of the one or more other simulated neurons connected to each simulated neuron. The phase modulator includes individual modulators, each configured to modulate the activity and the phase of each of the plurality of simulated neurons based on a modulation function. The entity-identifier module is configured to identify one or more distinct entities (e.g., objects, sound sources, etc.) included in the input signal based on the one or more distinct collections of simulated neurons that have substantially distinct phases. | 05-19-2011 |
20110144986 | CONFIDENCE CALIBRATION IN AUTOMATIC SPEECH RECOGNITION SYSTEMS - Described is a calibration model for use in a speech recognition system. The calibration model adjusts the confidence scores output by a speech recognition engine to thereby provide an improved calibrated confidence score for use by an application. The calibration model is one that has been trained for a specific usage scenario, e.g., for that application, based upon a calibration training set obtained from a previous similar/corresponding usage scenario or scenarios. Different calibration models may be used with different usage scenarios, e.g., during different conditions. The calibration model may comprise a maximum entropy classifier with distribution constraints, trained with continuous raw confidence scores and multi-valued word tokens, and/or other distributions and extracted features. | 06-16-2011 |
20110307252 | Using Utterance Classification in Telephony and Speech Recognition Applications - Described is the use of utterance classification based methods and other machine learning techniques to provide a telephony application or other voice menu application (e.g., an automotive application) that need not use Context-Free-Grammars to determine a user's spoken intent. A classifier receives text from an information retrieval-based speech recognizer and outputs a semantic label corresponding to the likely intent of a user's speech. The semantic label is then output, such as for use by a voice menu program in branching between menus. Also described is training, including training the language model from acoustic data without transcriptions, and training the classifier from speech-recognized acoustic data having associated semantic labels. | 12-15-2011 |
20130138436 | DISCRIMINATIVE PRETRAINING OF DEEP NEURAL NETWORKS - Discriminative pretraining technique embodiments are presented that pretrain the hidden layers of a Deep Neural Network (DNN). In general, a one-hidden-layer neural network is trained first using labels discriminatively with error back-propagation (BP). Then, after discarding an output layer in the previous one-hidden-layer neural network, another randomly initialized hidden layer is added on top of the previously trained hidden layer along with a new output layer that represents the targets for classification or recognition. The resulting multiple-hidden-layer DNN is then discriminatively trained using the same strategy, and so on until the desired number of hidden layers is reached. This produces a pretrained DNN. The discriminative pretraining technique embodiments have the advantage of bringing the DNN layer weights close to a good local optimum, while still leaving them in a range with a high gradient so that they can be fine-tuned effectively. | 05-30-2013 |
20130166291 | EMOTIONAL AND/OR PSYCHIATRIC STATE DETECTION - Mental state of a person is classified in an automated manner by analysing natural speech of the person. A glottal waveform is extracted from a natural speech signal. Pre-determined parameters defining at least one diagnostic class of a class model are retrieved, the parameters determined from selected training glottal waveform features. The selected glottal waveform features are extracted from the signal. Current mental state of the person is classified by comparing extracted glottal waveform features with the parameters and class model. Feature extraction from a glottal waveform or other natural speech signal may involve determining spectral amplitudes of the signal, setting spectral amplitudes below a pre-defined threshold to zero and, for each of a plurality of sub bands, determining an area under the thresholded spectral amplitudes, and deriving signal feature parameters from the determined areas in accordance with a diagnostic class model. | 06-27-2013 |
20130317815 | METHOD AND SYSTEM FOR ANALYZING DIGITAL SOUND AUDIO SIGNAL ASSOCIATED WITH BABY CRY - A method for analyzing a digital audio signal associated with a baby cry, comprising the steps of: (a) processing the digital audio signal using a spectral analysis to generate a spectral data; (b) processing the digital audio signal using a time-frequency analysis to generate a time-frequency characteristic; (c) categorizing the baby cry into one of a basic type and a special type based on the spectral data; (d) if the baby cry is of the basic type, determining a basic need based on the time-frequency characteristic and a predetermined lookup table; and (e) if the baby cry is of the special type, determining a special need by inputting the time-frequency characteristic into a pre-trained artificial neural network. | 11-28-2013 |
20140149112 | COMBINING AUDITORY ATTENTION CUES WITH PHONEME POSTERIOR SCORES FOR PHONE/VOWEL/SYLLABLE BOUNDARY DETECTION - Phoneme boundaries may be determined from a signal corresponding to recorded audio by extracting auditory attention features from the signal and extracting phoneme posteriors from the signal. The auditory attention features and phoneme posteriors may then be combined to detect boundaries in the signal. | 05-29-2014 |
20140163977 | SPEECH MODEL RETRIEVAL IN DISTRIBUTED SPEECH RECOGNITION SYSTEMS - Features are disclosed for managing the use of speech recognition models and data in automated speech recognition systems. Models and data may be retrieved asynchronously and used as they are received or after an utterance is initially processed with more general or different models. Once received, the models and statistics can be cached. Statistics needed to update models and data may also be retrieved asynchronously so that it may be used to update the models and data as it becomes available. The updated models and data may be immediately used to re-process an utterance, or saved for use in processing subsequently received utterances. User interactions with the automated speech recognition system may be tracked in order to predict when a user is likely to utilize the system. Models and data may be pre-cached based on such predictions. | 06-12-2014 |
20140214417 | METHOD AND DEVICE FOR VOICEPRINT RECOGNITION - A method and device for voiceprint recognition, include: establishing a first-level Deep Neural Network (DNN) model based on unlabeled speech data, the unlabeled speech data containing no speaker labels and the first-level DNN model specifying a plurality of basic voiceprint features for the unlabeled speech data; obtaining a plurality of high-level voiceprint features by tuning the first-level DNN model based on labeled speech data, the labeled speech data containing speech samples with respective speaker labels, and the tuning producing a second-level DNN model specifying the plurality of high-level voiceprint features; based on the second-level DNN model, registering a respective high-level voiceprint feature sequence for a user based on a registration speech sample received from the user; and performing speaker verification for the user based on the respective high-level voiceprint feature sequence registered for the user. | 07-31-2014 |
20140244248 | CONVERSION OF NON-BACK-OFF LANGUAGE MODELS FOR EFFICIENT SPEECH DECODING - Techniques for conversion of non-back-off language models for use in speech decoders. For example, a method comprises the following step. A non-back-off language model is converted to a back-off language model. The converted back-off language model is pruned. The converted back-off language model is usable for decoding speech. | 08-28-2014 |
20140257803 | CONSERVATIVELY ADAPTING A DEEP NEURAL NETWORK IN A RECOGNITION SYSTEM - Various technologies described herein pertain to conservatively adapting a deep neural network (DNN) in a recognition system for a particular user or context. A DNN is employed to output a probability distribution over models of context-dependent units responsive to receipt of captured user input. The DNN is adapted for a particular user based upon the captured user input, wherein the adaption is undertaken conservatively such that a deviation between outputs of the adapted DNN and the unadapted DNN is constrained. | 09-11-2014 |
20140257804 | EXPLOITING HETEROGENEOUS DATA IN DEEP NEURAL NETWORK-BASED SPEECH RECOGNITION SYSTEMS - Technologies pertaining to training a deep neural network (DNN) for use in a recognition system are described herein. The DNN is trained using heterogeneous data, the heterogeneous data including narrowband signals and wideband signals. The DNN, subsequent to being trained, receives an input signal that can be either a wideband signal or narrowband signal. The DNN estimates the class posterior probability of the input signal regardless of whether the input signal is the wideband signal or the narrowband signal. | 09-11-2014 |
20140257805 | MULTILINGUAL DEEP NEURAL NETWORK - Described herein are various technologies pertaining to a multilingual deep neural network (MDNN). The MDNN includes a plurality of hidden layers, wherein values for weight parameters of the plurality of hidden layers are learned during a training phase based upon training data in terms of acoustic raw features for multiple languages. The MDNN further includes softmax layers that are trained for each target language separately, making use of the hidden layer values trained jointly with multiple source languages. The MDNN is adaptable, such that a new softmax layer may be added on top of the existing hidden layers, where the new softmax layer corresponds to a new target language. | 09-11-2014 |
20140278390 | CLASSIFIER-BASED SYSTEM COMBINATION FOR SPOKEN TERM DETECTION - Systems and methods for processing a query include determining a plurality of sets of match candidates for a query using a processor, each of the plurality of sets of match candidates being independently determined from a plurality of diverse word lattice generation components of different type. The plurality of sets of match candidates is merged by generating a first score for each match candidate to provide a merged set of match candidates. A second score is computed for each match candidate of the merged set based upon features of that match candidate. The first score and the second score are combined to provide a final set of match candidates as matches to the query. | 09-18-2014 |
20140288928 | SYSTEM AND METHOD FOR APPLYING A CONVOLUTIONAL NEURAL NETWORK TO SPEECH RECOGNITION - A system and method for applying a convolutional neural network (CNN) to speech recognition. The CNN may provide input to a hidden Markov model and has at least one pair of a convolution layer and a pooling layer. The CNN operates along the frequency axis. The CNN has units that operate upon one or more local frequency bands of an acoustic signal. The CNN mitigates acoustic variation. | 09-25-2014 |
20140372112 | RESTRUCTURING DEEP NEURAL NETWORK ACOUSTIC MODELS - A Deep Neural Network (DNN) model used in an Automatic Speech Recognition (ASR) system is restructured. A restructured DNN model may include fewer parameters compared to the original DNN model. The restructured DNN model may include a monophone state output layer in addition to the senone output layer of the original DNN model. Singular value decomposition (SVD) can be applied to one or more weight matrices of the DNN model to reduce the size of the DNN Model. The output layer of the DNN model may be restructured to include monophone states in addition to the senones (tied triphone states) which are included in the original DNN model. When the monophone states are included in the restructured DNN model, the posteriors of monophone states are used to select a small part of senones to be evaluated. | 12-18-2014 |
20150019214 | METHOD AND DEVICE FOR PARALLEL PROCESSING IN MODEL TRAINING - A method and a device for training a DNN model includes: at a device including one or more processors and memory: establishing an initial DNN model; dividing a training data corpus into a plurality of disjoint data subsets; for each of the plurality of disjoint data subsets, providing the data subset to a respective training processing unit of a plurality of training processing units operating in parallel, wherein the respective training processing unit applies a Stochastic Gradient Descent (SGD) process to update the initial DNN model to generate a respective DNN sub-model based on the data subset; and merging the respective DNN sub-models generated by the plurality of training processing units to obtain an intermediate DNN model, wherein the intermediate DNN model is established as either the initial DNN model for a next training iteration or a final DNN model in accordance with a preset convergence condition. | 01-15-2015 |
20150039301 | SPEECH RECOGNITION USING NEURAL NETWORKS - Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech recognition using neural networks. A feature vector that models audio characteristics of a portion of an utterance is received. Data indicative of latent variables of multivariate factor analysis is received. The feature vector and the data indicative of the latent variables is provided as input to a neural network. A candidate transcription for the utterance is determined based on at least an output of the neural network. | 02-05-2015 |
20150039302 | SPATIAL AUDIO SIGNALING FILTERING - An apparatus comprising: an analyser configured to analyse at least one input to determine one or more expression within the at least one input; and a controller configured to control at least one audio signal associated with the at least one input dependent on the determination of the one or more expression. | 02-05-2015 |
20150066496 | ASSIGNMENT OF SEMANTIC LABELS TO A SEQUENCE OF WORDS USING NEURAL NETWORK ARCHITECTURES - Technologies pertaining to slot filling are described herein. A deep neural network, a recurrent neural network, and/or a spatio-temporally deep neural network are configured to assign labels to words in a word sequence set forth in natural language. At least one label is a semantic label that is assigned to at least one word in the word sequence. | 03-05-2015 |
20150095026 | SPEECH RECOGNIZER WITH MULTI-DIRECTIONAL DECODING - In an automatic speech recognition (ASR) processing system, ASR processing may be configured to process speech based on multiple channels of audio received from a beamformer. The ASR processing system may include a microphone array and the beamformer to output multiple channels of audio such that each channel isolates audio in a particular direction. The multichannel audio signals may include spoken utterances/speech from one or more speakers as well as undesired audio, such as noise from a household appliance. The ASR device may simultaneously perform speech recognition on the multi-channel audio to provide more accurate speech recognition results. | 04-02-2015 |
20150095027 | KEY PHRASE DETECTION - Methods, systems, and apparatus, including computer programs encoded on computer storage media, for key phrase detection. One of the methods includes receiving a plurality of audio frame vectors that each model an audio waveform during a different period of time, generating an output feature vector for each of the audio frame vectors, wherein each output feature vector includes a set of scores that characterize an acoustic match between the corresponding audio frame vector and a set of expected event vectors, each of the expected event vectors corresponding to one of the scores and defining acoustic properties of at least a portion of a keyword, and providing each of the output feature vectors to a posterior handling module. | 04-02-2015 |
20150100312 | SYSTEM AND METHOD OF USING NEURAL TRANSFORMS OF ROBUST AUDIO FEATURES FOR SPEECH PROCESSING - A system and method for processing speech includes receiving a first information stream associated with speech, the first information stream comprising micro-modulation features and receiving a second information stream associated with the speech, the second information stream comprising features. The method includes combining, via a non-linear multilayer perceptron, the first information stream and the second information stream to yield a third information stream. The system performs automatic speech recognition on the third information stream. The third information stream can also be used for training HMMs. | 04-09-2015 |
20150127336 | SPEAKER VERIFICATION USING NEURAL NETWORKS - Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for inputting speech data that corresponds to a particular utterance to a neural network; determining an evaluation vector based on output at a hidden layer of the neural network; comparing the evaluation vector with a reference vector that corresponds to a past utterance of a particular speaker; and based on comparing the evaluation vector and the reference vector, determining whether the particular utterance was likely spoken by the particular speaker. | 05-07-2015 |
20150127337 | ASYNCHRONOUS OPTIMIZATION FOR SEQUENCE TRAINING OF NEURAL NETWORKS - Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for obtaining, by a first sequence-training speech model, a first batch of training frames that represent speech features of first training utterances; obtaining, by the first sequence-training speech model, one or more first neural network parameters; determining, by the first sequence-training speech model, one or more optimized first neural network parameters based on (i) the first batch of training frames and (ii) the one or more first neural network parameters; obtaining, by a second sequence-training speech model, a second batch of training frames that represent speech features of second training utterances; obtaining one or more second neural network parameters; and determining, by the second sequence-training speech model, one or more optimized second neural network parameters based on (i) the second batch of training frames and (ii) the one or more second neural network parameters. | 05-07-2015 |
20150134330 | VOICE AND/OR FACIAL RECOGNITION BASED SERVICE PROVISION - Apparatuses, methods and storage medium associated with voice and/or facial recognition based service provision are disclosed herein. In embodiments, an apparatus may include a voice recognition engine and a facial recognition engine configured to provide, individually or in cooperation with each other, identification of a user at a plurality of identification levels. The apparatus may further include a service agent configured to provide a service to a user of the apparatus, after the user has been identified at least at an identification level required to receive the service. Other embodiments may be described and/or claimed. | 05-14-2015 |
20150149165 | Speaker Adaptation of Neural Network Acoustic Models Using I-Vectors - A method includes providing a deep neural network acoustic model, receiving audio data including one or more utterances of a speaker, extracting a plurality of speech recognition features from the one or more utterances of the speaker, creating a speaker identity vector for the speaker based on the extracted speech recognition features, and adapting the deep neural network acoustic model for automatic speech recognition using the extracted speech recognition features and the speaker identity vector. | 05-28-2015 |
20150294670 | TEXT-DEPENDENT SPEAKER IDENTIFICATION - Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speaker verification. The methods, systems, and apparatus include actions of inputting speech data that corresponds to a particular utterance to a first neural network and determining an evaluation vector based on output at a hidden layer of the first neural network. Additional actions include obtaining a reference vector that corresponds to a past utterance of a particular speaker. Further actions include inputting the evaluation vector and the reference vector to a second neural network that is trained on a set of labeled pairs of feature vectors to identify whether speakers associated with the labeled pairs of feature vectors are the same speaker. More actions include determining, based on an output of the second neural network, whether the particular utterance was likely spoken by the particular speaker. | 10-15-2015 |
20150310858 | SHARED HIDDEN LAYER COMBINATION FOR SPEECH RECOGNITION SYSTEMS - Providing a framework for merging automatic speech recognition (ASR) systems having a shared deep neural network (DNN) feature transformation is provided. A received utterance may be evaluated to generate a DNN-derived feature from the top hidden layer of a DNN. The top hidden layer output may then be utilized to generate a network including a bottleneck layer and an output layer. Weights representing a feature dimension reduction may then be extracted between the top hidden layer and the bottleneck layer. Scores may then be generated and combined to merge the ASR systems which share the DNN feature transformation. | 10-29-2015 |
20150340032 | TRAINING MULTIPLE NEURAL NETWORKS WITH DIFFERENT ACCURACY - Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a deep neural network. One of the methods includes generating a plurality of feature vectors that each model a different portion of an audio waveform, generating a first posterior probability vector for a first feature vector using a first neural network, determining whether one of the scores in the first posterior probability vector satisfies a first threshold value, generating a second posterior probability vector for each subsequent feature vector using a second neural network, wherein the second neural network is trained to identify the same key words and key phrases and includes more inner layer nodes than the first neural network, and determining whether one of the scores in the second posterior probability vector satisfies a second threshold value. | 11-26-2015 |
20160019884 | METHODS AND APPARATUS FOR TRAINING A TRANSFORMATION COMPONENT - According to some aspects, a method of training a transformation component using a trained acoustic model comprising first parameters having respective first values established during training of the acoustic model using first training data is provided. The method comprises using at least one computer processor to perform coupling the transformation component to a portion of the acoustic model, the transformation component comprising second parameters, and training the transformation component by determining, for the second parameters, respective second values using second training data input to the transformation component and processed by the acoustic model, wherein the acoustic model retains the first parameters having the respective first values throughout training of the transformation component. | 01-21-2016 |
20160071515 | SECTIONED MEMORY NETWORKS FOR ONLINE WORD-SPOTTING IN CONTINUOUS SPEECH - Systems, methods, and computer program products to detect a keyword in speech, by generating, from a sequence of spectral feature vectors generated from the speech, a plurality of blocked feature vector sequences, and analyzing, by a neural network, each of the plurality of blocked feature vector sequences to detect the presence of the keyword in the speech. | 03-10-2016 |
20160071519 | SPEECH MODEL RETRIEVAL IN DISTRIBUTED SPEECH RECOGNITION SYSTEMS - Features are disclosed for managing the use of speech recognition models and data in automated speech recognition systems. Models and data may be retrieved asynchronously and used as they are received or after an utterance is initially processed with more general or different models. Once received, the models and statistics can be cached. Statistics needed to update models and data may also be retrieved asynchronously so that it may be used to update the models and data as it becomes available. The updated models and data may be immediately used to re-process an utterance, or saved for use in processing subsequently received utterances. User interactions with the automated speech recognition system may be tracked in order to predict when a user is likely to utilize the system. Models and data may be pre-cached based on such predictions. | 03-10-2016 |
20160078863 | SIGNAL PROCESSING ALGORITHM-INTEGRATED DEEP NEURAL NETWORK-BASED SPEECH RECOGNITION APPARATUS AND LEARNING METHOD THEREOF - Provided are a signal processing algorithm-integrated deep neural network (DNN)-based speech recognition apparatus and a learning method thereof. A model parameter learning method in a deep neural network (DNN)-based speech recognition apparatus implementable by a computer includes converting a signal processing algorithm for extracting a feature parameter from a speech input signal of a time domain into signal processing deep neural network (DNN), fusing the signal processing DNN and a classification DNN, and learning a model parameter in a deep learning model in which the signal processing DNN and the classification DNN are fused. | 03-17-2016 |
20160093290 | SYSTEM AND METHOD FOR COMPRESSED DOMAIN LANGUAGE IDENTIFICATION - Embodiments included herein are directed towards a system and method for compressed domain language identification. Embodiments may include receiving a bitstream of a sequence of packets at one or more computing devices and classifying each packet into speech or non-speech based upon, at least in part, compressed domain voice activity detection (VAD). Embodiments may further include extracting a pseudo-cepstral representation from the speech detected packets and partially decoding without extracting a PCM format and generating a sequence of multi-frames, based upon, at least in part, the pseudo-cepstral representation. Embodiments may also include providing in real time the sequence of multi-frames to a deep neural network (DNN), wherein the DNN has been trained off-line for one or more desired target languages. | 03-31-2016 |
20160093294 | ACOUSTIC MODEL TRAINING CORPUS SELECTION - The present disclosure relates to training a speech recognition system. One example method includes receiving a collection of speech data items, wherein each speech data item corresponds to an utterance that was previously submitted for transcription by a production speech recognizer. The production speech recognizer uses initial production speech recognizer components in generating transcriptions of speech data items. A transcription for each speech data item is generated using an offline speech recognizer, and the offline speech recognizer components are configured to improve speech recognition accuracy in comparison with the initial production speech recognizer components. The updated production speech recognizer components are trained for the production speech recognizer using a selected subset of the transcriptions of the speech data items generated by the offline speech recognizer. An updated production speech recognizer component is provided to the production speech recognizer for use in transcribing subsequently received speech data items. | 03-31-2016 |
20160093313 | NEURAL NETWORK VOICE ACTIVITY DETECTION EMPLOYING RUNNING RANGE NORMALIZATION - A “running range normalization” method includes computing running estimates of the range of values of features useful for voice activity detection (VAD) and normalizing the features by mapping them to a desired range. Running range normalization includes computation of running estimates of the minimum and maximum values of VAD features and normalizing the feature values by mapping the original range to a desired range. Smoothing coefficients are optionally selected to directionally bias a rate of change of at least one of the running estimates of the minimum and maximum values. The normalized VAD feature parameters are used to train a machine learning algorithm to detect voice activity and to use the trained machine learning algorithm to isolate or enhance the speech component of the audio data. | 03-31-2016 |
20160098987 | NEURAL NETWORK-BASED SPEECH PROCESSING - Pairs of feature vectors are obtained that represent speech. Some pairs represent two samples of speech from the same speakers, and other pairs represent two samples of speech from different speakers. A neural network feeds each feature vector in a sample pair into a separate bottleneck layer, with a weight matrix on the input of both vectors tied to one another. The neural network is trained using the feature vectors and an objective function that induces the network to classify whether the speech samples come from the same speaker. The weights from the tied weight matrix are extracted for use in generating derived features for a speech processing system that can benefit from features that are thus transformed to better reflect speaker identity. | 04-07-2016 |
20160098996 | MANAGEMENT OF VOICE COMMANDS FOR DEVICES IN A CLOUD COMPUTING ENVIRONMENT - Provided is a lightweight computational device that is configured to be in communication with a cloud both directly and via a smart computational device. The lightweight computational device receives a voice command from a user, wherein the lightweight computational device does not have adequate processing power to convert the voice command to a text command. The voice command is transmitted from the lightweight computational device to a smart computational device, wherein the smart computational device uses voice recognition to convert the voice command to a text command in the smart computational device, and transmits the text command for being processed by that cloud that provides at least one of voice recognition service and other services. The lightweight computational device receives a data response for the user from the cloud, via the smart computational device, based on the other services provided by the cloud. | 04-07-2016 |
20160099010 | CONVOLUTIONAL, LONG SHORT-TERM MEMORY, FULLY CONNECTED DEEP NEURAL NETWORKS - Methods, systems, and apparatus, including computer programs encoded on computer storage media, for identifying the language of a spoken utterance. One of the methods includes receiving input features of an utterance; and processing the input features using an acoustic model that comprises one or more convolutional neural network (CNN) layers, one or more long short-term memory network (LSTM) layers, and one or more fully connected neural network layers to generate a transcription for the utterance. | 04-07-2016 |
20160104486 | Methods and Systems for Communicating Content to Connected Vehicle Users Based Detected Tone/Mood in Voice Input - Methods, systems and cloud processing are provided for coordinating and processing user input provided to vehicles during use. One example is for processing voice inputs at a vehicle to identify a mood of a user and then modifying or customizing the vehicle response based on the detected mood, physical characteristic and/or physiological characteristic of the user. One example includes sending, to a cloud processing server, data from the vehicle. The vehicle includes an on-board computer for processing instructions for the vehicle and processing wireless communication to exchange data with the cloud processing server. The method then receives, at the vehicle, data for a user account to use the vehicle. The cloud processing server uses the user account to identify a user profile of a user. Then, receiving, from the cloud processing server, voice profiles for the user profile. Each voice profile is associated with a tone identifier or audio signature. The voice profiles for the user are learned from a plurality of voice inputs made to the vehicle by the user in one or more prior sessions of use of the vehicle. The method further includes receiving, by on-board computer, a voice input. For the voice input, processing, by on-board computer, the voice input to identify a voice profile that is correlated to the voice input and generating a vehicle response for the voice input. The vehicle response is moderated based on the tone identifier of the identified voice profile. In one example, the tone identifier is used to infer a mood of the user, and the moderation of the vehicle response assists in selecting a type of response by the vehicle and/or setting made to a vehicle system. | 04-14-2016 |
20160118041 | SYSTEM AND METHOD FOR ANALYZING AND CLASSIFYING CALLS WITHOUT TRANSCRIPTION VIA KEYWORD SPOTTING - A facility and method for analyzing and classifying calls without transcription via keyword spotting is disclosed. The facility uses a group of calls having known outcomes to generate one or more domain- or entity-specific grammars containing keywords and related information that are indicative of particular outcome. The facility monitors telephone calls by determining the domain or entity associated with the call, loading the appropriate grammar or grammars associated with the determined domain or entity, and tracking keywords contained in the loaded grammar or grammars that are spoken during the monitored call, along with additional information. The facility performs a statistical analysis on the tracked keywords and additional information to determine a classification for the monitored telephone call. | 04-28-2016 |
20160125877 | MULTI-STAGE HOTWORD DETECTION - Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for multi-stage hotword detection are disclosed. In one aspect, a method includes the actions of receiving, by a second stage hotword detector of a multi-stage hotword detection system that includes at least a first stage hotword detector and the second stage hotword detector, audio data that corresponds to an initial portion of an utterance. The actions further include determining a likelihood that the initial portion of the utterance includes a hotword. The actions further include determining that the likelihood that the initial portion of the utterance includes the hotword satisfies a threshold. The actions further include, in response to determining that the likelihood satisfies the threshold, transmitting a request for the first stage hotword detector to cease providing additional audio data that corresponds to one or more subsequent portions of the utterance. | 05-05-2016 |
20160125883 | SPEECH RECOGNITION CLIENT APPARATUS PERFORMING LOCAL SPEECH RECOGNITION - [Object] An object is to provide a client having a local speech recognition function, capable of activating a speech recognition function of a speech recognition server in a natural manner, and capable of maintaining high precision while not increasing burden on a communication line. | 05-05-2016 |
20160155436 | METHOD AND APPARATUS FOR SPEECH RECOGNITION | 06-02-2016 |
20160163310 | METHOD AND APPARATUS FOR TRAINING LANGUAGE MODEL AND RECOGNIZING SPEECH - A method and apparatus for training a neural network language model, and a method and apparatus for recognizing speech data based on a trained language model are provided. The method of training a language model involves converting, using a processor, training data into error-containing training data, and training a neural network language model using the error-containing training data. | 06-09-2016 |
20160171974 | SYSTEMS AND METHODS FOR SPEECH TRANSCRIPTION | 06-16-2016 |
20160180838 | USER SPECIFIED KEYWORD SPOTTING USING LONG SHORT TERM MEMORY NEURAL NETWORK FEATURE EXTRACTOR | 06-23-2016 |
20160189715 | SPEECH RECOGNITION DEVICE AND METHOD - A terminal includes a speech acquisition unit that acquires first speech information, a first speech processor that removes noise contained in the first speech information using a first removal method and outputs the noise-removed speech information as second speech information, a first speech recognition unit that performs speech recognition on the second speech information and outputs the speech recognition result as first speech recognition result information, a communication unit that receives a speech recognition result as second speech recognition result information from a server, the speech recognition result being a result obtained by performing speech recognition on third speech information, the third speech information being obtained by removing noise contained in the first speech information using a second removal method that removes a larger amount of noise than the amount of noise removed from the first speech information using the first removal method, and a determination unit that makes a selection as to which of the first speech recognition result information and the second speech recognition result information should be outputted. | 06-30-2016 |
20160379626 | LANGUAGE MODEL MODIFICATION FOR LOCAL SPEECH RECOGNITION SYSTEMS USING REMOTE SOURCES - A language model is modified for a local speech recognition system using remote speech recognition sources. In one example, a speech utterance is received. The speech utterance is sent to at least one remote speech recognition system. Text results corresponding to the utterance are received from the remote speech recognition system. A local text result is generated using local vocabulary. The received text results and the generated text result are compared to determine words that are out of the local vocabulary and the local vocabulary is updated using the out of vocabulary words. | 12-29-2016 |
20160379665 | TRAINING DEEP NEURAL NETWORK FOR ACOUSTIC MODELING IN SPEECH RECOGNITION - A method is provided for training a Deep Neural Network (DNN) for acoustic modeling in speech recognition. The method includes reading central frames and side frames as input frames from a memory. The side frames are preceding side frames preceding the central frames and/or succeeding side frames succeeding the central frames. The method further includes executing pre-training for only the central frames or both the central frames and the side frames and fine-tuning for the central frames and the side frames so as to emphasize connections between acoustic features in the central frames and units of the bottom layer in hidden layer of the DNN. | 12-29-2016 |
20160379669 | METHOD FOR DETERMINING ALCOHOL CONSUMPTION, AND RECORDING MEDIUM AND TERMINAL FOR CARRYING OUT SAME - Disclosed are a method for determining whether a person is drunk after consuming alcohol on the basis of a difference among a plurality of formant energy energies, which are generated by applying linear predictive coding according to a plurality of linear prediction orders, and a recording medium and a terminal for carrying out the method. The alcohol consumption determining terminal comprises: a voice input unit for receiving voice signals and converting same into voice frames and outputting the voice frames; a voiced/unvoiced sound analysis unit for extracting voice frames corresponding to a voiced sound from among the voice frames; an LPC processing unit for calculating a plurality of formant energy energies by applying linear predictive cording according to the plurality of linear prediction orders to the voice frames corresponding to the voiced sound; and an alcohol consumption determining unit for determining whether a person is drunk after consuming alcohol on the basis of a difference among the plurality of formant energy energies which have been calculated by the LPC processing unit, thereby determining whether a person is drunk after consuming alcohol depending on a change in the formant energy energies generated by applying linear predictive coding according to the plurality of linear prediction orders to voice signals. | 12-29-2016 |
20170236516 | System and Method for Audio-Visual Speech Recognition | 08-17-2017 |
20180025721 | AUTOMATIC SPEECH RECOGNITION USING MULTI-DIMENSIONAL MODELS | 01-25-2018 |
20190147855 | NEURAL NETWORK FOR USE IN SPEECH RECOGNITION ARBITRATION | 05-16-2019 |
20190147856 | Low-Power Automatic Speech Recognition Device | 05-16-2019 |
20190147876 | SYSTEMS AND METHODS FOR ADAPTIVE PROPER NAME ENTITY RECOGNITION AND UNDERSTANDING | 05-16-2019 |
20220138427 | SYSTEM AND METHOD FOR PROVIDING VOICE ASSISTANT SERVICE REGARDING TEXT INCLUDING ANAPHORA - A system and method for providing a voice assistant service for text including an anaphor are provided. A method, performed by an electronic device, of providing a voice assistant service includes: obtaining first text generated from a first input, detecting a target word within the first text and generating common information related to the detected target word, using a first natural language understanding (NLU) model, obtaining second text generated from a second input, inputting the common information and the second text to a second NLU model, detecting an anaphor included in the second text and outputting an intent and a parameter, based on common information corresponding to the detected anaphor, using the second NLU model, and generating response information related to the intent and the parameter. | 05-05-2022 |
20220139382 | CONTEMPORANEOUS MACHINE-LEARNING ANALYSIS OF AUDIO STREAMS - Described techniques select portions of an audio stream for transmission to a trained machine learning application, which generates response recommendations in real-time. This real-time response is facilitated by the system identifying, selecting and transmitting those portions of the audio stream likely to be most relevant to the conversation. Portions of an audio stream less likely to be relevant to the conversation are identified accordingly and not transmitted. The system may identify the relevant portions of an audio stream by detecting events in a contemporaneous event stream, use a trained machine learning model to identify events in an audio stream, or both. | 05-05-2022 |
20220139383 | REAL-TIME VIDEO CONFERENCE CHAT FILTERING USING MACHINE LEARNING MODELS - In various examples, as a user is speaking or presenting content during an online video conference, the data stream may be processed to generate a textual representation (e.g., transcript) of the audio and/or information relating to the video. The textual representation and/or video related information may then be processed to determine a context or one or more topic(s) of discussion. Based on the determined context/topic(s), a corresponding neural network(s) may be selected. Once a neural network has been selected, comments may be retrieved from a chat feature of the application and applied to the neural network. The neural network may then output data to indicate the relevance of the comments to the determined discussion topic. Based on the relevance of the comment, the comment may be allowed, prioritized, deleted, de-emphasized, or otherwise filtered in the chat feature. | 05-05-2022 |
20220139393 | DRIVER INTERFACE WITH VOICE AND GESTURE CONTROL - A driver interface for use within an automobile provides responses to voice commands issued for example by a driver of the automobile. The interface includes a camera and microphone for capturing image data such as gestures and audio data from the automobile driver. The image data and audio data are processed to extract image and linguistic features from the image and audio data, which image and linguistic features are processed to interpret and infer a meaning of the voice command. | 05-05-2022 |