Entries |
Document | Title | Date |
20080215329 | Methods and Apparatus for Generating Dialog State Conditioned Language Models - Techniques are provided for generating improved language modeling. Such improved modeling is achieved by conditioning a language model on a state of a dialog for which the language model is employed. For example, the techniques of the invention may improve modeling of language for use in a speech recognizer of an automatic natural language based dialog system. Improved usability of the dialog system arises from better recognition of a user's utterances by a speech recognizer, associated with the dialog system, using the dialog state-conditioned language models. By way of example, the state of the dialog may be quantified as: (i) the internal state of the natural language understanding part of the dialog system; or (ii) words in the prompt that the dialog system played to the user. | 09-04-2008 |
20080255844 | Minimizing empirical error training and adaptation of statistical language models and context free grammar in automatic speech recognition - Architecture for minimizing an empirical error rate by discriminative adaptation of a statistical language model in a dictation and/or dialog application. The architecture allows assignment of an improved weighting value to each term or phrase to reduce empirical error. Empirical errors are minimized whether a user provides correction results or not based on criteria for discriminatively adapting the user language model (LM)/context-free grammar (CFG) to the target. Moreover, algorithms are provided for the training and adaptation processes of LM/CFG parameters for criteria optimization. | 10-16-2008 |
20080262844 | Method and system for analyzing separated voice data of a telephonic communication to determine the gender of the communicant - A method and system for determining the gender of a communicant in a communication is provided. According to the method, at least one aural segment corresponding to at least one word spoken by a communicant is identified. The aural segment is then analyzed by applying a gender detection model to the aural segment, and gender detection data is generated based on the application of the gender detection model. | 10-23-2008 |
20080294441 | Speech Recognition System with Huge Vocabulary - The invention deals with speech recognition, such as a system for recognizing words in continuous speech. A speech recognition system is disclosed which is capable of recognizing a huge number of words, and in principle even an unlimited number of words. The speech recognition system comprises a word recognizer for deriving a best path through a word graph, and wherein words are assigned to the speech based on the best path. The word score being obtained from applying a phonemic language model to each word of the word graph. Moreover, the invention deals with an apparatus and a method for identifying words from a sound block and to computer readable code for implementing the method. | 11-27-2008 |
20080312928 | Natural language speech recognition calculator - Disclosed herein is a computer implemented method and system for evaluating a mathematical expression spoken in a natural language by a user. The disclosed method and system provides a natural language speech recognition calculator comprising a speech recognition engine. The spoken mathematical expression is transmitted to the speech recognition engine via an audio input device. Mathematical entities of the spoken mathematical expression are extracted and represented in a hierarchical recursive format of a speech recognition grammar implemented by the speech recognition engine. A symbolic mathematical expression is generated from the extracted mathematical entities and then normalized with common measurement units. The normalized mathematical expression is then evaluated to generate a mathematical result. The mathematical result may be synthesized by a text-to-speech engine to produce a voice output. The mathematical result may be provided on an audio output device, a video display unit, a printer, and an electronic device in a network. | 12-18-2008 |
20080319750 | CONCEPT MONITORING IN SPOKEN-WORD AUDIO - Monitoring a spoken-word audio stream for a relevant concept is disclosed. A speech recognition engine may recognize a plurality of words from the audio stream. Function words that do not indicate content may be removed from the plurality of words. A concept may be determined from at least one word recognized from the audio stream. The concept may be determined via a morphological normalization of the plurality of words. The concept may be associated with a time related to when the at least one word was spoken. A relevance metric may be computed for the concept. Computing the relevance metric may include assessing the temporal frequency of the concept within the audio stream. The relevance metric for the concept may be based on respective confidence scores of the at least one word. The concept, time, and relevance metric may be displayed in a graphical display. | 12-25-2008 |
20090030691 | USING AN UNSTRUCTURED LANGUAGE MODEL ASSOCIATED WITH AN APPLICATION OF A MOBILE COMMUNICATION FACILITY - A user may control a mobile communication facility through recognized speech provided to the mobile communication facility. Speech that is recorded by a user using a mobile communication facility resident capture facility. A speech recognition facility generates results of the recorded speech using an unstructured language model based at least in part on information relating to the recording. An application resident on the mobile communications facility is identified, wherein the resident application is capable of taking the results generated by the speech recognition facility as an input. The generated results are input to the application. | 01-29-2009 |
20090150155 | KEYWORD EXTRACTING DEVICE - The present invention aims at extracting a keyword of conversation without preparations by advanced anticipation of keywords of conversation. A keyword extracting device of the present invention includes an audio input section | 06-11-2009 |
20090271201 | STANDARD-MODEL GENERATION FOR SPEECH RECOGNITION USING A REFERENCE MODEL - A standard model creating apparatus which provides a high-precision standard model used for pattern recognition such as speech recognition, character recognition, or image recognition using a probability model based on a hidden Markov model, Bayesian theory, or linear discrimination analysis; intention interpretation using a probability model such as a Bayesian net; data-mining performed using a probability model; and so forth. The standard model creating apparatus includes a reference model preparing unit that prepares at least one reference model; a reference model storing unit that stores the reference model prepared by the reference model preparing unit (; and a standard model creating unit that creates a standard model by calculating statistics of the standard model so as to maximize or locally maximize the probability or likelihood with respect to the reference model stored in the reference model storing unit. | 10-29-2009 |
20100004932 | SPEECH RECOGNITION SYSTEM, SPEECH RECOGNITION PROGRAM, AND SPEECH RECOGNITION METHOD - A speech recognition system includes the following: a feature calculating unit; a sound level calculating unit that calculates an input sound level in each frame; a decoding unit that matches the feature of each frame with an acoustic model and a linguistic model, and outputs a recognized word sequence; a start-point detector that determines a start frame of a speech section based on a reference value; an end-point detector that determines an end frame of the speech section based on a reference value; and a reference value updating unit that updates the reference value in accordance with variations in the input sound level. The start-point detector updates the start frame every time the reference value is updated. The decoding unit starts matching before being notified of the end frame and corrects the matching results every time it is notified of the start frame. The speech recognition system can suppress a delay in response time while performing speech recognition based on a proper speech section. | 01-07-2010 |
20100070278 | Method for Creating a Speech Model - A transformation can be derived which would represent that processing required to convert a male speech model to a female speech model. That transformation is subjected to a predetermined modification, and the modified transformation is applied to a female speech model to produce a synthetic children's speech model. The male and female models can be expressed in terms of a vector representing key values defining each speech model and the derived transformation can be in the form of a matrix that would transform the vector of the male model to the vector of the female model. The modification to the derived matrix comprises applying an exponential p which has a value greater than zero and less than 1. | 03-18-2010 |
20100076765 | STRUCTURED MODELS OF REPITITION FOR SPEECH RECOGNITION - Described is a technology by which a structured model of repetition is used to determine the words spoken by a user, and/or a corresponding database entry, based in part on a prior utterance. For a repeated utterance, a joint probability analysis is performed on (at least some of) the corresponding word sequences as recognized by one or more recognizers) and associated acoustic data. For example, a generative probabilistic model, or a maximum entropy model may be used in the analysis. The second utterance may be a repetition of the first utterance using the exact words, or another structural transformation thereof relative to the first utterance, such as an extension that adds one or more words, a truncation that removes one or more words, or a whole or partial spelling of one or more words. | 03-25-2010 |
20100125458 | METHOD AND APPARATUS FOR ERROR CORRECTION IN SPEECH RECOGNITION APPLICATIONS - In one embodiment, the present invention is a method and apparatus for error correction in speech recognition applications. In one embodiment, a method for recognizing user speech includes receiving a first utterance from the user, receiving a subsequent utterance from the user, and combining acoustic evidence from the first utterance with acoustic evidence from the subsequent utterance in order to recognize the first utterance. It is assumed that, if the first utterance has been incorrectly recognized on a first attempt, the user will repeat the first utterance (or at least the incorrectly recognized portion of the first utterance) in the subsequent utterance. | 05-20-2010 |
20100185447 | Markup language-based selection and utilization of recognizers for utterance processing - Embodiments are provided for selecting and utilizing multiple recognizers to process an utterance based on a markup language document. The markup language document and an utterance are received in a computing device. One or more recognizers are selected from among the multiple recognizers for returning a results set for the utterance based on markup language in the markup language document. The results set is received from the one or more selected recognizers in a format determined by a processing method specified in the markup language document. An event is then executed on the computing device in response to receiving the results set. | 07-22-2010 |
20100191531 | QUANTIZING FEATURE VECTORS IN DECISION-MAKING APPLICATIONS - A system, method and computer program product for classification of an analog electrical signal using statistical models of training data. A technique is described to quantize the analog electrical signal in a manner which maximizes the compression of the signal while simultaneously minimizing the diminution in the ability to classify the compressed signal. These goals are achieved by utilizing a quantizer designed to minimize the loss in a power of the log-likelihood ratio. A further technique is described to enhance the quantization process by optimally allocating a number of bits for each dimension of the quantized feature vector subject to a maximum number of bits available across all dimensions. | 07-29-2010 |
20100292989 | SYMBOL INSERTION APPARATUS AND SYMBOL INSERTION METHOD - Enables symbol insertion evaluation in consideration of a difference in speaking style features between speakers. For a word sequence transcribing voice information, the symbol insertion likelihood calculation means | 11-18-2010 |
20100305948 | Phoneme Model for Speech Recognition - A sub-phoneme model given acoustic data which corresponds to a phoneme. The acoustic data is generated by sampling an analog speech signal producing a sampled speech signal. The sampled speech signal is windowed and transformed into the frequency domain producing Mel frequency cepstral coefficients of the phoneme. The sub-phoneme model is used in a speech recognition system. The acoustic data of the phoneme is divided into either two or three sub-phonemes. A parameterized model of the sub-phonemes is built, where the model includes Gaussian parameters based on Gaussian mixtures and a length dependency according to a Poisson distribution. A probability score is calculated while adjusting the length dependency of the Poisson distribution. The probability score is a likelihood that the parameterized model represents the phoneme. The phoneme is subsequently recognized using the parameterized model. | 12-02-2010 |
20100318358 | RECOGNIZER WEIGHT LEARNING DEVICE, SPEECH RECOGNIZING DEVICE, AND SYSTEM - A speech recognition apparatus ( | 12-16-2010 |
20100324901 | SPEECH RECOGNITION SYSTEM - Various methods and apparatus are described for a speech recognition system. In an embodiment, the statistical language model (SLM) provides probability estimates of how linguistically likely a sequence of linguistic items are to occur in that sequence based on an amount of times the sequence of linguistic items occurs in text and phrases in general use. The speech recognition decoder module requests a correction module for one or more corrected probability estimates P′(z|xy) of how likely a linguistic item z follows a given sequence of linguistic items x followed by y, where (x, y, and z) are three variable linguistic items supplied from the decoder module. The correction module is trained to linguistics of a specific domain, and is located in between the decoder module and the SLM in order to adapt the probability estimates supplied by the SLM to the specific domain when those probability estimates from the SLM significantly disagree with the linguistic probabilities in that domain. | 12-23-2010 |
20110046953 | METHOD OF RECOGNIZING SPEECH - A method for recognizing speech involves reciting, into a speech recognition system, an utterance including a numeric sequence that contains a digit string including a plurality of tokens and detecting a co-articulation problem related to at least two potentially co-articulated tokens in the digit string. The numeric sequence may be identified using i) a dynamically generated possible numeric sequence that potentially corresponds with the numeric sequence, and/or ii) at least one supplemental acoustic model. Also disclosed herein is a system for accomplishing the same. | 02-24-2011 |
20110099013 | SYSTEM AND METHOD FOR IMPROVING SPEECH RECOGNITION ACCURACY USING TEXTUAL CONTEXT - Disclosed herein are systems, methods, and computer-readable storage media for improving speech recognition accuracy using textual context. The method includes retrieving a recorded utterance, capturing text from a device display associated with the spoken dialog and viewed by one party to the recorded utterance, and identifying words in the captured text that are relevant to the recorded utterance. The method further includes adding the identified words to a dynamic language model, and recognizing the recorded utterance using the dynamic language model. The recorded utterance can be a spoken dialog. A time stamp can be assigned to each identified word. The method can include adding identified words to and/or removing identified words from the dynamic language model based on their respective time stamps. A screen scraper can capture text from the device display associated with the recorded utterance. The device display can contain customer service data. | 04-28-2011 |
20110137653 | SYSTEM AND METHOD FOR RESTRICTING LARGE LANGUAGE MODELS - Disclosed herein are systems, methods, and computer-readable storage media for performing speech recognition based on a masked language model. A system configured to practice the method receives a masked language model including a plurality of words, wherein a bit mask identifies whether each of the plurality of words is allowed or disallowed with regard to an adaptation subset, receives input speech, generates a speech recognition lattice based on the received input speech using the masked language model, removes from the generated lattice words identified as disallowed by the bit mask for the adaptation subset, and recognizes the received speech based on the lattice. Alternatively during the generation step, the system can only add words indicated as allowed by the bit mask. The bit mask can be separate from or incorporated as part of the masked language model. The system can dynamically update the adaptation subset and bit mask. | 06-09-2011 |
20120095766 | SPEECH RECOGNITION APPARATUS AND METHOD - A speech recognition apparatus is provided. The speech recognition apparatus includes a primary speech recognition unit configured to perform speech recognition on input speech and thus to generate word lattice information, a word string generation unit configured to generate one or more word strings based on the word lattice information, a language model score calculation unit configured to calculate bidirectional language model scores of the generated word strings selectively using forward and backward language models for each of words in each of the generated word strings, and a sentence output unit configured to output one or more of the generated word strings with high scores as results of the speech recognition of the input speech based on the calculated bidirectional language model scores. | 04-19-2012 |
20120221337 | METHOD AND APPARATUS FOR PREDICTING WORD ACCURACY IN AUTOMATIC SPEECH RECOGNITION SYSTEMS - The invention comprises a method and apparatus for predicting word accuracy. Specifically, the method comprises obtaining an utterance in speech data where the utterance comprises an actual word string, processing the utterance for generating an interpretation of the actual word string, processing the utterance to identify at least one utterance frame, and predicting a word accuracy associated with the interpretation according to at least one stationary signal-to-noise ratio and at least one non-stationary signal to noise ratio, wherein the at least one stationary signal-to-noise ratio and the at least one non-stationary signal to noise ratio are determined according to a frame energy associated with each of the at least one utterance frame. | 08-30-2012 |
20130238336 | RECOGNIZING SPEECH IN MULTIPLE LANGUAGES - Speech recognition systems may perform the following operations: receiving audio; recognizing the audio using language models for different languages to produce recognition candidates for the audio, where the recognition candidates are associated with corresponding recognition scores; identifying a candidate language for the audio; selecting a recognition candidate based on the recognition scores and the candidate language; and outputting data corresponding to the selected recognition candidate as a recognized version of the audio. | 09-12-2013 |
20130262117 | SPOKEN DIALOG SYSTEM USING PROMINENCE - The invention presents a method for analyzing speech in a spoken dialog system, comprising the steps of: accepting an utterance by at least one means for accepting acoustical signals, in particular a microphone, analyzing the utterance and obtaining prosodic cues from the utterance using at least one processing engine, wherein the utterance is evaluated based on the prosodic cues to determine a prominence of parts of the utterance, and wherein the utterance is analyzed to detect at least one marker feature, e.g. a negative statement, indicative of the utterance containing at least one part to replace at least one part in a previous utterance, the part to be replaced in the previous utterance being determined based on the prominence determined for the parts of the previous utterance and the replacement parts being determined based on the prominence of the parts in the utterance, and wherein the previous utterance is evaluated with the replacement part(s). | 10-03-2013 |
20130297313 | ACOUSTIC MODEL ADAPTATION USING GEOGRAPHIC INFORMATION - Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for enhancing speech recognition accuracy. In one aspect, a method includes receiving an audio signal that corresponds to an utterance recorded by a mobile device, determining a geographic location associated with the mobile device, adapting one or more acoustic models for the geographic location, and performing speech recognition on the audio signal using the one or more acoustic models model that are adapted for the geographic location. | 11-07-2013 |
20140025379 | Method and System for Real-Time Keyword Spotting for Speech Analytics - A system and method are presented for real-time speech analytics in the speech analytics field. Real time audio is fed along with a keyword model, into a recognition engine. The recognition engine computes the probability of the audio stream data matching keywords in the keyword model. The probability is compared to a threshold where the system determines if the probability is indicative of whether or not the keyword has been spotted. Empirical metrics are computed and any false alarms are identified and rejected. The keyword may be reported as found when it is deemed not to be a false alarm and passes the threshold for detection. | 01-23-2014 |
20140257814 | Posterior-Based Feature with Partial Distance Elimination for Speech Recognition - A high-dimensional posterior-based feature with partial distance elimination may be utilized for speech recognition. The log likelihood values of a large number of Gaussians are needed to generate the high-dimensional posterior feature. Gaussians with very small log likelihoods are associated with zero posterior values. Log likelihoods for Gaussians for a speech frame may be evaluated with a partial distance elimination method. If the partial distance of a Gaussian is already too small, the Gaussian will have a zero posterior value. The partial distance may be calculated by sequentially adding individual dimensions in a group of dimensions. The partial distance elimination occurs when less than all of the dimensions in the group are sequentially added. | 09-11-2014 |
20150025890 | MULTI-LEVEL SPEECH RECOGNITION - A method and device for recognizing an utterance. The method includes transmitting context data associated with a first device to a second device. A first speech recognition model is received from the second device. The first speech recognition model is a subset of a second speech recognition model present at the second device. The first speech recognition model is based on the context data. It is determined whether the utterance can be recognized at the first device based on the first speech recognition model. If the utterance cannot be recognized at the first device, then at least a portion of the utterance is sent to the second device. If the utterance can be recognized at the first device, then an action associated with the recognized utterance is performed. | 01-22-2015 |
20150095032 | Keyword Detection For Speech Recognition - This application discloses a method implemented of recognizing a keyword in a speech that includes a sequence of audio frames further including a current frame and a subsequent frame. A candidate keyword is determined for the current frame using a decoding network that includes keywords and filler words of multiple languages, and used to determine a confidence score for the audio frame sequence. A word option is also determined for the subsequent frame based on the decoding network, and when the candidate keyword and the word option are associated with two distinct types of languages, the confidence score of the audio frame sequence is updated at least based on a penalty factor associated with the two distinct types of languages. The audio frame sequence is then determined to include both the candidate keyword and the word option by evaluating the updated confidence score according to a keyword determination criterion. | 04-02-2015 |
20150379988 | SYSTEM AND METHODS TO CREATE AND DETERMINE WHEN TO USE A MINIMAL USER SPECIFIC LANGUAGE MODEL - The technology of the present application provides apparatuses and methods that may be used to generate the smallest language model for a continuous speech recognition engine that covers a given speaker's speech patterns. The apparatuses and methods start with a generic language model that is an approximation to the given speaker's speech patterns. The given speaker generates corrected transcripts that allows for the generation of a user specific language model. Once the user specific language model is sufficiently robust, the continuous speech recognition system may replace the generic language model with the user specific language model. | 12-31-2015 |
20160027441 | SPEECH RECOGNITION SYSTEM, SPEECH RECOGNIZING DEVICE AND METHOD FOR SPEECH RECOGNITION - A speech recognition system is to be used on a human subject. The speech recognition system includes an image capturing device, an oral cavity detecting device and a speech recognition device. The image capturing device captures images of lips of the subject during a speech of the subject. The oral cavity detecting device detects contact with a tongue of the subject and distance from the tongue of the subject, and accordingly generates a contact signal and a distance signal. The speech recognition device processes the images of the lips and the contact and distance signals so as to obtain content of the speech of the subject. | 01-28-2016 |
20160063996 | KEYWORD SPOTTING SYSTEM FOR ACHIEVING LOW-LATENCY KEYWORD RECOGNITION BY USING MULTIPLE DYNAMIC PROGRAMMING TABLES RESET AT DIFFERENT FRAMES OF ACOUSTIC DATA INPUT AND RELATED KEYWORD SPOTTING METHOD - A keyword spotting system includes a decoder having a storage device and a decoding circuit. The storage device is used to store a log-likelihood table and a plurality of dynamic programming (DP) tables generated for recognition of a designated keyword. The decoding circuit is used to refer to features in one frame of an acoustic data input to calculate the log-likelihood table and refer to at least the log-likelihood table to adjust each of the DP tables when recognition of the designated keyword is not accepted yet, where the DP tables are reset by the decoding circuit at different frames of the acoustic data input, respectively. | 03-03-2016 |
20160379627 | METHOD AND APPARATUS FOR EXEMPLARY SEGMENT CLASSIFICATION - Method and apparatus for segmenting speech by detecting the pauses between the words and/or phrases, and to determine whether a particular time interval contains speech or non-speech, such as a pause. | 12-29-2016 |
20180025731 | Cascading Specialized Recognition Engines Based on a Recognition Policy | 01-25-2018 |