Entries |
Document | Title | Date |
20080201144 | METHOD OF EMOTION RECOGNITION - A method is disclosed in the present invention for recognizing emotion by setting different weights to at least of two kinds of unknown information, such as image and audio information, based on their recognition reliability respectively. The weights are determined by the distance between test data and hyperplane and the standard deviation of training data and normalized by the mean distance between training data and hyperplane, representing the classification reliability of different information. The method is capable of recognizing the emotion according to the unidentified information having higher weights while the at least two kinds of unidentified information have different result classified by the hyperplane and correcting wrong classification result of the other unidentified information so as to raise the accuracy while emotion recognition. Meanwhile, the present invention also provides a learning step with a characteristic of higher learning speed through an algorithm of iteration. The learning step functions to adjust the hyperplane instantaneously so as to increase the capability of the hyperplane for identifying the emotion from an unidentified information accurately. Besides, a way of Gaussian kernel function for space transformation is also provided in the learning step so that the stability of accuracy is capable of being maintained. | 08-21-2008 |
20080235015 | SYSTEM AND METHOD FOR LIKELIHOOD COMPUTATION IN MULTI-STREAM HMM BASED SPEECH RECOGNITION - A system and method for speech recognition includes determining active Gaussians related to a first feature stream and a second feature stream by labeling at least one of the first and second streams, and determining active Gaussians co-occurring in the first stream and the second stream based upon joint probability. A number of Gaussians computed is reduced based upon Gaussians already computed for the first stream and a number of Gaussians co-occurring in the second stream. Speech is decoded based on the Gaussians computed for the first and second streams. | 09-25-2008 |
20080270129 | Method and System for Automatically Providing Linguistic Formulations that are Outside a Recognition Domain of an Automatic Speech Recognition System - A method for automatically providing a hypothesis of a linguistic formulation that is uttered by users of a voice service based on an automatic speech recognition system and that is outside a recognition domain of the automatic speech recognition system. The method includes providing a constrained and an unconstrained speech recognition from an input speech signal, identifying a part of the constrained speech recognition outside the recognition domain, identifying a part of the unconstrained speech recognition corresponding to the identified part of the constrained speech recognition, and providing the linguistic formulation hypothesis based on the identified part of the unconstrained speech recognition. | 10-30-2008 |
20080270130 | SYSTEMS AND METHODS FOR REDUCING ANNOTATION TIME - Systems and methods for annotating speech data. The present invention reduces the time required to annotate speech data by selecting utterances for annotation that will be of greatest benefit. A selection module uses speech models, including speech recognition models and spoken language understanding models, to identify utterances that should be annotated based on criteria such as confidence scores generated by the models. These utterances are placed in an annotation list along with a type of annotation to be performed for the utterances and an order in which the annotation should proceed. The utterances in the annotation list can be annotated for speech recognition purposes, spoken language understanding purposes, labeling purposes, etc. The selection module can also select utterances for annotation based on previously annotated speech data and deficiencies in the various models. | 10-30-2008 |
20080300875 | Efficient Speech Recognition with Cluster Methods - A speech recognition method and system, the method comprising the steps of providing a speech model, said speech model includes at least a portion of a state of Gaussian, clustering said Gaussian of said speech model to give N clusters of Gaussians, wherein N is an integer and utilizing said Gaussian in recognizing an utterance. | 12-04-2008 |
20080306738 | VOICE PROCESSING METHODS AND SYSTEMS - Voice processing methods and systems are provided. An utterance is received. The utterance is compared with teaching materials according to at least one matching algorithm to obtain a plurality of matching values corresponding to a plurality of voice units of the utterance. Respective voice units are scored in at least one first scoring item according to the matching values and a personified voice scoring algorithm. The personified voice scoring algorithm is generated according to training utterances corresponding to at least one training sentence in a phonetic-balanced sentence set of a plurality of learners and at least one real teacher, and scores corresponding to the respective voice units of the training utterances of the learners in the first scoring item provided by the real teacher. | 12-11-2008 |
20090024390 | Multi-Class Constrained Maximum Likelihood Linear Regression - A method of speech recognition converts an unknown speech input into a stream of representative features. The feature stream is transformed based on speaker dependent adaptation of multi-class feature models. Then automatic speech recognition is used to compare the transformed feature stream to multi-class speaker independent acoustic models to generate an output representative of the unknown speech input. | 01-22-2009 |
20090030683 | SYSTEM AND METHOD FOR TRACKING DIALOGUE STATES USING PARTICLE FILTERS - Disclosed are methods, systems, and computer-readable media for tracking dialog states in a spoken dialog system. The method comprises casting a plurality of dialog states, or particles, as a network describing the probability relationships between each of a plurality of variables, sampling a subset of the plurality of dialog states, or particles, in the network, for each sampled dialog state, or particle, projecting into the future, assigning a weight to each sampled particle, and normalizing the assigned weights to yield a new estimated distribution over each variable's values, wherein the distribution over the variables is used in a spoken dialog system. Also disclosed is a method of tuning performance of the methods, systems, and computer-readable media by adding or removing particles to/from the network. | 01-29-2009 |
20090030684 | USING SPEECH RECOGNITION RESULTS BASED ON AN UNSTRUCTURED LANGUAGE MODEL IN A MOBILE COMMUNICATION FACILITY APPLICATION - A method and system for entering information into a software application resident on a mobile communication facility is provided. The method and system may include recording speech presented by a user using a mobile communication facility resident capture facility, transmitting the recording through a wireless communication facility to a speech recognition facility, transmitting information relating to the software application to the speech recognition facility, generating results utilizing the speech recognition facility using an unstructured language model based at least in part on the information relating to the software application and the recording, transmitting the results to the mobile communications facility, loading the results into the software application and simultaneously displaying the results as a set of words and as a set of application results based on those words. | 01-29-2009 |
20090030685 | USING SPEECH RECOGNITION RESULTS BASED ON AN UNSTRUCTURED LANGUAGE MODEL WITH A NAVIGATION SYSTEM - Speech recorded by an audio capture facility of a navigation facility is processed by a speech recognition facility to generate results that are provided to the navigation facility. When information related to a navigation application running on the navigation facility are provided to the speech recognition facility, the results generated are based at least in part on the application related information. The speech recognition facility uses an unstructured language model for generating results. The user of the navigation facility may optionally be allowed to edit the results being provided to the navigation facility. The speech recognition facility may also adapt speech recognition based on usage of the results. | 01-29-2009 |
20090048835 | FEATURE EXTRACTING APPARATUS, COMPUTER PROGRAM PRODUCT, AND FEATURE EXTRACTION METHOD - A feature extracting apparatus includes a spectrum calculator that calculates a logarithmic frequency spectrum including frequency components obtained from an input speech signal at regular intervals on a logarithmic frequency scale of a frame; a function calculator that calculates a cross-correlation function between a logarithmic frequency spectrum of a time and a logarithmic frequency spectrum of one or plural times included in a certain temporal width before and after the time, from a sequence of the logarithmic frequency spectra calculated at each time; and a feature extractor that extracts a set of the cross-correlation functions as a local and relative fundamental-frequency pattern feature at the frame. | 02-19-2009 |
20090063144 | SYSTEM AND METHOD FOR PROVIDING A COMPENSATED SPEECH RECOGNITION MODEL FOR SPEECH RECOGNITION - An automatic speech recognition (ASR) system and method is provided for controlling the recognition of speech utterances generated by an end user operating a communications device. The ASR system and method can be used with a communications device that is used in a communications network. The ASR system can be used for ASR of speech utterances input into a mobile device, to perform compensating techniques using at least one characteristic and for updating an ASR speech recognizer associated with the ASR system by determined and using a background noise value and a distortion value that is based on the features of the mobile device. The ASR system can be used to augment a limited data input capability of a mobile device, for example, caused by limited input devices physically located on the mobile device. | 03-05-2009 |
20090070110 | COMBINING RESULTS OF IMAGE RETRIEVAL PROCESSES - A MMR system for newspaper publishing comprises a plurality of mobile devices, an MMR gateway, an MMR matching unit and an MMR publisher. The MMR matching unit receives an image query from the MMR gateway and sends it to one or more of the recognition units to identify a result including a document, the page and the location on the page. The MMR matching unit also includes a result combiner coupled to each of the recognition units to receive recognition results. The result combiner produces a list of most likely results and associated confidence scores. This list of results is sent by the result combiner back to the MMR gateway for presentation on the mobile device. The result combiner uses the quality predictor as an input in deciding which results are best. The present invention also includes a number of novel methods including a method for generating the list of best results. | 03-12-2009 |
20090112585 | TIMING OF SPEECH RECOGNITION OVER LOSSY TRANSMISSION SYSTEMS - Recognizing a stream of speech received as speech vectors over a lossy communications link includes constructing for a speech recognizer a series of speech vectors from packets received over a lossy packetized transmission link, wherein some of the packets associated with each speech vector are lost or corrupted during transmission. Each constructed speech vector is multi-dimensional and includes associated features. After waiting for a predetermined time, speech vectors are generated and potentially corrupted features within the speech vector are indicated to the speech recognizer when present. Speech recognition is attempted at the speech recognizer on the speech vectors when corrupted features are present. This recognition may be based only on certain or valid features within each speech vector. Retransmission of a missing or corrupted packet is requested when corrupted values are indicated by the indicating step and when the attempted recognition step fails. | 04-30-2009 |
20090125306 | METHOD, SYSTEM AND COMPUTER PROGRAM FOR ENHANCED SPEECH RECOGNITION OF DIGITS INPUT STRINGS - The present invention proposes a method, system and computer program for speech recognition. According to one embodiment, a method is provided wherein, for an expected input string divided into a plurality of expected string segments, a speech segment is received for each expected string segment. Speech recognition is then performed separately on each said speech segment via the generation, for each said speech segment, of a segment n-best list comprising n highest confidence score results. A global n-best list is then generated corresponding to the expected input string utilizing the segment n-best lists and a final global speech recognition result corresponding to said expected input string is determined via the pruning of the results of the global n-best list utilizing a pruning criterion. | 05-14-2009 |
20090210226 | Method and Apparatus for Voice Searching for Stored Content Using Uniterm Discovery - A method, system and communication device for enabling voice-to-voice searching and ordered content retrieval via audio tags assigned to individual content, which tags generate uniterms that are matched against components of a voice query. The method includes storing content and tagging at least one of the content with an audio tag. The method further includes receiving a voice query to retrieve content stored on the device. When the voice query is received, the method completes a voice-to-voice search utilizing uniterms of the audio tag, scored against the phoneme latent lattice model generated by the voice query to identify matching terms within the audio tags and corresponding stored content. The retrieved content(s) associated with the identified audio tags having uniterms that score within the phoneme lattice model are outputted in an order corresponding to an order in which the uniterms are structured within the voice query. | 08-20-2009 |
20090265170 | EMOTION DETECTING METHOD, EMOTION DETECTING APPARATUS, EMOTION DETECTING PROGRAM THAT IMPLEMENTS THE SAME METHOD, AND STORAGE MEDIUM THAT STORES THE SAME PROGRAM - An audio feature is extracted from audio signal data for each analysis frame and stored in a storage part. Then, the audio feature is read from the storage part, and an emotional state probability of the audio feature corresponding to an emotional state is calculated using one or more statistical models constructed based on previously input learning audio signal data. Then, based on the calculated emotional state probability, the emotional state of a section including the analysis frame is determined. | 10-22-2009 |
20090276216 | METHOD AND SYSTEM FOR ROBUST PATTERN MATCHING IN CONTINUOUS SPEECH - A method for speech recognition, the method includes: extracting time—frequency speech features from a series of reference speech elements in a first series of sampling windows; aligning reference speech elements that are not of equal time span duration; constructing a common subspace for the aligned speech features; determining a first set of coefficient vectors; extracting a time—frequency feature image from a test speech stream spanned by a second sampling window; approximating the extracted image in the common subspace for the aligned extracted time—frequency speech features with a second coefficient vector; computing a similarity measure between the first and the second coefficient vector; determining if the similarity measure is below a predefined threshold; and wherein a match between the reference speech elements and a portion of the test speech stream is made in response to a similarity measure below a predefined threshold. | 11-05-2009 |
20090313015 | MULTIPLE AUDIO/VIDEO DATA STREAM SIMULATION METHOD AND SYSTEM - A multiple audio/video data stream simulation method and system. A computing system receives first audio and/or video data streams. The first audio and/or video data streams include data associated with a first person and a second person. The computing system monitors the first audio and/or video data streams. The computing system identifies emotional attributes comprised by the first audio and/or video data streams. The computing system generates second audio and/or video data streams associated with the first audio and/or video data streams. The second audio and/or video data streams include the first audio and/or video data streams data without the emotional attributes. The computing system stores the second audio and/or video data streams. | 12-17-2009 |
20090319268 | METHOD AND APPARATUS FOR MEASURING THE INTELLIGIBILITY OF AN AUDIO ANNOUNCEMENT DEVICE - A method and an apparatus for measuring the intelligibility level of an audio announcement device ( | 12-24-2009 |
20100017207 | METHOD AND DEVICE FOR ASCERTAINING FEATURE VECTORS FROM A SIGNAL - A signal is used to form intermediate feature vectors which are subjected to high-pass filtering. The high-pass-filtered intermediate feature vectors have a respective prescribed addition feature vector added to them. | 01-21-2010 |
20100063816 | Method and System for Parsing of a Speech Signal - A method for processing an analog speech signal for speech recognition. The analog speech signal is sampled to produced a sampled speech signal. The sampled speech signal is framed into multiple frames of the sampled speech signal. The absolute value of the sampled speech signal is integrated within the frames and respective integrated-absolute values of the frames are determined. Based on the integrated-absolute values, the sampled speech signal is cut into segments of non-uniform duration. The segments are not as yet identified as parts of speech prior to and during the cutting. | 03-11-2010 |
20100161328 | Utterance Processing For Network-Based Speech Recognition Utilizing A Client-Side Cache - Embodiments are provided for utilizing a client-side cache for utterance processing to facilitate network based speech recognition. An utterance comprising a query is received in a client computing device. The query is sent from the client to a network server for results processing. The utterance is processed to determine a speech profile. A cache lookup is performed based on the speech profile to determine whether results data for the query is stored in the cache. If the results data is stored in the cache, then a query is sent to cancel the results processing on the network server and the cached results data is displayed on the client computing device. | 06-24-2010 |
20100198597 | DYNAMIC PRUNING FOR AUTOMATIC SPEECH RECOGNITION - Methods, speech recognition systems, and computer readable media are provided that recognize speech using dynamic pruning techniques. A search network is expanded based on a frame from a speech signal, a best hypothesis is determined in the search network, a default beam threshold is modified, and the search network is pruned using the modified beam threshold. The search network may be further pruned based on the search depth of the best hypothesis and/or the average number of frames per state for a search path. | 08-05-2010 |
20100217592 | Dialog Prediction Using Lexical and Semantic Features - The present invention provides a method for identifying a turn, such as a sentence or phrase, for addition to a platform dialog comprising a plurality of turns. Lexical features of each of a set of candidate turns relative to one or more turns in the platform dialog are determined. Semantic features associated with each candidate turn and associated with the platform dialog are determined to identify one or more topics associated with each candidate turn and with the platform dialog. Lexical features of each candidate turn are compared to lexical features of the platform dialog and semantic features associated with each candidate turn are compared to semantic features of the platform dialog to rank the candidate turns based on similarity of lexical features and semantic features of each candidate turn to lexical features and semantic features of the platform dialog. | 08-26-2010 |
20100268535 | PRONUNCIATION VARIATION RULE EXTRACTION APPARATUS, PRONUNCIATION VARIATION RULE EXTRACTION METHOD, AND PRONUNCIATION VARIATION RULE EXTRACTION PROGRAM - A problem to be solved is to robustly detect a pronunciation variation example and acquire a pronunciation variation rule having a high generalization property, with less effort. The problem can be solved by a pronunciation variation rule extraction apparatus including a speech data storage unit, a base form pronunciation storage unit, a sub word language model generation unit, a speech recognition unit, and a difference extraction unit. The speech data storage unit stores speech data. The base form pronunciation storage unit stores base form pronunciation data representing base form pronunciation of the speech data. The sub word language model generation unit generates a sub word language model from the base form pronunciation data. The speech recognition unit recognizes the speech data by using the sub word language model. The difference extraction unit extracts a difference between a recognition result outputted from the speech recognition unit and the base form pronunciation data by comparing the recognition result and the base form pronunciation data. | 10-21-2010 |
20100280827 | NOISE ROBUST SPEECH CLASSIFIER ENSEMBLE - Embodiments for implementing a speech recognition system that includes a speech classifier ensemble are disclosed. In accordance with one embodiment, the speech recognition system includes a classifier ensemble to convert feature vectors that represent a speech vector into log probability sets. The classifier ensemble includes a plurality of classifiers. The speech recognition system includes a decoder ensemble to transform the log probability sets into output symbol sequences. The speech recognition system further includes a query component to retrieve one or more speech utterances from a speech database using the output symbol sequences. | 11-04-2010 |
20100332227 | AUTOMATIC DISCLOSURE DETECTION - A method of detecting pre-determined phrases to determine compliance quality is provided. The method includes determining whether at least one of an event or a precursor event has occurred based on a comparison between pre-determined phrases and a communication between a sender and a recipient in a communications network, and rating the recipient based on the presence of the pre-determined phrases associated with the event or the presence of the pre-determined phrases associated with the precursor event in the communication. | 12-30-2010 |
20110046951 | SYSTEM AND METHOD FOR BUILDING OPTIMAL STATE-DEPENDENT STATISTICAL UTTERANCE CLASSIFIERS IN SPOKEN DIALOG SYSTEMS - A system and a method to generate statistical utterance classifiers optimized for the individual states of a spoken dialog system is disclosed. The system and method make use of large databases of transcribed and annotated utterances from calls collected in a dialog system in production and log data reporting the association between the state of the system at the moment when the utterances were recorded and the utterance. From the system state, being a vector of multiple system variables, subsets of these variables, certain variable ranges, quantized variable values, etc. can be extracted to produce a multitude of distinct utterance subsets matching every possible system state. For each of these subset and variable combinations, statistical classifiers can be trained, tuned, and tested, and the classifiers can be stored together with the performance results and the state subset and variable combination. Once the set of classifiers and stored results have been put into a production system, for a given system state, the classifiers resulting in optimum performance can be selected from the result list and used to perform utterance classification. | 02-24-2011 |
20110066433 | SYSTEM AND METHOD FOR PERSONALIZATION OF ACOUSTIC MODELS FOR AUTOMATIC SPEECH RECOGNITION - Disclosed herein are methods, systems, and computer-readable storage media for automatic speech recognition. The method includes selecting a speaker independent model, and selecting a quantity of speaker dependent models, the quantity of speaker dependent models being based on available computing resources, the selected models including the speaker independent model and the quantity of speaker dependent models. The method also includes recognizing an utterance using each of the selected models in parallel, and selecting a dominant speech model from the selected models based on recognition accuracy using the group of selected models. The system includes a processor and modules configured to control the processor to perform the method. The computer-readable storage medium includes instructions for causing a computing device to perform the steps of the method. | 03-17-2011 |
20110082695 | METHODS, ELECTRONIC DEVICES, AND COMPUTER PROGRAM PRODUCTS FOR GENERATING AN INDICIUM THAT REPRESENTS A PREVAILING MOOD ASSOCIATED WITH A PHONE CALL - An electronic device includes a call analysis module that is configured to analyze characteristics of a phone call and to generate an indicium that represents a prevailing mood associated with the phone call based on the analyzed characteristics. | 04-07-2011 |
20110137650 | SYSTEM AND METHOD FOR TRAINING ADAPTATION-SPECIFIC ACOUSTIC MODELS FOR AUTOMATIC SPEECH RECOGNITION - Disclosed herein are systems, methods, and computer-readable storage media for training adaptation-specific acoustic models. A system practicing the method receives speech and generates a full size model and a reduced size model, the reduced size model starting with a single distribution for each speech sound in the received speech. The system finds speech segment boundaries in the speech using the full size model and adapts features of the speech data using the reduced size model based on the speech segment boundaries and an overall centroid for each speech sound. The system then recognizes speech using the adapted features of the speech. The model can be a Hidden Markov Model (HMM). The reduced size model can also be of a reduced complexity, such as having fewer mixture components than a model of full complexity. Adapting features of speech can include moving the features closer to an overall feature distribution center. | 06-09-2011 |
20110191104 | SYSTEM AND METHOD FOR MEASURING SPEECH CHARACTERISTICS - A method for measuring a disparity between two speech samples is disclosed that may include determining upon a speech granularity level at which to compare the rhythm of a student speech sample and a reference speech sample; determining a duration disparity between a first speech unit and a second, non-adjacent speech unit in the student speech sample; determining a duration disparity between a first speech unit and a second, non-adjacent speech unit in the reference speech sample; and calculating the difference between the student speech-unit duration disparity and the reference speech-unit disparity. | 08-04-2011 |
20110213614 | METHOD OF ANALYSING AN AUDIO SIGNAL - A method of analysing an audio signal is disclosed. A digital representation of an audio signal is received and a first output function is generated based on a response of a physiological model to the digital representation. At least one property of the first output function may be determined. One or more values are determined for use in analysing the audio signal, based on the determined property of the first output function. | 09-01-2011 |
20110224982 | AUTOMATIC SPEECH RECOGNITION BASED UPON INFORMATION RETRIEVAL METHODS - Described is a technology in which information retrieval (IR) techniques are used in a speech recognition (ASR) system. Acoustic units (e.g., phones, syllables, multi-phone units, words and/or phrases) are decoded, and features found from those acoustic units. The features are then used with IR techniques (e.g., TF-IDF based retrieval) to obtain a target output (a word or words). Also described is the use of IR techniques to provide a full large vocabulary continuous speech (LVCSR) recognizer | 09-15-2011 |
20110231188 | SYSTEM AND METHOD FOR PROVIDING AN ACOUSTIC GRAMMAR TO DYNAMICALLY SHARPEN SPEECH INTERPRETATION - The system and method described herein may provide an acoustic grammar to dynamically sharpen speech interpretation. In particular, the acoustic grammar may be used to map one or more phonemes identified in a user verbalization to one or more syllables or words, wherein the acoustic grammar may have one or more linking elements to reduce a search space associated with mapping the phonemes to the syllables or words. As such, the acoustic grammar may be used to generate one or more preliminary interpretations associated with the verbalization, wherein one or more post-processing techniques may then be used to sharpen accuracy associated with the preliminary interpretations. For example, a heuristic model may assign weights to the preliminary interpretations based on context, user profiles, or other knowledge and a probable interpretation may be identified based on confidence scores associated with one or more candidate interpretations generated with the heuristic model. | 09-22-2011 |
20110270610 | PARAMETER LEARNING IN A HIDDEN TRAJECTORY MODEL - Parameters for distributions of a hidden trajectory model including means and variances are estimated using an acoustic likelihood function for observation vectors as an objection function for optimization. The estimation includes only acoustic data and not any intermediate estimate on hidden dynamic variables. Gradient ascent methods can be developed for optimizing the acoustic likelihood function. | 11-03-2011 |
20120016672 | Systems and Methods for Assessment of Non-Native Speech Using Vowel Space Characteristics - Computer-implemented systems and methods are provided for assessing non-native speech proficiency. A non-native speech sample is processed to identify a plurality of vowel sound boundaries in the non-native speech sample. Portions of the non-native speech sample are analyzed within the vowel sound boundaries to extract vowel characteristics. The vowel characteristics are used to identify a plurality of vowel space metrics for the non-native speech sample, and the vowel space metrics are used to determine a non-native speech proficiency score for the non-native speech sample. | 01-19-2012 |
20120046945 | MULTIMODAL AGGREGATING UNIT - In a voice processing system, a multimodal request is received from a plurality of modality input devices, and the requested application is run to provide a user with the feedback of the multimodal request. In the voice processing system, a multimodal aggregating unit is provided which receives a multimodal input from a plurality of modality input devices, and provides an aggregated result to an application control based on the interpretation of the interaction ergonomics of the multimodal input within the temporal constraints of the multimodal input. Thus, the multimodal input from the user is recognized within a temporal window. Interpretation of the interaction ergonomics of the multimodal input include interpretation of interaction biometrics and interaction mechani-metrics, wherein the interaction input of at least one modality may be used to bring meaning to at least one other input of another modality. | 02-23-2012 |
20120072214 | Frame Erasure Concealment Technique for a Bitstream-Based Feature Extractor - A frame erasure concealment technique for a bitstream-based feature extractor in a speech recognition system particularly suited for use in a wireless communication system operates to “delete” each frame in which an erasure is declared. The deletions thus reduce the length of the observation sequence, but have been found to provide for sufficient speech recognition based on both single word and “string” tests of the deletion technique. | 03-22-2012 |
20120109649 | SPEECH DIALECT CLASSIFICATION FOR AUTOMATIC SPEECH RECOGNITION - Automatic speech recognition including receiving speech via a microphone, pre-processing the received speech to generate acoustic feature vectors, classifying dialect of the received speech, selecting at least one of an acoustic model or a lexicon specific to the classified dialect, decoding the acoustic feature vectors using a processor and at least one of the selected dialect-specific acoustic model or selected lexicon to produce a plurality of hypotheses for the received speech, and post-processing the plurality of hypotheses to identify one of the plurality of hypotheses as the received speech. | 05-03-2012 |
20120150539 | METHOD FOR ESTIMATING LANGUAGE MODEL WEIGHT AND SYSTEM FOR THE SAME - Method of the present invention may include receiving speech feature vector converted from speech signal, performing first search by applying first language model to the received speech feature vector, and outputting word lattice and first acoustic score of the word lattice as continuous speech recognition result, outputting second acoustic score as phoneme recognition result by applying an acoustic model to the speech feature vector, comparing the first acoustic score of the continuous speech recognition result with the second acoustic score of the phoneme recognition result, outputting first language model weight when the first coustic score of the continuous speech recognition result is better than the second acoustic score of the phoneme recognition result and performing a second search by applying a second language model weight, which is the same as the output first language model, to the word lattice. | 06-14-2012 |
20120185251 | METHOD AND SYSTEM FOR CANDIDATE MATCHING - A method and system for candidate matching, such as used in match-making services, assesses narrative responses to measure candidate qualities. A candidate database includes self-assessment data and narrative data. Narrative data concerning a defined topic is analyzed to determine candidate qualities separate from topical information. Candidate qualities thus determined are included in candidate profiles and used to identify desirable candidates. | 07-19-2012 |
20120197641 | CONSONANT-SEGMENT DETECTION APPARATUS AND CONSONANT-SEGMENT DETECTION METHOD - A signal portion is extracted from an input signal for each frame having a specific duration to generate a per-frame input signal. The per-frame input signal in a time domain is converted into a per-frame input signal in a frequency domain, thereby generating a spectral pattern. Subband average energy is derived in each of subbands adjacent one another in the spectral pattern. The subband average energy is compared in at least one subband pair of a first subband and a second subband that is a higher frequency band than the first subband, the first and second subbands being consecutive subbands in the spectral pattern. It is determined that the per-frame input signal includes a consonant segment if the subband average energy of the second subband is higher than the subband average energy of the first subband. | 08-02-2012 |
20120253805 | SYSTEMS, METHODS, AND MEDIA FOR DETERMINING FRAUD RISK FROM AUDIO SIGNALS - Systems, methods, and media for determining fraud risk from audio signals and non-audio data are provided herein. Some exemplary methods include receiving an audio signal and an associated audio signal identifier, receiving a fraud event identifier associated with a fraud event, determining a speaker model based on the received audio signal, determining a channel model based on a path of the received audio signal, using a server system, updating a fraudster channel database to include the determined channel model based on a comparison of the audio signal identifier and the fraud event identified, and updating a fraudster voice database to include the determined speaker model based on a comparison of the audio signal identifier and the fraud event identifier. | 10-04-2012 |
20120253806 | System And Method For Distributed Speech Recognition - A system and method for distributed speech recognition is provided. Audio data is obtained from a caller participating in a call with an agent. A main recognizer receives a main grammar template and the audio data. A plurality of secondary recognizers each receive the audio data and a reference that identifies a secondary grammar, which is a non-overlapping section of the main grammar template. Speech recognition is performed on each of the secondary recognizers and speech recognition results are identified by applying the secondary grammar to the audio data. An n number of most likely speech recognition results are selected. The main recognizer constructs a new grammar based on the main grammar template using the speech recognition results from each of the secondary recognizers as a new vocabulary. Further speech recognition results are identified by applying the new grammar to the audio data. | 10-04-2012 |
20120278075 | System and Method for Community Feedback and Automatic Ratings for Speech Metrics - A system and method for collecting from an ASR, a first rating of an intelligibility of human speech, and collecting another intelligibility rating of such speech from networked listeners to such speech. The first rating and the second rating are weighed based on an importance to a user of the ratings, and a third rating is created from such weighted two ratings. | 11-01-2012 |
20120290302 | CHINESE SPEECH RECOGNITION SYSTEM AND METHOD - A Chinese speech recognition system and method is disclosed. Firstly, a speech signal is received and recognized to output a word lattice. Next, the word lattice is received, and word arcs of the word lattice are rescored and reranked with a prosodic break model, a prosodic state model, a syllable prosodic-acoustic model, a syllable-juncture prosodic-acoustic model and a factored language model, so as to output a language tag, a prosodic tag and a phonetic segmentation tag, which correspond to the speech signal. The present invention performs rescoring in a two-stage way to promote the recognition rate of basic speech information and labels the language tag, prosodic tag and phonetic segmentation tag to provide the prosodic structure and language information for the rear-stage voice conversion and voice synthesis. | 11-15-2012 |
20120296648 | SYSTEMS AND METHODS FOR DETERMINING THE N-BEST STRINGS - Systems and methods for identifying the N-best strings of a weighted automaton. A potential for each state of an input automaton to a set of destination states of the input automaton is first determined. Then, the N-best paths are found in the result of an on-the-fly determinization of the input automaton. Only the portion of the input automaton needed to identify the N-best paths is determinized. As the input automaton is determinized, a potential for each new state of the partially determinized automaton is determined and is used in identifying the N-best paths of the determinized automaton, which correspond exactly to the N-best strings of the input automaton. | 11-22-2012 |
20120323573 | Non-Scorable Response Filters For Speech Scoring Systems - A method for scoring non-native speech includes receiving a speech sample spoken by a non-native speaker and performing automatic speech recognition and metric extraction on the speech sample to generate a transcript of the speech sample and a speech metric associated with the speech sample. The method further includes determining whether the speech sample is scorable or non-scorable based upon the transcript and speech metric, where the determination is based on an audio quality of the speech sample, an amount of speech of the speech sample, a degree to which the speech sample is off-topic, whether the speech sample includes speech from an incorrect language, or whether the speech sample includes plagiarized material. When the sample is determined to be non-scorable, an indication of non-scorability is associated with the speech sample. When the sample is determined to be scorable, the sample is provided to a scoring model for scoring. | 12-20-2012 |
20130006629 | SEARCHING DEVICE, SEARCHING METHOD, AND PROGRAM - The present invention relates to a searching device, searching method, and program whereby searching for a word string corresponding to input voice can be performed in a robust manner. | 01-03-2013 |
20130030808 | Computer-Implemented Systems and Methods for Scoring Concatenated Speech Responses - Systems and methods are provided for scoring non-native speech. Two or more speech samples are received, where each of the samples are of speech spoken by a non-native speaker, and where each of the samples are spoken in response to distinct prompts. The two or more samples are concatenated to generate a concatenated response for the non-native speaker, where the concatenated response is based on the two or more speech samples that were elicited using the distinct prompts. A concatenated speech proficiency metric is computed based on the concatenated response, and the concatenated speech proficiency metric is provided to a scoring model, where the scoring model generates a speaking score based on the concatenated speech metric. | 01-31-2013 |
20130080165 | Model Based Online Normalization of Feature Distribution for Noise Robust Speech Recognition - Online histogram recognition may be provided. Upon receiving a spoken phrase from a user, a histogram/frequency distribution may be estimated on the spoken phrase according to a prior distribution. The histogram distribution may be equalized and then provided to a spoken language understanding application. | 03-28-2013 |
20130080166 | DIALOG-BASED VOICEPRINT SECURITY FOR BUSINESS TRANSACTIONS - A system for biometrically securing business transactions uses speech recognition and voiceprint authentication to biometrically secure a transaction from a variety of client devices in a variety of media. A voiceprint authentication server receives a request from a third party requestor to authenticate a previously enrolled end user of a client device. A signature collection applet presents the user a randomly generated signature string, prompting the user to speak the string, and recording the user's as he speaks. After transmittal to the authentication server, the signature string is recognized using voice recognition software, and compared with a stored voiceprint, using voiceprint authentication software. An authentication result is reported to both user and requestor. Voiceprints are stored in a repository along with the associated user data. Enrollment is by way of a separate enrollment applet, wherein the end user provides user information and records a voiceprint, which is subsequently stored. | 03-28-2013 |
20130090925 | SYSTEM AND METHOD FOR SUPPLEMENTAL SPEECH RECOGNITION BY IDENTIFIED IDLE RESOURCES - Disclosed herein are systems, methods, and computer-readable storage media for improving automatic speech recognition performance. A system practicing the method identifies idle speech recognition resources and establishes a supplemental speech recognizer on the idle resources based on overall speech recognition demand. The supplemental speech recognizer can differ from a main speech recognizer, and, along with the main speech recognizer, can be associated with a particular speaker. The system performs speech recognition on speech received from the particular speaker in parallel with the main speech recognizer and the supplemental speech recognizer and combines results from the main and supplemental speech recognizer. The system recognizes the received speech based on the combined results. The system can use beam adjustment in place of or in combination with a supplemental speech recognizer. A scheduling algorithm can tailor a particular combination of speech recognition resources and release the supplemental speech recognizer based on increased demand. | 04-11-2013 |
20130173266 | VOICE ANALYZER AND VOICE ANALYSIS SYSTEM - A voice analyzer includes a first voice acquisition unit provided in a place where a distance of a sound wave propagation path from a mouth of a user is a first distance, plural second voice acquisition units provided in places where distances of sound wave propagation paths from the mouth of the user are smaller than the first distance, and an identification unit that identifies whether the voices acquired by the first and second voice acquisition units are voices of the user or voices of others excluding the user on the basis of a result of comparison between first sound pressure of a voice signal of the voice acquired by the first voice acquisition unit and second sound pressure calculated from sound pressure of a voice signal of the voice acquired by each of the plural second voice acquisition units. | 07-04-2013 |
20130231932 | Voice Activity Detection and Pitch Estimation - Implementations include systems, methods and/or devices operable to detect voice activity in an audible signal by detecting glottal pulses. The dominant frequency of a series of glottal pulses is perceived as the intonation pattern or melody of natural speech, which is also referred to as the pitch. However, as noted above, spoken communication typically occurs in the presence of noise and/or other interference. In turn, the undulation of voiced speech is masked in some portions of the frequency spectrum associated with human speech by the noise and/or other interference. In some implementations, detection of voice activity is facilitated by dividing the frequency spectrum associated with human speech into multiple sub-bands in order to identify glottal pulses that dominate the noise and/or other inference in particular sub-bands. Additionally and/or alternatively, in some implementations the analysis is furthered to provide a pitch estimate of the detected voice activity. | 09-05-2013 |
20130253930 | FACTORED TRANSFORMS FOR SEPARABLE ADAPTATION OF ACOUSTIC MODELS - Various technologies described herein pertain to adapting a speech recognizer to input speech data. A first linear transform can be selected from a first set of linear transforms based on a value of a first variability source corresponding to the input speech data, and a second linear transform can be selected from a second set of linear transforms based on a value of a second variability source corresponding to the input speech data. The linear transforms in the first and second sets can compensate for the first variability source and the second variability source, respectively. Moreover, the first linear transform can be applied to the input speech data to generate intermediate transformed speech data, and the second linear transform can be applied to the intermediate transformed speech data to generate transformed speech data. Further, speech can be recognized based on the transformed speech data to obtain a result. | 09-26-2013 |
20130268270 | Forced/Predictable Adaptation for Speech Recognition - A method is described for use with automatic speech recognition using discriminative criteria for speaker adaptation. An adaptation evaluation is performed of speech recognition performance data for speech recognition system users. Adaptation candidate users are identified based on the adaptation evaluation for whom an adaptation process is likely to improve system performance. | 10-10-2013 |
20130275135 | Automatic Updating of Confidence Scoring Functionality for Speech Recognition Systems - Automatically adjusting confidence scoring functionality is described for a speech recognition engine. Operation of the speech recognition system is revised so as to change an associated receiver operating characteristic (ROC) curve describing performance of the speech recognition system with respect to rates of false acceptance (FA) versus correct acceptance (CA). Then a confidence scoring functionality related to recognition reliability for a given input utterance is automatically adjusted such that where the ROC curve is better for a given operating point after revising the operation of the speech recognition system, the adjusting reflects a double gain constraint to maintain FA and CA rates at least as good as before revising operation of the speech recognition system. | 10-17-2013 |
20130282374 | SPEECH RECOGNITION DEVICE, SPEECH RECOGNITION METHOD, AND SPEECH RECOGNITION PROGRAM - A speech recognition device has: hypothesis search means which searches for an optimal solution of inputted speech data by generating a hypothesis which is a bundle of words which are searched for as recognition result candidates; self-repair decision means which calculates a self-repair likelihood of a word or a word sequence included in the hypothesis which is being searched for by the hypothesis search means, and decides whether or not self-repair of the word or the word sequence is performed; and transparent word hypothesis generation means which, when the self-repair decision means decides that the self-repair is performed, generates a transparent word hypothesis which is a hypothesis which regards as a transparent word a word or a word sequence included in an un-repaired interval related to the word or the word sequence, and the hypothesis search means searches hypotheses for an optimal solution, the hypotheses including as search target hypotheses the transparent word hypothesis generated by the transparent word hypothesis generation means. | 10-24-2013 |
20130289987 | Negative Example (Anti-Word) Based Performance Improvement For Speech Recognition - A system and method are presented for negative example based performance improvements for speech recognition. The presently disclosed embodiments address identified false positives and the identification of negative examples of keywords in an Automatic Speech Recognition (ASR) system. Various methods may be used to identify negative examples of keywords. Such methods may include, for example, human listening and learning possible negative examples from a large domain specific text source. In at least one embodiment, negative examples of keywords may be used to improve the performance of an ASR system by reducing false positives. | 10-31-2013 |
20130304468 | Contextual Voice Query Dilation - A method for contextual voice query dilation in a Spoken Web search includes determining a context in which a voice query is created, generating a set of multiple voice query terms based on the context and information derived by a speech recognizer component pertaining to the voice query, and processing the set of query terms with at least one dilation operator to produce a dilated set of queries. A method for performing a search on a voice query is also provided, including generating a set of multiple query terms based on information derived by a speech recognizer component processing a voice query, processing the set with multiple dilation operators to produce multiple dilated sub-sets of query terms, selecting at least one query term from each dilated sub-set to compose a query set, and performing a search on the query set. | 11-14-2013 |
20130317820 | Automatic Methods to Predict Error Rates and Detect Performance Degradation - An automatic speech recognition dictation application is described that includes a dictation module for performing automatic speech recognition in a dictation session with a speaker user to determine representative text corresponding to input speech from the speaker user. A post-processing module develops a session level metric correlated to verbatim recognition error rate of the dictation session, and determines if recognition performance degraded during the dictation session based on a comparison of the session metric to a baseline metric. | 11-28-2013 |
20130317821 | SPARSE SIGNAL DETECTION WITH MISMATCHED MODELS - Various arrangements for detecting a type of sound, such as speech, are presented. A plurality of audio snippets may be sampled. A period of time may elapse between consecutive audio snippets. A hypothetical test may be performed using the sampled plurality of audio snippets. Such a hypothetical test may include weighting one or more hypothetical values greater than one or more other hypothetical values. Each hypothetical value may correspond to an audio snippet of the plurality of audio snippets. The hypothetical test may further include using at least the greater weighted one or more hypothetical values to determine whether at least one audio snippet of the plurality of audio snippets comprises the type of sound. | 11-28-2013 |
20130332163 | VOICED SOUND INTERVAL CLASSIFICATION DEVICE, VOICED SOUND INTERVAL CLASSIFICATION METHOD AND VOICED SOUND INTERVAL CLASSIFICATION PROGRAM - The voiced sound interval classification device comprises a vector calculation unit which calculates, from a power spectrum time series of voice signals, a multidimensional vector series as a vector series of a power spectrum having as many dimensions as the number of microphones, a difference calculation unit which calculates, with respect to each time of the multidimensional vector series, a vector of a difference between the time and the preceding time, a sound source direction estimation unit which estimates, as a sound source direction, a main component of the differential vector, and a voiced sound interval determination unit which determines whether each sound source direction is in a voiced sound interval or a voiceless sound interval by using a predetermined voiced sound index indicative of a likelihood of a voiced sound interval of the voice signal applied at each time. | 12-12-2013 |
20130339018 | MULTI-SAMPLE CONVERSATIONAL VOICE VERIFICATION - A system and method of verifying the identity of an authorized user in an authorized user group through a voice user interface for enabling secure access to one or more services via a mobile device includes receiving first voice information from a speaker through the voice user interface of the mobile device, calculating a confidence score based on a comparison of the first voice information with a stored voice model associated with the authorized user and specific to the authorized user, interpreting the first voice information as a specific service request, identifying a minimum confidence score for initiating the specific service request, determining whether or not the confidence score exceeds the minimum confidence score, and initiating the specific service request if the confidence score exceeds the minimum confidence score. | 12-19-2013 |
20140039890 | EFFICIENT CONTENT CLASSIFICATION AND LOUDNESS ESTIMATION - The present document relates to methods and systems for encoding an audio signal. The method comprises determining a spectral representation of the audio signal. The determining a spectral representation step may comprise determining modified discrete cosine transform, MDCT, coefficients, or a Quadrature Mirror Filter, QMF, filter bank representation of the audio signal. The method further comprises encoding the audio signal using the determined spectral representation; and classifying parts of the audio signal to be speech or non-speech based on the determined spectral representation. Finally, a loudness measure for the audio signal based on the speech parts is determined. | 02-06-2014 |
20140052443 | ELECTRONIC DEVICE WITH VOICE CONTROL FUNCTION AND VOICE CONTROL METHOD - A voice control method of an electronic device is provided. The method includes detecting external voice around, changing detected voices into voices signals; periodically sensing non-vocal physical actions of the user, and identifying the action; comparing the identified action with a preset action to determine whether the identified action is same as the preset action; extracting voice characteristic of linguistic meaning from the voice signals when the signals are received; determining whether the extracted voice characteristic of linguistic meaning matches voice templates stored in a storage unit; and performing a particular function associated with the voice template when the storage unit stores the voice template corresponding to the voice characteristic of linguistic meaning. The electronic device is also provided. | 02-20-2014 |
20140067391 | Method and System for Predicting Speech Recognition Performance Using Accuracy Scores - A system and method are presented for predicting speech recognition performance using accuracy scores in speech recognition systems within the speech analytics field. A keyword set is selected. Figure of Merit (FOM) is computed for the keyword set. Relevant features that describe the word individually and in relation to other words in the language are computed. A mapping from these features to FOM is learned. This mapping can be generalized via a suitable machine learning algorithm and be used to predict FOM for a new keyword. In at least embodiment, the predicted FOM may be used to adjust internals of speech recognition engine to achieve a consistent behavior for all inputs for various settings of confidence values. | 03-06-2014 |
20140067392 | CENTRALIZED SPEECH LOGGER ANALYSIS - A method of providing hands-free services using a mobile device having wireless access to computer-based services includes receiving speech in a vehicle from a vehicle occupant; recording the speech using a mobile device; transmitting the recorded speech from the mobile device to a cloud speech service; receiving automatic speech recognition (ASR) results from the cloud speech service at the mobile device; and comparing the recorded speech with the received ASR results at the mobile device to identify one or more error conditions. | 03-06-2014 |
20140074468 | System and Method for Automatic Prediction of Speech Suitability for Statistical Modeling - An embodiment according to the invention provides a capability of automatically predicting how favorable a given speech signal is for statistical modeling, which is advantageous in a variety of different contexts. In Multi-Form Segment (MFS) synthesis, for example, an embodiment according to the invention uses prediction capability to provide an automatic acoustic driven template versus model decision maker with an output quality that is high, stable and depends gradually on the system footprint. In speaker selection for a statistical Text-to-Speech synthesis (TTS) system build, as another example context, an embodiment according to the invention enables a fast selection of the most appropriate speaker among several available ones for the full voice dataset recording and preparation, based on a small amount of recorded speech material. | 03-13-2014 |
20140074469 | Apparatus and Method for Generating Signatures of Acoustic Signal and Apparatus for Acoustic Signal Identification - Method and apparatus for generating compact signatures of acoustic signal are disclosed. A method of generating acoustic signal signatures comprises the steps of dividing input signal into multiple frames, computing Fourier transform of each frame, computing difference between non-negative Fourier transform output values for the current frame and non-negative Fourier transform output values for one of previous frames, combining difference values into subgroups, accumulating difference values within a subgroup, combining accumulated subgroup values into groups, and finding an extreme value within each group. | 03-13-2014 |
20140081636 | SYSTEM AND METHOD FOR DYNAMIC ASR BASED ON SOCIAL MEDIA - System and method to adjust an automatic speech recognition (ASR) engine, the method including: receiving social network information from a social network; data mining the social network information to extract one or more characteristics; inferring a trend from the extracted one or more characteristics; and adjusting the ASR engine based upon the inferred trend. Embodiments of the method may further include: receiving a speech signal from a user; and recognizing the speech signal by use of the adjusted ASR engine. Further embodiments of the method may further include: producing a list of candidate matching words; and ranking the list of candidate matching words by use of the inferred trend. | 03-20-2014 |
20140100847 | VOICE RECOGNITION DEVICE AND NAVIGATION DEVICE - Disclosed is a voice recognition device including: first through Mth voice recognition parts each for detecting a voice interval from sound data stored in a sound data storage unit | 04-10-2014 |
20140180688 | SPEECH RECOGNITION DEVICE AND SPEECH RECOGNITION METHOD, DATA BASE FOR SPEECH RECOGNITION DEVICE AND CONSTRUCTING METHOD OF DATABASE FOR SPEECH RECOGNITION DEVICE - A speech recognition device comprises, a corpus processor which includes a refiner to classify collected corpora into domains corresponding to functions of the speech recognition device, and an extractor which extracts collected basic sentences based on functions of the speech recognition device with respect to the corpora in the domains, a database (DB) which stores therein the extracted basic sentences based on functions of the speech recognition device, a corpus receiver which receives a user's corpora, and a controller which compares a received basic sentence extracted by the extractor with collected basic sentences stored in the DB and determines the function intended by the user's corpora. | 06-26-2014 |
20140188470 | FLEXIBLE ARCHITECTURE FOR ACOUSTIC SIGNAL PROCESSING ENGINE - A disclosed speech processor includes a front end to receive a speech input and generate a feature vector indicative of a portion of the speech input and a Gaussian mixture (GMM) circuit to receive the feature vector, model any one of a plurality of GMM speech recognition algorithms, and generate a GMM score for the feature vector based on the GMM speech recognition algorithm modeled. In at least one embodiment, the GMM circuit includes a common compute block to generate feature a vector sum indicative of a weighted sum of differences squares between the feature vector and a mixture component of the GMM speech recognition algorithm. In at least one embodiment, the GMM speech recognition algorithm being modeled includes a plurality of Gaussian mixture components and the common compute block is operable to generate feature vector scores corresponding to each of the plurality of mixture components. | 07-03-2014 |
20140200890 | METHODS, SYSTEMS, AND CIRCUITS FOR SPEAKER DEPENDENT VOICE RECOGNITION WITH A SINGLE LEXICON - Embodiments reduce the complexity of speaker dependent speech recognition systems and methods by representing the code word (i.e., the word to be recognized) using a single Gaussian Mixture Model (GMM) which is adapted from a Universal Background Model (UBM). Only the parameters of the GMM need to be stored. Further reduction in computation is achieved by only checking the GMM component that is relevant to the keyword template. In this scheme, keyword template is represented by a sequence of the index of best performing component of the GMM of the keyword model. Only one template is saved by combining the registration template using Longest Common Sequence algorithm. The quality of the word model is continuously updated by performing expectation maximization iteration using the test word which is accepted as keyword model. | 07-17-2014 |
20140207456 | WAVEFORM ANALYSIS OF SPEECH - A waveform analysis of speech is disclosed. Embodiments include methods for analyzing captured sounds produced by animals, such as human vowel sounds, and accurately determining the sound produced. Some embodiments utilize computer processing to identify the location of the sound within a waveform, select a particular time within the sound, and measure a fundamental frequency and one or more formants at the particular time. Embodiments compare the fundamental frequency and the one or more formants to known thresholds and multiples of the fundamental frequency, such as by a computer-run algorithm. The results of this comparison identify of the sound with a high degree of accuracy. | 07-24-2014 |
20140249816 | METHODS, APPARATUS AND COMPUTER PROGRAMS FOR AUTOMATIC SPEECH RECOGNITION - An automatic speech recognition (ASR) system includes a speech-responsive application and a recognition engine. The ASR system generates user prompts to elicit certain spoken inputs, and the speech-responsive application performs operations when the spoken inputs are recognised. The recognition engine compares sounds within an input audio signal with phones within an acoustic model, to identify candidate matching phones. A recognition confidence score is calculated for each candidate matching phone, and the confidence scores are used to help identify one or more likely sequences of matching phones that appear to match a word within the grammar of the speech-responsive application. The per-phone confidence scores are evaluated against predefined confidence score criteria (for example, identifying scores below a ‘low confidence’ threshold) and the results of the evaluation are used to influence subsequent selection of user prompts. One such system uses confidence scores to select prompts for targetted recognition training—encouraging input of sounds identified as having low confidence scores. Another system selects prompts to discourage input of sounds that were not easily recognised. | 09-04-2014 |
20140324425 | ELECTRONIC DEVICE AND VOICE CONTROL METHOD THEREOF - A voice control method is applied in an electronic device. The electronic device includes a voice input unit, a play unit, and a storage unit storing a conversation database and an association table between different ranges of voice characteristics and styles of response voice. The method includes the following steps. Obtaining voice signals input via the voice input unit. Determining which content is input according to the obtained voice signals. Searching in the conversation database to find a response corresponding to the input content. Analyzing voice characteristics of the obtained voice signals. Comparing the voice characteristics of the obtained voice signals with the pre-stored ranges. Selecting the associated response voice. Finally, outputting the found response using the associated response voice via the play unit. | 10-30-2014 |
20140330563 | SEAMLESS AUTHENTICATION AND ENROLLMENT - Some aspects of the invention may include a computer-implemented method for enrolling voice prints generated from audio streams, in a database. The method may include receiving an audio stream of a communication session and creating a preliminary association between the audio stream and an identity of a customer that has engaged in the communication session based on identification information. The method may further include determining a confidence level of the preliminary association based on authentication information related to the customer and if the confidence level is higher than a threshold, sending a request to compare the audio stream to a database of voice prints of known fraudsters. If the audio stream does not match any known fraudsters, sending a request to generate from the audio stream a current voice print associated with the customer and enrolling the voice print in a customer voice print database. | 11-06-2014 |
20150019218 | METHOD AND APPARATUS FOR EXEMPLARY SEGMENT CLASSIFICATION - Method and apparatus for segmenting speech by detecting the pauses between the words and/or phrases, and to determine whether a particular time interval contains speech or non-speech, such as a pause. | 01-15-2015 |
20150039309 | METHODS AND SYSTEMS FOR IDENTIFYING ERRORS IN A SPEECH REGONITION SYSTEM - Methods are disclosed for identifying possible errors made by a speech recognition system without using a transcript of words input to the system. A method for model adaptation for a speech recognition system includes determining an error rate, corresponding to either recognition of instances of a word or recognition of instances of various words, without using a transcript of words input to the system. The method may further include adjusting an adaptation, of the model for the word or various models for the various words, based on the error rate. Apparatus are disclosed for identifying possible errors made by a speech recognition system without using a transcript of words input to the system. An apparatus for model adaptation for a speech recognition system includes a processor adapted to estimate an error rate, corresponding to either recognition of instances of a word or recognition of instances of various words, without using a transcript of words input to the system. The apparatus may further include a controller adapted to adjust an adaptation of the model for the word or various models for the various words, based on the error rate. | 02-05-2015 |
20150058010 | METHOD AND SYSTEM FOR BIAS CORRECTED SPEECH LEVEL DETERMINATION - Method for measuring level of speech determined by an audio signal in a manner which corrects for and reduces the effect of modification of the signal by the addition of noise thereto and/or amplitude compression thereof, and a system configured to perform any embodiment of the method. In some embodiments, the method includes steps of generating frequency banded, frequency-domain data indicative of an input speech signal, determining from the data a Gaussian parametric spectral model of the speech signal, and determining from the parametric spectral model an estimated mean speech level and a standard deviation value for each frequency band of the data; and generating speech level data indicative of a bias corrected mean speech level for each frequency band, including using at least one correction value to correct the estimated mean speech level for the frequency band, where each correction value has been predetermined using a reference speech model. | 02-26-2015 |
20150058011 | INFORMATION PROCESSING APPARATUS, INFORMATION UPDATING METHOD AND COMPUTER-READABLE STORAGE MEDIUM - An information processing apparatus recognizes input data as character information formed by character strings each being in a predetermined unit based on information relating to a character string as a recognition target, and performs processing based on the recognized character information. The apparatus includes an input information receiver that receives input information capable of being processed as characters; an input information dividing unit that divides the received input information into character strings each being in a predetermined processing unit; a popularity level calculating unit that calculates a popularity level based on history of an appearance timing of each of the divided character strings, the popularity level indicating information relating to a usage frequency for a predetermined period of time up to a current time for each of the divided character strings; and an updating processor that updates the information relating to the character string based on the calculated popularity level. | 02-26-2015 |
20150058012 | System and Method for Combining Frame and Segment Level Processing, Via Temporal Pooling, for Phonetic Classification - Disclosed herein are systems, methods, and non-transitory computer-readable storage media for combining frame and segment level processing, via temporal pooling, for phonetic classification. A frame processor unit receives an input and extracts the time-dependent features from the input. A plurality of pooling interface units generates a plurality of feature vectors based on pooling the time-dependent features and selecting a plurality of time-dependent features according to a plurality of selection strategies. Next, a plurality of segmental classification units generates scores for the feature vectors. Each segmental classification unit (SCU) can be dedicated to a specific pooling interface unit (PIU) to form a PIU-SCU combination. Multiple PIU-SCU combinations can be further combined to form an ensemble of combinations, and the ensemble can be diversified by varying the pooling operations used by the PIU-SCU combinations. Based on the scores, the plurality of segmental classification units selects a class label and returns a result. | 02-26-2015 |
20150081295 | METHOD AND APPARATUS FOR CONTROLLING ACCESS TO APPLICATIONS - According to an aspect of the present disclosure, a method for controlling access to a plurality of applications in an electronic device is disclosed. The method includes receiving a voice command from a speaker for accessing a target application among the plurality of applications, and verifying whether the voice command is indicative of a user authorized to access the applications based on a speaker model of the authorized user. In this method, each application is associated with a security level having a threshold value. The method further includes updating the speaker model with the voice command if the voice command is verified to be indicative of the user, and adjusting at least one of the threshold values based on the updated speaker model. | 03-19-2015 |
20150088506 | Speech Recognition Server Integration Device and Speech Recognition Server Integration Method - The speech recognition result through the general-purpose server and that through the specialized speech recognition server are integrated in an optimum manner, thereby, a speech recognition function least in errors in the end being provided. The specialized speech recognition server | 03-26-2015 |
20150088507 | DETECTING POTENTIAL SIGNIFICANT ERRORS IN SPEECH RECOGNITION RESULTS - In some embodiments, the recognition results produced by a speech processing system (which may include two or more recognition results, including a top recognition result and one or more alternative recognition results) based on an analysis of a speech input, are evaluated for indications of potential significant errors. In some embodiments, the recognition results may be evaluated to determine whether a meaning of any of the alternative recognition results differs from a meaning of the top recognition result in a manner that is significant for a domain, such as the medical domain. In some embodiments, words and/or phrases that may be confused by an ASR system may be determined and associated in sets of words and/or phrases. Words and/or phrases that may be determined include those that change a meaning of a phrase or sentence when included in the phrase/sentence. | 03-26-2015 |
20150106095 | SOUND IDENTIFICATION SYSTEMS - A digital sound identification system for storing a Markov model is disclosed. A processor is coupled to a sound data input, working memory, and a stored program memory for executing processor control code to input sound data for a sound to be identified. The sample sound data defines a sample frequency domain data energy in a range of frequency. Mean and variance values for a Markov model of the sample sound are generated. The Markov model is stored in the non-volatile memory. Interference sound data defining interference frequency domain data is inputted. The mean and variance values of the Markov model using the interference frequency domain data are adjusted. Sound data defining other sound frequency domain data are inputted. A probability of the other sound frequency domain data fitting the Markov model is determined. Finally, sound identification data dependent on the probability is outputted. | 04-16-2015 |
20150112678 | SOUND CAPTURING AND IDENTIFYING DEVICES - Broadly speaking, embodiments of the present invention provide a device, systems and methods for capturing sounds, generating a sound model (or “sound pack”) for each captured sound, and identifying a detected sound using the sound model(s). Preferably, a single device is used to capture a sound, store sound models, and to identify a detected sound using the stored sound models. | 04-23-2015 |
20150120296 | SYSTEM AND METHOD FOR SELECTING NETWORK-BASED VERSUS EMBEDDED SPEECH PROCESSING - Disclosed herein are systems, methods, and computer-readable storage media for making a multi-factor decision whether to process speech or language requests via a network-based speech processor or a local speech processor. An example local device configured to practice the method, having a local speech processor, and having access to a remote speech processor, receives a request to process speech. The local device can analyze multi-vector context data associated with the request to identify one of the local speech processor and the remote speech processor as an optimal speech processor. Then the local device can process the speech, in response to the request, using the optimal speech processor. If the optimal speech processor is local, then the local device processes the speech. If the optimal speech processor is remote, the local device passes the request and any supporting data to the remote speech processor and waits for a result. | 04-30-2015 |
20150348537 | Source Signal Separation by Discriminatively-Trained Non-Negative Matrix Factorization - A method estimates source signals from a mixture of source signals by first training an analysis model and a reconstruction model using training data. The analysis model is applied to the mixture of source signals to obtain an analysis representation of the mixture of source signals, and the reconstruction model is applied to the analysis representation to obtain an estimate of the source signals, wherein the analysis model utilizes an analysis linear basis representation, and the reconstruction model utilizes a reconstruction linear basis representation. | 12-03-2015 |
20160027436 | SPEECH RECOGNITION DEVICE, VEHICLE HAVING THE SAME, AND SPEECH RECOGNITION METHOD - A speech recognition device is configured to increase usability by retrying speech recognition without returning to a previous operation or a re-input of speech when a user's speech is misrecognized. The speech recognition device is further configured increase accuracy of recognition by changing a search environment when the user's speech is misrecognized or when re-recognition is performed since the recognized speech is rejected due to a low confidence. A vehicle includes a speech input device configured to receive speech; and a speech recognition device configured to recognize the received speech and output a recognition result of the received speech. The speech recognition device resets a recognition environment applied to speech recognition and re-recognizes the received speech when a re-recognition instruction is input by a user, and resets the reset recognition environment to an initial value when the re-recognition is completed. | 01-28-2016 |
20160042734 | RELATIVE EXCITATION FEATURES FOR SPEECH RECOGNITION - Relative Excitation Features, in all conditions, are far superior to conventional acoustic features like Mel-Frequency Cepstrum (MFC) and Perceptual Linear Prediction (PLP), and provide much more speaker-independence, channel-independence, and noise-immunity. Relative Excitation features are radically different than conventional acoustic features. Relative Excitation method doesn't try to model the speech-production or vocal tract shape, doesn't try to do deconvolution, and doesn't utilize LP (Linear Prediction) and Cepstrum techniques. This new feature set is completely related to human hearing. The present invention is inspired by the fact that human auditory perception analyzes and tracks the relations between spectral frequency component amplitudes and the “Relative Excitation” name implies relative excitation levels of human auditory neurons. Described herein is a major breakthrough for explaining and simulating the human auditory perception and its robustness. | 02-11-2016 |
20160086622 | SPEECH PROCESSING DEVICE, SPEECH PROCESSING METHOD, AND COMPUTER PROGRAM PRODUCT - According to an embodiment, a speech processing device includes an analyzer, a feature quantity calculator, a comparator, and a sensation index calculator. The analyzer performs multiple pseudo frequency analyses each using different window functions on subject speech to be processed. The feature quantity calculator calculates a feature quantity of the subject speech on the basis of analysis results of the multiple pseudo frequency analyses. The comparator compares the feature quantity of the subject speech with a reference feature quantity calculated from reference speech and generates a comparison result. The sensation index calculator calculates a sensation index representing a sensation received from the subject speech on the basis of the comparison result. | 03-24-2016 |
20160093297 | METHOD AND APPARATUS FOR EFFICIENT, LOW POWER FINITE STATE TRANSDUCER DECODING - A system, apparatus and method for efficient, low power, finite state transducer decoding. For example, one embodiment of a system for performing speech recognition comprises: a processor to perform feature extraction on a plurality of digitally sampled speech frames and to responsively generate a feature vector; an acoustic model likelihood scoring unit communicatively coupled to the processor over a communication interconnect to compare the feature vector against a library of models of various known speech sounds and responsively generate a plurality of scores representing similarities between the feature vector and the models; and a weighted finite state transducer (WFST) decoder communicatively coupled to the processor and the acoustic model likelihood scoring unit over the communication interconnect to perform speech decoding by traversing a WFST graph using the plurality of scores provided by the acoustic model likelihood scoring unit. | 03-31-2016 |