Patent application number | Description | Published |
20090055168 | Word Detection - Methods, systems, and apparatus, including computer program products, in which data from web documents are partitioned into a training corpus and a development corpus are provided. First word probabilities for words are determined for the training corpus, and second word probabilities for the words are determined for the development corpus. Uncertainty values based on the word probabilities for the training corpus and the development corpus are compared, and new words are identified based on the comparison. | 02-26-2009 |
20090055381 | Domain Dictionary Creation - Methods, systems, and apparatus, including computer program products, to identify topic words in a document corpus that includes topic documents related to a topic are disclosed. A reference topic word divergence value based on the document corpus and the topic document corpus is determined. A candidate topic word divergence value for a candidate topic word is determined based on the document corpus and the topic document corpus. The candidate topic word is determined to be a topic word if the candidate topic word divergence value is greater than the reference topic word divergence value. | 02-26-2009 |
20090070097 | USER INPUT CLASSIFICATION - Systems and methods of classifying user input are disclosed. The user input can be, for example, in the form of Roman characters. An ambiguous word (e.g., a word that is a non-pinyin word written in Roman characters and a valid pinyin word) can be identified in the user input. Contextual words (e.g., words adjacent to the ambiguous word) are classified as a pinyin context or a non-pinyin context. The ambiguous word is classified based on the context of the contextual words. | 03-12-2009 |
20090077037 | SUGGESTING ALTERNATIVE QUERIES IN QUERY RESULTS - Methods, systems, and apparatus, including computer program products, for suggesting alternative queries based on original query search results. In one aspect, a method includes receiving search results for a first query, where each search result refers to a respective resource and includes a snippet of content from the respective resource, receiving one or more suggested second queries, for each of the suggested second queries: selecting a set of words in one of the snippets to represent the suggested second query, associating the suggested second query with the set so that a user can interact with a word in the set to invoke the suggested second query, and marking the set so as to indicate that the user can interact with a word in the set to invoke the suggested second query, and transmitting the search results including each marked set to a client device for presentation to the user. | 03-19-2009 |
20100005086 | RESOURCE LOCATOR SUGGESTIONS FROM INPUT CHARACTER SEQUENCE - Methods, systems, and apparatus, including computer program products, in which an input method editor receives Roman character inputs, identifies keywords for candidate sets of a non-Roman character, and identifies an associated resource location. Upon identifying an associated resource location, associating the resource location with the candidate set of non-Roman characters. | 01-07-2010 |
20100180199 | DETECTING NAME ENTITIES AND NEW WORDS - Various aspects can be implemented for detecting name entities and/or new words from input entries. In general, one aspect can be a method that includes receiving an input entry comprising a text string. The method also includes identifying segmentation information from the input entry. The method further includes generating a candidate text string from the text string of the input entry based on the segmentation information. Other implementations of this aspect includes corresponding systems, apparatus, and processing engines. | 07-15-2010 |
20100306139 | CJK NAME DETECTION - Aspects directed to name detection are provided. A method includes generating a raw name detection model using a collection of family names and an annotated corpus including a collection of n-grams, each n-gram having a corresponding probability of occurring. The method includes applying the raw name detection model to a collection of semi-structured data to form annotated semi?structured data identifying n-grams identifying names and n?grams not identifying names and applying the raw name detection model to a large unannotated corpus to form a large annotated corpus data identifying n-grams of the large unannotated corpus identifying names and n-grams not identifying names. The method includes generating a name detection model, including deriving a name model using the annotated semi-structured data identifying names and the large annotated corpus data identifying names, deriving a not-name model using the semi?structured data not identifying names, and deriving a language model using the large annotated corpus. | 12-02-2010 |
20110022952 | Determining Proximity Measurements Indicating Respective Intended Inputs - Determination of proximity measurements indicative of respective intended inputs are disclosed. User inputs are received, where each user input is one of a predefined plurality of inputs that each map to multiple characters in a language. Rates of user selections of candidates decoded from the user inputs into the language are received, where each of the candidates includes one or more characters in the language. User inputs for the candidates having low rates of selection as non-selected user inputs are identified. User inputs for the candidates having high rates of selection as intended inputs are identified. The intended user inputs to the non-selected user inputs are compared to identify one or more misspelled input and intended input pairs. A proximity measurement for each misspelled input and intended input pair is determined based on a ratio of the number of times corresponding candidates for the misspelled input were not selected to the number of times the misspelled input was entered. | 01-27-2011 |
20110137642 | Word Detection - Methods, systems, and apparatus, including computer program products, in which data from web documents are partitioned into a training corpus and a development corpus are provided. First word probabilities for words are determined for the training corpus, and second word probabilities for the words are determined for the development corpus. Uncertainty values based on the word probabilities for the training corpus and the development corpus are compared, and new words are identified based on the comparison. | 06-09-2011 |
20110238413 | DOMAIN DICTIONARY CREATION - Methods, systems, and apparatus, including computer program products, to identify topic words in a collection of documents that includes topic documents related to a topic are disclosed. A reference topic word divergence value based on a document collection and the topic document collection is determined. A candidate topic word divergence value for a candidate topic word is determined based on the document collection and the topic document collection. The candidate topic word is determined to be a topic word if the candidate topic word divergence value is greater than the reference topic word divergence value. | 09-29-2011 |
20110296374 | CUSTOM LANGUAGE MODELS - Systems, methods, and apparatuses including computer program products for generating a custom language model. In one implementation, a method is provided. The method includes receiving a collection of documents; clustering the documents into one or more clusters; generating a cluster vector for each cluster of the one or more clusters; generating a target vector associcated with a target profile; comparing the target vector with each of the cluster vectors; selecting one or more of the one or more clusters based on the comparison; and generating a language model using documents from the one or more selected clusters. | 12-01-2011 |
20130103696 | Suggesting and Refining User Input Based on Original User Input - Systems and methods to generate modified/refined user inputs based on the original user input, such as a search query, are disclosed. The method may be implemented for Roman-based and/or non-Roman based language such as Chinese. The method may generally include receiving an original user input and identifying core terms therein, determining potential alternative inputs by replacing core term(s) in the original input with another term according to a similarity matrix and/or substituting a word sequence in the original input with another word sequence according to an expansion/contraction table where one word sequence is a substring of the other, computing likelihood of each potential alternative input, and selecting most likely alternative inputs according to a predetermined criteria, e.g., likelihood of the alternative input being at least that of the original input. A cache containing pre-computed original user inputs and corresponding alternative inputs may be provided. | 04-25-2013 |
20140012839 | SUGGESTING ALTERNATIVE QUERIES IN QUERY RESULTS - Methods, systems, and apparatus, including computer program products, for suggesting alternative queries based on original query search results. In one aspect, a method includes receiving search results for a first query, where each search result refers to a respective resource and includes a snippet of content from the respective resource, receiving one or more suggested second queries, for each of the suggested second queries: selecting a set of words in one of the snippets to represent the suggested second query, associating the suggested second query with the set so that a user can interact with a word in the set to invoke the suggested second query, and marking the set so as to indicate that the user can interact with a word in the set to invoke the suggested second query, and transmitting the search results including each marked set to a client device for presentation to the user. | 01-09-2014 |
20140258892 | RESOURCE LOCATOR SUGGESTIONS FROM INPUT CHARACTER SEQUENCE - Methods, systems, and apparatus, including computer program products, in which an input method editor receives Roman character inputs, identifies keywords for candidate sets of a non-Roman character, and identifies an associated resource location. Upon identifying an associated resource location, associating the resource location with the candidate set of non-Roman characters. | 09-11-2014 |