Patent application number | Description | Published |
20080243481 | Large Language Models in Machine Translation - Systems, methods, and computer program products for machine translation are provided. In some implementations a system is provided. The system includes a language model including a collection of n-grams from a corpus, each n-gram having a corresponding relative frequency in the corpus and an order n corresponding to a number of tokens in the n-gram, each n-gram corresponding to a backoff n-gram having an order of n-1 and a collection of backoff scores, each backoff score associated with an n-gram, the backoff score determined as a function of a backoff factor and a relative frequency of a corresponding backoff n-gram in the corpus. | 10-02-2008 |
20110129153 | Identifying Matching Canonical Documents in Response to a Visual Query - A server system receives a visual query from a client system. The visual query is an image containing text such as a picture of a document. At the receiving server or another server, optical character recognition (OCR) is performed on the visual query to produce text recognition data representing textual characters. Each character in a contiguous region of the visual query is individually scored according to its quality. The quality score of a respective character is influenced by the quality scores of neighboring or nearby characters. Using the scores, one or more high quality strings of characters are identified. Each high quality string has a plurality of high quality characters. A canonical document containing the one or more high quality textual strings is retrieved. At least a portion of the canonical document is sent to the client system. | 06-02-2011 |
20110202330 | Compound Splitting - Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for decompounding compound words are disclosed. In one aspect, a method includes obtaining a token that includes a sequence of characters, identifying two or more candidate sub-words that are constituents of the token, and one or more morphological operations that are required to transform the sub-words into the token, where at least one of the morphological operations involves a use of a non-dictionary word, and determining a cost associated with each sub-word and a cost associated with each morphological operation. | 08-18-2011 |
20120047172 | PARALLEL DOCUMENT MINING - A technique includes providing a collection of documents in multiple languages, identifying, from the collection of documents, a group of candidate documents, where each candidate document in the group shares multiple corresponding rare features, evaluating pairs of candidate documents in the group using multiple common features present in the collection of documents, and determining, based on evaluating the pairs of candidate documents, whether each pair of candidate documents corresponds to a translated pair of documents. | 02-23-2012 |
20120128250 | Generating a Combination of a Visual Query and Matching Canonical Document - A server system receives a visual query from a client system distinct from the server system, performs optical character recognition (OCR) on the visual query to produce text recognition data representing textual characters, including a plurality of textual characters in a contiguous region of the visual query, and scores each textual character in the plurality of textual characters. The server system identifies, in accordance with the scoring, one or more high quality textual strings, each comprising a plurality of high quality textual characters from among the plurality of textual characters in the contiguous region of the visual query; retrieves a canonical document having the one or more high quality textual strings; generates a combination of the visual query and at least a portion of the canonical document; and sends the combination to the client system. | 05-24-2012 |
20120128251 | Identifying Matching Canonical Documents Consistent with Visual Query Structural Information - A server system receives a visual query from a client system, performs optical character recognition (OCR) on the visual query to produce text recognition data representing textual characters, including a plurality of textual characters in a contiguous region of the visual query. The server system also produces structural information associated with the textual characters in the visual query. Textual characters in the plurality of textual characters are scored. The method further includes identifying, in accordance with the scoring, one or more high quality textual strings, each comprising a plurality of high quality textual characters from among the plurality of textual characters in the contiguous region of the visual query. A canonical document that includes the one or more high quality textual strings and that is consistent with the structural information is retrieved. At least a portion of the canonical document is sent to the client system. | 05-24-2012 |
20120134590 | Identifying Matching Canonical Documents in Response to a Visual Query and in Accordance with Geographic Information - A server system receives a visual query from a client system distinct from the server system. The server system performs optical character recognition (OCR) on the visual query to produce text recognition data representing textual characters, including a plurality of textual characters in a contiguous region of the visual query. The server system scores each textual character in the plurality of textual characters in accordance with the geographic location of the client system. The server system identifies, in accordance with the scoring, one or more high quality textual strings, each comprising a plurality of high quality textual characters from among the plurality of textual characters in the contiguous region of the visual query. Then the server system retrieves a canonical document having the one or more high quality textual strings and sends at least a portion of the canonical document to the client system. | 05-31-2012 |
20130346059 | LARGE LANGUAGE MODELS IN MACHINE TRANSLATION - Systems, methods, and computer program products for machine translation are provided. In some implementations a system is provided. The system includes a language model including a collection of n-grams from a corpus, each n-gram having a corresponding relative frequency in the corpus and an order n corresponding to a number of tokens in the n-gram, each n-gram corresponding to a backoff n-gram having an order of n−1 and a collection of backoff scores, each backoff score associated with an n-gram, the backoff score determined as a function of a backoff factor and a relative frequency of a corresponding backoff n-gram in the corpus. | 12-26-2013 |
20140334746 | Identifying Matching Canonical Documents Consistent With Visual Query Structural Information - A server system receives a visual query from a client system, performs optical character recognition (OCR) on the visual query to produce text recognition data representing textual characters, including a plurality of textual characters in a contiguous region of the visual query. The server system also produces structural information associated with the textual characters in the visual query. Textual characters in the plurality of textual characters are scored. The method further includes identifying, in accordance with the scoring, one or more high quality textual strings, each comprising a plurality of high quality textual characters from among the plurality of textual characters in the contiguous region of the visual query. A canonical document that includes the one or more high quality textual strings and that is consistent with the structural information is retrieved. At least a portion of the canonical document is sent to the client system. | 11-13-2014 |