Stratify, Inc. Patent applications |
Patent application number | Title | Published |
20130191111 | LANGUAGE IDENTIFICATION FOR DOCUMENTS CONTAINING MULTIPLE LANGUAGES - Multiple nonoverlapping languages within a single document can be identified. In one embodiment, for each of a set of candidate languages, a set of non-overlapping languages is defined. The document is analyzed under the hypothesis that the whole document is in one language and that part of the document is in one language while the rest is in a different, non-overlapping language. Language(s) of the document are identified based on comparing these competing hypotheses across a number of language pairs. In another embodiment, transitions between non-overlapping character sets are used to segment a document, and each segment is scored separately for a subset of candidate languages. Language(s) of the document are identified based on the segment scores. | 07-25-2013 |
20130185060 | PHRASE BASED DOCUMENT CLUSTERING WITH AUTOMATIC PHRASE EXTRACTION - Meaningful phrases are distinguished from chance word sequences statistically, by analyzing a large number of documents and using a statistical metric such as a mutual information metric to distinguish meaningful phrases from groups of words that co-occur by chance. In some embodiments, multiple lists of candidate phrases are maintained to optimize the storage requirement of the phrase-identification algorithm. After phrase identification, a combination of words and meaningful phrases can be used to construct clusters of documents. | 07-18-2013 |
20110191347 | ADAPTIVE ROUTING OF DOCUMENTS TO SEARCHABLE INDEXES - Documents are assigned to one or more indexes in a document indexing system on the basis of document properties such as total number of tokens in the document, number of numeric tokens in the document, number of alphabetic tokens in the document, size of the document, and metadata associated with the document. Based on statistical distributions of document properties (over a large number of documents), different indexes can be defined, and a document router can direct a particular document to one index or another based on the properties of the particular document. In some implementations, certain document properties may be used to identify a nonrelevant document, or garbage document, so that it is either not indexed or assigned to an index dedicated for such documents. | 08-04-2011 |
20110191098 | PHRASE-BASED DOCUMENT CLUSTERING WITH AUTOMATIC PHRASE EXTRACTION - Meaningful phrases are distinguished from chance word sequences statistically, by analyzing a large number of documents and using a statistical metric such as a mutual information metric to distinguish meaningful phrases from groups of words that co-occur by chance. In some embodiments, multiple lists of candidate phrases are maintained to optimize the storage requirement of the phrase-identification algorithm. After phrase identification, a combination of words and meaningful phrases can be used to construct clusters of documents. | 08-04-2011 |
20110087669 | COMPOSITE LOCALITY SENSITIVE HASH BASED PROCESSING OF DOCUMENTS - Reliable identification of highly similar documents allows such documents to be treated as identical for purposes of document analysis. Identification of highly similar documents can be based on a composite hash value or other value for which the likelihood of two documents having the same value is high if and only if the documents have a high degree of similarity. Prior to performing content based analysis, the composite hash value for the current document is determined and compared to composite hash values of previously analyzed documents. If a match is found, the results of the analysis of the previous document can be applied to the current document. If no match is found, the current document is analyzed. | 04-14-2011 |
20110087668 | CLUSTERING OF NEAR-DUPLICATE DOCUMENTS - Documents likely to be near-duplicates are clustered based on document vectors that represent word-occurrence patterns in a relatively low-dimensional space. Edit distance between documents is defined based on comparing their document vectors. In one process, initial clusters are formed by applying a first edit-distance constraint relative to a root document of each cluster. The initial clusters can be merged subject to a second edit-distance constraint that limits the maximum edit distance between any two documents in the cluster. The second edit-distance constraint can be defined such that whether it is satisfied can be determined by comparing cluster structures rather than individual documents. | 04-14-2011 |
20100125448 | AUTOMATED IDENTIFICATION OF DOCUMENTS AS NOT BELONGING TO ANY LANGUAGE - An “impostor profile” for a language is used to determine whether documents are in that language or no language. The impostor profile for a given language provides statistical information about the expected results of applying a language model for one or more other (“impostor”) languages to a document that is in fact in the given language. After a most likely language for a test document is identified, the impostor profile is used together with the scores for the test document in the various impostor languages to determine whether to identify the test document as being in the most likely language or in no language. | 05-20-2010 |
20100125447 | LANGUAGE IDENTIFICATION FOR DOCUMENTS CONTAINING MULTIPLE LANGUAGES - Multiple nonoverlapping languages within a single document can be identified. In one embodiment, for each of a set of candidate languages, a set of non-overlapping languages is defined. The document is analyzed under the hypothesis that the whole document is in one language and that part of the document is in one language while the rest is in a different, non-overlapping language. Language(s) of the document are identified based on comparing these competing hypotheses across a number of language pairs. In another embodiment, transitions between non-overlapping character sets are used to segment a document, and each segment is scored separately for a subset of candidate languages. Language(s) of the document are identified based on the segment scores. | 05-20-2010 |