Patent application number | Description | Published |
20090307630 | System And Method for Processing A Message Store For Near Duplicate Messages - A system and method for processing a message store for near duplicate messages is provided. Metadata, content, and each attachment associated with messages are extracted. Near duplicate messages in the message store are identified. Compound digests taken of the metadata for, of the content contained in, and of the each attachment associated with each of the messages in the message store are compared. Each message having a compound digest not matching the compound digest of any other message is marked as unique and each message having a compound digest matching the compound digest of at least one other message is marked as an exact duplicate. Messages remaining unmarked and having similar content are grouped into sets that each includes one or more near duplicate messages. One of the near duplicate messages is designated as unique and each remaining near duplicate message in the set is designated as a near duplicate. | 12-10-2009 |
20100030715 | Social Network Model for Semantic Processing - A social network model, based on data relevant to a user, is used for semantic processing to enable improved entity recognition among text accessed by the user. An entity extraction module of the server, with reference to a general training corpus, general gazetteers, user-specific gazetteers, and entity models, parses text to identify entities. The entities may be, for example, people, organizations, or locations. A social network module of the server builds the social network model implicit in the data accessed by the user. The social network model includes the relationships between entities and an indication of the strength of each relationship. The social network module is also used to disambiguate names and unify entities based on the social network model. | 02-04-2010 |
20100049708 | System And Method For Scoring Concepts In A Document Set - A system and method for scoring concepts in a document set is provided. Concepts including two or more terms extracted from the document set are identified. Each document having one or more of the concepts is designated as a candidate seed document. A score is calculated for each of the concepts identified within each candidate seed document based on a frequency of occurrence, concept weight, structural weight, and corpus weight. A vector is formed for each candidate seed document. The vector is compared with a center of one or more clusters each comprising thematically-related documents. At least one of the candidate seed documents that is sufficiently distinct from the other candidate seed documents is selected as a seed document for a new cluster. Each of the unselected candidate seed documents is placed into one of the clusters having a most similar cluster center. | 02-25-2010 |
20110022597 | System And Method For Thematically Grouping Documents Into Clusters - A system and method for thematically grouping documents into clusters is provided. Concepts are extracted from a plurality of documents. The concepts include nouns or noun phrases. A number of occurrences for each concept are determined within each document. A bounded range is applied to the concepts and a subset of the concepts is selected by removing the concepts that fall outside the bounded range. The bounded range includes upper edge conditions and lower edge conditions. Themes are generated from the subset of concepts by identifying two or more concepts with common semantic meaning. Clusters of the documents are generated based on the themes. | 01-27-2011 |
20110067037 | System And Method For Processing Message Threads - A system and method for processing message threads is provided. A plurality of messages, each comprising a message body, is grouped by conversation thread. The message bodies of the messages are compared. Each message recursively contained in at least one other message is identified as a near duplicate message. An attachment sequence is generated for at least part of each attachment associated with one or more of the messages. The attachment sequences associated with the near duplicate messages are compared. Each near duplicate message having an attachment sequence not matching the attachment sequence of any other near duplicate message is identified as a unique message. | 03-17-2011 |
20110320453 | System And Method For Grouping Similar Documents - A system and method for grouping similar documents is provided. Frequencies of occurrences are determined for terms and noun phrases within a set of documents. A subset of the documents is selected by removing those documents having terms and noun phrases that fall outside a bounded range of upper and lower conditions for frequency of occurrence. Each of the documents in the subset is mapped to a cluster of documents based on a similarity of the documents to the cluster documents. | 12-29-2011 |
20120130961 | System And Method For Identifying Unique And Duplicate Messages - A system and method for identifying unique and duplicate messages is provided. Messages are maintained, and a header and message body are extracted from each of the messages. A hash code is calculated for each message over at least part of the header and the body of that message. The messages with matching hash codes are grouped. One message in each group with two or more messages is randomly selected as a unique message. The remaining messages in the group are marked as exact duplicate messages. | 05-24-2012 |
20130159300 | Computer-Implemented System and Method for Clustering Similar Documents - A computer-implemented system and method for clustering similar documents is provided. Concepts are identified for a set of documents and occurrence frequencies are determined for each concept in the documents set. A distance quantifying a similarity for each of the documents in the set with one or more clusters of documents is calculated. Each document is mapped to at least one of the one or more document clusters. | 06-20-2013 |
20130268610 | Computer-Implemented System And Method For Identifying Near Duplicate Messages - A computer-implemented system and method for identifying near duplicate messages is provided. Messages each including a content body are grouped by conversation thread. One or more of the messages also includes an attachment. The messages for each conversation thread are sorted in order of message length. At least one of the messages is selected from one of the threads and the body of the selected message is compared with the body of one such shorter message in that thread. A determination is made that the body of the shorter message is included in the body of the selected message. Hash codes of the attachments for the selected message and the shorter message are compared. The shorter message is marked as a near duplicate message of the selected message when the hash codes of the attachments match. | 10-10-2013 |
20140122450 | Computer-Implemented System And Method For Identifying Duplicate And Near Duplicate Messages - A computer-implemented system and method for identifying duplicate and near duplicate messages is provided. A set of messages is obtained. A body of one such message is compared with the body of each other message. Those messages having matching bodies are identified as exact duplicates. The exact duplicates are removed from the set. The remaining messages are sorted in order of message length and a shorter message is compared with a longer message. A determination is made that the body of the shorter message is included in the body of the longer message and the shorter message is marked as a near duplicate of the longer message. | 05-01-2014 |
20140122495 | Computer-Implemented System And Method For Clustering Documents Based On Scored Concepts - A computer-implemented system and method for clustering documents based on scored concepts is provided. A set of documents is obtained and concepts are extracted from the documents. A score is calculated for each concept. The score is determined as a function of summation of a frequency of occurrence, concept weight, structural weight, and corpus weight. The documents in the set are clustered based on the scores. A vector is formed for each document based on the concepts in that document and the scores associated with the concepts. A similarity is determined between each document and each of the other documents based on the formed vectors. Those documents that are sufficiently distinct from the other documents are identified as seed documents for separate document clusters. Each of the remaining documents are grouped into one of the clusters most similar to that remaining document. | 05-01-2014 |
20140250087 | Computer-Implemented System And Method For Identifying Relevant Documents For Display - A computer-implemented system and method for identifying relevant documents for display are provided. Themes for a set of documents are generated. The documents are clustered based on the themes. A matrix including an inner product of document frequency occurrences and cluster concept weightings for each theme is generated for the documents. From the matrix, documents most relevant to a particular theme are identified, and the relevant documents are displayed. | 09-04-2014 |