Patent application number | Description | Published |
20090076989 | AUTOMATED CLASSIFICATION ALGORITHM COMPRISING AT LEAST ONE INPUT-INVARIANT PART - A classification algorithm is separated into one or more input-invariant parts and one or more input-dependent classification parts. The input-invariant parts of the classification algorithm capture the underlying and unchanging relationships between the plurality of data elements being operated upon by the classification algorithm, whereas the one or more classification parts embody the probabilistic labeling of the data elements according to the various classifications. For any given iteration, a user's input is used to modify at least one classification part of the algorithm. Recalculated classification parts (i.e., updated classification results) are determined based on computationally simple combinations of the one or more modified classification parts and the one or more input-invariant parts. Preferably, a graphical user interface is used to solicit user input. In this manner, wait times between user feedback iterations can be dramatically reduced, thereby making application of active learning to classification tasks a practical reality. | 03-19-2009 |
20100162402 | DATA ANONYMIZATION BASED ON GUESSING ANONYMITY - Privacy is defined in the context of a guessing game based on the so-called guessing inequality. The privacy of a sanitized record, i.e., guessing anonymity, is defined by the number of guesses an attacker needs to correctly guess an original record used to generate a sanitized record. Using this definition, optimization problems are formulated that optimize a second anonymization parameter (privacy or data distortion) given constraints on a first anonymization parameter (data distortion or privacy, respectively). Optimization is performed across a spectrum of possible values for at least one noise parameter within a noise model. Noise is then generated based on the noise parameter value(s) and applied to the data, which may comprise real and/or categorical data. Prior to anonymization, the data may have identifiers suppressed, whereas outlier data values in the noise perturbed data may be likewise modified to further ensure privacy. | 06-24-2010 |
20100169375 | Entity Assessment and Ranking - General entity retrieval and ranking is described. A first set of documents is retrieved from one or more document repositories based on a query formed according to the topic. The first set of documents is characterized based on its first set of metadata values. One or more candidate entities are identified based on the first set of documents and the original query is thereafter augmented according to a candidate entity. The second set of documents resulting from the augmented query is then characterized in a similar manner. For each candidate entity, the first and second document set characterizations are compared to determine their degree of similarity. Increasingly similar document set characterizations indicates that the candidate entity is increasingly relevant to the original query. Repeating this process for each of the one or more candidate entities can give rise to rankings according to the respective degrees of similarity. | 07-01-2010 |
20110054925 | CLAIMS ANALYTICS ENGINE - Methods and systems for processing claims (e.g., healthcare insurance claims) are described. For example, prior to payment of an unpaid claim, a prediction is made as to whether or not an attribute specified in the claim is correct. Depending on the prediction results, the claim can be flagged for an audit. Feedback from the audit can be used to update the prediction models in order to refine the accuracy of those models. | 03-03-2011 |
20110246467 | EXTRACTION OF ATTRIBUTES AND VALUES FROM NATURAL LANGUAGE DOCUMENTS - One or more classification algorithms are applied to at least one natural language document in order to extract both attributes and values of a given product. Supervised classification algorithms, semi-supervised classification algorithms, unsupervised classification algorithms or combinations of such classification algorithms may be employed for this purpose. The at least one natural language document may be obtained via a public communication network. Two or more attributes (or two or more values) thus identified may be merged to form one or more attribute phrases or value phrases. Once attributes and values have been extracted in this manner, association or linking operations may be performed to establish attribute-value pairs that are descriptive of the product. In a presently preferred embodiment, an (unsupervised) algorithm is used to generate seed attributes and values which can then support a supervised or semi-supervised classification algorithm. | 10-06-2011 |
20110307429 | AUTOMATED CLASSIFICATION ALGORITHM COMPRISING AT LEAST ONE INPUT-INVARIANT PART - A classification algorithm is separated into one or more input-invariant parts and one or more input-dependent classification parts. Classifiable electronic data is obtained via a communication network. Using the classification algorithm, classifications of a plurality of data elements in the classifiable data are identified, where the at least one classification part incorporates user input concerning classification of at least one data element of the plurality of data elements. | 12-15-2011 |
20120078908 | PROCESSING A REUSABLE GRAPHIC IN A DOCUMENT - A method and apparatus are provided for processing a graphic in a document so that the graphic may be reused in a different application than the one it was originally used in. For a given document, a graphic may be identified from within the document and extracted from the document. The extracted graphic may be stored in a suitable storage medium, such as a reusable graphic repository. A structural feature associated with the extracted graphic may also be extracted. The extracted graphic may then be classified based on the extracted structural feature. Furthermore, a method and apparatus are provided for generating a reusable graphic from a document. | 03-29-2012 |
20120179453 | PREPROCESSING OF TEXT - Performance of statistical machine learning techniques, particularly classification techniques applied to the extraction of attributes and values concerning products, is improved by preprocessing a body of text to be analyzed to remove extraneous information. The body of text is split into a plurality of segments. In an embodiment, sentence identification criteria are applied to identify sentences as the plurality of segments. Thereafter, the plurality of segments are clustered to provide a plurality of clusters. One or more of the resulting clusters are then analyzed to identify segments having low relevance to their respective clusters. Such low relevance segments are then removed from their respective clusters and, consequently, from the body of text. As the resulting relevance-filtered body of text no longer includes portions of the body of text containing mostly extraneous information, the reliability of any subsequent statistical machine learning techniques may be improved. | 07-12-2012 |
20120179633 | IDENTIFICATION OF ATTRIBUTES AND VALUES USING MULTIPLE CLASSIFIERS - A body of text comprises a plurality of unknown attributes and a plurality of unknown values. A first classification sub-component labels a first portion of the plurality of unknown values as a first set of values, whereas a second classification sub-component labels a portion of the plurality of unknown attributes as a set of attributes and a second portion of the plurality of unknown values as a second set of values. Learning models implemented by the first and second classification subcomponents are updated based on the set of attributes and the first and second set of values. The first classification sub-component implements at least one supervised classification technique, whereas the second classification sub-component implements an unsupervised and/or semi-supervised classification technique. Active learning may be employed to provide at least one of a corrected attribute and/or corrected value that may be used to update the learning models. | 07-12-2012 |
20120239380 | Classification-Based Redaction in Natural Language Text - When redacting natural language text, a classifier is used to provide a sensitive concept model according to features in natural language text and in which the various classes employed are sensitive concepts reflected in the natural language text. Similarly, the classifier is used to provide an utility concepts model based on utility concepts. Based on these models, and for one or more identified sensitive concept and identified utility concept, at least one feature in the natural language text is identified that implicates the at least one identified sensitive topic more than the at least one identified utility concept. At least some of the features thus identified may be perturbed such that the modified natural language text may be provided as at least one redacted document. In this manner, features are perturbed to maximize classification error for sensitive concepts while simultaneously minimizing classification error in the utility concepts. | 09-20-2012 |
20130018651 | PROVISION OF USER INPUT IN SYSTEMS FOR JOINTLY DISCOVERING TOPICS AND SENTIMENTSAANM Djordjevic; DivnaAACI AntibesAACO FRAAGP Djordjevic; Divna Antibes FRAANM Ghani; RayidAACI ChicagoAAST ILAACO USAAGP Ghani; Rayid Chicago IL USAANM Krema; MarkoAACI EvanstonAAST ILAACO USAAGP Krema; Marko Evanston IL US - A generative model is used to develop at least one topic model and at least one sentiment model for a body of text. The at least one topic model is displayed such that, in response, a user may provide user input indicating modifications to the at least one topic model. Based on the received user input, the generative model is used to provide at least one updated topic model and at least one updated sentiment model based on the user input. Thereafter, the at least one updated topic model may again be displayed in order to solicit further user input, which further input is then used to once again update the models. The at least one updated topic model and the at least one updated sentiment model may be employed to analyze target text in order to identify topics and associated sentiments therein. | 01-17-2013 |
20130018824 | SENTIMENT CLASSIFIERS BASED ON FEATURE EXTRACTIONAANM Ghani; RayidAACI ChicagoAAST ILAACO USAAGP Ghani; Rayid Chicago IL USAANM Krema; MarkoAACI EvanstonAAST ILAACO USAAGP Krema; Marko Evanston IL US - Method and apparatus are provided for providing one or more sentiment classifiers from training data using supervised classification techniques based on features extracted from the training data. Training data includes a plurality of units such as, but not limited to, documents, paragraphs, sentences, and clauses. A feature extraction component extracts a plurality of features from the training data, and a feature value determination component determines a value for each extracted feature based on a frequency at which each feature occurs in the training data. On the other hand, a class labeling component labels each unit of the training data according to a plurality of sentiment classes to provide labeled training data. Thereafter, a sentiment classifier generation component provides a least one sentiment classifier based on the value of each extracted feature and the labeled training data using a supervised classification technique. | 01-17-2013 |
20130018825 | DETERMINATION OF A BASIS FOR A NEW DOMAIN MODEL BASED ON A PLURALITY OF LEARNED MODELSAANM GHANI; RayidAACI ChicagoAAST ILAACO USAAGP GHANI; Rayid Chicago IL USAANM Krema; MarkoAACI EvanstonAAST ILAACO USAAGP Krema; Marko Evanston IL US - In a machine learning system in which a plurality of learned models, each corresponding to a unique domain, already exist, new domain input for training a new domain model may be provided. Statistical characteristics of features in the new domain input are first determined. The resulting new domain statistical characteristics are then compared with statistical characteristics of features in prior input previously provided for training at least some of the plurality of learned models. Thereafter, at least one learned model of the plurality of learned models is identified as the basis for the new domain model when the new domain input statistical characteristics compare favorably with the statistical characteristics of the features in the prior input corresponding to the at least one learned model. | 01-17-2013 |
20130041896 | CONTEXT AND PROCESS BASED SEARCH RANKING - A search ranking system may include a context mining module to determine a set of contexts based on profile of information rankable by the system and an access history of users that have accessed at least some of the information. A context detection module may compare an association of a user conducting a search with one or more of the contexts to thereby rank search results based on the comparison. | 02-14-2013 |
20140095466 | ENTITY ASSESSMENT AND RANKING - General entity retrieval and ranking is described. A first set of documents is retrieved from one or more document repositories based on a query formed according to the topic. The first set of documents is characterized based on its first set of metadata values. One or more candidate entities are identified based on the first set of documents and the original query is thereafter augmented according to a candidate entity. The second set of documents resulting from the augmented query is then characterized in a similar manner. For each candidate entity, the first and second document set characterizations are compared to determine their degree of similarity. Increasingly similar document set characterizations indicates that the candidate entity is increasingly relevant to the original query. Repeating this process for each of the one or more candidate entities can give rise to rankings according to the respective degrees of similarity. | 04-03-2014 |
20140123304 | DATA ANONYMIZATION BASED ON GUESSING ANONYMITY - Privacy is defined in the context of a guessing game based on the so-called guessing inequality. The privacy of a sanitized record, i.e., guessing anonymity, is defined by the number of guesses an attacker needs to correctly guess an original record used to generate a sanitized record. Using this definition, optimization problems are formulated that optimize a second anonymization parameter (privacy or data distortion) given constraints on a first anonymization parameter (data distortion or privacy, respectively). Optimization is performed across a spectrum of possible values for at least one noise parameter within a noise model. Noise is then generated based on the noise parameter value(s) and applied to the data, which may comprise real and/or categorical data. Prior to anonymization, the data may have identifiers suppressed, whereas outlier data values in the noise perturbed data may be likewise modified to further ensure privacy. | 05-01-2014 |
20140380138 | PROCESSING A REUSABLE GRAPHIC IN A DOCUMENT - A method and apparatus are provided for processing a graphic in a document so that the graphic may be reused in a different application than the one it was originally used in. For a given document, a graphic may be identified from within the document and extracted from the document. The extracted graphic may be stored in a suitable storage medium, such as a reusable graphic repository. A structural feature associated with the extracted graphic may also be extracted. The extracted graphic may then be classified based on the extracted structural feature. Furthermore, a method and apparatus are provided for generating a reusable graphic from a document. | 12-25-2014 |