Patent application number | Description | Published |
20110295854 | AUTOMATIC REFINEMENT OF INFORMATION EXTRACTION RULES - A method and system for automatically refining information extraction (IE) rules. A provenance graph for IE rules on a set of test documents is determined. The provenance graph indicates a sequence of evaluations of the IE rules that generates an output of each operator of the IE rules. Based on the provenance graph, high-level rule changes (HLCs) of the IE rules are determined. Low-level rule changes (LLCs) of the IE rules are determined to specify how to implement the HLCs. Each LLC specifies changing an operator's structure or inserting a new operator in between two operators. Based on how the LLCs affect the IE rules and previously received correct results of applying the rules on the test documents, a ranked list of the LLCs is determined. The IE rules are refined based on the ranked list. | 12-01-2011 |
20130226841 | EXTRACTION OF INFORMATION FROM CLINICAL REPORTS - A method for extracting information from electronic documents, including: learning terms and term variants from a training corpus, wherein the terms and the term variants correspond to a specialized dictionary related to the training corpus; generating a list of negative indicators found in the training corpus; performing a partial match of the terms and the term variants in a set of electronic documents to create initial match results; and performing a negation test using the negative indicators and a positive terms test using the terms and the term variants on the initial match results to remove matches from the initial match results that fail either the negation test or the positive terms test, resulting in final match results. | 08-29-2013 |
20130226843 | EXTRACTION OF INFORMATION FROM CLINICAL REPORTS - A method for extracting information from electronic documents, including: learning terms and term variants from a training corpus, wherein the terms and the term variants correspond to a specialized dictionary related to the training corpus; generating a list of negative indicators found in the training corpus; performing a partial match of the terms and the term variants in a set of electronic documents to create initial match results; and performing a negation test using the negative indicators and a positive terms test using the terms and the term variants on the initial match results to remove matches from the initial match results that fail either the negation test or the positive terms test, resulting in final match results. | 08-29-2013 |
20130318075 | DICTIONARY REFINEMENT FOR INFORMATION EXTRACTION - A method for refining a dictionary for information extraction, the operations including: inputting a set of extracted results from execution of an extractor comprising the dictionary on a collection of text, wherein the extracted results are labeled as correct results or incorrect results; processing the extracted results using an algorithm configured to set a score of the extractor above a score threshold, wherein the score threshold balances a precision and a recall of the extractor; and outputting a set of candidate dictionary entries corresponding to a full set of dictionary entries, wherein the candidate dictionary entries are candidates to be removed from the dictionary based on the extracted results. | 11-28-2013 |
20130318076 | REFINING A DICTIONARY FOR INFORMATION EXTRACTION - A method for refining a dictionary for information extraction, the operations including: inputting a set of extracted results from execution of an extractor comprising the dictionary on a collection of text, wherein the extracted results are labeled as correct results or incorrect results; processing the extracted results using an algorithm configured to set a score of the extractor above a score threshold, wherein the score threshold balances a precision and a recall of the extractor; and outputting a set of candidate dictionary entries corresponding to a full set of dictionary entries, wherein the candidate dictionary entries are candidates to be removed from the dictionary based on the extracted results. | 11-28-2013 |
20140143661 | BUILDING AND MAINTAINING INFORMATION EXTRACTION RULES - Methods and arrangements for managing development of information extraction rules. One or more documents are opened for extraction. An interface is provided to create a label and thereupon label a portion of the document. The created label is stored, and an extractor is developed based on the labeling. A test interface is provided for the extractor, and results of a test conducted through the test interface are displayed. The extractor is exported. In accordance with at least one embodiment, developers are presented with eased automated guidance to write extractors, which thereby reduces an overall manual effort involved in extractor development. Generally, a focused, tutorial-type environment serves as a guide based on previously developed best practices. | 05-22-2014 |
20150096041 | IDENTIFYING AND RANKING PIRATED MEDIA CONTENT - A computer identifies and ranks URL hyperlinks to possible pirated media content by searching a web page from a first website for one or more indicator keywords, wherein a strength of an indicator keyword is related to a likelihood of pirated media content. Responsive to locating a plurality of instances of the one or more indicator keywords, identifying a plurality of hyperlinks respectively associated with one or more of the plurality of instances. Weighting, the identified plurality of hyperlinks based on at least one of: a strength of associated indicator keywords, number of associated indicator keywords, number of times each hyperlink was identified, and date of posting. Ranking the plurality of hyperlinks according to weight indicating a relative likelihood that respective hyperlinks point to pirated media content in a ranked list. | 04-02-2015 |
20160117522 | PROBABILISTIC SURFACING OF POTENTIALLY SENSITIVE IDENTIFIERS - Probabilistic surfacing of potentially sensitive identifiers is provided. In one embodiment of the present invention, a method of and computer program product for surfacing of potentially sensitive identifiers are provided. An input string is read. The input string has a length. The input string is divided into a plurality of tokens. Each of the tokens has a predetermined length. A score is determined for each of the plurality of tokens. A composite score is determined based on the scores of each of the plurality of tokens. Whether the input string comprises an identifier is determined by comparing the composite score to a predetermined threshold. | 04-28-2016 |