Patent application number | Description | Published |
20080275833 | Link spam detection using smooth classification function - A collection of web pages is considered as a directed graph in which the pages themselves are nodes and the hyperlinks between the pages are directed edges in the graph. A trusted entity identifies training examples for spam pages and normal pages. A random walk is conducted through the directed graph that includes the collection of web pages and the stationary probabilities, and transitional probabilities, among the nodes in the directed graph are obtained. A classifier training component estimates a classification function that changes slowly on densely connected subgraphs within the directed graph. The classification function assigns a value to each of the nodes in the directed graph and identifies them as spam or normal pages based upon whether the value meets a given function threshold value. | 11-06-2008 |
20080275902 | Web page analysis using multiple graphs - A collection of web pages is modeled as a directed graph, in which the nodes of the graph are the web pages and directed edges are hyperlinks. Web pages can also be represented by content, or by other features, to obtain a similarity graph over the web pages, where nodes again denote the web pages and the links or edges between each pair of nodes is weighted by a corresponding similarity between those two nodes. A random walk is defined for each graph, and a mixture of the random walks is obtained for the set of graphs. The collection of web pages is then analyzed based on the mixture to obtain a web page analysis result. The web page analysis result can be, for example, clustering of the web pages to discover web communities, classifying or categorizing the web pages, or spam detection indicating whether a given web page is spam or content. | 11-06-2008 |
20100185649 | SUBSTANTIALLY SIMILAR QUERIES - A system described herein includes analyzer component that analyzes queries submitted by users and corresponding URLs selected by the users, wherein the queries include a first query and a second query, and wherein the analyzer component determines that the first query and the second query are substantially similar queries. The system additionally includes a correlator component that, responsive to the analyzer component determining that the first query and the second query are substantially similar, generates correlation data that indicates that the first and second queries are substantially similar. | 07-22-2010 |
20110282816 | LINK SPAM DETECTION USING SMOOTH CLASSIFICATION FUNCTION - A spam detection system is disclosed. The system includes a classifier training component that receives a first set of training pages labeled as normal pages and a second set of training pages labeled as spam pages. The training component trains a web page classifier based on both the first set of training pages and the second set of training pages. A spam detector then receives unlabeled web pages uses the web page classifier to classify the unlabeled web pages as spam pages or normal pages. | 11-17-2011 |
20110295589 | LOCATING PARAPHRASES THROUGH UTILIZATION OF A MULTIPARTITE GRAPH - A method is described herein that includes acts of receiving a selection of a first phrase in a first language and executing a random walk over a computer-implemented multipartite graph, wherein the multipartite-graph includes a first set of nodes that are representative of phrases in the first language, a second set of nodes that are representative of phrases in a second language, and edges between nodes that are representative of relationships between the respective phrases. The random walk includes traversals over edges of the graph between nodes. The method also includes the act of indicating that a second phrase in the first language is a paraphrase of the first phrase based at least in part upon the random walk. | 12-01-2011 |
20120166366 | HIERARCHICAL CLASSIFICATION SYSTEM - The claimed subject matter provides a method for hierarchical classification. The method includes receiving a hierarchical structure with a first level comprising a parent node and a sibling node. The structure also includes a second level comprising two child nodes. The method further includes receiving training examples. Each training example may be associated with a class of the parent node, the sibling node, or the two child nodes. The method also includes generating a first classifier for the first level. The first classifier includes a first hyperplane distinguishing the parent and sibling nodes. A first vector is normal to the first hyperplane. Additionally, the method includes generating a second classifier for the second level. The second classifier includes a second hyperplane distinguishing the two child nodes. A second vector is normal to the second hyperplane. An orthogonality of the second vector in relation to the first vector is maximized. | 06-28-2012 |
20130282632 | LINK SPAM DETECTION USING SMOOTH CLASSIFICATION FUNCTION - A spam detection system is disclosed. The system includes a classifier training component that receives a first set of training pages labeled as normal pages and a second set of training pages labeled as spam pages. The training component trains a web page classifier based on both the first set of training pages and the second set of training pages. A spam detector then receives unlabeled web pages uses the web page classifier to classify the unlabeled web pages as spam pages or normal pages. | 10-24-2013 |
20140105488 | LEARNING-BASED IMAGE PAGE INDEX SELECTION - Architecture that performs image page index selection. A learning-based framework learns a statistical model based on the hyperlink (URL-uniform resource locator) previous click information obtained from the image search users. The learned model can combine the features of a newly discovered URL to predict the possibility of the newly-discovered URL being clicked in the future image search. In addition to existing web index selection features, image clicks are added as features, and the image clicks are aggregated over different URL segments, as well as the site modeling pattern trees to reduce the sparse problem of the image click information. | 04-17-2014 |
20140172767 | BUDGET OPTIMAL CROWDSOURCING - To optimize the number of correct decisions made by a crowdsourcing system given a fixed budget, tasks for multiple decisions are allocated to workers in a sequence. A task is allocated to a worker based on results already achieved for that task from other workers. Such allocation addresses the different levels of difficulty of decisions. A task also can be allocated to a worker based on results already received for other tasks from that worker. Such allocation addresses the different levels of reliability of workers. The process of allocating tasks to workers can be modeled as a Bayesian Markov decision process. Given the information already received for each item and worker, an estimate of the number of correct labels received can be determined. At each step, the system attempts to maximize the estimated number of correct labels it expects to have given the inputs so far. | 06-19-2014 |
20140222747 | LEARNING WITH NOISY LABELS FROM MULTIPLE JUDGES - A system and method infer true labels for multiple items. The inferred labels are generated from judgments. Multiple judges select the judgments from a specified choice of labels for each item. The method includes determining a characterization of judge expertise and item difficulties based on the judgments. The method also includes determining, using maximum entropy, a probability distribution over the specified choice of labels for each judge and item, based on the judgments. The method further includes selecting improved labels for the items from the specified choice such that the entropy over the probability distribution is reduced. The improved labels represent an improvement from the judgments toward the true labels. Additionally, the method includes performing iterative procedure to determine the true labels, the characterizations of judge expertise and the labeling difficulties. | 08-07-2014 |