Patent application number | Description | Published |
20110225192 | AUTO-DETECTION OF HISTORICAL SEARCH CONTEXT - Architecture that automatically detects historical search contexts as well as behaviors related to a search query. Machine learning and hand-authored rules are employed to automatically identify search contexts. Historical information likely to be useful in the current context is surfaced. When a user enters a search query or executes another search behavior, past behaviors are exposed which are contextually related to the current behavior. The architecture also provides automatic discovery of historical contexts, features related to the contexts, and training or authoring of a system for classifying behavior into contexts, using some combination of the machine learning and/or hand-authored rules. A runtime system classifies the current user behavior into a context and surfaces contextual information to the user. | 09-15-2011 |
20110246573 | DISTRIBUTED NON-NEGATIVE MATRIX FACTORIZATION - Architecture that scales up the non-negative matrix factorization (NMF) technique to a distributed NMF (denoted DNMF) to handle large matrices, for example, on a web scale that can include millions and billions of data points. To analyze web-scale data, DNMF is applied through parallelism on distributed computer clusters, for example, with thousands of machines. In order to maximize the parallelism and data locality, matrices are partitioned in the short dimension. The probabilistic DNMF can employ not only Gaussian and Poisson NMF techniques, but also exponential NMF for modeling web dyadic data (e.g., dwell time of a user on browsed web pages). | 10-06-2011 |
20130246429 | MULTI-CENTER CANOPY CLUSTERING - A canopy clustering process merges at least one set of multiple single-center canopies together into a merged multi-center canopy. Multi-center canopies, as well as the single-center canopies, can then be used to partition data objects in a dataset. The multi-center canopies allow a canopy assignment condition constraint to be relaxed without risk of leaving any data objects in a dataset outside of all canopies. Approximate distance calculations can be used as similarity metrics to define and merge canopies and to assign data objects to canopies. In one implementation, a distance between a data object and a canopy is represented as the minimum of the distances between the data object and each center of a canopy (whether merged or unmerged), and the distance between two canopies is represented as the minimum of the distances for each pairing of the center(s) in one canopy and the center(s) in the other canopy. | 09-19-2013 |
20130253888 | ONE-PASS STATISTICAL COMPUTATIONS - Some embodiments of the invention employ algorithms enabling the calculation of one or more statistical moments in a single pass of a dataset. For example, some embodiments may apply algorithms for calculating statistical moments to a dataset using a map-reduce framework, whereby an input dataset is partitioned into multiple shards, a separate map process is used to apply an algorithm enabling calculation of one or more statistical moments in a single scan to each shard, and one or more reduce processes consolidate the results generated by the map processes to calculate the one or more statistical moments across the entire dataset. In other embodiments of the invention, a map-reduce framework may be employed to apply algorithms enabling calculation of a covariance between data elements expressed in a dataset, instead of or in addition to one or more statistical moments. | 09-26-2013 |
20130254280 | IDENTIFYING INFLUENTIAL USERS OF A SOCIAL NETWORKING SERVICE - Techniques for identifying influential users of a social networking service are provided. Influential users may be identified via an algorithm in which an influence score is assigned to each user based at least in part on other members of the community users having taken an affirmative step with respect to the user's communications. Iterative processing may be performed, with each user's influence score being determined by contributions from other users, and each contribution being determined by the contributor's influence score as of a prior iteration. A map-reduce framework may be employed, with data representing the community being partitioned into a plurality of discrete shards, a map process corresponding to each shard calculating an influence score for users represented in the shard, and reduce processes ranking users according to influence score across all shards. | 09-26-2013 |
20130339000 | IDENTIFYING COLLOCATIONS IN A CORPUS OF TEXT IN A DISTRIBUTED COMPUTING ENVIRONMENT - Technologies pertaining to computing a metric that is indicative of whether an n-gram in a large corpus of text is a collocation are described herein. The metric is computed in connection with a distributed computing framework, wherein n-grams of varying lengths can be analyzed in a single input data pass, and wherein secondary sorting functionality of the distributed computing framework need not be invoked. | 12-19-2013 |
20130346424 | COMPUTING TF-IDF VALUES FOR TERMS IN DOCUMENTS IN A LARGE DOCUMENT CORPUS - Technologies pertaining to computing a respective TF-IDF value for each term in each document of a relative large document corpus are described herein. TF-IDF values are computed with respect to terms in documents of a large document corpus by in a single pass over the document corpus. Secondary sorting functionality of a distributed computing framework is exploited to compute TF-IDF values efficiently. | 12-26-2013 |
20130346466 | IDENTIFYING OUTLIERS IN A LARGE SET OF OBJECTS - Described herein are various technologies pertaining to identifying global outlier candidates from a relatively large collection of computer-readable objects in a distributed computing environment. The collection of computer-readable objects is partitioned into a plurality of sets of objects, and local outlier candidates are identified from each set of objects in the plurality of sets of objects. The local outlier candidates are updated through a hierarchical pairwise similarity analysis until global outlier candidates are identified. Thereafter, a pairwise similarity analysis is undertaken with respect to the global outlier candidates and the sets of objects in the plurality of sets of objects to identify true global outliers. | 12-26-2013 |
20140189000 | SOCIAL MEDIA IMPACT ASSESSMENT - A system for identifying influential users of a social network platform. The system may compute a score for each of multiple users. Such a score may be topic-based, leading to a more accurate identification of influential users. Such a topic-based score may indicate authority and/or impact of a user with respect to a topic. The impact may be computed based on authority combined with other factors, such as power of the user. The authority score may be simply computed, in whole or in part, directly from a tweet log without, for example creating a retweet graph. As a result, the scores may be computed, using MapReduce primitives or other constructs that allow the computations to be distributed across multiple parallel processors. Such scores may be used to select users based on impact as part of social trend analysis, marketing or other functions. | 07-03-2014 |
20140189536 | SOCIAL MEDIA IMPACT ASSESSMENT - A system for identifying influential users of a social network platform. The system may compute a score for each of multiple users. Such a score may be topic-based, leading to a more accurate identification of influential users. Such a topic-based score may indicate authority and/or impact of a user with respect to a topic. The impact may be computed based on authority combined with other factors, such as power of the user. The authority score may be simply computed, in whole or in part, directly from a tweet log without, for example creating a retweet graph. As a result, the scores may be computed, using MapReduce primitives or other constructs that allow the computations to be distributed across multiple parallel processors. Such scores may be used to select users based on impact as part of social trend analysis, marketing or other functions. | 07-03-2014 |
20150039619 | GROUPING DOCUMENTS AND DATA OBJECTS VIA MULTI-CENTER CANOPY CLUSTERING - A canopy clustering process merges at least one set of multiple single-center canopies together into a merged multi-center canopy. Multi-center canopies, as well as the single-center canopies, can then be used to partition data objects in a dataset. The multi-center canopies allow a canopy assignment condition constraint to be relaxed without risk of leaving any data objects in a dataset outside of all canopies. Approximate distance calculations can be used as similarity metrics to define and merge canopies and to assign data objects to canopies. In one implementation, a distance between a data object and a canopy is represented as the minimum of the distances between the data object and each center of a canopy (whether merged or unmerged), and the distance between two canopies is represented as the minimum of the distances for each pairing of the center(s) in one canopy and the center(s) in the other canopy. | 02-05-2015 |