Patent application number | Description | Published |
20120239613 | GENERATING A PREDICTIVE MODEL FROM MULTIPLE DATA SOURCES - Techniques are disclosed for generating an ensemble model from multiple data sources. In one embodiment, the ensemble model is generated using a global validation sample, a global holdout sample and base models generated from the multiple data sources. An accuracy value may be determined for each base model, on the basis of the global validation dataset. The ensemble model may be generated from a subset of the base models, where the subset is selected on the basis of the determined accuracy values. | 09-20-2012 |
20120278275 | GENERATING A PREDICTIVE MODEL FROM MULTIPLE DATA SOURCES - Techniques are disclosed for generating an ensemble model from multiple data sources. In one embodiment, the ensemble model is generated using a global validation sample, a global holdout sample and base models generated from the multiple data sources. An accuracy value may be determined for each base model, on the basis of the global validation dataset. The ensemble model may be generated from a subset of the base models, where the subset is selected on the basis of the determined accuracy values. | 11-01-2012 |
20130006998 | INTERESTINGNESS OF DATA - Provided are techniques for analyzing fields. Statistical metrics for each field in a data set are received. A general interestingness index is generated for each field using one or more combination functions that aggregate standardized interestingness sub-indexes. One or more fields are identified as interesting for further analysis using the general interestingness index. One or more expert recommendations for field transformations are constructed for the identified one or more fields. | 01-03-2013 |
20130007003 | INTERESTINGNESS OF DATA - Provided are techniques for analyzing fields. Statistical metrics for each field in a data set are received. A general interestingness index is generated for each field using one or more combination functions that aggregate standardized interestingness sub-indexes. One or more fields are identified as interesting for further analysis using the general interestingness index. One or more expert recommendations for field transformations are constructed for the identified one or more fields. | 01-03-2013 |
20130218908 | COMPUTING AND APPLYING ORDER STATISTICS FOR DATA PREPARATION - Provided are techniques for generating order statistics and error bounds. For each of multiple, distributed data sources, a finite number of data bins are created for each field in that data source. Data values in each of the multiple, distributed data sources are processed to generate basic summaries for each of the data bins in a single pass of the data values. The data bins from each of the multiple, distributed data sources are sorted. One or more approximate order statistics are computed for a data set by accumulating counts from a number of the sorted data bins. Lower and upper error bounds are provided for each of the computed one or more approximate order statistics, wherein the lower and upper error bounds are values delimiting an interval containing a true value of an order statistic. | 08-22-2013 |
20130218909 | COMPUTING AND APPLYING ORDER STATISTICS FOR DATA PREPARATION - Provided are techniques for generating order statistics and error bounds. For each of multiple, distributed data sources, a finite number of data bins are created for each field in that data source. Data values in each of the multiple, distributed data sources are processed to generate basic summaries for each of the data bins in a single pass of the data values. The data bins from each of the multiple, distributed data sources are sorted. One or more approximate order statistics are computed for a data set by accumulating counts from a number of the sorted data bins. Lower and upper error bounds are provided for each of the computed one or more approximate order statistics, wherein the lower and upper error bounds are values delimiting an interval containing a true value of an order statistic. | 08-22-2013 |
20130226838 | MISSING VALUE IMPUTATION FOR PREDICTIVE MODELS - Provided are techniques for imputing a missing value for each of one or more predictor variables. Data is received from one or more data sources. For each of the one or more predictor variables, an imputation model is built based on information of a target variable; a type of imputation model to construct is determined based on the one or more data sources, a measurement level of the predictor variable, and a measurement level of the target variable; and the determined type of imputation model is constructed using basic statistics of the predictor variable and the target variable. The missing value is imputed for each of the one or more predictor variables using the data from the one or more data sources and one or more built imputation models to generate a completed data set. | 08-29-2013 |
20130226842 | MISSING VALUE IMPUTATION FOR PREDICTIVE MODELS - Provided are techniques for imputing a missing value for each of one or more predictor variables. Data is received from one or more data sources. For each of the one or more predictor variables, an imputation model is built based on information of a target variable; a type of imputation model to construct is determined based on the one or more data sources, a measurement level of the predictor variable, and a measurement level of the target variable; and the determined type of imputation model is constructed using basic statistics of the predictor variable and the target variable. The missing value is imputed for each of the one or more predictor variables using the data from the one or more data sources and one or more built imputation models to generate a completed data set. | 08-29-2013 |