Data cleansing, data scrubbing, and deleting duplicates

Subclass of:

707 - Data processing: database and file management or data structures

707687000 - DATA INTEGRITY

Patent class list (only not empty are listed)

Deeper subclasses:

Document	Title	Date
Entries
20100049760	COMPUTER TOOL FOR MANAGING DIGITAL DOCEMENTS - The invention relates to a computer device for managing documents, in particular software projects developed in co-operation. A memory stores contents of documents having time references. An extractor separates the document contents into document elements. A signature generator returns signatures of element contents. An imprint generator associates for each document the time reference thereof and the signatures of its elements. A time clock module calculates the results of a time election function. A supervisor operates the generator on the document contents. For each signature value, it calls the time clock module with a list of imprint time references containing the signature value. A unified time reference is obtained for each signature value. Each signature value and its unified time reference are stored.	02-25-2010
20100094813	REPRESENTING AND STORING AN OPTIMIZED FILE SYSTEM USING A SYSTEM OF SYMLINKS, HARDLINKS AND FILE ARCHIVES - A data de-duplication system is used with network attached storage and serves to reduce data duplication and file storage costs. Techniques utilizing both symlinks and hardlinks ensure efficient deletion file/data cleanup and avoid data loss in the event of crashes.	04-15-2010
20100114842	Detecting Duplicative Hierarchical Sets Of Files - To detect duplicative hierarchically arranged sets of files in a storage system, a method includes generating, for hierarchically arranged plural sets of files, respective collections of values computed based on files in corresponding sets of files. For a further set of files that is an ancestor of at least one of the plural sets of files, a respective collection of values that is based on the collection of values computed for the at least one set is generated. Duplicative sets according to comparisons of the collections of values are identified.	05-06-2010
20100121825	FILE SYSTEM WITH INTERNAL DEDUPLICATION AND MANAGEMENT OF DATA BLOCKS - A method for deduplicating and managing data blocks within a file system includes adding a deduplication identifier to each pointer pointing to a data block to indicate whether the data block is deduplicated, detecting duplicate data blocks, determining whether one of the duplicate data blocks has been deduplicated, when detected, determining that one duplicate data block is a master copy when it is determined that one duplicate data block has been deduplicated, selecting one of the duplicate data blocks to be a master copy when it is determined that the duplicate data blocks have not been deduplicated, and setting the deduplication identifier of the selected duplicate data block to indicate deduplication, and determining that the other duplicate data block is a new duplicate data block and setting the deduplication identifier of the other duplicate data block to indicate deduplication and directing the respective pointer to the master copy.	05-13-2010
20100153348	Report Generation for a Navigation-Related Database - Systems, devices, features, and methods for updating a geographic database, such as a navigation-related database, and/or reporting discrepancies associated with geographic data of the geographic database are disclosed. For example, one method comprises capturing a photograph of an observed geographic feature in a geographic region. Comment information corresponding to the observed geographic feature may be stored. The comment information is indicative of a discrepancy between the observed geographic feature and the geographic data corresponding to the geographic region. The comment information may be associated with the photograph to generate a report, and the report is transmitted.	06-17-2010
20100169287	SYSTEMS AND METHODS FOR BYTE-LEVEL OR QUASI BYTE-LEVEL SINGLE INSTANCING - Described in detail herein are systems and methods for deduplicating data using byte-level or quasi byte-level techniques. In some embodiments, a file is divided into multiple blocks. A block includes multiple bytes. Multiple rolling hashes of the file are generated. For each byte in the file, a searchable data structure is accessed to determine if the data structure already includes an entry matching a hash of a minimum sequence length. If so, this indicates that the corresponding bytes are already stored. If one or more bytes in the file are already stored, then the one or more bytes in the file are replaced with a reference to the already stored bytes. The systems and methods described herein may be used for file systems, databases, storing backup data, or any other use case where it may be useful to reduce the amount of data being stored.	07-01-2010
20100174688	Apparatus, System and Method for Member Matching - An apparatus, system, and method for member matching. In one embodiment, the apparatus includes an input adapter, a processor, and an output adapter. The input adapter may receive a first data record from a plurality of data records stored in one or more databases. The processor may generate a first data key from one or more field values in the first data record, compare a second data key associated with a second data record with the first data key associated with the first data record, and identify a match between the first data key and the second data key. In one embodiment, the output adapter may produce an output configured to identify the first data record and the second data record in response to identification of the match.	07-08-2010
20100191709	DATA RECORDING METHOD, DATA ERASURE METHOD, DATA DISPLAY METHOD, STORAGE DEVICE, STORAGE MEDIUM, AND PROGRAM - The objective of the present invention is to manage reference movies using an index file, without causing the user to be perplexed. the reference movies are generated because of, for instance, the upper limit of the file size. The index file manages sets of information regarding the files being managed. Examples of these sets of information are information for determining whether or not a file is presented to the user, information for determining whether or not a file is original, and information indicating whether or not nondestructive editing has been done. Based on such information, the erasure, displaying a list, and so on are carried out. Thus, it is possible to manage the reference movies using the index file, without causing the user to be perplexed.	07-29-2010
20100198797	CLASSIFYING DATA FOR DEDUPLICATION AND STORAGE - In a method of classifying data for deduplication, data to be classified is accessed. The data is classified into a deduplication classification in accordance with a data content aware data classification policy such that classified data is created. The data classification policy includes a plurality of deduplication classifications.	08-05-2010
20100205158	FILE CREATION UTILITY FOR DEDUPLICATION TESTING - A method, system, and computer program product for facilitating deduplication product testing in a computing environment is provided. At least one deduplication test file is generated. The at least one deduplication test file is adapted for, when processed through the deduplication product testing, exhibiting a predefined deduplication factor. A definition file is initialized. The definition file defines at least one file characteristic addressed during the generating the at least one deduplication test file to obtain the predefined deduplication factor. The file characteristic may include a file pattern, a file/pattern ratio, and a pattern across multiple files.	08-12-2010
20100235331	User-determinable method and system for manipulating and displaying textual and graphical information - One or more aspects of the invention include transforming source data in order to display a work product. A plurality of rules relating to content manipulation of the source data include at least one rule relating to content selection and at least one rule relating to content compression. Source data for content manipulation may also be received. A selected portion of the source data and a compressed portion of the source data may be formed. The compressed portion may then be received and presented on a computer a work product.	09-16-2010
20100235332	APPARATUS AND METHOD TO DEDUPLICATE DATA - A method to deduplicate data by receiving a data set, setting a data chunk size, selecting a first stage deduplication algorithm, and selecting a second stage deduplication algorithm, where the first stage deduplication algorithm differs from the second stage deduplication algorithm. The method selects a data chunk, where that data chunk comprises all or a portion of the data set, performs a first stage deduplication analysis of the data chunk using the first stage deduplication algorithm. If the first stage deduplication analysis indicates duplicate data, then the method performs a second state deduplication analysis of said data chunk using the second stage deduplication algorithm to verify the data as duplicate. Only if both data deduplication analysis indicate duplicate data the data chunk is replaced by a deduplication stub or reference to the identical data chunk which is already stored.	09-16-2010
20100235333	APPARATUS AND METHOD TO SEQUENTIALLY DEDUPLICATE DATA - A method to sequentially deduplicate data, wherein the method receives a plurality of computer files, wherein each of the plurality of computer files comprises a label comprising a file name, a file type, a version number, and file size, and stores that plurality of computer files in a deduplication queue. The method then identifies a subset of the plurality of computer files, wherein each file of the subset comprises the same file name but a different version number, and wherein the subset comprises a maximum count of version numbers, and wherein the subset comprises a portion of the plurality of computer files. The method deduplicates the subset using a hash algorithm, and removes the subset from said deduplication queue.	09-16-2010
20100250501	STORAGE MANAGEMENT THROUGH ADAPTIVE DEDUPLICATION - One embodiment retrieves a first portion of a plurality of stored objects from at least one storage device. The embodiment further performs a base type deduplication estimation process on the first portion of stored objects. The embodiment still further categorizes the first portion of the plurality of stored objects into deduplication sets based on a deduplication relationship of each object of the plurality of stored objects with each of the estimated first plurality of deduplication chunk portions. The embodiment further combines deduplication sets into broad classes based on deduplication characteristics of the objects in the deduplication sets. The embodiment still further classifies a second portion of the plurality of stored objects into broad classes using classifiers. The embodiment further selects an appropriate deduplication approach for each categorized class.	09-30-2010
20100250502	METHOD AND APPARATUS FOR CONTENTS DE-DUPLICATION - Exemplary embodiments provide in effect data de-duplication in storage servers without the need to compare every byte of stored data. In one embodiment, a method for providing contents from a content device to a storage device comprises receiving by a storage device a ticket including trade information of a trade by a user for content from a content device; receiving by the storage device from the content device attribute information of the content identified in the ticket; determining whether the storage device has the content identified in the ticket based on the attribute information; if the storage device does not have the content identified in the ticket, receiving the content identified in the ticket from the content device and storing the content in the storage device; and if the storage device has the content identified in the ticket, not receiving the content identified in the ticket from the content device.	09-30-2010
20100250503	ELECTRONIC COMMUNICATION DATA VALIDATION IN AN ELECTRONIC DISCOVERY ENTERPRISE SYSTEM - Embodiments of the invention relate to systems, methods, and computer program products for improved electronic discovery. Embodiments herein disclosed provide for an enterprise-wide e-discovery system that provides for validity verification of electronic communications prior to subsequent processing, such as decryption or standardized format conversion.	09-30-2010
20100262587	DEVICE FOR REMOTE DEFRAGMENTATION OF AN EMBEDDED DEVICE - An embedded device (	10-14-2010
20100268693	ELECTRONIC DEVICE FILTERING - A filtering method and apparatus for an electronic device. The operating system's program registry is copied and then modified to direct all application data to a filter program such as a virus scanning filter program. If the filter program determines that the object is virus free, the copy of the program registry (filter registry) is queried by the filter program to determine the application program associated with the data object. The data object is then forwarded to the appropriate application program.	10-21-2010
20100281003	SYSTEM AND USES FOR GENERATING DATABASES OF PROTEIN SECONDARY STRUCTURES INVOLVED IN INTER-CHAIN PROTEIN INTERACTIONS - The present invention relates to methods and systems for generating a database of protein secondary structures that are at an interface of a two-chain inter-protein interaction. Collections of secondary structures identified according to the methods disclosed herein, and their use in identifying therapeutic drug candidates potentially effective in modulating a two-chain inter-protein interaction having a secondary structure at its interface, are also disclosed.	11-04-2010
20100325093	APPARATUS AND METHOD TO SEQUENTIALLY DEDUPLICATE GROUPS OF FILES COMPRISING THE SAME FILE NAME BUT DIFFERENT FILE VERSION NUMBERS - A method to sequentially deduplicate data, wherein the method receives a plurality of computer files, wherein each of the plurality of computer files comprises a label comprising a file name, a file type, a version number, and file size, and stores that plurality of computer files in a deduplication queue. The method then identifies a subset of the plurality of computer files, wherein each file of the subset comprises the same file name but a different version number, and wherein the subset comprises a maximum count of version numbers, and wherein the subset comprises a portion of the plurality of computer files. The method deduplicates the subset using a hash algorithm, and removes the subset from said deduplication queue. During the deduplicating, the method receives new computer files comprising the same file name, stores those new computer files to the deduplication queue, but does not add those new computer files to the subset.	12-23-2010
20110010347	ITERATOR REGISTER FOR STRUCTURED MEMORY - Loading data from a computer memory system is disclosed. A memory system is provided, wherein some or all data stored in the memory system is organized as one or more pointer-linked data structures. One or more iterator registers are provided. A first pointer chain is loaded, having two or more pointers leading to a first element of a selected pointer-linked data structure to a selected iterator register. A second pointer chain is loaded, having two or more pointers leading to a second element of the selected pointer-linked data structure to the selected iterator register. The loading of the second pointer chain reuses portions of the first pointer chain that are common with the second pointer chain.	01-13-2011
20110016095	Integrated Approach for Deduplicating Data in a Distributed Environment that Involves a Source and a Target - One aspect of the present invention includes a configuration of a storage management system that enables the performance of deduplication activities at both the client (source) and at the server (target) locations. The location of deduplication operations can then be optimized based on system conditions or predefined policies. In one embodiment, seamless switching of deduplication activities between the client and the server is enabled by utilizing uniform deduplication process algorithms and accessing the same deduplication index (containing information on the hashed data chunks). Additionally, any data transformations on the chunks are performed subsequent to identification of the data chunks. Accordingly, with use of this storage configuration, the storage system can find and utilize matching chunks generated with either client- or server-side deduplication.	01-20-2011
20110022571	METHOD OF MANAGING WEBSITE COMPONENTS OF A BROWSER - A method of uninstalling all data and objects relating to a Website is disclosed. The user, upon visiting a Website, registers the site with the uninstall software according to a defined registration event. The registration process creates a Meta URL, to which all data and objects relating to the visited Website are associated. The data and objects include URLs, HTML documents, bookmarks or favorites, temporary browser objects such as embedded multimedia, browser cookies, browser history, and browser plug-ins and extensions. A configuration file, local or remote, or a user interface determines which items are removable. Multiple browsers can share the association between the data and objects relating to a Website and the Meta URL and the user can remove a browser from the shared association.	01-27-2011
20110022572	DUPLICATE FILTERING IN A DATA PROCESSING ENVIRONMENT - A data processing method is provided. The method comprises collecting a stream of data records from one or more devices in a network; loading one or more persistent indexes associated with the stream of data records into memory; identifying duplicate data records in the stream of data records using the in-memory indexes; and updating a repository such that the duplicate data records are not stored in the repository or managed differently than non-duplicate data records.	01-27-2011
20110022573	PREVENTING TRANSFER AND DUPLICATION OF REDUNDANTLY REFERENCED OBJECTS ACROSS NODES OF AN APPLICATION SYSTEM - Unique identifiers referred to as “keys” are generated for objects stored on each node. When a container object including at least one embedded object is transferred from a sending node to a receiving node, the sending node sends the key uniquely identifying the embedded object to the receiving node to determine whether the embedded object is already stored on the receiving node. If the receiving node indicates that the embedded object is already stored at the receiving node, then the sending node determines that the embedded object does not need to be sent to the receiving node. In that case, if the embedded object has not been sent, the sending node does not send the embedded object. If the sending node has already started sending the embedded object, then the sending node terminates sending of the embedded object.	01-27-2011
20110029491	DYNAMICALLY DETECTING NEAR-DUPLICATE DOCUMENTS - Techniques for detecting one or more documents that are duplicate or near-duplicate of a first document are provided. The techniques include obtaining a first document, obtaining one or more additional documents, retrieving a set of one or more document signatures for each document, and detecting one or more documents that are duplicate or near-duplicate of the first document by detecting each of the one or more additional documents that have at least a minimum number of signatures in common with the first document, wherein detecting each of the one or more additional documents that have at least a minimum number of signatures in common with the first document comprises dynamically using at least one of a user-configurable similarity definition and a user-configurable similarity threshold value.	02-03-2011
20110040734	PROCESSING OF STREAMING DATA WITH KEYED AGGREGATION - Keyed aggregation is used in the processing of streaming data to streamline processing to provide higher throughput and decreased use of resources. The most recent event for each unique replacement key value(s) is maintained. In response to an incoming event having a same key as a previous event, the effect on an aggregation of the previous event is removed. The aggregation is then updated with one or more values from the arriving event and the updated aggregation is output.	02-17-2011
20110047130	METHOD AND APPARATUS FOR COLLECTING EVIDENCE - Method and apparatus for collecting evidence are provided. An exemplary embodiment enhances accuracy and efficiency of collecting evidence by analyzing link information in the target computer and collecting collection target file. And the exemplary embodiment can collect evidence from a target computer as well as from a remote computer through analyzing the link information in the target computer, identifying the path of collection target file and extracting the target file.	02-24-2011
20110055171	GENERATION OF REALISTIC FILE CONTENT CHANGES FOR DEDUPLICATION TESTING - Method, system, and computer program product embodiments for facilitating deduplication product testing in a computing environment are provided. In one such embodiment, data to be processed through the deduplication product testing is arranged into a single, continuous stream. At least one of a plurality of random modifications are applied to the arranged data in a self-similar pattern exhibiting scale invariance. A plurality of randomly sized subsets of the arranged data modified with the self-similar pattern is mapped into each of a plurality of randomly sized deduplication test files.	03-03-2011
20110055172	AUTOMATIC ERROR CORRECTION FOR INVENTORY TRACKING AND MANAGEMENT SYSTEMS USED AT A SHIPPING CONTAINER YARD - A method automatically detects and corrects errors in a container inventory database associated with a container inventory tracking system of a container storage facility. A processor in the inventory tracking system performs a method to detect errors; this method of error detection obtains a first data record, identifies an event (e.g., pickup or drop-off of a container, or movement of handing equipment) associated with the first record, provides a list of error types based on the identified event, and determines whether a data error has occurred through a checking process. To correct the errors, this method further sets search criteria based on the error detection results, queries the inventory tracking database using the set criteria, determines error candidates based on the query results, evaluates the error candidates to identify a match or matches among the error candidates, and corrects the error(s) by modifying the error detection results together with the identified match or matches.	03-03-2011
20110055173	Data Integration Method and System - A computer implemented method for ensuring the quality of processed corporate entity data, the method comprising: sequentially processing the corporate entity data through a series of serially connected drivers, the serially connected drivers comprise a data collection driver, an entity matching driver, an identification number driver, a corporate linkage driver, and a predictive indicator driver; and conducting a quality assurance of the corporate entity data as it is processed in each of the driver, wherein the quality assurance comprises: (i) sampling the corporate entity data from each the driver periodically, thereby generating sample data; (ii) evaluating the sample data; and (iii) adjusting the processing based upon the evaluation, thereby producing high quality data.	03-03-2011
20110071989	FILE AWARE BLOCK LEVEL DEDUPLICATION - A system provides file aware block level deduplication in a system having multiple clients connected to a storage subsystem over a network such as an Internet Protocol (IP) network. The system includes client components and storage subsystem components. Client components include a walker that traverses the namespace looking for files that meet the criteria for optimization, a file system daemon that rehydrates the files, and a filter driver that watches all operations going to the file system. Storage subsystem components include an optimizer resident on the nodes of the storage subsystem. The optimizer can use idle processor cycles to perform optimization. Sub-file compression can be performed at the storage subsystem.	03-24-2011
20110082840	SCALABLE MECHANISM FOR DETECTION OF COMMONALITY IN A DEDUPLICATED DATA SET - Mechanisms are provided for efficiently determining commonality in a deduplicated data set in a scalable manner regardless of the number of deduplicated files or the number of stored segments. Information is generated and maintained during deduplication to allow scalable and efficient determination of data segments shared in a particular file, other files sharing data segments included in a particular file, the number of files sharing a data segment, etc. Data need not be expanded or uncompressed. Deduplication processing can be validated and verified during commonality detection.	04-07-2011
20110082841	Analyzing Backup Objects Maintained by a De-Duplication Storage System - Analyzing backup objects maintained by a de-duplication server. A plurality of first objects may be maintained. Each first object may refer to second object(s) and each second object may refer back to at least one first object. For each respective first object, the respective first object may be analyzed to determine the one or more second objects referred to by the respective first object. Correspondingly, a command may be generated for each respective second object of the determined second object(s), thereby generating a plurality of commands. Each command may be used to verify that the respective second object refers back to the respective first object. The plurality of commands may be sorted into a disk access order. The commands may be used to verify that each second object refers back to first objects that refer to the second object.	04-07-2011
20110099154	Data Deduplication Method Using File System Constructs - A data deduplication method providing direct look up and storage in an instance repository (IR). The method includes receiving a data object and processing the data object to generate a fingerprint that includes a location component, which defines a file location within the IR such as by first using a hash function to create a hash for the data object and parsing the hash value into sub-strings defining sub-directories of the IR. The method includes determining whether the data object is a duplicate by verifying the presence of a file in the IR at the file location. Determining if the data is unique involves performing a system call on the IR providing the location component as the file path. The method includes, when a file is not in the IR, updating the IR to store the data object as a file at the file location defined by the location component.	04-28-2011
20110113018	METHOD AND SYSTEM FOR MERGING DISPARATE VIRTUAL UNIVERSES ENTITIES - A migration tool for merging disparate virtual universes by selecting a source region or source account, selecting a destination edge or destination account, retrieving and storing virtual universe information for the source region or account, inserting the virtual universe resources of the source region or account into the destination region or account, activating the inserted resources, and deleting the source resources from the source region or account.	05-12-2011
20110119239	METHOD FOR AGGREGATING WEB FEED MINIMIZING REDUNDANCIES - Method for aggregating syndicated Web content, comprising the steps of: Retrieving (	05-19-2011
20110125719	EFFICIENT SEGMENT DETECTION FOR DEDUPLICATION - Mechanisms are provided for efficiently detecting segments for deduplication. Data is analyzed to determine file types and file components. File types such as images may have optimal data segment boundaries set at the file boundaries. Other file types such as container files are delayered to extract objects to set optimal data segment boundaries based on file type or based on the boundaries of the individual objects. Storage of unnecessary information is minimized in a deduplication dictionary while allowing for effective deduplication.	05-26-2011
20110125720	METHODS AND APPARATUS FOR NETWORK EFFICIENT DEDUPLICATION - Mechanisms are provided for performing network efficient deduplication. Segments are extracted from files received for deduplication at a host connected to a target over one or more networks and/or fabrics in a deduplication system. Segment identifiers (IDs) are determined and compared with segment IDs for segments already deduplicated. Segments already deduplicated need not be transmitted to a target system. References and reference counts are modified at a target system. Updating references and reference counts may involve modifying filemaps, dictionaries, and datastore suitcases for both already deduplicated and not already deduplicated segments.	05-26-2011
20110125721	DELETION IN DATA FILE FORWARDING FRAMEWORK - Methods and apparatus, including computer program products, for deletion in data file forwarding framework. A framework includes a network of interconnected computer system nodes in which data files are continuously forwarded from computer memory to computer memory without storing on any physical storage device in the network, a central server coupled to the network, and a deletion server coupled to the network.	05-26-2011
20110145207	SCALABLE DE-DUPLICATION FOR STORAGE SYSTEMS - A method for performing storage system de-duplication. The method includes accessing a plurality of initial partitions of files of a storage system and performing a de-duplication on each of the initial partitions. For each duplicate found, an indicator comprising the metadata that is similar across said each duplicate is determined. For each indicator, indicators are determined that infer a likelihood that data objects with said indicators contain duplicate data is high. Optimized partitions are generated in accordance with the chosen indicators. A de-duplication process is subsequently performed on each of the optimized partitions.	06-16-2011
20110153576	Multi-Client Generic Persistence for Extension Fields - Access to a networked application can be provided to multiple users while allowing user-specific extension fields to be created and maintained for exclusive access by the user creating the extension field. A user-customized data object that includes a standard field value of a standard field of a standard data object defined by the networked application and a user-specific extension field value of a user-specific extension field that modifies operation of the networked application for the user and that is not available to other users of the plurality of users can be received from a user for writing to memory. The user-specific extension field value can be separated from the standard field value. The standard field value and the user-specific extension field value can be persisted in a first database table and a second database table, respectively. Related systems, methods, and articles of manufacture are also provided.	06-23-2011
20110173162	SCRUBBING PROCEDURE FOR A DATA STORAGE SYSTEM - A method is provided for scrubbing information stored in a data storage system where the information is stored as a plurality of encoded fragments across multiple storage devices. The method includes maintaining on a first storage device a list of metadata entries corresponding to values that are stored in the data storage system at an At Maximum Redundancy (AMR) state, verifying that encoded fragments associated with each of the metadata entries are stored on a second storage, verifying that a corresponding metadata entry is stored on the first storage device for each encoded fragment that is stored on the second storage device, and scheduling for recovery any missing encoded fragments and/or any missing metadata entry.	07-14-2011
20110178995	MICROBLOG SEARCH INTERFACE - Methods, systems, and computer-readable media for searching microblog entries. The microblog entries may be generated through a single microblog website or across multiple microblog sites. Upon receiving a search input, a series of microblog entries responsive to the search input may be displayed to the user. The displayed microblog entries may be the most recently generated microblog entries that are responsive to the search input. In another embodiment, the microblog entries returned are a best match to the search criteria, which may be based on a user authority score for a user that drafted a microblog entry and additional characteristics of the microblog entry.	07-21-2011
20110178996	SYSTEM AND METHOD FOR CREATING A DE-DUPLICATED DATA SET - The present invention is directed to a system and method for creating a non-redundant data set from a plurality of data sources. Generally, the system and method operate by creating unique hash keys corresponding to unique data files; compiling the hash keys along with seeking information for the corresponding data files; de-duplicating the hash keys; and retrieving/storing the data files corresponding to the de-duplicated hash keys. Thus, in accordance with the system and method of the present invention, a non-redundant data set can be created from a plurality of data sources. The system of the present invention can operate independently or in conjunction with any de-duplicating methods and systems. For example, a de-duplicating method and system can be used to read and obtain data from a variety of media, regardless of the application used to generate the backup media. The component parts of a file may be read from a medium, including content and metadata pertaining to a file. These pieces of content and metadata may then be stored and associated. To avoid duplication of data, pieces of content and metadata may be compared to previously stored content and metadata. Furthermore, using these same methods and systems the content and metadata of a file may be associated with a location where the file resided. A database which stores these components and allows linking between the various stored components may be particularly useful in implementing embodiments of these methods and systems.	07-21-2011
20110184921	System and Method for Data Driven De-Duplication - Described are computer-based methods and apparatuses, including computer program products, for removing redundant data from a storage system. In one example, a data delineation process delineates data targeted for de-duplication into regions using a plurality of markers. The de-duplication system determines which of these regions should be subject to further de-duplication processing by comparing metadata representing the regions to metadata representing regions of a reference data set. The de-duplication system identifies an area of data that incorporates the regions that should be subject to further de-duplication processing and de-duplicates this area with reference to a corresponding area within the reference data set.	07-28-2011
20110191305	STORAGE SYSTEM FOR ELIMINATING DUPLICATED DATA - A storage system	08-04-2011
20110191306	COMPUTER, ITS PROCESSING METHOD, AND COMPUTER SYSTEM - When a deletion request to delete a file system is made and a retention period of the file system to be deleted has not expired, the retention period end date and time is displayed at high speed.	08-04-2011
20110196848	DATA DEDUPLICATION BY SEPARATING DATA FROM META DATA - Provided are techniques for data deduplication. A chunk of data and a mapping of boundaries between file data and meta data in the chunk of data are received. The mapping is used to split the chunk of data into a file data stream and a meta data stream and to store file data from the file data stream in a first file and to store meta data from the meta data stream in a second file, wherein the first file and the second file are separate files. The file data in the first file is deduplicated.	08-11-2011
20110208703	SELECTIVITY ESTIMATION - The invention concerns the compression, querying and updating of tree structured data. For example, but not limited to, the invention concerns a synopsis (	08-25-2011
20110218972	DATA REDUCTION INDEXING - Example apparatus, methods, data structures, and computers control indexing to facilitate duplicate determinations. One example method includes indexing, in a global index, a unique chunk processed by a data de-duplicator. Indexing the unique chunk in the global index can include updating an expedited data structure associated with the global index. The example method can also include selectively indexing, in a temporal index, a relationship chunk processed by the data de-duplicator. The relationship chunk is a chunk that is related to another chunk processed by the data de-duplicator by sequence, storage location, and/or similarity hash value. Indexing the relationship chunk in the temporal index can also include updating one or more expedited data structures associated with the temporal index. The expedited data structures and indexes can then be consulted to resolve a duplicate determination being made by a data reducer.	09-08-2011
20110218973	SYSTEM AND METHOD FOR CREATING A DE-DUPLICATED DATA SET AND PRESERVING METADATA FOR PROCESSING THE DE-DUPLICATED DATA SET - The present invention provides a system and method for de-duplicating a large heterogeneous stock of data and collecting metadata associated with that data. Additionally, the system and method provide a means for retrieving data items based on specific criteria that can be identified in the collected metadata.	09-08-2011
20110225128	CLEAN STORE FOR OPERATING SYSTEM AND SOFTWARE RECOVERY - Systems, methods and apparatus for automatically identifying a version of a file that is expected to be present on a computer system and for automatically replacing a potentially corrupted copy of the file with a clean (or undamaged) copy of the expected version. Upon identifying a file on the computer system as being potentially corrupted, a clean file agent may perform an analysis based on the identity of the file and one or more other properties of the system to determine the version of the file that is expected to be present on the system. Once the expected version is identified, a clean replacement copy of the file may be obtained from a clean file repository by submitting a version identifier of the expected version. The version identifier may be a hash value, which may additionally be used to verify integrity of the clean copy.	09-15-2011
20110225129	METHOD AND SYSTEM TO SCAN DATA FROM A SYSTEM THAT SUPPORTS DEDUPLICATION - An interface is disclosed that makes information obtained from a file deduplication process available to an application for the efficient operation thereof. A data deduplication repository is scanned to determine a plurality of file segments and respective checksum values associated with the segments. A data structure is generated that allows shared segments to be identified by indexing using a common checksum value. The segments also indicate the file to which they belong and may also include a timestamp value. This data structure is updated as files are modified, etc. The data structure is accessible to an application program so that the application program can readily determine which segments are shared between multiple files. With this information, the application can efficiently process the segment once rather than multiple times. Timestamps can be used by the application to efficiently identify only those segments that were accessed after a given time.	09-15-2011
20110225130	STORAGE DEVICE, AND PROGRAM AND METHOD FOR CONTROLLING STORAGE DEVICE - In a storage device, an information acquisition unit acquires and stores information in an information memory unit. A data acquisition unit acquires data. A deduplication unit divides the acquired data by a smaller division size than that indicated in additional information included in the information stored in the information memory unit, performs deduplication, and stores the resulting data in a data memory unit. The information memory unit stores the information including the additional information that indicates the division size used for dividing data in deduplication of another device.	09-15-2011
20110231374	Highly Scalable and Distributed Data De-Duplication - This disclosure relates to systems and methods for both maintaining referential integrity within a data storage system, and freeing unused storage in the system, without the need to maintain reference counts to the blocks of storage used to represent and store the data.	09-22-2011
20110231375	SPACE RECOVERY WITH STORAGE MANAGEMENT COUPLED WITH A DEDUPLICATING STORAGE SYSTEM - Provided are techniques for space recovery with storage management coupled with a deduplicating storage system. A notification is received that one or more data objects have been logically deleted by deleting metadata about the one or more data objects, wherein the notification provides storage locations within one or more logical storage volumes corresponding to the deleted one or more data objects, wherein each of the one or more data objects are divided into one or more extents. In response to determining that a sparse file represents the one or more logical storage volumes, physical space is deallocated by nulling out space in the sparse file corresponding to each of the one or more extents.	09-22-2011
20110238634	STORAGE APPARATUS WHICH ELIMINATES DUPLICATED DATA IN COOPERATION WITH HOST APPARATUS, STORAGE SYSTEM WITH THE STORAGE APPARATUS, AND DEDUPLICATION METHOD FOR THE SYSTEM - According to one embodiment, a storage apparatus includes a first storage unit, a second storage unit and a control module. The control module stores the address of a block data item and a block identifier unique to the block data item, included in a write request, in the second storage unit such that the address and the block identifier are associated with each other when a request to specify the writing of data including the block data item into the storage apparatus has been generated at a host apparatus and when the host apparatus has transmitted the write request because the data item has coincided with any one of the block data items stored in the cache of the host apparatus.	09-29-2011
20110246431	STORAGE SYSTEM - The present invention relates to a storage system including a de-duplicate function and a full-text search function or the like, and reduces an amount of index information about full-test search to save storage resource. In this system, a storage apparatus includes a processing unit for de-duplicating a plurality of files having the same content regarding a file group of data inputted/outputted through a host apparatus. A full-text search processing server performs a full-text search processing to the file group and includes a processing unit for causing the full-text search processing to correspond to de-duplicate. An index information creation processing performed to a plurality of target files having the same content by the full-text search processing unit is inhibited according to a status of de-duplicate to the file group by the processing unit. Thereby, the amount of index information can be reduced.	10-06-2011
20110270808	Systems and Methods for Discovering Synonymous Elements Using Context Over Multiple Similar Addresses - A clustering-based approach to data standardization is provided. Certain embodiments take as input a plurality of addresses, identify one or more features of the addresses, cluster the addresses based on the one or more features, utilize the cluster(s) to provide a data-based context useful in identifying one or more synonyms for elements contained in the address(es), and standardize the address(es) to an acceptable format, with one or more synonyms and/or other elements being added to or taken away from the input address(es) as part of the standardization process.	11-03-2011
20110270809	HEAT INDICES FOR FILE SYSTEMS AND BLOCK STORAGE - Techniques and mechanisms are provided to allow for selective optimization, including deduplication and/or compression, of portions of files and data blocks. Data access is monitored to generate a heat index for identifying sections of files and volumes that are frequently and infrequently accessed. These frequently used portions may be left non-optimized to reduce or eliminate optimization I/O overhead. Infrequently accessed portions can be more aggressively optimized.	11-03-2011
20110270810	METHODS AND APPARATUS FOR ACTIVE OPTIMIZATION OF DATA - Techniques and mechanisms are provided to support live file optimization. Active I/O access to an optimization target is monitored during optimization. Active files need not be taken offline or made unavailable to an application during optimization and retain the ability to support file operations such as read, write, unlink, and truncate while an optimization engine performs deduplication and/or compression on active file ranges.	11-03-2011
20110276543	VIRTUAL BLOCK DEVICE - A virtual block device is an interface with applications that appears to the applications as a memory device, such as a standard block device. The virtual block device interacts with additional elements to do data deduplication to files at the block level such that one or more files accessed using the virtual block device have at least one block which is shared by the one or more files.	11-10-2011
20110276544	INFORMATION PROCESSING METHOD, INFORMATION PROCESSING PROGRAM AND INFORMATION PROCESSING DEVICE - Provided are an information processing method, an information processing program, and an information processing device for copying or moving a file. The method includes a step for comparing file information and judging whether the file information coincide. The comparison element of the file information contains a file content. That is, the method includes: a step for comparing the file names of the copy source and copy destination or the movement source and the movement destination and judging whether a file of the same name exists in the copy destination or the movement destination; a step for comparing the file contents if a file of the same name exists in the copy destination or the movement destination so as to judge whether the file contents are identical; and a step for outputting the comparison results of the file contents.	11-10-2011
20110307455	CONTACT INFORMATION MERGER AND DUPLICATE RESOLUTION - Merger and duplicate resolution for contact information across platforms is managed employing contact objects and aggregating the contact objects into contact models. Contact data from internal and/or external data stores may be retrieved and contact objects created for each contact from each contact store. A contact model for each contact entity may be created by aggregating contact data from contact objects across the contact stores. The aggregation may include duplicate resolution through weighting of communication system types, ranking of contact information type, and similar approaches. The contact models may be dynamically updated based on changes to the contact objects.	12-15-2011
20110307456	ACTIVE FILE INSTANT CLONING - Techniques and mechanisms are provided to instantly clone active files including active optimized files. When a new instance of an active file is created, a new stub is generated in the user namespace and a block map file is cloned. The block map file includes the same offsets and location pointers that existed in the original block map file. No user file data needs to be copied. If the cloned file is later modified, the behavior can be same as what happens when a de-duplicated file is modified.	12-15-2011
20110307457	INTEGRATED DUPLICATE ELIMINATION SYSTEM, DATA STORAGE DEVICE, AND SERVER DEVICE - First, a duplicate elimination process based on a first duplicate elimination process, in which both a duplicate elimination effect and a processing load are low, is executed. Information related to a processing result of the duplicate elimination process based on the first duplicate elimination process is acquired prior to execution of a second duplicate elimination process, in which both the duplicate elimination effect and the processing load are high. Target data of the second duplicate elimination process is narrowed down based on the acquired information. The second duplicate elimination process is applied only to the narrowed down target data. As a result, an integrated duplicate elimination system with a lower processing load than in a conventional system is realized while attaining a high duplicate elimination effect.	12-15-2011
20110320415	PIECEMEAL LIST PREFETCH - Prefetching data using a piecemeal list prefetching method. This is achieved by various means, including building a plurality of data pages, sorting the plurality of data pages into sequential data pages and a list of non-sequential pages, prefetching the sequential data pages using a first prefetching technique, and prefetching the non-sequential list of data pages using a second prefetching technique.	12-29-2011
20110320416	Eliminating Redundant Processing of Data in Plural Node Systems - According to a present invention embodiment, a system avoids duplicate processing of database objects to ensure operation integrity in a database system including a plurality of nodes. The system comprises a computer system including at least one processor. The computer system receives a data operation from a secondary node, executes the received data operation, and identifies each database object that is relocated based on the executed data operation. The computer system communicates to the secondary node operations performed by the computer system for execution of the data operation and an indication of each relocated database object. The secondary node stores an identifier reflecting the relocation for each relocated database object to prevent re-processing of the relocated database objects for the data operation. Embodiments of the present invention further include a method and computer program product for avoiding duplicate processing of database objects in substantially the same manner described above.	12-29-2011
20120005171	Deduplication of data object over multiple passes - In each of a number of passes to deduplicate a data object, a transaction is started. Where an offset into the object has previously been set, the offset is retrieved; otherwise, the offset is set to reference a beginning of the object. A portion of the object beginning at the offset is deduplicated until an end-of-transaction criterion has been satisfied. The transaction is ended to commit deduplication; where the object has not yet been completely deduplicated, the offset is moved just past where deduplication has already occurred. The object is locked during each pass; other processes cannot access the object during each pass, but can access the object between passes. Each pass is relatively short, so the length of time in which the object is inaccessible is relatively short. By comparison, deduplicating an object within a single pass prevents other processes from accessing the object for a longer time.	01-05-2012
20120016844	DATA PROCESSING APPARATUS, DATA PROCESSING METHOD, AND COMPUTER-READABLE STORAGE MEDIUM STORING A PROGRAM - A table is provided including a document name and content that is included the document data thereof and for which the number of times that duplication is permitted is restricted, in association with each other. The table is referenced, and a determination is made as to whether document data targeted for duplication processing includes content for which duplication processing is restricted. If a determination is made that such content is included, deletion-completed document data in which the content has been deleted from the document data is generated (S	01-19-2012
20120016845	SYSTEM AND METHOD FOR DATA DEDUPLICATION FOR DISK STORAGE SUBSYSTEMS - A method for data deduplication includes the following steps. First, segmenting an original data set into a plurality of data segments. Next, transforming the data in each data segment into a transformed data representation that has a band-type structure for each data segment. The band-type structure includes a plurality of bands. Next, selecting a first set of bands, grouping them together and storing them with the original data set. The first set of bands includes non-identical transformed data for each data segment. Next, selecting a second set of bands and grouping them together. The second set of bands includes identical transformed data for each data segment. Next, applying a hash function onto the transformed data of the second set of bands and thereby generating transformed data segments indexed by hash function indices. Finally, storing the hash function indices and the transformed data representation of one representative data segment in a deduplication database.	01-19-2012
20120016846	DATA DEDUPLICATION BY SEPARATING DATA FROM META DATA - Provided are techniques for data deduplication. A chunk of data and a mapping of boundaries between file data and meta data in the chunk of data are received. The mapping is used to split the chunk of data into a file data stream and a meta data stream and to store file data from the file data stream in a first file and to store meta data from the meta data stream in a second file, wherein the first file and the second file are separate files. The file data in the first file is deduplicated.	01-19-2012
20120030183	Security erase of a delete file and of sectors not currently assigned to a file - Secure erase of files and unallocated sectors on storage media such that any previous data is non-recoverable. The database contains sets of data patterns used to overwrite the data on different physical media. The software programs manage the overwriting process automatically when a file has been deleted. When de-allocated sectors in the file system are pruned from a file or escaped the file deletion process also finds them. Data will never be found on deleted sectors or on pruned sectors is overwritten.	02-02-2012
20120066187	PARCEL DATA ACQUISITION AND PROCESSING - In some embodiments, scripts may be used to perform parcel data acquisition, conversion, and clean-up/repair in an automated manner and/or through graphical user interfaces. The scripts may be used, for example, to repair geometries of new parcel data, convert multi-part parcel geometries to single part parcel geometries (explode), eliminate duplicate parcel geometries, append columns, create feature classes, and append feature classes. These scripts may be executed in a predetermined manner to increase efficiency. In some embodiments, different combinations of attributes may be appended to stored parcel data. In some embodiments, a tracking application may be used to track information about sources of data. In some embodiments, a tracking application may be used to track which system users are assigned to specific tasks (e.g., in a data acquisition project).	03-15-2012
20120078857	COMPARING AND SELECTING DATA CLEANSING SERVICE PROVIDERS - The present invention extends to methods, systems, and computer program products for exploring and selecting data cleansing service providers. Embodiments of the invention permit a user to explore different data cleansing service providers and compare quality results from the different data cleansing service providers. Sample data is mapped to a specified data domain. A list of service providers, for cleansing data for the selected data domain, is provided to a user. The user selects a subset of service providers. The sample data is submitted to the subset of service providers, which return results including allegedly cleansed data. The results are profiled and a comparison of the subset of service providers is presented to the user. The user selects a service provider to use when cleansing further data.	03-29-2012
20120078858	De-Duplicating Data in a Network with Power Management - A method, computer system, and computer program product for managing copies of data objects in a network data processing system. The computer system identifies copies of a data object stored on storage devices. The computer system places the storage devices into groups. Each storage device in a group has a smallest distance from the storage device to a center location for the group as compared to distances to center locations for other groups within the groups. The computer system selects a portion of the copies of the data object for removal from the storage devices based on a management of power for the storage devices such that remaining set of storage devices in each group is capable of handling concurrent requests that have been made historically for the copies of the data object. The computer system removes the portion of the copies of the data object from the storage devices.	03-29-2012
20120084268	CONTENT ALIGNED BLOCK-BASED DEDUPLICATION - A content alignment system according to certain embodiments aligns a sliding window at the beginning of a data segment. The content alignment system performs a block alignment function on the data within the sliding window. A deduplication block is established if the output of the block alignment function meets a predetermined criteria. At least part of a gap is established if the output of the block alignment function does not meet the predetermined criteria. The predetermined criteria is changed if a threshold number of outputs fail to meet the predetermined criteria.	04-05-2012
20120084269	CONTENT ALIGNED BLOCK-BASED DEDUPLICATION - A content alignment system according to certain embodiments aligns a sliding window at the beginning of a data segment. The content alignment system performs a block alignment function on the data within the sliding window. A deduplication block is established if the output of the block alignment function meets a predetermined criteria. At least part of a gap is established if the output of the block alignment function does not meet the predetermined criteria. The predetermined criteria is changed if a threshold number of outputs fail to meet the predetermined criteria.	04-05-2012
20120084270	STORAGE OPTIMIZATION MANAGER - Techniques and mechanisms provide a storage optimization manager. Data may be optimized and maintained on various nodes in a cluster. Particular nodes may be overburdened while other nodes remain relatively unused. Techniques are provided to efficiently optimize data onto nodes to enhance operational efficiency. Data access requests for optimized data are monitored and managed to allow for intelligent maintenance of optimized data.	04-05-2012
20120089578	Data deduplication - A method of deduplicating data is disclosed comprising mounting, by a deduplication appliance, network shared storage of a client machine, via a network, accessing data to be deduplicated on the network shared storage device, deduplicating the data, storing the deduplicated data on a second storage device, and replacing the data in the network shared storage device by at least one indicator of the location of the deduplicated data in the second storage device. A method is also disclosed for copying deduplicated data stored by a deduplication appliance, by a client machine, comprising receiving a request to copy data from a first location to a second location, by the client machine, by providing at least one second indicator to the third location on the deduplication appliance, at the second location, if the source and the destination are on the deduplication appliance. Systems are also disclosed.	04-12-2012
20120102003	PARALLEL DATA REDUNDANCY REMOVAL - A method, system, and computer usable program product for parallel data redundancy removal are provided in the illustrative embodiments. A plurality of values is computed for a record in a plurality of records stored in a storage device. The plurality of values for the record is distributed to corresponding queues in a plurality of queues, wherein each of the plurality of queues is associated with a corresponding section of a Bloom filter. A determination is made whether each value distributed to the corresponding queues for the record is indicated by a corresponding value in the corresponding section of the Bloom filter. The record is identified as a redundant record in response to a determination that each value distributed to the corresponding queues for the record is indicated by a corresponding value in the corresponding section of the Bloom filter.	04-26-2012
20120102004	DELETING A FILE ON READING OF THE FILE - A digital file may be stored on a storage device and deleted upon reading of the file. In one example, at least a portion of the stored file may be read, such as by a processor. In response to the reading of the stored file, at least part of the at least a portion of the stored file that has been read may be deleted. In further examples, the at least part of the at least a portion of the stored file may be deleted progressively while the stored file is being read, after a triggering event, or after a delay in time. According to another example, a user may provide an indication representative of the at least part of the at least a portion of the stored file that is to be deleted after the at least a portion of the stored file is read.	04-26-2012
20120109907	ON-DEMAND DATA DEDUPLICATION - Embodiments of the invention relate to performing on-demand data deduplication for managing data and storage space. Redundant data in a system is detected. Availability of data storage space in the system is periodically evaluated. Performance parameters of the system are evaluated. Detected redundant data is selected based on the data storage availability and performance parameters of the system. If at least a portion of the selected redundant data is to be deduplicated is determined.	05-03-2012
20120117036	SMART ADDRESS BOOK - An apparatus, method, system, and computer-readable medium are provided for maintaining contact information associated with a contact. In some embodiments a request associated with a contact may be received. Contact information may be obtained from one or more external or internal sources. One or more confidence scores may be generated for the obtained contact information and for one or more values received with the request. Based on the confidence score(s), one or more values associated with the contact may be incorporated in one or more data stores. In some embodiments, suggestions for contact related information may be generated. Responses to the suggestions may be used to update the generated confidence score(s).	05-10-2012
20120117037	LOG CONSOLIDATION DEVICE, LOG CONSOLIDATION METHOD, AND COMPUTER-READABLE MEDIUM - A log consolidation device includes: a selection unit that selects at least part of fields included in multiple logs stored in a storage unit and chronologically representing processes executed by one or multiple processing units, each log including information representing content of a process and a count value relating to the process, the information being divided into multiple fields; a deletion unit that deletes, from at least part of the multiple logs stored in the storage unit, items of information in the fields selected by the selection unit; and an integration unit that integrates into a single log two or more of the multiple logs having identical items of information in fields that were not deleted by the deletion unit by summing up the count values of the two or more of the multiple logs.	05-10-2012
20120124011	METHOD FOR INCREASING DEDUPLICATION SPEED ON DATA STREAMS FRAGMENTED BY SHUFFLING - A computer-implemented method for deduplicating an incoming data sequence can include the steps of storing signature values for a plurality of data blocklets of a parent data sequence in a deduplication index, sequentially storing signature values for at least some of the plurality of data blocklets of the parent data sequence in a first storage location outside of the deduplication index, determining that a first data blocklet in the incoming data sequence is absent from the parent data sequence, storing a signature value for the first data blocklet in a second storage location outside of the deduplication index, storing a guarded link linking the first data blocklet to the second data blocklet into the second storage location, determining that a second data blocklet that follows the first data blocklet in the incoming data sequence is present in the parent data sequence, the second data blocklet having a signature value that is stored in the first storage location, and copying at least a portion of the contents of the first storage location and the second storage location into a cache to expedite access during deduplication of the incoming data sequence.	05-17-2012
20120124012	SYSTEM AND METHOD FOR CREATING DEDUPLICATED COPIES OF DATA BY TRACKING TEMPORAL RELATIONSHIPS AMONG COPIES AND BY INGESTING DIFFERENCE DATA - Systems and methods are disclosed for forming deduplicated images of a data object that changes over time using difference information between temporal states of the data object. The method includes organizing the content of the data object for a first temporal state as a plurality of content segments and storing the content segments in a data store; creating an organized arrangement of hash structures to represent the data object in its first temporal state; receiving difference information for the data object; forming at least one hash signature for the changed content; and storing the changed content that is unique in the data store as content segments, whereby a deduplicated image of the data object for a second temporal state is stored without requiring reception of a complete image of the data object for the second temporal state.	05-17-2012
20120124013	SYSTEM AND METHOD FOR CREATING DEDUPLICATED COPIES OF DATA STORING NON-LOSSY ENCODINGS OF DATA DIRECTLY IN A CONTENT ADDRESSABLE STORE - Systems and methods are disclosed for storing deduplicated images in which a portion of the image is stored in encoded form directly in a hash table, the method comprising: organizing unique content of each data object as a plurality of content segments and storing the content segments in a data store; receiving content to be included in the deduplicated image of the data object; determining if the received content may be encoded using a predefined non-lossy encoding technique and in which the encoded value would fit within the field for containing a hash signature; if so, placing the encoding in the field and marking the hash structure to indicate that the field contains encoded content; otherwise, generating a hash signature for the received content and placing the hash signature in the field and placing the received content in a corresponding content segment if it is unique.	05-17-2012
20120124014	SYSTEM AND METHOD FOR CREATING DEDUPLICATED COPIES OF DATA BY SENDING DIFFERENCE DATA BETWEEN NEAR-NEIGHBOR TEMPORAL STATES - Systems and methods are disclosed for using a first deduplicating store to update a second deduplicating store with information representing how data objects change over time, said method comprising: at a first and a second deduplicating store, for each data object, maintaining an organized arrangement of temporal structures to represent a corresponding data object over time, wherein each structure is associated with a temporal state of the data object and wherein the logical arrangement of structures is indicative of the changing temporal states of the data object; finding a temporal state that is common to and in temporal proximity to the current state of the first and second deduplicating stores; and compiling and sending a set of hash signatures for the content that has changed from the common state to the current temporal state of the first deduplicating store.	05-17-2012
20120124015	METHOD FOR DATABASE CONSOLIDATION AND DATABASE SEPARATION - Methods for consolidating databases while maintaining data integrity are disclosed. A source database and target database are compared, and consolidated, and the consolidated databases are used. In other examples, a database is split to support divested entities.	05-17-2012
20120130961	System And Method For Identifying Unique And Duplicate Messages - A system and method for identifying unique and duplicate messages is provided. Messages are maintained, and a header and message body are extracted from each of the messages. A hash code is calculated for each message over at least part of the header and the body of that message. The messages with matching hash codes are grouped. One message in each group with two or more messages is randomly selected as a unique message. The remaining messages in the group are marked as exact duplicate messages.	05-24-2012
20120130962	DATA AUDIT SYSTEM - Embodiments of the present invention provide a system, method, and computer program product for auditing data based on server access of databases. In one embodiment, an expected rate is received, and a rate request is sent to a server, wherein the rate request promotes an entry of the rate request to access a database. A rate response is received from the server, wherein the rate response corresponds to the rate request and includes a specified rate. A sale request is sent to the server, wherein the sale request corresponds to the rate request and promotes an entry of the sale request to access the database. A sale response is received from the server, wherein the sale response corresponds to the sale request and includes a sale rate. A message is output based on the expected rate, the specified rate, and/or the sale rate.	05-24-2012
20120136841	SYSTEM AND METHOD FOR APPLICATION AWARE DE-DUPLICATION OF DATA BLOCKS ON A VIRTUALIZED STORAGE ARRAY - A system and method for application aware de-duplication (de-dup) of data blocks in a virtualized storage array is disclosed. In one embodiment, in a method of application aware de-dup of data blocks on virtualized storage arrays in a storage area network, a de-dup agent is enabled on each of one or more components of the storage area network. A master list of metadata associated with indexed data is then created and stored in the virtualized storage arrays. One or more sublists of metadata are then created from the masterlist and are stored. Upon receiving a write request from an application residing in the host device, it is determined whether data block being written has an entry in a sublist stored in a host device, and if so, the data block is then replaced with a pointer indicating where the data block is residing in the virtualized storage arrays.	05-31-2012
20120136842	PARTITIONING METHOD OF DATA BLOCKS - A partitioning method of data blocks is applied to a data de-duplication process. The method includes the following steps. A file structural tank partitioning program and a data block partitioning process are performed on an input file. A fingerprint feature value of a generated data block is compared with fingerprint feature values recorded in completed file structural tanks. If a duplicate fingerprint feature value exists in another file structural tank, it is determined whether the duplicate data block is a first data block of the existing file structural tank. If the data block is the same as the first data block of the existing file structural tank, it is further determined whether the structural tank feature values of the file structural tanks of the two data blocks are the same; and if yes, the data block to be compared is deleted.	05-31-2012
20120143832	DYNAMIC REWRITE OF FILES WITHIN DEDUPLICATION SYSTEM - Various embodiments for rewriting data in a deduplication storage environment by a processor device are provided. A dynamic layer above a sequential deduplication file system (denoted as DFS) implements the rewrite functionality. A user file is composed of one or more DFS files. As incoming data is written into a user file, the data is written by the dynamic layer sequentially into DFS files, created one by one. For each user file this dynamic layer creates and maintains a dynamic metadata file, in a regular, non deduplicated file system. This metadata file contains entries pointing to sections of DFS files.	06-07-2012
20120150823	DE-DUPLICATION INDEXING - Example apparatus, methods, and computers support data de-duplication indexing. One example apparatus includes a processor, a memory, and an interface to connect the processor, memory, and a set of logics. The set of logics includes an establishment logic to instantiate one-to-many de-duplication data structures, a manipulation logic to update the de-dupe data structure(s), a key logic to generate a key from a block of data to be de-duplicated, and a similarity logic to make a similarity determination for the block. The similarity determination identifies the block as a unique block, a duplicate block, or a block that meets a similarity threshold with respect to a stored de-duplicated block accessible through the dedupe data structure. The similarity determination involves comparing the block to be de-duplicated to a stored block available to the apparatus using a byte-by-byte approach, a hash approach, a delta hash approach and/or a sampling sequence approach.	06-14-2012
20120150824	Processing System of Data De-Duplication - A processing system of data de-duplication includes a client and a server. A characteristic value of each data block is compared with characteristic values stored in the client. If the same characteristic value exists in the client, the data block corresponding to the compared characteristic value is deleted. A server data management module is connected to a client data management module through a network. If the characteristic value does not exist in the server, a corresponding data block is obtained from the client, and the new data block and the characteristic value are stored in the server. A file management module records a storage address of the data blocks in the server into an index file. In this way, the server is not required to perform all data de-duplication processes of the clients, thus reducing the occupation of bandwidth and improving the processing efficiency of the server.	06-14-2012
20120150825	Cleansing a Database System to Improve Data Quality - According to one embodiment of the present invention, a system controls cleansing of data within a database system, and comprises a computer system including at least one processor. The system receives a data set from the database system, and one or more features of the data set are selected for determining values for one or more characteristics of the selected features. The determined values are applied to a data quality estimation model to determine data quality estimates for the data set. Problematic data within the data set are identified based on the data quality estimates, where the cleansing is adjusted to accommodate the identified problematic data. Embodiments of the present invention further include a method and computer program product for controlling cleansing of data within a database system in substantially the same manner described above.	06-14-2012
20120150826	DISTRIBUTED DEDUPLICATED STORAGE SYSTEM - A distributed, deduplicated storage system according to certain embodiments is arranged in a parallel configuration including multiple deduplication nodes. Deduplicated data is distributed across the deduplication nodes. The deduplication nodes can be networked together and communicate with one another according using a light-weight, customized communication scheme (e.g., a scheme based on FTP or HTTP). In some cases, deduplication management information including deduplication signatures and/or other metadata is stored separately from the deduplicated data in deduplication management nodes, improving performance and scalability.	06-14-2012
20120150827	DATA STORAGE DEVICE WITH DUPLICATE ELIMINATION FUNCTION AND CONTROL DEVICE FOR CREATING SEARCH INDEX FOR THE DATA STORAGE DEVICE - A file server performs duplicate elimination on files, and creates a virtual file system that does not include a duplicate file and is used for creating a search index. A search server acquires search target files from the virtual file system in the file server, and creates the search index.	06-14-2012
20120158670	FINGERPRINTS DATASTORE AND STALE FINGERPRINT REMOVAL IN DE-DUPLICATION ENVIRONMENTS - A storage server is coupled to a storage device that stores blocks of data, and generates a fingerprint for each data block stored on the storage device. The storage server creates a fingerprints datastore that is divided into a primary datastore and a secondary datastore. The primary datastore comprises a single entry for each unique fingerprint and the secondary datastore comprises an entry having an identical fingerprint as an entry in the primary datastore. The storage server merges entries in a changelog with the entries in the primary datastore to identify duplicate data blocks in the storage device and frees the identified duplicate data blocks in the storage device. The storage server stores the entries that correspond to the freed data blocks to a third datastore and overwrites the primary datastore with the entries from the merged data that correspond to the unique fingerprints to create an updated primary datastore.	06-21-2012
20120158671	METHOD AND SYSTEM FOR PROCESSING DATA - Methods, computer systems, and computer program products for processing data a computing environment are provided. The computer environment for data deduplication storage receives a plurality of write operations for deduplication storage of the data. The data is buffered in a plurality of buffers with overflow temporarily stored to a memory hierarchy when the data received for deduplication storage is sequential or non sequential. The data is accumulated and updated in the plurality of buffers per a data structure, the data structure serving as a fragment map between the plurality of buffers and a plurality of user file locations. The data is restructured in the plurality of buffers to form a complete sequence of a required sequence size. The data is provided as at least one stream to a stream-based deduplication algorithm for processing and storage.	06-21-2012
20120158672	Extensible Pipeline for Data Deduplication - The subject disclosure is directed towards data deduplication (optimization) performed by phases/modules of a modular data deduplication pipeline. At each phase, the pipeline allows modules to be replaced, selected or extended, e.g., different algorithms can be used for chunking or compression based upon the type of data being processed. The pipeline facilitates secure data processing, batch processing, and parallel processing. The pipeline is tunable based upon feedback, e.g., by selecting modules to increase deduplication quality, performance and/or throughput. Also described is selecting, filtering, ranking, sorting and/or grouping the files to deduplicate, e.g., based upon properties and/or statistical properties of the files and/or a file dataset and/or internal or external feedback.	06-21-2012
20120158673	STORING AND PUBLISHING CONTENTS OF A CONTENT STORE - Aspects are disclosed for publishing contents of a content store. A storage operation is performed, and a completion of the storage operation is detected. Here, the storage operation redundantly stores contents of a content set onto instances associated with a content store. The contents stored in the instances are then published in response to the completion of the storage operation. In another aspect, a dataset table is generated to facilitate storing contents of a content set, which include payload and metadata. The payload is stored onto a payload table, and the metadata is stored onto a metadata table. For this embodiment, the dataset table includes a first foreign key to the payload table, whereas the metadata table includes a second foreign key to the dataset table. The dataset table is monitored to determine a storage status of the contents, and the contents are subsequently published based on the storage status.	06-21-2012
20120158674	Indexing for deduplication - Systems and methods of indexing for deduplication are disclosed. An example method includes providing a first table in a first storage and a second table in a second storage. The method also includes looking up a key in the first table. If the key is not found in the first table, the key is looked up in the second table. If the key is found in the second table, the key is copied from the second table to the first table. If the entry is not found or in the second table, an entry with the key is inserted in the first table. The method also includes applying an operation to the entry associated with the key in the first table. The method also includes merging data of the first table with data of the second table when the first table is full to produce a new version of the second table that replaces a previous version.	06-21-2012
20120166399	AUTOMATED ROLE CLEAN-UP - Various embodiments of systems and method automated role clean-up are described herein. In various embodiments, an automated role clean-up agent can connect to a role repository system that may be configured to implement an automated role clean-up workflow. A method of an embodiment ensures that roles that are not being used or outdated are safe to delete. One or more deletion buffers may be configured to determine whether roles need to be deleted from the role repository system. Assigning conditions to a deletion buffer lets roles to be incubated in these deletion buffers for a desired period of time before deletion if the conditions are met. A re-affirmation can be sent out to role owners for deletion approval before roles are deleted. Deletion of the roles is performed by the role repository system.	06-28-2012
20120166400	TECHNIQUES FOR PROCESSING OPERATIONS ON COLUMN PARTITIONS IN A DATABASE - Techniques for processing operations on column partitions of a table in a database are provided. A table includes a control column partition. Each delete container of the control column partition representing multiple rows in the table (or a row partition, if any), and each row represented by a bit flag within a bit string. Rows of the table set for deletion have their corresponding bits within a particular delete container set to indicate those rows are deleted.	06-28-2012
20120166401	Using Index Partitioning and Reconciliation for Data Deduplication - The subject disclosure is directed towards a data deduplication technology in which a hash index service's index is partitioned into subspace indexes, with less than the entire hash index service's index cached to save memory. The subspace index is accessed to determine whether a data chunk already exists or needs to be indexed and stored. The index may be divided into subspaces based on criteria associated with the data to index, such as file type, data type, time of last usage, and so on. Also described is subspace reconciliation, in which duplicate entries in subspaces are detected so as to remove entries and chunks from the deduplication system. Subspace reconciliation may be performed at off-peak time, when more system resources are available, and may be interrupted if resources are needed. Subspaces to reconcile may be based on similarity, including via similarity of signatures that each compactly represents the subspace's hashes.	06-28-2012
20120166402	TECHNIQUES FOR EXTENDING HORIZONTAL PARTITIONING TO COLUMN PARTITIONING - Techniques for extending horizontal partitioning to column partitioning are provided. A database table is partitioned into custom groups of rows and custom groups of columns. Each partitioned column is managed as a series of containers representing all values appearing under the partitioned column. A logical row represents a row of the table logically indicating each column value of a row. Compression, deletion, and insertion within the containers are managed via a control header maintained with each container.	06-28-2012
20120166403	DISTRIBUTED STORAGE SYSTEM HAVING CONTENT-BASED DEDUPLICATION FUNCTION AND OBJECT STORING METHOD - Distributed storage system having content-based deduplication function and object storing method. The distributed storage system may include a plurality of data nodes and a server coupled with the plurality of data nodes. Each one of the plurality of data nodes may be configured to store at least one object. The server may be configured to perform a deduplication function based on a content-specific index of a target object and content-specific indexes of objects stored in the plurality of data nodes in response to an object storage request from a client, and configured to store the target object in one of the plurality of data nodes based on a result of the deduplication function performed by the server.	06-28-2012
20120179658	Cleansing a Database System to Improve Data Quality - According to one embodiment of the present invention, a system controls cleansing of data within a database system, and comprises a computer system including at least one processor. The system receives a data set from the database system, and one or more features of the data set are selected for determining values for one or more characteristics of the selected features. The determined values are applied to a data quality estimation model to determine data quality estimates for the data set. Problematic data within the data set are identified based on the data quality estimates, where the cleansing is adjusted to accommodate the identified problematic data. Embodiments of the present invention further include a method and computer program product for controlling cleansing of data within a database system in substantially the same manner described above.	07-12-2012
20120185446	SEARCH CLUSTERING - In one example embodiment, a method is illustrated as including retrieving item data from a plurality of listings, the item data filtered from noise data, constructing at least one base cluster having at least one document with common item data stored in a suffix ordering, compacting the at least one base cluster to create a compacted cluster representation having a reduced duplicate suffix ordering amongst the clusters, and merging the compacted cluster representation to generate a merged cluster, the merging based upon a first overlap value applied to the at least one document with common item data.	07-19-2012
20120191667	SYSTEM AND METHOD OF STORAGE OPTIMIZATION - A method and system are disclosed for storage optimization. Data parts and metadata within a source data unit are identified and the data parts are compared with data which is already stored in the physical storage space. In case identical data parts are found within the physical storage, the data parts from the source data unit are linked to the identified data, while the data parts can be discarded, thereby reducing the required storage capacity. The metadata parts can be separately stored in a designated storage area.	07-26-2012
20120191668	Manipulating the Actual or Effective Window Size in a Data-Dependant Variable-Length Sub-Block Parser - Example systems and methods concern a sub-block parser that is configured with a variable sized window whose size varies as a function of the actual or expected entropy of data to be parsed by the sub-block parser. Example systems and methods also concern a sub-block parser configured to compress a data sequence to be parsed before parsing the data sequence. One example method facilitates either actually changing the window size or effectively changing the window size by manipulating the data before it is parsed. The example method includes selectively reconfiguring a data set to be parsed by a data-dependent parser based, at least in part, on the entropy level of the data set, selectively reconfiguring the data-dependent parser, based, at least in part, on the entropy level of the data set, and parsing the data set.	07-26-2012
20120191669	Detection and Deduplication of Backup Sets Exhibiting Poor Locality - Described are computer-based methods and apparatuses, including computer program products, for detection and deduplication of backup sets exhibiting poor locality. A first set of summaries of a first data set are determined, each summary of the first set of summaries being indicative of a data pattern in the first data set. A second set of summaries of a second data set are determined, each summary of the second set of summaries being indicative of a data pattern in the second data set. A set of comparison metrics are calculated, each comparison metric being based on a first subset of summaries from the first set of summaries and a second subset of summaries from the second set of summaries. A locality metric is calculated based on the set of comparison metrics indicative of whether the first data set and second data set exhibit poor locality.	07-26-2012
20120191670	Dynamic Deduplication - Described are computer-based methods and apparatuses, including computer program products, for dynamic deduplication. Data is processed using an algorithm that deduplicates the data based on a first set of parameters. A first moving average of a first deduplication performance metric is calculated for the algorithm over a time period. A second moving average of a second deduplication performance metric is calculated for the algorithm over the time period. It is determined that the first moving average satisfies a first criterion, the second moving average satisfies a second criterion, or both. The algorithm is reconfigured based on a second set of parameters to deduplicate data.	07-26-2012
20120191671	COMPUTER SYSTEM AND DATA DE-DUPLICATION METHOD - A computer system and data de-duplication method capable of performing efficient data de-duplication are suggested.	07-26-2012
20120191672	DICTIONARY FOR DATA DEDUPLICATION - Mechanisms are provided for efficiently improving a dictionary used for data deduplication. Dictionaries are used to hold hash key and location pairs for deduplicated data. Strong hash keys prevent collisions but weak hash keys are more computation and storage efficient. Mechanisms are provided to use both a weak hash key and a strong hash key. Weak hash keys and corresponding location pairs are stored in an improved dictionary while strong hash keys are maintained with the deduplicated data itself. The need for having uniqueness from a strong hash function is balanced with the deduplication dictionary space savings from a weak hash function.	07-26-2012
20120191673	COUPLING A USER FILE NAME WITH A PHYSICAL DATA FILE STORED IN A STORAGE DELIVERY NETWORK - A method of coupling a user file name to a physical data file stored within a storage delivery network, includes: assigning a logical file identification value (LFID) to a data file stored in one or more storage nodes and storing the LFID in a computer readable memory; storing in the computer readable memory a node identification value (Node ID) indicative of where the data file is stored among a plurality of geographically distributed storage nodes and associating the Node ID with the LFID; and storing in the computer readable memory a file name for the data file created by a user and associating the file name with the LFID, wherein the LFID correlates the file name with the Node ID transparently to the user and allows the user to access the data file using just the file name.	07-26-2012
20120191674	Dynamic Monitoring of Ability to Reassemble Streaming Data Across Multiple Channels Based on History - Mechanisms are provided for processing streaming data at high sustained data rates. These mechanisms receive a plurality of data elements over a plurality of non-sequential communication channels and write the plurality of data elements directly to the file system of the data processing system in an unassembled manner. The mechanisms determining whether to perform a data scrubbing operation or not based on history information indicative of whether data elements in the plurality of data elements are being received in a substantially sequential manner. The mechanisms perform a data scrubbing operation, in response to a determination to perform data scrubbing, to identify any missing data elements in the plurality of data elements written to the tile system and assemble the plurality of data elements into a plurality of data streams in response to results of the data scrubbing indicating that there are no missing data elements.	07-26-2012
20120191675	DEVICE AND METHOD FOR ELIMINATING FILE DUPLICATION IN A DISTRIBUTED STORAGE SYSTEM - The present invention relates to an apparatus and method for eliminating duplication of a file in a distributed storage system. The apparatus and method for eliminating duplication of a file in a distributed storage system according to the present invention calculates a hash value of each chunk for an active file; calculates a secondary hash value by adding the hash values calculated for respective chunks; examines duplication of the file using the hash value of each chunk and the secondary hash value; and eliminates a duplicated file depending on a result of the examination.	07-26-2012
20120197851	CONSIDERING MULTIPLE LOOKUPS IN BLOOM FILTER DECISION MAKING - Example apparatus, methods, and computers are configured to consider multiple lookups when making decisions concerning whether a probabilistic data structure indicates that an item is or is not present. One example method includes receiving a first response from a probabilistic data structure, where the first response indicates whether a first element is a member of a set of stored elements. The example method also includes receiving a set of second responses from the probabilistic data structure, where the set of second responses indicate whether members of a corresponding set of second elements are members of the set of stored elements. The method then provides a present/absent signal concerning whether the first element is a member of the set of stored elements. The signal is computed as a function of the first response and the set of second responses rather than merely as a function of the first response.	08-02-2012
20120197852	Aggregating Sensor Data - In particular embodiments, a method includes accessing sensor data from sensor nodes in a sensor network and aggregating the sensor data for communication to an indexer in the sensor network. The aggregation of the sensor data includes deduplicating the sensor data; validating the sensor data; formatting the sensor; generating metadata for the sensor data; and time-stamping the sensor data. The metadata identifies one or more pre-determined attributes of the sensor data. The method also includes communicating the aggregated sensor data to the indexer in the sensor network. The indexer is configured to index the aggregated sensor data according to a multi-dimensional array for querying of the aggregated sensor data along with other aggregated sensor data. One or more first ones of the dimensions of the multi-dimensional array include time and one or more second ones of the dimensions of the multi-dimensional include one or more of the pre-determined sensor-data attributes.	08-02-2012
20120197853	SYSTEM AND METHOD FOR SAMPLING BASED ELIMINATION OF DUPLICATE DATA - A technique for eliminating duplicate data is provided. Upon receipt of a new data set, one or more anchor points are identified within the data set. A bit-by-bit data comparison is then performed of the region surrounding the anchor point in the received data set with the region surrounding an anchor point stored within a pattern database to identify forward/backward delta values. The duplicate data identified by the anchor point, forward and backward delta values is then replaced in the received data set with a storage indicator.	08-02-2012
20120209820	GARBAGE COLLECTION FOR MERGED COLLECTIONS - A method of identifying nonreferenced memory elements in a storage system is disclosed. A plurality of lists of referenced elements from a plurality of storage subsystems is input. A union of the lists of referenced elements is compiled. The union of the lists of referenced memory elements is compared to a list of previously referenced memory elements to determine previously referenced elements that are no longer referenced. The previously referenced elements that are no longer referenced is output.	08-16-2012
20120215748	DEDUPLICATED DATA PROCESSING RATE CONTROL - A plurality of workers is configured for parallel processing of deduplicated data entities in a plurality of chunks. The deduplicated data processing rate is regulated using a rate control mechanism. The rate control mechanism incorporates a debt/credit algorithm specifying which of the plurality of workers processing the deduplicated data entities must wait for each of a plurality of calculated required sleep times. The rate control mechanism is adapted to limit a data flow rate based on a penalty acquired during a last processing of one of the plurality of chunks in a retroactive manner, and further adapted to operate on at least one vector representation of at least one limit specification to accommodate a variety of available dimensions corresponding to the at least one limit specification.	08-23-2012
20120221534	DATABASE INDEX MANAGEMENT - Managing database indexes includes creating a main index and creating at least one service index that is configured for recording a change to a node to be updated in the main index. Managing database indexes also includes detecting whether an operation that involves the main index and is performed on the database appears in the database, and maintaining the main index using at least one service index in response to the operation that involves the main index and is performed on the database, appearing in the database. The maintaining is performed based on changes to a node to be updated in the main index that are recorded in the at least one service node.	08-30-2012
20120226670	IMPLEMENTING CONTINUOUS CONTROL MONITORING FOR AUDIT PURPOSES USING A COMPLEX EVENT PROCESSING ENVIRONMENT - A method of providing True Continuous Control Monitoring (CCM) of business processes for audit purposes is provided herein. The method includes the following steps: consolidating data from multiple sources, in case the transactional data is located in more than one source, to a single self contained and comprehensive source; identifying, in the single data source, data elements that are required for detection and reporting for each audit rule; translating and streaming, in case required, the transactions data into events, so that every change in a transaction is immediately reflected and identifiable; eliminating duplicate events for the same single transaction; applying the event processing engine to the events, based on event audit patterns derived from audit rules, possibly entered by non-programmers; and generating alert data in audit-style notation, to be reported back to the system, based on alert notifications derived from the event processing engine.	09-06-2012
20120233134	Openstack file deletion - Several different embodiments of a massively scalable object storage system are described. The object storage system is particularly useful for storage in a cloud computing installation whereby shared servers provide resources, software, and data to computers and other devices on demand. In several embodiments, the object storage system includes a ring implementation used to associate object storage commands with particular physical servers such that certain guarantees of consistency, availability, and performance can be met. In other embodiments, the object storage system includes a synchronization protocol used to order operations across a distributed system. In a third set of embodiments, the object storage system includes a metadata management system. In a fourth set of embodiments, the object storage system uses a structured information synchronization system. Features from each set of embodiments can be used to improve the performance and scalability of a cloud computing object storage system.	09-13-2012
20120233135	SAMPLING BASED DATA DE-DUPLICATION - Example apparatus, methods, and computers perform sampling based data de-duplication. One example method controls a data de-duplication computer to compute a sampling sequence for a sub-block of data and to use the sampling sequence to locate a stored sub-block known to the data de-duplication computer. Upon finding a stored sub-block to compare to, the method includes controlling the data de-duplication computer to determine a degree of similarity (e.g., duplicate, very similar, somewhat similar, very dissimilar, completely dissimilar, x % similar) between the sub-block and the stored sub-block and to control whether and how the sub-block is stored and/or transmitted based on the degree of similarity. The degree of similarity can also control whether and how the data de-duplication computer updates a dedupe data structure(s) that stores information for finding groups of similarity sampling sequence related sub-blocks.	09-13-2012
20120233136	DELETING RELATIONS BETWEEN SOURCES AND SPACE-EFFICIENT TARGETS IN MULTI-TARGET ARCHITECTURES - A method for deleting a relation between a source and a target in a multi-target architecture is described. The multi-target architecture includes a source and multiple space-efficient (SE) targets mapped thereto. In one embodiment, such a method includes initially identifying a relation for deletion from the multi-target architecture. A space-efficient (SE) target associated with the relation is then identified. A mapping structure maps data in logical tracks of the SE target to physical tracks of a repository. The method then identifies a sibling SE target that inherits data from the SE target. Once the SE target and the sibling SE target are identified, the method modifies the mapping structure to map the data in the physical tracks of the repository to the logical tracks of the sibling SE target. The relation is then deleted between the source and the SE target.	09-13-2012
20120239630	FILE REPAIR - Example methods, and apparatus concern file repair. One example method includes storing a file in a file store and also parsing the file into a set of constituent data blocks. The method includes selectively storing, in a data store, unique data blocks from the set of constituent data blocks. The method includes maintaining, in a combination of the file store and the data store, a threshold number of copies of data blocks. The method also includes maintaining a data structure that stores data for locating the file in the file store and that stores data for recreating the file from data blocks. The method also includes maintaining a data structure that stores data for locating multiple copies of data found in members of the set of constituent data blocks. Files can be repaired using data blocks parsed from stored files or using data blocks stored as data blocks.	09-20-2012
20120239631	DISK SCRUBBING - A method, a system, and a computer-readable storage medium are provided for data management. The method may comprise identifying a predefined set of data storage parameters and a predefined set of data scrubbing parameters. The method further may comprise determining the predefined set of data scrubbing parameters for first data in a first data storage based on the predefined set of data storage parameters for the first data and performing data scrubbing for the first data using the determined predefined set of data scrubbing parameters. Furthermore, the method may comprise comparing first data in the first data storage and corresponding second data in a second data storage. Upon determining a mismatch between the first data and the second data, the method further may comprise synchronizing the first data with the second data as a result of the comparison.	09-20-2012
20120239632	METHODS FOR SECURE MULTI-ENTERPRISE STORAGE - A method in one embodiment includes receiving a data identifier (ID) associated with each of a plurality of files from multiple data providers; storing the data ID associated with each of the plurality of files to a database; identifying any duplicate data IDs in the database to determine if any of the plurality of files associated with the data IDs are non-confidential; querying one of the data providers which provided the file having the duplicate data ID to determine; receiving a response from the data provider indicating whether or not to store the file having the duplicate data ID to the storage network; receiving one of the files having a duplicate data ID from the data provider; storing the file having the duplicate data ID to the storage network; and causing deletion of the file having the duplicate data ID that is stored to the storage network.	09-20-2012
20120239633	DYNAMIC REWRITE OF FILES WITHIN DEDUPLICATION SYSTEM - A dynamic layer above a sequential deduplication file system (denoted as DFS) implements the rewrite functionality. A user file is composed of one or more DFS files. As incoming data is written into a user file, the data is written by the dynamic layer sequentially into DFS files, created one by one. For each user file this dynamic layer creates and maintains a dynamic metadata file, in a regular, non deduplicated file system. This metadata file contains entries pointing to sections of DFS files.	09-20-2012
20120246124	EXTERNALIZED DATA VALIDATION ENGINE - A method and system of externalized data validation. Data input to applications is received. Metadata specifying types of the received data is received. Methods to cleanse the received data are determined based on the metadata. Based on the determined methods and received metadata, a validation engine external to the applications cleanses and validates the received data. The validated data is sent to the applications for use by the applications. Via a subscription service and without requiring updates to the applications, a service provider provides dynamic updates of the validation engine to mitigate newly identified events associated with input to the applications.	09-27-2012
20120246125	DUPLICATE FILE DETECTION DEVICE, DUPLICATE FILE DETECTION METHOD, AND COMPUTER-READABLE STORAGE MEDIUM - A duplicate file detection device	09-27-2012
20120246126	Policy-based management of a redundant array of independent nodes - An archive cluster application runs across a redundant array of independent nodes. Each node runs an archive cluster application instance comprising a set of software processes: a request manager, a storage manager, a metadata manager, and a policy manager. The request manager manages requests for data, the storage manager manages data read/write functions, and the metadata manager facilitates metadata transactions and recovery. The policy manager implements policies, which are operations that determine the behavior of an “archive object” within the cluster. The archive cluster application provides object-based storage. It associates metadata and policies with the raw archived data, which together comprise an archive object. Object policies govern the object's behavior in the archive. The archive manages itself independently of client applications, acting automatically to ensure that object policies are valid.	09-27-2012
20120246127	VIRTUALIZATION OF METADATA FOR FILE OPTIMIZATION - Mechanisms are provided for optimizing files while allowing application servers access to metadata associated with preoptimized versions of the files. During file optimization involving compression and/or compaction, file metadata changes. In order to allow file optimization in a manner transparent to application servers, the metadata associated with preoptimized versions of the files is maintained in a metadata database as well as in an optimized version of the files themselves.	09-27-2012
20120254131	VIRTUAL MACHINE IMAGE CO-MIGRATION - Embodiments of the invention relate to co-migration in a shared pool of resources with similarity across data sets of a migrating application. The data sets are processed and profiled. Metadata is reviewed to remove duplicate elements and to distribute the processing load across available nodes. At the same time, a ranking may be assigned to select metadata to support a prioritized migration. Non-duplicate data chunks are migrated across the shared pool of resources responsive to the assigned prioritization.	10-04-2012
20120254132	Enhanced Contact Information - A method and an apparatus for organizing information in an electronic address book. The method comprises collecting contact information for an electronic address book, comparing a name from any field in said contact information to a database comprising name information, identifying a first name or a surname from the contact information and relocating in the contact information the identified first name to a field assigned to first names or the surname to a field assigned to surnames as a response to a name identified in a wrong field.	10-04-2012
20120259821	MAINTAINING CACHES OF OBJECT LOCATION INFORMATION IN GATEWAY COMPUTING DEVICES USING MULTICAST MESSAGES - Example embodiments relate to maintaining caches of object location information in gateway computing devices using multicast messages. In example embodiments, upon updating a cache with an object identifier and a corresponding object location, a gateway computing device may transmit an update message using a simple multicast transport protocol to a plurality of other gateway computing devices. In contrast, upon deleting an object from a cache, the gateway computing device may transmit a delete message using a reliable multicast transport protocol to the plurality of other gateway computing devices.	10-11-2012
20120265736	SYSTEMS AND METHODS FOR IDENTIFYING SETS OF SIMILAR PRODUCTS - Embodiments of the present invention relate to systems and methods for determining sets of products which are similar to each other in terms of consumers' wants and needs. Queries are performed on a particular product. Documents relating to the query are received and stored. A dictionary is created from the received documents, whereby the documents, which are text files, are scrubbed of certain data to create a scrubbed text file. Topic modeling is then performed on the cleansed text file. Various methods can be used to perform topic modeling, including, but not limited to, latent semantic analysis, nonnegative matrix factorization, and singular value decomposition.	10-18-2012
20120296880	Method and System for Building and Using a Centralized and Harmonized Relational Database - A method for building and maintaining centralized and harmonized relational database for acquiring, managing, filtering, integrating and accurately analyzing peptide and protein data based on functional class is described. In addition, a computer-based system comprising the above database and analysis tools for mining and analyzing the protein/peptide data stored in the database is provided. The database is built using curated and validated protein specific data and does not rely on probabilistic or predictive approaches to derive protein information indirectly from genomic or gene-expression data.	11-22-2012
20120303594	Multiple Node/Virtual Input/Output (I/O) Server (VIOS) Failure Recovery in Clustered Partition Mobility - A method, system, and computer program product utilizes cluster-awareness to effectively support a live partition mobility (LPM) event and provide recovery from node failure within a Virtual Input/Output (I/O) Server (VIOS) cluster. An LPM utility creates a monitoring thread on a first VIOS on initiation of a corresponding LPM event. The monitoring thread tracks a status of an LPM and records status information in the mobility table of a database. The LPM utility creates other monitoring threads on other VIOSes running on the (same) source server. If the first VIOS VIOS sustains one of multiple failures, the LPM utility provides notification to other functioning nodes/VIOSes. The LPM utility enables a functioning monitoring thread to update the LPM status. In particular, a last monitoring thread may perform cleanup/update operations within the database based on an indication that there are nodes on the first server that are in failed state.	11-29-2012
20120303595	DATA RESTORATION METHOD FOR DATA DE-DUPLICATION - A data restoration method for data de-duplication uses to restore partial data of a target file of a client, includes the client queries a file attribute of a source file corresponding to the target file from a storage server; the client compares whether the file attribute of the target file is the same as the file attribute of the source file; if the file attributes of the target file and the source file are different, segmentation processing is performed on the target file to generate segmentation data blocks and corresponding fingerprints; after obtaining all the fingerprints of the source file from the storage server, the client compares a difference between the fingerprints of the source file and the target file; the client obtains corresponding segmentation data blocks from the storage server according to the different fingerprints and overwrites the obtained segmentation data blocks to corresponding positions in the target file.	11-29-2012
20120310901	System and Method for Electronically Storing Essential Data - A method for storing electronic data blocks at a storage facility uses a public database and a select database. Hash for each data block is evaluated at the facility to determine whether the data block is already stored at the facility. New data blocks are assigned a new address in the select database when encrypted with a customer key. Otherwise, they are assigned a new address in the public database by default. Duplicate data blocks are assigned a previously established address for the data block in either the public or select database. All addresses are then sent to the customer location for file integrity and only the content of new data blocks need to be sent to the storage facility (i.e. no need for duplicates).	12-06-2012
20120310902	PHRASE-BASED DETECTION OF DUPLICATE DOCUMENTS IN AN INFORMATION RETRIEVAL SYSTEM - An information retrieval system uses phrases to index, retrieve, organize and describe documents. Phrases are identified that predict the presence of other phrases in documents. Documents are the indexed according to their included phrases. Related phrases and phrase extensions are also identified. Phrases in a query are identified and used to retrieve and rank documents. Phrases are also used to cluster documents in the search results, create document descriptions, and eliminate duplicate documents from the search results, and from the index.	12-06-2012
20120317081	DETECTING WASTEFUL DATA COLLECTION - A method and system comprises a duplication identifier module to analyze data input information to automatically identify duplicate expected inputs associated with a process. The system includes logical process model information defining a logically structured series of process activities and data input information representing a plurality of expected inputs associated with respective process activities, with each expected input being indicative of expected collection of a corresponding data element during execution of the associated process activity. Each duplicate expected input comprises one of the plurality of expected inputs for which there is at least one other expected input with respect to a common corresponding data element.	12-13-2012
20120317082	QUERY-BASED INFORMATION HOLD - Systems and methods for implementing a query-based hold on electronic items hosted by a communication device and/or system. Electronic items from a plurality of user-specific folders are purged and copied to a discovery hold folder. The purged items, along with all existing items, contained within the discovery hold folder are evaluated against the query-based hold criteria. Items that fail to meet the query-based hold criteria are permanently deleted from the discovery hold folder. Items that meet the query-based hold criteria are maintained within discovery hold folder.	12-13-2012
20120317083	SYSTEM AND METHOD FOR DELETION OF DATA IN A REMOTE COMPUTING PLATFORM - Embodiments of a system and method to perform a secure deletion of a set of data from a remote cloud computing system are described. As described, in some embodiments, a user of a cloud computing service that provides data storage may securely delete his stored set of data by acquiring elevated access privileges to the stored set of data, designating at least one most significant bit in at least one data block therein as a sentinel and recording its value and position, updating the value of the sentinels and thereby rendering the data unusable, and verifying the success of the operation by checking the new value of the sentinels against the original value. In some embodiments, the verification process may be repeated in order to ensure that the data has been rendered useless across all nodes of the remote cloud platform.	12-13-2012
20120317084	METHOD AND SYSTEM FOR ACHIEVING DATA DE-DUPLICATION ON A BLOCK-LEVEL STORAGE VIRTUALIZATION DEVICE - A method and system for achieving data de-duplication on a block-level storage virtualization device belonging to the field of data storage technologies, is disclosed. The method comprises: deleting the duplicate data in the actual physical data corresponding to the specified virtual LBA address space to obtain the data extents after physical data is de-duplicated; establishing the correspondence between the virtual LBA address space and the data extents after the physical data is de-duplicated; according to the correspondence and metadata information of the data extents, obtaining the storage position information of the actual physical data corresponding to the virtual LBA address space pointed by external data read and write requests to complete the I/O redirection. This invention also provides a system for achieving data de-duplication on a block-level storage virtualization device. This invention can delete duplicate data across hosts and storage devices, to achieve a wider scope of data de-duplication.	12-13-2012
20120323859	HIERARCHICAL IDENTIFICATION AND MAPPING OF DUPLICATE DATA IN A STORAGE SYSTEM - The technique introduced here includes a system and method for identifying and mapping duplicate data objects referenced by data objects. The technique illustratively utilizes a hierarchical tree of fingerprints for each data object to compare the data objects and identify duplicate data blocks referenced by the data objects. A progressive comparison of the hierarchical trees starts from a top layer of the hierarchical trees and proceeds toward a base layer. Between the compared data objects (i.e., the compared hierarchical trees), the technique maps matching fingerprints only at the top-most layer of the hierarchical trees at which the fingerprints match. Lower layer matching fingerprints are neither compared nor mapped. Data blocks corresponding to the matching fingerprints are then deleted. Such an identification and mapping technique substantially reduces the amount of mapping metadata stored in data objects that have been subject to deduplication.	12-20-2012
20120323860	OBJECT-LEVEL IDENTIFICATION OF DUPLICATE DATA IN A STORAGE SYSTEM - The technique introduced here includes a system and method for identification of duplicate data directly at a data-object level. The technique illustratively utilizes a hierarchical tree of fingerprints for each data object to compare data objects and identify duplicate data blocks referenced by the data objects. The hierarchical fingerprint trees are constructed in such a manner that a top-level fingerprint (or object-level fingerprint) of the hierarchical tree is representative of all data blocks referenced by a storage system. In embodiments, inline techniques are utilized to generate hierarchical fingerprints for new data objects as they are created, and an object-level fingerprint of the new data object is compared against preexisting object-level fingerprints in the storage system to identify exact or close matches. While exact matches result in complete deduplication of data blocks referenced by the data object, hierarchical comparison methods are used for identifying and mapping duplicate data blocks referenced by closely related data objects.	12-20-2012
20120323861	DEDUPLICATED CACHING OF QUERIES FOR GREEN IT MANAGEMENT - Exemplary methods, computer systems, and computer program products for smarter deduplicated caching of queries for green IT management in a computer storage device are provided. In one embodiment, the computer environment is configured for ascertaining the most-used data access chains. Multiple access paths to identical data are determined for the most-used data access chains. A generalized chain that is formed from the plurality of access paths to the identical data is determined. Multiple keys and information relating to the access paths to the identical data is deduplicated.	12-20-2012
20120323862	Identifying Duplicate Messages in a Database - A system for storing data in a memory comprises a memory operable to store a database, wherein the database comprises an array, and the array comprises a number of elements uniquely identifiable by their location in relation to an origin point of the array, an interface operable to receive first data to be stored in the array; and a processor communicatively coupled to the memory and the interface, the processor operable to convert the first data to a hash using a hash function, determine a selected number of character positions of the hash, and identify an array element according to the character values of the selected character positions of the hash.	12-20-2012
20120323863	SEMANTIC REFLECTION STORAGE AND AUTOMATIC RECONCILIATION OF HIERARCHICAL MESSAGES - Database storage of hierarchically structured messages is facilitated based on structural semantic reflection of the message and automatic reconciliation of the messages. The structural semantics of an incoming message may be assessed and database storage provisioned based on the structural semantic reflection of the message. The system may auto-adapt over time as incoming messages from a known source change and automatically generate code which applies the sequential logic to a stream of messages in order to represent the latest state for a given context. Furthermore, the hierarchical semantics of messaging formats may be applied to a flexible set of database structures that represent the raw contents of the messages.	12-20-2012
20120323864	DISTRIBUTED DE-DUPLICATION SYSTEM AND PROCESSING METHOD THEREOF - A distributed de-duplication system and a processing method thereof are described. A client runs a de-duplication procedure on an input file to generate a partitioned data block and a corresponding fingerprint eigenvalue. The client sends an inquiry request having the fingerprint eigenvalue to a dispatch server. The dispatch server records a storage location of the partitioned data block. The dispatch server forwards the inquiry request to the corresponding dedup. engine according to the fingerprint eigenvalue. The dedup. engine judges whether the fingerprint eigenvalue already exists. If the fingerprint eigenvalue does not exist, the dedup. engine stores a new partitioned data block to a storage server according to a new fingerprint eigenvalue.	12-20-2012
20120323865	METHOD AND APPARATUS FOR DETERMINING WHETHER A PRIVATE DATA AREA IS SAFE TO PRESERVE - A system may configure a safety-tag that indicates whether a private data area is safe to preserve. During operation, the system receives a file with a private data area. Specifically, in one embodiment, the private data area is contained within an Exchangeable Image File (EXIF) MakerNote tag, which allows makers of EXIF writers to record any desired information. Next, the system determines whether the private data area is safe to preserve. If the private data area is safe to preserve, the system configures a safety-tag to indicate that the private data area is safe to preserve. Otherwise, if the private data area is not safe to preserve, the system configures the safety-tag to indicate that the private data is not safe to preserve. Specifically, in one embodiment, the safety-tag is a Digital Negative (DNG) MakerNoteSafety tag.	12-20-2012
20120323866	EFFICIENT DEVELOPMENT OF A RULE-BASED SYSTEM USING CROWD-SOURCING - Described herein are methods, systems, apparatuses and products for efficient development of a rule-based system. An aspect provides a method including accessing data records; converting said data records to an intermediate form; utilizing intermediate forms to compute similarity scores for said data records; and selecting as an example to be provided for rule making at least one record of said data records having a maximum dissimilarity score indicative of dissimilarity to already considered examples.	12-20-2012
20120330903	DEDUPLICATION IN AN EXTENT-BASED ARCHITECTURE - A request is received to remove duplicate data. A log data container associated with a storage volume in a storage server is accessed. The log data container includes a plurality of entries. Each entry is identified by an extent identifier in a data structures stored in a volume associated with the storage server. For each entry in the log data container, a determination is made if the entry matches another entry in the log data container. If the entry matches another entry in the log data container, a determination is made of a donor extent and a recipient extent. If an external reference count associated with the recipient extent equals a first predetermined value, block sharing is performed for the donor extent and the recipient extent. A determination is made if the reference count of the donor extent equals a second predetermined value. If the reference count of the donor extent equals the second predetermined value, the donor extent is freed.	12-27-2012
20120330904	EFFICIENT FILE SYSTEM OBJECT-BASED DEDUPLICATION - In accordance with one or more embodiments, an inode implemented file system may be utilized to support both offline and inline deduplication. When the first content is stored in the storage medium, one inode is used to associate a filename with the data blocks where the first content is stored. When a second content that is a duplicate of the first content is to be stored, then a parent inode is created to point to the data blocks in which a copy of the first content is stored. Further, two inodes are created, one representing the first content and the other representing the second content. Both inodes point to the same parent inode that points to the data blocks where the first content is stored.	12-27-2012
20120330905	METHOD FOR PRODUCING AND MANAGING A LARGE-VOLUME LONG - The present invention relates to a method for producing and managing a large-volume long-term archive which comprises an archive data memory and a management file, and to a corresponding long-term archive. The method according to the invention involves relocating archive data in a container file so that the legal validity of the data is maintained by virtue of qualified signing.	12-27-2012
20120330906	METHOD AND SYSTEMS FOR DETECTING DUPLICATE TRAVEL PATH - A system and method comprising: receiving itinerary data from at least two sources; identifying a traveler associated with the itinerary data; and adding information about the identified traveler to the itinerary data.	12-27-2012
20120330907	STORAGE SYSTEM FOR ELIMINATING DUPLICATED DATA - A storage system	12-27-2012
20130013572	OPTIMIZATION OF A COMPUTING ENVIRONMENT IN WHICH DATA MANAGEMENT OPERATIONS ARE PERFORMED - Described are embodiments of an invention for optimizing a computing environment that performs data management operations such as encryption, deduplication and compression. The computing environment includes data components and a management system. The data components operate on data during the lifecycle of the data. The management system identifies all the data components in a data path, how the data components are interconnected, the data management operations performed at each data component, and how many data management operations of each type are performed at each data component. Further, the management system builds a data structure to represent the flow of data through the data path and analyzes the data structure in view of policy. After the analysis, the management system provides recommendations to optimize the computing environment through the reconfiguration of the data management operation configuration and reconfigures the data management operation configuration to optimize the computing environment.	01-10-2013
20130013573	RETRIEVAL AND RECOVERY OF DATA CHUNKS FROM ALTERNATE DATA STORES IN A DEDUPLICATING SYSTEM - For recovery of data chunks from alternate data stores, a method detects a damaged copy of a first data chunk of a deduplicated data object within a first storage pool of plurality of storage pools storing data chunks. The method further locates an undamaged copy of the first data chunk in an alternate storage pool within the plurality of storage pools from a system-wide deduplication index that indexes each data chunk in each storage pool. In addition, the method creating a new object holding the undamaged copy in the first storage pool, the new object linked to the damaged copy through the system-wide deduplication index.	01-10-2013
20130018851	INTELLIGENT DEDUPLICATION DATA PREFETCHINGAANM Jayaraman; VinodAACI San FranciscoAAST CAAACO USAAGP Jayaraman; Vinod San Francisco CA USAANM Bolla; Ratna ManojAACI HyderabadAACO INAAGP Bolla; Ratna Manoj Hyderabad IN - Deduplication dictionaries are used to maintain data chunk identifier and location pairings in a deduplication system. When access to a particular data chunk is requested, a deduplication dictionary is accessed to determine the location of the data chunk and a datastore is accessed to retrieve the data chunk. However, deduplication dictionaries are large and typically maintained on disk, so dictionary access is expensive. Techniques and mechanisms of the present invention allow prefetches or read aheads of datastore (DS) headers. For example, if a dictionary hit results in datastore DS(X), then headers for DS (X+1), DS (X+2), DS(X+read-ahead-window) are prefetched ahead of time. These datastore headers are cached in memory, and indexed by datastore identifier. Before going to the dictionary, a lookup is first performed in the cached headers to reduce deduplication data access request latency.	01-17-2013
20130018852	DELETED DATA RECOVERY IN DATA STORAGE SYSTEMSAANM Barton; Leslie A.AACI San JoseAAST CAAACO USAAGP Barton; Leslie A. San Jose CA USAANM Johnson; Gavin S.AACI San JoseAAST CAAACO USAAGP Johnson; Gavin S. San Jose CA USAANM Koester; Michael J.AACI HollisterAAST CAAACO USAAGP Koester; Michael J. Hollister CA USAANM Van Noorden; Carrie J.AACI SaratogaAAST CAAACO USAAGP Van Noorden; Carrie J. Saratoga CA US - In one embodiment, a system includes a data storage device for storing one or more storage volumes, logic adapted for associating an indicator with a data set on the one or more storage volumes, wherein the indicator is in a first state indicating that the data set is accessible, logic adapted for storing the indicator associated with the data set in a data set descriptor record, wherein the record is stored in at least one mapping of the one or more storage volumes, logic adapted for receiving a request to delete the data set, logic adapted for changing the indicator to a second state indicating that the data set is inaccessible in response to the request to delete the data set, with the proviso that the data set is unchanged, logic adapted for receiving a request to restore the deleted data set, and logic adapted for restoring the indicator from the second state to the first state in response to the request to restore the deleted data set.	01-17-2013
20130018853	ACCELERATED DEDUPLICATION - Mechanisms are provided for accelerated data deduplication. A data stream is received an input interface and maintained in memory. Chunk boundaries are detected and chunk fingerprints are calculated using a deduplication accelerator while a processor maintains a state machine. A deduplication dictionary is accessed using a chunk fingerprint to determine if the associated data chunk has previously been written to persistent memory. If the data chunk has previously been written, reference counts may be updated but the data chunk need not be stored again. Otherwise, datastore suitcases, filemaps, and the deduplication dictionary may be updated to reflect storage of the data chunk. Direct memory access (DMA) addresses are provided to directly transfer a chunk to an output interface as needed.	01-17-2013
20130018854	USE OF SIMILARITY HASH TO ROUTE DATA FOR IMPROVED DEDUPLICATION IN A STORAGE SERVER CLUSTER - A technique for routing data for improved deduplication in a storage server cluster includes computing, for each node in the cluster, a value collectively representative of the data stored on the node, such as a “geometric center” of the node. New or modified data is routed to the node which has stored data identical or most similar to the new or modified data, as determined based on those values. Each node stores a plurality of chunks of data, where each chunk includes multiple deduplication segments. A content hash is computed for each deduplication segment in each node, and a similarity hash is computed for each chunk from the content hashes of all segments in the chunk. A geometric center of a node is computed from the similarity hashes of the chunks stored in the node.	01-17-2013
20130018855	DATA DEDUPLICATION - A method for data deduplication includes receiving a set of hashes derived from a data chunk of a set of input data chunks	01-17-2013
20130024431	EVENT DATABASE FOR EVENT SEARCH AND TICKET RETRIEVAL - Methods, systems, and computer-readable media for managing event data and exploring the event data in an event database are provided. A data acquisition system may process the event database to remove duplicates and assign event data ranks to the event data. The event data rank may be based on query log data. In turn, a search engine communicatively connected to the event database may generate search results that include the event data. The search engine may receive an event data search request from a user. The event data matching the event data search request is retrieved from the event database and formatted, by the search engine, for display in rank order based on the event data rank, proximity of user location to an event location, and extent of query match in various event fields like title, description, etc.	01-24-2013
20130031062	ADJUSTMENT APPARATUS, ADJUSTMENT METHOD, AND RECORDING MEDIUM OF ADJUSTMENT PROGRAM - An adjustment method includes reading a record that includes a plurality of columns from a storage unit, determining whether data stored in a certain column in the plurality of columns of the read record has an attribute that corresponds to another column in the plurality of columns when the data does not have an attribute that corresponds to the certain column, and assigning the data to the another column when it is determined that the data has the attribute that corresponds to the another column.	01-31-2013
20130036100	DEDUPLICATION IN NETWORK STORAGE WITH DATA RESIDENCE COMPLIANCE - Deduplication in a network storage environment includes, for files stored in a network, determining a location constraint status specified by a compliance agreement for each of the files. Location constraint statuses include a location of persistent residency and no residency restriction. Deduplication also includes selecting a file from the files in the network and identifying corresponding redundant files, the selected file and the corresponding redundant files representing a set. Deduplication further includes determining the location constraint status for each of the files in the set. For the files in the set having a location constraint status specifying a location of persistent residency, the deduplication includes retaining a master copy at the respective location of persistent residency, and removing the corresponding redundant files from the network.	02-07-2013
20130054540	FILE SYSTEM OBJECT-BASED DEDUPLICATION - Systems and methods for optimizing deduplication in a data storage system are provided. The method comprises associating a first name with first data blocks including first content stored in a data storage system, wherein the first name is associated with the first data blocks by way of a reference to a first meta file that points to a data file which points to the first data blocks; storing a first signature derived from the first content in an indexing data structure, wherein the first signature is used to associate the first name with the first data blocks and as means to verify whether a second content is a duplicate of the first content, based on value of a second signature derived from the second content.	02-28-2013
20130054541	Holistic Database Record Repair - A computer implemented method for repairing records of a database, comprises determining a first set of records of the database which violate a functional dependency of the database, determining a second set of records of the database comprising duplicate records, computing a cost metric representing a measure for the cost of mutually dependently modifying records in the first and second sets, modifying records in the first and second sets on the basis of the cost metric to provide a modified database instance.	02-28-2013
20130054542	METHOD AND SYSTEM FOR DETECTING DUPLICATE TRAVEL PATH INFORMATION - Method and system for detecting possible duplicate travel path information, comprising: obtaining a set of travel paths with at least two travel paths from a travel path database in communication with a processor, the processor, breaking each travel path into at least one segment, wherein the at least one segment comprises a single unit of travel with an origin and a destination; the processor, comparing each leg in each travel path to each leg in every other travel path in the set of travel paths to determine whether any travel paths are duplicates by determining whether any segments in any legs are similar by determining whether any segments have the same origin and/or the same destination as other segments in other legs in the set of travel paths, and listing any segment paths that are possible duplicates.	02-28-2013
20130060739	Optimization of a Partially Deduplicated File - The subject disclosure is directed towards transforming a file having at least one undeduplicated portion into a fully deduplicated file. For each of the at least one undeduplicated portion, a deduplication mechanism defines at least one chunk between file offsets associated with the at least one undeduplicated portion. Chunk boundaries associated with the at least one chunk are stored within deduplication metadata. The deduplication mechanism aligns the at least one chunk with chunk boundaries of at least one deduplicated portion of the file. Then, the at least one chunk is committed to a chunk store.	03-07-2013
20130066841	Accepting third party content contributions - Accepting a third party news article submission is disclosed. A first submission, including a first URL of a first news article that is different from a second URL of a previously accepted second news article submission, is received. One or more automated checks are performed on at least a portion of the first submission. Whether to accept the first submission is automatically determined based at least in part on the performed checks.	03-14-2013
20130066842	Method and Device for Storing Domain Name System Records, Method and Device for Parsing Domain Name - A method for storing domain name system (DNS) records includes locally storing received DNS records needed to be stored. If the size of all the stored DNS records does not satisfy a preset storing threshold, a part of the stored DNS records are deleted to make the size of the remaining DNS records after deletion satisfy the storing threshold. A domain name parsing method, device, and server are also provided.	03-14-2013
20130073526	LOG MESSAGE OPTIMIZATION TO IGNORE OR IDENTIFY REDUNDANT LOG MESSAGES - A method of presenting log messages during execution of a computer program. The method can include identifying at least a second log message set comprising information that is the same as information contained in a first log message set. The method can include determining to present the second log message set in a manner that indicates that the second log message set is redundant, and presenting such list of log messages accordingly, or determining not to present the second log message set in the list of log messages, and presenting the list of log messages accordingly.	03-21-2013
20130073527	DATA STORAGE DEDEUPLICATION SYSTEMS AND METHODS - Storage systems and methods are presented. In one embodiment, a variable length segment storage method comprises: receiving a data stream; performing a tailored segment process on the data stream, wherein at least one of a plurality of tailored segments include corresponding data of at least one of a plurality of variable length segments and alignment padding to align with boundaries of a fixed length de-duplication scheme; performing a de-duplication process on the plurality of tailored segments; and storing information corresponding to the result of the de-duplication process. In one embodiment, the tailored segment process includes adjusting the alignment padding of the at least one of a plurality of tailored segments, wherein an adjustment in the alignment padding of the at least one of a plurality of tailored segments corresponds to a modification in the at least one of the plurality of variable length segments.	03-21-2013
20130073528	SCALABLE DEDUPLICATION SYSTEM WITH SMALL BLOCKS - For scalable data deduplication working with small data chunks in a computing environment, for each of the small data chunks, a signature is generated based on a combination of a representation of characters that appear in the small data chunks with a representation of frequencies of the small data chunks. The signature is used to help in selecting the data to be deduplicated.	03-21-2013
20130073529	SCALABLE DEDUPLICATION SYSTEM WITH SMALL BLOCKS - For scalable data deduplication working with small data chunks in a computing environment, for each of the small data chunks, a signature is generated based on a combination of a representation of characters that appear in the small data chunks with a representation of frequencies of the small data chunks. The signature is used to help in selecting the data to be deduplicated.	03-21-2013
20130080404	MAINTAINING DEDUPLICATION DATA IN NATIVE FILE FORMATS - Mechanisms are provided to maintain deduplication data in native file formats. Files, including entities such as volumes and databases, are analyzed to identify components suitable for deduplication. These components suitable for deduplication are delineated into chunks and identifiers are generated for each of the chunks. The identifiers are used to reference the chunks in deduplication dictionaries that provide locations indicating where deduplicated chunks are stored. The components in the files are replaced with file handles or stubs that applications can use to access deduplicated data. Applications can continue to perform operations on the files as though no deduplication has occurred.	03-28-2013
20130080405	APPLICATION TRANSPARENT DEDUPLICATION DATA - Mechanisms are provided to allow for application transparent deduplication data. A mail database associated with a mail application can be analyzed to identify attachments meeting particular administrator criteria. The attachments are analyzed and replaced with stubs to allow continued mail application interaction with the mail database. The attachments may be optimized with deduplication and/or compression.	03-28-2013
20130080406	MULTI-TIER BANDWIDTH-CENTRIC DEDUPLICATION - Example apparatus and methods concern multi-tier bandwidth-centric deduplication. One example apparatus supports inline bandwidth-centric deduplication with post-processing space-centric deduplication to improve inline bandwidth-centric deduplication and thereby reduce bandwidth requirements. One example method may include determining whether a bandwidth-centric deduplication device can satisfy a deduplication request associated with a data communication and then deciding whether to engage a space-centric deduplication device to co-operate in attempting to satisfy the request. More generally, the method includes controlling a first deduplication device to participate in bandwidth reduction and selectively controlling a second deduplication device to also participate in the bandwidth reduction.	03-28-2013
20130080407	Client -Server Transactional Pre-Archival Apparatus - An apparatus which receives client-server transactions such as HTTP REQUESTS and transforms them into a synopsis format for archival storage. HTTP transactions are logged and parsed for key words called HTTP METHODS. For each HTTP METHOD, data is extracted from the message or the resources provided by the transaction. The data is efficiently stored into a transaction store. The data is also indexed and the index is stored into the transaction store. A record is kept for all concurrent sessions by usernames associated with a directory entry.	03-28-2013
20130080408	AUTOMATED SELECTION OF FUNCTIONS TO REDUCE STORAGE CAPACITY BASED ON PERFORMANCE REQUIREMENTS - A plurality of functions to configure a unit of a storage volume is maintained, wherein each of the plurality of functions, in response to being applied to the unit of the storage volume, configures the unit of the storage volume differently. Statistics are computed on growth rate of data and access characteristics of the data stored in the unit of the storage volume. A determination is made as to which of the plurality of functions to apply to the unit of the storage volume, based on the computed statistics.	03-28-2013
20130080409	DEDUPLICATED DATA PROCESSING CONGESTION CONTROL - Various embodiments for deduplicated data processing congestion control in a computing environment are provided. In one such embodiment, a congestion target setpoint is calculated using one of a proportional constant, an integral constant, and a derivative constant, wherein the congestion target setpoint is a virtual dimension setpoint. A single congestion metric is determined from a sampling of a plurality of combined deduplicated data processing congestion statistics in a number of active deduplicated data processes. A congestion limit is calculated from a comparison of the single congestion metric to the congestion target setpoint, the congestion limit being a manipulated variable. The congestion limit is compared to the number of active deduplicated data processes. If the number of active deduplicated data processes are less than the congestion limit, a new deduplicated data process of the number of active deduplicated data processes is spawned.	03-28-2013
20130086006	METHOD FOR REMOVING DUPLICATE DATA FROM A STORAGE ARRAY - A system and method for efficiently removing duplicate data blocks at a fine-granularity from a storage array. A data storage subsystem supports multiple deduplication tables. Table entries in one deduplication table have the highest associated probability of being deduplicated. Table entries may move from one deduplication table to another as the probabilities change. Additionally, a table entry may be evicted from all deduplication tables if a corresponding estimated probability falls below a given threshold. The probabilities are based on attributes associated with a data component and attributes associated with a virtual address corresponding to a received storage access request. A strategy for searches of the multiple deduplication tables may also be determined by the attributes associated with a given storage access request.	04-04-2013
20130086007	SYSTEM AND METHOD FOR FILESYSTEM DEDUPLICATION USING VARIABLE LENGTH SHARING - Embodiments of the present invention are directed to a method and system for filesystem deduplication that uses both small fingerprint granularity and variable length sharing techniques. The method includes accessing, within an electronic system, a plurality of files in a primary storage filesystem and determining a plurality of fingerprints for the plurality of files. Each respective fingerprint may correspond to a respective portion of a respective file of the plurality of files. The method further includes determining a plurality of portions of the plurality of files where each of the plurality of portions has the same corresponding fingerprint and accessing a list comprising a plurality of portions of files previously deduplicated. A portion of a file of the plurality of files not present in the list may then be deduplicated. Consecutive portions of variables lengths having the same corresponding fingerprints may also be deduplicated.	04-04-2013
20130086008	USE OF MAILBOX FOR STORING METADATA IN CONFLICT RESOLUTION - Metadata associated with contact unification, which may involve conflict resolution and de-duplication, is stored in a user's mailbox for optimizing future automated unification operations, sharing of information between different clients and services, and providing relational data that can be used for other applications. User interactions regarding unification such as rejection or acceptance of automated actions, usage of created unified contacts, as well as data from external applications and services may be analyzed and stored in the mailbox. Such metadata may then be used to resolve conflicts the same user or other users in future contact unification operations and shared with other applications and services through a predefined schema such that those applications and services can update their data as well.	04-04-2013
20130086009	METHOD AND SYSTEM FOR DATA DEDUPLICATION - The present disclosure discloses a method and system for data deduplciation. The method comprises: acquiring meta data and multiple data chunks corresponding to at least one original data object, which are generated by using a data deduplication method; combining the acquired multiple data chunks into a new data object; performing deduplication on the new data object to generate new meta data and new data chunks corresponding to the new data object; and storing the meta data corresponding to the at least one original data object, the new meta data corresponding to the new data object, and the new data chunks. The method and system can further improve deduplication ratio, lower data storage amount, and save costs.	04-04-2013
20130086010	SYSTEMS AND METHODS FOR DATA QUALITY CONTROL AND CLEANSING - A method for detecting and cleansing suspect building automation system data is shown and described. The method includes using processing electronics to automatically determine which of a plurality of error detectors and which of a plurality of data cleansers to use with building automation system data. The method further includes using processing electronics to automatically detect errors in the data and cleanse the data using a subset of the error detectors and a subset of the cleansers.	04-04-2013
20130091102	DEDUPLICATION AWARE SCHEDULING OF REQUESTS TO ACCESS DATA BLOCKS - Systems and methods for scheduling requests to access data may adjust the priority of such requests based on the presence of de-duplicated data blocks within the requested set of data blocks. A data de-duplication process operating on a storage device may build a de-duplication data map that stores information about the presence and location of de-duplicated data blocks on the storage drive. An I/O scheduler that manages the access requests can employ the de-duplicated data map to identify and quantify any de-duplicated data blocks within an access request. The I/O scheduler can then adjust the priority of the access request, based at least in part, on whether de-duplicated data blocks provide a large enough sequence of data blocks to reduce the likelihood that servicing the request, even if causing a head seek operation, will not reduce the overall global throughput of the storage system.	04-11-2013
20130091103	SYSTEMS AND METHODS FOR REAL-TIME DE-DUPLICATION - Disclosed are systems, apparatus, and methods for identifying and processing duplicative records in one or more database systems. In various implementations, a first data object may be created and stored in a first database system, where the first data object includes a plurality of data fields capable of storing a plurality of data values. A trigger function may be executed in response to creating the first data object, where the trigger function causes one or more servers to determine if one or more existing data objects stored in the second database system match the first data object, and where the trigger function further causes one or more servers in the first database system to retrieve one or more data values from the one or more existing data objects. The retrieved one or more data values may be stored in one or more data fields of the first data object.	04-11-2013
20130091104	SYSTEMS AND METHODS FOR REAL-TIME DE-DUPLICATION - Disclosed are systems, apparatus, and methods for identifying and visualizing duplicative records via a social network. In various implementations, a first data object may be created and stored in a first database system, where the first data object includes a plurality of data fields capable of storing a plurality of data values. In some implementations, a trigger function may be executed in response to creating the first data object, where the trigger function causes one or more servers in a second database system to determine if one or more existing data objects stored in the second database system include one or more data values that match data values included in the first data object. In various implementations, feed items may be generated in response to determining that a match exists, where the feed items provide one or more users with an indication of the determined match.	04-11-2013
20130097124	AUTOMATICALLY AGGREGATING CONTACT INFORMATION - A communication application automatically aggregates contact information. The communication application classifies contact information retrieved from data sources as either duplicate or complimentary contact information to a contact. The communication application aggregates the contact information and the contact into a unified contact object by eliminating the duplicate contact information and adding the complimentary contact information. The application presents the unified contact object through a user interface.	04-18-2013
20130097125	AUTOMATED ANALYSIS OF UNSTRUCTURED DATA - The current application is directed to automated methods and systems for processing and analyzing unstructured data. The methods and systems of the current application identify patterns and determine characteristics of, and interrelationships between, events parsed from the unstructured data without necessarily using user-provided or expert-provided contextual knowledge. In one implementation, the unstructured data is parsed into attributed-associated events, reduced by eliminating attributes of low-information content, and coalesced into nodes that are incorporated into one or more graphs, within which patterns are identified and characteristics and interrelationships determined.	04-18-2013
20130103654	GLOBAL DICTIONARIES USING UNIVERSAL PRIMITIVES - Embodiments are directed towards managing data storage and queries within a database system using global dictionaries with universal primitives (UNIPs) to represent non-numeric data within a mixed numeric/non-numeric environment. Common data types are managed within a same global dictionary through dictionaries that are globally used within the database system. At least non-numeric data within mixed data fields may be stored using a UNIP to identify the stored non-numeric data. The UNIP may take advantage of the IEEE-754 standard for floating point data representation by setting a first field within the UNIP to 0x7ff (HEX) to indicate that the data is non-numeric (NaN) and using remaining bits to store typed data, such as a date or unique indirect reference (e.g. a sequence number or file offset to larger piece of data). The UNIP may then replace the data within the database and be used during operations performed on the data.	04-25-2013
20130110792	Contextual Gravitation of Datasets and Data Services	05-02-2013
20130110793	DATA DE-DUPLICATION IN COMPUTER STORAGE SYSTEMS	05-02-2013
20130110794	APPARATUS AND METHOD FOR FILTERING DUPLICATE DATA IN RESTRICTED RESOURCE ENVIRONMENT	05-02-2013
20130117241	Shadow Paging Based Log Segment Directory - Replay of data transactions is initiated in a data storage application. Pages of a log segment directory characterizing metadata for a plurality of log segment are loaded into memory. Thereafter, redundant pages within the log segment directory are removed. It is then determined, based on the log segment directory, which log segments need to be accessed. These log segments are accessed to execute the log replay. Related apparatus, systems, techniques and articles are also described.	05-09-2013
20130124486	DATA STORAGE WITH SNAPSHOT-TO-SNAPSHOT RECOVERY - Embodiments of the present invention provide methods, apparatuses, systems, and computer software products for data storage. A corrupted node under a first meta-volume node in a hierarchical tree structure is deleted. The hierarchical tree structure further includes a source node under the first meta-volume node. The corrupted node and the source node each include a respective set of local pointers. The corrupted node and the source node represent respective copies of a logical volume. The source node is reconfigured to become a second meta-volume node having the same set of local pointers as the source node. A first new node is created under the second meta-volume node in the hierarchical tree structure to represent the corrupted node. A second new node is created under the second meta-volume node to represent the source node. The first and second new nodes are configured to have no local pointers.	05-16-2013
20130124487	Deduplication of data object over multiple passes - In each of a number of passes to deduplicate a data object, a transaction is started. Where an offset into the object has previously been set, the offset is retrieved; otherwise, the offset is set to reference a beginning of the object. A portion of the object beginning at the offset is deduplicated until an end-of-transaction criterion has been satisfied. The transaction is ended to commit deduplication; where the object has not yet been completely deduplicated, the offset is moved just past where deduplication has already occurred. The object is locked during each pass; other processes cannot access the object during each pass, but can access the object between passes. Each pass is relatively short, so the length of time in which the object is inaccessible is relatively short. By comparison, deduplicating an object within a single pass prevents other processes from accessing the object for a longer time.	05-16-2013
20130138617	DE-DUPLICATION IN BILLING SYSTEM - A computing system partitions received events into a number of channels by account identifier. The channels receive the events and perform de-duplication of the events. This de-duplication can be performed with a filter that is updated to reflect the receipt of any original event. The filter may be used to either determine that the event is not a duplicate of another, or to determine that the event cannot be ruled out as being a duplicate of another. In the latter case, further processing may be performed to for definitively determine whether the event is truly a duplication, or in the alternative, the event may be immediately treated as a duplicate.	05-30-2013
20130144845	REMOVAL OF DATA REMANENCE IN DEDUPLICATED STORAGE CLOUDS - A method implemented in a computer infrastructure including a combination of hardware and software includes receiving from a local computing device a request to securely delete a file. The method also includes determining the file is deduplicated. The method further includes determining one of: the file is referred to by at least one other file, and the file is not referred to by another file. The method additionally includes securely deleting links associating the file with the local computing device without deleting the file when the file is referred to by at least one other file. The method also includes securely deleting the file when the file is not referred to by another file.	06-06-2013
20130144846	MANAGING REDUNDANT IMMUTABLE FILES USING DEDUPLICATION IN STORAGE CLOUDS - A method includes receiving a request to save a first file as immutable. The method also includes searching for a second file that is saved and is redundant to the first file. The method further includes determining the second file is one of mutable and immutable. When the second file is mutable, the method includes saving the first file as a master copy, and replacing the second file with a soft link pointing to the master copy. When the second file is immutable, the method includes determining which of the first and second files has a later expiration date and an earlier expiration date, saving the one of the first and second files with the later expiration date as a master copy, and replacing the one of the first and second files with the earlier expiration date with a soft link pointing to the master copy.	06-06-2013
20130144847	De-Duplication of Featured Content - A system, computer-implemented method and computer-readable medium for managing duplicate articles are provided. A first and a second potentially duplicate article of a magazine edition are accessed, the first article associated with a first title and a first URL and the second article associated with a second title and a second URL. The titles are normalized. The first normalized title is compared to the second normalized title and the first URL is compared to the second URL to determine whether the first article and the second article are duplicates. It is determined that the first article and the second article are duplicates when the first normalized title is considered similar to the second normalized title and the first URL is considered similar to the second URL. Otherwise, it is determined that the first article and the second article are not duplicates.	06-06-2013
20130144848	DEDUPLICATED DATA PROCESSING RATE CONTROL - Workers are configured for parallel processing of deduplicated data entities in chunks. The deduplicated data processing rate is regulated using a rate control mechanism. The rate control mechanism incorporates a debt/credit algorithm specifying which of the workers processing the deduplicated data entities must wait for each of a multiplicity of calculated required sleep times.	06-06-2013
20130151481	SYSTEM, APPARATUS AND METHOD FOR GENERATING ARRANGEMENTS OF DATA BASED ON SIMILARITY FOR CATALOGING AND ANALYTICS - Embodiments of the invention relate generally to electrical and electronic hardware, computer software, wired and wireless network communications, and computing devices, and more particularly, to a system, an apparatus and a method configured to generate arrangements of data, including data catalogs, to facilitate discovery of items via an interface depicting item representations based on similarity of one or more attributes. In one embodiment, a method includes receiving data representing a request to transmit data; executing instructions at a processor to determine the relationships of the item representations to the principal item representation for records stored in a memory, associating the item representations to positions of a data arrangement based on the relationships of the item representations. Also the method can include transmitting to a computing device data representing the data arrangement including data representing the item representations to be presented on an interface of the computing device.	06-13-2013
20130151482	De-duplication for a global coherent de-duplication repository - Example methods and apparatus associated with data de-duplication for a global coherent de-duplication repository are provided. In one example a request related to data de-duplication is transmitted to a plurality of nodes associated with the global coherent de-duplication repository. Responses to the request are received from at least a subset of nodes in the plurality of nodes. Affinity scores are assigned to nodes of the subset of nodes based, at least in part, on affinity data from the responses. A node is selected to perform the request related to de-duplication from the subset of nodes of the plurality of nodes based, at least in part, on the affinity score assigned to the nodes.	06-13-2013
20130151483	Adaptive experience based De-duplication - Example apparatus and methods associated with adaptive experience based de-duplication are provided. One example data de-duplication apparatus includes a de-duplication logic, an experience logic, and a reconfiguration logic. The de-duplication logic may be configured to perform data de-duplication according to a configurable approach that is a function of a pre-defined constraint. The experience logic may be configured to acquire de-duplication performance experience data. The reconfiguration logic may be configured to selectively reconfigure the configurable approach on the apparatus as a function of the de-duplication performance experience data. In different examples, dynamic reconfiguration may be performed locally and/or in a distributed manner based on local and/or distributed data that is acquired on a per actor (e.g., user, application) basis and/or on a per entity (e.g., computer, data stream) basis.	06-13-2013
20130151484	STORAGE DISCOUNTS FOR ALLOWING CROSS-USER DEDUPLICATION - Technologies are presented for deduplicating data storage across multiple separate users in a datacenter environment. In some examples, the deduplication may take into consideration separate encryption and packaging of various inactive data modules and machine instances, and may be performed based on customer proactive flagging of data as available for deduplication. Billing system records may be employed to track saved space for incentivizing users through discounts and as a garbage collection master reference for tracking usage of deduplication packages, which may otherwise be difficult in the multi-package environment.	06-13-2013
20130159261	DE-DUPLICATION REFERENCE TAG RECONCILIATION - Example apparatus and methods concern de-duplication reference tag reconciliation associated with garbage collection and/or reference health checking. One example method may include accessing data associated with members of a set of references to blocks of data stored by a data de-duplication system. The method may process the first data to manipulate a Bloom filter into a state from which membership in the set of references can be assessed. The method may also include accessing a block identifier identified with a member of the set of blocks of data stored by the data de-duplication system and assessing membership in the set of references for the block identifier by querying the Bloom filter with the block identifier. If the block is not referenced, as determined by querying the Bloom filter, then the method may include performing a block reclamation action on the unreferenced block.	06-20-2013
20130159262	EFFICIENT SEGMENT DETECTION FOR DEDUPLICATION - Mechanisms are provided for efficiently detecting segments for deduplication. Data is analyzed to determine file types and file components. File types such as images may have optimal data segment boundaries set at the file boundaries. Other file types such as container files are delayered to extract objects to set optimal data segment boundaries based on file type or based on the boundaries of the individual objects. Storage of unnecessary information is minimized in a deduplication dictionary while allowing for effective deduplication.	06-20-2013
20130173560	DYNAMIC RECORD BLOCKING - Dynamic blocking determines which pairs of records in a data set should be examined as potential duplicates. Records are grouped together into blocks by shared properties that are indicators of duplication. Blocks that are too large to be efficiently processed are further subdivided by other properties chosen in a data-driven way. We demonstrate the viability of this algorithm for large data sets. We have scaled this system up to work on billions of records on an 80 node Hadoop cluster.	07-04-2013
20130173561	SYSTEMS AND METHODS FOR DE-DUPLICATION IN STORAGE SYSTEMS - In accordance with embodiments of the present disclosure, a storage system may include a storage array comprising one or more storage resources, a processor communicatively coupled to the storage array, and a de-duplication module comprising instructions embodied on a computer-readable medium communicatively coupled to the processor. The de-duplication module may be configured to, when read and executed by the processor: generate a fingerprint for an item of data stored on the storage array; identify a partition for the fingerprint; associate the partition with a hardware instance selected from a plurality of hardware instances, wherein each particular hardware instance comprises one or more information handling resources; and query the selected hardware instance to determine if the fingerprint exists on the hardware instance.	07-04-2013
20130173562	Simplifying Lexicon Creation in Hybrid Duplicate Detection and Inductive Classifier System - A classification system includes a signature-based duplicate detector and an inductive classifier that share attribute information. To perform the duplicate detection and the classification, the duplicate detector and inductive classifier are first initialized by generating a lexicon of attributes for the duplicate detector and a classification model for the classifier. To develop a classification model, a training set of documents of known class are used by the classifier to determine the attributes of the documents that are most useful in classifying an unknown document. The model is developed from these attributes. Attribute information containing the attributes determined by the classifier is then passed to the duplicate detector and the duplicate detector uses the attribute information to generate the lexicon of attributes.	07-04-2013
20130173563	RELIABILITY OF DUPLICATE DOCUMENT DETECTION ALGORITHMS - In a single-signature duplicate document system, a secondary set of attributes is used in addition to a primary set of attributes so as to improve the precision of the system. When the projection of a document onto the primary set of attributes is below a threshold, then a secondary set of attributes is used to supplement the primary lexicon so that the projection is above the threshold.	07-04-2013
20130179407	Deduplication Seeding - Apparatus, methods, and other embodiments associated with de- duplication seeding are described. One example method includes re-configuring a data de-duplication repository with a blocklet from a data de-duplication seed corpus. Reconfiguring the repository may include adding a blocklet from the seed corpus to the repository, activating a blocklet identified with the seed corpus in the repository, removing a blocklet from the repository, and de-activating a blocklet in the repository. The example method may also include re-configuring a data de-duplication index associated with the data de-duplication repository with information about the blocklet. Reconfiguring the repository and the index increases the likelihood that a blocklet ingested by a data de-duplication apparatus that relies on the repository and the index will be treated as a duplicate blocklet by the data de-duplication apparatus.	07-11-2013
20130179408	Blocklet Pattern Identification - Apparatus, methods, and other embodiments associated with blocklet pattern identification are described. One example method includes accessing a blocklet produced by a computerized data de-duplication parsing process before providing the blocklet to a duplicate blocklet determiner. The example method also includes hashing a portion of the blocklet to produce a pattern indicating hash and then identifying the blocklet as a pattern blocklet if the pattern indicating hash matches a pre-determined pattern indicating hash. To improve efficiency in a data de-duplication process, the blocklet pattern identifying may be performed independently from a data structure and process used by the duplicate blocklet determiner. If the blocklet is a pattern blocklet, then the method includes selectively controlling the duplicate blocklet determiner to not process the pattern blocklet. The duplicate determination is not needed because a pattern determination has already been made.	07-11-2013
20130191349	HANDLING REWRITES IN DEDUPLICATION SYSTEMS USING DATA PARSERS - Methods, computer systems, and computer program products for deduplicating data are provided. Data is parsed to identify portions of metadata within the data. The data and identified portions of metadata are processed by a deduplication engine to be storable in a single repository. The deduplication engine is adapted for deduplicating the data without at least one of deduplicating and indexing the identified portions of metadata.	07-25-2013
20130191350	Single Instantiation Method Using File Clone and File Storage System Utilizing the Same - In file de-duplication using hash value comparison, hash values of all target files must be calculated and actual data of all files must be read for hash value calculation, so that the processing time was long. The present invention provides a file storage system comprising a controller and a volume storing a plurality of files, the volume including a first directory storing a first file and a second file and a second directory storing a third file being created, wherein the controller migrates actual data of the second file to the third file, sets up a management information of the second file so that the third file is referred to when the second file is read, and if the sizes of actual data of the first file and the actual data of the third file are identical and the binaries of the actual data of the first file and the actual data of the third file are identical, sets up a management information of the first file to refer to the third file when reading the first file.	07-25-2013
20130198148	ESTIMATING DATA REDUCTION IN STORAGE SYSTEMS - Embodiments of the present invention provide a system, method and computer program products for estimating data reduction in a file system. A method includes selecting a sample of all data from data files in the file system, wherein said sample represent a subset of all the data in the file system. The method further includes estimating a data reduction ratio by data deduplication for the file system based on said sample. The method further includes estimating a data reduction ratio by data compression for the file system based said sample. The method further includes generating a combined data reduction estimate for the file system based on said data compression estimate and said data deduplication estimate.	08-01-2013
20130198149	AUTOMATED CORRUPTION ANALYSIS OF SERVICE DESIGNS - Methods and arrangements for conducting corruption analysis of service designs. A service design is accepted. Corrupting factors within the service design are assessed, and a corruption susceptibility score is generated. An alternative service design is generated responsive to a corruption susceptibility score fulfilling predetermined criteria.	08-01-2013
20130198150	FILE-TYPE DEPENDENT DATA DEDUPLICATION - A memory system comprises a pre-processor that receives a data file and determines a type of the data file, a chunking module that chunks the data file to produce a plurality of chunks, a hash engine that generates a hash value for a chunk among the plurality of chunks, a finger print detector that determines whether the hash value matches an entry within a portion of an index table corresponding to the type of the data file, and a storage medium that stores the chunk or a pointer to the chunk according to a result of the determination performed by the finger print detector.	08-01-2013
20130204848	DEDUPLICATED DATA PROCESSING RATE CONTROL - A plurality of workers is configured for parallel processing of deduplicated data entities in a plurality of chunks. The deduplicated data processing rate is regulated using a rate control mechanism. The rate control mechanism incorporates a debt/credit algorithm specifying which of the plurality of workers processing the deduplicated data entities must wait for each of a plurality of calculated required sleep times. The rate control mechanism limits a data flow rate based on a penalty acquired during a last processing of one of the plurality of chunks in a retroactive manner, and operates on at least one vector representation of at least one limit specification to accommodate a variety of available dimensions corresponding to the at least one limit specification.	08-08-2013
20130204849	DISTRIBUTED VIRTUAL STORAGE CLOUD ARCHITECTURE AND A METHOD THEREOF - The present disclosure relates to a distributed information storage system which functions as virtual cloud storage overlay on top of physical cloud storage systems. The disclosure discloses transparently solving all the data management related security, virtualization, reliability and enables transparent cloud storage migration, cloud storage virtualization, information dispersal and integration across disparate cloud storage devices operated by different providers or on-premise storage. The cloud storage is owned or hosted by same or different third-party providers who own the information contained in the storage which eliminates cloud dependencies. This present disclosure functions as a distributed cloud storage delivery platform enabling various functionalities like cloud storage virtualization, cloud storage integration, cloud storage management and cloud level RAID.	08-08-2013
20130212074	STORAGE SYSTEM - Duplicate storage elimination is performed in units of block data generated by dividing a data stream into arbitrary-sized blocks. Further, sub-block data is generated by further dividing the block data into a plurality of pieces of data, and sub-address data based on the data content of each of the pieces of sub-block data is stored in a predetermined storage device. As such, duplicate storage elimination is also performed in sub-block data units based on the sub-address data.	08-15-2013
20130218847	FILE SERVER APPARATUS, INFORMATION SYSTEM, AND METHOD FOR CONTROLLING FILE SERVER APPARATUS - Provided is a file server apparatus	08-22-2013
20130218848	OPTIMIZING WIDE AREA NETWORK (WAN) TRAFFIC BY PROVIDING HOME SITE DEDUPLICATION INFORMATION TO A CACHE SITE - Methods, systems, and physical computer-readable storage medium are provided to optimize WAN traffic on cloud networking sites. In an embodiment, by way of example only, a method includes fetching deduplication information from a home site to build a repository comprising duplicate peer file sets, one or more of the duplicate peer file sets including one or more peer files, referring to the repository to determine whether a target file corresponds with a cache copy of a peer file of the one or more peer files included in the duplicate peer file sets, and creating a local copy of the peer file of the one or more peer files, if a determination is made that the target file corresponds with the cache copy of the peer file of the one or more peer files included in the duplicate peer file sets.	08-22-2013
20130218849	AUTOMATED DICTIONARY CREATION FOR SCIENTIFIC TERMS - Systems and methods for automated creation of a dictionary of scientific terms are described herein. Initially, input data is filtered to obtain a primary file having a plurality of term-ID pairs with each term-ID pair having a unique term ID and a scientific term. Further, a remove-term file is generated based on one or more term-ID pairs identified from the primary file such that the scientific terms of each term-ID pair corresponds to one of additional terms, frequent scientific terns, and undesirable terms. At least one term-ID pair from among the one or more term-ID pairs is altered to obtain a modified term-ID pair based on modification rules. The modified term-ID pair is added to an add-term file and a modified file is obtained based on the remove-term file and the add-term file. Duplicate term-ID pairs present in the modified file are removed to obtain the dictionary of scientific terms.	08-22-2013
20130218850	DYNAMIC REWRITE OF FILES WITHIN DEDUPLICATION SYSTEM - An original deduplication file system (DFS) file is partitioned into a first set of sections being sections including data affected by rewrite operations and a second set of sections being sections including data unaffected by rewrite operations. A new DFS file to be stored as part of a plurality of user files is created, the plurality of user files including the original DFS file and being accessible by a sequential DFS and a dynamic non-DFS, the dynamic non-DFS including a plurality of dynamic metadata files having entries pointing to corresponding sections of the original DFS files. The first set of sections of the original DFS file including data affected by rewrite operations is directly written into the new DFS file. The second set of sections from the original DFS file including data unaffected by rewrite operations is quoted into the new DFS file. The original DFS file is deleted.	08-22-2013
20130218851	STORAGE SYSTEM, DATA MANAGEMENT DEVICE, METHOD AND PROGRAM - A storage system is characterized in that the storage system includes duplication-determination-unit determining means for determining a duplication determination unit, which is a unit to be used in determining duplications of data, on the basis of a duplication generation rate computed for each of a plurality of data division units obtained as a result of division of data stored in a storage device, and duplication eliminating means for carrying out processing to eliminate duplications of the data stored in the storage device on the basis of the duplication determination unit determined by the duplication-determination-unit determining means.	08-22-2013
20130226881	FRAGMENTATION CONTROL FOR PERFORMING DEDUPLICATION OPERATIONS - The techniques introduced here provide for enabling deduplication operations for a file system without significantly affecting read performance of the file system due to fragmentation of the data sets in the file system. The techniques include determining, by a storage server that hosts the file system, a level of fragmentation that would be introduced to a data set stored in the file system as a result of performing a deduplication operation on the data set. The storage server then compares the level of fragmentation with a threshold value and determines whether to perform the deduplication operation based on a result of comparing the level of fragmentation with the threshold value. The threshold value represents an acceptable level of fragmentation in the data sets of the file system.	08-29-2013
20130226882	AUTOMATIC TABLE CLEANUP FOR RELATIONAL DATABASES - An approach for an automatic table cleanup process of use, implemented in relational databases, are provided. A method includes setting up a table cleanup process in a database which is operable to perform an automatic table cleanup on a table within the database using an auto purge value associated with the table. The method further includes altering the table with a virtual column to keep track of dates on the table. The method further includes turning on an automatic table maintenance capability of the database to include and initiate the table cleanup process. The method further includes running the table cleanup process to perform the automatic table cleanup using dates which are automatically filled in during an insert or update operation on the table, the table cleanup process comprising looking through the records and automatically purging the table when the auto purge value has been met.	08-29-2013
20130226883	SYSTEMS AND METHODS FOR BYTE-LEVEL OR QUASI BYTE-LEVEL SINGLE INSTANCING - Described in detail herein are systems and methods for deduplicating data using byte-level or quasi byte-level techniques. In some embodiments, a file is divided into multiple blocks. A block includes multiple bytes. Multiple rolling hashes of the file are generated. For each byte in the file, a searchable data structure is accessed to determine if the data structure already includes an entry matching a hash of a minimum sequence length. If so, this indicates that the corresponding bytes are already stored. If one or more bytes in the file are already stored, then the one or more bytes in the file are replaced with a reference to the already stored bytes. The systems and methods described herein may be used for file systems, databases, storing backup data, or any other use case where it may be useful to reduce the amount of data being stored.	08-29-2013
20130226884	SYSTEM AND METHOD FOR CREATING DEDUPLICATED COPIES OF DATA BY SENDING DIFFERENCE DATA BETWEEN NEAR-NEIGHBOR TEMPORAL STATES - Systems and methods are disclosed for using a first deduplicating store to update a second deduplicating store with information representing how data objects change over time, said method including: at a first and a second deduplicating store, for each data object, maintaining an organized arrangement of temporal structures to represent a corresponding data object over time, wherein each structure is associated with a temporal state of the data object and wherein the logical arrangement of structures is indicative of the changing temporal states of the data object; finding a temporal state that is common to and in temporal proximity to the current state of the first and second deduplicating stores; and compiling and sending a set of hash signatures for the content that has changed from the common state to the current temporal state of the first deduplicating store.	08-29-2013
20130232124	DEDUPLICATING A FILE SYSTEM - A storage node receives a file. The storage node determines whether the file is stored on the storage node by comparing a hash value computed for content of the received file to hash values for content stored on the storage node. The storage node transfers a name and address of the file to a directory node.	09-05-2013
20130232125	STREAM LOCALITY DELTA COMPRESSION - Stream locality delta compression is disclosed. A previous stream indicated locale of data segments is selected. A first data segment is then determined to be similar to a data segment in the stream indicated locale.	09-05-2013
20130232126	HIGHLY SCALABLE AND DISTRIBUTED DATA DE-DUPLICATION - This disclosure relates to systems and methods for both maintaining referential integrity within a data storage system, and freeing unused storage in the system, without the need to maintain reference counts to the blocks of storage used to represent and store the data.	09-05-2013
20130238568	ENHANCING DATA RETRIEVAL PERFORMANCE IN DEDUPLICATION SYSTEMS - Various embodiments for processing data in a data deduplication system are provided. For data segments previously deduplicated by the data deduplication system, a supplemental hot-read link is established for those of the data segments determined to be read on at least one of a frequent and recently used basis. Other system and computer program product embodiments are disclosed and provide related advantages.	09-12-2013
20130238569	REDUNDANT ATTRIBUTE VALUES - Aspects of the present disclosure provide techniques that determine whether an attribute value is associated with each configuration item in a plurality of configuration items. If it is determined that the attribute value is associated with each configuration item in the plurality of configuration items, the attribute value is deemed a redundant attribute value.	09-12-2013
20130238570	FIXED SIZE EXTENTS FOR VARIABLE SIZE DEDUPLICATION SEGMENTS - Mechanisms are provided for maintaining variable size deduplication segments using fixed size extents. Variable size segments are identified and maintained in a datastore suitcase. Duplicate segments need not be maintained redundantly but can be managed by updating reference counts associated with the segments in the datastore suitcase. Segments are maintained using fixed size extents. A minor increase in storage overhead removes the need for inefficient recompaction when a segment is removed from the datastore suitcase. Fixed size extents can be reallocated for storage of new segments.	09-12-2013
20130238571	ENHANCING DATA RETRIEVAL PERFORMANCE IN DEDUPLICATION SYSTEMS - Various embodiments for processing data in a data deduplication system are provided. In one embodiment, a method for processing such data is disclosed. For data segments previously deduplicated by the data deduplication system, a supplemental hot-read link is established for those of the data segments determined to be read on at least one of a frequent and recently used basis. Other system and computer program product embodiments are disclosed and provide related advantages.	09-12-2013
20130238572	PERFORMING DATA STORAGE OPERATIONS WITH A CLOUD ENVIRONMENT, INCLUDING CONTAINERIZED DEDUPLICATION, DATA PRUNING, AND DATA TRANSFER - Various systems and methods may be used for performing data storage operations, including content-indexing, containerized deduplication, and policy-driven storage, within a cloud environment. The systems support a variety of clients and cloud storage sites that may connect to the system in a cloud environment that requires data transfer over wide area networks, such as the Internet, which may have appreciable latency and/or packet loss, using various network protocols, including HTTP and FTP. Methods for content indexing data stored within a cloud environment may facilitate later searching, including collaborative searching. Methods for performing containerized deduplication may reduce the strain on a system namespace, effectuate cost savings, etc. Methods may identify suitable storage locations, including suitable cloud storage sites, for data files subject to a storage policy. Further, the systems and methods may be used for providing a cloud gateway and a scalable data object store within a cloud environment.	09-12-2013
20130246370	SYSTEM, METHOD, AND COMPUTER PROGRAM PRODUCT FOR DETERMINING WHETHER CODE IS UNWANTED BASED ON THE DECOMPILATION THEREOF - A system, method, and computer program product are provided for determining whether code is unwanted based on the decompilation thereof. In use, code is identified and the code is decompiled. In addition, it is determined whether the code is unwanted, based on the decompiled code.	09-19-2013
20130246371	System and Method for Concept Building - A method is provided in one example embodiment and it includes identifying a root term and determining one or more other terms belonging to a group associated with the root term. The method also includes selecting one or more of the terms from the group and generating a concept based on the selected terms from the group, wherein the concept is applied to a rule that affects data management for one or more documents that satisfy the rule. In more specific embodiments, the root term is identified via a search or via an incident list. In other embodiments, a collection of meaningful terms is provided to assist in determining the other terms for the group, the collection of meaningful terms being generated based on the root term. The concept can be used to automatically mark one or more documents that relate to the concept.	09-19-2013
20130246372	METHODS AND APPARATUS FOR EFFICIENT COMPRESSION AND DEDUPLICATION - Mechanisms are provided for performing efficient compression and deduplication of data segments. Compression algorithms are learning algorithms that perform better when data segments are large. Deduplication algorithms, however, perform better when data segments are small, as more duplicate small segments are likely to exist. As an optimizer is processing and storing data segments, the optimizer applies the same compression context to compress multiple individual deduplicated data segments as though they are one segment. By compressing deduplicated data segments together within the same context, data reduction can be improved for both deduplication and compression. Mechanisms are applied to compensate for possible performance degradation.	09-19-2013
20130246373	SYSTEM, METHOD AND COMPUTER PROGRAM PRODUCT FOR STORING FILE SYSTEM CONTENT IN A MULTI-TENANT ON-DEMAND DATABASE SYSTEM - In accordance with embodiments, there are provided mechanisms and methods for storing file system content in a multi-tenant on-demand database system. These mechanisms and methods for storing file system content in a multi-tenant on-demand database system can enable embodiments to reduce a number of files stored on a file system, avoid copying of all file system content to file system copies, etc.	09-19-2013
20130246374	DATA MANAGEMENT DEVICE, SYSTEM, PROGRAM STORAGE MEDIUM AND METHOD - The hit rate of when the application using a plurality of classes of data whose generation times are close to each other is not sufficiently high currently.	09-19-2013
20130254170	SOCIAL MEDIA IDENTITY DISCOVERY AND MAPPING FOR BANKING AND GOVERNMENT - A server executing a social media identity and discovery application and method are provided that scan social networking sites for communications. The target content is found with content indicators when communications are put on a social networking site. The content is recorded and evaluated. If the identified content is contextually significant, the alias and the user account data and/or user data from public records are correlated based on keywords and/or events, and a notification of the correlation is sent to an agency, agent, or a contact center system. The agent or agency may verify that the identity of a poster has been accurately correlated with a customer record in the database or with user data from public records. The agent, the agency, or the system has the opportunity to respond to the communication, despite the anonymity of the poster on the social networking site.	09-26-2013
20130262404	Systems, Methods, And Computer Program Products For Scheduling Processing To Achieve Space Savings - A method performed in a system that has a plurality of volumes stored to storage hardware, the method including generating, for each of the volumes, a respective space saving potential iteratively over time and scheduling space saving operations among the plurality of volumes by analyzing each of the volumes for space saving potential and assigning priority of resources based at least in part on space saving potential.	10-03-2013
20130262405	Virtual Block Devices - Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for virtual block storage. In one aspect, a method includes receiving a request to initialize a virtual machine, the virtual machine having an associated virtual block device; accessing a file map comprising a plurality of file map entries; determining file map entries corresponding to blocks of data allocated to the virtual block device and one or more files in which the blocks of data allocated to the virtual block device are stored; determining that a particular one of the blocks allocated to the virtual block device has been written to a new position not associated with the particular block in the file map; and updating the position associated with the particular block to the new position.	10-03-2013
20130262406	AUTOMATED SYSTEM AND METHOD OF DATA SCRUBBING - A system and method enabling automated data cleansing and scrubbing at the attribute level is disclosed. A consolidated view may be provided of the scrubbed data or narratives that gets promoted to a final copy and the data or narratives received from multiple sources on a single user interface.	10-03-2013
20130268496	INCREASED IN-LINE DEDUPLICATION EFFICIENCY - Exemplary method, system, and computer program product embodiments for increased in-line deduplication efficiency in a computing environment are provided. In one embodiment, by way of example only hash values are calculated in nth iterations for accumulative data chunks extracted from an object requested for in-line deduplication. For each of the nth iterations, the calculated hash values for the accumulative data chunks are matched in a nth hash index table with a corresponding hash value of existing objects in storage. The nth hash index table is exited upon detecting a mismatch during the matching. The mismatch is determined to be a unique object and is stored. A hash value for the object is calculated. A master hash index table is updated with the calculated hash value for the object and the calculated hash values for the unique object. Additional system and computer program product embodiments are disclosed and provide related advantages.	10-10-2013
20130268497	INCREASED IN-LINE DEDUPLICATION EFFICIENCY - Exemplary embodiments for increased in-line deduplication efficiency in a computing environment are provided. In one embodiment, by way of example only, hash values are calculated in nth iterations on data samples from fixed size data chunks extracted from an object requested for in-line deduplication. For each of the nth iterations, the calculated hash values for the data samples from the fixed size data chunks are matched in an nth hash index table with a corresponding hash value of existing objects in storage. The nth hash index table is exited upon detecting a mismatch during the matching. The mismatch is determined to be a unique object and is stored. A hash value for the object is calculated. A master hash index table is updated with the calculated hash value for the object and the calculated hash values for the unique object.	10-10-2013
20130268498	PRIORITIZATION MECHANISM FOR DELETION OF CHUNKS OF DEDUPLICATED DATA OBJECTS - A reference counter corresponding to a base chunk of a plurality of chunks of a deduplicated data object is maintained, where the reference counter is incremented in response to an insertion of any chunk that references the base chunk, and where the reference counter is decremented, in response to a deletion of any chunk that references the base chunk. A queue is defined for processing dereferenced chunks of the plurality of chunks. The dereferenced chunks in the queue are processed in a predefined order, to free storage space.	10-10-2013
20130268499	INFORMATION MANAGEMENT METHOD, AND COMPUTER FOR PROVIDING INFORMATION - When an online storage service is used to expand a storage capacity of a file server, an amount of communication in synchronization processing and an amount of data retained on the online storage service are reduced to save an amount of charge. In a kernel module provided with a storage area on the online storage service, files are divided into block files and managed, and blocks overlapping with an already registered and saved block file group are not uploaded, but only configuration information of the files is changed. A mechanism is adopted, in which DBs for managing meta information and elimination of duplication are divided and managed, and only updated sections are appropriately uploaded.	10-10-2013
20130268500	REPRESENTING DE-DUPLICATED FILE DATA - Providing a subset of de-duplicated as output is disclosed. In some embodiments, the output comprises a subset of data stored in de-duplicated form in a plurality of containers each comprising a plurality of data segments comprising the data. For each container that includes one or more data segments comprising the subset, a corresponding container data is included in the output. Each container may include one or more segments not included in the subset. For each container the corresponding container data of which is included in the output, a corresponding value in a data structure comprising for each container stored on the de-duplicated storage system a data value indicating whether or not the corresponding container data has been included in the output is updated.	10-10-2013
20130275392	SOLVING PROBLEMS IN DATA PROCESSING SYSTEMS BASED ON TEXT ANALYSIS OF HISTORICAL DATA - Computer program products and systems, determine solutions to a problem experienced by a data processing system user. A query is received from the user. The query includes a problem description of the problem experienced by the user with respect to the data processing system. One or more keywords are extracted from the received problem description. An index of problems and associated solutions is searched using the one or more extracted keywords. The index of problems and associated solutions is created by analyzing a document collection describing problems and associated solutions with a text analytics application. One or more documents are returned that contains words or phrases that are similar to the keywords used for searching the index of problems and associated solutions. The documents relevant for the problem and associated solutions are presented to the user.	10-17-2013
20130275393	DATA CLEANING - A computer-implemented method comprising partitioning data representing an input instance of a database including multiple tuples into multiple fragments of tuples, detecting tuples which violate a data quality specification in respective ones of the fragments, selecting a data cleaning asset on the basis of characteristics of errors in detected tuples for a fragment and based on declared asset capabilities, assigning a selected data cleaning asset to the fragment, the selected data cleaning asset to provide a set of candidate corrections for the detected tuples in the fragment, providing data representing an output instance of the database in which detected tuples are replaced with selected candidate corrections.	10-17-2013
20130275394	INFORMATION PROCESSING SYSTEM - Deduplication is executed in a storage device having low random access performance, such as an optical disk library. When an optical disk that is not inserted into an optical disk drive needs to be accessed in order to execute binary compare, data to be binary compared is stored in a temporary memory area in order to postpone the binary compare, and on the timing when the optical disk is inserted, the postponed binary compare is executed. Before second deduplication, which is deduplication between data to be stored in the storage device and the data in the optical disk library, is executed, first deduplication, which is deduplication between the data to be stored in the storage device and the data in the temporary memory area, is executed.	10-17-2013
20130275395	Method for Indexed-Field Based Difference Detection and Correction - A method and system for indexed field based difference detection and correction. A data feed file is partitioned into a plurality of subsets based on an indexed field of the data feed file. A redundancy check value is calculated for each of the subsets, and the redundancy check value is compared to a database file which corresponds to each subset. If the redundancy check values do not match for a subset and a corresponding database file, a difference is detected between the subset and the corresponding database file and the corresponding database file is replaced by the subset.	10-17-2013
20130282669	PRESERVING REDUNDANCY IN DATA DEDUPLICATION SYSTEMS - Various embodiments for preserving data redundancy in a data deduplication system in a computing environment are provided. An indicator is configured. The indicator is provided with a selected data segment to be written through the data deduplication system to designate that the selected data segment must not be subject to a deduplication operation, such that repetitive data can be written stored on physical locations despite being identical.	10-24-2013
20130282670	PRESERVING REDUNDANCY IN DATA DEDUPLICATION SYSTEMS BY DESIGNATION OF VIRTUAL ADDRESS - Various embodiments for preserving data redundancy of identical data in a data deduplication system in a computing environment are provided. A selected range of virtual addresses of a virtual storage device in the computing environment is designated as not subject to a deduplication operation. Other system and computer program product embodiments are disclosed and provide related advantages.	10-24-2013
20130282671	PRESERVING REDUNDANCY IN DATA DEDUPLICATION SYSTEMS BY DESIGNATION OF VIRTUAL DEVICE - Various embodiments for preserving data redundancy in a data deduplication system in a computing environment are provided. At least one virtual device out of a volume set is designated as not subject to a deduplication operation.	10-24-2013
20130282672	STORAGE APPARATUS AND STORAGE CONTROL METHOD - The present invention not only reduces the load but also enhances the accuracy of de-duplication in a storage apparatus which performs in-line de-duplication processing and post-process de-duplication processing. A storage apparatus comprises a storage device and a controller. The controller receives multiple files, and by performing in-line de-duplication processing under a prescribed condition, detects from among the multiple files a file which is duplicated with a file received in the past, stores in the temporary storage area a file other than the detected file of the multiple files, and partitions the stored file into multiple chunks, and by performing post-process de-duplication processing, detects from among the multiple chunks a chunk which is duplicated with a chunk received in the past, and stores in the transfer-destination storage area a chunk other than the detected chunk of the multiple chunks.	10-24-2013
20130282673	PRESERVING REDUNDANCY IN DATA DEDUPLICATION SYSTEMS BY INDICATOR - Various embodiments for preserving data redundancy in a data deduplication system in a computing environment are provided. In one embodiment, a method for such preservation is disclosed. An indicator is configured. The indicator is provided with a selected data segment to be written through the data deduplication system to designate that the selected data segment must not be subject to a deduplication operation, such that repetitive data can be written stored on physical locations despite being identical.	10-24-2013
20130282674	PRESERVING REDUNDANCY IN DATA DEDUPLICATION SYSTEMS BY DESIGNATION OF VIRTUAL ADDRESS - Various embodiments for preserving data redundancy of identical data in a data deduplication system in a computing environment are provided. In one embodiment, a method for such preservation is disclosed. A selected range of virtual addresses of a virtual storage device in the computing environment is designated as not subject to a deduplication operation. Other system and computer program product embodiments are disclosed and provide related advantages.	10-24-2013
20130282675	PRESERVING REDUNDANCY IN DATA DEDUPLICATION SYSTEMS BY DESIGNATION OF VIRTUAL DEVICE - Various embodiments for preserving data redundancy in a data deduplication system in a computing environment are provided. In one embodiment, a method for such preservation is disclosed in a multi-device file system. At least one virtual device out of a volume set is designated as not subject to a deduplication operation.	10-24-2013
20130282676	GARBAGE COLLECTION-DRIVEN BLOCK THINNING - An apparatus comprises one or more computer-readable storage media and program instructions stored on the one or more computer-readable storage media for facilitating garbage collection-driven volume thinning. The program instructions, when executed by a processing system, direct the processing system to at least generate deduplication data referenced to a plurality of files when deduplicating the plurality of files. The program instructions further direct the processing system to discover when the deduplication data has become unreferenced with respect to the plurality of files. Responsive to when the deduplication data has become unreferenced with respect to the plurality of files, the program instructions direct the processing system to initiate a thinning process with respect to a portion of a shared storage volume associated with the de-duplication data. The processing system is operatively coupled with the one or more computer-readable storage media and configured to execute the program instructions.	10-24-2013
20130290275	Object Synthesis - Apparatus, methods, and other embodiments associated with object synthesis are described. One example apparatus includes logic for identifying a block in a data de-duplication repository and for identifying a reference to the block. The apparatus also includes logic for representing a source object using a first named, organized collection of references to blocks in the data de-duplication repository and logic for representing a target object using a second named, organized collection of references. The apparatus is configured to synthesize the target object from the source object. Since synthesis may be complicated by edge cases, the apparatus is configured to account for conditions including a block in the target object needing less than all the data in a source object block, data to be used to synthesize the target object residing in a sparse hole in a data stream, and the target object needing data not present in the source object.	10-31-2013
20130290276	ENHANCING PERFORMANCE-COST RATIO OF A PRIMARY STORAGE ADAPATIVE DATA REDUCTION SYSTEM - Data reduction in a storage system comprises determining attributes of data for storage in the storage system and determining expected data reduction effectiveness for the data based on said attributes. Said effectiveness indicates the benefit that data reduction is expected to provide for the data based on said attributes. The data reduction further comprises applying data reduction to the data based on the expected data reduction effectiveness and performance impact, to improve resource usage efficiency.	10-31-2013
20130290277	DEDUPLICATING STORAGE WITH ENHANCED FREQUENT-BLOCK DETECTION - Detecting data duplication comprises maintaining a fingerprint directory including one or more entries, each entry including a data fingerprint and a data location for a data chunk. Each entry is associated with a seen-count attribute which is an indication of how often the fingerprint has been seen in arriving data chunks. Higher-frequency entries in the directory are retained, while also taking into account recency of data accesses. A data duplication detector detects that the data fingerprint for a new chunk is the same as the data fingerprint contained in an entry in the fingerprint directory.	10-31-2013
20130290278	SCALABLE DEDUPLICATION SYSTEM WITH SMALL BLOCKS - Exemplary method, system, and computer program product embodiments for scalable data deduplication working with small data chunk in a computing environment are provided. In one embodiment, by way of example only, for each of the small data chunk, a signature is generated based on a combination of a representation of characters that appear in the small data chunk with a representation of frequencies of the small data chunk. A signature is generated based on a combination of a representation of characters that appear. The signature is used to help in selecting the data to be deduplicated. Additional system and computer program product embodiments are disclosed and provide related advantages.	10-31-2013
20130290279	SCALABLE DEDUPLICATION SYSTEM WITH SMALL BLOCKS - Exemplary method, system, and computer program product embodiments for scalable data deduplication working with small data chunk in a computing environment are provided. In one embodiment, by way of example only, for each of the small data chunk, a signature is generated based on a combination of a representation of characters that appear in the small data chunk with a representation of frequencies of the small data chunk. A signature is generated based on a combination of a representation of characters that appear. The signature is used to help in selecting the data to be deduplicated. Additional system and computer program product embodiments are disclosed and provide related advantages.	10-31-2013
20130290280	DE-DUPLICATION SYSTEMS AND METHODS FOR APPLICATION-SPECIFIC DATA - Content-aware systems and methods for improving de-duplication, or single instancing, in storage operations. In certain examples, backup agents on client devices parse application-specific data to identify data objects that are candidates for de-duplication. The backup agents can then insert markers or other indictors in the data that identify the location(s) of the particular data objects. Such markers can, in turn, assist a de-duplication manager to perform object-based de-duplication and increase the likelihood that like blocks within the data are identified and single instanced. In other examples, the agents can further determine if a data object of one file type can or should be single-instanced with a data object of a different file type. Such processing of data on the client side can provide for more efficient storage and back-end processing.	10-31-2013
20130297569	ENHANCING DATA PROCESSING PERFORMANCE BY CACHE MANAGEMENT OF FINGERPRINT INDEX - Various embodiments for improving hash index key lookup caching performance in a computing environment are provided. In one embodiment, for a cached fingerprint map having a plurality of entries corresponding to a plurality of data fingerprints, reference count information is used to determine a length of time to retain the plurality of entries in cache. Those of the plurality of entries having a higher reference counts are retained longer than those having lower reference counts.	11-07-2013
20130297570	METHOD AND APPARATUS FOR DELETING DUPLICATE DATA - Present invention provides a method and an apparatus for deleting duplicate data. The method include: receiving a modified data block for a user file stored in the data storage system; querying whether the modified data block is found in the system data block file; if the modified data block is not found in the system data block file, adding the modified data block to the system data block file, and updating an index relationship of the user file with the system data block file to include an index pointing to the modified data block added in the system data block file. With the method and apparatus for deleting duplicate data provided by embodiments of the present invention, the duplicate data after modification is deleted, which improves the performance in modifying other data block files and improves the effect of deleting the duplicate data.	11-07-2013
20130297571	System and Method for Application Aware De-Duplication of Data Blocks in a Virtualized Storage Array - A system and method for application aware de-duplication of data blocks in a virtualized storage array is disclosed. In one embodiment, in a method of de-duplication of data, a master list of metadata is created based on a number of occurrences of data blocks within a storage array. A first sublist of metadata is created from the master list of metadata. The first sublist of metadata is provided to a first component of a networked storage system. It is determined whether the data block being written has a corresponding entry in the master list of metadata based on a determination that a data block being written does not have any corresponding entry in the first sublist of metadata. The data block being written is replaced with a pointer based on a determination that the data block being written has a corresponding entry in the master list of metadata.	11-07-2013
20130297572	FILE AWARE BLOCK LEVEL DEDUPLICATION - A system provides file aware block level deduplication in a system having multiple clients connected to a storage subsystem over a network such as an Internet Protocol (IP) network. The system includes client components and storage subsystem components. Client components include a walker that traverses the namespace looking for files that meet the criteria for optimization, a file system daemon that rehydrates the files, and a filter driver that watches all operations going to the file system. Storage subsystem components include an optimizer resident on the nodes of the storage subsystem. The optimizer can use idle processor cycles to perform optimization. Sub-file compression can be performed at the storage subsystem.	11-07-2013
20130311432	CONTEXT SENSITIVE REUSABLE INLINE DATA DEDUPLICATION - A computer identifies a relationship among a subset of a set of data blocks, a basis of the relationship forming a context shared by the subset of data blocks. The computer selects a code data structure from a set of code data structures using the context. The context is associated with the code data structure, and the code data structure includes a set of codes. The computer computes, for a first data block in the subset of data blocks, a first code corresponding to a content of the first data block. The computer determines whether the first code matches a stored code in the code data structure. The computer replaces, responsive to the first code matching the stored code, the first data block with a reference to an instance of the first data block. The computer causes the reference to be stored in a target data processing system.	11-21-2013
20130311433	Stream-based data deduplication in a multi-tenant shared infrastructure using asynchronous data dictionaries - Stream-based data deduplication is provided in a multi-tenant shared infrastructure but without requiring “paired” endpoints having synchronized data dictionaries. In this approach, data objects processed by the dedupe functionality are treated as objects that can be fetched as needed. Because the compressed objects are treated as just objects, a decoding peer does not need to maintain a symmetric library for the origin. Rather, if the peer does not have the chunks in cache that it needs, it follows a conventional content delivery network (CDN) procedure to retrieve them. In this way, if dictionaries between pairs of sending and receiving peers are out-of-sync, relevant sections are the re-synchronized on-demand. The approach does not require that libraries maintained at a particular pair of sender and receiving peers are the same. Rather, the technique enables a peer, in effect, to “backfill” its dictionary on-the-fly.	11-21-2013
20130311434	METHOD, APPARATUS AND SYSTEM FOR DATA DEDUPLICATION - Techniques and mechanisms for limiting storage of duplicate data in a storage back-end. In an embodiment, a storage device of the storage back-end receives from a storage front-end a write command specifying a write of data to the storage back-end. In another embodiment, the storage device calculates and provides to the storage front-end a data signature for data which is the subject of the write command. Based on the data signature provided by the storage device, a deduplication engine of the storage front-end determines whether a deduplication operation is to be performed.	11-21-2013
20130318050	DATA DEPULICATION USING SHORT TERM HISTORY - Exemplary system, and computer program product embodiments for data deduplication using short term history in a computing environment are provided. In one embodiment, by way of example only, a hash value is calculated on data chunks for a read operation. The calculated hash value is stored in a storage media. The calculated hash value is looked up in the storage media to verify if a current write operation was previously written and/or read. Additional system and computer program product embodiments are disclosed and provide related advantages.	11-28-2013
20130318051	SHARED DICTIONARY BETWEEN DEVICES - In one embodiment, a system and method for managing a network deduplication dictionary is disclosed. According to the method, the dictionary is divided between available deduplication engines (DDE) in deduplication devices that support shared dictionaries. The fingerprints are distributed to different DDEs based on a hash function. The hash function takes the fingerprint and hashes it and based on the hash result, it selects one of the DDEs. The hash function could select a few bits from the fingerprint and use those bits to select a DDE.	11-28-2013
20130318052	DATA DEPULICATION USING SHORT TERM HISTORY - Exemplary embodiments for data deduplication using short term history in a computing environment are provided. In one embodiment, by way of example only, a hash value is calculated on data chunks for a read operation. The calculated hash value is stored in a storage media. The calculated hash value is looked up in the storage media to verify if a current write operation was previously written and/or read. Additional system and computer program product embodiments are disclosed and provide related advantages.	11-28-2013
20130318053	SYSTEM AND METHOD FOR CREATING DEDUPLICATED COPIES OF DATA BY TRACKING TEMPORAL RELATIONSHIPS AMONG COPIES USING HIGHER-LEVEL HASH STRUCTURES - Systems and methods are disclosed for forming deduplicated images of a data object that changes over time using difference information between temporal states of the data object. The method includes organizing the content of the data object for a first temporal state as a plurality of content segments and storing the content segments in a data store; creating an organized arrangement of hash structures to represent the data object in its first temporal state; receiving difference information for the data object; forming at least one hash signature for the changed content; and storing the changed content that is unique in the data store as content segments. The method also includes determining, subsequent to receiving the changed content at the deduplicating content store, whether the changed content should be stored by searching for the hash signature for the changed higher-level hash structure in the global cache of the deduplicating content store.	11-28-2013
20130318054	SYSTEM AND METHOD FOR NEAR AND EXACT DE-DUPLICATION OF DOCUMENTS - A system, method and computer program product for identifying near and exact-duplicate documents in a document collection, including for each document in the collection, reading textual content from the document; filtering the textual content based on user settings; determining N most frequent words from the filtered textual content of the document; performing a quorum search of the N most frequent words in the document with a threshold M; and sorting results from the quorum search based on relevancy. Based on the values of N and M near and exact-duplicate documents are identified in the document collection.	11-28-2013
20130325821	MERGING ENTRIES IN A DEDUPLCIATION INDEX - Provided are a computer program product, system, and method for merging entries in a deduplication index. An index has chunk signatures calculated from chunks of data in the data objects in the storage, wherein each index entry includes at least one of the chunk signatures and a reference to the chunk of data from which the signature was calculated. Entries in the index are selected to merge and a merge operation is performed on the chunk signatures in the selected entries to generate a merged signature. An entry is added to the index including the merged signature and a reference to the chunks in the storage referenced in the merged selected entries. The index of the signatures is used in deduplication operations when adding data objects to the storage.	12-05-2013
20130332427	COMPARING AND SELECTING DATA CLEANSING SERVICE PROVIDERS - The present invention extends to methods, systems, and computer program products for exploring and selecting data cleansing service providers. Embodiments of the invention permit a user to explore different data cleansing service providers and compare quality results from the different data cleansing service providers. Sample data is mapped to a specified data domain. A list of service providers, for cleansing data for the selected data domain, is provided to a user. The user selects a subset of service providers. The sample data is submitted to the subset of service providers, which return results including allegedly cleansed data. The results are profiled and a comparison of the subset of service providers is presented to the user. The user selects a service provider to use when cleansing further data.	12-12-2013
20130339314	ELIMINATION OF DUPLICATE OBJECTS IN STORAGE CLUSTERS - Digital objects within a fixed-content storage cluster use a page mapping table and a hash-to-UID table to store a representation of each object. For each object stored within the cluster, a record in the hash-to-UID table stores the object's hash value and its unique identifier (or portions thereof). To detect a duplicate of an object, a portion of its hash value is used as a key into the page mapping table. The page mapping table indicates a node holding a hash-to-UID table indicating currently stored objects in a particular page range. Finding the same hash value but with a different unique identifier in the table indicates that a duplicate of an object exists. Portions of the hash value and unique identifier may be used in the hash-to-UID table. Unneeded duplicate objects are deleted by copying their metadata to a manifest and then redirecting their unique identifiers to point at the manifest.	12-19-2013
20130339315	Configurable Data Generator - Embodiments associated with configurable, repeatable, data generation are described. One example method includes manipulating a redundancy parameter that controls data redundancy in binary large objects (BLOBs) to be included in a generated data set. The redundancy parameters may control variations in repeatable variable length sequences included in BLOBs. The example method also includes manipulating a parameter(s) that controls custom designed sequences included in BLOBs. With the redundancy and custom designed sequences described, the example method then generates BLOBs based, at least in part, on the redundancy parameters and the custom-designed sequences. BLOBs may include byte sequences repeated at different frequencies and configurable user-designed sequences. Manipulating the redundancy parameter, manipulating the custom-designed sequences, generating the BLOBs, and providing the BLOBS may be performed by separate processes acting in parallel.	12-19-2013
20130339316	PACKING DEDUPLICATED DATA INTO FINITE-SIZED CONTAINERS - Deduplicated data is packed into finite-sized containers. A similarity score is calculated between files that are similarly of the deduplicated data. The similarity score is used for grouping the similarly compared files of the deduplicated data into subsets for destaging each of the subsets from a deduplication system to one a finite-sized container.	12-19-2013
20130339317	DATA DEDUPLICATION MANAGEMENT - Technologies are generally described for a data deduplication management scheme for media files uploaded or to be uploaded to a server. In some examples, a method may include identifying, by a server, a creation time of a media file based at least in part on metadata of the media file; identifying, by the server, an uploading time of the media file; calculating, by the server, a difference between the creation time and the uploading time; and performing, by the server, a data deduplication process when the difference is greater than a predetermined value.	12-19-2013
20130339318	METHOD AND SYSTEM FOR DELETING OBSOLETE FILES FROM A FILE SYSTEM - A method for deleting obsolete files from a file system is provided. The method includes: receiving a request to delete a reference to a target file in a file system from a file reference data structure, wherein the file reference data structure includes target file names and reference file names; identifying a reference file name in the file reference data structure, wherein the reference file name includes a file name of the target file; deleting a reference file from the file system, wherein the reference file has the identified reference file name; checking whether the file system includes at least one reference file whose file name matches the file name of the target file; if there is no such reference file in the file system: deleting the target file from the file system; and deleting the file name of the target file from the file reference data structure.	12-19-2013
20130339319	SYSTEM AND METHOD FOR CACHING HASHES FOR CO-LOCATED DATA IN A DEDUPLICATION DATA STORE - Systems and methods are provided for caching hashes for deduplicated data. A request to read data from the deduplication data store is received. A persist header stored in a deduplication data store is identified in a first hash structure that is not stored in memory of the computing device. The persist header comprises a set of hashes that includes a hash that is indicative of the data the computing device requested to read. Each hash in the set of hashes represents data stored in the deduplication data store after the persist header that is co-located with other data represented by the remaining hashes in the set of hashes. The set of hashes is cached in a second hash structure stored in the memory, whereby the computing device can identify the additional data using the second hash structure if the additional data is represented by the persist header.	12-19-2013
20130339320	STORAGE SYSTEM - The storage system includes a data dividing means for dividing writing target data into a plurality of units of partial data, and generating units of new divided file data; an index file generation means for generating, for each of the units of partial data, an index entry, and generating index file data by adding test data for error detection; a data writing means for writing the divided file data and the index file data; and a recovery means for detecting an error in the index entries written in the storage device, based on the test data included in each of the index entries. The recovery means deletes an index entry in which an error is detected and all of the subsequent index entries in the index file data stored in the storage device, from the index file data.	12-19-2013
20130346376	De-Duplicating Immutable Data at Runtime - De-duplication of immutable data items at runtime may include identifying a set of potentially duplicate immutable data items in use by one or more applications. The applications may access the immutable data items through pointers of respective objects corresponding to the immutable data items. A de-duplication component executing distinctly from the applications may analyze the identified set of potentially duplicate immutable data items to determine two or more that have identical content and may then modify one or more pointers of the corresponding objects so that at least two of the pointers point to a single immutable data item.	12-26-2013
20130346377	SYSTEM AND METHOD FOR ALIGNING DATA FRAMES IN TIME - A method and apparatus for merging data acquired by two or more capture devices from two or more points in a computer system, duplicate frames are analyzed to determine the time difference between the timestamp of a first capture device and a second capture device. The disclosure compares the frames for duplicates. If the duplicate frames are the first set of duplicate frames discovered, then all previous timestamps and all subsequent timestamps from the second capture device are adjusted by the calculated time difference. If duplicate frames are again discovered, the time difference is recalculated and all subsequent frames from the second capture device are adjusted by the calculated time difference. After all the frames have been analyzed and the timestamps adjusted, the frames are merged together and put into chronological order to simulate a single capture of data encompassing all of the points where the data was collected.	12-26-2013
20140006362	Low-Overhead Enhancement of Reliability of Journaled File System Using Solid State Storage and De-Duplication	01-02-2014
20140006363	OPTIMIZED DATA PLACEMENT FOR INDIVIDUAL FILE ACCESSES ON DEDUPLICATION-ENABLED SEQUENTIAL STORAGE SYSTEMS	01-02-2014
20140012822	SUB-BLOCK PARTITIONING FOR HASH-BASED DEDUPLICATION - Sub-block partitioning for hash-based deduplication is performed by defining a minimal size and maximum size of the sub-block. For each boundary start position of the sub-block, starting a search, after the minimal size of the sub-block, for a boundary position of a subsequent sub-block by using multiple search criteria to test hash values that are calculated during the search. If one of the multiple search criteria is satisfied by one of the hash values, declaring the position of the hash value as a boundary end position of the sub-block. If the maximum size of the sub-block is reached prior to satisfying one of the multiple search criteria, declaring a position of an alternative one of the hash values that is selected based upon another one of the multiple search criteria as the boundary end position of the sub-block.	01-09-2014
20140012823	GENERATION OF REALISTIC FILE CONTENT CHANGES FOR DEDUPLICATION TESTING - Method, system, and computer program product embodiments for facilitating deduplication product testing in a computing environment are provided. In one such embodiment, data to be processed through the deduplication product testing is arranged into a single, continuous stream. At least one of a plurality of random modifications are applied to the arranged data in a self-similar pattern exhibiting scale invariance. A plurality of randomly sized subsets of the arranged data modified with the self-similar pattern is mapped into each of a plurality of randomly sized deduplication test files.	01-09-2014
20140019425	FILE SERVER AND FILE MANAGEMENT METHOD - The file server identifies two or more files, each including duplicated data among a plurality of files that have been stored into the logical storage device as a file group based on the file system information. The file server deletes copies of the duplicated data other than shared data that is one copy of the duplicated data included in the two or more files from the logical storage device. The file server makes a file, which is not a shared file of the file group, referring to the shared file that is a file configured by the shared data. The file server creates a group link that associates the m files that belong to the file group with each other.	01-16-2014
20140025648	Method of Optimizing Data Flow Between a Software Application and a Database Server - A method may include receiving a request for a resource on a database server, the request being from a request initiator coupled to a network. Redundant data in the request is identified based on the data optimization rules, where the redundant data is unnecessary for the database server to satisfy the request for the resource. The redundant data is removed from the request based on the data optimization rules to create an optimized request. The optimized request is provided, using the network to the database server.	01-23-2014
20140032507	DE-DUPLICATION USING A PARTIAL DIGEST TABLE - Data de-duplication is done on a data set. The data de-duplication is done using a partial digest table. Some digests are selective removed from the partial digest table when a pre-determined condition occurs.	01-30-2014
20140032508	ACCELERATED DEDUPLICATION - Mechanisms are provided for accelerated data deduplication. A data stream is received an input interface and maintained in memory. Chunk boundaries are detected and chunk fingerprints are calculated using a deduplication accelerator while a processor maintains a state machine. A deduplication dictionary is accessed using a chunk fingerprint to determine if the associated data chunk has previously been written to persistent memory. If the data chunk has previously been written, reference counts may be updated but the data chunk need not be stored again. Otherwise, datastore suitcases, filemaps, and the deduplication dictionary may be updated to reflect storage of the data chunk. Direct memory access (DMA) addresses are provided to directly transfer a chunk to an output interface as needed.	01-30-2014
20140046911	DE-DUPLICATING ATTACHMENTS ON MESSAGE DELIVERY AND AUTOMATED REPAIR OF ATTACHMENTS - Systems and techniques of de-duplicating file and/or blobs within a file system are presented. In one embodiment, an email system is disclosed wherein the email system receives email messages comprising a set of associated attachments. The system determines whether the associated attachments have been previously stored in the email system, the state of the stored attachment, and if the state of the attachment is appropriate for sharing copies of the attachment, then providing a reference to the attachment upon a request to share the attachment. In another embodiment, the system may detect whether stored attachments are corrupted and, if so, attempt to repair the attachment, and possibly, prior to sharing references to the attachment.	02-13-2014
20140046912	METHODS AND SYSTEMS FOR DATA CLEANUP USING PHYSICAL IMAGE OF FILES ON STORAGE DEVICES - Systems and computer program products are provided for optimizing selection of files for deletion from one or more data storage devices to free up a predetermined amount of space in the one or more data storage devices. A method includes analyzing an effective space occupied by each file of a plurality of files in the one or more data storage devices, identifying, from the plurality of files, one or more data blocks making up a file to free up the predetermined amount of space based on the analysis of the effective space of each file of the plurality of files, selecting one or more of the plurality of files as one or more candidate files for deletion, based on the identified one or more data blocks, and deleting the one or more candidate files for deletion from the one or more data storage devices.	02-13-2014
20140046913	SYSTEMS, METHODS, AND COMPUTER PROGRAM PRODUCTS FOR SECURE MULTI-ENTERPRISE STORAGE - In one embodiment, a computer program product for storing data to a storage network includes a non-transitory computer readable storage medium having computer readable program code embodied therewith, the computer readable program code including computer readable program code configured to receive a plurality of data identifiers (IDs) from multiple data providers, each data ID being associated with one of a plurality of files, computer readable program code configured to store the plurality of data IDs to a database, computer readable program code configured to identify any duplicate data IDs in the database to determine if any of the plurality of files associated with the plurality of data IDs are non-confidential, computer readable program code configured to receive one of the files having a duplicate data ID, and computer readable program code configured to store the file having the duplicate data ID to a storage network.	02-13-2014
20140052698	Virtual Machine Image Access De-Duplication - A system and an article of manufacture for de-duplicating virtual machine image accesses include identifying one or more identical blocks in two or more images in a virtual machine image repository, generating a block map for mapping different blocks with identical content into a same block, deploying a virtual machine image by reconstituting an image from the block map and fetching any unique blocks remotely on-demand, and de-duplicating virtual machine image accesses by storing the deployed virtual machine image in a local disk cache.	02-20-2014
20140052699	ESTIMATION OF DATA REDUCTION RATE IN A DATA STORAGE SYSTEM - Systems and methods for estimating data reduction ratio for a data set is provided. The method comprises selecting a plurality of m elements from a data set comprising a plurality of N elements; associating an identifier h	02-20-2014
20140059015	SELECTING CANDIDATE ROWS FOR DEDUPLICATION - The present invention extends to methods, systems, and computer program products for selecting candidate records for deduplication from a table. A table can be processed to compute an inverse index for each field of the table. A deduplication algorithm can traverse the inverse indices in accordance with a flexible user-defined policy to identify candidate records for deduplication. Both exact matches and approximate matches can be found.	02-27-2014
20140059016	DEDUPLICATION DEVICE AND DEDUPLICATION METHOD - A deduplicate device includes: a first through N-th (N≧3) bloom filters; a counting unit that performs a process of judging whether information indicating that a duplicate data of a storing-target data exists in a storage device is registered in each bloom filter sequentially until an unregistered bloom filter in which the information is not registered is found or that the information is registered in the N-th bloom filter is found, and registers, when the unregistered bloom filter is found, the information indicating that the duplicate data exists into the unregistered bloom filter; and a deduplicating unit that stores the storing-target data in the storage device when the counting unit finds the unregistered bloom filter, and stores index information relating the duplicate data in the storage device with the storing-target data when the counting unit finds that the information is registered in the N-th bloom filter.	02-27-2014
20140059017	DATA RELATIONSHIPS STORAGE PLATFORM - A data relationships storage platform for analysis of one or more data sources is described herein. A data processing system may be communicatively coupled to one or more data sources and one or more big-data databases. One or more collectors may collect data pieces from the one or more data sources. One or more analyzer may analyze the collected data pieces to determine whether one or more relationships exist between the collected data pieces. The analysis results in one or more data globs that include one or more of the data pieces and relationship information, such as tags. The tagged data globs may be communicated to and stored in one or more big-data databases.	02-27-2014
20140059018	DATA DE-DUPLICATION IN A DISTRIBUTED NETWORK - A computer-implemented method for efficient data storage is provided. A first storage medium associates data stored on one or more data storage media with a unique identification value (ID) for the purpose of determining de-duplication status of the data. In response to receiving a request to read the data from a logical address, the first storage medium retrieves the data from a second storage medium based on the unique ID. In response to receiving a request to write the data to a logical address, the one or more data storage media store at least one copy of the data based on the de-duplication status of the data.	02-27-2014
20140059019	METHODS AND SYSTEMS FOR DATA CLEANUP USING PHYSICAL IMAGE OF FILES ON STORAGE DEVICES - Methods, systems, and computer program products are provided for optimizing selection of files for deletion from one or more data storage devices to free up a predetermined amount of space in the one or more data storage devices. A method includes analyzing an effective space occupied by each file of a plurality of files in the one or more data storage devices, identifying, from the plurality of files, one or more data blocks making up a file to free up the predetermined amount of space based on the analysis of the effective space of each file of the plurality of files, selecting one or more of the plurality of files as one or more candidate files for deletion, based on the identified one or more data blocks, and deleting the one or more candidate files for deletion from the one or more data storage devices.	02-27-2014
20140059020	REDUCED DISK SPACE STANDBY - A method and system for replicating database data is provided. One or more standby database replicas can be used for servicing read-only queries, and the amount of storage required is scalable in the size of the primary database storage. One technique is described for combining physical database replication to multiple physical databases residing within a common storage system that performs de-duplication. Having multiple physical databases allows for many read-only queries to be processed, and the de-duplicating storage system provides scalability in the size of the primary database storage. Another technique uses one or more diskless standby database systems that share a read-only copy of physical standby database files. Notification messages provide consistency between each diskless system's in-memory cache and the state of the shared database files. Use of a transaction sequence number ensures that each database system only accesses versions of data blocks that are consistent with a transaction checkpoint.	02-27-2014
20140067774	SOCIAL NETWORK RECOMMENDATIONS THROUGH DUPLICATE FILE DETECTION - The present disclosure relates generally to the field of social network recommendations through duplicate file detection. In various examples, social network recommendations through duplicate file detection may be implemented in the form of systems, methods and/or algorithms.	03-06-2014
20140067775	SYSTEM, METHOD AND COMPUTER PROGRAM PRODUCT FOR CONDITIONALLY PERFORMING DE-DUPING ON DATA - In accordance with embodiments, there are provided mechanisms and methods for conditionally performing de-duping on data. These mechanisms and methods for conditionally performing de-duping on data can enable increased resource efficiency, optimized data analysis, faster report generation, etc.	03-06-2014
20140067776	Method and System For Operating System File De-Duplication - When one considers all of the servers at an organization, the exact same operating system and application files will appear on many of them. Thus, there is an opportunity for saving an enormous amount of disk space for the organization as a whole by de-duplicating stored files. The present invention addresses the above needs by providing a method and system for saving at least one copy of a duplicate file in a location on a common storage system accessible to all relevant server computers and then removing the duplicates from the storage allocated to each server. Whenever the operating system on a server whose duplicate file has been removed requires access to the file, then the method redirects the operating system to access the file from the common storage system file location.	03-06-2014
20140074801	DATA DE-DUPLICATION SYSTEM - A data de-duplication system is provided that supports the loading and integration of data from multiple data sources. The data de-duplication system identifies and merges duplicate dimension data records that describe the same entity by creating a single dimension data record that is identified as a single best record (“SBR”). The data de-duplication system further adjusts foreign keys that reference the duplicate dimension data records so that the foreign keys correctly reference the merged dimension data record (i.e., the SBR).	03-13-2014
20140074802	SECURE DELETION OPERATIONS IN A WIDE AREA NETWORK - Methods, systems, and computer program products are provided for performing a secure delete operation in a wide area network (WAN) including a cache site and a home site. A method includes identifying a file for deletion at the cache site, determining whether the file has a copy stored at the home site, detecting a location of the copy at the home site prior to a disconnection event of the cache site from the home site, deleting the file from the cache site during the disconnection event, and performing a secure deletion of the copy at the home site immediately after a reconnection event of the cache site to the home site.	03-13-2014
20140074803	LOG MESSAGE OPTIMIZATION TO IGNORE OR IDENTIFY REDUNDANT LOG MESSAGES - A method of presenting log messages during execution of a computer program. The method can include identifying at least a second log message set comprising information that is the same as information contained in a first log message set. The method can include determining to present the second log message set in a manner that indicates that the second log message set is redundant, and presenting such list of log messages accordingly, or determining not to present the second log message set in the list of log messages, and presenting the list of log messages accordingly.	03-13-2014
20140074804	METHOD FOR MAINTAINING MULTIPLE FINGERPRINT TABLES IN A DEDUPLICATING STORAGE SYSTEM - A system and method for managing multiple fingerprint tables in a deduplicating storage system. A computer system includes a data storage medium, a first fingerprint table comprising a first plurality of entries, and a second fingerprint table comprising a second plurality of entries. Each of the first plurality of entries and each of the second plurality of entries are configured to store fingerprint related data corresponding to data stored in the data storage medium. A data storage controller is configured to select the first fingerprint table for storage of entries corresponding to data stored in the data storage medium that has been deemed more likely to be successfully deduplicated than other data stored in the data storage medium; and select the second fingerprint table for storage of entries corresponding to data stored in the data storage medium that has been deemed less likely to be successfully deduplicated than other data stored in the data storage medium.	03-13-2014
20140081925	Managing Incident Reports - The present disclosure describes methods, systems, and computer program products for managing incident reports can include receiving alert messages from multiple tenants and aggregating the alert messages into a reduced, correlated incident reports. For example, the method includes receiving, from a number of tenants, alert reports that represent at least one system alert incident associated with the tenants. The alert reports can be collected and analyzed for duplicate reports. The analysis for duplicate reports can include identifying a number of duplicate alert reports and correlating each identified duplicate alert reports into a correlated incident report. The correlated incident report can be aggregated into a summarized incident report for processing.	03-20-2014
20140081926	IMAGE DUPLICATION PREVENTION APPARATUS AND IMAGE DUPLICATION PREVENTION METHOD - An image duplication prevention apparatus comprising: image duplication prevention means for, when a determining means of the image duplication prevention apparatus determines that a subset of metadata of an image to be transferred from a first location to a second location is identical to a corresponding subset of metadata of any image already stored at the second location, preventing the transfer of the image from the first location to the second location. The image duplication prevention means is further for, when the determining means determines that the subset of metadata of the image to be transferred is not identical to the corresponding subset of metadata of the any image already stored at the second location, allowing the transfer of the image from the first location to the second location.	03-20-2014
20140081927	DATA NODE FENCING IN A DISTRIBUTED FILE SYSTEM - Systems and methods for data node fencing in a distributed file system to prevent data inconsistencies and corruptions are disclosed. An embodiment includes implementing a protocol whereby data nodes detect a failover and determine an active name node based on transaction identifiers associated with transaction requests. The data nodes also provide to the active name node block location information and an acknowledgment. The embodiment further includes a protocol whereby a name node refrains from issuing invalidation requests to the data nodes until the name node receives acknowledgments from all data nodes that are functional.	03-20-2014
20140081928	SKILL EXTRACTION SYSTEM - In an example, disclosed is a machine automated method of identifying a set of skills. In some examples, the method includes extracting a plurality of skill seed phrases from a plurality of member profiles of a social networking site, creating a plurality of disambiguated skill seed phrases by disambiguating the plurality of skill seed phrases using one or more computer processors, and de-duplicating the plurality of disambiguated skill seed phrases to create a plurality of de-duplicated skill seed phrases.	03-20-2014
20140089272	METHOD AND APPARATUS FOR TAGGED DELETION OF USER ONLINE HISTORY - An approach is provided for deleting a user's online data across different services and platforms based on contextual selection criteria. The deletion manager determines at least one request to delete data associated with at least one user, the request specifying at least in part one or more contextual parameters. The deletion manager determines one or more data records associated with the at least one user from one or more services, one or more applications, or a combination thereof. The deletion manager causes, at least in part, a deletion of the one or more data records based, at least in part, on whether the data at least substantially meet the one or more contextual parameters.	03-27-2014
20140089273	LARGE SCALE FILE STORAGE IN CLOUD COMPUTING - Storing and retrieving files based on hashes for the files. One method for storing files includes: identifying a file; identifying a hash calculated based on the file; renaming the file based on the hash based on the file; and storing the file in a particular location based on the hash calculated based on the file. Another method for retrieving files includes: identifying a hash for a given file; using the hash, traversing a hierarchical file structure to find a location where the given file should be stored; determining that the file is at the location; and as a result, retrieving the file.	03-27-2014
20140089274	AGENT COMMUNICATION BULLETIN BOARD - A data communication system comprising a first plurality of software entities, each having a respective entity identifier and a respective plurality of characteristics, and a data repository, wherein a first software entity of the first plurality of software entities instigates establishment of a first collection of data at the data repository, the first collection of data having at least one collection identifier selected from the plurality of characteristics of the first software entity, each of a second plurality of the first plurality of software entities having a respective set of the respective plurality of characteristics that matches the at least one collection identifier instigates addition of the entity identifier of the respective software entity to the first collection of data, at least one of the second plurality of software entities instigates addition of data to the first collection of data, and at least one other of the second plurality of software entities obtains a portion of the data from the first collection of data.	03-27-2014
20140089275	EFFICIENT FILE RECLAMATION IN DEDUPLICATING VIRTUAL MEDIA - Expired files in the deduplicating virtual media are selectively erased using a backup application for notifying a backup repository of which expired files are no longer required. The space of the expired files is reclaimed for reuse. Virtual space of the expired files is reserved for allowing the backup application to seek past the reclaimed space to subsequent data in the deduplicating virtual media.	03-27-2014
20140095455	HEAT INDICES FOR FILE SYSTEMS AND BLOCK STORAGE - Techniques and mechanisms are provided to allow for selective optimization, including deduplication and/or compression, of portions of files and data blocks. Data access is monitored to generate a heat index for identifying sections of files and volumes that are frequently and infrequently accessed. These frequently used portions may be left non-optimized to reduce or eliminate optimization I/O overhead. Infrequently accessed portions can be more aggressively optimized.	04-03-2014
20140101113	Locality Aware, Two-Level Fingerprint Caching - The present disclosure provides for implementing a two-level fingerprint caching scheme for a client cache and a server cache. The client cache hit ratio can be improved by pre-populating the client cache with fingerprints that are relevant to the client. Relevant fingerprints include fingerprints used during a recent time period (e.g., fingerprints of segments that are included in the last full backup image and any following incremental backup images created for the client after the last full backup image), and thus are referred to as fingerprints with good temporal locality. Relevant fingerprints also include fingerprints associated with a storage container that has good spatial locality, and thus are referred to as fingerprints with good spatial locality. A pre-set threshold established for the client cache (e.g., threshold Tc) is used to determine whether a storage container (and thus fingerprints associated with the storage container) has good spatial locality.	04-10-2014
20140101114	METHOD AND SYSTEM FOR PROCESSING DATA - Methods, computer systems, and computer program products for processing data a computing environment are provided. The computer environment for data deduplication storage receives a plurality of write operations for deduplication storage of the data. The data is buffered in a plurality of buffers with overflow temporarily stored to a memory hierarchy when the data received for deduplication storage is sequential or non sequential. The data is accumulated and updated in the plurality of buffers per a data structure, the data structure serving as a fragment map between the plurality of buffers and a plurality of user file locations. The data is restructured in the plurality of buffers to form a complete sequence of a required sequence size. The data is provided as at least one stream to a stream-based deduplication algorithm for processing and storage.	04-10-2014
20140101115	SEGMENT GROUP-BASED SEGMENT CLEANING APPARATUS AND METHODS FOR STORAGE UNITS - Victim segments to be returned to a free area in a segment cleaning process from a plurality of segments included in each segment group are selected by using a method corresponding to the segment group. A host comprises an interface relaying data exchange with a storage device; and a file system module performing a segment cleaning process by selecting victim segment from a plurality of segments stored in the storage device, discovering live blocks in each of the victim segments, writing back the discovered live blocks to the storage device through the interface, and returning the victim segments to a free area. The file system module calculates victim points for all segments included in a first segment group using a first victim point calculation formula, calculates victim points for all segments included in a second segment group using a second victim point calculation formula, and selects the victim segments based on the victim points.	04-10-2014
20140108359	SCALABLE DATA PROCESSING FRAMEWORK FOR DYNAMIC DATA CLEANSING - Methods and systems for reconstructing data are disclosed. One method includes receiving a selection of one or more input data streams at a data processing framework, and receiving a definition of one or more analytics components at the data processing framework. The method further includes applying a dynamic principal component analysis to the one or more input data streams, and detecting a fault in the one or more input data streams based at least in part on a prediction error and a variation in principal component subspace generated based on the dynamic principal component analysis. The method also includes reconstructing data at the fault within the one or more input data streams based on data collected prior to occurrence of the fault.	04-17-2014
20140114932	SELECTIVE DEDUPLICATION - Methods and apparatuses for performing selective deduplication in a storage system are introduced here. Techniques are provided for determining a probability of deduplication for a data object based on a characteristic of the data object and performing a deduplication operation on the data object in the storage system prior to the data object being stored in persistent storage of the storage system if the probability of deduplication for the data object has a specified relationship to a specified threshold.	04-24-2014
20140114933	MIGRATING DEDUPLICATED DATA - Methods and apparatuses for efficiently migrating deduplicated data are provided. In one example, a data management system includes a data storage volume, a memory including machine executable instructions, and a computer processor. The data storage volume includes data objects and free storage space. The computer processor executes the instructions to perform deduplication of the data objects and determine migration efficiency metrics for groups of the data objects. Determining the migration efficiency metrics includes determining, for each group, a relationship between the free storage space that will result if the group is migrated from the volume and the resources required to migrate the group from the volume.	04-24-2014
20140114934	MULTI-LEVEL INLINE DATA DEDUPLICATION - Technologies are presented for data deduplication that operates at relatively high throughput and with relatively less storage space than conventional techniques. Building upon content-dependent chunking (CDC) using Rabin fingerprints, data may be fingerprinted and stored in variable-size chunks. In some examples, data may be chunked on multiple levels, for example, two levels, variable size large chunks in the first level and fixed-size sub-chunks in the second level, in order to prevent sub-chunks common to two or more data chunks from not being deduplicated. For example, at a first level, a CDC algorithm may be employed to fingerprint and chunk data in content-dependent sizes (variable sizes), and at a second level the CDC chunks may be sliced into small fixed-size chunks. The sliced CDC chunks may then be used for deduplication.	04-24-2014
20140122447	SYSTEM AND METHOD FOR PREVENTING DUPLICATE FILE UPLOADS IN A SYNCHRONIZED CONTENT MANAGEMENT SYSTEM - A method and system for preventing duplicate file uploads in a remote content management system is described. The user device receives a hash value list associated with the files stored in the remote content management system. The user device calculates a hash value associated with new files to be uploaded. The system then compares the hash value(s) associated with the new file(s) to be uploaded with the hash value list received from the remote file storage system. If the hash values of any of the new files to be uploaded match a hash value on the hash value list, then the system prevents the new files from being uploaded to the remote file storage system.	05-01-2014
20140122448	INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD AND PROGRAM STORAGE MEDIUM - According to one embodiment, an information processing apparatus includes a management controller, a determination module, and a receiver. When the deletion request is from a user other than a predetermined user, the management controller is configured to perform such control that, without deleting target data of the deletion request, the target data of the deletion request is not displayed to the user other than the predetermined user and the target data of the deletion request is displayed to the predetermined user.	05-01-2014
20140122449	Manipulating The Actual or Effective Window Size In A Data-Dependant Variable-Length Sub-Block Parser - Example systems and methods concern a sub-block parser that is configured with a variable sized window whose size varies as a function of the actual or expected entropy of data to be parsed by the sub-block parser. Example systems and methods also concern a sub-block parser configured to compress a data sequence to be parsed before parsing the data sequence. One example method facilitates either actually changing the window size or effectively changing the window size by manipulating the data before it is parsed. The example method includes selectively reconfiguring a data set to be parsed by a data-dependent parser based, at least in part, on the entropy level of the data set, selectively reconfiguring the data-dependent parser, based, at least in part, on the entropy level of the data set, and parsing the data set.	05-01-2014
20140122450	Computer-Implemented System And Method For Identifying Duplicate And Near Duplicate Messages - A computer-implemented system and method for identifying duplicate and near duplicate messages is provided. A set of messages is obtained. A body of one such message is compared with the body of each other message. Those messages having matching bodies are identified as exact duplicates. The exact duplicates are removed from the set. The remaining messages are sorted in order of message length and a shorter message is compared with a longer message. A determination is made that the body of the shorter message is included in the body of the longer message and the shorter message is marked as a near duplicate of the longer message.	05-01-2014
20140136490	Methods and Systems For Vectored Data De-Duplication - The present invention is directed toward methods and systems for data de-duplication. More particularly, in various embodiments, the present invention provides systems and methods for data de-duplication that may utilize a vectoring method for data de-duplication wherein a stream of data is divided into “data sets” or blocks. For each block, a code, such as a hash or cyclic redundancy code may be calculated and stored. The first block of the set may be written normally and its address and hash can be stored and noted. Subsequent block hashes may be compared with previously written block hashes.	05-15-2014
20140136491	STORAGE SYSTEM, STORAGE SYSTEM CONTROL METHOD, AND STORAGE CONTROL DEVICE - It is provided a storage system including a storage device for storing data, and a controller for controlling data read/write in the storage device. The controller includes a processor for executing a program, and a memory for storing the program that is executed by the processor. The processor executes deduplication processing for converting a duplicate part of data that is stored in the storage device into shared data, and calculates a distributed capacity consumption, which represents a capacity of a storage area that is used by a user in the storage device, by using a size of the data prior to the deduplication processing and a count of pieces of data referring to the shared data that is referred to by this data.	05-15-2014
20140143212	AGGREGATING IDENTIFIERS FOR MEDIA ITEMS - A server device may receive multiple provider identifiers for a media item from multiple client devices. The multiple provider identifiers may each be associated with different media providers and may each be associated with the same media item. The server device may aggregate the multiple provider identifiers into entries in a data store. The server device may also analyze the entries in the data store and may request missing provider identifiers, merge entries that have duplicate information, and may indicate whether a media item is playable.	05-22-2014
20140143213	DEDUPLICATION IN A STORAGE SYSTEM - A IO handler receives a write command including write data that is associated with a LBA. The IO handler reserves a deduplication ID according to the LBA with which the write data is associated, within the scope of each LBA, each deduplication ID is unique. The IO handler computes a hash value for the write data. In case a deduplication database does not include an entry which is associated with the hash value, the IO handler: provides a reference key which is a combination of the LBA and the deduplication ID; adds to the deduplication database an entry which is uniquely associated with the hash value and references the reference key; and adds to a virtual address database an entry, including: the reference key; a reference indicator indicating if there is an entry that is associated with the present entry; and a pointer to where the write data is stored.	05-22-2014
20140149364	SYSTEM AND METHOD FOR PICK-AND-DROP SAMPLING - A database system includes an input to a database server configured to deliver a data stream formed of a sequence of elements, D={p	05-29-2014
20140149365	Method and Apparatus for Handling Digital Objects in a Communication Network - Systems and methods for accelerating relational database applications are disclosed whereby the retrieval of objects can be 100,000 times faster than state of the art methods. According to embodiments of the present invention, an application may directly obtain digital objects from an in-memory store rather than querying a possibly remote data source. In some embodiments, several in-memory nodes are deployed simultaneously, for example, in clusters. Changes in underlying data store(s) can be updated to in-memory cache with SQL triggers. Potential queries may be predicted with automatically generated code. Advanced read/write locking mechanisms further improve the performance of data access.	05-29-2014
20140149366	SIMILARITY ANALYSIS METHOD, APPARATUS, AND SYSTEM - Embodiments of the present invention provide a similarity analysis method, an apparatus, and a system. The method includes: acquiring file fingerprint information of a file to be analyzed; sending an analysis request that carries the file fingerprint information to at least two MDSs; selecting at least one group according to an analysis result returned by each MDS, where the analysis result includes a group number and a similarity of at least one group that has the highest similarity with the file fingerprint information and is found by the MDS; and the MDS locally queries a duplicate data block in the selected group. In this way, each MDS needs to query only a file fingerprint information set of a group that the MDS itself is responsible for, which reduces the amount of data retrieval and waiting time of reading, writing, and locking a database file.	05-29-2014
20140156606	Method and System for Integrating Data Into a Database - A method for integrating data into a database comprises storing data comprising a plurality of records which each comprise a plurality of attributes; analysing a sample of records from the plurality of records by: identifying duplicate pairs of records in the sample records; analysing each attribute of each record of the duplicate pairs of records to identify a respective attribute condition which is indicative that the pairs of records are duplicates; wherein the method further comprises: comparing each attribute of a record with the respective attribute condition and, if the attribute satisfies the attribute condition, allocating the record to a disjoint group which comprises records with an attribute that satisfies the same respective attribute condition; identifying duplicate pairs of records in the records in each disjoint group; identifying duplicate pairs of records in records that are not allocated to a disjoint group; and consolidating each duplicate pair of records into one consolidated record and storing the consolidated record in an integrated database.	06-05-2014
20140156607	INDEX FOR DEDUPLICATION - Techniques for deduplication include an index, a receiver module, and an indexer module. The index can store information about data blocks. The receiver module can receive a data block. The indexer module can check whether information about the data block is in the index, and if information about the data block is not found in the index, then it can make a random decision about whether to store information about the data block in the index, and if the random decision is to store information about the data block in the index, then it can store information about the data block in the index.	06-05-2014
20140164338	ORGANIZING INFORMATION DIRECTORIES - Building an information directory can include sending source data to an information extractor, wherein the source data includes first source metadata, extracting second source metadata using the source data, using the information extractor, merging the first source metadata and the second source metadata into third source metadata, and organizing the third source metadata in the information directory.	06-12-2014
20140164339	REPETITIVE DATA BLOCK DELETING SYSTEM AND METHOD - An analysis device obtains hash lists from databases of a server cluster. The analysis device determines repetitive hash values and repetitive data blocks. The analysis device deletes the repetitive data blocks from servers of the server cluster.	06-12-2014
20140172805	CONTACT MANAGEMENT - The description relates to contact management. One example can be manifest as a system that can include a display and a processor. The processor can be configured to process instructions to create a graphical user interface on the display. The graphical user interface can include an aggregate view of contact information relating to an entity. The graphical user interface can be configured to indicate a source of individual instances of the contact information. The aggregate view can be configured to distinguish first individual instances of the contact information that are editable from second individual instances of the contact information that are read-only.	06-19-2014
20140181054	DATA DEDUPLICATION IN A REMOVABLE STORAGE DEVICE - An apparatus and associated methodology contemplate a data storage system having a removable storage device operably transferring user data between the data storage system and another device via execution of a plurality of input/output commands. A commonality factoring module executes computer instructions stored in memory to assign commonality information to the user data. A deduplication module executes computer instructions stored in memory to combine a plurality of files of the user data (user data files) with at least one file of corresponding commonality information (commonality information file), the combined files forming a sequential data stream.	06-26-2014
20140188817	System and Method for Object Integrity Service - An embodiment for object integrity service in a storage system includes generating a list of objects stored in a storage system, wherein the list of objects may list an unchecked object, and wherein the unchecked object is an object that has not been checked within a set time period, walking through the list of objects to identify the unchecked object, adding a task to a queue to check the unchecked object, and clearing the task from the queue by checking the unchecked object.	07-03-2014
20140188818	OPTIMIZING A PARTITION IN DATA DEDUPLICATION - For optimizing a partition of a data block into matching and non-matching segments in data deduplication using a processor device in a computing environment, an optimal calculation operation is applied in polynomial time to the matching segments for selecting a globally optimal subset of a set of matching segments according to overhead considerations for minimizing an overall size of a deduplicated file by determining a trade off between a time complexity and a space complexity.	07-03-2014
20140188819	COMPRESSION AND DEDUPLICATION LAYERED DRIVER - A method, apparatus, and system for interposed file system driver is provided, which provides a logical file system on top of an existing base file system. One such interposed file system driver is a compression and deduplication layered driver (“COLD driver”). File system operations are intercepted from the operating system through the COLD driver, which is provided as an upper-level operating system driver that operates on top of an existing base file system. By processing file data through various modules, the existing base file system can be extended as a logical file system with compression, deduplication, indexing, and other functionality. The COLD driver can be implemented without requiring modifications to existing base file system structures or base file system drivers. Server deployments may thus leverage the additional file system functionality provided by the COLD driver without having to migrate to another file system.	07-03-2014
20140195493	PACKING DEDUPLICATED DATA IN A SELF-CONTAINED DEDUPLICATED REPOSITORY - Deduplicated data is packed in a self-contained deduplicated repository having unique data blocks with each being referenced by a globally unique identifier (GUID). The self-contained deduplicated repository has information regarding both deduplicated data files and the unique data blocks of each of the deduplicated data files and a master GUID list containing a location of each of the unique data blocks.	07-10-2014
20140195494	METHOD AND SYSTEM FOR CREATING AND MAINTAINING UNIQUE DATA REPOSITORY - In accordance with the disclosure, there is provided a system and method for creating and maintaining unique data repository comprising a matching process based on a set of predefined matching conditions and thereon performing an action type corresponding to the outcome of matching process. The present disclosure provides for real time data de-duplication and updation of unique data repository to obtain a unified view of unique and matching records.	07-10-2014
20140195495	PACKING DEDUPLICATED DATA IN A SELF-CONTAINED DEDUPLICATED REPOSITORY - Deduplicated data is packed in a self-contained deduplicated repository having unique data blocks with each being referenced by a globally unique identifier (GUID). The self-contained deduplicated repository has information regarding both deduplicated data files and the unique data blocks of each of the deduplicated data files and a master GUID list containing a location of each of the unique data blocks.	07-10-2014
20140195496	USE OF PREDEFINED BLOCK POINTERS TO REDUCE DUPLICATE STORAGE OF CERTAIN DATA IN A STORAGE SUBSYSTEM OF A STORAGE SERVER - A method and system for eliminating the redundant allocation and deallocation of special data on disk, wherein the redundant allocation and deallocation of special data on disk is eliminated by providing an innovate technique for specially allocating special data of a storage system. Specially allocated data is data that is pre-allocated on disk and stored in memory of the storage system. “Special data” may include any pre-decided data, one or more portions of data that exceed a pre-defined sharing threshold, and/or one or more portions of data that have been identified by a user as special. For example, in some embodiments, a zero-filled data block is specially allocated by a storage system. As another example, in some embodiments, a data block whose contents correspond to a particular type document header is specially allocated.	07-10-2014
20140201167	SYSTEMS AND METHODS FOR FILE SYSTEM MANAGEMENT - A method may include in response to receiving a command to delete data on a storage resource, determining, whether a storage unit has an area to delete responsive to the command that is not aligned with boundaries of the storage unit. The method may also include in response to determining that the storage unit has an area to delete responsive to the command that is unaligned with boundaries of the storage unit, determining whether the entire storage unit, other than the area to delete responsive to the command that is unaligned with boundaries of the storage unit, is marked for unmapping. The method may further include in response to determining that the entire storage unit, other than the area to delete responsive to the command that is unaligned with boundaries of the storage unit, is marked for unmapping, unmapping the storage unit from a logical-to-physical map for the storage resource.	07-17-2014
20140201168	DEDUPLICATION IN AN EXTENT-BASED ARCHITECTURE - A request is received to remove duplicate data. A log data container associated with a storage volume in a storage server is accessed. The log data container includes a plurality of entries. Each entry is identified by an extent identifier in a data structures stored in a volume associated with the storage server. For each entry in the log data container, a determination is made if the entry matches another entry in the log data container. If the entry matches another entry in the log data container, a determination is made of a donor extent and a recipient extent. If an external reference count associated with the recipient extent equals a first predetermined value, block sharing is performed for the donor extent and the recipient extent. A determination is made if the reference count of the donor extent equals a second predetermined value. If the reference count of the donor extent equals the second predetermined value, the donor extent is freed.	07-17-2014
20140201169	DATA PROCESSING METHOD AND APPARATUS IN CLUSTER SYSTEM - In embodiments of the present invention, when a duplicate data query is performed on a received data stream, a first physical node which corresponds to each first sketch value and is in a cluster system is identified according to a first sketch value representing the data stream, and then the first sketch value representing the data stream is sent to the identified physical node for the duplicate data query, and a procedure of the duplicate data query does not change with an increase of the number of nodes in the cluster system; therefore, a calculation amount of each node does not increase with an increase of the number of nodes in the cluster system.	07-17-2014
20140201170	HIGH AVAILABILITY DISTRIBUTED DEDUPLICATED STORAGE SYSTEM - A high availability distributed, deduplicated storage system according to certain embodiments is arranged to include multiple deduplication database media agents. The deduplication database media agents store signatures of data blocks stored in secondary storage. In addition, the deduplication database media agents are configured as failover deduplication database media agents in the event that one of the deduplication database media agents becomes unavailable.	07-17-2014
20140201171	HIGH AVAILABILITY DISTRIBUTED DEDUPLICATED STORAGE SYSTEM - A high availability distributed, deduplicated storage system according to certain embodiments is arranged to include multiple deduplication database media agents. The deduplication database media agents store signatures of data blocks stored in secondary storage. In addition, the deduplication database media agents are configured as failover deduplication database media agents in the event that one of the deduplication database media agents becomes unavailable.	07-17-2014
20140201172	Using Flow Space Alignment to Distinguish Duplicate Reads - Systems and method for identifying duplicate reads can receive first and second reads, determine if the first and second reads have a same start and end position, determine a binary flow difference, and identify the second read as a duplicate of the first read when the binary flow difference exceeds a threshold.	07-17-2014
20140207743	Method for Storage Driven De-Duplication of Server Memory - A method for storage driven de-duplication of server memory comprises configuring a storage controller, as part of each IO operation, to generate a unique signature for each data page passing through the controller. The method associates the signature with the data page and stores the associated page and signature. The signature is added to a signature queue for signature match analysis with signatures stored in server memory. Signature analysis is limited to read-only pages to speed up analysis of pages more likely to be duplicates. Once a duplicate page is found, a page table is updated to point to the match page and the duplicate page is added to a free list.	07-24-2014
20140214775	SCALABLE DATA DEDUPLICATION - A method implemented on a node, the method comprising receiving a key according to a sub-index of the key, wherein the sub-index identifies the node, and wherein the key corresponds to a data segment of a file, determining whether the data segment is stored in a data storage system according to whether the key appears in a hash table.	07-31-2014
20140214776	DATA DE-DUPLICATION FOR DISK IMAGE FILES - The invention relates to a data processing system, comprising at least two disk emulators operating in parallel and emulating a disk subsystem each, the disk emulators each using a file in a file system for any data stored on the respective disk, a separate de-duplicator for de-duplicating the data stored in the files, the de-duplicator operating in parallel to the disk emulators, the de-duplicator further using an additional disk emulator emulating an additional disk subsystem by using an additional file in a file system for storing data shared between the other disk subsystems.	07-31-2014
20140214777	SYSTEM, METHOD, AND COMPUTER READABLE MEDIA FOR IDENTIFYING A USER-INITIATED LOG FILE RECORD IN A LOG FILE - A system, a method, and a computer readable media for identifying a user-initiated log file record in a log file are provided. The log file has a user-initiated log file record and a repeating pattern of log file records automatically generated by a software program. The system allows a user to identify first and second timestamp values corresponding to first and second times which identify a time interval of interest in the log file. The system further analyzes the log file to identify the user-initiated log file record having a timestamp value between the first and second timestamp values. The system further identifies the repeating pattern of log file records in the log file.	07-31-2014
20140214778	Entity Normalization Via Name Normalization - Systems and methods for normalizing entities via name normalization are disclosed. In some implementations, a computer-implemented method of identifying duplicate objects in a plurality of objects is provided. Each object in the plurality of objects is associated with one or more facts, and each of the one or more facts having a value. The method includes: using a computer processor to perform: associating facts extracted from web documents with a plurality of objects; and for each of the plurality of objects, normalizing the value of a name fact, the name fact being among one or more facts associated with the object; processing the plurality of objects in accordance with the normalized value of the name facts of the plurality of objects. In some implementations, normalizing the value of the name fact is optionally carried out by applying a group of normalization rules to the value of the name fact.	07-31-2014
20140222768	MULTI-ROW DATABASE DATA LOADING FOR ENTERPRISE WORKFLOW APPLICATION - Embodiments of the invention are directed to a system, method, or computer program product for providing expedited loading/inserting of data by an entity. Specifically, the invention expedites the loading/inserting of large quantities of data to database tables. Initially received data for loading is processed, via multi-row insert, onto in-memory or temporary tables. The data is staged on a temporary table while the appropriate base table is determined. Once determined, data from the temporary table is pointed to the base table. In this way, a massive amount of data loading from the temporary table to a base table may occur. This prevents logging and locking associated with adding individual data points or row to a base table independently. Errors are check and processed accordingly. Once updated, the data on the temporary table is deleted in mass and a check point restart is issued.	08-07-2014
20140222769	OBJECT DEDUPLICATION AND APPLICATION AWARE SNAPSHOTS - Embodiments deploy delayering techniques, and the relationships between successive versions of a rich-media file become apparent. With this, modified rich-media files suddenly present far smaller storage overhead as compared to traditional application-unaware snapshot and versioning implementations. Optimized file data is stored in suitcases. As a file is versioned, each new version of the file is placed in the same suitcase as the previous version, allowing embodiments to employ correlation techniques to enhance optimization savings.	08-07-2014
20140222770	DE-DUPLICATION DATA BANK - Facility for transferring data over a network between two network endpoints by transferring hash signatures over the network instead the actual data. The hash signatures are pre-generated from local static data and stored in a hash database before any data is transferred between source and destination. The hash signatures are created on both sides of a network at the point where data is local, and the hash database consists of hash signatures of blocks of data that are stored locally. The hash signatures are created using different traversal patterns across local data so that the hash database can represent a larger dataset then the actual physical storage of the local data. If no local data is present, then arbitrary data is generated and then remains static.	08-07-2014
20140236905	METHOD AND SYSTEM FOR SCANNING FILES OF A DEVICE BY USING CLOUD COMPUTING - A method and system for scanning redundant files of a mobile terminal by using cloud computing are provided. The method comprises: scanning, by a client on the mobile terminal, a file system on a local mobile terminal to generate a list of file information; submitting, by the client on the mobile terminal, to a server side the list of file information; comparing, by the server side, the list of file information received by the server side with an associated list of file information in a server side database and returning the comparison result; comparing, by the client, the comparison result returned by the server side with the application list of the mobile terminal; performing, by the client, a cleanup operation based on the comparison result of the client.	08-21-2014
20140236906	ELIMINATION OF DUPLICATE OBJECTS IN STORAGE CLUSTERS - Digital objects within a fixed-content storage cluster use a page mapping table and a hash-to-UID table to store a representation of each object. For each object stored within the cluster, a record in the hash-to-UID table stores the object's hash value and its unique identifier (or portions thereof). To detect a duplicate of an object, a portion of its hash value is used as a key into the page mapping table. The page mapping table indicates a node holding a hash-to-UID table indicating currently stored objects in a particular page range. Finding the same hash value but with a different unique identifier in the table indicates that a duplicate of an object exists. Portions of the hash value and unique identifier may be used in the hash-to-UID table. Unneeded duplicate objects are deleted by copying their metadata to a manifest and then redirecting unique identifiers to point at the manifest.	08-21-2014
20140236907	SELECTING CANDIDATE ROWS FOR DEDUPLICATION - The present invention extends to methods, systems, and computer program products for selecting candidate records for deduplication from a table. A table can be processed to compute an inverse index for each field of the table. A deduplication algorithm can traverse the inverse indices in accordance with a flexible user-defined policy to identify candidate records for deduplication. Both exact matches and approximate matches can be found.	08-21-2014
20140244598	INTEGRITY CHECKING AND SELECTIVE DEDUPLICATION BASED ON NETWORK PARAMETERS - An approach for managing a data package is provided. Network utilization is determined to exceed a threshold. A sender computer determines a hash digest of the data package by using a hash function selected based on central processing unit utilization. If the hash digest is in a sender hash table, then without sending the data package, the sender computer sends the hash digest and an index referring to the hash digest so that a recipient computer can use the index to locate a matching hash digest and the data package in a recipient hash table. If the hash digest is not in the sender hash table, then the sender computer adds the data package and the hash digest to the sender hash table and sends the data package and the hash digest to the second computer to check the integrity of the data package based on the hash digest.	08-28-2014
20140244599	DEDUPLICATION STORAGE SYSTEM WITH EFFICIENT REFERENCE UPDATING AND SPACE RECLAMATION - A deduplication storage system and associated methods are described. The deduplication storage system may split data objects into segments and store the segments. A plurality of data segment containers may be maintained. Each of the containers may include two or more of the data segments. Maintaining the containers may include maintaining a respective logical size of each container. In response to detecting that the logical size of a particular container has fallen below a threshold level, the deduplication storage system may perform an operation to reclaim the storage space allocated to one or more of the data segments included in the particular container.	08-28-2014
20140244600	MANAGING DUPLICATE MEDIA ITEMS - Systems, methods, devices, and computer-readable media for managing duplicate media items. The system first analyzes a first file from a first source, wherein the first file is a duplicate of a second file. Next, the system deduplicates the first file and the second file to yield a deduplicated file. The system then selects metadata associated with at least one of the first file or the second file to be assigned as metadata for the deduplicated file, the metadata being selected based on a priority preference.	08-28-2014
20140244601	GRANULAR PARTIAL RECALL OF DEDUPLICATED FILES - The subject disclosure is directed towards partially recalling file ranges of deduplicated files based on tracking dirty (write modified) ranges (user writes) in a way that eliminates or minimizes reading and writing already-optimized adjacent data. The granularity of the ranges does not depend on any file-system granularity for tracking ranges. In one aspect, lazy flushing of tracking data that preserves data-integrity and crash-consistency is provided. In one aspect, also described is supporting granular partial recall on an open file while a data deduplication system is optimizing that file.	08-28-2014
20140250086	WAN Gateway Optimization by Indicia Matching to Pre-cached Data Stream Apparatus, System, and Method of Operation - A network gateway coupled to a backup server on a wide area network which receives and de-duplicates binary objects. The backup server provides selected data segments of binary objects to the gateway to store into a prescient cache (p-cache) store. The network gateway optimizes network traffic by fulfilling a local client request from its local p-cache store instead of requiring further network traffic when it matches indicia of stored data segments stored in its p-cache store with indicia of a first segment of a binary object requested from and received from a remote server.	09-04-2014
20140250087	Computer-Implemented System And Method For Identifying Relevant Documents For Display - A computer-implemented system and method for identifying relevant documents for display are provided. Themes for a set of documents are generated. The documents are clustered based on the themes. A matrix including an inner product of document frequency occurrences and cluster concept weightings for each theme is generated for the documents. From the matrix, documents most relevant to a particular theme are identified, and the relevant documents are displayed.	09-04-2014
20140250088	SYSTEMS AND METHODS FOR BYTE-LEVEL OR QUASI BYTE-LEVEL SINGLE INSTANCING - Described in detail herein are systems and methods for deduplicating data using byte-level or quasi byte-level techniques. In some embodiments, a file is divided into multiple blocks. A block includes multiple bytes. Multiple rolling hashes of the file are generated. For each byte in the file, a searchable data structure is accessed to determine if the data structure already includes an entry matching a hash of a minimum sequence length. If so, this indicates that the corresponding bytes are already stored. If one or more bytes in the file are already stored, then the one or more bytes in the file are replaced with a reference to the already stored bytes. The systems and methods described herein may be used for file systems, databases, storing backup data, or any other use case where it may be useful to reduce the amount of data being stored.	09-04-2014
20140250089	SYSTEM AND METHOD FOR OPTIMIZING DATA REMANENCE OVER HYBRID DISK CLUSTERS USING VARIOUS STORAGE TECHNOLOGIES - A method is implemented in a computer infrastructure having computer executable code tangibly embodied on a computer readable storage medium having programming instructions. The programming instructions are operable to optimize data remanence over hybrid disk clusters using various storage technologies, determine one or more data storage technologies accessible by a file system, and determine secure delete rules for each of the one or more storage technologies accessible by the file system. The secure delete rules include a number of overwrites required for data to be securely deleted from each of the one or more storage technologies. The programming instructions are further operable to provide the secure delete rules to the file system upon a request for deletion of data for each of the one or more storage technologies a specific amount of times germane to secure delete data from the one or more storage technologies.	09-04-2014
20140258244	STORAGE SYSTEM DEDUPLICATION WITH SERVICE LEVEL AGREEMENTS - Mechanisms are provided for adjusting a configuration of data stored in a storage system. According to various embodiments, a storage module may be configured to store a configuration of data. A processor may be configured to identify an estimated performance level for the storage system based on a configuration of data stored on the storage system. The processor may also be configured to transmit an instruction to adjust the configuration of data on the storage system to meet the service level objective when the estimated performance level fails to meet a service level objective for the storage system	09-11-2014
20140258245	EFFICIENT DATA DEDUPLICATION - Efficient data deduplication is described herein. A deduplication bit array partition can be created that corresponds to a number of data items in an expected dataset. The deduplication bit array partition can track whether the data items have been received. When a data item in the expected dataset is received, a bit in the deduplication bit array partition corresponding to the received data item can be accessed to determine, based on the value of the bit, if the received data item has already been received. When the value of the bit indicates that the received data item has not already been received, the value can be changed to indicate that the data item has now been received. When the value of the bit indicates that the received data item has already been received, the data item can be deleted or ignored.	09-11-2014
20140258246	RECOGNIZING AND COMBINING REDUNDANT MERCHANT DEISGNATIONS IN A TRANSACTION DATABASE - Determining whether two merchant location database entries are describing the same merchant location. A subject merchant location database entry and comparison candidate merchant location database entries include a DBA name field, a street address field, and one or more additional descriptive fields descriptive of one or more predetermined characteristics of the respective merchant location. The subject merchant location database entry is compared to a set populated with candidate merchant location database entries, candidates having a predetermined minimum textural similarity with the subject merchant location database entry on the basis of each entry's DBA name field or street address field. The subject merchant location database entry is compared with each of the candidate database entries on the basis of the one or more additional descriptive fields, and a logistic regression is performed using the results of the comparing, in order to calculate a probability that the database entries refer to the same merchant location.	09-11-2014
20140279948	INDUSTRIAL ASSET EVENT CHRONOLOGY - Among other things, one or more techniques and/or systems are provided for developing a timeline chronicling events pertaining to an industrial asset. Data is received from a plurality of assets, processed (e.g., to reduce duplicative and/or redundant data), and organized chronologically for presentation in a timeline. The data is further grouped and/or prioritized to display some portions of the data more prominently relative to other portions of the data in the timeline (e.g., which may be hidden). Grouping rules and/or prioritization rules for grouping and/or prioritizing the data may be a function of user interaction with the timeline and/or a function of a machine learning algorithm which may be configured to identify patterns in how users interact with the timeline based upon, among other things, a role the user plays relative to the industrial asset and/or an operating state of the industrial asset.	09-18-2014
20140279949	Method and system for Data De-Duplication in storage devices - A method and system for data de-duplication in storage devices is disclosed. The method scans for the content within the storage device. When the method obtains all the content within the storage device, it checks for the duplicate content in the storage device. The method identifies duplicate content based on two criteria which include parametric level and Meta data level. The method switches to Meta data level when the method fails to identify duplicate content in parametric level. Further, the method obtains the input from user to delete or retain the duplicate content. If the user provides a confirmation for deleting the duplicate content, the method deletes the duplicate content.	09-18-2014
20140279950	SYSTEM AND METHOD FOR METADATA MODIFICATION - The present invention provides a method for modifying a first storage medium having a plurality of files, the method including providing a first modification tool; operatively coupling the first storage medium to the modification tool, wherein the operatively coupling includes bypassing a first operating system used to access the plurality of files; and dematerializing, using the first modification tool, at least a first file to form one or more dematerialized files. In some embodiments, the present invention provides a modification system for modifying a first storage medium having a plurality of files, the system including a first modification tool that includes an attachment module configured to operatively couple the modification tool to the first storage medium such that a first operating system used to access the plurality of files is bypassed; and a dematerialization module configured to dematerialize at least a first file to form one or more dematerialized files.	09-18-2014
20140279951	DIGEST RETRIEVAL BASED ON SIMILARITY SEARCH IN DATA DEDUPLICATION - For digest retrieval based on similarity search in deduplication processing in a data deduplication system using a processor device in a computing environment, input data is partitioned into fixed sized data chunks. Similarity elements and digest block boundaries and digest values are calculated for each of the fixed sized data chunks. Matching similarity elements are searched for in a search structure containing the similarity elements for each of the fixed sized data chunks in a repository of data. Positions of similar data are located in the repository. The positions of the similar data are used to locate and load into the memory stored digest values and corresponding stored digest block boundaries of the similar data in the repository. The digest values and the corresponding digest block boundaries of the input data are matched with the stored digest values and the corresponding stored digest block boundaries to find data matches.	09-18-2014
20140279952	EFFICIENT CALCULATION OF SIMILARITY SEARCH VALUES AND DIGEST BLOCK BOUNDARIES FOR DATA DEDUPLICATION - For efficient calculation of both similarity search values and boundaries of digest blocks in data deduplication, input data is partitioned into chunks, and for each chunk a set of rolling hash values is calculated. A single linear scan of the rolling hash values is used to produce both similarity search values and boundaries of the digest blocks of the chunk.	09-18-2014
20140279953	REDUCING DIGEST STORAGE CONSUMPTION IN A DATA DEDUPLICATION SYSTEM - For reducing digests storage consumption in a data deduplication system using a processor device in a computing environment, digest values are calculated for input data. The digest values are used to locate matches with data stored in a repository. The digest values are stored in the repository. The digest values of the data stored in the repository that is determined to be redundant with the input data are removed.	09-18-2014
20140279954	REDUCING DIGEST STORAGE CONSUMPTION BY TRACKING SIMILARITY ELEMENTS IN A DATA DEDUPLICATION SYSTEM - For reducing digests storage consumption in a data deduplication system using a processor device in a computing environment, input data is partitioned into chunks, and the chunks are grouped into chunk sets. Digests are calculated for input data and stored in sets corresponding to the chunk sets. Similarity elements are calculated for the input data and the similarity elements are stored in a similarity search structure. The number of similarity elements associated with a chunk set which are currently contained in the similarity search structure is maintained for each chunk set, and when this number of a specific chunk set becomes lower than a threshold, the digests set associated with that chunk set are removed from the repository.	09-18-2014
20140279955	OBJECT STORE MANAGEMENT OPERATIONS WITHIN COMPUTE-CENTRIC OBJECT STORES - Object store management operations within compute-centric object stores are provided herein. An exemplary method may include transforming an object storage dump into an object store table by a table generator container, wherein the object storage dump includes at least objects within an object store that are marked for deletion, transmitting records for objects from the object store table to reducer containers, such that each reducer container receives object records for at least one object, the object records comprising all object records for the at least one object, generating a set of cleanup tasks by the reducer containers, and executing the cleanup tasks by a cleanup agents.	09-18-2014
20140279956	SYSTEMS AND METHODS OF LOCATING REDUNDANT DATA USING PATTERNS OF MATCHING FINGERPRINTS - A system configured to compute match potential between first data and second data is provided. The system includes data storage storing the first data and the second data, and at least one processor coupled to the data storage. The at least one processor is configured to identify a first sequence of fingerprints characterizing a first plurality of sections of the first data, the first sequence being ordered according to an order of the first plurality of sections within the first data; identify a second sequence of fingerprints comprising fingerprints that match fingerprints within the first sequence, the second sequence of fingerprints characterizing a second plurality of sections of the second data, the second sequence being ordered according to an order of the second plurality of sections within the second data; quantify a similarity between the first sequence and the second sequence; and adjust the match potential based on the similarity.	09-18-2014
20140279957	TABULAR DATA MANIPULATION SYSTEM AND METHOD - A system and method that implements a tabular graph editor are disclosed. The system supports employing tables to browse and edit comparisons by multiple attributes of nodes in a graph.	09-18-2014
20140279958	REPRESENTING DE-DUPLICATED FILE DATA - Providing a subset of de-duplicated as output is disclosed. In some embodiments, the output comprises a subset of data stored in de-duplicated form in a plurality of containers each comprising a plurality of data segments comprising the data. For each container that includes one or more data segments comprising the subset, a corresponding container data is included in the output. Each container may include one or more segments not included in the subset. For each container the corresponding container data of which is included in the output, a corresponding value in a data structure comprising for each container stored on the de-duplicated storage system a data value indicating whether or not the corresponding container data has been included in the output is updated.	09-18-2014
20140297601	SYSTEM AND METHOD FOR DELETION COMPACTOR FOR LARGE STATIC DATA IN NOSQL DATABASE - System and method to compact a NoSQL database, the method including: receiving, by a receiver coupled to a processor, an indication of a record to delete in the NoSQL database; for each file in the NoSQL database, perform the steps of: if said file does not contain the record to delete, placing said file in a first memory; if said file contains the record to delete: placing said file in a second memory; searching whether the record to delete from said file in the second memory matches a record in one or more files in the first memory; and if a searched files in the first memory contain the record to delete from said file in the second memory, compacting said file in the second memory with the files in the first memory that contain the record to delete.	10-02-2014
20140297602	MULTIPLE USER PROFILE CLEANER - A cleaning application that can clean, for one or more user profiles, at least one of one or more files of a computer or a registry of the computer is provided. The cleaning application can include a cleaning module. The cleaning module can select a plurality of user profiles of the computer. The cleaning module can further select at least one of a file location or a user profile hive for each user profile of the plurality of user profiles. The cleaning module can further clean at least one of one or more files stored within the file location or a registry stored within the user profile hive for each user profile of the plurality of user profiles.	10-02-2014
20140297603	METHOD AND APPARATUS FOR DEDUPLICATION OF REPLICATED FILE - A replicated file deduplication apparatus generates a hash key of a requested data block, determines whether the same data block as the requested data block exists in data blocks of a replicated image file that is derived from the same golden image file as the requested data block using the hash key of the requested data block, and records, if the same data block as the requested data block exists, information of a chunk in which the same data block as the requested data block is stored at a layout of the requested data block.	10-02-2014
20140297604	TECHNIQUES FOR RECONCILING METADATA AND DATA IN A CLOUD STORAGE SYSTEM WITHOUT SERVICE INTERRUPTION - A system and methods for reconciling data and metadata in a cloud storage system while the cloud storage system is fully operational are provided. The method comprises scanning for broken references in a metadata database containing metadata of blocks stored in the cloud storage system, wherein the scanning for the broken references is performed as a background process; and synchronously verifying blocks for at least existence of the blocks in the object storage system, wherein the synchronous block verification is performed using a foreground process as blocks are requested.	10-02-2014
20140304238	METHOD AND APPARATUS FOR DETECTING DUPLICATE MESSAGES - An approach is provided for detect duplicate messages with multiple probabilistic data structures. A de-duplication platform causes, at least in part, a representing of one or more messages in two or more probabilistic data structures. The de-duplication platform further causes, at least in part, an alternating clearing of the two or more probabilistic data structures as respective probabilistic data structures are filled with the one or more messages to respective thresholds, with the two or more probabilistic data structures facilitating determination of one or more duplicates among the one or more messages.	10-09-2014
20140304239	SYSTEMS AND METHODS FOR SCHEDULING DEDUPLICATION OF A STORAGE SYSTEM - Systems for deduplicating one or more storage units of a storage system provide a scheduler, which is operable to select at least one storage unit (e.g. a storage volume) for deduplication and perform a deduplication process, which removes duplicate data blocks from the selected storage volume. The systems are operable to determine the state of one or more storage units and manage deduplication requests in part based state information. The system is further operable to manage user generated requests and manage deduplication requests in part based on user input information. The system may include a rules engine which prioritizes system operations including determining an order in which to perform state-gathering information and determining an order in which to perform deduplication. The system is further operable to determine the order in which storage units are processed.	10-09-2014
20140304240	Pruning of Blob Replicas - A method allocates object replicas in a distributed storage system. The method identifies a plurality of objects in the distributed storage system. Each object has an associated storage policy that specifies a target number of object replicas stored at distinct instances of the distributed storage system. The method identifies an object of the plurality of objects whose number of object replicas exceeds the target number of object replicas specified by the storage policy associated with the object. The method selects a first replica of the object for removal based on last access times for replicas of the object, and transmits a request to a first instance of the distributed storage system that stores the first replica. The request instructs the first instance to remove the first replica of the object.	10-09-2014
20140304241	SYSTEM AND METHOD FOR ACCELERATING ANCHOR POINT DETECTION - A sampling based technique for eliminating duplicate data (de-duplication) stored on storage resources, is provided. According to the invention, when a new data set, e.g., a backup data stream, is received by a server, e.g., a storage system or virtual tape library (VTL) system implementing the invention, one or more anchors are identified within the new data set. The anchors are identified using a novel anchor detection circuitry in accordance with an illustrative embodiment of the present invention. Upon receipt of the new data set by, for example, a network adapter of a VTL system, the data set is transferred using direct memory access (DMA) operations to a memory associated with an anchor detection hardware card that is operatively interconnected with the storage system. The anchor detection hardware card may be implemented as, for example, a FPGA is to quickly identify anchors within the data set. As the anchor detection process is performed using a hardware assist, the load on a main processor of the system is reduced, thereby enabling line speed de-duplication.	10-09-2014
20140304242	STORAGE SYSTEM FOR ELIMINATING DUPLICATED DATA - A storage system	10-09-2014
20140310250	STORAGE-NETWORK DE-DUPLICATION - Techniques are provided for de-duplication of data. In one embodiment, a system comprises de-duplication logic that is coupled to a de-duplication repository. The de-duplication logic is operable to receive, from a client device over a network, a request to store a file in the de-duplicated repository using a single storage encoding. The request includes a file identifier and a set of signatures that identify a set of chunks from the file. The de-duplication logic determines whether any chunks in the set are missing from the de-duplicated repository and requests the missing chunks from the client device. Then, for each missing chunk, the de-duplication logic stores in the de-duplicated repository that chunk and a signature representing that chunk. The de-duplication logic also stores, in the de-duplicated repository, a file entry that represents the file and that associates the set of signatures with the file identifier.	10-16-2014
20140310251	INTELLIGENT DEDUPLICATION DATA PREFETCHING - Deduplication dictionaries are used to maintain data chunk identifier and location pairings in a deduplication system. When access to a particular data chunk is requested, a deduplication dictionary is accessed to determine the location of the data chunk and a datastore is accessed to retrieve the data chunk. However, deduplication dictionaries are large and typically maintained on disk, so dictionary access is expensive. Techniques and mechanisms of the present invention allow prefetches or read aheads of datastore (DS) headers. For example, if a dictionary hit results in datastore DS(X), then headers for DS (X+1), DS (X+2), DS(X+read-ahead-window) are prefetched ahead of time. These datastore headers are cached in memory, and indexed by datastore identifier. Before going to the dictionary, a lookup is first performed in the cached headers to reduce deduplication data access request latency.	10-16-2014
20140310252	INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND COMPUTER PROGRAM - An information processing apparatus is provided, in which content and position information generated independently of each other are recorded in a recording medium. The apparatus includes a recording medium in which the content and the position information are recorded and a deletion unit deleting position information temporally associated with a piece of the content from the recording medium when the piece of content is deleted from the recording medium.	10-16-2014
20140317067	DATA DE-DUPLICATION - Disclosed are computer implemented methods, computer program products, and computer systems for storing a file into a storage system. An embodiment includes, responsive to a determination that a descriptive information describing content of a first file corresponds to a descriptive information describing content of a second file, that a format of the first file is convertible to a format of the second file using a transformation matrix, and that the format of the first file has a higher quality indicator value than the format of the second file, storing the first file into the storage system.	10-23-2014
20140324788	CLEANER WITH BROWSER MONITORING - A cleaning application that can monitor one or more browser applications that are executed on a computer, and that can, for at least one browser application, clean at least one of one or more files or a registry associated with the at least one browser application is provided. The cleaning application can include a cleaning module. The cleaning module can monitor one or more browser applications that are executed on a computer. The cleaning module can further detect a closing of at least one browser application. The cleaning module can further perform a pre-defined action in response to the closing of the at least one browser application. The pre-defined action can include cleaning at least one of one or more files or a registry associated with the at least one browser application.	10-30-2014
20140324789	CLEANER WITH COMPUTER MONITORING - A cleaning application that can monitor one or more characteristics of a computer, and that can clean at least one of one or more files or a registry of the computer, is provided. The cleaning application can include a cleaning module. The cleaning module can monitor one or more characteristics of the computer. The cleaning module can further detect an occurrence of pre-defined criteria involving the one or more characteristics. The cleaning module can further perform a pre-defined action in response to the pre-defined criteria. The pre-defined action can include cleaning at least one of one or more files or a registry associated with the computer.	10-30-2014
20140324790	METHOD AND SYSTEM FOR MULTI-BLOCK OVERLAP-DETECTION IN A PARALLEL ENVIRONMENT WITHOUT INTER-PROCESS COMMUNICATION - Techniques for avoiding duplicate comparisons while comparing customer records to identify linked customer records pertaining to a single customer entity are provided. The techniques include the computer system comparing a first electronic customer record with a second electronic customer record to determine if the first electronic customer record and the second electronic customer record pertain to a single customer entity if the computer system identifies a common blocker key corresponding to a selected blocker from a data field in the first electronic customer record and from a data field in the second electronic customer record and if the computer system does not identify a common blocker key corresponding to an additional lower order blocker from another data field in the first electronic customer record and from a data field in the second electronic customer record.	10-30-2014
20140324791	SYSTEM AND METHOD FOR EFFICIENTLY DUPLICATING DATA IN A STORAGE SYSTEM, ELIMINATING THE NEED TO READ THE SOURCE DATA OR WRITE THE TARGET DATA - A method for copying data efficiently within a deduplicating storage system eliminates the need to read or write the data per se within the storage system. The copying is accomplished by creating duplicates of the metadata block pointers only. The result is a process that creates and arbitrary number of copies using minimal time and bandwidth.	10-30-2014
20140324792	EXTRACTING A SOCIAL GRAPH FROM CONTACT INFORMATION ACROSS A CONFINED USER BASE - Embodiments of the present invention relate to extraction of a social graph from contact information across a confined user base. Users are typically subscribed to a service that backs up data from end-user devices to a cloud. The data includes contacts from mobile address books. The service is able to determine relationships of contacts in the cloud to build a social graph or map of these contacts. The social graph can be used to drive individual and group analytics to, for example, increase membership and provide value-added features to its service members.	10-30-2014
20140324793	Method for Layered Storage of Enterprise Data - A computer-implemented method for layered storage of enterprise data comprises receiving from one or more virtual machines data blocks; de-duplicating the data blocks per hypervisor; storing de-duplicated data blocks in a local cache memory; time-based grouping the data blocks into data containers; dividing each data container in X fixed length mega-blocks; for each data container applying erasure encoding to the X fixed length mega-blocks to thereby generate Y fixed length mega-blocks with redundant data, Y being larger than X; and distributed storing the Y fixed length mega-blocks across multiple backend storage systems.	10-30-2014
20140324794	Methods for decomposing events from managed infrastructures - Methods are provided for clustering events. Data is received at an extraction engine from managed infrastructure. Events are converted into alerts and the alerts mapped to a matrix M. One or more common steps are determined from the events and clusters of events are produced relating to the alerts and or events.	10-30-2014
20140324795	DATA MANAGEMENT - Methods and systems for data management are disclosed. With embodiments of the present disclosure, data files originating from the same source data can be de-duplicated. One such method comprises calculating one or more of a first characteristic value for first data in a first format, and one or more second characteristic values for one or more data in one or more second formats into which the first data can be converted, said characteristic value uniquely representing an arrangement characteristic of at least part of bits of data in a particular format. The method also includes storing one of the first data and the second data in response to one of the calculated characteristic values being the same as a stored characteristic value corresponding to a second data.	10-30-2014
20140324796	STATE-BASED DIRECTING OF SEGMENTS IN A MULTINODE DEDUPLICATED STORAGE SYSTEM - A system for directing for storage comprises a processor and a memory. The processor is configured to determine a segment overlap for each of a plurality of nodes. The processor is further configured to determine a selected node of the plurality of nodes based at least in part on the segment overlap for each of the plurality of nodes and based at least in part on a selection criteria. The memory is coupled to the processor and configured to provide the processor with instructions.	10-30-2014
20140324797	Displaying Social Networking System User Information Via a Historical Newsfeed - The invention provides a display interface in a social networking system that enables the presentation of information related to a user in a timeline or map view. The system accesses information about a user of a social networking system, including both data about the user and social network activities related to the user. The system then selects one or more of these pieces of data and/or activities from a certain time period and gathers them into timeline units based on their relatedness and their relevance to users. These timeline units are ranked by relevance to the user, and are used to generate a timeline or map view for the user containing visual representations of the timeline units organized by location or time. The timeline or map view is then provided to other users of the social networking system that wish to view information about the user.	10-30-2014
20140324798	SYSTEM AND METHOD FOR SELECTIVE FILE ERASURE USING METADATA MODIFCATIONS - A process that ensures the virtual destruction of data files a user wishes to erase from a storage medium, such as a hard drive, flash drive, or removable disk. This approach is appropriate for managing custom distributions from a large file sets as it is roughly linear in compute complexity to the number of files erased but is capped when many files are batch erased.	10-30-2014
20140330793	CAPACITY FORECASTING FOR A DEDUPLICATING STORAGE SYSTEM - A system for managing a storage system comprises a processor and a memory. The processor is configured to receive storage system information from a deduplicating storage system. The processor is further configured to determine a capacity forecast based at least in part on the storage system information. The processor is further configured to provide a compression forecast. The memory is coupled to the processor and configured to provide the processor with instructions.	11-06-2014
20140330794	SYSTEM AND METHOD FOR CONTENT SCORING - The various implementations of the present invention are provided as a computer-based system for content scoring. Content from a variety of source feeds may be considered for inclusion in an aggregated feed, based on the content of the source feed. The content of the source feed may be “scored” according to a variety of user-configurable options, thereby identifying the most valuable content from the source feeds for inclusion in the aggregated feed. For example, certain content elements may be extracted from a variety of source feeds and then combined to create an aggregated feed where the aggregated feed contains only the highest scoring elements, as determined by the feed creator, from the various source feeds are used to create the aggregated feed.	11-06-2014
20140330795	OPTIMIZING RESTORATION OF DEDUPLICATED DATA - A computer identifies a plurality of data retrieval requests that may be serviced using a plurality of unique data chunks. The computer services the data retrieval requests by utilizing at least one of the unique data chunks. At least one of the unique data chunks is utilized for servicing two or more of the data retrieval requests. The computer determines a servicing sequence for the plurality of data retrieval requests such that the two or more of the data retrieval requests that are serviced utilizing the at least one of the unique data chunks are serviced consecutively. The computer services the plurality of data retrieval requests according to the servicing sequence.	11-06-2014
20140337299	Method And Apparatus For Content-Aware And Adaptive Deduplication - A method, a system, an apparatus, and a computer readable medium for transmission of data across a network are disclosed. The method includes receiving a data stream, analyzing the received data stream to determine a starting location and an ending location of each zone within the received data stream, based on the starting and ending locations, generating a zone stamp identifying the zone, the zone stamp includes a sequence of contiguous characters representing at least a portion of data in the zone, wherein the order of characters in the zone stamp corresponds to the order of data in the zone, comparing the zone stamp with another zone stamp of another zone in any data stream received, determining whether the zone is substantially similar to another zone by detecting that the zone stamp is substantially similar to another zone stamp, delta-compressing zones within any data stream received that have been determined to have substantially similar zone stamps, thereby deduplicating zones having substantially similar zone stamps within any data stream received, and transmitting the deduplicated zones across the network from one storage location to another storage location.	11-13-2014
20140344229	SYSTEMS AND METHODS FOR DATA CHUNK DEDUPLICATION - A method includes receiving information about a plurality of data chunks and determining if one or more of a plurality of back-end nodes already stores more than a threshold amount of the plurality of data chunks where one of the plurality of back-end nodes is designated as a sticky node. The method further includes, responsive to determining that none of the plurality of back-end nodes already stores more than a threshold amount of the plurality of data chunks, deduplicating the plurality of data chunks against the back-end node designated as the sticky node. Finally, the method includes, responsive to an amount of data being processed, designating a different back-end node as the sticky node.	11-20-2014
20140351226	Distributed Feature Collection and Correlation Engine - A distributed feature collection and correlation engine is provided, Feature extraction comprises obtaining one or more data records; extracting information from the one or more data records based on domain knowledge; transforming the extracted information into a key/value pair comprised of a key K and a value V, wherein the key comprises a feature identifier; and storing the key/value pair in a feature store database if the key/value pair does not already exist in the feature store database using a de-duplication mechanism. Features extracted from data records can be queried by obtaining a feature store database comprised of the extracted features stored as a key/value pair comprised of a key K and a value V, wherein the key comprises a feature identifier; receiving a query comprised of at least one query key; retrieving values from the feature store database that match the query key; and returning one or more retrieved key/value pairs.	11-27-2014
20140351227	Distributed Feature Collection and Correlation Engine - A distributed feature collection and correlation engine is provided, Feature extraction comprises obtaining one or more data records; extracting information from the one or more data records based on domain knowledge; transforming the extracted information into a key/value pair comprised of a key K and a value V, wherein the key comprises a feature identifier; and storing the key/value pair in a feature store database if the key/value pair does not already exist in the feature store database using a de-duplication mechanism. Features extracted from data records can be queried by obtaining a feature store database comprised of the extracted features stored as a key/value pair comprised of a key K and a value V, wherein the key comprises a feature identifier; receiving a query comprised of at least one query key; retrieving values from the feature store database that match the query key; and returning one or more retrieved key/value pairs.	11-27-2014
20140351228	DIALOG SYSTEM, REDUNDANT MESSAGE REMOVAL METHOD AND REDUNDANT MESSAGE REMOVAL PROGRAM - There are provided an answer evaluation means	11-27-2014
20140358867	DE-DUPLICATION DEPLOYMENT PLANNING - Assignment of files to a de-duplication domain. Address space of data files is divided into multiple containers. For each of the containers, a file metadata scan is performed to obtain file system metadata, which is aggregated and summarized in a content feature summary. A content feature summary prediction measurement is measured between containers from the generated content feature summary, and files from each container are assigned to a de-duplication domain based upon the content similarity predication measurement.	12-04-2014
20140358868	LIFE CYCLE MANAGEMENT OF METADATA - The program code assigns a first record to a first object having a first life cycle and a second record to a second object having a second life cycle, wherein the first object is associated to the second object, and wherein the assigning is based on configurable predefined rules. In response to receiving a request to perform a delete action on at least one of the first object and the second object, performing the delete action when the at least one of the first object and the second object has a life cycle that is in a destroy phase.	12-04-2014
20140358869	SYSTEM AND METHOD FOR ACCELERATING MAPREDUCE OPERATION - Provided are a system and method for accelerating a mapreduce operation. The system for accelerating a mapreduce operation includes at least one map node configured to perform a map operation in response to a map operation request of a master node, and at least one reduce node configured to perform a reduce operation using result data of the map operation. The map node includes at least one map operation accelerator configured to generate a data stream by merging a plurality of data blocks generated as results of the map operation and establish a transmission channel for transmission of the data stream, and the reduce node includes at least one reduce operation accelerator configured to receive the data stream from the map operation accelerator through the transmission channel, recover the plurality of data blocks from the received data stream, and provide the recovered data blocks for the reduce operation.	12-04-2014
20140358870	DE-DUPLICATION DEPLOYMENT PLANNING - Assignment of files to a de-duplication domain. Address space of data files is divided into multiple containers. For each of the containers, a file metadata scan is performed to obtain file system metadata, which is aggregated and summarized in a content feature summary. A content feature summary prediction measurement is measured between containers from the generated content feature summary, and files from each container are assigned to a de-duplication domain based upon the content similarity predication measurement.	12-04-2014
20140358871	DEDUPLICATION FOR A STORAGE SYSTEM - A method and system for deduplication of data to be stored on a storage system. A deduplication system performs a method that includes the steps of: segmenting a storage object into a plurality of data segments; generating a content similarity key indicative of a content of a data segment as well as associating a physical position on the storage medium for the data segment with the generated content similarity key; storing the association in deduplication index information; and using the stored associations for optimizing the deduplication.	12-04-2014
20140358872	STORAGE SYSTEM AND METHOD FOR PERFORMING DEDUPLICATION IN CONJUNCTION WITH HOST DEVICE AND STORAGE DEVICE - Provided is a method for performing deduplication in conjunction with a host device and a storage device, and a storage system therefor. The host device includes a brief examination device which is configured to briefly examine whether data to be stored is duplicated or not based on a hash value of the data to be stored, and a data transmission device which is configured to transmit the data to be stored with an examination request or a data storage request to the at least one storage device according to a result of the examination.	12-04-2014
20140358873	Systems, Methods, and Computer Program Products for Scheduling Processing to Achieve Space Savings - A method performed in a system that has a plurality of volumes stored to storage hardware, the method including generating, for each of the volumes, a respective space saving potential iteratively over time and scheduling space saving operations among the plurality of volumes by analyzing each of the volumes for space saving potential and assigning priority of resources based at least in part on space saving potential.	12-04-2014
20140365448	TRENDING SUGGESTIONS - Aspects of the subject matter described herein relate to paragraph snapping. In aspects, trending data is collected and prepared for sending to one or more target machines. Upon receiving the trending data, a target machines installs the trending data locally and deletes previously installed trending data. After installation, the trending data may be used to suggest text in response to input from a user. If a user selects suggested text, the text may be added to a local dictionary of the target machine.	12-11-2014
20140365449	INLINE LEARNING-BASED SELECTIVE DEDUPLICATION FOR PRIMARY STORAGE SYSTEMS - A computing device receives a plurality of writes; each write is comprised of chunks of data. The computing device records metrics associated with the deduplication of the chunks of data from the plurality of writes. The computing device generates groups based on associating each group with a portion of a range of the metrics, such that each of the chunks of data are associated with one of the groups, and a similar number of chunks of data are associated with each group. The computing device determines a deduplication affinity for each of the groups based on the chunks of data that are duplicates and at least one metric. The computing device sets a threshold for the deduplication affinity and in response to any of the groups exceeding the threshold, the computing device excluding the chunks of data associated with a group exceeding the threshold, from deduplication.	12-11-2014
20140365450	SYSTEM AND METHOD FOR MULTI-SCALE NAVIGATION OF DATA - A system configured to generate a macro-fingerprint from at least one predefined set of summaries is provided. The system includes data storage storing a first predefined set of summaries associated with a first region of data, each member of the first predefined set of summaries characterizing data within the first region of data; and at least one processor coupled to the data storage and configured to: read the first predefined set of summaries; select at least one first member from the first predefined set of summaries based on a value of the at least one first member; and store the at least one first member within a first macro-fingerprint. The first region of data may have a first size indicative of to a quantity of data included in the first region of data. The macro fingerprints are created from previously created smaller (micro) fingerprints without having to reread the data.	12-11-2014
20140365451	METHOD AND SYSTEM FOR CLEANING UP FILES ON A DEVICE - A method and system for cleaning up junk files on a mobile terminal are provided. The method comprises: scanning, by a mobile terminal client, a file system on a local mobile terminal to generate a list of file information; submitting, by the mobile terminal client, to a server side the list of file information; comparing, by the server side, the list of file information submitted by the client with an associated list of file information in a server side database and returning the comparison result; determining a request for cleaning up in the file system on the basis of the comparison result, and performing, by the mobile terminal client, an operation of cleaning up.	12-11-2014
20140372386	DETECTING WASTEFUL DATA COLLECTION - A method and system comprising a duplication identifier module to analyze data input information to automatically identify duplicate expected inputs associated with a process are shown. The system includes logical process model information defining a logically structured series of process activities and data input information representing a plurality of expected inputs associated with respective process activities, with each expected input being indicative of expected collection of a corresponding data element during execution of the associated process activity. Each duplicate expected input comprises one of the plurality of expected inputs for which there is at least one other expected input with respect to a common corresponding data element.	12-18-2014
20140372387	METHOD AND MECHANISM FOR REDUCING CLIENT-SIDE MEMORY FOOTPRINT OF TRANSMITTED DATA - The present invention is directed to a method and mechanism for reducing the expense of data transmissions between a client and a server. According to an aspect of data prefetching is utilized to predictably retrieve information between the client and server. Another aspect pertains to data redundancy management for reducing the expense of transmitting and storing redundant data between the client and server. Another aspect relates to moved data structures for tracking and managing data at a client in conjunction with data redundancy management.	12-18-2014
20140379670	Data Item Deletion in a Database System - Example systems and methods of deleting data stored in a database system are presented. In one example, a plurality of data items is received from an application and stored at the database system. Also received from the application and stored at the database system is deletion timing information for each of the data items. The deletion timing information for a data item may indicate when the data item is to be deleted from the database system. At least one of the data items may be deleted at the database system at a time indicated by its corresponding deletion timing information without assistance from the application.	12-25-2014
20140379671	DATA SCRUBBING IN CLUSTER-BASED STORAGE SYSTEMS - Disclosed is the technology for data scrubbing in a cluster-based storage system. This technology allows protecting data against failures of storage devices by periodically reading data object replicas and data object hashes stored in a plurality of storage devices and rewriting those data object replicas that have errors. The present disclosure addresses aspects of writing data object replicas and hashes, checking validity of data object replicas, and performing data scrubbing based upon results of the checking.	12-25-2014
20140379672	SINGLE INSTANTIATION METHOD USING FILE CLONE AND FILE STORAGE SYSTEM UTILIZING THE SAME - The file storage system includes a controller and a volume storing a plurality of files, the volume including a first directory storing a first file and a second file and a second directory storing a third file being created. The controller migrates actual data of the second file to the third file, sets up a management information of the second file so that the third file is referred to when the second file is read, and if the sizes of actual data of the first file and the actual data of the third file are identical and the binaries of the actual data of the first file and the actual data of the third file are identical, sets up a management information of the first file to refer to the third file when reading the first file.	12-25-2014
20150012503	SELF-HEALING BY HASH-BASED DEDUPLICATION - For self-healing in a hash-based deduplication system using a processor device in a computing environment, deduplication digests of data and a corresponding list of the deduplication digests in a table of contents (TOC) are maintained for the self-healing of data that is lost or unreadable. The input data digests are compared to the TOC if directed to data that is lost or unreadable, and the input data digests are used to repair the one of lost and unreadable data.	01-08-2015
20150012504	PROVIDING IDENTIFIERS TO DATA FILES IN A DATA DEDUPLICATION SYSTEM - Data file in the data deduplication system are associated with a file identifier defined to have a first part identifier for denoting a location of the data file in a storage, and a second part identifier for uniquely identifying the data file in the data deduplication system over time.	01-08-2015
20150019499	DIGEST BASED DATA MATCHING IN SIMILARITY BASED DEDUPLICATION - Data matches are calculated between input data and repository data via a digest based matching algorithm where the reference digests corresponding to a repository interval of data identified as similar to an input interval of data are loaded into a sequential array and into a search structure. Each of the matching digests found using the search structure are extended using the sequential array of reference digests. Repository data intervals are determined as similar to an input data interval. Reference digests corresponding to the similar repository data interval are loaded into a sequential representation and into a search structure. Matches of input digests and the reference digests are found using the search structure. Each one of the found matches of the input digests and repository digests are extended using the sequential representation. Data matches are determined between the input data and the repository data using extended matches of digests.	01-15-2015
20150019500	REDUCING ACTIVATION OF SIMILARITY SEARCH IN A DATA DEDUPLICATION SYSTEM - For conditional activation of similarity search in a data deduplication system using a processor device in a computing environment, input data is partitioned into data chunks. A determination is made as to whether to apply the similarity search process for an input data chunk based on deduplication results of a previous input data chunk in the input data.	01-15-2015
20150019501	GLOBAL DIGESTS CACHING IN A DATA DEDUPLICATION SYSTEM - For utilizing a global digests cache in deduplication processing in a data deduplication system using a processor device in a computing environment, input data is partitioned into data chunks and digest values are calculated for each of the data chunks. The positions of similar repository data are found in a repository of data for each of the data chunks. The repository digests of the similar repository data are located and loaded into the global digests cache. The global digests cache contains digests previously loaded by other deduplication processes. The input digests of the input data are matched with the repository digests contained in the global digests cache for locating data matches.	01-15-2015
20150019502	READ AHEAD OF DIGESTS IN SIMILARITY BASED DATA DEDUPLICATON - For read ahead of digests in similarity based data deduplication in a data deduplication system using a processor device in a computing environment, input data is partitioned into data chunks and digest values are calculated for each of the data chunks. The positions and sizes of similar data intervals in a repository of data are found for each of the data chunks. The positions and the sizes of read ahead intervals are calculated based on the similar data intervals. The read ahead digests of the read ahead intervals are located and loaded into memory in a background read ahead process.	01-15-2015
20150019503	DIGEST BLOCK SEGMENTATION BASED ON REFERENCE SEGMENTATION IN A DATA DEDUPLICATION SYSTEM - For producing digest block segmentations based on reference segmentations in a data deduplication system using a processor device in a computing environment, digests are calculated for an input data chunk. Data matches and data mismatches are produced based on matching input digests with reference digests. Secondary digest block segmentations are obtained from similar reference intervals for each of the data mismatches and applied to the input data.	01-15-2015
20150019504	CALCULATION OF DIGEST SEGMENTATIONS FOR INPUT DATA USING SIMILAR DATA IN A DATA DEDUPLICATION SYSTEM - For calculation of digest segmentations for input data using similar data in a data deduplication system using a processor device in a computing environment, a stream of input data is partitioned into input data chunks. Similar repository intervals are calculated for each input data chunk. Anchor positions are determined between an input data chunk and the similar repository intervals, based on data matches between a previous input data chunk and previous similar repository intervals. Digest segmentations of the similar repository intervals are projected onto the input data chunk, starting at the anchor positions.	01-15-2015
20150019505	DATA STRUCTURES FOR DIGESTS MATCHING IN A DATA DEDUPLICATION SYSTEM - Data matches are calculated in a data deduplication system by matching input and repository digests using a digest based data matching process where the reference digests corresponding to a repository interval of data identified as similar to an input interval of data are loaded into two data structures. The two data structures include a sequential buffer containing digests in a sequence of occurrence in the data and a search structure for searching of the reference digests matching a version digest.	01-15-2015
20150019506	OPTIMIZING DIGEST BASED DATA MATCHING IN SIMILARITY BASED DEDUPLICATION - Data matches are calculated between input data and repository data via a digest based matching algorithm where in a first step digest matches, anchored at already verified matching positions in the input data and in the repository data, are extended to produce data matches. In a second step the remaining unmatched input digests are matched with repository digests and extended to produce further data matches.	01-15-2015
20150019507	OPTIMIZING HASH TABLE STRUCTURE FOR DIGEST MATCHING IN A DATA DEDUPLICATION SYSTEM - Repository data intervals are determined as similar to an input data interval. Repository digests corresponding to the similar repository data interval are loaded into a sequential representation and into a search structure. Matches of input digests and the repository digests are found using the search structure. Each one of the found matches of the input digests and repository digests are extended using the sequential representation. Data matches are determined between the input data and the repository data using extended matches of digests. A compact index pointing to a position in the sequential representation of digests is incorporated into entries of the search structure.	01-15-2015
20150019508	PRODUCING ALTERNATIVE SEGMENTATIONS OF DATA INTO BLOCKS IN A DATA DEDUPLICATION SYSTEM - For producing secondary segmentations of data into blocks and corresponding digests for input data in a data deduplication system using a processor device in a computing environment, digests are calculated for an input data chunk using a primary segmentation into blocks. Secondary segmentations are produced for each of the data mismatches based on reference data, and used to calculate further data matches. The primary segmentation and the corresponding primary digests are stored for the input data chunk.	01-15-2015
20150019509	COMPATIBILITY AND INCLUSION OF SIMILARITY ELEMENT RESOLUTIONS - For adaptive similarity search resolution in a data deduplication system using a processor device in a computing environment, multiple resolution levels are configured for a similarity search. Input similarity elements are calculated in one resolution level for a chunk of input data. The input similarity elements of the one resolution level are used to find similar data in a repository of data where similarity elements of the stored similar repository data are of the multiple resolution levels.	01-15-2015
20150019510	APPLYING A MAXIMUM SIZE BOUND ON CONTENT DEFINED SEGMENTATION OF DATA - Applying a content defined maximum size bound on blocks produced by content defined segmentation of data by calculating the size of the interval of data between a newly found candidate segmenting position and a last candidate segmenting position of the same or higher hierarchy level, and then using the intermediate candidate segmenting positions of that interval if the size of the interval exceeds the maximum size bound, or discarding the intermediate candidate segmenting positions of that interval if the size of the interval does not exceed the maximum size bound.	01-15-2015
20150019511	APPLYING A MINIMUM SIZE BOUND ON CONTENT DEFINED SEGMENTATION OF DATA - Applying a content defined minimum size bound on blocks produced by content defined segmentation of data by calculating the size of the interval of data between a newly found candidate segmenting position and a last candidate segmenting position of same or higher hierarchy level, and then discarding the newly found candidate segmenting position if a size of an interval of data is lower than the minimum size bound, or retaining the newly found candidate segmenting position if the size of the interval of data is not lower than the minimum size bound or if there is no last candidate segmenting position of a same or higher hierarchy level as the newly found candidate segmenting position. When a last candidate segmenting position of a same or higher hierarchy level becomes available, the evaluation is reiterated to converge edge segmenting positions of the outputs of consecutive calculation units.	01-15-2015
20150019512	SYSTEMS AND METHODS FOR FILTERING LOW UTILITY VALUE MESSAGES FROM SYSTEM LOGS - Systems and methods disclosed herein provide intelligent filtering of system log messages having low utility value. In providing the filtering, the systems and methods determine the utility value of a system log message and delete the message from the system log if the message is determined to be of low utility value. As such, embodiments herein provide an system log filter, which reduces the amount of data stored in the system log based on the utility value of the message.	01-15-2015
20150019513	TIME-SERIES ANALYSIS BASED ON WORLD EVENT DERIVED FROM UNSTRUCTURED CONTENT - The present subject matter relates to analysis of time-series data based on world events derived from unstructured content. According to one embodiment, a method comprises obtaining event information corresponding to at least one world event from unstructured content obtained from a plurality of data sources. The event information includes at least time of occurrence of the world event, time of termination of the world event, and at least one entity associated with the world event. Further, the method comprises retrieving time-series data pertaining to the entity associated with the world event from a time-series data repository. Based on the event information and the time-series data, the world event is aligned and correlated with at least one time-series event to identify at least one pattern indicative of cause-effect relationship amongst the world event and the time-series event.	01-15-2015
20150026135	ADAPTIVE SIMILARITY SEARCH RESOLUTION IN A DATA DEDUPLICATION SYSTEM - For adaptive similarity search resolution in a data deduplication system using a processor device in a computing environment, input data is partitioned into data chunks. Input similarity elements are calculated for an input chunk. The input similarity elements are used to find similar data in a repository of data using a similarity search structure. A resolution level is calculated for storing the input similarity elements. The input similarity elements are stored in the calculated resolution level in the similarity search structure.	01-22-2015
20150026136	Automated Data Validation - According to some embodiments, logic executing on a processor receives a request to compare a first file and a second file. Each file comprises records, attributes, and attribute values. An attribute value is a value that a record associates with a corresponding attribute. The logic receives a mapping file indicating a key and one or more selected attributes for comparison. The logic compares each record in the first file to its corresponding record in the second file, the corresponding record determined according to the key. For records that fail to match, the logic determines which of the selected attributes are unmatched. The logic communicates a report indicating a result of comparing the first file and the second file.	01-22-2015
20150026137	RECOVERING FROM A PENDING UNCOMPLETED REORGANIZATION OF A DATA SET - Provided are a computer program product, system, and method for recovering from a pending uncompleted reorganization of a data set managing data sets in a storage. In response an initiation of an operation to access a data set, an operation is initiated to complete a pending uncompleted reorganization of the data set in response to the data set being in a pending uncompleted reorganization state and no other process currently accessing the data set.	01-22-2015
20150026138	SYSTEMS, METHODS, AND COMPUTER PROGRAM PRODUCTS FOR MODIFYING AND DELETING DATA FROM A MOBILE DEVICE - Systems, methods, and computer program products are provided for transmitting modified sets of data to, or deleting existing sets of data from, mobile wallet applications on mobile devices. Data set identifiers associated with existing sets of data, attributes defining existing sets of data, and other information associated with existing sets of data are stored on a server. A change request to modify or delete an existing set of data is received from a service provider system. The server is searched for an existing set of data corresponding to the existing set of data identified in the change request. The change request is processed and a modified set of data, or a request to delete the existing set of data, is transmitted to mobile devices that have previously received the existing set of data.	01-22-2015
20150026139	SCALABLE MECHANISM FOR DETECTION OF COMMONALITY IN A DEDUPLICATED DATA SET - Mechanisms are provided for efficiently determining commonality in a deduplicated data set in a scalable manner regardless of the number of deduplicated files or the number of stored segments. Information is generated and maintained during deduplication to allow scalable and efficient determination of data segments shared in a particular file, other files sharing data segments included in a particular file, the number of files sharing a data segment, etc. Data need not be expanded or uncompressed. Deduplication processing can be validated and verified during commonality detection.	01-22-2015
20150026140	MERGING ENTRIES IN A DEDUPLCIATION INDEX - Provided are a computer program product, system, and method for merging entries in a deduplication index. An index has chunk signatures calculated from chunks of data in the data objects in the storage, wherein each index entry includes at least one of the chunk signatures and a reference to the chunk of data from which the signature was calculated. Entries in the index are selected to merge and a merge operation is performed on the chunk signatures in the selected entries to generate a merged signature. An entry is added to the index including the merged signature and a reference to the chunks in the storage referenced in the merged selected entries. The index of the signatures is used in deduplication operations when adding data objects to the storage.	01-22-2015
20150032702	SYSTEMS AND METHODS OF UNIFIED RECONSTRUCTION IN STORAGE SYSTEMS - Systems and methods for reconstructing unified data in an electronic storage network are provided which may include the identification and use of metadata stored centrally within the system. The metadata may be generated by a group of storage operation cells during storage operations within the network. The unified metadata is used to reconstruct data throughout the storage operation cells that may be missing, deleted or corrupt.	01-29-2015
20150032703	GETTING DEPENDENCY METADATA USING STATEMENT EXECUTION PLANS - A database statement can be identified in a software artifact that is configured to issue the database statement. At least one execution plan for the database statement can be retrieved, and reference(s) to database object(s) can be identified in the execution plan(s). Metadata from the reference(s) can be assembled, where the metadata can reflect one or more dependencies of the software artifact on the object(s). The metadata can be included in a data structure.	01-29-2015
20150039570	MANAGING REDUNDANT IMMUTABLE FILES USING DEDUPLICATION IN STORAGE CLOUDS - A method includes receiving a request to save a first file as immutable. The method also includes searching for a second file that is saved and is redundant to the first file. The method further includes determining the second file is one of mutable and immutable. When the second file is mutable, the method includes saving the first file as a master copy, and replacing the second file with a soft link pointing to the master copy. When the second file is immutable, the method includes determining which of the first and second files has a later expiration date and an earlier expiration date, saving the one of the first and second files with the later expiration date as a master copy, and replacing the one of the first and second files with the earlier expiration date with a soft link pointing to the master copy.	02-05-2015
20150039571	ACCELERATED DEDUPLICATION - Mechanisms are provided for accelerated data deduplication. A data stream is received an input interface and maintained in memory. Chunk boundaries are detected and chunk fingerprints are calculated using a deduplication accelerator while a processor maintains a state machine. A deduplication dictionary is accessed using a chunk fingerprint to determine if the associated data chunk has previously been written to persistent memory. If the data chunk has previously been written, reference counts may be updated but the data chunk need not be stored again. Otherwise, datastore suitcases, filemaps, and the deduplication dictionary may be updated to reflect storage of the data chunk. Direct memory access (DMA) addresses are provided to directly transfer a chunk to an output interface as needed.	02-05-2015
20150039572	SYSTEM AND METHOD FOR REMOVING OVERLAPPING RANGES FROM A FLAT SORTED DATA STRUCTURE - A system and method efficiently removes ranges of entries from a flat sorted data structure, such as a fingerprint database, of a storage system. The ranges of entries represent fingerprints that have become stale, i.e., are not representative of current states of corresponding blocks in the file system, due to various file system operations such as, e.g., deletion of a data block without overwriting its contents. A deduplication module of a file system executing on the storage system performs a fingerprint verification procedure to remove the stale fingerprints from the fingerprint database. As part of the fingerprint verification procedure, the deduplication module performs an attributes intersect range calculation (AIRC) procedure on the stale fingerprint data structure to compute a set of non-overlapping and latest consistency point (CP) ranges. During the AIRC procedure, an inode associated with a data container, e.g., a file, is selected and the FBN tuple of each deleted data block in the file is sorted in a predefined, e.g., increasing, FBN order. The AIRC procedure then identifies the most recent fingerprint associated with a deleted data block. The output from the AIRC procedure, i.e., the set of non-overlapping and latest CP ranges, is then used to remove stale fingerprints associated with that deleted block (as well as each other deleted data block) from the fingerprint database. Notably, only a single pass through the fingerprint database is required to identify the set of non-overlapping and latest CP ranges, thereby improving efficiency of the storage system.	02-05-2015
20150046408	Method and apparatus for reducing duplicates of multimedia data items in service system - A method of reducing duplicates of multimedia data items in a service system includes maintaining service system as values for the multimedia data items of the service system; receiving a first multimedia data item; and hashing the received multimedia data item to provide a first hash value. The method further includes searching the first hash value from the service system hash values; and approving the received multimedia data item to the service system in response to the first hash value being not found when searching from the service system hash values.	02-12-2015
20150046409	FINGERPRINTS DATASTORE AND STALE FINGERPRINT REMOVAL IN DE-DUPLICATION ENVIRONMENTS - A storage server is coupled to a storage device that stores blocks of data, and generates a fingerprint for each data block stored on the storage device. The storage server creates a fingerprints datastore that is divided into a primary datastore and a secondary datastore. The primary datastore comprises a single entry for each unique fingerprint and the secondary datastore comprises an entry having an identical fingerprint as an entry in the primary datastore. The storage server merges entries in a changelog with the entries in the primary datastore to identify duplicate data blocks in the storage device and frees the identified duplicate data blocks in the storage device. The storage server stores the entries that correspond to the freed data blocks to a third datastore and overwrites the primary datastore with the entries from the merged data that correspond to the unique fingerprints to create an updated primary datastore.	02-12-2015
20150046410	ENHANCED RELIABILITY IN DEDUPLICATION TECHNOLOGY OVER STORAGE CLOUDS - Methods and systems for enhancing reliability in deduplication over storage clouds are provided. A method includes: determining a weight for each of a plurality of duplicate files based on parameters associated with a respective storage device of each of the plurality of duplicate files; and designating one of the plurality of duplicate files as a master copy based on the determined weight.	02-12-2015
20150052112	FILE SERVER, STORAGE APPARATUS, AND DATA MANAGEMENT METHOD - A file server coupled to a client terminal via a network includes a storage unit for storing received files and a control unit for controlling writing or reading of the files to or from the storage unit, wherein the control unit: performs deduplication by deciding one of files with the same content, which are stored in the storage unit, as a clone source file, and deciding another file as a clone file, which refers to data of the clone source file; and appends data to the clone source file in accordance with an update instruction for the clone file from the client terminal.	02-19-2015
20150058301	EFFICIENT DATA DEDUPLICATION IN A DATA STORAGE NETWORK - Machines, systems and methods of uploading data files, the method comprising a first client machine dividing a first file into N data chunks to be uploaded to a server, wherein the N data chunks are of size kX, where k is an integer and X is size of a minimal size data chunk, wherein X is known by the server and by at least a second client machine used for uploading a second file to the server in data chunks of size k′X; and uploading the first file to the server, wherein a first unique signature is calculated for the first file based on applying a signature function to a collection of signatures calculated for the minimal size data chunks of size X that make up the data chunks of size kX in the first file, wherein the uploading of the first file is accomplished by uploading the data chunks of size kX to the server in any order.	02-26-2015
20150058302	ADDRESSING CACHE COHERENCE IN UPDATES TO A SHARED DATABASE IN A NETWORK ENVIRONMENT - Example embodiments are provided that may include receiving a request to update a particular object based on a modified object, where the particular object is one of a number of objects in a shared database, and the request includes an identification of one or more referenced objects and version information of the one or more referenced objects. Embodiments further include determining whether any of the referenced objects is stale based on the version information, where the particular object is not updated if any of the referenced objects is stale. More specific embodiments include updating the particular object if none of the referenced objects is stale. In yet further embodiments, determining a referenced object is stale is based on a comparison of a version identifier of the referenced object and a version identifier of an object in the shared database that corresponds to the referenced object.	02-26-2015
20150058303	DECREASING DUPLICATES AND LOOPS IN AN ACTIVITY RECORD - The claimed subject matter decreases duplicate entries and loops in an activity record. An exemplary method comprises analyzing a new entry from a user to determine an originating service and a type of activity and extracting an identifying portion of the new entry. The identifying portion includes a predetermined number of characters at a beginning of the entry. Additionally, the predetermined number of characters is based on a likelihood of duplicates in the activity record. The identifying portion is compared to a list of prior entries from the user, and an exclusion action is performed, if the new entry matches one in the list of prior entries. The exclusion action may be to hide the new entry, to delete the new entry, or to collapse the new entry into a matching prior entry.	02-26-2015
20150066871	DATA DEDUPLICATION IN AN INTERNET SMALL COMPUTER SYSTEM INTERFACE (iSCSI) ATTACHED STORAGE SYSTEM - Embodiments of the present invention disclose a method, computer program product, and system for data deduplication. Receiving a protocol data unit (PDU) that includes data to be stored on a system and a hash value that corresponds to the data. Determining whether the hash value of the received PDU matches a stored hash value that corresponds to data that is stored in the system. Responsive to determining that the hash value of the received PDU does not match a stored hash value, storing the data included in the received PDU in the system. In another embodiment, the system is an iSCSI attached storage system, and the PDU is an iSCSI PDU.	03-05-2015
20150066872	Efficient Duplicate Elimination - Methods and systems for identifying unique values in an input list are provided. A method in a processor is provided. The method includes generating a hash value list based on an input list of items using a respective work hem of the processor for each item in the input list and identify lag unique items in the input list based at least on the hash value list.	03-05-2015
20150066873	POLICY BASED DEDUPLICATION TECHNIQUES - Policy based deduplication techniques are described. A deduplication application may manage deduplication operations for a storage system. The deduplication application may comprise, among other elements, a deduplication handler component to receive a deduplication request to perform deduplication operations for a logical container of a storage system. The deduplication application may further comprise a policy manager component to retrieve a data compliance policy associated with the logical container, the data compliance policy to comprise a set of rules to control deduplication operations for the logical container. The deduplication application may still further comprise a deduplication manager component to determine whether to perform deduplication operations for the logical container based on the data compliance policy for the logical container. Other embodiments are described and claimed.	03-05-2015
20150066874	DATA DEDUPLICATION IN AN INTERNET SMALL COMPUTER SYSTEM INTERFACE (iSCSI) ATTACHED STORAGE SYSTEM - Embodiments of the present invention disclose a method, computer program product, and system for data deduplication. Receiving a protocol data unit (PDU) that includes data to be stored on a system and a hash value that corresponds to the data. Determining whether the hash value of the received PDU matches a stored hash value that corresponds to data that is stored in the system. Responsive to determining that the hash value of the received PDU does not match a stored hash value, storing the data included in the received PDU in the system. In another embodiment, the system is an iSCSI attached storage system, and the PDU is an iSCSI PDU.	03-05-2015
20150066875	UPDATING DE-DUPLICATION TRACKING DATA FOR A DISPERSED STORAGE NETWORK - A method begins by a dispersed storage (DS) processing module of a dispersed storage network (DSN) determining whether a change has occurred to a data object of a set of data objects. When a change has occurred, the method continues with the DS processing module accessing de-duplication tracking data for the set of data objects. When the change is deletion of an identified data object of the set of data objects, the method continues with the DS processing module determining whether the identified data object is the only data object in the set of data objects. When the identified data object is not the only data object in the set of data objects, the method continues with the DS processing module updating the linking information to delete linking the identified data object to addressing information.	03-05-2015
20150066876	DATA DE-DUPLICATION - A method and device for data de-duplication, comprising: performing data chunk partition on a current data object by using a different standard in each of a plurality of logical passes; searching one or more first redundant data chunks of the current data object in each logic pass based on the data chunks partitioned on the current data object in the logical pass, respectively, and performing data de-duplication on the current data object based on all of the found first redundant data chunks of the current data object. Other embodiments of the present invention may also relate to a data de-duplication system and a corresponding computer program product.	03-05-2015
20150066877	SEGMENT COMBINING FOR DEDUPLICATION - A non-transitory computer-readable storage device includes instructions that, when executed, cause one or more processors to receive a sequence of hashes. Next, the one or more processors are further caused to determine locations of previously stored copies of a subset of the data chunks corresponding to the hashes. The one or more processors are further caused to group hashes and corresponding data chunks into segments based in part on the determined information. The one or more processors are caused to choose, for each segment, a store to deduplicate that segment against. Finally, the one or more processors are further caused to combine two or more segments chosen to be deduplicated against the same store and deduplicate them as a whole using a second index.	03-05-2015
20150074064	DEFRAGMENTATION-LESS DEDUPLICATION - For defragmentation-less deduplication using a processor device, holes are punched in a file in a data deduplication process for avoiding the use of defragmenting by allowing a file system to use the punched holes for reclaiming the free space for adding to a free space pool of the file system.	03-12-2015
20150074065	Data Access in a Storage Infrastructure - The present invention relates to a method for data access in a storage infrastructure. The storage infrastructure comprises a host system connected to at least a first storage system and a second storage system. The first storage system receives, from the host system, a write request for storing a data chunk, the write request is indicative of a first identifier of the data chunk. The first storage system calculates a hash value of the received data chunk using a hash function. The first storage system determines a first storage location in the first storage system of the data chunk and sends a write message including the hash value, the first identifier and the first storage location to the de-duplication module. The de-duplication module determines whether the hash value exists in the data structure.	03-12-2015
20150081649	IN-LINE DEDUPLICATION FOR A NETWORK AND/OR STORAGE PLATFORM - An apparatus comprising a classification block, a pattern generator block, a hash key block and a replacement block. The classification block may be configured to (i) receive a data signal and (ii) identify a portion of the data signal that contains a duplicated data pattern. The pattern generation block may be configured to generate a common continuous pattern of data in response to the data signal. The hash key block may be configured to generate a hash key representing the duplicated data pattern. The replacement block may be configured to replace the duplicated data pattern with the hash key.	03-19-2015
20150088837	RESPONDING TO SERVICE LEVEL OBJECTIVES DURING DEDUPLICATION - Technology is described for responding to service level objectives during deduplication. In various embodiments, the technology receives a service level objective (SLO); receives data to be stored at the data storage system; computes an amount of deduplication to apply to the received data responsive to the SLO; deduplicates the data to the computed amount; and stores the deduplicated data. The deduplicated data may be stored in such a manner that the data can be read in a manner that meets the SLO.	03-26-2015
20150088838	DATA STORAGE DEVICE DEFERRED SECURE DELETE - A method of securely deleting data from a data storage device is described. The method includes the steps of receiving a secure delete command to securely delete a file. A data block of the file to securely delete is identified. A pointer to the data block is stored in a deletion buffer. It is then determined whether the secure delete command has a highest priority over other data storage device commands. In response to the secure delete command having the highest priority, the secure delete command to the data block is performed.	03-26-2015
20150088839	REPLACING A CHUNK OF DATA WITH A REFERENCE TO A LOCATION - Examples disclose a computing device comprising a deduplication module to analyze a signature associated with a chunk of data to identify a corresponding signature in an index of signatures on a hard drive. The corresponding signature indicates the chunk of data corresponds to a stored chunk of data within a removable media. Further, the deduplication module determines whether the chunk of data is redundant based on the identification of the corresponding signature and replaces the chunk of data with a reference to a location of the stored chunk of data. Additionally, the examples also disclose the removable media to store the reference to the chunk of data.	03-26-2015
20150088840	DETERMINING SEGMENT BOUNDARIES FOR DEDUPLICATION - A sequence of hashes is received. Each hash corresponds to a data chunk of data to be deduplicated. Locations of previously stored copies of the data chunks are determined, the locations determined based on the hashes. A breakpoint in the sequence of data chunks is determined based on the locations, the breakpoint forming a boundary of a segment of data chunks.	03-26-2015
20150088841	TECHNIQUES FOR CORRELATING DATA IN A REPOSITORY SYSTEM - Techniques are described for determining correlations between data in a repository system. The data may include information about corresponding to resources (e.g., an application, a process, a service, an endpoint, or a method) in a computing environment. A correlation between objects can indicate a similarity or a relationship based on one or more of the attributes of each object that is correlated. The repository system can store information about each object in a data structure, such as an entity, including the attributes about the object. The repository system can determine the relationships between entities based on correlations identified from the attributes of entities. The repository system can perform correlations based on groups of entities corresponding to a group of objects. Upon determining that two groups of entities match, the repository system can compare individual entities in the groups to identify correlations between individual entities corresponding to objects that are correlated.	03-26-2015
20150088842	DATA STORAGE SYSTEM AND METHOD BY SHREDDING AND DESHREDDING - A system and method for data storage by shredding and deshredding of the data allows for various combinations of processing of the data to provide various resultant storage of the data. Data storage and retrieval functions include various combinations of data redundancy generation, data compression and decompression, data encryption and decryption, and data integrity by signature generation and verification. Data shredding is performed by shredders and data deshredding is performed by deshredders that have some implementations that allocate processing internally in the shredder and deshredder either in parallel to multiple processors or sequentially to a single processor. Other implementations use multiple processing through multi-level shredders and deshredders. Redundancy generation includes implementations using non-systematic encoding, systematic encoding, or a hybrid combination. Shredder based tag generators and deshredder based tag readers are used in some implementations to allow the deshredders to adapt to various versions of the shredders.	03-26-2015
20150088843	OPTIMIZING A PARTITION IN DATA DEDUPLICATION - For optimizing a partition of a data block into matching and non-matching segments in data deduplication using a processor device in a computing environment, a sequence of matching segments is split into sub-parts for obtaining a globally optimal subset, to which an optimal calculation is applied. The solutions of optimal calculations for the entire range of the sequence are combined, and a globally optimal subset is built by means of a first two-dimensional table represented by a matrix C[i, j], and storing a representation of the globally optimal subset in a second two-dimensional table represented by a matrix PS[i, j] that holds, at entry [i, j] of the matrix, the globally optimal subset for a plurality of parameters in form of a bit-string of length j−i+1, wherein i and j are indices of bit positions corresponding to segments.	03-26-2015
20150095291	Identifying Product Groups in Ecommerce - Systems and methods are disclosed herein for supplementing product records with product groups that are relevant to the product records. Queries form users may be analyzed to extract keywords. Search results for keywords are evaluated to determine category consistency among product records, including such values as entropy and taxonomy depth. Those keywords with search results having adequate category consistency are selected as product groups and the search results associated with the product groups. Product groups are associated with product records according to a random walk of a graph having as nodes products and product groups and links representing belonging of a product to a product group. Product groups may be selected based on a transition probability based on a random walk and a quality score based on usage of a product group page for the product group.	04-02-2015
20150100554	ATTRIBUTE REDUNDANCY REMOVAL - Systems, methods, and other embodiments associated with attribute redundancy removal are described. In one embodiment, a method includes identifying redundant attribute values in a group of attributes that describe two items. The example method also includes generating a pruned group of attributes having the redundant attribute values removed. The similarity of the two items is calculated based, at least in part, on the pruned group of attribute values.	04-09-2015
20150106343	TECHNIQUE FOR GLOBAL DEDUPLICATION ACROSS DATACENTERS WITH MINIMAL COORDINATION - A system and method for global data de-duplication in a cloud storage environment utilizing a plurality of data centers is provided. Each cloud storage gateway appliance divides a data stream into a plurality of data objects and generates a content-based hash value as a key for each data object. An IMMUTABLE PUT operation is utilized to store the data object at the associated key within the cloud.	04-16-2015
20150106344	METHODS AND SYSTEMS FOR INTELLIGENT ARCHIVE SEARCHING IN MULTIPLE REPOSITORY SYSTEMS - Systems and methods of providing a configurable table of rules that defines a repository/archive search priority that includes multiple repositories/archives. In this manner, repository/archives are successively searched and after a first result is returned the search is stopped. Repository/archives searched in priority order based on location in pre-configured “tiers.” This enables searches to be directed to repository/archives that are best able to handle load for different types of searches, and for different types of studies as well. A duplicate priority list enables an administrator to designate which repository/archive will appear on search results list if duplicates are found. For example, in clinical study archiving systems, the search priority enables an administrator to direct searches to repository best able to handle load for different types of searches and for different types of studies.	04-16-2015
20150106345	MULTI-NODE HYBRID DEDUPLICATION - According to at least one embodiment, a data storage system is provided. The data storage system includes memory, at least one processor in data communication with the memory, and a deduplication director component executable by the at least one processor. The deduplication director component is configured to receive data for storage on the data storage system, analyze the data to determine whether the data is suitable for at least one of summary-based deduplication, content-based deduplication, and no deduplication, and store, in a common object store, at least one of the data and a reference to duplicate data stored in the common object store.	04-16-2015
20150112950	SYSTEMS AND METHODS FOR PROVIDING INCREASED SCALABILITY IN DEDUPLICATION STORAGE SYSTEMS - A computer-implemented method for providing increased scalability in deduplication storage systems may include (1) identifying a database that stores a plurality of reference objects, (2) determining that at least one size-related characteristic of the database has reached a predetermined threshold, (3) partitioning the database into a plurality of sub-databases capable of being updated independent of one another, (4) identifying a request to perform an update operation that updates one or more reference objects stored within at least one sub-database, and then (5) performing the update operation on less than all of the sub-databases to avoid processing costs associated with performing the update operation on all of the sub-databases. Various other systems, methods, and computer-readable media are also disclosed.	04-23-2015
20150120680	DISCUSSION SUMMARY - One or more techniques and/or systems are provided for providing a discussion summary corresponding to a search query and/or for providing discussion session search results. For example, discussion data (e.g., corresponding to real-time messaging, such as a microblog discussion) may be evaluated to identify a discussion topic for a discussion sessions (e.g., a kitchen renovation topic may be assigned to a 1 hour exchange of kitchen renovation messages by a discussion group). A discussion summary of a discussion session may be provided based upon the discussion session having a discussion topic corresponding to a search query topic of a search query. The discussion summary may be provided along with other results for the query and may describe the discussion group, identifiers such as hashtags used by the discussion group, meeting dates/times, average number(s) of participants, other discussion sessions hosted by the discussion group, future discussion sessions, and/or other information.	04-30-2015
20150120681	SYSTEM AND METHOD FOR AGGREGATING MEDIA CONTENT METADATA - A system and a method to aggregate multiple content servers' metadata to a local database is provided that enable various features such as improved performance, non searchable server support, duplicate handling and protocol independence. The system performs local content crawling, remote server crawling and remote server searching to create an aggregated database of metadata. The content is located in a single database. Hence, the duplicate metadata can be removed easily.	04-30-2015
20150120682	AUTOMATED RECOGNITION OF PATTERNS IN A LOG FILE HAVING UNKNOWN GRAMMAR - Embodiments of the present invention disclose a method, computer program product, and system for recognizing patterns in log files with unknown grammar. A computer replaces one or more alphanumeric strings with a first alphanumeric character to generate a first resulting string. The computer then replaces one or more identical pairs of characters of the first resulting string with a second alphanumeric character to generate a second resulting string. The computer then replaces one or more consecutive instances of the second alphanumeric character, in the second resulting string, with one instance of the second alphanumeric character to generate a compressed string.	04-30-2015
20150127621	USE OF SOLID STATE STORAGE DEVICES AND THE LIKE IN DATA DEDUPLICATION - Systems and methods of data deduplication are disclosed comprising generating a hash value of a data block and comparing the hash value to a table in a first memory that correlates ranges of hash values with buckets of hash values in a second memory different from the first memory. A bucket is identified based on the comparison and the bucket is searched to locate the hash value. If the hash value is not found in the bucket, the hash value is stored in the bucket and the data block is stored in a third memory. The first memory may be volatile memory and the second memory may be non-volatile random access memory, such as an SSD. Rebalancing of buckets and the table, and use of additional metadata to determine where data blocks should be stored, are also disclosed.	05-07-2015
20150127622	METHODS AND APPARATUS FOR NETWORK EFFICIENT DEDUPLICATION - Mechanisms are provided for performing network efficient deduplication. Segments are extracted from files received for deduplication at a host connected to a target over one or more networks and/or fabrics in a deduplication system. Segment identifiers (IDs) are determined and compared with segment IDs for segments already deduplicated. Segments already deduplicated need not be transmitted to a target system. References and reference counts are modified at a target system. Updating references and reference counts may involve modifying filemaps, dictionaries, and datastore suitcases for both already deduplicated and not already deduplicated segments.	05-07-2015
20150134623	PARALLEL DATA PARTITIONING - A method, system, and data storage medium for parallel partitioning of input data into chunks for data deduplication, comprising: dividing said input data into segments; for at least one segment, appending a portion of a subsequent segment; searching the segments in parallel for candidate breaking points; and partitioning each segment into chunks based on a group of final breaking points selected from said candidate breaking points.	05-14-2015
20150134624	CONTENT ITEM PURGING - Methods, systems, and computer readable media for content item purging functionality are provided. A contact item purger, such as may be incorporated within a local client application of a content management system, leverages its knowledge as to which items have been uploaded to the content management system, and how long content items have been stored on the user device, to propose items for local deletion and thus reclaiming storage on the user device. A contact item purger may run on one or more devices of a user associated with an account on a content management system upon various triggering events, and may run with or without user interaction, thus maintaining available user device memory capacity at all times.	05-14-2015
20150134625	PRUNING OF SERVER DUPLICATION INFORMATION FOR EFFICIENT CACHING - Technology is disclosed for improving the storage efficiency and communication efficiency for a storage client device by maximizing the cache hit rate and minimizing data requests to the storage server. The storage server provides a duplication list to the storage client device. The duplication list contains references (e.g. storage addresses) to data blocks that contain duplicate data content. The storage client uses the duplication list to improve the cache hit rate. The duplication list is pruned to contain references to data blocks relevant to the storage client device. The storage server can prune the duplication list based on a working set of storage objects for a client. Alternatively, the storage server can prune the duplication list based on content characteristics, e.g. duplication degree and access frequency. Duplicate blocks to which the client does not have access can be excluded from the duplication list.	05-14-2015
20150142755	STORAGE APPARATUS AND DATA MANAGEMENT METHOD - A control unit of a storage apparatus divides received data into one or more chunks and compresses the divided chunk(s); and regarding the chunk whose compressibility is equal to or lower than a threshold value, the control unit does not store the chunk in the first storage area, but calculates a hash value of the compressed chunk, compares the hash value with a hash value of another data already stored in the second storage area and executes first deduplication processing; and regarding the chunk whose compressibility is higher than the threshold value, the control unit stores the compressed chunk in the first storage area, reads the compressed chunk from the first storage area, calculates a hash value of the compressed chunk, compares the relevant hash value with a hash value of another data already stored in the second storage area, and executes secondary deduplication processing.	05-21-2015
20150142756	DEDUPLICATION IN DISTRIBUTED FILE SYSTEMS - Deduplication in a distributed file system is described. Key classes are determined from a set of potential keys, the potential keys used to represent file content stored by the file system. Control of the key classes is apportioned among index nodes of the file system. Nodes in the file system, during deduplication of data chunks of the file content, generate keys calculated from the data chunks. The keys are distributed among the index nodes based on relations between the keys and the key classes controlled by the index nodes.	05-21-2015
20150142757	INFORMATION PROCESSING METHOD AND ELECTRONIC DEVICE - The disclosure provides an information processing method and an electronic device. The electronic device generates M components to be embedded into a first application program when installing a recording application program, M is an integer greater than or equal to 1. There is an association relationship between the M components and the recording application program. In a case where the M components are embedded into the first application program, the method includes: when the first application program runs, displaying a first graphical interface corresponding to the first application program by the electronic device, the first graphical interface including the M components; obtaining a first triggering operation for a first component of the M components; collecting, in response to the first triggering operation, first data content under the first graphical interface directly; and storing the collected first data content.	05-21-2015
20150142758	Method for Intelligently Categorizing Data to Delete Specified Amounts of Data Based on Selected Data Characteristics - A method assigns stored documents within a distributed storage system (DSS) to various document categories to enable a target number of documents to be deleted. An intelligent storage management (ISM) utility identifies a data storage threshold value used to control data storage within the DSS. If a current storage usage exceeds the data storage threshold value, the ISM utility calculates, based on the current storage usage, a target number of documents that can be deleted from the DSS. The ISM utility utilizes a recursive process which includes assigning stored documents to groups including a set of document categories based on data characteristics of the stored documents. The ISM utility further utilizes the recursive process to delete, based on an established ordering of the groups, all of the stored documents assigned to a subset of the groups in order to remove the target number of stored documents.	05-21-2015
20150142759	METHOD FOR DETECTING THE PLAYBACK OF A DATA PACKET - A method of detecting whether a packet from a plurality of packets transmitted by at least one transmitting station over a network has been played back is disclosed. Each packet includes a message and an identifier, the packets being successively transmitted over several consecutive time periods. The method includes receiving the packet by at least one receiving station and reading of the identifier of the received packet to obtain a received identifier, and consulting, by the receiving station, a database of identifiers already received to determine whether the received identifier has already been received. If the received identifier has not already been received, the method also includes updating the database to include the received identifier. The identifier includes an indicator of belonging to groups of packets.	05-21-2015
20150142760	METHOD AND DEVICE FOR DEDUPLICATING WEB PAGE - A method and a device is described for de-duplicating a web page. The method includes: extracting at least one core sentence from a target web page; mapping each core sentence to a unique numeric value to form a first numeric value set; determining an intersection set of the first numeric value set and each second numeric value set, and the number of numeric values included in each intersection set, and determining a maximum number of numeric values included in each intersection set; and when a ratio of the maximum number to a total number of numeric values in the first numeric value set is greater than a set threshold, processing the target web page as a duplicate web page. In embodiments of the present invention, during web page de-duplication processing, accuracy can be improved, an anti-noise capability can be enhanced, and a calculating scale can be reduced.	05-21-2015
20150293949	DATA SAMPLING DEDUPLICATION - Techniques for deduplication include receiving a series of data blocks that includes a first data block and deciding whether the first data block is a sampled data block. If the first data block is a sampled data block and information about the first data block is not in a index, storing information about the first data block in the index. If the first data block is not a sampled data block and information about the first data block is not in the index, deciding whether to store information about the first data block in the index based in part on whether it is near data blocks whose Information is stored in the index.	10-15-2015
20150293950	METHOD, APPARATUS, AND STORAGE MEDIUM FOR REMOVING REDUNDANT INFORMATION FROM TERMINAL - This application relates to the technical field of network communications, and discloses a method and an apparatus for removing redundant information of a terminal. The method includes the steps of: calculating an estimated redundancy value of at least one type of redundant information in a terminal; determining that a redundancy value of a type of redundant information reaches a threshold of the type of redundancy value; prompting a user to remove redundant information; and according to confirmation from the user, removing the type of redundant information or all redundant information. The apparatus includes a first calculating unit, a determining unit, a prompting unit and a cleanup unit. According to the method and the apparatus of this application, an estimated redundancy value of redundant information of a terminal can be calculated actively by analyzing historical redundant data information of a user, and the user is prompted to process redundant information that reaches a threshold without the need for scanning, thereby saving system resources, improving system performance and also saving user time.	10-15-2015
20150301903	CROSS-SYSTEM, USER-LEVEL MANAGEMENT OF DATA OBJECTS STORED IN A PLURALITY OF INFORMATION MANAGEMENT SYSTEMS - Systems and methods are disclosed for cross-system user-level management of data objects stored in one or more information management systems, and for user-level management of data storage quotas in information management systems, including data objects in secondary storage. An illustrative quota manager is associated with one or more information management systems. The quota manager comprises a quota value representing the maximum amount of data storage allowed for a given end-user's primary and secondary data in the one or more information management systems. The quota manager determines whether data associated with the end-user has exceeded the storage quota, and if so, prompts the end-user to select data for deletion, the deletion to be implemented globally, across the primary and secondary storage subsystems of the respective one or more information management systems. Meanwhile, so long as the quota is exceeded, the quota manager instructs storage managers to block backups of end-user's data.	10-22-2015
20150302022	DATA DEDUPLICATION METHOD AND APPARATUS - A data deduplication method includes separating data into a plurality of data chunks that correspond to first to N-th positions, N being a positive integer that is greater than 1; determining discrimination indexes of the first to N-th positions, respectively; arranging the order of the first to N-th positions according to values of the discrimination indexes; recording the arranged order of the first to N-th positions on a position vector; and generating fingerprints through combination of the data chunks that correspond to the first to N-th positions according to the order of the first to N-th positions recorded on the position vector, wherein the determining discrimination indexes includes determining the discrimination indexes according to a ratio of duplicate data chunks to the data chunks that correspond to a same position in a plurality of pieces of data.	10-22-2015
20150309880	EFFICIENT VIDEO DATA DEDUPLICATION - Various embodiments for performing video data deduplication by a processor device are provided. An accompanying audio stream of a video stream for a selected data block is analyzed for similarity with a pre-existing data block having a predetermined value representative of a plurality of coordinate points of corresponding video at a certain time.	10-29-2015
20150310132	EVENT-TRIGGERED DATA QUALITY VERIFICATION - A method is directed to associating quality metadata with underlying data. The method includes, for one or more data items, a computing system identifying one or more threshold conditions related to the data items. The computing system determines that the one or more threshold conditions related to the data items have been met. As a result of determining that the one or more threshold conditions related to the data items have been met, the computing system associates quality metadata with the data items.	10-29-2015
20150317328	MANAGING REDUNDANT IMMUTABLE FILES USING DEDUPLICATION IN STORAGE CLOUDS - A method includes receiving a request to save a first file as immutable. The method also includes searching for a second file that is saved and is redundant to the first file. The method further includes determining the second file is one of mutable and immutable. When the second file is mutable, the method includes saving the first file as a master copy, and replacing the second file with a soft link pointing to the master copy. When the second file is immutable, the method includes determining which of the first and second files has a later expiration date and an earlier expiration date, saving the one of the first and second files with the later expiration date as a master copy, and replacing the one of the first and second files with the earlier expiration date with a soft link pointing to the master copy.	11-05-2015
20150324419	REDUCING DIGEST STORAGE CONSUMPTION BY TRACKING SIMILARITY ELEMENTS IN A DATA DEDUPLICATION SYSTEM - For reducing digests storage consumption in a data deduplication system using a processor device in a computing environment, input data is partitioned into chunks, and the chunks are grouped into chunk sets. Digests are calculated for input data and stored in sets corresponding to the chunk sets. Similarity elements are calculated for the input data and the similarity elements are stored in a similarity search structure, and the number of similarity elements associated with a chunk set which are currently contained in the similarity search structure is maintained for each chunk set.	11-12-2015
20150331864	RANKING AND RATING SYSTEM AND METHOD UTILIZING A COMPUTER NETWORK - A ranking and rating system and method is disclosed. The ranking or rating system and method comprises aggregating information or data; storing the information or data; analyzing the information or data; creating a preliminary result; removing duplicates of the preliminary results; allowing users to input their custom requirements; providing a final result; and adjusting the final result to allow the system to incorporate past results in future analyses.	11-19-2015
20150331897	INFORMATION PROCESSING APPARATUS, INFORMATION PROCESSING METHOD, AND NON-TRANSITORY COMPUTER READABLE MEDIUM - An information processing apparatus includes a first determination unit determining whether a storage period of a target document expires on the basis of a first table in which type information, creation department information, and the storage period determined by a period from a document reviewed time are associated with each other using the type information and the creation department information about the target document; a second determination unit determining whether the storage period of the target document expires on the basis of a second table in which medical care department information is associated with the storage period using the information indicating the medical care department where the target document is reviewed; and a deletion unit that, if the storage period of the target document expires, performs any of deletion of the target document from a memory storing the target document, compression, and movement to another document memory.	11-19-2015
20150331915	INTERNET AND DATABASE MINING TO GENERATE DATABASE RECORDS - A method of generating database records. The method includes receiving by a processor, a user input defining a common search criteria; identifying, by the processor, one or more database records in a searchable database, wherein each of the one or more records is associated with the common search criteria; and extracting by the processor, the one or more database records from the searchable database to build a set of extracted records, wherein the extracted records are in a different format from the database records.	11-19-2015
20150339316	DATA DEDUPLICATION METHOD - Data deduplication is performed by separating data into a plurality of data chunks that correspond to first through N	11-26-2015
20150347439	INCREMENTAL DATA PROCESSING - Event logs in a video advertisement insertion network are processed to remove duplicate entries. One or more ad servers are continuously generating new event entries and writing them to a database. The entries are randomized such that generated time contiguous entries are distributed over multiple storage locations, thereby facilitating resource scaling and a uniform use of storage and computing resources. The distributed entries are read from the storage locations using sequential reads of chunks of the entries and processed to remove duplicate entries.	12-03-2015
20150347442	DATA EMBEDDING IN RUN LENGTH ENCODED STREAMS - One or more system, apparatus, method, and computer readable media for embedding supplemental data into a compressed data stream to form a supplemented compressed data stream. In embodiments, supplemental data is embedded at a run-length encoded (RLE) compression stage. In embodiments, supplemental data is extracted from a supplemented RLE stream to recover supplemental data and/or reconstruct the compressed data stream from which the supplemental data is extracted.	12-03-2015
20150347444	CLOUD LIBRARY DE-DUPLICATION - Disclosed herein are systems, methods, and non-transitory computer-readable storage media identifying duplicate media items that occur during a batch upload process from a client device to the cloud media library and performing media de-duplication and re-mapping of duplicate media items.	12-03-2015
20150347445	DEDUPLICATION OF FILE - The present invention discloses a method for deduplication of a file, a computer program product, and an apparatus thereof. In the method, the file is partitioned into at least one composite block, wherein the composite block includes a fixed-size block and a variable-size block, the variable-size block being determined based on content of the file. Then a deduplication operation is performed on the at least one composite block.	12-03-2015
20150347596	Bulletin Board Data Mapping and Presentation - The system provides a method and apparatus for sorting and displaying information from a BBS. The system provides a method of sorting and presenting messages from a BBS in a way so that the relationship in message threads can be easily observed and related messages can be identified. The system provides a way to view messages and map message threads in two and three dimensions so that the content of messages can be easily reviewed and the relationship between messages can be seen and followed. The system allows a user to enter into a message thread at any point and to then produce a visualization of the related threads and messages associated with each individual message. The system provides interfaces for either a linear or threaded BBS or even a hybrid BBS that is some combination of linear and threaded.	12-03-2015
20150347614	Synchronized Web Browsing Histories: Processing Deletions and Limiting Communications to Server - Deletion of synchronized web browsing history is enabled. A deletion filter record that specifies synchronized web browsing history to be deleted is received from a first client. The deletion filter record is stored in association with an identifier of the first client. A check-in message is received from a second client. Responsive to the check-in message, a determination is made that the stored deletion filter record is relevant to the second client. The stored deletion filter record is sent to the second client. Separately, a client's communications to a server are limited. A request is received to communicate with the server. A throttling policy is accessed. The throttling policy includes multiple ordered policy sections. A policy section indicates that all messages sent from the client to the server, up to the number of messages, must be separated by at least the time period.	12-03-2015
20150356109	STORAGE APPARATUS AND DATA MANAGEMENT METHOD - The present invention relates to a storage apparatus that executes de-duplication processing. Specifically, a storage apparatus includes a storing apparatus configured to provide a first storage area and a second storage area and a control unit. The control unit determines, on the basis of a result of comparison of a compression ratio of compressed data with a threshold, whether first duplication determination for determining whether data same as the data compressed without being stored in the first storage area is stored in the second storage area is executed or second duplication determination for determining whether data same as the data compressed after being stored in the first storage area is stored in the second storage area is executed. Further, the control unit changes the threshold on the basis of a state of the storage apparatus.	12-10-2015
20150356124	MANAGING DATA SETS OF A STORAGE SYSTEM - A method, system, and computer program product for managing data sets of a storage facility is disclosed. The method, system, and computer program product include determining, by analyzing a first data set, that the first data set includes a first record having padded data. To identify the padded data, the method, system, and computer program product include comparing at least a portion of the first record of the first data set with a second record of a second data set. Next, the method, system, and computer program product include removing, from the first record of the first data set, the padded data.	12-10-2015
20150356134	DE-DUPLICATION SYSTEM AND METHOD THEREOF - Chunk de-duplication performance is improved. A de-duplication system has a cut-out processing unit which inputs a content from a client terminal thereinto, determines a calculation range from a predetermined maximum chunk size and a predetermined minimum chunk size, divides the calculation range into at least two small calculation ranges, sets the positions of windows for rolling hash calculation so that the rolling hash calculation is continuous between the two small calculation ranges, and subjects the at least two small calculation ranges to the rolling hash calculation with shifting of the windows based on parallel processing to cut out a chunk from the content, and a de-duplication processing unit which does not store the cut-out chunk into a storage device when the chunk having the same contents as those of the cut-out chunk is already stored in the storage device.	12-10-2015
20150363418	DATA RESTRUCTURING OF DEDUPLICATED DATA - Various embodiments for enhancing storage of deduplicated data in a computing storage environment. Analytics are applied to at least one data storage characteristic observed in the computing storage environment to restructure the deduplicated data in a more sequential manner so as to enhance performance of the computing storage environment.	12-17-2015
20150363420	MEDIA ASSET MANAGEMENT - A method includes processing a number of media assets stored in one or more media asset repositories to determine a number of signatures for media in time intervals of the media assets of the number of media assets, processing the number signatures to identify duplicate instances of the signatures in the number of signatures, processing the identified duplicate instances to identify relationships between the identified duplicate instances, and storing the number of signatures and the relationships between the identified duplicate instances of the signatures in a signature data store.	12-17-2015
20150363437	DATA COLLECTION AND CLEANING AT SOURCE - Apparatus and method to cleanse data, the apparatus including: a receiver to collect electronic data to cleanse; a processor coupled to the receiver, the processor configured to receive the data from the receiver; a memory coupled to the processor, the memory configured to store an application program; a first interface to an instantiation module, to process data collected by the receiver; and a second interface to a configuration manager module, the configuration manager module configured to control data structure and rules used by the instantiation module to process data, wherein the first interface and the second interface are callable from the application program to cleanse the data collected by the receiver.	12-17-2015
20150363438	EFFICIENTLY ESTIMATING COMPRESSION RATIO IN A DEDUPLICATING FILE SYSTEM - A system for estimating a quantity of unique identifiers comprises a processor and a memory. The processor is configured to, for each of k times, associate a bin of a set of bins with each received identifier. The processor is further configured to determine an estimate of the quantity of unique identifiers based at least in part on an average minimum associated bin value. The memory is coupled to the processor and configured to provide the processor with instructions.	12-17-2015
20150370835	HASH BASED DE-DUPLICATION IN A STORAGE SYSTEM - A method for de-duplication, the method may include receiving a request to store in a storage system a received data entity; obtaining a received data entity signature that is responsive to the received data entity; selecting a selected data structure out of a set of data structures that comprises K data structures; wherein K is a positive integer; wherein for each value of a variable k that ranges between 2 and K, a stored data entity signature that is stored in a k'th data structure out of the set collided with stored data entity signatures that are stored in each one of a first till (k−1)'th data structures of the set; calculating an index by applying, on the received data entity signature, a hash function that is associated with the selected data structure; determining whether an entry that is associated with the index and belongs to the selected data structure is empty; writing to the entry, if the entry is empty, the received data entity signature, and storing the received data entity in the storage system in response to a location of the entry in the set; selecting, if (a) the entry is not empty and (b) the received data entity signature differs from a stored data entity signature that is stored in the entry, a new data structure of the set, and repeating at least the stages of calculating and determining.	12-24-2015
20150378638	HIGH READ BLOCK CLUSTERING AT DEDUPLICATION LAYER - Methods, systems, and computer program products are provided for deduplicating data. In one embodiment, a method comprises mapping a plurality of file blocks of selected data to a plurality of logical blocks, deduplicating the plurality of logical blocks to thereby associate each logical block with a corresponding physical block of a plurality of physical blocks located on a physical memory device, two or more of the corresponding physical blocks being non-contiguous with each other, and determining whether one or more of the corresponding physical blocks are one or more frequently accessed physical blocks being accessed at a frequency above a threshold frequency and being referred to by a common set of applications.	12-31-2015
20150378775	LOG-BASED TRANSACTION CONSTRAINT MANAGEMENT - A transaction request is received at a log-based transaction manager, indicating a logical constraint to be satisfied before the corresponding transaction is committed. The transaction manager identifies a subset of transaction records stored in a persistent change log that are to be examined to evaluate the logical constraint. Based at least in part on the result of a comparison of one or more constraint-related data signatures included in the transaction request with corresponding data signatures in the subset of transaction records, a decision is made to commit the requested transaction.	12-31-2015
20150379066	SYSTEM AND METHOD FOR PICK-AND-DROP SAMPLING - A database system includes an input to a database server configured to deliver a data stream formed of a sequence of elements, D={p	12-31-2015
20160004716	HASH-BASED MULTI-TENANCY IN A DEDUPLICATION SYSTEM - In a hash-based multi-tenancy in a deduplication system, incorporating, as if part of input data, a tenant identification (ID) into a hash value calculation using a single hash based index table for separating data segments in a multi-tenant deduplication system.	01-07-2016
20160004717	Storage and Retrieval of File-Level Changes on Deduplication Backup Server - When a backup client sends a request to back up a file to a backup server, the file and an index (e.g., checksum, hash, encryption, etc.) of the file are stored on the backup server in an efficient deduplication storage. If a backup client sends a request to back up a modified version of a file already stored on a backup server, the modified portion of the file is stored. In addition, an index of the modified portion is generated and stored along with the modified portions on the backup server. The indices can be used to reconstruct the file or modified version of the file when retrieved. The efficient deduplication storage method ensures that multiple copies of files or portions of files do not exist on the servers.	01-07-2016
20160004730	MINING OF POLICY DATA SOURCE DESCRIPTION BASED ON FILE, STORAGE AND APPLICATION META-DATA - A method and system determines discrete policy target groups for information objects stored in an enterprise IT system. The method and system provide cleansed information about information objects stored on the enterprise IT system. Criteria for sorting the information objects is determined. Initial sorting of the information objects is carried out, resulting in an initial set of clusters. The information objects are clustered into discrete policy target groups based on the information about the information objects and the initial set of clusters, and human-understandable names and definite descriptions for policy target groups are computed.	01-07-2016
20160012082	CONTENT-BASED REVISION HISTORY TIMELINES	01-14-2016
20160012098	USING INDEX PARTITIONING AND RECONCILIATION FOR DATA DEDUPLICATION	01-14-2016
20160019232	DISTRIBUTED DEDUPLICATION USING LOCALITY SENSITIVE HASHING - Deduplication in a distributed storage system is described. A deduplication manager identifies a data item that includes multiple data chunks. The deduplication manager defines a first extent on a first node in a distributed storage system. The deduplication manager compares the first extent to existing groups of similar extents to find one of the existing groups that has extents that are similar to the first extent. The deduplication manager selects a second extent from the found group of extents. The second closely matches the first extent and removes from the first extent one or more data chunks that are included in the first extent and the second extent. The deduplication manager associates, with the first extent, a pointer to the second extent for the removed one or more data chunks.	01-21-2016
20160026652	SYSTEM PERFORMING DATA DEDUPLICATION USING A DENSE TREE DATA STRUCTURE - In one embodiment, as new blocks of data are written to storage devices of a storage system, fingerprints are generated for those new blocks and inserted as entries into a top level (L0) of a dense tree data structure. When L0 is filled, the contents from L0 may be merged with level 1 (L1). After the initial merge, new fingerprints are added to L0 until L0 fills up again, which triggers a new merge. Duplicate fingerprints in L0 and L1 are identified which, in turn, indicates duplicate data blocks. A post-processing deduplication operation is then performed to remove duplicate data blocks corresponding to the duplicate fingerprints. In a different embodiment, as new fingerprint entries are loaded into L0, those new fingerprints may be compared with existing fingerprints loaded into L0 and/or other levels to facilitate inline deduplication to identify duplicate fingerprints and subsequently perform the deduplication operation.	01-28-2016
20160026653	LOOKUP-BASED DATA BLOCK ALIGNMENT FOR DATA DEDUPLICATION - Calculating fingerprints for each one of a multiplicity of alignment combinations of fixed-size deduplication data blocks and comparing each of the fingerprints to stored deduplicated data fingerprints in a lookup database for determining a preferred deduplication data block alignment. A deduplication data block comprises each of the fixed-size deduplication data blocks.	01-28-2016
20160034489	SCHEDULING DEDUPLICATION IN A STORAGE SYSTEM - A system can maintain multiple queues for deduplication requests of different priorities. The system can also designate priority of storage units. The scheduling priority of a deduplication request is based on the priority of the storage unit indicated in the deduplication request and a trigger for the deduplication request.	02-04-2016
20160034523	SUB-BLOCK PARTITIONING FOR HASH-BASED DEDUPLICATION - Sub-block partitioning for hash-based deduplication is performed by defining a minimal size and maximum size of the sub-block. If one of a plurality of search criteria is satisfied by one of a plurality of hash values, declaring a position of the hash value as a boundary end position of the sub-block. If the maximum size of the sub-block is reached prior to satisfying one of the multiple search criteria, declaring a position of an alternative one of the hash values that is selected based upon another one of the multiple search criteria as the boundary end position of the sub-block. One of the plurality of search criteria is satisfied if n bits at predefined positions of a value calculated by applying an XOR operation on last calculated k hash values are equal to one of an mth predefined different patterns of bits.	02-04-2016
20160042007	CONTENT ALIGNED BLOCK-BASED DEDUPLICATION - A content alignment system according to certain embodiments aligns a sliding window at the beginning of a data segment. The content alignment system performs a block alignment function on the data within the sliding window. A deduplication block is established if the output of the block alignment function meets a predetermined criteria. At least part of a gap is established if the output of the block alignment function does not meet the predetermined criteria. The predetermined criteria is changed if a threshold number of outputs fail to meet the predetermined criteria.	02-11-2016
20160042008	TECHNIQUE SELECTION IN A DEDUPLICATION AWARE CLIENT ENVIRONMENT - Techniques and mechanisms described herein facilitate the transmission of a data stream to a networked storage system. According to various embodiments, a determination may be made as to whether an amount of available computing resources at a client device meets or exceeds a computing resource availability threshold at the client device. A processing operation on a data stream may be performed at the client device to produce a pre-processed data stream when the amount of available computing resources meets or exceeds the computing resource availability threshold. The pre-processed data stream may be transmitted to a networked storage system for storage via a network. The networked storage system may be operable to store deduplicated data for retrieval via the network.	02-11-2016
20160042016	Deleting Records In A Multi-Level Storage Architecture - Deleting a data record from the second level storage or main store is disclosed. A look-up is performed for the data record in the first level storage, where the data record is defined by a row identifier. If the row identifier is found in the first level storage, a look-up is performed for an updated row identifier representing an update of the data record in the second level storage and the main store, the update of the data record being defined by an updated row identifier. If the updated row identifier is found in the second level storage, an undo log is generated from the first level storage to invalidate a row identifier of the row identifier. A flag is generated representing an invalid updated row identifier, and a redo log is generated to restore the data record in the first level storage.	02-11-2016
20160042026	METHOD OF REDUCING REDUNDANCY BETWEEN TWO OR MORE DATASETS - A method for reducing redundancy between two or more datasets of potentially very large size. The method improves upon current technology by oversubscribing the data structure that represents a digest of data blocks and using positional information about matching data so that very large datasets can be analyzed and the redundancies removed by, having found a match on digest, expands the match in both directions in order to detect and eliminate large runs of data by replace duplicate runs with references to common data. The method is particularly useful for capturing the states of images of a hard disk. The method permits several files to have their redundancy removed and the files to later be reconstituted. The method is appropriate for use on a WORM device. The method can also make use of L2 cache to improve performance.	02-11-2016
20160042051	DECOMPOSING EVENTS FROM MANAGED INFRASTRUCTURES USING GRAPH ENTROPY - Methods are provided for clustering events. Data is received at an extraction engine from managed infrastructure. Events are converted into alerts and the alerts mapped to a matrix M. One or more common steps are determined from the events and clusters of events are produced relating to the alerts and or events.	02-11-2016
20160048541	AUTOMATIC TABLE CLEANUP FOR RELATIONAL DATABASES - An approach for an automatic table cleanup process of use, implemented in relational databases, are provided. A method includes setting up a table cleanup process in a database which is operable to perform an automatic table cleanup on a table within the database using an auto purge value associated with the table. The method further includes altering the table with a virtual column to keep track of dates on the table. The method further includes turning on an automatic table maintenance capability of the database to include and initiate the table cleanup process. The method further includes running the table cleanup process to perform the automatic table cleanup using dates which are automatically filled in during an insert or update operation on the table, the table cleanup process comprising looking through the records and automatically purging the table when the auto purge value has been met.	02-18-2016
20160048542	DATA CURATION SYSTEM WITH VERSION CONTROL FOR WORKFLOW STATES AND PROVENANCE - A data curation system that includes various methods to enable efficient reuse of human and machine effort. To reuse effort, various facilities are presented that model, save, and allow the querying of provenance and state information of a curation workflow and allow for incremental, stateful transitions of the data and the metadata.	02-18-2016
20160055200	SCALABLE DEDUPLICATION SYSTEM AND METHOD - A system and method for data deduplication includes a first computer device that determines duplicacy of a data item. If the data item is not a duplicate, the first computer device transmits a request to add an entry for the data item in a deduplication table of a deduplication database. The database adds the entry for the data item while enforcing uniqueness of data across one or more data fields of the deduplication table, where, in enforcing the uniqueness, the database denies an attempt by the second device to add an entry in the deduplication table for the same data item.	02-25-2016
20160063022	RAPID INDEXING OF DOCUMENT TAGS - Document tags are rapidly indexed using a text based index and a graph index. A tag signal is received. A tag and a type of the tag that are located in the tag signal are stored in a data store. The tag is indexed as a tag document in the text based index. One or more relationships between the tag and a content document are managed in the graph index.	03-03-2016
20160063024	NETWORK STORAGE DEDUPLICATING METHOD AND SERVER USING THE SAME - A network storage deduplicating method and a server using the same method are proposed. The method includes the following steps: receiving a first data through an Internet small computer system interface protocol; calculating identification information of the first data; determining whether a second data having the identification information is already stored in the server; if yes, generating and storing a pointer pointing to the second data and neglecting the first data.	03-03-2016
20160070714	LOW-OVERHEAD RESTARTABLE MERGE OPERATION WITH EFFICIENT CRASH RECOVERY - A low-overhead merge technique enables restart of a merge operation with minimal logging of state information relating to progress of the merge operation by a volume layer of a storage input/output (I/O) stack executing on one or more nodes of a cluster. The technique enables restart of the merge operation by ensuring that metadata, i.e., metadata pages, generated during the merge operation is not subject to de-duplication by providing a unique value in each metadata page that distinguishes the page, i.e., renders the page distinct or “unique”, from other metadata pages in an extent store. In addition, the technique ensures that a reference count on each metadata page is a value denoting a lack of de-duplication. To that end, the extent store layer is configured to not increment the reference count for a metadata page if, during the merge operation, the page is identical (and thus subject to deduplication) to an existing metadata page in the extent store.	03-10-2016
20160070715	STORING DATA IN A DISTRIBUTED FILE SYSTEM - A device for storing data in a distributed file system, the distributed file system including a plurality of deduplication storage devices, includes a determination unit configured to determine a characteristic of first data to be stored in the distributed file system; an identification unit configured to identify one of the deduplication storage devices of the distributed file system as deduplication storage device for the first data based on the characteristic of the first data; and a storing unit configured to store the first data in the identified deduplication storage device such that the first data and second data being redundant to the first data are deduplicatable within the identified deduplication storage device.	03-10-2016
20160070716	SYNCHRONIZATION OF A SERVER SIDE DEDUPLICATION CACHE WITH A CLIENT SIDE DEDUPLICATION CACHE - A server computational device maintains commonly occurring duplicate chunks of deduplicated data that have already been stored m a server side repository via one or more client computational devices. The server computational device provides a client computational device with selected elements of the commonly occurring duplicate chunks of deduplicated data, in response to receiving a request by the server computational device from the client computational device to prepopulate, refresh or update a client side deduplication cache maintained in the client computational device.	03-10-2016
20160070724	DATA QUALITY ANALYSIS AND CLEANSING OF SOURCE DATA WITH RESPECT TO A TARGET SYSTEM - A system transfers data between source systems and a target system. The system determines a domain score for data domains of source data from the source systems based on data quality metrics for the target system. The domain score indicates data quality with respect to the target system. Corresponding processes of the target system are identified for the data domains, and a process score is determined for the identified processes based on a corresponding domain score. The process score indicates data quality with respect to the identified processes. The system cleanses the source data based on the domain score and/or process score, and validates the cleansed source data against the target system for transference. Embodiments of the present invention further include a method and computer program product for transferring data between source systems and a target system in substantially the same manner described above.	03-10-2016
20160070725	DATA QUALITY ANALYSIS AND CLEANSING OF SOURCE DATA WITH RESPECT TO A TARGET SYSTEM - A system transfers data between source systems and a target system. The system determines a domain score for data domains of source data from the source systems based on data quality metrics for the target system. The domain score indicates data quality with respect to the target system. Corresponding processes of the target system are identified for the data domains, and a process score is determined for the identified processes based on a corresponding domain score. The process score indicates data quality with respect to the identified processes. The system cleanses the source data based on the domain score and/or process score, and validates the cleansed source data against the target system for transference. Embodiments of the present invention further include a method and computer program product for transferring data between source systems and a target system in substantially the same manner described above.	03-10-2016
20160077924	SELECTING A STORE FOR DEDUPLICATED DATA - A technique includes communicating a plurality of hashes associated with chunks of an object to at least some stores of a plurality of stores on which the object is distributed; and in response to the communication, receiving responses indicating a distribution of the associated chunks. The technique includes selecting one of the stores based at least in part on the responses and communicating deduplicated data associated with the object to the selected store.	03-17-2016
20160078068	FAST DEDUPLICATION DATA VERIFICATION - An information management system provides a data deduplication system that uses a primary table, a deduplication chunk table, and a chunk integrity table to ensure that a referenced deduplicated data block is only verified once during the data verification of a backup or other replication operation. The data deduplication system may reduce the computational and storage overhead associated with traditional data verification processes. The primary table, the deduplication chunk table, and the chunk integrity table, all of which are stored in a deduplication database, can also ensure synchronization between the deduplication database and secondary storage devices.	03-17-2016
20160085682	Caching Methodology for Dynamic Semantic Tables - A method for caching includes determining a degree of relatedness for a database entry stored in a concept table. The concept table is stored in cache. The degree of relatedness is based on a comparison between a concept of data of the database entry and a concept of the concept table. The method includes determining an amount of data usage for the database entry where the data usage includes an amount of usage of the database entry while in cache. The method includes determining a cache flushing rating for the database entry. The cache flushing rating is determined from the degree of relatedness of the database entry and the amount of data usage of the database entry. The method includes flushing the database entry from the cache in response to the cache flushing rating of the database entry being below a cache flush threshold.	03-24-2016
20160085767	TOPONYM RESOLUTION WITH ONE HUNDRED PERCENT RECALL - Various presentation systems may benefit from appropriate toponym resolution. For example, a system such as a search engine may benefit from toponym resolution with one hundred percent recall. A method can include receiving a set of geographic data comprising recognized toponyms. The method can also include recalling correctly all correctly recognized toponyms of the set. The recalling can include displaying the geographic data on a plurality of related displays. A first display can include at least a subset of the set. A second display can include an overview of the set.	03-24-2016
20160085792	SYSTEMS AND METHODS FOR LARGE-SCALE SYSTEM LOG ANALYSIS, DEDUPLICATION AND MANAGEMENT - System and methods for parsing raw log data into structured log data, and removing duplicate entries, storing the deduplicated log data into binary format, and managing system events. The subject matter can increase speed of log data analysis and storage, reduce data storage for log data, and easily manage system events.	03-24-2016
20160085807	Deriving a Multi-Pass Matching Algorithm for Data De-Duplication - Methods, systems, and computer program products for deriving a multi-pass matching algorithm for data de-duplication are provided herein. A method includes identifying multiple passes across multiple databases using a set of one or more blocking columns derived from a set of trained input data; identifying, in each of the multiple passes, one or more columns across the multiple databases that match one or more of the blocking columns; selecting a given pass from the multiple passes, wherein said given pass comprises a maximum number of matching columns within the multiple passes; determining, for the given pass, data that conform to the given pass comprising (i) a set of matching columns, (ii) one or more matching types and (iii) one or more weights; and determining one or more subsequent passes across the multiple databases iteratively by removing the data that conform to the given pass.	03-24-2016
20160092312	DEDUPLICATED DATA DISTRIBUTION TECHNIQUES - In connection with a data distribution architecture, client-side “deduplication” techniques may be utilized for data transfers occurring among various file system nodes. In some examples, these deduplication techniques involve fingerprinting file system elements that are being shared and transferred, and dividing each file into separate units referred to as “blocks” or “chunks.” These separate units may be used for independently rebuilding a file from local and remote collections, storage locations, or sources. The deduplication techniques may be applied to data transfers to prevent unnecessary data transfers, and to reduce the amount of bandwidth, processing power, and memory used to synchronize and transfer data among the file system nodes. The described deduplication concepts may also be applied for purposes of efficient file replication, data transfers, and file system events occurring within and among networks and file system nodes.	03-31-2016
20160092477	DETECTION AND QUANTIFYING OF DATA REDUNDANCY IN COLUMN-ORIENTED IN-MEMORY DATABASES - Methods, systems, and computer-readable storage media for quantifying a redundancy of data stored in tables of a database. In some implementations, actions include, for each primary key and table pair in a set of primary key and table pairs, determining an aggregate severity sub-score based on one or more values of the primary key in the table, the primary key being included in a set of primary keys and the table being included in a set of tables, determining an aggregate severity score for each primary key in the set of primary keys based on aggregate severity sub-scores associated with the primary key to provide a plurality of aggregate severity scores, each aggregate severity score indicating a relative redundancy of values of the primary key across all tables in the set of tables, and providing a list of aggregate severity scores and corresponding primary keys for display to a user.	03-31-2016
20160092478	DELETING TUPLES USING SEPARATE TRANSACTION IDENTIFIER STORAGE - Data from a database object are processed. Transaction information for a set of data of the database object is stored separate from the set of data in an allocated storage space, where the transaction information indicates visibility of the set of data to other transactions. A map structure is generated indicating storage of the set of data and the allocated storage space of the transaction information. The transaction information is altered in response to a transaction to the set of data to alter visibility of the set of data. Altering the transaction information is accomplished by providing updated transaction information within a new storage space in accordance with the transaction to the set of data and generating a descriptor for the transaction indicating an existing location of the set of data and the new storage space.	03-31-2016
20160092479	DATA DE-DUPLICATION - A method, executed by a computer, for de-duplicating data includes receiving a dataset, pivoting the dataset along a set of columns that have a common domain to provide a pivoted dataset, de-duplicating the pivoted dataset to provide a de-duplicated dataset, and using the de-duplicated dataset. De-duplicating the pivoted dataset may include computing similarity scores for records that have different primary keys and merging records that have a similarity score that exceeds a selected threshold value. The method may include determining the set of columns having a common domain by referencing a business catalog and/or conducting a data classification operation on some or all of the columns of the dataset. The method may also include pivoting the dataset along another set of columns that have a different common domain. A computer system and computer program product corresponding to the method are also disclosed herein.	03-31-2016
20160092494	DATA DE-DUPLICATION - A method, executed by a computer, for de-duplicating data includes receiving a dataset, pivoting the dataset along a set of columns that have a common domain to provide a pivoted dataset, de-duplicating the pivoted dataset to provide a de-duplicated dataset, and using the de-duplicated dataset. De-duplicating the pivoted dataset may include computing similarity scores for records that have different primary keys and merging records that have a similarity score that exceeds a selected threshold value. The method may include determining the set of columns having a common domain by referencing a business catalog and/or conducting a data classification operation on some or all of the columns of the dataset. The method may also include pivoting the dataset along another set of columns that have a different common domain. A computer system and computer program product corresponding to the method are also disclosed herein.	03-31-2016
20160092495	DELETING TUPLES USING SEPARATE TRANSACTION IDENTIFIER STORAGE - Data from a database object are processed. Transaction information for a set of data of the database object is stored separate from the set of data in an allocated storage space, where the transaction information indicates visibility of the set of data to other transactions. A map structure is generated indicating storage of the set of data and the allocated storage space of the transaction information. The transaction information is altered in response to a transaction to the set of data to alter visibility of the set of data. Altering the transaction information is accomplished by providing updated transaction information within a new storage space in accordance with the transaction to the set of data and generating a descriptor for the transaction indicating an existing location of the set of data and the new storage space.	03-31-2016
20160092496	REMOVAL OF GARBAGE DATA FROM A DATABASE - Elements of a database object are removed. The database object is stored as a plurality of different object portions, where each object portion is associated with one or more versions of transaction identifiers stored separately from the database object. An oldest transaction identifier is determined for a transaction for which data portions of the database object remains visible. Each object portion is examined and object portions with a threshold amount of data to remove are determined based on a comparison of the transaction identifiers for those object portions and the oldest transaction identifier. Data from the database object are removed in response to a sufficient quantity of data is to be removed from object portions containing the threshold amount of data.	03-31-2016
20160103868	EFFICIENT CALCULATION OF SIMILARITY SEARCH VALUES AND DIGEST BLOCK BOUNDARIES FOR DATA DEDUPLICATION - For efficient calculation of both similarity search values and boundaries of digest blocks in data deduplication, input data is partitioned into chunks, and for each chunk a set of rolling hash values is calculated. A single linear scan of the rolling hash values is used to produce both similarity search values and boundaries of the digest blocks of the chunk. The rolling hash values are used to contribute to the calculation of the similarity search values and to the calculation of the boundaries of the digest blocks.	04-14-2016
20160110261	CLOUD STORAGE USING MERKLE TREES - Efficient cloud storage systems, methods, and media are provided herein. Exemplary methods may include storing a data stream on a client side de-duplicating block store of a client device, generating a data stream Merkle tree of the data stream, storing a secure hash algorithm (SHA) key for the data stream Merkle tree, as well as the data stream Merkle tree on the client side de-duplicating block store, recursively iterating through the data stream Merkle tree using an index of a snapshot Merkle tree of the client device that is stored on a cloud data center to determine missing Merkle nodes or missing data blocks which are present in the data stream Merkle tree but not present in the snapshot Merkle tree stored on the cloud data center, and transmitting over a wide area network (WAN) the missing data blocks to the cloud data center.	04-21-2016
20160110354	MATCHING OBJECTS USING KEYS BASED ON MATCH RULES - Matching objects using keys based on match rules is described. A system generates a match rule key based on a match rule, wherein the match rule specifies whether two objects match. The system creates candidate keys by applying the match rule key to data objects. The system creates a probe key by applying the match rule key to a probe object. The system determines whether the probe key matches a candidate key. The system determines whether the probe object matches a candidate object based on applying the match rule to the probe object and the candidate object if the probe key matches the candidate key corresponding to the candidate object. The system identifies the probe object and the candidate object as matching based on the match rule if the probe object matches the candidate object.	04-21-2016
20160110375	MANAGING DELETION OF DATA IN A DATA STORAGE SYSTEM - In certain embodiments, a system comprises a memory and a processor communicatively coupled to the memory. The memory includes executable instructions that upon execution cause the system to generate, at a first time, a first snapshot capturing data stored in storage units of a storage device. The executable instructions upon execution cause the system to receive an indication to delete at least a portion of the data in the storage units and captured by the first snapshot, and to mark, in response to receiving the indication, the one or more storage units that store the at least a first portion of the data as available. The executable instructions upon execution cause the system to generate, at a second time subsequent to the first time, a second snapshot that omits the one or more storage units marked as available.	04-21-2016
20160110388	DEDUPLICATION IN A STORAGE SYSTEM - A IO handler receives a write command including write data that is associated with a LBA. The IO handler reserves a deduplication ID according to the LBA with which the write data is associated, within the scope of each LBA, each deduplication ID is unique. The IO handler computes a hash value for the write data. In case a deduplication database does not include an entry which is associated with the hash value, the IO handler: provides a reference key which is a combination of the LBA and the deduplication ID; adds to the deduplication database an entry which is uniquely associated with the hash value and references the reference key; and adds to a virtual address database an entry, including: the reference key; a reference indicator indicating if there is an entry that is associated with the present entry; and a pointer to where the write data is stored.	04-21-2016
20160117344	USER RE-ENGAGEMENT WITH ONLINE PHOTO MANAGEMENT SERVICE - An online photo management service that stores a collection of photos belonging to a user can send re-engagement messages to the user that can include photos automatically selected from the collection. The selection can be based on a scoring algorithm that rates the photos according to a set of attributes and computes a score based on the attributes and a set of weights. Based on user responses to re-engagement messages, the weights can be tuned to more reliably select photos likely to result in user re-engagement with the stored collection of photos.	04-28-2016
20160117349	COLLECTIVE RECONCILIATION - Methods, systems, and computer-readable media are provided for collective reconciliation. In some implementations, an collective reconciliation module may remove duplicate entries from merged data a source. The collective reconciliation module may identify a first entity reference in a first data source and may identify one or more entity references in a second data source based on an identifier match. The collective reconciliation module may generate a set of pairings defined by the first entity reference with each of a subset of the one or more entity references based on an iterative analysis of common attributes for the set of pairings. The collective reconciliation module may determine whether a commonality exists for each of the set of pairings. The collective reconciliation module may merge the first data source and the second data source, wherein duplications are identified based at least in part on the determination.	04-28-2016
20160124985	PRESERVING REDUNDANCY IN DATA DEDUPLICATION SYSTEMS BY DESIGNATION OF VIRTUAL ADDRESS - Various embodiments for preserving data redundancy of identical data in a data deduplication system in a computing environment are provided. In one embodiment, a method for such preservation is disclosed. A selected range of virtual addresses of a virtual storage device in the computing environment is designated as not subject to a deduplication operation. Other system and computer program product embodiments are disclosed and provide related advantages.	05-05-2016
20160125021	EFFICIENT UPDATES IN NON-CLUSTERED COLUMN STORES - The processing of transaction oriented data tends to be row-oriented, while the processing of analytical operations tends to be column-oriented. Various systems, sometimes referred to as operational data warehouses, may comprise mechanisms adapted for use in scenarios where both transactional data processing and analytical queries are to be performed efficiently. The operational data warehouse (ODW) may perform and update data efficiently by maintaining a table in structures comprising a column store, a delta store, a delete bitmap, and a delete buffer. In this environment, key values may be associated for each row such that the ODW may more efficiently seek rows. Further, rows may also be excluded from a column store based at least in part on a filter criterion. The filtering criterion may be used to filter out rows based on a created predicate set by a user or the system.	05-05-2016
20160125169	DUPLICATION DETECTION IN CLINICAL DOCUMENTATION - Methods, systems, and computer-readable media are provided to detect similarities in clinical documents that might be inaccurate or inappropriate. A first clinical document and a second clinical document that are to be compared are identified. This identification of the documents is based on times associated with the first and second clinical documents, an identity of clinicians who authored the first and second clinical documents, an identity of patients associated with the first and second clinical documents, a type of the first and second clinical documents, or contents of the first and second clinical documents. The first clinical document is compared to a portion of the second clinical document. A report is automatically generated, where the report indicates the similarities between the portion of the first clinical document and the portion of the second clinical document that are potentially inaccurate or inappropriate.	05-05-2016
20160132519	APPLYING A MINIMUM SIZE BOUND ON CONTENT DEFINED SEGMENTATION OF DATA - Applying a content defined minimum size bound on blocks produced by content defined segmentation of data by calculating the size of the interval of data between a newly found candidate segmenting position and a last candidate segmenting position of same or higher hierarchy level, and then discarding the newly found candidate segmenting position if a size of an interval of data is lower than the minimum size bound, or retaining the newly found candidate segmenting position if the size of the interval of data is not lower than the minimum size bound or if there is no last candidate segmenting position of a same or higher hierarchy level as the newly found candidate segmenting position. When a last candidate segmenting position of a same or higher hierarchy level becomes available, the evaluation is reiterated to converge edge segmenting positions of the outputs of consecutive calculation units.	05-12-2016
20160132523	EXPLOITING NODE-LOCAL DEDUPLICATION IN DISTRIBUTED STORAGE SYSTEM - Data deduplication is carried out in a storage system in which a set of volumes of data is distributed among a plurality of servers. The technique comprises computing a similarity metric among volumes of the set, making a determination that a difference in the similarity metric is less than a predetermined threshold value. Responsively to the determination there is a migration of the data of the volumes of the set within their respective servers to distribute the migrated data in like manner in the respective servers. Thereafter data deduplication is performed on the respective servers.	05-12-2016
20160132524	OBJECT DEDUPLICATION AND APPLICATION AWARE SNAPSHOTS - Embodiments deploy delayering techniques, and the relationships between successive versions of a rich-media file become apparent. With this, modified rich-media files suddenly present far smaller storage overhead as compared to traditional application-unaware snapshot and versioning implementations. Optimized file data is stored in suitcases. As a file is versioned, each new version of the file is placed in the same suitcase as the previous version, allowing embodiments to employ correlation techniques to enhance optimization savings.	05-12-2016
20160139997	DATASETS PROFILING TOOLS, METHODS, AND SYSTEMS - A dataset profiling tool configured to identify unique and non-unique column combinations in a dataset which comprises a plurality of tuples, the tool including: an inserts handler module configured to: receive one or more new tuples for insertion into the dataset, receive one or more minimal uniques and one or more maximal non-uniques for the dataset, identify and group, for each minimal unique, any tuples of the dataset and any of the one or more new tuples which contain duplicate values in the column combinations of the minimal unique, to form grouped tuples which are grouped according to the minimal unique to which the tuples relate, validate the grouped tuples to identify supersets of the minimal uniques for which duplicate values were identified, to generate a new set of one or more minimal uniques and one or more maximal non-uniques, and output the new set of one or more updated minimal uniques and one or more maximal non-uniques.	05-19-2016
20160140137	READ AND DELETE INPUT/OUTPUT OPERATION FOR DATABASE MANAGEMENT - A computer-implemented method for improving database management may include selecting one or more database records that are requested based on a query statement. The one or more database records may be read from a first database file, wherein the one or more database records are copied from the first database file and stored to a memory. The one or more database records may be deleted from the first database file at substantially the same time as the reading the one or more database records, wherein the reading and the deleting occur through a single read and delete input/output (I/O) operation.	05-19-2016
20160140138	DE-DUPLICATING ATTACHMENTS ON MESSAGE DELIVERY AND AUTOMATED REPAIR OF ATTACHMENTS - Systems and techniques of de-duplicating file and/or blobs within a file system are presented. In one embodiment, an email system is disclosed wherein the email system receives email messages comprising a set of associated attachments. The system determines whether the associated attachments have been previously stored in the email system, the state of the stored attachment, and if the state of the attachment is appropriate for sharing copies of the attachment, then providing a reference to the attachment upon a request to share the attachment. In another embodiment, the system may detect whether stored attachments are corrupted and, if so, attempt to repair the attachment, and possibly, prior to sharing references to the attachment.	05-19-2016
20160147785	HOST-BASED DEDUPLICATION USING ARRAY GENERATED DATA TAGS - Exemplary methods, apparatuses, and systems include a host computer detecting a request to utilize data stored at a storage address in an external storage device. The host computer, in response to the detected request, transmits a request to the storage device for a tag that uniquely identifies the data. The tag for the data is received from the storage device. In response to determining that the received tag matches a local mapping of tags stored in the host computer, the host computer utilizes the local mapping of tags to process the detected request.	05-26-2016
20160147797	OPTIMIZING DATABASE DEDUPLICATION - A method and associated systems for optimized deduplication of a database stored on multiple tiers of storage devices. A database-deduplication system, upon receiving a request to update a database record, uses memory-resident logs and previously generated database-maintenance tables to identify a first logical block that identifies an updated value, stored in a first physical block of storage, to be used to update a database record and to further identify a second logical block that stores in the database a corresponding existing value of the same record. After determining that the first and second logical blocks reside within the same storage tier, the system directs a deduplication module to associate both logical blocks with the first physical block.	05-26-2016
20160147798	DATA CLEANSING AND GOVERNANCE USING PRIORITIZATION SCHEMA - According to an embodiment of the present invention, a computer-implemented method of cleansing data is provided that comprises determining a criticality score and a complexity score for identified attributes of an enterprise, wherein the criticality score represents a relevance of an attribute to one or more enterprise dimensions and the complexity score represents complexity of cleansing data for an attribute. The identified attributes for data cleansing based on the criticality and complexity scores are prioritized, and data of the identified attributes is cleansed in accordance with priority of the identified attributes. Embodiments further include a system, apparatus and computer readable media to cleanse data in substantially the same manner as described above.	05-26-2016
20160147800	Data Processing Method and System and Client - A data processing method and system and a client, where a target storage node is determined in a manner of comparing a second vector of received data and first vectors that are corresponding to all storage nodes and prestored on the client that receives the data, and the target storage node no longer needs to be determined in a manner of extracting some fingerprint values as samples from received data and sending the fingerprint values to all storage nodes in a data processing system for query, and waiting for a feedback from the storage nodes.	05-26-2016
20160154816	SCALABLE MECHANISM FOR DETECTION OF COMMONALITY IN A DEDUPLICATED DATA SET	06-02-2016
20160154830	SYSTEMS AND METHODS FOR DATA INTEGRATION	06-02-2016
20160154839	SECURITY FOR MULTI-TENANT DEDUPLICATION DATASTORE AGAINST OTHER TENANTS	06-02-2016
20160154840	AVOID DOUBLE COUNTING OF MAPPED DATABASE DATA	06-02-2016
20160162218	DISTRIBUTED DATA DEDUPLICATION IN ENTERPRISE NETWORKS - Distributed data deduplication may include or utilize containers attached to nodes or byte caches in a cluster or enterprise networks. The containers may store a mapping of byte caches and hashes the byte caches hold. An encoding byte cache may communicate with its attached container to determine which nodes should send which hash values, and may encode an output stream accordingly. Decoding byte cache decompresses the output stream by communicating with its attached container for receiving hash values and associated content from one or more byte caches specified in the output stream.	06-09-2016
20160162368	REMOTE STORAGE - Remote storage of consumer data is achieved by processing consumer data for deduplication at a client computing system that includes creating metadata comprising information relating to a consumer directory tree structure of the consumer data, and transferring the deduplicated data and metadata for remote storage	06-09-2016
20160162507	AUTOMATED DATA DUPLICATE IDENTIFICATION - In an approach to identifying duplicates in data, one or more computer processors receive a request from a user to identify duplicates in a data set. The one or more computer processors retrieve the data set utilizing data discovery. The one or more computer processors perform data profiling on the data set. The one or more computer processors determine one or more domain types of the data set, based, at least in part, on the performed data profiling. The one or more computer processors perform data standardization on the data set, based, at least in part, on the one or more determined domain types. Responsive to performing data standardization, the one or more computer processors perform probabilistic matching on the data set. The one or more computer processors to identify two or more duplicates in the data set, based, at least in part, on the probabilistic matching.	06-09-2016
20160162508	MANAGING DEDUPLICATION IN A DATA STORAGE SYSTEM USING A BLOOMIER FILTER DATA DICTIONARY - A method including maintaining a library having a plurality of storage tablets, each storage tablet storing a plurality of hash-to-storage mappings, each mapping a hash value to a storage location at which a block of data is stored, the block of data translating to the hash value pursuant to a hashing algorithm. The method also including upon receipt and/or determination of a new hash for incoming data pursuant to the hashing algorithm: a) querying a tablet cache for a hash-to-storage mapping having the new hash, the tablet cache comprising a subset of storage tablets copied from the library; and/or b) querying a secondary index for a hash-to-storage tablet mapping having the new hash, the secondary index including a plurality of filters, each filter mapping each of a plurality of key hashes to a storage tablet of the library storing that particular key hash in a hash-to-storage mapping.	06-09-2016
20160171009	METHOD AND APPARATUS FOR DATA DEDUPLICATION	06-16-2016
20160171021	RECOVERING FROM A PENDING UNCOMPLETED REORGANIZATION OF A DATA SET	06-16-2016
20160179836	METHOD FOR UPDATING DATA TABLE OF KEYVALUE DATABASE AND APPARATUS FOR UPDATING TABLE DATA	06-23-2016
20160188583	METHOD AND SYSTEM FOR METADATA MODIFICATION - The present invention provides a method for modifying a first storage medium having a plurality of files, the method including providing a first modification tool; operatively coupling the first storage medium to the modification tool, wherein the operatively coupling includes bypassing a first operating system used to access the plurality of files; and dematerializing, using the first modification tool, at least a first file to form one or more dematerialized files. In some embodiments, the present invention provides a modification system for modifying a first storage medium having a plurality of files, the system including a first modification tool that includes an attachment module configured to operatively couple the modification tool to the first storage medium such that a first operating system used to access the plurality of files is bypassed; and a dematerialization module configured to dematerialize at least a first file to form one or more dematerialized files.	06-30-2016
20160188589	TECHNOLOGIES FOR COMPUTING ROLLING HASHES - Technologies for computing rolling hashes include a computing device having a first hash table that includes a first plurality of random-valued entries and a second hash table that includes a second plurality of random-valued entries. The computing device retrieves a block of data from a data buffer and generates a hash based on the block of data, a previously generated hash, the first hash table, and the second hash table. The computing device further determines whether the generated hash matches a predefined trigger and records a data boundary in response to a determination that the generated hash matches the trigger.	06-30-2016
20160188694	CLUSTERS OF POLYNOMIALS FOR DATA POINTS - A method, system and storage device are generally directed to determining for each of a plurality of data points, a neighborhood of data points about each such data point. For each such neighborhood of data points, a projection set of polynomials is generated based on candidate polynomials. The projection set of polynomials evaluated on the neighborhood of data points is subtracted from the plurality of candidate polynomials evaluated on the neighborhood of data points to generate a subtraction matrix of evaluated resulting polynomials. The singular value decomposition of the subtraction matrix is then computed. The resulting polynomials are clustered into multiple clusters and then partitioned based on a threshold.	06-30-2016
20160188700	OPTIMIZED PLACEMENT OF DATA - The disclosed embodiments included a system, apparatus, method, and computer program product for optimizing the placement of data utilizing cloud-based IT services. The apparatus comprises a processor that executes computer-readable program code embodied on a computer program product. By executing that computer-readable program code, the processor extracts content from data and determines the context in which that data was generated, modified, and/or accessed. The processor also classifies the data based on its content and context, determines the cost of storing the data at each a plurality of locations, and specifies which of those locations the data is to be stored based on the classification of that data and the cost of storing that data at each of the plurality of locations.	06-30-2016
20160196275	HEAT INDICES FOR FILE SYSTEMS AND BLOCK STORAGE	07-07-2016
20160196305	ANALYZING USER BEHAVIORS	07-07-2016
20160203156	METHOD, APPARATUS AND SYSTEM FOR DATA ANALYSIS	07-14-2016
20160203187	SYSTEM AND METHOD FOR GENERATING SOCIAL SUMMARIES	07-14-2016
20160253348	PURGING USER DATA FROM VEHICLE MEMORY	09-01-2016
20160253351	Scalable Grid Deduplication	09-01-2016
20160253362	METHOD, DEVICE, NODE AND SYSTEM FOR MANAGING FILE IN DISTRIBUTED DATA WAREHOUSE	09-01-2016
20160253363	TWO-PHASE CONSTRUCTION OF DATA GRAPHS FROM DISPARATE INPUTS	09-01-2016
20160378781	Log File Analysis to Locate Anomalies - Method and system are provided for log file analysis to locate anomalies. The method includes comparing each line of a log file with other lines of the log file to determine duplicate and similar lines of the log file. The step of comparing includes: locating two or more duplicate lines of the log file; and locating two or more similar lines of the log file using pattern matching of a string of each of the lines of the log file. The method also includes outputting a line of the log file as a line that is a potential anomaly if it is rejected as a duplicate or a similar line.	12-29-2016
20160378796	MATCH FIX-UP TO REMOVE MATCHING DOCUMENTS - The technology described herein provides for a match fix-up stage that removes matching documents identified for a search query that don't actually contain terms from the search query. A representation of each document (e.g., a forward index storing a list of terms for each document) is used to identify valid matching documents (i.e., documents containing terms from the search query) and invalid matching documents (i.e., documents that don't contain terms from the search query). Any invalid matching documents are removed from further processing and ranking for the search query.	12-29-2016
20160378797	CONTENT ITEM PURGING - Methods, systems, and computer readable media for content item purging are provided. A contact item purger, such as may be incorporated within a local client application of a content management system running on a user device, may leverage knowledge as to which items have been uploaded to the content management system, and how long such content items have been stored on the user device, to propose items for deletion from the user device so as to reclaim storage space. A contact item purger may run on one or more user devices, and may activate upon various triggering events, based on various conditions and parameters, with or without user interaction, thus maintaining available memory capacity at all times.	12-29-2016
20160378856	OPTIMIZED METHOD OF AND SYSTEM FOR SUMMARIZING UTILIZING FACT CHECKING AND DELETING FACTUALLY INACCURATE CONTENT - An optimized fact checking system analyzes and determines the factual accuracy of information and/or characterizes the information by comparing the information with source information. The optimized fact checking system automatically monitors information, processes the information, fact checks the information in an optimized manner and/or provides a status of the information. In some embodiments, the optimized fact checking system generates, aggregates, and/or summarizes content.	12-29-2016
20170235496	DATA DEDUPLICATION WITH AUGMENTED CUCKOO FILTERS	08-17-2017
20170235742	TRANSFER OF DIGITAL MEDIA OBJECTS VIA MIGRATION	08-17-2017
20170235746	METHODS AND APPARATUS FOR REMOVING A DUPLICATED WEB PAGE	08-17-2017
20180024892	USER-LEVEL QUOTA MANAGEMENT OF DATA OBJECTS STORED IN INFORMATION MANAGEMENT SYSTEMS	01-25-2018
20180025011	COMPLIANCE VIOLATION DETECTION	01-25-2018
20180025018	AVOIDING REDUNDANT PRESENTATION OF CONTENT	01-25-2018
20180025019	Platform for Analytic Applications	01-25-2018
20180025046	Reference Set Construction for Data Deduplication	01-25-2018
20190147047	OBJECT-LEVEL IMAGE QUERY AND RETRIEVAL	05-16-2019
20220138168	MAINTAINING ROW DURABILITY DATA IN DATABASE SYSTEMS - A database system operates by: receiving a plurality of row data associated with a first data source; identifying a subset of row data from the plurality of row data that includes only ones of the plurality of row data that compare favorably to maintained row durability data; generating at least one page from ones of the plurality of row data included in the subset of row data; storing the at least one page in long term storage; generating updated row durability data indicating a least favorably ordered row number of a plurality of row numbers corresponding to the subset of row data based on storing the at least one page in long term storage; and updating the maintained row durability data to indicate the least favorably ordered row number of the updated row durability data.	05-05-2022

Inventors list

Assignees list

Classification tree browser

Top 100 Inventors

Top 100 Assignees

Data cleansing, data scrubbing, and deleting duplicates

Subclass of:

707 - Data processing: database and file management or data structures

707687000 - DATA INTEGRITY

Patent class list (only not empty are listed)

Deeper subclasses: