Patent application number | Description | Published |
20080320130 | VISIBILITY AND CONTROL OF WIRELESS SENSOR NETWORKS - A computer implemented technique framework, prototype tool and associated methods that provide a high degree of visibility and control over the in-field execution of software in a minimally intrusive manner wherein developer-defined correctness tests and validation logic are embedded into the sensor node itself, making in-field software testing autonomous without necessitating continuous developer participation. | 12-25-2008 |
20100088492 | SYSTEMS AND METHODS FOR IMPLEMENTING BEST-EFFORT PARALLEL COMPUTING FRAMEWORKS - Implementations of the present principles include Best-effort computing systems and methods. In accordance with various exemplary aspects of the present principles, a application computation requests directed to a processing platform may be intercepted and classified as either guaranteed computations or best-effort computations. Best-effort computations may be dropped to improve processing performance while minimally affecting the end result of application computations. In addition, interdependencies between best-effort computations may be relaxed to improve parallelism and processing speed while maintaining accuracy of computation results. | 04-08-2010 |
20110029471 | DYNAMICALLY CONFIGURABLE, MULTI-PORTED CO-PROCESSOR FOR CONVOLUTIONAL NEURAL NETWORKS - A coprocessor and method for processing convolutional neural networks includes a configurable input switch coupled to an input. A plurality of convolver elements are enabled in accordance with the input switch. An output switch is configured to receive outputs from the set of convolver elements to provide data to output branches. A controller is configured to provide control signals to the input switch and the output switch such that the set of convolver elements are rendered active and a number of output branches are selected for a given cycle in accordance with the control signals. | 02-03-2011 |
20110119467 | MASSIVELY PARALLEL, SMART MEMORY BASED ACCELERATOR - Systems and methods for massively parallel processing on an accelerator that includes a plurality of processing cores. Each processing core includes multiple processing chains configured to perform parallel computations, each of which includes a plurality of interconnected processing elements. The cores further include multiple of smart memory blocks configured to store and process data, each memory block accepting the output of one of the plurality of processing chains. The cores communicate with at least one off-chip memory bank. | 05-19-2011 |
20110173155 | DATA AWARE SCHEDULING ON HETEROGENEOUS PLATFORMS - Systems and method for data-aware scheduling of applications on a heterogeneous platform having at least one central processing unit (CPU) and at least one accelerator. Such systems and methods include a function call handling module configured to intercept, analyze, and schedule library calls on a processing element. The function call handling module further includes a function call interception module configured to intercept function calls to predefined libraries, a function call analysis module configured to analyze argument size and location, and a function call redirection module configured to schedule library calls and data transfers. The systems and methods also use a memory unification module, configured to keep data coherent between memories associated with the at least one CPU and the at least one accelerator based on the output of the function call redirection module. | 07-14-2011 |
20120079298 | ENERGY EFFICIENT HETEROGENEOUS SYSTEMS - Low-power systems and methods are disclosed for executing an application software on a general purpose processor and a plurality of accelerators with a runtime controller. The runtime controller splits a workload across the processor and the accelerators to minimize energy. The system includes building one or more performance models in an application-agnostic manner; and monitoring system performance in real-time and adjusting the workload splitting to minimize energy while conforming to a target quality of service (QoS). | 03-29-2012 |
20120081373 | ENERGY-AWARE TASK CONSOLIDATION ON GRAPHICS PROCESSING UNIT (GPU) - A method includes configuring a shared library, stored in a memory, to be loaded into applications to intercept graphics processing unit (GPU) computation requests for different types of workload kernals corresponding to the applications. The method further includes generating a power prediction and a performance prediction for at least one candidate kernel combination for execution on a GPU responsive to the GPU computations requests. The at least one candidate kernel combination pertains to at least two of the workload kernals. The method also includes rendering a decision of whether to execute the at least one candidate kernel combination or to execute the at least two of the workload kernals pertaining thereto separately, based on the power prediction and the performance prediction. | 04-05-2012 |
20120084747 | PARTITIONED ITERATIVE CONVERGANCE PROGRAMMING MODEL - Methods and systems for iterative convergence include performing at least one global iteration. Each global iteration includes partitioning input data into multiple input data partitions according to an input data partitioning function, partitioning a model into multiple model partitions according to a model partitioning function, performing at least one local iteration using a processor to compute sub-problems formed from a model partition and an input data partition to produce multiple locally updated models, and combining the locally updated models from the at least one local iteration according to a model merging function to produce a merged model. | 04-05-2012 |
20120124591 | SCHEDULER AND RESOURCE MANAGER FOR COPROCESSOR-BASED HETEROGENEOUS CLUSTERS - A system and method for scheduling client-server applications onto heterogeneous clusters includes storing at least one client request of at least one application in a pending request list on a computer readable storage medium. A priority metric is computed for each application, where the computed priority metric is applied to each client request belonging to that application. The priority metric is determined based on estimated performance of the client request and load on the pending request list. The at least one client request of the at least one application is scheduled based on the priority metric onto one or more heterogeneous resources. | 05-17-2012 |
20120131389 | CROSS-LAYER SYSTEM ARCHITECTURE DESIGN - Methods and systems for cross-layer forgiveness exploitation include executing one or more applications using a processing platform that includes a first reliable processing core and at least one additional processing core having a lower reliability than the first processing core, modifying application execution according to one or more best-effort techniques to improve performance, and controlling parameters associated with the processing platform and the best-effort layer that control performance and error rate such that performance is maximized in a region of low hardware-software interference. | 05-24-2012 |
20120233486 | LOAD BALANCING ON HETEROGENEOUS PROCESSING CLUSTERS IMPLEMENTING PARALLEL EXECUTION - Methods and systems for managing data loads on a cluster of processors that implement an iterative procedure through parallel processing of data for the procedure are disclosed. One method includes monitoring, for at least one iteration of the procedure, completion times of a plurality of different processing phases that are undergone by each of the processors in a given iteration. The method further includes determining whether a load imbalance factor threshold is exceeded in the given iteration based on the completion times for the given iteration. In addition, the data is repartitioned by reassigning the data to the processors based on predicted dependencies between assigned data units of the data and completion times of a plurality of the processers for at least two of the phases. Further, the parallel processing is implemented on the cluster of processors in accordance with the reassignment. | 09-13-2012 |
20130055224 | OPTIMIZING COMPILER FOR IMPROVING APPLICATION PERFORMANCE ON MANY-CORE COPROCESSORS - A system and method for compiling includes parsing code of an application stored in a computer readable storage medium to identify one or more parallelizable code portions. At least one parallelizable code portion is optimized by transforming offload construct code portions to provide an optimized application. | 02-28-2013 |
20130055225 | COMPILER FOR X86-BASED MANY-CORE COPROCESSORS - A system and method for compiling includes, for a parallelizable code portion of an application stored on a computer readable storage medium, determining one or more variables that are to be transferred to and/or from a coprocessor if the parallelizable code portion were to be offloaded. A start location and an end location are determined for at least one of the one or more variables as a size in memory. The parallelizable code portion is transformed by inserting an offload construct around the parallelizable code portion and passing the one or more variables and the size as arguments of the offload construct such that the parallelizable code portion is offloaded to a coprocessor at runtime. | 02-28-2013 |
20140053131 | AUTOMATIC ASYNCHRONOUS OFFLOAD FOR MANY-CORE COPROCESSORS - Methods and systems for asynchronous offload to many-core coprocessors include splitting a loop in an input source code into a sampling sub-part, a many integrated core (MIC) sub-part, and a central processing unit (CPU) sub-part; executing the sampling sub-part with a processor to determine loop characteristics including memory- and processor-operations executed by the loop; identifying optimal split boundaries based on the loop characteristics such that the MIC sub-part will complete in a same amount of time when executed on a MIC processor as the CPU sub-part will take when executed on a CPU; and modifying the input source code to split the loop at the identified boundaries, such that the MIC sub-part is executed on a MIC processor and the CPU sub-part is concurrently executed on a CPU. | 02-20-2014 |
20140236913 | Accelerating Distributed Transactions on Key-Value Stores Through Dynamic Lock Localization - Systems and methods for accelerating distributed transactions on key-value stores includes applying one or more policies of dynamic lock-localization, the policies including a lock migration stage that decreases nodes on which locks are present so that a transaction needs fewer number of network round trips to acquire locks, the policies including a lock ordering stage for pipelining during lock acquisition and wherein the order on locks to avoid deadlock is controlled by average contentions for the locks rather than static lexicographical ordering; and dynamically migrating and placing locks for distributed objects in distinct entity-groups in a datastore through the policies of dynamic lock-localization. | 08-21-2014 |
20140237477 | SIMULTANEOUS SCHEDULING OF PROCESSES AND OFFLOADING COMPUTATION ON MANY-CORE COPROCESSORS - Methods and systems for scheduling jobs to manycore nodes in a cluster include selecting a job to run according to the job's wait time and the job's expected execution time; sending job requirements to all nodes in a cluster, where each node includes a manycore processor; determining at each node whether said node has sufficient resources to ever satisfy the job requirements and, if no node has sufficient resources, deleting the job; creating a list of nodes that have sufficient free resources at a present time to satisfy the job requirements; and assigning the job to a node, based on a difference between an expected execution time and associated confidence value for each node and a hypothetical fastest execution time and associated hypothetical maximum confidence value. | 08-21-2014 |
20140289637 | Remote Visualization and Control for Virtual Mobile Infrastructure - A method for running application software for a mobile device by virtualizing a mobile device operating system (OS); running a virtual instance of the mobile device OS with the application software on a server on the cloud; and rendering on the server and sending a display image for the mobile device screen to be displayed on the mobile device. | 09-25-2014 |
20140325495 | Semi-Automatic Restructuring of Offloadable Tasks for Accelerators - A computer implemented method entails identifying code regions in an application from which offloadable tasks can be generated by a compiler for heterogenous computing system with processor and accelerator memory, including adding relaxed semantics to a directive based language in the heterogenous computing for allowing a suggesting rather than specifying a parallel code region as an offloadable candidate, and identifying one or more offloadable tasks in a neighborhood of code region marked by the directive. | 10-30-2014 |
20150066988 | SCALABLE PARALLEL SORTING ON MANYCORE-BASED COMPUTING SYSTEMS - Systems and methods for sorting data, including chunking unsorted data such that each chunk is of a size that fits within a last level cache of the system. One or more threads are instantiated in each physical core of the system, chunks assigned physical cores are distributed evenly across the threads on the physical cores. Subchunks in the physical cores are sorted using vector intrinsics, the subchunks being data assigned to the threads in the physical cores, and the subchunks are merged to generate sorted large chunks. A binary tree, which includes leaf nodes that correspond to the sorted large chunks, is built, leaf nodes are assigned to threads, and tree nodes are assigned to a circular buffer, wherein the circular buffer is lock and synchronization free. The large chunks are sorted to generate sorted data as output. | 03-05-2015 |
20150067225 | AUTOMATIC COMMUNICATION AND OPTIMIZATION OF MULTI-DIMENSIONAL ARRAYS FOR MANY-CORE COPROCESSOR USING STATIC COMPILER ANALYSIS - There are provided source-to-source transformation methods for a multi-dimensional array and/or a multi-level pointer for a computer program. A method includes minimizing a number of holes for variable length elements for a given dimension of the array and/or pointer using at least two stride values included in stride buckets. The minimizing step includes modifying memory allocation sites, for the array and/or pointer, to allocate memory based on the stride values. The minimizing step further includes modifying a multi-dimensional memory access, for accessing the array and/or pointer, into a single dimensional memory access using the stride values. The minimizing step also includes inserting offload pragma for a data transfer of the array and/or pointer prior as at least one of a single-dimensional array and a single-level pointer. The data transfer is from a central processing unit to a coprocessor over peripheral component interconnect express. | 03-05-2015 |
Patent application number | Description | Published |
20090119556 | METHOD AND APPARATUS FOR TESTING LOGIC CIRCUIT DESIGNS - Disclosed is a logic testing system that includes a decompressor and a tester in communication with the decompressor. The tester is configured to store a seed and locations of scan inputs and is further configured to transmit the seed and the locations of scan inputs to the decompressor. The decompressor is configured to generate a test pattern from the seed and the locations of scan inputs. The decompressor includes a first test pattern generator, a second test pattern generator, and a selector configured to select the test pattern generated by the first test pattern generator or the test pattern generated by the second test pattern generator using the locations of scan inputs. | 05-07-2009 |
20090119563 | METHOD AND APPARATUS FOR TESTING LOGIC CIRCUIT DESIGNS - Disclosed is a logic testing system that includes a decompressor and a tester in communication with the decompressor. The tester is configured to store a seed and locations of scan inputs and is further configured to transmit the seed and the locations of scan inputs to the decompressor. The decompressor is configured to generate a test pattern from the seed and the locations of scan inputs. The decompressor includes a first test pattern generator, a second test pattern generator, and a selector configured to select the test pattern generated by the first test pattern generator or the test pattern generated by the second test pattern generator using the locations of scan inputs. | 05-07-2009 |
20090210762 | Method for Blocking Unknown Values in Output Response of Scan Test Patterns for Testing Circuits - A method includes compressing control patterns describing values required at the control signals of blocking logic gates, by linear feedback shift register LFSR reseeding; bypassing blocking logic gates for some groups of scan chains that do not capture unknown values in output response of scan test patterns for testing circuits; and reducing numbers of specified bits in densely specified ones of the control patterns for further reducing the size of a seed of the LFSR. | 08-20-2009 |
20090304268 | System and Method for Parallelizing and Accelerating Learning Machine Training and Classification Using a Massively Parallel Accelerator - A method system for training an apparatus to recognize a pattern includes providing the apparatus with a host processor executing steps of a machine learning process; providing the apparatus with an accelerator including at least two processors; inputting training pattern data into the host processor; determining coefficient changes in the machine learning process with the host processor using the training pattern data; transferring the training data to the accelerator; determining kernel dot-products with the at least two processors of the accelerator using the training data; and transferring the dot-products back to the host processor. | 12-10-2009 |
20100088490 | METHODS AND SYSTEMS FOR MANAGING COMPUTATIONS ON A HYBRID COMPUTING PLATFORM INCLUDING A PARALLEL ACCELERATOR - In accordance with exemplary implementations, application computation operations and communications between operations on a host processing platform may be adapted to conform to the memory capacity of a parallel accelerator. Computation operations may be split and scheduled such that the computation operations fit within the memory capacity of the accelerator. Further, the operations may be automatically adapted without any modification to the code of an application. In addition, data transfers between a host processing platform and the parallel accelerator may be minimized in accordance with exemplary aspects of the present principles to improve processing performance. | 04-08-2010 |
20120188263 | Method and system to dynamically bind and unbind applications on a general purpose graphics processing unit - A system for dynamically binding and unbinding of graphics processing unit GPU applications, the system includes a memory management for tracking memory of a GPU used by an application, and a source-to-source compiler for identifying nested structures allocated on the GPU so that the virtual memory management can track these nested structures, and identifying all instances where nested structures on the GPU are modified inside kernels. | 07-26-2012 |
20120192198 | Method and System for Memory Aware Runtime to Support Multitenancy in Heterogeneous Clusters - The invention solves the problem of sharing many-core devices (e.g. GPUs) among concurrent applications running on heterogeneous clusters. In particular, the invention provides transparent mapping of applications to many-core devices (that is, the user does not need to be aware of the many-core devices present in the cluster and of their utilization), time-sharing of many-core devices among applications also in the presence of conflicting memory requirements, and dynamic binding/binding of applications to/from many-core devices (that is, applications do not need to be statically mapped to the same many-core device for their whole life-time). | 07-26-2012 |
20130091507 | OPTIMIZING DATA WAREHOUSING APPLICATIONS FOR GPUS USING DYNAMIC STREAM SCHEDULING AND DISPATCH OF FUSED AND SPLIT KERNELS - Systems and methods for managing a processor and one or more co-processors for a database application whose queries have been processed into an intermediate form (IR) containing kernels of the database application that have been fused and split; dynamically scheduling such kernels on CUDA streams and further dynamically dispatching kernels to GPU devices by estimating execution time in order to achieve high performance. | 04-11-2013 |
20130097593 | Computer-Guided Holistic Optimization of MapReduce Applications - A method for compiler-guided optimization of MapReduce type applications that includes applying transformations and optimizations to Java bytecode of an original application by an instrumenter which carries out static analysis to determine application properties depending on the optimization being performed and provides an output of optimized Java bytecode, and executing the application and analyzing generated trace and feeds information back into the instrumenter by a trace analyzer, the trace analyzer and instrumenter invoking each other iteratively and exchanging information through files. | 04-18-2013 |
20130191612 | INTERFERENCE-DRIVEN RESOURCE MANAGEMENT FOR GPU-BASED HETEROGENEOUS CLUSTERS - Systems and methods are disclosed that share coprocessor resources between two or more applications in a computing cluster using a job selector to receive jobs from a job queue; a node selector coupled to the job selector; an off line profiler with an interference prediction model; a coprocessor dynamic interference detection module; and a coprocessor interference response module. | 07-25-2013 |
20130298130 | AUTOMATIC PIPELINING FRAMEWORK FOR HETEROGENEOUS PARALLEL COMPUTING SYSTEMS - Systems and methods for automatic generation of software pipelines for heterogeneous parallel systems (AHP) include pipelining a program with one or more tasks on a parallel computing platform with one or more processing units and partitioning the program into pipeline stages, wherein each pipeline stage contains one or more tasks. The one or more tasks in the pipeline stages are scheduled onto the one or more processing units, and execution times of the one or more tasks in the pipeline stages are estimated. The above steps are repeated until a specified termination criterion is reached. | 11-07-2013 |
20140047422 | COMPILER-GUIDED SOFTWARE ACCELERATOR FOR ITERATIVE HADOOP JOBS - Various methods are provided directed to a compiler-guided software accelerator for iterative HADOOP jobs. A method includes identifying intermediate data, generated by an iterative HADOOP application, below a predetermined threshold size and used less than a predetermined threshold time period. The intermediate data is stored in a memory device. The method further includes minimizing input, output, and synchronization overhead for the intermediate data by selectively using at any given time any one of a Message Passing Interface and Distributed File System as a communication layer. The Message Passing Interface is co-located with the HADOOP Distributed File System. | 02-13-2014 |
20140208072 | USER-LEVEL MANAGER TO HANDLE MULTI-PROCESSING ON MANY-CORE COPROCESSOR-BASED SYSTEMS - A method is disclosed to manage a multi-processor system with one or more multiple-core coprocessors by intercepting coprocessor offload infrastructure application program interface (API) calls; scheduling user processes to run on one of the coprocessors; scheduling offloads within user processes to run on one of the coprocessors; and affinitizing offloads to predetermined cores within one of the coprocessors by selecting and allocating cores to an offload, and obtaining a thread-to-core mapping from a user. | 07-24-2014 |
20140208327 | METHOD FOR SIMULTANEOUS SCHEDULING OF PROCESSES AND OFFLOADING COMPUTATION ON MANY-CORE COPROCESSORS - A method is disclosed to manage a multi-processor system with one or more manycore devices, by managing real-time bag-of-tasks applications for a cluster, wherein each task runs on a single server node, and uses the offload programming model, and wherein each task has a deadline and three specific resource requirements: total processing time, a certain number of manycore devices and peak memory on each device; when a new task arrives, querying each node scheduler to determine which node can best accept the task and each node scheduler responds with an estimated completion time and a confidence level, wherein the node schedulers use an urgency-based heuristic to schedule each task and its offloads; responding to an accept/reject query phase, wherein the cluster scheduler send the task requirements to each node and queries if the node can accept the task with an estimated completion time and confidence level; and scheduling tasks and offloads using a aging and urgency-based heuristic, wherein the aging guarantees fairness, and the urgency prioritizes tasks and offloads so that maximal deadlines are met. | 07-24-2014 |
20140208331 | METHODS OF PROCESSING CORE SELECTION FOR APPLICATIONS ON MANYCORE PROCESSORS - A runtime method is disclosed that dynamically sets up core containers and thread-to-core affinity for processes running on manycore coprocessors. The method is completely transparent to user applications and incurs low runtime overhead. The method is implemented within a user-space middleware that also performs scheduling and resource management for both offload and native applications using the manycore coprocessors. | 07-24-2014 |