Class / Patent application number | Description | Number of patent applications / Date published |
717150000 | Loop compiling | 23 |
20090113405 | RECONFIGURABLE COPROCESSOR ARCHITECTURE TEMPLATE FOR NESTED LOOPS AND PROGRAMMING TOOL - The architectures derived from the proposed template are integrated in a generic System on Chip (SoC) and consist of reconfigurable coprocessors for executing nested program loops whose bodies are expressions of operations performed in a functional unit array in parallel. The data arrays are accessed from one or more system inputs and from an embedded memory array in parallel. The processed data arrays are sent back to the memory array or to system outputs. The architectures enable the acceleration of nested loops compared to execution on a standard processor, where only one operation or datum access can be performed at a time. The invention can be used in a number of applications especially those which involve digital signal processing, such as multimedia and communications. The architectures are used preferably in conjunction with von Neumann processors which are better at implementing control flow. The architectures can be scaled easily in the number of data stream inputs, outputs, embedded memories, functional units and configuration registers. A computational system may entail several general purpose processors and several coprocessors derived from this architectural template. The coprocessors are connected either synchronously or using asynchronous first in first out memories (FIFOs), forming a globally asynchronous locally synchronous system. Each coprocessor can be programmed by tagging and rewriting the nested loops in the original program. The programming tool produces a coprocessor configuration per each nested loop group, which is replaced in the original code with coprocessor input/output operations and control. | 04-30-2009 |
20100192138 | Methods And Apparatus For Local Memory Compaction - Methods, apparatus and computer software product for local memory compaction are provided. In an exemplary embodiment, a processor in connection with a memory compaction module identifies inefficiencies in array references contained within in received source code, allocates a local array and maps the data from the inefficient array reference to the local array in a manner which improves the memory size requirements for storing and accessing the data. In another embodiment, a computer software product implementing a local memory compaction module is provided. In a further embodiment a computing apparatus is provided. The computing apparatus is configured to improve the efficiency of data storage in array references. This Abstract is provided for the sole purpose of complying with the Abstract requirement rules. This Abstract is submitted with the explicit understanding that it will not be used to interpret or to limit the scope or the meaning of the claims. | 07-29-2010 |
20100205589 | Non-Localized Constraints for Automated Program Generation - A method and a system for non-locally constraining a plurality of related but separated program entities (e.g., a loop operation and a related accumulation operation within the loop's scope) such that any broad program transformation affecting both will have the machinery to assure that the changes to both entities will preserve the invariant properties of and dependencies among them. For example, if a program transform alters one entity (e.g., re-expresses an accumulation operation as a vector operation incorporating some or all of the loop's iteration) the constraint will provide the machinery to assure a compensating alteration of the other entities (e.g., the loop operation is reduced to reflect the vectorization of the accumulation operation). One realization of this method comprises specialized instances of the related entities that while retaining their roles as program entities (i.e., operators), also contain data and machinery to define the non-local constraint relationship. | 08-12-2010 |
20100275191 | CONCURRENT MUTATION OF ISOLATED OBJECT GRAPHS - Fine-grained parallelism within isolated object graphs is used to provide safe concurrent operations within the isolated object graphs. One example provides an abstraction labeled IsolatedObjectGraph that encapsulates at least one object graph, but often two or more object graphs, rooted by an instance of a type member. By encapsulating the object graph, no references from outside of the object graph are allowed to objects inside of the object graph. Also, the encapsulated object graph does not contain references to objects outside of the graphs. The isolated object graphs provide for safe data parallel operations, including safe data parallel mutations such as for each loops. In an example, the ability to isolate the object graph is provided through type permissions. | 10-28-2010 |
20120079466 | Systems And Methods For Compiler-Based Full-Function Vectorization - Systems and methods for the vectorization of software applications are described. In some embodiments, a compiler may automatically generate both scalar and vector versions of a function from a single source code description. A vector interface may be exposed in a persistent dependency database that is associated with the function. This may allow a compiler to make vector function calls from within vectorized loops, rather than making multiple serialized scalar function calls from within a vectorized loop. This may in turn facilitate the vectorization of hierarchical code, which may improve application performance when vector execution resources are available. | 03-29-2012 |
20120079467 | PROGRAM PARALLELIZATION DEVICE AND PROGRAM PRODUCT - According to one embodiment, a parallelizing unit divides a loop into first and second processes based on a program to be converted and division information. The first and second processes respectively have termination control informationloop control information and change information indicating change of data to be referred to in a process subsequent to the loop. The parallelizing unit inserts a determination process into the first process, which determines whether the second process is terminated at execution of an (n−1)th iteration of the second process when the second process is subsequent to the first process, or determines whether the second process is terminated at execution of an nth iteration of the second process when the second process precedes the first process, and notifies the second process of a result of the determination, and inserts a control process into the second process, which controls execution of the second process based on the result of determination notified. | 03-29-2012 |
20120089970 | APPARATUS AND METHOD FOR CONTROLLING LOOP SCHEDULE OF A PARALLEL PROGRAM - A compiling apparatus and method are provided. The compiling apparatus includes a first setting unit that sets a first parameter of a parallel programming model for a parallel region of a caller, a callee detection unit that detects a callee that is called by the caller and that has at least one loop region, and a second setting unit that sets a second parameter of the parallel programming model for the loop region of the callee using the first parameter. | 04-12-2012 |
20120192167 | Runtime Extraction of Data Parallelism - Mechanisms for extracting data dependencies during runtime are provided. The mechanisms execute a portion of code having a loop and generate, for the loop, a first parallel execution group comprising a subset of iterations of the loop less than a total number of iterations of the loop. The mechanisms further execute the first parallel execution group and determining, for each iteration in the subset of iterations, whether the iteration has a data dependence. Moreover, the mechanisms commit store data to system memory only for stores performed by iterations in the subset of iterations for which no data dependence is determined. Store data of stores performed by iterations in the subset of iterations for which a data dependence is determined is not committed to the system memory. | 07-26-2012 |
20130024849 | COMPILER DEVICE, COMPILER PROGRAM, AND LOOP PARALLELIZATION METHOD - According to the conventional loop parallelization method, when a loop in which a value of a loop-carried dependency variable can be calculated in all of the iterations without sequentially executing the loop from the start, it is determined that DOALL parallelization is not applicable due to the loop-carried dependency variable. Accordingly, the loop is sequentially executed or parallelized by using DOACROSS parallelization that executes a loop including a loop-carried dependency variable. That is, there is a problem that an expression including a loop-carried dependency cannot be parallelized and efficiently processed with use of a multi-processor. By generating initial value calculating codes | 01-24-2013 |
20130055225 | COMPILER FOR X86-BASED MANY-CORE COPROCESSORS - A system and method for compiling includes, for a parallelizable code portion of an application stored on a computer readable storage medium, determining one or more variables that are to be transferred to and/or from a coprocessor if the parallelizable code portion were to be offloaded. A start location and an end location are determined for at least one of the one or more variables as a size in memory. The parallelizable code portion is transformed by inserting an offload construct around the parallelizable code portion and passing the one or more variables and the size as arguments of the offload construct such that the parallelizable code portion is offloaded to a coprocessor at runtime. | 02-28-2013 |
20130191817 | Optimisation of loops and data flow sections - The present invention relates to a method for compiling code for a multi-core processor, comprising: detecting and optimizing a loop, partitioning the loop into partitions executable and mappable on physical hardware with optimal instruction level parallelism, optimizing the loop iterations and/or loop counter for ideal mapping on hardware, chaining the loop partitions generating a list representing the execution sequence of the partitions. | 07-25-2013 |
20130232476 | AUTOMATIC PIPELINE PARALLELIZATION OF SEQUENTIAL CODE - A system and associated method for automatically pipeline parallelizing a nested loop in sequential code over a predefined number of threads. Pursuant to task dependencies of the nested loop, each subloop of the nested loop are allocated to a respective thread. Combinations of stage partitions executing the nested loop are configured for parallel execution of a subloop where permitted. For each combination of stage partitions, a respective bottleneck is calculated and a combination with a minimum bottleneck is selected for parallelization. | 09-05-2013 |
20140007061 | STAGED LOOP INSTRUCTIONS | 01-02-2014 |
20140019949 | Method and System for Automated Improvement of Parallelism in Program Compilation - A method of program compilation to improve parallelism during the linking of the program by a compiler. The method includes converting statements of the program to canonical form, constructing abstract system tree (AST) for each procedure in the program, and traversing the program to construct a graph by making each non-control flow statement and each control structure into at least one node of the graph. | 01-16-2014 |
20140173575 | PROCESSORS AND COMPILING METHODS FOR PROCESSORS - A compiling method compiles an object program to be executed by a processor having a plurality of execution units operable in parallel. In the method a first availability chain is created from a producer instruction (p1), scheduled for execution by a first one of the execution units ( | 06-19-2014 |
20140325495 | Semi-Automatic Restructuring of Offloadable Tasks for Accelerators - A computer implemented method entails identifying code regions in an application from which offloadable tasks can be generated by a compiler for heterogenous computing system with processor and accelerator memory, including adding relaxed semantics to a directive based language in the heterogenous computing for allowing a suggesting rather than specifying a parallel code region as an offloadable candidate, and identifying one or more offloadable tasks in a neighborhood of code region marked by the directive. | 10-30-2014 |
20140344793 | APPARATUS AND METHOD FOR EXECUTING CODE - An apparatus and method for executing code are provided. The apparatus includes a memory manager that allocates a stack in memory to store processed data that needs to be retained; a loop generator that divides program code programmed to be processed in parallel into regions based on a barrier function, transforms a region that includes the processed data that needs to be retained in the stack into a first coalescing loop, and transforms a region that uses the processed data stored in the stack into a second coalescing loop such that the transformed program code may be serially processed; and a loop changer that reverses a processing order of the second coalescing loop in comparison to a processing order of the first coalescing loop. | 11-20-2014 |
20150058832 | AUTO MULTI-THREADING IN MACROSCALAR COMPILERS - System and methods for the parallelization of software applications are described. In some embodiments, a compiler may automatically identify within source code dependencies of a function called by another function. A persistent database may be generated to store identified dependencies. When calls the function are encountered within the source code, the persistent database may be checked, and a parallelized implementation of the function may be employed dependent upon the dependency indicated in the persistent database. | 02-26-2015 |
20150135171 | INFORMATION PROCESSING APPARATUS AND COMPILATION METHOD - A storage unit stores source code including loop processing that is written with an array referenced by an index, a loop variable, and a parameter. A computing unit generates a conditional expression indicating that the index of the array satisfies a predetermined condition, using the loop variable and the parameter. The computing unit generates determination information on the parameter, by eliminating the loop variable from the conditional expression through formula manipulation. Then, the computing unit generates object code corresponding to the source code in accordance with the determination information. | 05-14-2015 |
20160098258 | METHOD AND SYSTEM FOR AUTOMATED IMPROVEMENT OF PARALLELISM IN PROGRAM COMPILATION - A method of program compilation to improve parallelism during the linking of the program by a compiler. The method includes converting statements of the program to canonical form, constructing a traversable representation, such as an abstract syntax tree (AST), for each procedure in the program, and traversing the program to construct a graph by making each non-control flow statement and each control structure into at least one node of the graph. | 04-07-2016 |
20160139901 | SYSTEMS, METHODS, AND COMPUTER PROGRAMS FOR PERFORMING RUNTIME AUTO PARALLELIZATION OF APPLICATION CODE - Systems, methods, and computer programs are disclosed for performing runtime auto-parallelization of application code. One embodiment of such a method comprises receiving application code to be executed in a multi-processor system. The application code comprises an injected code cost computation expression for at least one loop in the application code defining a serial workload for processing the loop. A runtime profitability check of the loop is performed based on the injected code cost computation expression to determine whether the serial workload can be profitably parallelized. If the serial workload can be profitably parallelized, the loop is executed in parallel using two or more processors in the multi-processor system. | 05-19-2016 |
20160147514 | SYSTEMS AND METHODS FOR STENCIL AMPLIFICATION - In a sequence of major computational steps or in an iterative computation, a stencil amplifier can increase the number of data elements accessed from one or more data structures in a single major step or iteration, thereby decreasing the total number of computations and/or communication operations in the overall sequence or the iterative computation. Stencil amplification, which can be optimized according to a specified parameter such as compile time, rune time, code size, etc., can improve the performance of a computing system executing the sequence or the iterative computation in terms of run time, memory load, energy consumption, etc. The stencil amplifier typically determines boundaries, to avoid erroneously accessing data elements not present in the one or more data structures. | 05-26-2016 |
20160147516 | EXECUTION OF COMPLEX RECURSIVE ALGORITHMS - This application discloses tools and mechanisms to convert a program from a sequentially-executable format into a parallel-executable format, and then modify the program in the parallel-executable format to either allow compilation for parallel execution or to speed-up the parallel execution by an accelerated processing unit. The tools and mechanisms can identify various features of the program, such as recursive calls, search loops, inline function calls, uncompressed data structures, memory utilization, and inter-dependent kernel instances. The tools and mechanisms can modify the program to replace or otherwise augment the identified features, which can allow the modified program to be compiled for parallel execution, or speed-up the parallel execution by an accelerated processing unit. | 05-26-2016 |