Entries |
Document | Title | Date |
20080216064 | Method, Architecture and Software of Meta-Operating System, Operating Systems and Applications For Parallel Computing Platforms - A method, architecture, and tangible medium storing computer readable software provides a meta-operating system, native operating systems and native applications which have been designed to operate upon one or more parallel computing platforms. The meta-operating system provides for an abstracted model of a standard operating system designed to manage one or more underlying standard operating systems, and may comprise components such as computer programming languages, development environments, user interfaces, hardware platform interfaces, operating system interfaces, parallelization platform interfaces, non-native application support, or others. | 09-04-2008 |
20080222620 | PARALLEL PROGRAMMING COMPUTING SYSTEM - A computing system receives a program created by a technical computing environment, analyzes the program, generates multiple program portions based on the analysis of the program, dynamically allocates the multiple program portions to multiple software units of execution for parallel programming, receives multiple results associated with the multiple program portions from the multiple software units of execution, and provides the multiple results or a single result to the program. | 09-11-2008 |
20080276232 | Processor Dedicated Code Handling in a Multi-Processor Environment - Code handling, such as interpreting language instructions or performing “just-in-time” compilation, is performed using a heterogeneous processing environment that shares a common memory. In a heterogeneous processing environment that includes a plurality of processors, one of the processors is programmed to perform a dedicated code-handling task, such as perform just-in-time compilation or interpretation of interpreted language instructions, such as Java. The other processors request code handling processing that is performed by the dedicated processor. Speed is achieved using a shared memory map so that the dedicated processor can quickly retrieve data provided by one of the other processors. | 11-06-2008 |
20090044179 | MEDIA FOR PERFORMING PARALLEL PROCESSING OF DISTRIBUTED ARRAYS - One or more computer-readable media store executable instructions that, when executed by processing logic, perform parallel processing. The media store one or more instructions for initiating a single programming language, and identifying, via the single programming language, one or more data distribution schemes for executing a program. The media also store one or more instructions for transforming, via the single programming language, the program into a parallel program with an optimum data distribution scheme selected from the one or more identified data distribution schemes, and allocating the parallel program to two or more labs for parallel execution. The media further store one or more instructions for receiving one or more results associated with the parallel execution of the parallel program from the two or more labs, and providing the one or more results to the program. | 02-12-2009 |
20090044180 | DEVICE FOR PERFORMING PARALLEL PROCESSING OF DISTRIBUTED ARRAYS - A device for performing parallel processing includes a processor to initiate a single programming language, and identify, via the single programming language, one or more data distribution schemes for executing a program. The processor also transforms, via the single programming language, the program into a parallel program with an optimum data distribution scheme selected from the one or more identified data distribution schemes, and allocates the parallel program to two or more labs for parallel execution. The processor further receives one or more results associated with the parallel execution of the parallel program from the two or more labs, and provides the one or more results to the program. | 02-12-2009 |
20090049433 | Method and apparatus for ordering code based on critical sections - A method of compiling code includes ordering instructions that protect and release critical sections in the code to improve parallel execution of the code according to an intrinsic order of the critical sections. According to one embodiment, the intrinsic order of the critical sections in the code is determined from data dependence and control dependence of instructions in the critical sections, and additional dependencies are generated to enforce the intrinsic order of the critical sections. Other embodiments are described and claimed. | 02-19-2009 |
20090049434 | PROGRAM TRANSLATING APPARATUS AND COMPILER PROGRAM - A program translating apparatus and compiler program of this invention translates program source code into intermediate code containing multiple instructions, extracts at least one combination of two parallelization candidate instructions from the intermediate code, extracts, for each parallelization candidate instruction, a dependency related instruction having a dependency relation with the parallelization candidate instruction from the intermediate code, determines, for each parallelization candidate instruction, a movement-feasible range for the parallelization candidate instruction based on the execution position of the extracted dependency related instruction for the parallelization candidate instruction, moves the two parallelization candidate instructions to an execution position contained in the common movement-feasible range of the two parallelization candidate instructions, thereby modifying the intermediate code, and translates it into instruction code. | 02-19-2009 |
20090049435 | PARALLEL PROCESSING OF DISTRIBUTED ARRAYS - A computing device-implemented method includes initiating a single programming language, and identifying, via the single programming language, one or more data distribution schemes for executing a program. The method also includes transforming, via the single programming language, the program into a parallel program with an optimum data distribution scheme selected from the one or more identified data distribution schemes, and allocating the parallel program to two or more labs for parallel execution. The method further includes receiving one or more results associated with the parallel execution of the parallel program from the two or more labs, and providing the one or more results to the program. | 02-19-2009 |
20090064115 | Enabling graphical notation for parallel programming - In one embodiment, the present invention includes a method for developing of a parallel program by specifying graphical representations for input data objects into a parallel computation code segment, specifying graphical representations for parallel program schemes, each including at least one graphical representation of an operator to perform an operation on an data object, determining if any of the parallel program schemes include at least one alternative computation, and unrolling the corresponding parallel program schemes and generating alternative parallel program scheme fragments therefrom. Other embodiments are described and claimed. | 03-05-2009 |
20090113404 | Optimum code generation method and compiler device for multiprocessor - A method of generating optimum parallel codes from a source code for a computer system configured of plural processors that share a cache memory or a main memory is provided. A preset code is read and operation amounts and process contents are analyzed while distinguishing dependence and independence among processes from the code. Then, the amount of data to be reused among processes is analyzed, and the amount of data that accesses the main memory is analyzed. Further, upon the reception of a parallel code generation policy inputted by a user, the processes of the code are divided, and while estimating an execution cycle from the operation amount and process contents thereof, the cache use of the reuse data, and the main memory access data amount, a parallelization method with which the execution cycle becomes shortest is executed. | 04-30-2009 |
20090119652 | Computer Program Functional Partitioning System for Heterogeneous Multi-processing Systems - The present invention provides for a system for computer program functional partitioning for heterogeneous multi-processing systems. At least one system parameter of a computer system comprising one or more disparate processing nodes is identified. Computer program code comprising a program to be run on the computer system is received. A whole program representation is generated based on received computer program code. At least one single-entry-single-exit (SESE) region is identified based on the whole program representation. At least one node-specific SESE region is identified based on identified SESE regions and the at least one system parameter. Each node-specific SESE region is grouped into a node-specific subroutine. Each node-specific subroutine is compiled based on a specified node characteristic. The computer program code is modified based on the node-specific subroutines and the modified computer program code is compiled. | 05-07-2009 |
20090133007 | COMPILER AND TOOL CHAIN - A compiler for a DRP inputs a source program, and outputs the final CPU code to operate in an information processing device having hierarchical memories of at least three hierarchies comprising addressable memories. The compiler outputs a code which transfers instructions or configurations to a processor of the information processing device, in the hierarchical memories, with the memory close to the processor as the upper layer, from the memory of the lower layer to the memory of the upper layer, step by step. | 05-21-2009 |
20090138862 | PROGRAM PARALLELIZATION SUPPORTING APPARATUS AND PROGRAM PARALLELIZATION SUPPORTING METHOD - A program parallelization supporting apparatus determines a determinacy in at least one dependency relationship of a data dependency, a control dependency and a pointer dependency in a program, extracts a critical path in the program, and extracts a processing instruction which exists on the critical path and has a non-deterministic determinacy in the dependency relationship. Furthermore, if a process related to a path of the extracted non-deterministic processing instruction is parallelized and the path of the non-deterministic processing instruction is deleted, the program parallelization supporting apparatus outputs parallelization labor hour information depending on the number of dependency relationships disturbing the parallelization and parallelization effect information depending on the number of processing instructions which are shortened by the parallelization. | 05-28-2009 |
20090271774 | SYSTEM AND METHOD FOR THE DISTRIBUTION OF A PROGRAM AMONG COOPERATING PROCESSING ELEMENTS - A Veil program analyzes the source code and/or data of an existing sequential target program and determines how best to distribute the target program and data among the processing elements of a multi-processing element computing system. The Veil program analyzes source code loops, data sizes and types to prepare a set of distribution attempts, whereby each distribution is run under a run-time evaluation wrapper and evaluated to determine the optimal distribution across the available processing elements. | 10-29-2009 |
20090300591 | COMPOSABLE AND CANCELABLE DATAFLOW CONTINUATION PASSING - Parallel tasks are created, and the tasks include a first task and a second task. Each task resolves a future. At least one of three possible continuations for each of the tasks is supplied. The three continuations include a success continuation, a cancellation continuation, and a failure continuation. A value is returned as the future of the first task upon a success continuation for the first task. The value from the first task is used in the second task to compute a second future. The cancellation continuation is supplied if the task is cancelled and the failure continuation is supplied if the task does not return a value and the task is not cancelled. | 12-03-2009 |
20090307671 | SYSTEM AND METHOD FOR SCALING SIMULATIONS AND GAMES - A system and method for modeling simulation and game artificial intelligence as a data management problem. A scripting language that provides game designers and players with a data-driven artificial intelligence scheme for customizing behavior for individual agents. Query processing and indexing techniques to efficiently execute large numbers of agent scripts, thus providing a framework for games with a large number of agents. | 12-10-2009 |
20090320005 | CONTROLLING PARALLELIZATION OF RECURSION USING PLUGGABLE POLICIES - A parallelism policy object provides a control parallelism interface whose implementation evaluates parallelism conditions that are left unspecified in the interface. User-defined and other parallelism policy procedures can make recommendations to a worker program for transitioning between sequential program execution and parallel execution. Parallelizing assistance values obtained at runtime can be used in the parallelism conditions on which the recommendations are based. A consistent parallelization policy can be employed across a range of parallel constructs, and inside recursive procedures. | 12-24-2009 |
20100031241 | Method and apparatus for detection and optimization of presumably parallel program regions - A method and apparatus for optimizing source code for use in a parallel computing environment by compiling an application source code, performing analysis, and optimizing the application source code. At the time of compilation, a compiler adds instrumentation to a prepared executable. An analysis program then analyzes the prepared executable and generates an analysis result. The analysis result is then used by the analysis program to optimize the application source code for parallel processing. | 02-04-2010 |
20100070958 | PROGRAM PARALLELIZING METHOD AND PROGRAM PARALLELIZING APPARATUS - Provided is a program parallelizing method and a program parallelizing apparatus that enable to efficiently generate a parallelized program with shorter parallel execution time. | 03-18-2010 |
20100153937 | SYSTEM AND METHOD FOR PARALLEL EXECUTION OF A PROGRAM - A computer system for executing a computer program on parallel processors, the system having a compiler for identifying within a computer program concurrency markers that indicate that code between them can be executed in parallel and should be executed with delayed side-effects; and an execution system that is operable to execute the code identified by the concurrency markers to generate a queue of side-effects and after execution of that code is completed, sequentially execute the queue of side-effects. | 06-17-2010 |
20100205588 | GENERAL PURPOSE DISTRIBUTED DATA PARALLEL COMPUTING USING A HIGH LEVEL LANGUAGE - General-purpose distributed data-parallel computing using a high-level language is disclosed. Data parallel portions of a sequential program that is written by a developer in a high-level language are automatically translated into a distributed execution plan. The distributed execution plan is then executed on large compute clusters. Thus, the developer is allowed to write the program using familiar programming constructs in the high level language. Moreover, developers without experience with distributed compute systems are able to take advantage of such systems. | 08-12-2010 |
20100229161 | COMPILE METHOD AND COMPILER - A compile technique is provided for multicore allocation, by which a desired running performance can be achieved. The steps of analyzing a taskization directive, taskizing a specified part, and assigning a specified CPU the task are adopted for the compile technique. According to the program-to-tasks-decomposition compile technique, the multicore decomposition is performed by allocating tasks to CPUs individually while following a task decomposition directive of a main part designated by a user. When no direction is issued concerning a CPU to be allocated, the relation with a principal task is judged from the relation of invocation and the dependency, and CPU to be allocated, and then the CPU to be allocated is determined. In allocation to the CPU, an efficient multicore-task decomposition is achieved in consideration of copy and assignment of one processing to more than one CPU while figuring in the balance between processing speed and resources. | 09-09-2010 |
20100275190 | METHOD OF CONVERTING PROGRAM CODE OF PROGRAM RUNNING IN MULTI-THREAD TO PROGRAM CODE CAUSING LESS LOCK COLLISIONS, COMPUTER PROGRAM AND COMPUTER SYSTEM FOR THE SAME - A method of converting a program code of a program running in multi-thread to a program code which causes fewer lock collisions. The method includes reading the program code into a memory and searching the program code for a first conditional statement making a branch to a path, which is in a synchronized block and has no side effect on the synchronized block; duplicating the path having no side effect to which the branch is made by the searched first conditional statement into the outside of the synchronized block; and adding a second conditional statement into the program code in response to the duplication, wherein the second conditional statement is a conditional statement making a branch to the duplicated path having no side effect. Also provided is a system and an article of manufacture which causes a computer to carry out the steps of the above method. | 10-28-2010 |
20100306752 | Automatically Creating Parallel Iterative Program Code in a Graphical Data Flow Program - System and method for automatically parallelizing iterative functionality in a data flow program. A data flow program is stored that includes a first data flow program portion, where the first data flow program portion is iterative. Program code implementing a plurality of second data flow program portions is automatically generated based on the first data flow program portion, where each of the second data flow program portions is configured to execute a respective one or more iterations. The plurality of second data flow program portions are configured to execute at least a portion of iterations concurrently during execution of the data flow program. Execution of the plurality of second data flow program portions is functionally equivalent to sequential execution of the iterations of the first data flow program portion. | 12-02-2010 |
20100306753 | Loop Parallelization Analyzer for Data Flow Programs - System and method for automatically parallelizing iterative functionality in a data flow program. A data flow program is stored that includes a first data flow program portion, where the first data flow program portion is iterative. Program code implementing a plurality of second data flow program portions is automatically generated based on the first data flow program portion, where each of the second data flow program portions is configured to execute a respective one or more iterations. The plurality of second data flow program portions are configured to execute at least a portion of iterations concurrently during execution of the data flow program. Execution of the plurality of second data flow program portions is functionally equivalent to sequential execution of the iterations of the first data flow program portion. | 12-02-2010 |
20110010695 | ARCHITECTURE FOR ACCELERATED COMPUTER PROCESSING - A data processing system includes a host computer, an additional computer, an application module including a first executable code, a module for analyzing said first executable code and a module for generating a second executable code segmented notably into code blocks which are executed in a preferential manner on one of the two computers. The second executable code includes a sub-module for managing the distribution of the processing operations between the host computer and the additional computer and a sub-module for managing the additional computer as a virtual machine which executes the blocks allocated to the additional computer. | 01-13-2011 |
20110035736 | GRAPHICAL PROCESSING UNIT (GPU) ARRAYS - A device initiates a technical computing environment (TCE), and receives, via the TCE, a program command that permits the TCE to access a graphical processing unit that is remote to the device, where the program command permits the TCE to seamlessly transfer data to the remote GPU. The device transforms, via the TCE, the program command into a program command that is executable by the remote GPU, and provides the transformed program command to the remote GPU for execution. The device also receives, from the remote GPU, one or more results associated with execution of the transformed program command by the remote GPU, and utilizes the one or more results via the TCE. | 02-10-2011 |
20110035737 | SAVING AND LOADING GRAPHICAL PROCESSING UNIT (GPU) ARRAYS - A device receives, via a technical computing environment, a program that includes a parallel construct and a command to be executed by graphical processing units, and analyzes the program. The device also creates, based on the parallel construct and the analysis, one or more instances of the command to be executed in parallel by the graphical processing units, and transforms, via the technical computing environment, the one or more command instances into one or more command instances that are executable by the graphical processing units. The device further allocates the one or more transformed command instances to the graphical processing units for parallel execution, and receives, from the graphical processing units, one or more results associated with parallel execution of the one or more transformed command instances by the graphical processing units. | 02-10-2011 |
20110067014 | PIPELINED PARALLELIZATION WITH LOCALIZED SELF-HELPER THREADING - A system and method for automatically parallelizing a computer program for multi-threaded execution. A compiler identifies and parallelizes non-DOALL parallel regions, such as loops, within a computer program. The compiler determines enhanced helper thread instructions based upon the main body instructions of the non-DOALL region. These helper thread instructions are inserted ahead of the main body instructions within each of the plurality of threads, rather than within a single main thread. Next, synchronization instructions are inserted in one or more threads such that the main body of work of each thread is performed in a pipelined manner. The helper thread instructions within each thread may reduce the total execution time of each thread. | 03-17-2011 |
20110067015 | PROGRAM PARALLELIZATION APPARATUS, PROGRAM PARALLELIZATION METHOD, AND PROGRAM PARALLELIZATION PROGRAM - A program parallelization apparatus which generates a parallelized program of shorter parallel execution time is provided. The program parallelization apparatus inputs a sequential processing intermediate program and outputs a parallelized intermediate program. In the apparatus, a thread start time limitation analysis part analyzes an instruction-allocatable time based on a limitation on an instruction execution start time of each thread. A thread end time limitation analysis part analyzes an instruction-allocatable time based on a limitation on an instruction execution end time of each thread. An occupancy status analysis part analyzes a time not occupied by already-scheduled instructions. A dependence delay analysis part analyzes an instruction-allocatable time based on a delay resulting from dependence between instructions. A schedule candidate instruction select part selects a next instruction to schedule. An instruction arrangement part allocates a processor and time to execute to an instruction. | 03-17-2011 |
20110067016 | EFFICIENT PARALLEL COMPUTATION ON DEPENDENCY PROBLEMS - A computing method includes accepting a definition of a computing task ( | 03-17-2011 |
20110072420 | APPARATUS AND METHOD FOR CONTROLLING PARALLEL PROGRAMMING - A parallel programming adjusting apparatus and method are provided. Parameter sets are made by grouping parameters of a parallel programming model influencing the system performance, the parameter sets are combined among the groups, generating parameter combinations. Execution files are executed for the individual parameter combinations and a runtime of a parallel region for respective parameter combination is measured. An optimum parameter combination is selected based on the measured runtime. | 03-24-2011 |
20110078670 | PROCESS AND SYSTEM FOR DEVELOPMENT OF PARALLEL PROGRAMS - A method for developing parallel programs comprises creating a file in the memory facilities of a terminal; recording of source code in the file by a user using the input facilities and display facilities of the terminal, the source code being a combination of imperative code, algebraic code, and reference elements; checking of all reference elements included in the source code, during compilation, by an analysis module stored in the memory facilities of the terminal till all the reference elements included in the source code are confirmed as correct or missing by the compiler; if all the reference elements are found to be correct during checking, generating the parallel program; and if one or more reference elements are found to be incorrect or missing during checking, displaying information to the user using the display facilities of the terminal so that the user can carry out corrections. A system for development of parallel programs includes a terminal comprising input facilities, display facilities, and computation facilities. The system allows creation of a file including source code, execution of a compiler able to check the reference elements included in the source code. The display facilities of the terminal can be used for presenting information about reference elements. | 03-31-2011 |
20110083125 | PARALLELIZATION PROCESSING METHOD, SYSTEM AND PROGRAM - A unified parallelization table is formed by describing a process, to be executed, with a plurality of control blocks and edges connecting the control blocks; selecting highly predictable edges from the edges; identifying strongly-connected clusters; creating a parallelization table, having the entries of the number of processors, the costs thereof and corresponding clusters, for each node in the strongly-connected clusters and a non-strongly connected cluster between the strongly-connected clusters; creating a graph consisting of parallelization tables; converting the graph consisting of the parallelization tables into a series-parallel graph; and merging the parallelization tables for each serial path merging the parallelization tables for each parallel section. Then, based on the number of processors and the cost value in the unified parallelization table, a best entry is selected and an executable code to be allocated to each processor is generated. | 04-07-2011 |
20110088020 | PARALLELIZATION OF IRREGULAR REDUCTIONS VIA PARALLEL BUILDING AND EXPLOITATION OF CONFLICT-FREE UNITS OF WORK AT RUNTIME - An optimizing compiler device, a method, a computer program product which are capable of performing parallelization of irregular reductions. The method for performing parallelization of irregular reductions includes receiving, at a compiler, a program and selecting, at compile time, at least one unit of work (UW) from the program, each UW configured to operate on at least one reduction operation, where at least one reduction operation in the UW operates on a reduction variable whose address is determinable when running the program at a run-time. At run time, for each successive current UW, a list of reduction operations accessed by that unit of work is recorded. Further, it is determined at run time whether reduction operations accessed by a current UW conflict with any reduction operations recorded as having been accessed by prior selected units of work, and assigning the unit of work as a conflict free unit of work (CFUW) when no conflicts are found. Finally, there is scheduled, for parallel run-time operation, at least two or more processing threads to process a respective the at least two or more assigned CFUWs. | 04-14-2011 |
20110088021 | Parallel Dynamic Optimization - Technologies are generally described for parallel dynamic optimization using multicore processors. A runtime compiler may be adapted to generate multiple instances of executable code from a portable intermediate software module. The various instances of executable code may be generated with variations of optimization parameters such that the code instances each express different optimization attempts. A multicore processor may be leveraged to simultaneously execute some, or all, of the various code instances. Preferred optimization parameters may be determined from the executable code instances that may correctly complete in the least time, or may use the least amount of memory, or that may prove superior according to some other fitness metric. Preferred optimization parameters may be used to seed future optimization attempts. Output generated from the preferred instances may be used as soon as the first instance correctly completes block. | 04-14-2011 |
20110093837 | Method and apparatus for enabling parallel processing during execution of a cobol source program using two-stage compilation - A method and apparatus is disclosed for compilation of an original Cobol program and building an executable program with support for improved performance by increased parallelism during execution using multiple threads of processing. The approach includes a compilation (or translation) step utilizing a first compiler or translating program which is a parallel aware translating first compiler. The parallel aware first compiler is a specialized compiler/translator which takes as input a Cobol source program, and produces as output an intermediate computer program in a second computer programming language, the intermediate program including parallelization directives, the intermediate program intended for further compilation utilizing an existing selected second compiler, the second compiler providing support for parallelism for programs described in the second programming language. The approach optionally allows for use of pragmas serving as parallelization directives to the compiler in the original Cobol program or in the intermediate program. | 04-21-2011 |
20110113410 | Apparatus and Method for Simplified Microparallel Computation - The embodiments provide schemes for micro parallelization. That is, they involve methods of executing segments of code that might be executed in parallel but have typically been executed serially because of the lack of a suitable mechanism | 05-12-2011 |
20110119660 | PROGRAM CONVERSION APPARATUS AND PROGRAM CONVERSION METHOD - A program conversion apparatus according to the present invention includes: a thread creation unit which creates a plurality of threads equivalent to a program part included in a program, based on path information on a plurality of execution paths, each of the execution paths going from a start to an end of the program part, each of the threads being equivalent to at least one of the execution paths; a replacement unit which performs variable replacement on the threads so that a variable shared by the threads is accessed by only to one of the threads in order to avoid an access conflict among the threads; and a thread parallelization unit which generates a program which causes the threads to be speculatively executed in parallel after the variable replacement. | 05-19-2011 |
20110126180 | Methods and System for Executing a Program in Multiple Execution Environments - A method and a medium are disclosed for executing a technical computing program in parallel in multiple execution environments. A program is invoked for execution in a first execution environment and from the invocation the program is executed in the first execution environment and one or more additional execution environments to provide for parallel execution of the program. New constructs in a technical computing programming language are disclosed for parallel programming of a technical computing program for execution in multiple execution environments. It is also further disclosed a system and method for changing the mode of operation of an execution environment from a sequential mode to a parallel mode of operation and vice-versa. | 05-26-2011 |
20110126181 | Methods and System for Executing a Program in Multiple Execution Environments - A method and medium are disclosed for executing a technical computing program in parallel in multiple execution environments. A program is invoked for execution in a first execution environment and from the invocation the program is executed in the first execution environment and one or more additional execution environments to provide for parallel execution of the program. New constructs in a technical computing programming language are disclosed for parallel programming of a technical computing program for execution in multiple execution environments. It is also further disclosed a system and method for changing the mode of operation of an execution environment from a sequential mode to a parallel mode of operation and vice-versa. | 05-26-2011 |
20110161943 | METHOD TO DYNAMICALLY DISTRIBUTE A MULTI-DIMENSIONAL WORK SET ACROSS A MULTI-CORE SYSTEM - A method provides efficient dispatch/completion of an N Dimensional (ND) Range command in a data processing system (DPS). The method comprises: a compiler generating one or more commands from received program instructions; ND Range work processing (WP) logic determining when a command generated by the compiler will be implemented over an ND configuration of operands, where N is greater than one (1); automatically decomposing the ND configuration of operands into a one (1) dimension (1D) work element comprising P sequentially ordered work items that each represent one of the operands; placing the 1D work element within a command queue of the DPS; enabling sequential dispatching of 1D work items in ordered sequence from to one or more processing units; and generating an ND Range output by mapping the 1D work output result to an ND position corresponding to an original location of the operand represented by the 1D work item. | 06-30-2011 |
20110161944 | METHOD AND APPARATUS FOR TRANSFORMING PROGRAM CODE - Provided is a method of transforming program code written such that a plurality of work-items are allocated respectively to and concurrently executed on a plurality of processing elements included in a computing unit. A program code translator may identify, in the program code, two or more code regions, which are to be enclosed by work-item coalescing loops (WCLs), based on a synchronization barrier function contained in the program code, such that the work-items are serially executable on a smaller number of processing elements than a number of the processing elements, and may enclose the identified code regions with the WCLs, respectively. | 06-30-2011 |
20110167416 | SYSTEMS, APPARATUSES, AND METHODS FOR A HARDWARE AND SOFTWARE SYSTEM TO AUTOMATICALLY DECOMPOSE A PROGRAM TO MULTIPLE PARALLEL THREADS - Systems, apparatuses, and methods for a hardware and software system to automatically decompose a program into multiple parallel threads are described. In some embodiments, the systems and apparatuses execute a method of original code decomposition and/or generated thread execution. | 07-07-2011 |
20110167417 | PROGRAMMING SYSTEM IN MULTI-CORE, AND METHOD AND PROGRAM OF THE SAME - A first compiler generates one or more object codes from a program code for a first processor included in an arithmetic processing system to which a plurality of processors are mutually connected. A first linker links the generated one or more object codes to generate an execution file for the first processor. A parameter information generation unit generates, based on the information acquired from the first linker, parameter information used in a second processor included in the arithmetic processing system. A second compiler refers to a program code and the parameter information for the second processor to generate one or more object codes. A second linker links the generated one or more object codes to generate an execution file for the second processor. | 07-07-2011 |
20110209129 | PARALLELIZATION METHOD, SYSTEM AND PROGRAM - A parallelization method, system and program. A program expressed by a block diagram or the like is divided into strands and a balance in calculation time is made among the strands. The functional blocks are divided into strands and the strand involving the maximum calculation time from a strand set is found. One or more movable blocks in the strand involving the maximum calculation time is found. The next step is obtaining calculation time of each strand after the movable block is moved to the strand in the input or output direction according to its property, and moving the block to a strand most largely reducing the calculation time of the strand having the maximum calculation time before the movement. This process loops until calculation time is no longer reduced. Strands are then transformed into source codes. Source codes are compiled and assigned to separate cores or processors for execution. | 08-25-2011 |
20110239201 | METHOD AND SYSTEM FOR PARALLELIZATION OF SEQUENCIAL COMPUTER PROGRAM CODES - A method and system for parallelization of sequential computer program code are described. In one embodiment, an automatic parallelization system includes a syntactic analyser to analyze the structure of the sequential computer program code to identify the positions to insert SPI to the sequential computer code; a profiler for profiling the sequential computer program code by preparing call graph to determine dependency of each line of the sequential computer program code and the time required for the execution of each function of the sequential computer program code; an analyzer to determine parallelizability of the sequential computer program code from the information obtained by analysing and profiling of the sequential computer program code; and a code generator to insert SPI to the sequential computer program code upon determination of parallelizability to obtain parallel computer program code, which is further outputted to a parallel computing environment for execution and the method thereof. | 09-29-2011 |
20110252411 | IDENTIFICATION AND TRANSLATION OF PROGRAM CODE EXECUTABLE BY A GRAPHICAL PROCESSING UNIT (GPU) - A device receives program code, and receives size/type information associated with inputs to the program code. The device determines, prior to execution of the program code and based on the input size/type information, a portion of the program code that is executable by a graphical processing unit (GPU), and determines, prior to execution of the program code and based on the input size/type information, a portion of the program code that is executable by a central processing unit (CPU). The device compiles the GPU-executable portion of the program code to create a compiled GPU-executable portion of the program code, and compiles the CPU-executable portion of the program code to create a compiled CPU-executable portion of the program code. The device provides, to the GPU for execution, the compiled GPU-executable portion of the program code, and provides, to the CPU for execution, the compiled CPU-executable portion of the program code. | 10-13-2011 |
20110265068 | Single Thread Performance in an In-Order Multi-Threaded Processor - A mechanism is provided for improving single-thread performance for a multi-threaded, in-order processor core. In a first phase, a compiler analyzes application code to identify instructions that can be executed in parallel with focus on instruction-level parallelism and removing any register interference between the threads. The compiler inserts as appropriate synchronization instructions supported by the apparatus to ensure that the resulting execution of the threads is equivalent to the execution of the application code in a single thread. In a second phase, an operating system schedules the threads produced in the first phase on the hardware threads of a single processor core such that they execute simultaneously. In a third phase, the microprocessor core executes the threads specified by the second phase such that there is one hardware thread executing an application thread. | 10-27-2011 |
20110271263 | Compiling Software For A Hierarchical Distributed Processing System - Compiling software for a hierarchical distributed processing system including providing to one or more compiling nodes software to be compiled, wherein at least a portion of the software to be compiled is to be executed by one or more other nodes; compiling, by the compiling node, the software; maintaining, by the compiling node, any compiled software to be executed on the compiling node; selecting, by the compiling node, one or more nodes in a next tier of the hierarchy of the distributed processing system in dependence upon whether any compiled software is for the selected node or the selected node's descendants; sending to the selected node only the compiled software to be executed by the selected node or selected node's descendant. | 11-03-2011 |
20110271264 | METHOD FOR THE TRANSLATION OF PROGRAMS FOR RECONFIGURABLE ARCHITECTURES - Data processing using multidimensional fields is described along with methods for advantageously using high-level language codes. | 11-03-2011 |
20110314458 | BINDING DATA PARALLEL DEVICE SOURCE CODE - A compile environment is provided in a computer system that allows programmers to program both CPUs and data parallel devices (e.g., GPUs) using a high level general purpose programming language that has data parallel (DP) extensions. A compilation process translates modular DP code written in the general purpose language into DP device source code in a high level DP device programming language using a set of binding descriptors for the DP device source code. A binder generates a single, self-contained DP device source code unit from the set of binding descriptors. A DP device compiler generates a DP device executable for execution on one or more data parallel devices from the DP device source code unit. | 12-22-2011 |
20120005662 | INDEXABLE TYPE TRANSFORMATIONS - A high level programming language provides an extensible set of transformations for use on indexable types in a data parallel processing environment. A compiler for the language implements each transformation as a map from indexable types to allow each transformation to be applied to other transformations. At compile time, the compiler identifies sequences of the transformations on each indexable type in data parallel source code and generates data parallel executable code to implement the sequences as a combined operation at runtime using the transformation maps. The compiler also incorporates optimizations that are based on the sequences of transformations into the data parallel executable code. | 01-05-2012 |
20120066668 | C/C++ LANGUAGE EXTENSIONS FOR GENERAL-PURPOSE GRAPHICS PROCESSING UNIT - A general-purpose programming environment allows users to program a GPU as a general-purpose computation engine using familiar C/C++ programming constructs. Users may use declaration specifiers to identify which portions of a program are to be compiled for a CPU or a GPU. Specifically, functions, objects and variables may be specified for GPU binary compilation using declaration specifiers. A compiler separates the GPU binary code and the CPU binary code in a source file using the declaration specifiers. The location of objects and variables in different memory locations in the system may be identified using the declaration specifiers. CTA threading information is also provided for the GPU to support parallel processing. | 03-15-2012 |
20120151459 | NESTED COMMUNICATION OPERATOR - A high level programming language provides a nested communication operator that partitions a computational space. An indexable type with a rank and element type defines the computational space. The nested communication operator partitions a specified dimension of an index indexable type into segments specified by a segmentation vector and returns an output indexable type that represents the segments. By doing so, the nested communication operator allows data parallel algorithms to operate on the segments as individual units. | 06-14-2012 |
20120151460 | Procedural Concurrency Graph Generator - A parallel-code optimization system includes a Procedural Concurrency Graph (PCG) generator. The PCG generator produces an initial PCG of a computer program including parallel code, and determines a refined PCG from the initial PCG by applying concurrency-type refinements and interference-type refinements to the initial PCG. The initial PCG and the refined PCG include nodes and edges connecting pairs of the nodes. The nodes represent defined procedures in the parallel code, and each edge represents a may-happen-in-parallel relation, and is associated with a set of lvalues that represents the immediate interference between the corresponding pair of nodes. | 06-14-2012 |
20120180030 | SYSTEMS AND METHODS FOR DYNAMICALLY CHOOSING A PROCESSING ELEMENT FOR A COMPUTE KERNEL - A runtime system implemented in accordance with the present invention provides an application platform for parallel-processing computer systems. Such a runtime system enables users to leverage the computational power of parallel-processing computer systems to accelerate/optimize numeric and array-intensive computations in their application programs. This enables greatly increased performance of high-performance computing (HPC) applications. | 07-12-2012 |
20120180031 | Data Parallel Function Call for Determining if Called Routine is Data Parallel - Mechanisms for performing data parallel function calls in code during runtime are provided. These mechanisms may operate to execute, in the processor, a portion of code having a data parallel function call to a target portion of code. The mechanisms may further operate to determine, at runtime by the processor, whether the target portion of code is a data parallel portion of code or a scalar portion of code and determine whether the calling code is data parallel code or scalar code. Moreover, the mechanisms may operate to execute the target portion of code based on the determination of whether the target portion of code is a data parallel portion of code or a scalar portion of code, and the determination of whether the calling code is data parallel code or scalar code. | 07-12-2012 |
20120192164 | UTILIZING SPECIAL PURPOSE ELEMENTS TO IMPLEMENT A FSM - Apparatus, systems, and methods for a compiler are described. One such compiler generates machine code corresponding to a set of elements including a general purpose element and a special purpose element. The compiler identifies a portion in an arrangement of relationally connected operators that corresponds to a special purpose element. The compiler also determines whether the portion meets a condition to be mapped to the special purpose element. The compiler also converts the arrangement into an automaton comprising a plurality of states, wherein the portion is converted using a special purpose state that corresponds to the special purpose element if the portion meets the condition. The compiler also converts the automaton into machine code. Additional apparatus, systems, and methods are disclosed. | 07-26-2012 |
20120192165 | UNROLLING QUANTIFICATIONS TO CONTROL IN-DEGREE AND/OR OUT-DEGREE OF AUTOMATON - Apparatus, systems, and methods for a compiler are disclosed. One such compiler parses a human readable expression into a syntax tree and converts the syntax tree into an automaton having in-transitions and out-transitions. Converting can include unrolling the quantification as a function of in-degree limitations wherein in-degree limitations includes a limit on the number of transitions into a state of the automaton. The compiler can also convert the automaton into an image for programming a parallel machine, and publishes the image. Additional apparatus, systems, and methods are disclosed. | 07-26-2012 |
20120192166 | STATE GROUPING FOR ELEMENT UTILIZATION - Embodiments of a system and method for generating an image configured to program a parallel machine from source code are disclosed. One such parallel machine includes a plurality of state machine elements (SMEs) grouped into pairs, such that SMEs in a pair have a common output. One such method includes converting source code into an automaton comprising a plurality of interconnected states, and converting the automaton into a netlist comprising instances corresponding to states in the automaton, wherein converting includes pairing states corresponding to pairs of SMEs based on the fact that SMEs in a pair have a common output. The netlist can be converted into the image and published. | 07-26-2012 |
20120260239 | PARALLELIZATION OF PLC PROGRAMS FOR OPERATION IN MULTI-PROCESSOR ENVIRONMENTS - A method of identifying and extracting functional parallelism from a PLC program has been developed that results in the ability of the extracted program fragments to be executed in parallel across a plurality of separate resources, and a compiler configured to perform the functional parallelism (i.e., identification and extraction processes) and perform the scheduling of the separate fragments within a given set of resources. The inventive functional parallelism creates a larger number of separable elements than was possible with prior dataflow analysis methodologies. | 10-11-2012 |
20120317558 | BINDING EXECUTABLE CODE AT RUNTIME - The present invention extends to methods, systems, and computer program products for binding executable code at runtime. Embodiments of the invention include late binding of specified aspects of code to improve execution performance. A runtime dynamically binds lower level code based on runtime information to optimize execution of a higher level algorithm. Aspects of a higher level algorithm having a requisite (e.g., higher) impact on execution performance can be targeted for late binding. Improved performance can be achieved with minimal runtime costs using late binding for aspects having the requisite execution performance impact. | 12-13-2012 |
20120331450 | SYSTEM AND METHOD FOR APPLYING A SEQUENCE OF OPERATIONS CODE TO PROGRAM CONFIGURABLE LOGIC CIRCUITRY - A method and system are provided for deriving a resultant software program from an originating software program that may include overlapping branch logic. The method may include deriving a plurality of software objects from a sequence of processor instructions; associating software objects in accordance with an original logic of the sequence of processor instructions; determining and resolving memory precedence conflicts within the associated plurality of software objects; de-overlapping the execution of the associated plurality of software objects by replacing all overlapping branch logic instructions of the associated series of software objects with equivalent and non-overlapping branch logic instructions; and/or applying the de-overlapped associated plurality of software objects in a programming operation by a parallel execution logic circuitry. The resultant software is more easily converted into programming reconfigurable logic than the originating software program, computers or processors, or by means of a computer or a communications network. | 12-27-2012 |
20130014095 | SOFTWARE-TO-HARDWARE COMPILER WITH SYMBOL SET INFERENCE ANALYSIS - A software-to-hardware compiler is provided that generates hardware constructs in programmable logic resources. The programmable logic resources may be optimized in terms of being configured to make additional copies of regions on memory devices other than on the programmable logic resources (e.g., RAM). This facilitates multiple reads during a single clock cycle. Symbol set analysis is used to minimize the size of regions to allow for more efficient use of hardware resources. | 01-10-2013 |
20130019230 | Program Generating Apparatus, Method of Generating Program, and MediumAANM Nakanishi; YuAACI KanagawaAACO JPAAGP Nakanishi; Yu Kanagawa JPAANM Kizu; ToshikiAACI KanagawaAACO JPAAGP Kizu; Toshiki Kanagawa JPAANM Sasaki; ShunsukeAACI TokyoAACO JPAAGP Sasaki; Shunsuke Tokyo JPAANM Tokuyoshi; TakahiroAACI KanagawaAACO JPAAGP Tokuyoshi; Takahiro Kanagawa JP - According to an embodiment, a program generating apparatus includes a cross-compiling unit, a processing time calculating unit, a source code converting unit, and a self-compiling unit. The cross-compiling unit generates sin instruction string for each basic block based on a source code and specifies instructions performing a memory access. The processing time calculating unit calculates a processing time of the instruction string for each basic block. The source code converting unit inserts a first code, which adds the processing time of the basic block to an accumulated processing time variable of an executed thread of the basic block, and a second code, which calculates the processing time for the specified memory access and adds the calculated processing time to the accumulated processing time variable, into the source code. The self-compiling unit generates a performance estimating program outputting the accumulated processing time variable of the thread executed. | 01-17-2013 |
20130055224 | OPTIMIZING COMPILER FOR IMPROVING APPLICATION PERFORMANCE ON MANY-CORE COPROCESSORS - A system and method for compiling includes parsing code of an application stored in a computer readable storage medium to identify one or more parallelizable code portions. At least one parallelizable code portion is optimized by transforming offload construct code portions to provide an optimized application. | 02-28-2013 |
20130061213 | METHODS AND SYSTEMS FOR OPTIMIZING EXECUTION OF A PROGRAM IN A PARALLEL PROCESSING ENVIRONMENT - An automated method of optimizing execution of a program in a parallel processing environment is described. The program is adapted to execute in data memory and instruction memory. An optimizer receives the program to be optimized. The optimizer instructs the program to be compiled and executed. The optimizer observes execution of the program and identifies a subset of instructions that execute most often. The optimizer also identifies groups of instructions associated with the subset of instructions that execute most often. The identified groups of instructions include the identified subset of instructions that execute most often. The optimizer recompiles the program and stores the identified groups of instructions in instruction memory. The remaining instructions portions of the program are stored in the data memory. The instruction memory has a higher access rate and smaller capacity than the data memory. Once recompiled, subsequent execution of the program occurs using the recompiled program. | 03-07-2013 |
20130067443 | Parallel Processing Development Environment Extensions - A method for parallelization of an algorithm executing on a parallel processing system. An extension element is generated for each of the sections of the algorithm, where the sections comprise: distribution of data to multiple processing elements, transfer of data from outside of the algorithm to inside of the algorithm, global cross-communication of data between processing elements, moving data to a subset of the processing elements, and transfer of data from inside of the algorithm to outside of the algorithm. Each extension element functions to provide parallelization at a respective place in the algorithm where parallelization of the algorithm may occur. | 03-14-2013 |
20130117734 | TECHNIQUE FOR LIVE ANALYSIS-BASED REMATERIALIZATION TO REDUCE REGISTER PRESSURES AND ENHANCE PARALLELISM - A device compiler and linker within a parallel processing unit (PPU) is configured to optimize program code of a co-processor enabled application by rematerializing a subset of live-in variables for a particular block in a control flow graph generated for that program code. The device compiler and linker identifies the block of the control flow graph that has the greatest number of live-in variables, then selects a subset of the live-in variables associated with the identified block for which rematerializing confers the greatest estimated profitability. The profitability of rematerializing a given subset of live-in variables is determined based on the number of live-in variables reduced, the cost of rematerialization, and the potential risk of rematerialization. | 05-09-2013 |
20130132934 | APPLICATON INTERFACE ON MULTIPLE PROCESSORS - A method and an apparatus that execute a parallel computing program in a programming language for a parallel computing architecture are described. The parallel computing program is stored in memory in a system with parallel processors. The parallel computing program is stored in a memory to allocate threads between a host processor and a GPU. The programming language includes an API to allow an application to make calls using the API to allocate execution of the threads between the host processor and the GPU. The programming language includes host function data tokens for host functions performed in the host processor and kernel function data tokens for compute kernel functions performed in one or more compute processors, e.g. GPUs or CPUs, separate from the host processor. | 05-23-2013 |
20130254754 | METHODS AND SYSTEMS FOR OPTIMIZING THE PERFORMANCE OF SOFTWARE APPLICATIONS AT RUNTIME - Systems and method for optimizing the performance of software applications are described. Embodiments include computer implemented steps for identifying at least two constituent software components for parallel execution, executing the identified software components, profiling the performance of the one or more software components at an execution time, creating an optimization model with the set of data gathered from profiling the execution of the one or more software components, and marking at least two software components for execution in parallel in a subsequent execution on the basis of the optimization model. In additional embodiments, the optimization model may be reconfigured on the basis of a cost-benefit analysis of parallelization, and the software components involved marked for sequential execution if the resource overhead associated with parallelization exceeds the corresponding resource or throughput benefit. | 09-26-2013 |
20130263100 | EFFICIENT PARALLEL COMPUTATION OF DEPENDENCY PROBLEMS - A computing method includes accepting a definition of a computing task, which includes multiple Processing Elements (PEs) having execution dependencies. The computing task is compiled for concurrent execution on a multiprocessor device, by arranging the PEs in a series of two or more invocations of the multiprocessor device, including assigning the PEs to the invocations depending on the execution dependencies. The multiprocessor device is invoked to run software code that executes the series of the invocations, so as to produce a result of the computing task. | 10-03-2013 |
20140047421 | PARALLELIZATION METHOD, SYSTEM, AND PROGRAM - A segment including a set of blocks necessary to calculate blocks having internal states and blocks having no outputs is extracted by tracing from blocks for use in calculating inputs into the blocks having internal states and from the blocks having no outputs in the reverse direction of dependence. To newly extract segments in which blocks contained in the extracted segments are removed, a set of nodes to be temporarily removed is determined on the basis of parallelism. Segments executable independently of other segments are extracted by tracing from nodes whose child nodes are lost by removal of the nodes in the upstream direction. Segments are divided into upstream segments representing the newly extracted segments and downstream segments representing nodes temporarily removed. Upstream and downstream segments are merged so as to reduce overlapping blocks between segments such that the number of segments is reduced to the number of parallel executions. | 02-13-2014 |
20140068581 | OPTIMIZED DIVISION OF WORK AMONG PROCESSORS IN A HETEROGENEOUS PROCESSING SYSTEM - A compiler implemented by a computer performs optimized division of work across heterogeneous processors. The compiler divides source code into code sections and characterizes each of the code sections based on pre-defined criteria. Each of the code sections is characterized as at least one of: allocate to a main processor, allocate to a processing element, allocate to one of a parameterized main processor and a parameterized processing element, and indeterminate. The compiler analyzes side-effects and costs of executing the code sections on allocated processors, and transforms the code sections based on results of the analyzing. The transforming includes re-characterizing the code sections for alternate execution in a runtime environment. | 03-06-2014 |
20140068582 | OPTIMIZED DIVISION OF WORK AMONG PROCESSORS IN A HETEROGENEOUS PROCESSING SYSTEM - A compiler implemented by a computer performs optimized division of work across heterogeneous processors. The compiler divides source code into code sections and characterizes each of the code sections based on pre-defined criteria. Each of the code sections is characterized as at least one of: allocate to a main processor, allocate to a processing element, allocate to one of a parameterized main processor and a parameterized processing element, and indeterminate. The compiler analyzes side-effects and costs of executing the code sections on allocated processors, and transforms the code sections based on results of the analyzing. The transforming includes re-characterizing the code sections for alternate execution in a runtime environment. | 03-06-2014 |
20140096117 | DATA DEPENDENCE ANALYSIS SUPPORT DEVICE, DATA DEPENDENCE ANALYSIS SUPPORT PROGRAM, AND DATA DEPENDENCE ANALYSIS SUPPORT METHOD - A data dependence analysis support device calculates pointer information by performing a context-sensitive pointer analysis on every pointer used in a program; calculates dataflow information between statements by performing a context-sensitive dataflow analysis, using the context-sensitive pointer information, on all statements in an analysis target region and all statements that might be called upon execution of the analysis target region; and calculates inter-region data dependence information, using the dataflow information, for two or more threaded regions included in the source program. | 04-03-2014 |
20140109069 | METHOD OF COMPILING PROGRAM TO BE EXECUTED ON MULTI-CORE PROCESSOR, AND TASK MAPPING METHOD AND TASK SCHEDULING METHOD OF RECONFIGURABLE PROCESSOR - A method of compiling a program to be executed on a multicore processor is provided. The method may include generating an initial solution by mapping a task to a source processing element (PE) and a destination PE, and selecting a communication scheme for transmission of the task from the source PE to the destination PE, approximately optimizing the mapping and communication scheme included in the initial solution, and scheduling the task, wherein the communication scheme is designated in a compiling process. | 04-17-2014 |
20140196017 | SYSTEM AND METHOD FOR COMPILER ASSISTED PARALLELIZATION OF A STREAM PROCESSING OPERATOR - A method of enabling compiler assisted parallelization of one or more stream processing operators in a stream processing application, which consists of a data flow graph with operators as vertices connected by streams. The method includes specifying a parallelized version of one or more of the operators, with a parameterized degree of parallelism, in the stream application, evaluating whether or not to use the parallelized operator, deciding the degree of parallelism of the parallelized operator, if there is a need for a parallelized operator. | 07-10-2014 |
20140223419 | COMPILER, OBJECT CODE GENERATION METHOD, INFORMATION PROCESSING APPARATUS, AND INFORMATION PROCESSING METHOD - According to one embodiment, a compiler applicable to a parallel computer including processors, wherein a source program is input to the compiler and a local code for each of the processors are generated, the compiler includes a generation module and an object code generation module. The generation module is configured to analyze the input source program, extract a data transfer point from a procedure described in the source program, and generate a call processing for data copy. The object code generation module is configured to generate an object code including the call processing. | 08-07-2014 |
20140258995 | Compiler and Language for Parallel and Pipelined Computation - A compiler and language using the comma as a parallelism operator may ensure that variables on the left hand side of a line of code are only used once, and that the variables on the left hand side of the line of code are not being used as function arguments. Commas may be replaced with semi-colons. | 09-11-2014 |
20140325494 | METHODS AND SYSTEMS FOR DETECTION IN A STATE MACHINE - A device including a data analysis element including a plurality of memory cells. The memory cells analyze at least a portion of a data stream and output a result of the analysis. The device also includes a detection cell. The detection cell includes an AND gate. The AND gate receives result of the analysis as a first input. The detection cell also includes a D flip-flop including an output coupled to a second input of the AND gate. | 10-30-2014 |
20140359589 | Graphical Development and Deployment of Parallel Floating-Point Math Functionality on a System with Heterogeneous Hardware Components - System and method for configuring a system of heterogeneous hardware components, including at least one: programmable hardware element (PHE), digital signal processor (DSP) core, and programmable communication element (PCE). A program, e.g., a graphical program (GP), which includes floating point math functionality and which is targeted for distributed deployment on the system is created. Respective portions of the program for deployment to respective ones of the hardware components are automatically determined. Program code implementing communication functionality between the at least one PHE and the at least one DSP core and targeted for deployment to the at least one PCE is automatically generated. At least one hardware configuration program (HCP) is generated from the program and the code, including compiling the respective portions of the program and the program code for deployment to respective hardware components. The HCP is deployable to the system for concurrent execution of the program. | 12-04-2014 |
20140359590 | Development and Deployment of Parallel Floating-Point Math Functionality on a System with Heterogeneous Hardware Components - System and method for configuring a system of heterogeneous hardware components, including at least one: programmable hardware element (PHE), digital signal processor (DSP) core, and programmable communication element (PCE). A program, e.g., a graphical program (GP), which includes floating point math functionality and which is targeted for distributed deployment on the system is created. Respective portions of the program for deployment to respective ones of the hardware components are automatically determined. Program code implementing communication functionality between the at least one PHE and the at least one DSP core and targeted for deployment to the at least one PCE is automatically generated. At least one hardware configuration program (HCP) is generated from the program and the code, including compiling the respective portions of the program and the program code for deployment to respective hardware components. The HCP is deployable to the system for concurrent execution of the program. | 12-04-2014 |
20140380288 | UTILIZING SPECIAL PURPOSE ELEMENTS TO IMPLEMENT A FSM - Apparatus, systems, and methods for a compiler are described. One such compiler generates machine code corresponding to a set of elements including a general purpose element and a special purpose element. The compiler identifies a portion in an arrangement of relationally connected operators that corresponds to a special purpose element. The compiler also determines whether the portion meets a condition to be mapped to the special purpose element. The compiler also converts the arrangement into an automaton comprising a plurality of states, wherein the portion is converted using a special purpose state that corresponds to the special purpose element if the portion meets the condition. The compiler also converts the automaton into machine code. Additional apparatus, systems, and methods are disclosed. | 12-25-2014 |
20150100948 | IRREDUCIBLE MODULES - An approach to generating irreducible modules. The approach includes a method that includes receiving, by at least one computing device, data associated with a specification. The method includes defining, by the at least one computing device, a pattern on the received data. The pattern reduces a set of rules into a single condition. The method includes generating, by the at least one computing device, an irreducible module based on the pattern. The irreducible module has one output dependent variable and is associated with a data flow application. | 04-09-2015 |
20150100949 | PROCESSING METHOD - A method for processing computer program code to enable different parts of the computer program code to be executed by different processing elements of a plurality of communicating processing elements. The method comprises identifying at least one first part of the computer program code, which is to be executed by a particular one of said processing elements. The method further comprises identifying at least one further part of the computer code which is related to the at least one first part of the computer code. The at least one first part of the computer program code and the at least one further part of the computer program code are caused to be executed by the particular one of said processing elements. | 04-09-2015 |
20150113514 | SOURCE-TO-SOURCE TRANSFORMATIONS FOR GRAPH PROCESSING ON MANY-CORE PLATFORMS - Methods are provided for source-to-source transformations for graph processing on many-core platforms. A method includes receiving a graph application including one graph, expressed by a graph application programming interface configured for defining and manipulating graphs. The method further includes transforming, by a source-to-source compiler, the graph application into a plurality of parallel code variants. Each of the plurality of parallel code variants is specifically configured for parallel execution by a target one of a plurality of different many-core processors. The method also includes selecting and tuning, by a runtime component, a particular one of the parallel code variants for the parallel execution responsive to graph application characteristics, graph data, and an underlying code execution platform of the plurality of different many-core processors. | 04-23-2015 |
20150293752 | Unrestricted, Fully-Source-Preserving, Concurrent, Wait-Free, Synchronization-Free, Fully-Error-Handling Frontend With Inline Schedule Of Tasks And Constant-Space Buffers - A concurrent, wait-free compiler/compiler front-end for C/C++ and other programming languages, comprising parallel stages that carry out the steps of character translation, line translation, macro rewriting, lexing, parsing, and handling errors in input text and translating it to an object form, with features including (a) long lexenes, (b) display modifiers, (c) look ahead isolation, (d) line-by-line processing followed by tokenization, (e) complete error handlers, and/or (f) precise and inline context switches. | 10-15-2015 |
20150347106 | METHOD AND APPARATUS FOR A COMPILER AND RELATED COMPONENTS FOR STREAM-BASED COMPUTATIONS FOR A GENERAL-PURPOSE, MULTIPLE-CORE SYSTEM - A method and system of compiling and linking source stream programs for efficient use of multi-node devices. The system includes a compiler, a linker, a loader and a runtime component. The process converts a source code stream program to a compiled object code that is used with a programmable node based computing device having a plurality of processing nodes coupled to each other. The programming modules include stream statements for input values and output values in the form of sources and destinations for at least one of the plurality of processing nodes and stream statements that determine the streaming flow of values for the at least one of the plurality of processing nodes. The compiler converts the source code stream based program to object modules, object module instances and executables. The linker matches the object module instances to at least one of the multiple cores. The loader loads the tasks required by the object modules in the nodes and configure the nodes matched with the object module instances. The runtime component runs the converted program. | 12-03-2015 |
20150378696 | HYBRID PARALLELIZATION STRATEGIES FOR MACHINE LEARNING PROGRAMS ON TOP OF MAPREDUCE - Hybrid parallelization strategies for machine learning programs on top of MapReduce are provided. In one embodiment, a method of and computer program product for parallel execution of machine learning programs are provided. Program code is received. The program code contains at least one parallel for statement having a plurality of iterations. A parallel execution plan is determined for the program code. According to the parallel execution plan, the plurality of iterations is partitioned into a plurality of tasks. Each task comprises at least one iteration. The iterations of each task are independent. | 12-31-2015 |
20160092182 | METHODS AND SYSTEMS FOR OPTIMIZING EXECUTION OF A PROGRAM IN A PARALLEL PROCESSING ENVIRONMENT - An automated method of optimizing execution of a program in a parallel processing environment is described. The program is adapted to execute in data memory and instruction memory. An optimizer receives the program to be optimized. The optimizer instructs the program to be compiled and executed. The optimizer observes execution of the program and identifies a subset of instructions that execute most often. The optimizer also identifies groups of instructions associated with the subset of instructions that execute most often. The identified groups of instructions include the identified subset of instructions that execute most often. The optimizer recompiles the program and stores the identified groups of instructions in instruction memory. The remaining instructions portions of the program are stored in the data memory. The instruction memory has a higher access rate and smaller capacity than the data memory. Once recompiled, subsequent execution of the program occurs using the recompiled program. | 03-31-2016 |
20160103663 | PROGRAMMING A MULTI-PROCESSOR SYSTEM - A computer-implemented method for creating a program for a multi-processor system comprising a plurality of interspersed processors and memories. A user may specify or create source code using a programming language. The source code specifies a plurality of tasks and communication of data among the plurality of tasks. However, the source code may not (and preferably is not required to) 1) explicitly specify which physical processor will execute each task and 2) explicitly specify which communication mechanism to use among the plurality of tasks. The method then creates machine language instructions based on the source code, wherein the machine language instructions are designed to execute on the plurality of processors. Creation of the machine language instructions comprises assigning tasks for execution on respective processors and selecting communication mechanisms between the processors based on location of the respective processors and required data communication to satisfy system requirements. | 04-14-2016 |
20160117152 | METHOD AND SYSTEM OF A COMMAND BUFFER BETWEEN A CPU AND GPU - A method and system for a command processor for efficient processing of a program multi-processor core system with a CPU and GPU. The multi-core system includes a general purpose CPU executing commands in a CPU programming language and a graphic processing unit (GPU) executing commands in a GPU programming language. A command processor is coupled to the CPU and CPU. The command processor sequences jobs from a program for processing by the CPU or the GPU. The command processor creates commands from the jobs in a state free command format. The command processor generates a sequence of commands for execution by either the CPU or the GPU in the command format. A compiler running a meta language converts program data for the commands into a first format readable by the CPU programming language and a second format readable by the GPU programming language. | 04-28-2016 |
20160124730 | HYBRID PARALLELIZATION STRATEGIES FOR MACHINE LEARNING PROGRAMS ON TOP OF MAPREDUCE - Parallel execution of machine learning programs is provided. Program code is received. The program code contains at least one parallel for statement having a plurality of iterations. A parallel execution plan is determined for the program code. According to the parallel execution plan, the plurality of iterations is partitioned into a plurality of tasks. Each task comprises at least one iteration. The iterations of each task are independent. Data required by the plurality of tasks is determined. An access pattern by the plurality of tasks of the data is determined. The data is partitioned based on the access pattern. | 05-05-2016 |
20160139900 | ALGORITHM TO ACHIEVE OPTIMAL LAYOUT OF DECISION LOGIC ELEMENTS FOR PROGRAMMABLE NETWORK DEVICES - A processing network including a plurality of lookup and decision engines (LDEs) each having one or more configuration registers and a plurality of on-chip routers forming a matrix for routing the data between the LDEs, wherein each of the on-chip routers is communicatively coupled with one or more of the LDEs. The processing network further including an LDE compiler stored on a memory and communicatively coupled with each of the LDEs, wherein the LDE compiler is configured to generate values based on input source code that when programmed into the configuration registers of the LDEs cause the LDEs to implement the functionality defined by the input source code. | 05-19-2016 |
20160147515 | METHOD AND APPARATUS FOR PROCESSING DATA USING CALCULATORS HAVING DIFFERENT DEGREES OF ACCURACY - A method of processing data includes classifying input data into first data and second data, the second data being different from the first data, separately compiling the first data and the second data, and providing the compiled first data and the compiled second data to a first operator and a second operator, respectively, in which the first operator performs an operation different from an operation performed by the second operator. | 05-26-2016 |
20160170728 | UTILIZING SPECIAL PURPOSE ELEMENTS TO IMPLEMENT A FSM | 06-16-2016 |
20160196123 | BINARY FILE FOR COMPUTER PROGRAM HAVING MULTIPLE EXECUTABLE CODE VARIANTS FOR A FUNCTION THAT ARE EXECUTABLE ON A SAME PROCESSOR ARCHITECTURE | 07-07-2016 |