Class / Patent application number | Description | Number of patent applications / Date published |
712241000 | Loop execution | 60 |
20080282072 | Executing Software Within Real-Time Hardware Constraints Using Functionally Programmable Branch Table - A computer system is disclosed which includes a CPU or microprocessor to drive tightly constrained hardware events. The system comprises a processor having a set of system inputs to drive a functionally programmable event, and a fast branch in the CPU including a state handler to execute instructions from the CPU to process the event. A queue in the CPU stores the events such that the non-pre-empted events are serviced in the order they are received. | 11-13-2008 |
20080294882 | DISTRIBUTED LOOP CONTROLLER ARCHITECTURE FOR MULTI-THREADING IN UNI-THREADED PROCESSORS - In one aspect, a virtually multi-threaded distributed instruction memory hierarchy that can support the execution of multiple incompatible loops in parallel is disclosed. In addition to regular loops, irregular loops with conditional constructs and nested loops can be mapped. The loop buffers are clustered, each loop buffer having its own local controller, and each local controller is responsible for indexing and regulating accesses to its loop buffer. | 11-27-2008 |
20080301421 | Method of speeding up execution of repeatable commands and microcontroller able to speed up execution of repeatable commands - A method to speed up the execution of repeatable commands and a microcontroller able to speed up the execution of repeatable commands are disclosed. When the microcontroller is to execute repeatable commands in a program, it temporarily stores repeatable commands to a storage unit. If the execution of the repeatable command loop continues, then the repeatable command loop is retrieved from the storage unit and executed at higher clock cycle frequency. At the start and end of the repeatable command loop are respectively defined by a starting point and an end point for determining whether the repeatable command loop should continue to execute. The microcontroller thereby speeds up the execution of the repeatable command and the performance thereof. | 12-04-2008 |
20080320289 | STORAGE MEDIUM STORING CALCULATION PROCESSING VISUALIZATION PROGRAM, CALCULATION PROCESSING VISUALIZATION APPARATUS, AND CALCULATION PROCESSING VISUALIZATION METHOD - The execution status of pipeline processing is highly visualized by appropriately displaying processes forming loops in a simplified manner. A loop-information storage unit stores loop-defining information specifying the address of an instruction that causes a pipeline process forming a loop. An operation-information storage unit stores operation information that includes the address of an instruction input into a pipeline and information indicating the execution status of a pipeline process caused by the instruction. A loop determination unit determines whether each pipeline process indicated by the operation information forms a loop by referring to the loop-defining information. An output unit outputs visualization information indicating, in a visually comprehensible manner, the execution status of a pipeline process that has been determined to form a loop for a predetermined number of executions of the loop and the execution status of a pipeline process that has been determined to form no loop. | 12-25-2008 |
20090024842 | Precise Counter Hardware for Microcode Loops - In an embodiment, a microcode unit for a processor is contemplated. The microcode unit comprises a microcode memory storing a plurality of microcode routines executable by the processor, wherein each microcode routine comprises two or more microcode operations. Coupled to the microcode memory, the sequence control unit is configured to control reading microcode operations from the microcode memory to be issued for execution by the processor. The sequence control unit is configured to stall issuance of microcode operations forming a body of a loop in a first routine of the plurality of microcode routines until a loop counter value that indicates a number of iterations of the loop is received by the sequence control unit. | 01-22-2009 |
20090055635 | PROGRAM EXECUTION CONTROL DEVICE - A program execution control device which controls execution of a program by a processor having a predicate function for conditional execution of an instruction, wherein the program includes a branch instruction to control iterations in loop processing, the branch instruction is further an instruction to generate an execute-or-not condition indicating whether or not the branch instruction is to be executed at an iteration in the loop processing after a current iteration, and to reflect the execute-or-not condition on a predicate flag used for conditional execution of the branch instruction, the program execution control device comprises a processor status changing unit configured to change, before an execution cycle of the branch instruction, a status of the processor in advance for execution of an instruction following the branch instruction, the status being changed based on the execute-or-not condition reflected on the predicate flag. | 02-26-2009 |
20090077361 | DETECTING SPIN LOOPS IN A VIRTUAL MACHINE ENVIRONMENT - Embodiments of apparatuses, methods, and systems detecting spin loops in a virtual machine environment are disclosed. In one embodiment, an apparatus includes detection logic and virtualization logic. The detection logic is to detect whether a guest is executing a spin loop. The virtualization logic is to transfer control of the apparatus from the guest to a host in response to the detection logic detecting that the guest is executing the spin loop | 03-19-2009 |
20090083527 | Counter circuit, dynamic reconfigurable circuitry, and loop processing control method - A dynamic reconfigurable circuit that implements optional processing by dynamically switching a processing content of a reconfigurable processing element (PE) and a connection content between the PEs in accordance with a context, includes: a configuration register section for setting a content of loop processing on the basis of the context, the loop processing content including an output source of an output signal from each of a set of the reconfigured PEs, an output destination of the output signal, and a condition for outputting the output signal to the output destination; and at least one counter circuit including a loop control section and an output register section that implement the set loop processing, that count the number of implementations of the loop processing implemented by the loop control section, and that output the output signal to the output destination based on the counted number of implementations and the condition. | 03-26-2009 |
20090113191 | Apparatus and Method for Improving Efficiency of Short Loop Instruction Fetch - A method, system and computer program product for instruction fetching within a processor instruction unit, utilizing a loop buffer, one or more virtual loop buffers, and/or an instruction buffer. During instruction fetch, modified instruction buffers coupled to an instruction cache (I-cache) temporarily store instructions from a single branch, backwards short loop. The modified instruction buffers may be a loop buffer, one or more virtual loop buffers, and/or an instruction buffer. Instructions are stored in the modified instruction buffers for the length of the loop cycle. The instruction fetch within the instruction unit of a processor retrieves the instructions for the short loop from the modified buffers during the loop cycle, rather than from the instruction cache. | 04-30-2009 |
20090113192 | DESIGN STRUCTURE FOR IMPROVING EFFICIENCY OF SHORT LOOP INSTRUCTION FETCH - A design structure provides instruction fetching within a processor instruction unit, utilizing a loop buffer, one or more virtual loop buffers, and/or an instruction buffer. During instruction fetch, modified instruction buffers coupled to an instruction cache (I-cache) temporarily store instructions from a single branch, backwards short loop. The modified instruction buffers may be a loop buffer, one or more virtual loop buffers, and/or an instruction buffer. Instructions are stored in the modified instruction buffers for the length of the loop cycle. The instruction fetch within the instruction unit of a processor retrieves the instructions for the short loop from the modified buffers during the loop cycle, rather than from the instruction cache. | 04-30-2009 |
20090144529 | SIMD Code Generation For Loops With Mixed Data Lengths - Generating loop code to execute on Single-Instruction Multiple-Datapath (SIMD) architectures, where the loop operates on datatypes having different lengths, is disclosed. Further, a preferred embodiment of the present invention includes a novel technique to efficiently realign or shift arbitrary streams to an arbitrary offset, regardless whether the alignments or offsets are known at the compile time or not. This technique enables the application of advanced alignment optimizations to runtime alignment. Length conversion operations, for packing and unpacking data values, are included in the alignment handling framework. These operations are formally defined in terms of standard SIMD instructions that are readily available on various SIMD platforms. This allows sequential loop code operating on datatypes of disparate length to be transformed (“simdized”) into optimized SIMD code through a fully automated process. | 06-04-2009 |
20090150658 | Processor and Signal Processing Method - This invention combines a loop support mechanism and a branch prediction mechanism. After an instruction execution unit executes an end block instruction of a block repeat, the loop control unit branches to the first instruction in the loop and sends a pseudo branch instruction to the instruction execution unit. The instruction execution unit acts as if the last instruction in the block is an instruction for branching to the start address of the block. This is stored in the branch prediction unit and branch prediction is performed thereafter. | 06-11-2009 |
20090158018 | Method and System for Auto Parallelization of Zero-Trip Loops Through the Induction Variable Substitution - A method and system of auto parallelization of zero-trip loops that substitutes a nested basic linear induction variable by exploiting a parallelizing compiler is provided. Provided is a use of a max{0,N} variable for loop iterations in case of no information is known about the value of N, for a typical loop iterating from 1 to N, in which N is the loop invariant. For the nested basic induction variables, an induction variable substitution process is applied to the nested loops starting from the innermost loop to the outermost one. Then a removal of the max operator afterwards through a copy propagation pass of the IBM compiler is provided. In doing so, the loop dependency on the induction variable is eliminated and an opportunity for a parallelizing compiler to parallel the outermost loop is provided. | 06-18-2009 |
20090217017 | METHOD, SYSTEM AND COMPUTER PROGRAM PRODUCT FOR MINIMIZING BRANCH PREDICTION LATENCY - A method, system, and computer program product for minimizing branch prediction latency in a pipelined computer processing environment are provided. The method includes detecting a branch loop utilizing branch instruction addresses and corresponding target addresses stored in a branch target buffer (BTB). The method also includes fetching the branch loop into a pre-decode instruction buffer and qualifying the branch loop for loop lockdown. The method further includes locking an instruction stream that forms the branch loop in the pre-decode instruction buffer and processing qualified branch loop instructions from the buffer and powering down instruction fetching and branch prediction logic (BPL) associated with the BTB. | 08-27-2009 |
20090259832 | RETARGETTING AN APPLICATION PROGRAM FOR EXECUTION BY A GENERAL PURPOSE PROCESSOR - One embodiment of the present invention sets forth a technique for translating application programs written using a parallel programming model for execution on multi-core graphics processing unit (GPU) for execution by general purpose central processing unit (CPU). Portions of the application program that rely on specific features of the multi-core GPU are converted by a translator for execution by a general purpose CPU. The application program is partitioned into regions of synchronization independent instructions. The instructions are classified as convergent or divergent and divergent memory references that are shared between regions are replicated. Thread loops are inserted to ensure correct sharing of memory between various threads during execution by the general purpose CPU. | 10-15-2009 |
20090265534 | Fairness, Performance, and Livelock Assessment Using a Loop Manager With Comparative Parallel Looping - A method, apparatus, and computer program are provided for assessing fairness, performance, and livelock in a logic development process utilizing comparative parallel looping. Multiple loop macros are generated, the multiple loop macros respectively correspond to multiple processor threads, and the multiple loop macros are parallel comparative loop macros. The multiple processor threads for the multiple loop macros are executed in which a common resource is accessed. A forward performance of each of the multiple processor threads is verified. The forward performance of the multiple processor threads is compared with each other. It is determined whether any of the multiple processor threads fails to meet a minimum loop count or a minimum loop time. It is determined whether any of the multiple processor threads exceeds a maximum loop count or a maximum loop time. It is recognized whether fairness is maintained during the execution of the multiple processor threads. | 10-22-2009 |
20090307472 | Method and Apparatus for Nested Instruction Looping Using Implicit Predicates - A method and apparatus for executing a nested program loop on a vector processor, the loop comprising outer-pre, inner and outer-post portions. An input stream unit of the vector processor provides a data value to a data path and sets an associated data validity tag to ‘valid’ once per outer loop iteration, as indicated by an inner counter of the input stream unit. The tag is set to ‘invalid’ in other iterations. Functional units of the vector processor operate on data values in the data path, each functional unit producing a valid result if the data validity tags associated with inputs data values are set to ‘valid’. An output stream unit of the vector processor sinks a data value from the data path once per outer loop iteration if an associated data validity tag indicates that the data value is valid. | 12-10-2009 |
20090327674 | Loop Control System and Method - Loop control systems and methods are disclosed. In a particular embodiment, a hardware loop control logic circuit includes a detection unit to detect an end of loop indicator of a program loop. The hardware loop control logic circuit also includes a decrement unit to decrement a loop count and to decrement a predicate trigger counter. The hardware loop control logic circuit further includes a comparison unit to compare the predicate trigger counter to a reference to determine when to set a predicate value. | 12-31-2009 |
20100049958 | METHOD FOR EXECUTING AN INSTRUCTION LOOPS AND A DEVICE HAVING INSTRUCTION LOOP EXECUTION CAPABILITIES - A method for managing a hardware instruction loop, the method includes: (i) detecting, by a branch prediction unit, an instruction loop; wherein a size of the instruction loop exceeds a size of a storage space allocated in a fetch unit for storing fetched instructions; (ii) requesting from the fetch unit to fetch instructions of the instruction loop that follow the first instructions of the instruction loop; and (iii) selecting, during iterations of the instruction loop, whether to provide to a dispatch unit one of the first instructions of the instruction loop or another instruction that is fetched by the fetch unit; wherein the first instructions of the instruction loop are stored at the dispatch unit. | 02-25-2010 |
20100058039 | INSTRUCTION FETCH PIPELINE FOR SUPERSCALAR DIGITAL SIGNAL PROCESSORS AND METHOD OF OPERATION THEREOF - A next program counter (PC) value generator. The next PC value generator includes a discontinuity decoder that is provide to detect a discontinuity instruction among a plurality of instructions and a tight loop decoder that is provide to: a) detect a tight loop instruction, and b) provide a tight loop instruction target address. The next PC value generator further includes a next PC value logic having a plurality of inputs: a first input coupled to an output of the discontinuity decoder, and a second input coupled to an output of the tight loop decoder. The next PC value logic provides as an output, without a stall, a control signal that a next PC value is to be loaded with the tight loop instruction target address if: the discontinuity decoder detects a discontinuity instruction, and the tight loop decoder detects a tight loop instruction. | 03-04-2010 |
20100185839 | APPARATUS AND METHOD FOR SCHEDULING INSTRUCTION - An apparatus and method for scheduling an instruction are provided. The apparatus includes an analyzer configured to analyze dependency of a plurality of recurrence loops and a scheduler configured to schedule the recurrence loops based the analyzed dependencies. When scheduling a plurality of recurrence loops, the apparatus first schedules a dominant loop whose loop head has no dependency on another loop among the recurrence loops. | 07-22-2010 |
20100199076 | COMPUTING APPARATUS AND METHOD OF HANDLING INTERRUPT - A computing apparatus and method of handling an interrupt are provided. The computing apparatus includes a coarse-grained array, a host processor, and an interrupt supervisor. When an interrupt occurs in the coarse-grained array while performing a loop operation, the host processor processes the interrupt, and the interrupt supervisor may perform mode switching between the coarse-grained array and the host processor. | 08-05-2010 |
20100235612 | MACROSCALAR PROCESSOR ARCHITECTURE - A macroscalar processor architecture is described herein. In one embodiment, an exemplary processor includes one or more execution units to execute instructions and one or more iteration units coupled to the execution units. The one or more iteration units receive one or more primary instructions of a program loop that comprise a machine executable program. For each of the primary instructions received, at least one of the iteration units generates multiple secondary instructions that correspond to multiple loop iterations of the task of the respective primary instruction when executed by the one or more execution units. Other methods and apparatuses are also described. | 09-16-2010 |
20100299509 | SIMULATION SYSTEM, METHOD AND PROGRAM - A computer-implemented pipeline execution system, method, and program product for executing loop processing in a multi-core or a multiprocessor computing environment, where the loop processing includes multiple function blocks in a multiple-stage pipeline manner. The system includes: a pipelining unit for pipelining the loop processing and assigning the loop processing to a computer processor or core; a calculating unit for calculating a first-order gradient term from a value calculated with the use of a predicted value of the input to a pipeline; and a correcting unit for correcting an output value of the pipeline with the value of the first-order gradient term. | 11-25-2010 |
20110029763 | BRANCH PREDICTOR FOR SETTING PREDICATE FLAG TO SKIP PREDICATED BRANCH INSTRUCTION EXECUTION IN LAST ITERATION OF LOOP PROCESSING - A processor simultaneously issues instructions to multiple threads in a same instruction execution cycle. An instruction issuer controls issuance of an instruction for each of the multiple threads. A detector detects, for each of the multiple threads, whether a loop processing is currently being executed. A unit causes the instruction issuer to increase a number of instructions to be issued when the detector detects that the loop processing is currently being executed. | 02-03-2011 |
20110072251 | PILE PROCESSING SYSTEM AND METHOD FOR PARALLEL PROCESSORS - A system, method and computer program product are provided for processing exceptions. Initially, computational operations are processed in a loop. Moreover, exceptions are identified and stored while processing the computational operations. Such exceptions are then processed separate from the loop. | 03-24-2011 |
20110161642 | Parallel Execution Unit that Extracts Data Parallelism at Runtime - Mechanisms for extracting data dependencies during runtime are provided. With these mechanisms, a portion of code having a loop is executed. A first parallel execution group is generated for the loop, the group comprising a subset of iterations of the loop less than a total number of iterations of the loop. The first parallel execution group is executed by executing each iteration in parallel. Store data for iterations are stored in corresponding store caches of the processor. Dependency checking logic of the processor determines, for each iteration, whether the iteration has a data dependence. Only the store data for stores where there was no data dependence determined are committed to memory. | 06-30-2011 |
20110161643 | Runtime Extraction of Data Parallelism - Mechanisms for extracting data dependencies during runtime are provided. The mechanisms execute a portion of code having a loop and generate, for the loop, a first parallel execution group comprising a subset of iterations of the loop less than a total number of iterations of the loop. The mechanisms further execute the first parallel execution group and determining, for each iteration in the subset of iterations, whether the iteration has a data dependence. Moreover, the mechanisms commit store data to system memory only for stores performed by iterations in the subset of iterations for which no data dependence is determined. Store data of stores performed by iterations in the subset of iterations for which a data dependence is determined is not committed to the system memory. | 06-30-2011 |
20110219222 | Building Approximate Data Dependences with a Moving Window - Mechanisms for building approximate data dependences using a moving look-back window are provided. The mechanisms track dependence information for memory accesses over iterations of execution of a portion of code. The mechanisms receive a memory access of an iteration of the portion of code, the memory access having an address for access the memory and an access type indicating at least one of a read or a write access type. An entry in a moving look-back window data structure is generated corresponding to a memory location accessed by the memory access. The entry comprises at least an identification of the address, the access type, and an iteration number corresponding to the iteration of the memory access. The moving look-back window data structure is utilized to determine dependence information for memory accesses over a plurality of iterations of the portion of code. | 09-08-2011 |
20110302397 | Method and Apparatus for Improved Secure Computing and Communications - A computing and communications system and method may comprise a primitive recursive function computing engine including an instruction set architecture prohibiting loop operations that continue for an indefinite time. The system and method may further comprise the instruction set architecture comprising system identifiers selected from a group comprising things, places, paths, actions and causes. The instruction set architecture may comprise organizing at least one data thing into a processing path to be acted upon by an action. The instruction set architecture may comprise defining a processing element as comprising an input interface configured to receive a data thing into the processing path; a processor in the processing path configured to perform the action on the data thing; and an output interface configured to receive a result of performing of the action on the data thing configured to provide the result as an output of the processing element. | 12-08-2011 |
20120023316 | PARALLEL LOOP MANAGEMENT - The illustrative embodiments comprise a method, data processing system, and computer program product having a processor unit for processing instructions with loops. A processor unit creates a first group of instructions having a first set of loops and second group of instructions having a second set of loops from the instructions. The first set of loops have a different order of parallel processing from the second set of loops. A processor unit processes the first group. The processor unit monitors terminations in the first set of loops during processing of the first group. The processor unit determines whether a number of terminations being monitored in the first set of loops is greater than a selectable number of terminations. In response to a determination that the number of terminations is greater than the selectable number of terminations, the processor unit ceases processing the first group and processes the second group. | 01-26-2012 |
20120096247 | RECONFIGURABLE PROCESSOR AND METHOD FOR PROCESSING LOOP HAVING MEMORY DEPENDENCY - Provided are a reconfigurable processor, which is capable of reducing the probability of an incorrect computation by analyzing the dependence between memory access instructions and allocating the memory access instructions between a plurality of processing elements (PEs) based on the results of the analysis, and a method of controlling the reconfigurable processor. The reconfigurable processor extracts an execution trace from simulation results, and analyzes the memory dependence between instructions included in different iterations based on parts of the execution trace of memory access instructions. | 04-19-2012 |
20120124350 | TABLE-DRIVEN SOAKER TOOL FOR INFORMATION HANDLING SYSTEMS - A soaker tool for an information handling system (IHS) exercises the IHS to provide a predetermined amount of utilization that a user may specify. The soaker tool schedules wait times following respective utilization times in alternating fashion to achieve a desired utilization value for a predetermined time period. The soaker tool monitors for a dispatch interrupt during the utilization times. Should a dispatch interrupt occur during a utilization time, the soaker tool accounts for the dispatch interrupt by determining a remainder utilization time to maintain utilization accuracy. The soaker tool may employ a parameter table that specifies utilization times, wait times, loop counts and adjustment cycles indexed to the respective utilization values that a user may select. The soaker tool may employ adjustment cycles to compensate for cumulative timing errors that may occur when running the tool for extended time periods. | 05-17-2012 |
20120124351 | APPARATUS AND METHOD FOR DYNAMICALLY DETERMINING EXECUTION MODE OF RECONFIGURABLE ARRAY - An apparatus and method for dynamically determining the execution mode of a reconfigurable array are provided. Performance information of a loop may be obtained before and/or during the execution of the loop. The performance information may be used to determine whether to operate the apparatus in a very long instruction word (VLIW) mode or in a coarse grained array (CGA) mode. | 05-17-2012 |
20120131316 | METHOD AND APPARATUS FOR IMPROVED SECURE COMPUTING AND COMMUNICATIONS - A method and apparatus are disclosed that may comprise applying compact markup notation to a general recursive computing system including hardware and software components, the compact markup notation defining things, places, paths, actions and causes within at least one of the hardware and the software of the general recursive computing system, to establish a set of data comprising a definitive description of the general recursive computing system in the compact notation; and synthesizing a self-aware and self-monitoring primitive recursive computing system utilizing the definitive description in the compact markup notation. | 05-24-2012 |
20120137111 | LOOP DETECTION APPARATUS, LOOP DETECTION METHOD, AND LOOP DETECTION PROGRAM - A loop detection method, system, and article of manufacture for determining whether a sequence of unit processes continuously executed among unit processes in a program is a loop by means of computational processing performed by a computer. The method includes: reading address information on the sequence of unit processes; comparing an address of a unit process as a loop starting point candidate with an address of a last unit process in the sequence of unit processes; reading call stack information on the sequence of unit processes; comparing a call stack upon execution of the unit process as the loop starting point candidate with a call stack upon execution of the last unit process; outputting a determination result indicating that the sequence of unit processes forms a loop if the respective comparison results of the addresses and the call stacks match with each other. | 05-31-2012 |
20120226894 | PROCESSOR, AND METHOD OF LOOP COUNT CONTROL BY PROCESSOR - The present invention provides a processor comprising: a loop counter that is reset to 0 when a loop instruction for executing a process in a loop from a loop start address to a loop end address is issued; a data memory that receives, from outside, data that is used for executing a process in the loop; a calculator that uses the data transferred to said data memory to execute the process in the loop; a data counter that increments said loop counter by 1 every time a certain amount of data is transferred from outside to said data memory; and a loop controller that decrements said loop counter by 1 and causes said calculator to execute the process in the loop when the loop count value of said loop counter is not 0. | 09-06-2012 |
20120239914 | MULTITHREADED PARALLEL EXECUTION DEVICE, BROADCAST STREAM PLAYBACK DEVICE, BROADCAST STREAM STORAGE DEVICE, STORED STREAM PLAYBACK DEVICE, STORED STREAM RE-ENCODING DEVICE, INTEGRATED CIRCUIT, MULTITHREADED PARALLEL EXECUTION METHOD, AND MULTITHREADED COMPILER - When a temporary data storage unit | 09-20-2012 |
20130124839 | APPARATUS AND METHOD FOR EXECUTING EXTERNAL OPERATIONS IN PROLOGUE OR EPILOGUE OF A SOFTWARE-PIPELINED LOOP - A technology for executing an external operation from a software-pipelined loop is provided. Code performance efficiency can be improved by overlapping the execution of the external operations of the loop and the iterations of the loop. | 05-16-2013 |
20130297919 | EFFICIENT IMPLEMENTATION OF RSA USING GPU/CPU ARCHITECTURE - Various embodiments are directed to a heterogeneous processor architecture comprised of a CPU and a GPU on the same processor die. The heterogeneous processor architecture may optimize source code in a GPU compiler using vector strip mining to reduce instructions of arbitrary vector lengths into GPU supported vector lengths and loop peeling. It may be first determined that the source code is eligible for optimization if more than one machine code instruction of compiled source code under-utilizes GPU instruction bandwidth limitations. The initial vector strip mining results may be discarded and the first iteration of the inner loop body may be peeled out of the loop. The type of operands in the source code may be lowered and the peeled out inner loop body of source code may be vector strip mined again to obtain optimized source code. | 11-07-2013 |
20130339699 | LOOP BUFFER PACKING - Methods, apparatuses, and processors for packing multiple iterations of a loop in a loop buffer. A loop candidate that meets the criteria for buffering is detected in the instruction stream being executed by a processor. When the loop is being written to the loop buffer and the end of the loop is detected, another iteration of the loop is written to the loop buffer if the loop buffer is not yet halfway full. In this way, short loops are written to the loop buffer multiple times to maximize the instruction operations per cycle throughput out of the loop buffer when the processor is in loop buffer mode. | 12-19-2013 |
20130339700 | LOOP BUFFER LEARNING - Methods, apparatuses, and processors for tracking loop candidates in an instruction stream. A load buffer control unit detects a backwards taken branch and starts tracking the loop candidate. The control unit tracks taken branches of the loop candidate, and keeps track of the distance to each taken branch from the start of the loop. If the distance to each taken branch stays the same over multiple iterations of the loop, then the loop is stored in a loop buffer. The loop is then dispatched from the loop buffer, and the front-end of the processor is powered down until the loop terminates. | 12-19-2013 |
20140082340 | SYSTEM AND METHOD FOR CONTROLLING PROCESSOR INSTRUCTION EXECUTION - A system and method for controlling processor instruction execution. In one example, a method for controlling a total number of instructions executed by a processor includes instructing the processor to iteratively execute instructions via multiple iterations until a predetermined time period has elapsed. A number of instructions executed in each iteration of the iterations is less than a number of instructions executed in a prior iteration of the iterations. The method also includes determining the total number of instructions executed during the predetermined time period. | 03-20-2014 |
20140095850 | LOOP VECTORIZATION METHODS AND APPARATUS - Loop vectorization methods and apparatus are disclosed. An example method includes generating a first control mask for a set of iterations of a loop by evaluating a condition of the loop, wherein generating the first control mask includes setting a bit of the control mask to a first value when the condition indicates that an operation of the loop is to be executed, and setting the bit of the first control mask to a second value when the condition indicates that the operation of the loop is to be bypassed. The example method also includes compressing indexes corresponding to the first set of iterations of the loop according to the first control mask. | 04-03-2014 |
20140115305 | METHOD FOR INCREASING THE SPEED OF SPECULATIVE EXECUTION - A method for increasing the speed of execution by a processor including the steps of selecting a sequence of instructions to optimize, optimizing the sequence of instructions, creating a duplicate of instructions from the sequence of instructions which has been selected to optimize, executing the optimized sequence of instructions, and responding to an error during the execution of the optimized sequence of instructions by rolling back to the duplicate of instructions from the sequence of instructions. | 04-24-2014 |
20140136822 | EXECUTION OF INSTRUCTION LOOPS USING AN INSTRUCTION BUFFER - In a normal, non-loop mode a uOp buffer receives and stores for dispatch the uOps generated by a decode stage based on a received instruction sequence. In response to detecting a loop in the instruction sequence, the uOp buffer is placed into a loop mode whereby, after the uOps associated with the loop have been stored at the uOp buffer, storage of further uOps at the buffer is suspended. To execute the loop, the uOp buffer repeatedly dispatches the uOps associated with the loop's instructions until the end condition of the loop is met and the uOp buffer exits the loop mode. | 05-15-2014 |
20140189331 | SYSTEM OF IMPROVED LOOP DETECTION AND EXECUTION - An method may include identifying loop information corresponding to a plurality of loop instructions. The loop instructions are stored into a queue. The loop instructions are replayed from the queue for execution. Loop iteration is counted based on the identified loop information. A determination of whether the last iteration of the loop is done. If the last iteration is not done, then continue replaying the loop instructions, until the last iteration is done. | 07-03-2014 |
20140201510 | SYSTEM, APPARATUS AND METHOD FOR GENERATING A LOOP ALIGNMENT COUNT OR A LOOP ALIGNMENT MASK - A loop alignment instruction indicates a base address of an array as a first operand, an iteration limit of a loop as a second operand, and a destination. The loop contains iterations and each iteration includes a data element of the array. A processor receives the loop alignment instruction, decodes the instruction for execution, and stores a result of the execution in the destination. The result indicates the number of data elements at a beginning of the array that are to be handled separately from a remaining portion of the array, such that the base address of the remaining portion of the array aligns with an alignment width. | 07-17-2014 |
20140208085 | INSTRUCTION AND LOGIC TO EFFICIENTLY MONITOR LOOP TRIP COUNT - Logic and instruction to efficiently monitor loop trip count. Loop trip count information of a loop may be stored in a dedicated hardware buffer. Average loop trip count of the loop may be calculated based on the stored loop trip count information. Based on the average trip count, loop optimizations may be applied or removed from the loop. The stored loop trip count information may include an identifier identifying the loop, a total loop trip count of the loop, and an exit count of the loop. | 07-24-2014 |
20140297997 | AUTOMATED COOPERATIVE CONCURRENCY WITH MINIMAL SYNTAX - Various embodiments are generally directed to techniques for reducing syntax requirements in application code to cause concurrent execution of multiple iterations of at least a portion of a loop thereof to reduce overall execution time in solving a large scale problem. At least one non-transitory machine-readable storage medium includes instructions that when executed by a computing device, cause the computing device to parse an application code to identify a loop instruction indicative of an instruction block that includes instructions that define a loop of which multiple iterations are capable of concurrent execution, the instructions including at least one call instruction to an executable routine capable of concurrent execution; and insert at least one coordinating instruction into an instruction sub-block of the instruction block to cause sequential execution of instructions of the instruction sub-block across the multiple iterations based on identification of the loop instruction. Other embodiments are described and claimed. | 10-02-2014 |
20140337606 | DIGITAL SIGNAL PROCESSOR, PROGRAM CONTROL METHOD, AND CONTROL PROGRAM - When the branch condition of a branch command for a loop process is satisfied and enters the loop mode, the relative branch address is saved in a branch relative address save circuit that points to the branch command for loop processing, and the loop state flag is set in a loop state save circuit. When the loop state flag is set, if the absolute value of the value outputted by a command code counter circuit matches the absolute value of the relative branch address outputted by the branch relative address save circuit, a program counter sum value switching circuit outputs the relative branch address to an program counter adder. If the absolute values do not match, the program counter sum value switching circuit outputs the value ‘1’ to the program counter adder. With this, the branch penalty during loop processing is eliminated even with little hardware. | 11-13-2014 |
20150106603 | METHOD AND APPARATUS OF INSTRUCTION SCHEDULING USING SOFTWARE PIPELINING - A modulo scheduling method including calculating at least two candidate initiation intervals between adjacent iterations, searching for schedules of the instructions in parallel by using the candidate initiation intervals, and selecting a schedule determined to be valid from among the searched schedules. | 04-16-2015 |
20150121051 | KERNEL FUNCTIONALITY CHECKER - A debugging system and method, referred to as a kernel functionality checker, is described for enabling debugging of software written for device-specific APIs (application program interfaces) without requiring support or changes in the software driver or hardware. Specific example embodiments are described for OpenCL, but the disclosed methods may also be used to enable debugging capabilities for other device-specific APIs such as DirectX® and OpenGL®. | 04-30-2015 |
20150149747 | METHOD OF SCHEDULING LOOPS FOR PROCESSOR HAVING A PLURALITY OF FUNCTIONAL UNITS - Provided is a loop scheduling method including scheduling a first loop using execution units, and scheduling a second loop using execution units available as a result of the scheduling of the first loop. An n-th loop (n>2) may be scheduled using a result of scheduling an (n−1)-th loop, similar to the (n−1)-th loop. The first loop may be a higher priority loop than the second loop. | 05-28-2015 |
20150309795 | ZERO OVERHEAD LOOP - A method and apparatus for zero overheard loops is provided herein. The method includes the steps of identifying, by a decoder, a loop instruction and identifying, by the decoder, a last instruction in a loop body that corresponds to the loop instruction. The method further includes the steps of generating, by the decoder, a branch instruction that returns execution to a beginning of the loop body, and enqueing, by the decoder, the branch instruction into a branch reservation queue concurrently with an enqueing of the last instruction in a reservation queue. | 10-29-2015 |
20160011871 | Computer Processor Employing Explicit Operations That Support Execution of Software Pipelined Loops and a Compiler That Utilizes Such Operations for Scheduling Software Pipelined Loops | 01-14-2016 |
20160019060 | ENFORCING LOOP-CARRIED DEPENDENCY (LCD) DURING DATAFLOW EXECUTION OF LOOP INSTRUCTIONS BY OUT-OF-ORDER PROCESSORS (OOPs), AND RELATED CIRCUITS, METHODS, AND COMPUTER-READABLE MEDIA - Enforcing loop-carried dependency (LCD) during dataflow execution of loop instructions by out-of-order processors (OOPs), and related circuits, methods, and computer-readable media, is disclosed. In one aspect, a reservation station circuit is provided, comprising one or more reservation station segments configured to store a consumer loop instruction. Each reservation station segment also includes an operand buffer for each operand of the consumer loop instruction, the operand buffer indicating a producer loop instruction and an LCD distance between the producer loop instruction and the consumer loop instruction. Each reservation station segment receives an execution result of the producer loop instruction, and a loop iteration indicator that indicates a current loop iteration for the producer loop instruction. The reservation station segment generates an operand buffer index based on the loop iteration indicator of the producer loop instruction and the LCD offset indicator of the operand buffer corresponding to the execution result. | 01-21-2016 |
20160019061 | MANAGING DATAFLOW EXECUTION OF LOOP INSTRUCTIONS BY OUT-OF-ORDER PROCESSORS (OOPs), AND RELATED CIRCUITS, METHODS, AND COMPUTER-READABLE MEDIA - Managing dataflow execution of loop instructions by out-of-order processors (OOPs), and related circuits, methods, and computer-readable media are disclosed. In one aspect, a reservation station circuit is provided. The reservation station circuit includes multiple reservation station segments, each storing a loop instruction of a loop of a computer program. Each reservation station segment also stores an instruction execution credit indicating whether the corresponding loop instruction may be provided for dataflow execution. The reservation station circuit further includes a dataflow monitor that distributes an initial instruction execution credit to each reservation station segment. As each loop iteration is executed, each reservation station segment determines whether the instruction execution credit indicates that the loop instruction for the reservation station segment may be provided for dataflow execution. If so, the reservation station segment provides the loop instruction for dataflow execution, and adjusts the instruction execution credit for the reservation station segment. | 01-21-2016 |
20160092230 | LOOP PREDICTOR-DIRECTED LOOP BUFFER - A loop predictor trains a branch instruction to determine a trained loop count of a loop. When the loop fits in an instruction buffer, the processor stops fetching from an instruction cache, sends the loop instructions to an execution engine from the buffer without fetching from the cache, maintains a loop pop count of times the branch is sent to the execution engine from the buffer, and predicts the branch instruction is taken when the loop pop count is less than the trained loop count and otherwise predicts not taken. | 03-31-2016 |
20160110195 | COORDINATED START INTERPRETIVE EXECUTION EXIT FOR A MULTITHREADED PROCESSOR - A system and method of executing a plurality of threads, including a first thread and a set of remaining threads, on a computer processor core. The system and method includes determining that a start interpretive execution exit condition exists; determining that the computer processor core is within a grace period; and entering by the first thread a start interpretive execution exit sync loop without signaling to any of the set of remaining threads. In turn, the first thread remains in the start interpretive execution exit sync loop until the grace period expires or each of the remaining threads enters a corresponding start interpretive execution exit sync loop | 04-21-2016 |