Patent application number | Description | Published |
20120124337 | Size mis-match hazard detection - An out-of-order processor | 05-17-2012 |
20120124346 | Decoding conditional program instructions - A processor | 05-17-2012 |
20130275720 | ZERO CYCLE MOVE - A system and method for reducing the latency of data move operations. A register rename unit within a processor determines whether a decoded move instruction is eligible for a zero cycle move operation. If so, control logic assigns a physical register identifier associated with a source operand of the move instruction to the destination operand of the move instruction. Additionally, the register rename unit marks the given move instruction to prevent it from proceeding in the processor pipeline. Further maintenance of the particular physical register identifier may be done by the register rename unit during commit of the given move instruction. | 10-17-2013 |
20130290680 | OPTIMIZING REGISTER INITIALIZATION OPERATIONS - A system and method for efficiently reducing the latency of initializing registers. A register rename unit within a processor determines whether prior to an execution pipeline stage it is known a decoded given instruction writes a particular numerical value in a destination operand. An example is a move immediate instruction that writes a value of 0 in its destination operand. Other examples may also qualify. If the determination is made, a given physical register identifier is assigned to the destination operand, wherein the given physical register identifier is associated with the particular numerical value, but it is not associated with an actual physical register in a physical register file. The given instruction is marked to prevent it from proceeding to an execution pipeline stage. When the given physical register identifier is used to read the physical register file, no actual physical register is accessed. | 10-31-2013 |
20130290681 | REGISTER FILE POWER SAVINGS - A system and method for efficiently reducing the power consumption of register file accesses. A processor is operable to execute instructions with two or more data types, each with an associated size and alignment. Data operands for a first data type use operand sizes equal to an entire width of a physical register within a physical register file. Data operands for a second data type use operand sizes less than an entire width of a physical register. Accesses of the physical register file for operands associated with a non-full-width data type do not access a full width of the physical registers. A given numerical value may be bypassed for the portion of the physical register that is not accessed. | 10-31-2013 |
20130339699 | LOOP BUFFER PACKING - Methods, apparatuses, and processors for packing multiple iterations of a loop in a loop buffer. A loop candidate that meets the criteria for buffering is detected in the instruction stream being executed by a processor. When the loop is being written to the loop buffer and the end of the loop is detected, another iteration of the loop is written to the loop buffer if the loop buffer is not yet halfway full. In this way, short loops are written to the loop buffer multiple times to maximize the instruction operations per cycle throughput out of the loop buffer when the processor is in loop buffer mode. | 12-19-2013 |
20130339700 | LOOP BUFFER LEARNING - Methods, apparatuses, and processors for tracking loop candidates in an instruction stream. A load buffer control unit detects a backwards taken branch and starts tracking the loop candidate. The control unit tracks taken branches of the loop candidate, and keeps track of the distance to each taken branch from the start of the loop. If the distance to each taken branch stays the same over multiple iterations of the loop, then the loop is stored in a loop buffer. The loop is then dispatched from the loop buffer, and the front-end of the processor is powered down until the loop terminates. | 12-19-2013 |
20140075156 | FETCH WIDTH PREDICTOR - Various techniques for predicting instruction fetch widths. In one embodiment, a fetch prediction unit in a processor is configured to generate a fetch width that specifies a number of bits to be retrieved in a subsequent fetch from an instruction cache. The fetch prediction unit may also generate a fetch prediction that includes the fetch width in response to a current fetch request. A number of bits corresponding to the fetch width may be fetched from the instruction cache. The fetch width may correspond to a location of a predicted-taken control transfer instruction. This fetch width prediction may lead to power savings in instruction cache accesses. | 03-13-2014 |
20140195789 | Usefulness Indication For Indirect Branch Prediction Training - A circuit for implementing a branch target buffer. The branch target buffer may include a memory that stores a plurality of entries. Each entry may include a tag value, a target value, and a prediction accuracy value. A received index value corresponding to an indirect branch instruction may be used to select one of entries of the plurality of entries, and a received tag value may then be compared to the tag value of the selected entries in the memory. An entry in the memory may be selected in response to a determination that the received tag does not match the tag value of compared entries. The selected entry may be allocated to the indirect instruction branch dependent upon the prediction accuracy values of the plurality of entries. | 07-10-2014 |
20140208073 | Arithmetic Branch Fusion - A processor and method for fusing together an arithmetic instruction and a branch instruction. The processor includes an instruction fetch unit configured to fetch instructions. The processor may also include an instruction decode unit that may be configured to decode the fetched instructions into micro-operations for execution by an execution unit. The decode unit may be configured to detect an occurrence of an arithmetic instruction followed by a branch instruction in program order, wherein the branch instruction, upon execution, changes a program flow of control dependent upon a result of execution of the arithmetic instruction. In addition, the processor may further be configured to fuse together the arithmetic instruction and the branch instruction such that a single micro-operation is formed. The single micro-operation includes execution information based upon both the arithmetic instruction and the branch instruction. | 07-24-2014 |
20140215188 | Multi-Level Dispatch for a Superscalar Processor - In an embodiment, a processor includes a multi-level dispatch circuit configured to supply operations for execution by multiple parallel execution pipelines. The multi-level dispatch circuit may include multiple dispatch buffers, each of which is coupled to multiple reservation stations. Each reservation station may be coupled to a respective execution pipeline and may be configured to schedule instruction operations (ops) for execution in the respective execution pipeline. The sets of reservation stations coupled to each dispatch buffer may be non-overlapping. Thus, if a given op is to be executed in a given execution pipeline, the op may be sent to the dispatch buffer which is coupled to the reservation station that provides ops to the given execution pipeline. | 07-31-2014 |
20140244976 | IT INSTRUCTION PRE-DECODE - Various techniques for processing and pre-decoding branches within an IT instruction block. Instructions are fetched and cached in an instruction cache, and pre-decode bits are generated to indicate the presence of an IT instruction and the likely boundaries of the IT instruction block. If an unconditional branch is detected within the likely boundaries of an IT instruction block, the unconditional branch is treated as if it were a conditional branch. The unconditional branch is sent to the branch direction predictor and the predictor generates a branch direction prediction for the unconditional branch. | 08-28-2014 |
20140337605 | Mechanism for Reducing Cache Power Consumption Using Cache Way Prediction - A mechanism for reducing power consumption of a cache memory of a processor includes a processor with a cache memory that stores instruction information for one or more instruction fetch groups fetched from a system memory. The cache memory may include a number of ways that are each independently controllable. The processor also includes a way prediction unit. The way prediction unit may enable, in a next execution cycle, a given way within which instruction information corresponding to a target of a next branch instruction is stored in response to a branch taken prediction for the next branch instruction. The way prediction unit may also, in response to the branch taken prediction for the next branch instruction, enable, one at a time, each corresponding way within which instruction information corresponding to respective sequential instruction fetch groups that follow the next branch instruction are stored. | 11-13-2014 |
20140344558 | NEXT FETCH PREDICTOR RETURN ADDRESS STACK - A system and method for efficient branch prediction. A processor includes a next fetch predictor to generate a fast branch prediction for branch instructions at an early pipeline stage. The processor also includes a main return address stack (RAS) at a later pipeline stage for predicting the target of return instructions. When a return instruction is encountered, the prediction from the next fetch predictor is replaced by the top of the main RAS. If there are any recent call or return instructions in flight toward the main RAS, then a separate prediction is generated by a mini-RAS. | 11-20-2014 |
20150039860 | RDA CHECKPOINT OPTIMIZATION - A system and method for efficiently performing microarchitectural checkpointing. A register rename unit within a processor determines whether a physical register number qualifies to have duplicate mappings. Information for maintenance of the duplicate mappings is stored in a register duplicate array (RDA). To reduce the penalty for misspeculation or exception recovery, control logic in the processor supports multiple checkpoints. The RDA is one of multiple data structures to have checkpoint copies of state. The RDA utilizes a content addressable memory (CAM) to store physical register numbers. The duplicate counts for both the current state and the checkpoint copies for a given physical register number are updated when instructions utilizing the given physical register number are retired. To reduce on-die real estate and power consumption, a single CAM entry is stores the physical register number and the other fields are stored in separate storage elements. | 02-05-2015 |