Entries |
Document | Title | Date |
20080209162 | PROCESSOR AND PROGRAM EXECUTION METHOD CAPABLE OF EFFICIENT PROGRAM EXECUTION - A processor for sequentially executing a plurality of programs using a plurality of register value groups stored in a memory that correspond one-to-one with the programs. The processor includes a plurality of register groups; a select/switch unit operable to select one of the plurality of register groups as an execution target register group on which a program execution is based, and to switch the selection target every time a first predetermined period elapses; a restoring unit operable to restore, every time the switching is performed, one of the register value groups into one of the register groups that is not selected as the execution target register group; a saving unit operable to save, prior to the restoring, register values in the register group targeted for restoring, by overwriting a register value group in the memory that corresponds to the register values; and a program execution unit operable to execute, every time the switching is performed, a program corresponding to a register value group in the execution target register group. | 08-28-2008 |
20080244220 | Filter and Method For Filtering - A filter and method of filtering modifies the computation order to accommodate horizontal symmetric filtering, and modifies the source operands while modifying the SIMD computation, so as to eliminate such heavy overhead of transposing a pixel matrix. The filter and method of filtering reformats the equations involved in the prior art to the following equations, thereby acquiring the interpolation results by reducing the required clock cycles to three cycles: | 10-02-2008 |
20080256329 | Multi-Magnitudinal Vectors with Resolution Based on Source Vector Features - Methods, systems and computer program products for resolving multiple magnitudes assigned to a target vector are disclosed. A target vector that includes one or more target vector dimensions is received. One of the target vector dimensions is processed to determine a total number of magnitudes assigned to the processed target vector dimension. Also, a source vector that includes one or more source vector dimensions is received. The received source vector is processed to determine a total number of features associated with the source vector. When it is detected that the total number of magnitudes assigned to the processed target vector dimension exceeds one, one of the assigned magnitudes is selected based on one of the determined features associated with the source vector. | 10-16-2008 |
20080288745 | GENERATING PREDICATE VALUES DURING VECTOR PROCESSING - A method for performing parallel operations in a computer system when one or more memory hazards may be present, which may be implemented by a processor, is described. During operation, the processor receives instructions for detecting conflict between memory addresses in vectors when operations are performed in parallel using at least a portion of the vectors, and generating one or more predicate values corresponding to any detected conflict between the memory addresses, where a given predicate value indicates elements in at least the portion of the vector that can be processed in parallel. Next, the processor executes the instructions for detecting the conflict between the memory addresses and generating the one or more predicate values. | 11-20-2008 |
20080301401 | Processor - A processor includes: a plurality of registers; an instruction readout circuit configured to read out an instruction from a memory; an instruction generation circuit configured to generate instructions for saving data into a predetermined storage area, for the respective registers, if the instruction read out by the instruction readout circuit is an instruction causing the data stored in each of the plurality of registers to be saved; and an instruction execution circuit configured to execute the instruction read out from the memory and the instructions generated by the instruction generation circuit. | 12-04-2008 |
20090077345 | SIMD DOT PRODUCT OPERATIONS WITH OVERLAPPED OPERANDS - A data processing system includes a plurality of general purpose registers, and processor circuitry for executing one or more instructions, including a vector dot product instruction for simultaneously performing at least two dot products. The vector dot product instruction identifies a first and second source register, each for storing a plurality of vector elements, where a first dot product is to be performed between a first subset of vector elements of the first source register and a first subset of vector elements of the second source register, and a second dot product is to be performed between a second subset of vector elements of the first source register and a second subset of vector elements of the second source register. The first and second subsets of the second source register are different and at least two vector elements of the first and second subsets of the second source register overlap. | 03-19-2009 |
20090100247 | SIMD PERMUTATIONS WITH EXTENDED RANGE IN A DATA PROCESSOR - A processor in a data processing system executes a permutation instruction which identifies a first source register, at least one other source register, and a destination register. The first source register stores at least one in-range index value for the at least one other source register and at least one out-of-range index value for the at least one other source register. The at least one other source register stores a plurality of vector element values, wherein each in-range index value indicates which vector element value of the at least one other source register is to be stored into a corresponding vector element of the destination register. Each out-of-range index value is used to indicate which one of at least two predetermined constant values is to be stored into a corresponding vector element of the destination register. Partial table lookups using a permutation instruction shortens the time required to retrieve data. | 04-16-2009 |
20090106526 | Scalar Float Register Overlay on Vector Register File for Efficient Register Allocation and Scalar Float and Vector Register Sharing - Embodiments of the invention are generally related to image processing, and more specifically to register files for supporting image processing. An integrated register file is also provided for storing vector and scalar data. Therefore, the transfer of data to memory to exchange data between independent vector and scalar units is obviated. | 04-23-2009 |
20090113168 | Software Pipelining Using One or More Vector Registers - A method for managing multiple values assigned to a variable during various stages of a software pipelined process executed in a computing environment. The method comprises allocating two or more slots in a vector register to two or more values associated with said variable during two or more stages of a pipeline process; and rotating values in each slot responsive to an instruction. | 04-30-2009 |
20090150648 | Vector Permute and Vector Register File Write Mask Instruction Variant State Extension for RISC Length Vector Instructions - Embodiments of the invention generally relate to the field of image processing, and more specifically to instructions and hardware for supporting image processing. An integrated processing unit configured to process vector instructions and vector permute instructions is provided. A vector permute instruction may be issued to the integrated processing unit to set controls of one or more multiplexers so that the multiplexers rearrange the results of a subsequent vector instruction. | 06-11-2009 |
20090150649 | CAPACITY REGISTER FILE - An apparatus for storing X-bit digitized data, the register file comprising: a plurality of registers each register configured for storing X bits, wherein each register is partitioned into Y sub-registers such that each sub-register stores at least X/Y bits, and wherein at least one extra X/Y-bit sub-register is incorporated in each register to provide redundancy in the number of sub-registers for a total of at least Y+1 sub-registers per register, so that if a first sub-register in a first register includes faulty bits, data destined for storage in the first sub-register is stored in a second sub-register, in the first register, that does not include faulty bits. | 06-11-2009 |
20090172348 | METHODS, APPARATUS, AND INSTRUCTIONS FOR PROCESSING VECTOR DATA - A computer processor includes control logic for executing LoadUnpack and PackStore instructions. In one embodiment, the processor includes a vector register and a mask register. In response to a PackStore instruction with an argument specifying a memory location, a circuit in the processor copies unmasked vector elements from the vector register to consecutive memory locations, starting at the specified memory location, without copying masked vector elements. In response to a LoadUnpack instruction, the circuit copies data items from consecutive memory locations, starting at an identified memory location, into unmasked vector elements of the vector register, without copying data to masked vector elements. Other embodiments are described and claimed. | 07-02-2009 |
20090172349 | METHODS, APPARATUS, AND INSTRUCTIONS FOR CONVERTING VECTOR DATA - A computer processor includes a decoder for decoding machine instructions and an execution unit for executing those instructions. The decoder and the execution unit are capable of decoding and executing vector instructions that include one or more format conversion indicators. For instance, the processor may be capable of executing a vector-load-convert-and-write (VLoadConWr) instruction that provides for loading data from memory to a vector register. The VLoadConWr instruction may include a format conversion indicator to indicate that the data from memory should be converted from a first format to a second format before the data is loaded into the vector register. Other embodiments are described and claimed. | 07-02-2009 |
20090172350 | Non-volatile processor register - A processor using a vertically configured non-volatile memory array that can retain values through a power failure is disclosed. The processor may include a register block configured to store and retrieve one or more values, the register block being a vertically configured non-volatile memory array, an arithmetic block configured to perform an arithmetic operation on the one or more values, and a control block configured to control the register block, the arithmetic block, and a memory block. The vertically configured non-volatile memory array may include a plurality of two-terminal memory elements. The two-terminal memory elements may be resistivity-sensitive and store data in the absence of power. The two-terminal memory elements store data as plurality of conductivity profiles that can be non-destructively read by applying a read voltage across the terminals of the memory element and data can be written by applying a write voltage across the terminals. | 07-02-2009 |
20090249026 | Vector instructions to enable efficient synchronization and parallel reduction operations - In one embodiment, a processor may include a vector unit to perform operations on multiple data elements responsive to a single instruction, and a control unit coupled to the vector unit to provide the data elements to the vector unit, where the control unit is to enable an atomic vector operation to be performed on at least some of the data elements responsive to a first vector instruction to be executed under a first mask and a second vector instruction to be executed under a second mask. Other embodiments are described and claimed. | 10-01-2009 |
20090259822 | STREAMING DIGITAL DATA FILTER - A method of filtering streaming digital data in real time. The method including: (a) initializing and storing a set of m data elements and an associated set of m pointer data from 1 to m in sequence, m an integer greater than 2; (b) receiving in real time a first or next data element of a digital data stream of sequential data elements; (c) simultaneously with (b), replacing a stored data element associated with the pointer datum m with the received data element, changing the pointer datum of m to 1, and incrementing the value of all other pointer data by 1; (d) simultaneously with (b) sorting in order from a low to high all stored data elements; (e) simultaneously with (b), maintaining the association of pointer datum and data elements; (f) simultaneously with (b), filtering all stored data elements; and (g) repeating (b) through (f) multiple times. | 10-15-2009 |
20090259823 | CIRCUIT AND DESIGN STRUCTURE FOR A STREAMING DIGITAL DATA FILTER - A circuit and design structure for a streaming digital data filter embodied in a machine readable medium, the design structure including: a data processing unit and a pointer processing unit, the data processing unit and the pointer unit connected to a control logic; the pointer processing unit consisting of n serially connected pointer processing stages from a first to a last pointer processing stage, each pointer processing stage except for the first and last processing stages of the pointer processing unit including a pointer register and a multiplexer, wherein n is a positive integer greater than 2; the data processing unit consisting of n serially connected data processing stages from a first data processing stage to a last data processing stage, each data processing stage including a multiplexer, a data register and a comparator; and one or more filter output stages connected to the data processing unit. | 10-15-2009 |
20090276606 | METHOD AND SYSTEM FOR PARALLEL HISTOGRAM CALCULATION IN A SIMD AND VLIW PROCESSOR - The present invention provides histogram calculation for images and video applications using a SIMD and VLIW processor with vector Look-Up Table (LUT) operations. This provides a speed up of histogram calculation by a factor of N times over a scalar processor where the SIMD processor could perform N LUT operations per instruction. Histogram operation is partitioned into a vector LUT operation, followed by vector increment, vector LUT update, and at the end by reduction of vector histogram components. The present invention could be used for intensity, RGBA, YUV, and other type of multi-component images. | 11-05-2009 |
20100095087 | Dynamic Data Driven Alignment and Data Formatting in a Floating-Point SIMD Architecture - Mechanisms are provided for dynamic data driven alignment and data formatting in a floating point SIMD architecture. At least two operand inputs are input to a permute unit of a processor. Each operand input contains at least one floating point value upon which a permute operation is to be performed by the permute unit. A control vector input, having a plurality of floating point values that together constitute the control vector input, is input to the permute unit of the processor for controlling the permute operation of the permute unit. The permute unit performs a permute operation on the at least two operand inputs according to a permutation pattern specified by the plurality of floating point values that constitute the control vector input. Moreover, a result output of the permute operation is output from the permute unit to a result vector register of the processor. | 04-15-2010 |
20100106939 | TRANSFERRING DATA FROM INTEGER TO VECTOR REGISTERS - A method for transferring data from a general purpose register to a vector register, the method including splatting a byte of data directly from a general purpose register (GPR) to a vector register (VR) by means of vector permute instructions, and splatting another byte of data from the GPR to the VR and vectorially combining the data in the VR. | 04-29-2010 |
20100115232 | LARGE INTEGER SUPPORT IN VECTOR OPERATIONS - A vector processor or vector processing computer has a first vector register operable to store two or more vector elements that together comprise a single first large integer and a second vector register operable to store two or more vector elements that together comprise a single second large integer. An adder having a carry-in bit is operable to add the large integer in the first vector register to the large integer in the second vector register by using the carry-in bit to add sequential elements of the vector registers. | 05-06-2010 |
20100223444 | METHOD FOR PERFORMING PLURALITY OF BIT OPERATIONS AND A DEVICE HAVING PLURALITY OF BIT OPERATIONS CAPABILITIES - A method and a device having a plurality of bit operations capability, the device includes: a first and a second registers and an instruction fetch circuit, and an arithmetic logic unit adapted to: calculate, during a first clock cycle, a position value representative of a position, within a first information vector, of a first bit of information that has a first value; and to multiply the position value by a multiplication factor to provide a first result and to alter the value of the first bit to a second value to provide an updated information vector, during the first clock cycle. | 09-02-2010 |
20100332792 | Integrated Vector-Scalar Processor - Systems and methods for improved vector data processing based on separately processing elements of a vector in multiple simultaneously executing vector element processing units are disclosed. One embodiment of the present invention is a vector processing system including a plurality of vector element processing units and a routing infrastructure. The routing infrastructure is configured to route each element of a received vector to a respective one of the vector element processing units. The received vector may be from a memory which is coupled to the vector element processing units by the routing infrastructure. Each vector element processing unit is configured to simultaneously process two or more elements, wherein each of the two or more elements is from a separate vector. Embodiments of the present invention also provide for forwarding of data and results of computation between vector element processing units. | 12-30-2010 |
20110040951 | DEDUPLICATED DATA PROCESSING RATE CONTROL - Various embodiments for deduplicated data processing rate control using at least one processor device in a computing environment are provided. A plurality of workers is configured for parallel processing of deduplicated data entities in a plurality of chunks. The deduplicated data processing rate is regulated using a rate control mechanism. The rate control mechanism incorporates a debt/credit algorithm specifying which of the plurality of workers processing the deduplicated data entities must wait for each of a plurality of calculated required sleep times. The rate control mechanism is adapted to limit a data flow rate based on a penalty acquired during a last processing of one of the plurality of chunks in a retroactive manner, and further adapted to operate on at least one vector representation of at least one limit specification to accommodate a variety of available dimensions corresponding to the at least one limit specification. | 02-17-2011 |
20110072236 | Method for efficient and parallel color space conversion in a programmable processor - The present invention relates to an efficient implementation of color space conversion in a SIMD processor as part of converting output of video decompression to interface to a display unit. | 03-24-2011 |
20110087859 | SYSTEM CYCLE LOADING AND STORING OF MISALIGNED VECTOR ELEMENTS IN A SIMD PROCESSOR - The present invention provides efficient transfer of misaligned vector elements between a vector register file and data memory in a single clock cycle. One vector register of N elements can be loaded from memory with any memory element address alignment during a single clock cycle of the processor. Also, a partial segment of vector register elements can be loaded into a vector register in a single clock cycle with any element alignment from data memory. The present invention comprises properly partitioned multiple multi-port data memory modules in conjunction with a crossbar and address generation circuit. A preferred embodiment of the present invention uses a dual-issue processor containing both a RISC-type scalar processor and a vector/SIMD processor, whereby one scalar and one SIMD instruction are executed every clock cycle, and the RISC processor handles program flow control and also loading and storing of vector registers. | 04-14-2011 |
20110161624 | Floating Point Collect and Operate - Mechanisms are provided for performing a floating point collect and operate for a summation across a vector for a dot product operation. A routing network placed before the single instruction multiple data (SIMD) unit allows the SIMD unit to perform a summation across a vector with a single stage of adders. The routing network routes the vector elements to the adders in a first cycle. The SIMD unit stores the results of the adders into a results vector register. The routing network routes the summation results from the results vector register to the adders in a second cycle. The SIMD unit then stores the results from the second cycle in the results vector register. | 06-30-2011 |
20110276782 | RUNNING SUBTRACT AND RUNNING DIVIDE INSTRUCTIONS FOR PROCESSING VECTORS - The described embodiments provide a processor for generating a result vector with subtracted or mathematically divided values from a first input vector. During operation, the processor receives the first input vector, a second input vector, and a control vector, and optionally receives a predicate vector. The processor then records a value from an element at a key element position in the second input vector into a base value. Next, the processor generates a result vector. When generating the result vector, for each active element in the result vector to the right of the key element position, the processor is configured to set the element in the result vector equal to the base value minus a total of the values in each relevant element of the first input vector or to set the element in the result vector equal to the result of dividing the base value by a value in each relevant element of the first input vector, wherein the relevant elements include relevant elements from an element at the key element position to and including a predetermined element in the first input vector. | 11-10-2011 |
20120060015 | Vector Loads with Multiple Vector Elements from a Same Cache Line in a Scattered Load Operation - Mechanisms for performing a scattered load operation are provided. With these mechanisms, an extended address is received in a cache memory of a processor. The extended address has a plurality of data element address portions that specify a plurality of data elements to be accessed using the single extended address. Each of the plurality of data element address portions is provided to corresponding data element selector logic units of the cache memory. Each data element selector logic unit in the cache memory selects a corresponding data element from a cache line buffer based on a corresponding data element address portion provided to the data element selector logic unit. Each data element selector logic unit outputs the corresponding data element for use by the processor. | 03-08-2012 |
20120060016 | Vector Loads from Scattered Memory Locations - Mechanisms for performing a scattered load operation are provided. With these mechanisms, a gather instruction is receive in a logic unit of a processor, the gather instruction specifying a plurality of addresses in a memory from which data is to be loaded into a target vector register of the processor. A plurality of separate load instructions for loading the data from the plurality of addresses in the memory are automatically generated within the logic unit. The plurality of separate load instructions are sent, from the logic unit, to one or more load/store units of the processor. The data corresponding to the plurality of addresses is gathered in a buffer of the processor. The logic unit then writes data stored in the buffer to the target vector register. | 03-08-2012 |
20120185670 | SCALAR INTEGER INSTRUCTIONS CAPABLE OF EXECUTION WITH THREE REGISTERS - A processing core implemented on a semiconductor chip is described. The processing core includes logic circuitry to identify whether vector instructions and integer scalar instructions are to be executed with two registers or three registers, where, in the case of two registers input operand information is destroyed in one of two registers, and, in the case of three registers input operand is not destroyed. The processing core also includes steering circuitry coupled to the logic circuitry. The steering circuitry is to control first data paths between scalar integer execution units and a scalar integer register bank such that two registers are accessed from the scalar register bank if two register execution is identified for the scalar integer instructions or three registers are accessed from the scalar integer register bank if three register execution is identified for the scalar integer instructions. The steering circuitry is also to control second data paths between vector execution units and a vector register bank such that two registers are accessed from the vector register bank if two register execution is identified for the vector instructions or three registers are accessed from the vector register bank if three register execution is identified for the vector instructions. | 07-19-2012 |
20120260062 | SYSTEM AND METHOD FOR PROVIDING DYNAMIC ADDRESSABILITY OF DATA ELEMENTS IN A REGISTER FILE WITH SUBWORD PARALLELISM - A method and system for providing dynamic addressability of data elements in a vector register file with subword parallelism. The method includes the steps of: determining a plurality of data elements required for an instruction; storing an address for each of the data elements into a pointer register where the addresses are stored as a number of offsets from the vector register file's origin; reading the addresses from the pointer register; extracting the data elements located at the addresses from the vector register file; and placing the data elements in a subword slot of the vector register file so that the data elements are located on a single vector within the vector register file; where at least one of the steps is carried out using a computer device so that data elements in a vector register file with subword parallelism are dynamically addressable. | 10-11-2012 |
20130024653 | ACCELERATION OF STRING COMPARISONS USING VECTOR INSTRUCTIONS - A processor, method, and medium for using vector instructions to perform string comparisons. A single instruction compares the elements of two vectors and simultaneously checks for the null character. If an inequality or the null character is found, then the string comparison loop terminates, and a further check is performed to determine if the strings are equal. If all elements are equal and the null character is not found, then another iteration of the string comparison loop is executed. The vectors are loaded with the next portions of the strings, and then the next comparison is performed. The loop continues until either an inequality or the null character is found. | 01-24-2013 |
20130080737 | Interleaving data accesses issued in response to vector access instructions - A vector data access unit includes data access ordering circuitry, for issuing data access requests indicated by the elements to the data store, and configured in response to receipt of at least two decoded vector data access instructions, and one of the instructions being a write instruction. Data accesses are performed in the instructed order to determine an element indicating the next data access for each of said vector data access instructions. One of the next data accesses is selected to be issued to the data store in dependence upon an order in which the at least two vector data instructions were received. The position of the elements indicates the next data accesses relative to each other within their respective plurality of elements. A numerical position of the element indicating the next data access within the plurality of elements of an earlier instruction is less than a predetermined value. | 03-28-2013 |
20130091339 | SIMD Memory Circuit And Methodology To Support Upsampling, Downsampling And Transposition - An apparatus and method for creation of reordered vectors from sequential input data for block based decimation, filtering, interpolation and matrix transposition using a memory circuit for a Single Instruction, Multiple Data (SIMD) Digital Signal Processor (DSP). This memory circuit includes a two-dimensional storage array, a rotate-and-distribute unit, a read-controller and a write to controller, to map input vectors containing sequential data elements in columns of the two-dimensional array and extract reordered target vectors from this array. The data elements and memory configuration are received from the SIMD DSP. | 04-11-2013 |
20130212353 | System for implementing vector look-up table operations in a SIMD processor - The present invention incorporates a system for vector Look-Up Table (LUT) operations into a single-instruction multiple-data (SIMD) processor in order to implement plurality of LUT operations simultaneously, where each of the LUT contents could be the same or different. Elements of one or two vector registers are used to form LUT indexes, and the output of vector LUT operation is written into a vector register. No dedicated LUT memory is required; rather, data memory is organized as multiple separate data memory banks, where a portion of each data memory bank is used for LUT operations. For a single-input vector LUT operation, the address input of each LUT is operably coupled to any of the input vector register's elements using input vector element mapping logic in one embodiment. Thus, one input vector element can produce (a positive integer) N output elements using N different LUTs, or (another positive integer) K input vector elements can produce N output elements, where K is an integer from one to N. | 08-15-2013 |
20130212354 | Method for efficient data array sorting in a programmable processor - The present invention provides a method for performing data array sorting of vector elements in a N-wide SIMD that is accelerated by a factor of about N/2 over scalar implementation excluding scalar load/store instructions. A vector compare instruction with ability to compare any two vector elements in accordance to optimized data array sorting algorithms, followed by a vector-multiplex instruction which performs exchanges of vector elements in accordance with condition flags generated by the vector compare instruction provides an efficient but programmable method of performing data sorting with a factor of about N/2 acceleration. A mask bit prevents changes to elements which is not involved in a certain stage of sorting. | 08-15-2013 |
20130232318 | METHODS, APPARATUS, AND INSTRUCTIONS FOR CONVERTING VECTOR DATA - A computer processor includes a decoder for decoding machine instructions and an execution unit for executing those instructions. The decoder and the execution unit are capable of decoding and executing vector instructions that include one or more format conversion indicators. For instance, the processor may be capable of executing a vector-load-convert-and-write (VLoadConWr) instruction that provides for loading data from memory to a vector register. The VLoadConWr instruction may include a format conversion indicator to indicate that the data from memory should be converted from a first format to a second format before the data is loaded into the vector register. Other embodiments are described and claimed. | 09-05-2013 |
20130238877 | CORE SYSTEM FOR PROCESSING AN INTERRUPT AND METHOD FOR TRANSMISSION OF VECTOR REGISTER FILE DATA THEREFOR - Provided is a technique for improving the transfer latency of vector register file data when an interrupt is generated. According to an aspect, when interrupt occurs, a core determines whether to store vector register file data currently being executed in a first memory or in a second memory based on whether or not the first memory can store the vector register file data therein. In response to not being able to store the vector register file data in the first memory, a data transfer unit, which is implemented as hardware, is provided to store vector register file data in the second memory. | 09-12-2013 |
20140013075 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING A HORIZONTAL ADD OR SUBTRACT IN RESPONSE TO A SINGLE INSTRUCTION - Embodiments of systems, apparatuses, and methods for performing in a computer processor vector packed horizontal add or subtract of packed data elements in response to a single vector packed horizontal add or subtract instruction that includes a destination vector register operand, a source vector register operand, and an opcode are describes. | 01-09-2014 |
20140013076 | EFFICIENT HARDWARE INSTRUCTIONS FOR SINGLE INSTRUCTION MULTIPLE DATA PROCESSORS - A method and apparatus for efficiently processing data in various formats in a single instruction multiple data (“SIMD”) architecture is presented. Specifically, a method to unpack a fixed-width bit values in a bit stream to a fixed width byte stream in a SIMD architecture is presented. A method to unpack variable-length byte packed values in a byte stream in a SIMD architecture is presented. A method to decompress a run length encoded compressed bit-vector in a SIMD architecture is presented. A method to return the offset of each bit set to one in a bit-vector in a SIMD architecture is presented. A method to fetch bits from a bit-vector at specified offsets relative to a base in a SIMD architecture is presented. A method to compare values stored in two SIMD registers is presented. | 01-09-2014 |
20140019712 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING VECTOR PACKED COMPRESSION AND REPEAT - Embodiments of systems, apparatuses, and methods for performing in a computer processor vector packed compression and repeat in response to a single vector packed compression and repeat instruction that includes a first and second source vector register operand, a destination vector register operand, and an opcode are described. | 01-16-2014 |
20140019713 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING A DOUBLE BLOCKED SUM OF ABSOLUTE DIFFERENCES - Embodiments of systems, apparatuses, and methods for performing in a computer processor vector double block packed sum of absolute differences (SAD) in response to a single vector double block packed sum of absolute differences instruction that includes a destination vector register operand, first and second source operands, an immediate, and an opcode are described. | 01-16-2014 |
20140019714 | VECTOR FREQUENCY EXPAND INSTRUCTION - A processor core that includes a hardware decode unit and an execution engine unit. The hardware decode unit to decode a vector frequency expand instruction, wherein the vector frequency compress instruction includes a source operand and a destination operand, wherein the source operand specifies a source vector register that includes one or more pairs of a value and run length that are to be expanded into a run of that value based on the run length. The execution engine unit to execute the decoded vector frequency expand instruction which causes, a set of one or more source data elements in the source vector register to be expanded into a set of destination data elements comprising more elements than the set of source data elements and including at least one run of identical values which were run length encoded in the source vector register. | 01-16-2014 |
20140047211 | VECTOR REGISTER FILE - An aspect includes accessing a vector register in a vector register file. The vector register file includes a plurality of vector registers and each vector register includes a plurality of elements. A read command is received at a read port of the vector register file. The read command specifies a vector register address. The vector register address is decoded by an address decoder to determine a selected vector register of the vector register file. An element address is determined for one of the plurality of elements associated with the selected vector register based on a read element counter of the selected vector register. A word is selected in a memory array of the selected vector register as read data based on the element address. The read data is output from the selected vector register based on the decoding of the vector register address by the address decoder. | 02-13-2014 |
20140122831 | INSTRUCTION AND LOGIC TO PROVIDE VECTOR COMPRESS AND ROTATE FUNCTIONALITY - Instructions and logic provide vector compress and rotate functionality. Some embodiments, responsive to an instruction specifying: a vector source, a mask, a vector destination and destination offset, read the mask, and copy corresponding unmasked vector elements from the vector source to adjacent sequential locations in the vector destination, starting at the vector destination offset location. In some embodiments, the unmasked vector elements from the vector source are copied to adjacent sequential element locations modulo the total number of element locations in the vector destination. In some alternative embodiments, copying stops whenever the vector destination is full, and upon copying an unmasked vector element from the vector source to an adjacent sequential element location in the vector destination, the value of a corresponding field in the mask is changed to a masked value. Alternative embodiments zero elements of the vector destination, in which no element from the vector source is copied. | 05-01-2014 |
20140129801 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING DELTA ENCODING ON PACKED DATA ELEMENTS - Embodiments of systems, apparatuses, and methods for performing delta encoding on packed data elements of a source and storing the results in packed data elements of a destination using a single vector packed delta encode instruction are described. | 05-08-2014 |
20140136815 | VERIFICATION OF A VECTOR EXECUTION UNIT DESIGN - A method for verification of a vector execution unit design. The method includes issuing an instruction into a first instance and a second instance of a vector execution unit. The method includes issuing a random operand into a first lane of the first instance of the vector execution unit and into a second lane of the second instance of the vector execution unit. The method further includes receiving results from execution of the instruction and the random operand in both the first and the second instance of the vector execution unit and comparing the received results. | 05-15-2014 |
20140149713 | MULTI-REGISTER GATHER INSTRUCTION - A processor fetches a multi-register gather instruction that includes a destination operand that specifies a destination vector register, and a source operand that identifies content that indicates multiple vector registers, a first set of indexes of each of the vector registers that each identifies a source data element, and a second set of indexes of the destination vector register for each identified source element. The instruction is decoded and executed, causing, for each of the first set of indexes of each of the vector registers, the source data element that corresponds to that index of that vector register to be stored in a set of destination data elements that correspond to the second set of identified indexes of the destination vector register for that source data element. | 05-29-2014 |
20140156969 | VERIFICATION OF A VECTOR EXECUTION UNIT DESIGN - A method for verification of a vector execution unit design. The method includes issuing an instruction into a first instance and a second instance of a vector execution unit. The method includes issuing a random operand into a first lane of the first instance of the vector execution unit and into a second lane of the second instance of the vector execution unit. The method further includes receiving results from execution of the instruction and the random operand in both the first and the second instance of the vector execution unit and comparing the received results. | 06-05-2014 |
20140164733 | TRANSPOSE INSTRUCTION - A transpose instruction is described. A transpose instruction is fetched, where the transpose instruction includes an operand that specifies a vector register or a location in memory. The transpose instruction is decoded. The decoded transpose instruction is executed causing each data element in the specified vector register or location in memory to be stored in that specified vector register or location in memory in reverse order. | 06-12-2014 |
20140201497 | INSTRUCTION FOR ELEMENT OFFSET CALCULATION IN A MULTI-DIMENSIONAL ARRAY - An apparatus is described having functional unit logic circuitry. The functional unit logic circuitry has a first register to store a first input vector operand having an element for each dimension of a multi-dimensional data structure. Each element of the first vector operand specifying the size of its respective dimension. The functional unit has a second register to store a second input vector operand specifying coordinates of a particular segment of the multi-dimensional structure. The functional unit also has logic circuitry to calculate an address offset for the particular segment relative to an address of an origin segment of the multi-dimensional structure. | 07-17-2014 |
20140201498 | INSTRUCTION AND LOGIC TO PROVIDE VECTOR SCATTER-OP AND GATHER-OP FUNCTIONALITY - Instructions and logic provide vector scatter-op and/or gather-op functionality. In some embodiments, responsive to an instruction specifying: a gather and a second operation, a destination register, an operand register, and a memory address; execution units read values in a mask register, wherein fields in the mask register correspond to offset indices in the indices register for data elements in memory. A first mask value indicates the element has not been gathered from memory and a second value indicates that the element does not need to be, or has already been gathered. For each having the first value, the data element is gathered from memory into the corresponding destination register location, and the corresponding value in the mask register is changed to the second value. When all mask register fields have the second value, the second operation is performed using corresponding data in the destination and operand registers to generate results. | 07-17-2014 |
20140201499 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING CONVERSION OF A LIST OF INDEX VALUES INTO A MASK VALUE - Embodiments of systems, apparatuses, and methods for performing in a computer processor conversion of a list of index values into a mask value in response to a single vector packed conversion of a list of index values into a mask value instruction that includes a destination writemask register operand, a source vector register operand, and an opcode are described. | 07-17-2014 |
20140223138 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING CONVERSION OF A MASK REGISTER INTO A VECTOR REGISTER. - Embodiments of systems, apparatuses, and methods for performing in a computer processor conversion of a mask register into a vector register in response to a single vector packed convert a mask register to a vector register instruction that includes a destination vector register operand, a source writemask register operand, and an opcode are described. | 08-07-2014 |
20140281368 | CYCLE SLICED VECTORS AND SLOT EXECUTION ON A SHARED DATAPATH - An example method for executing multiple instructions in one or more slots includes receiving a packet including multiple instructions and executing the multiple instructions in one or more slots in a time shared manner. Each slot is associated with an execution data path or a memory data path. An example method for executing at least one instruction in a plurality of phases includes receiving a packet including an instruction, splitting the instruction into a plurality of phases, and executing the instruction in the plurality of phases. | 09-18-2014 |
20140281369 | APPARATUS AND METHOD FOR SLIDING WINDOW DATA GATHER - An apparatus and method are described for fetching and storing a plurality of portions of a data stream into a plurality of registers. For example, a method according to one embodiment includes the following operations: determining a set of N vector registers into which to read N designated portions of a data stream stored in system memory; determining the system memory addresses for each of the N designated portions of the data stream; fetching the N designated portions of the data stream from the system memory at the system memory addresses; and storing the N designated portions of the data stream into the N vector registers. | 09-18-2014 |
20140317377 | VECTOR FREQUENCY COMPRESS INSTRUCTION - A processor core that includes a hardware decode unit to decode a vector frequency compress instruction that includes a source operand and a destination operand. The source operand specifying a source vector register that includes a plurality of source data elements including one or more runs of identical data elements that are each to be compressed in a destination vector register as a value and run length pair. The destination operand identifies the destination vector register. The processor core also includes an execution engine unit to execute the decoded vector frequency compress instruction which causes, for each source data element, a value to be copied into the destination vector register to indicate that source data element's value. One or more runs of the source data elements equal are encoded in the destination vector register as the predetermined compression value followed by a run length for that run. | 10-23-2014 |
20140365747 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING A HORIZONTAL PARTIAL SUM IN RESPONSE TO A SINGLE INSTRUCTION - Embodiments of systems, apparatuses, and methods for performing in a computer processor vector packed horizontal partial sum of packed data elements in response to a single vector packed horizontal sum instruction that includes a destination vector register operand, a source vector register operand, and an opcode are described. | 12-11-2014 |
20150039851 | METHODS, APPARATUS, INSTRUCTIONS AND LOGIC TO PROVIDE VECTOR SUB-BYTE DECOMPRESSION FUNCTIONALITY - Methods, apparatus, instructions and logic provide SIMD vector sub-byte decompression functionality. Embodiments include shuffling a first and second byte into the least significant portion of a first vector element, and a third and fourth byte into the most significant portion. Processing continues shuffling a fifth and sixth byte into the least significant portion of a second vector element, and a seventh and eighth byte into the most significant portion. Then by shifting the first vector element by a first shift count and the second vector element by a second shift count, sub-byte elements are aligned to the least significant bits of their respective bytes. Processors then shuffle a byte from each of the shifted vector elements' least significant portions into byte positions of a destination vector element, and from each of the shifted vector elements' most significant portions into byte positions of another destination vector element. | 02-05-2015 |
20150039852 | DATA COMPACTION USING VECTORIZED INSTRUCTIONS - Techniques for performing database operations using vectorized instructions are provided. In one technique, data compaction is performed using vectorized instructions to identify a shuffle mask based on matching bits and update an output array based on the shuffle mask and an input array. In a related technique, a hash table probe involves using vectorized instructions to determine whether each key in one or more hash buckets matches a particular input key. | 02-05-2015 |
20150039853 | ESTIMATING A COST OF PERFORMING DATABASE OPERATIONS USING VECTORIZED INSTRUCTIONS - Techniques for performing database operations using vectorized instructions are provided. In one technique, it is determined whether to perform a database operation using one or more vectorized instructions or without using any vectorized instructions. This determination may comprise estimating a first cost of performing the database operation using one or more vectorized instructions and estimating a second cost of performing the database operation without using any vectorized instructions. Multiple factors that may be used to determine which approach to follow, such as the number of data elements that may fit into a SIMD register, a number of vectorized instructions in the vectorized approach, a number of data movement instructions that involve moving data from a SIMD register to a non-SIMD register and/or vice versa, a size of a cache, and a projected size of a hash table. | 02-05-2015 |
20150039854 | VECTORIZED LOOKUP OF FLOATING POINT VALUES - Systems and techniques disclosed herein include methods for de-quantization of feature vectors used in automatic speech recognition. A SIMD vector processor is used in one embodiment for efficient vectorized lookup of floating point values in conjunction with fMPE processing for increasing the discriminative power of input signals. These techniques exploit parallelism to effectively reduce the latency of speech recognition in a system operating in a high dimensional feature space. In one embodiment, a bytewise integer lookup operation effectively performs a floating point or a multiple byte lookup. | 02-05-2015 |
20150046671 | METHODS, APPARATUS, INSTRUCTIONS AND LOGIC TO PROVIDE VECTOR POPULATION COUNT FUNCTIONALITY - Instructions and logic provide SIMD vector population count functionality. Some embodiments store in each data field of a portion of n data fields of a vector register or memory vector, a plurality of bits of data. In a processor, a SIMD instruction for a vector population count is executed, such that for that portion of the n data fields in the vector register or memory vector, the occurrences of binary values equal to each of a first one or more predetermined binary values, are counted and the counted occurrences are stored, in a portion of a destination register corresponding to the portion of the n data fields in the vector register or memory vector, as a first one or more counts corresponding to the first one or more predetermined binary values. | 02-12-2015 |
20150046672 | METHODS, APPARATUS, INSTRUCTIONS AND LOGIC TO PROVIDE POPULATION COUNT FUNCTIONALITY FOR GENOME SEQUENCING AND ALIGNMENT - Instructions and logic provide SIMD vector population count functionality. Some embodiments store in each data field of a portion of n data fields of a vector register or memory vector, at least two bits of data. In a processor, a SIMD instruction for a vector population count is executed, such that for that portion of the n data fields in the vector register or memory vector, the occurrences of binary values equal to each of a first one or more predetermined binary values, are counted and the counted occurrences are stored, in a portion of a destination register corresponding to the portion of the n data fields in the vector register or memory vector, as a first one or more counts corresponding to the first one or more predetermined binary values. | 02-12-2015 |
20150089189 | Predicate Vector Pack and Unpack Instructions - In an embodiment, a processor may implement a vector instruction set including predicate vectors and multiple vector element sizes. The vector instruction set may include predicate vector pack and unpack instructions. Responsive to the predicate vector pack instruction, the processor may pack predicates from multiple predicate vector source registers into a destination predicate vector register. Responsive to the predicate vector unpack instruction, the processor may select a portion of a source predicate vector register and write the result to a destination predicate vector register. Additionally, the predicate vector register may store one or more vector attributes associated with the corresponding vector. The processor may modify the attribute as part of the pack/unpack operation (e.g. based on a pack/unpack factor). Additionally, vector pack/unpack instructions that are controlled by the attribute in a corresponding predicate vector register may be implemented. | 03-26-2015 |
20150143074 | VECTOR EXCEPTION CODE - Vector exception handling is facilitated. A vector instruction is executed that operates on one or more elements of a vector register. When an exception is encountered during execution of the instruction, a vector exception code is provided that indicates a position within the vector register that caused the exception. The vector exception code also includes a reason for the exception. | 05-21-2015 |
20160179529 | METHOD AND APPARATUS FOR PERFORMING A VECTOR BIT REVERSAL AND CROSSING | 06-23-2016 |
20160188335 | METHOD AND APPARATUS FOR PERFORMING A VECTOR BIT GATHER - An apparatus and method for performing a vector bit gather. For example, one embodiment of a processor comprises: a first vector register to store one or more source data elements; a second vector register to store one or more control elements, each of the control elements comprising a plurality of bit fields, each bit field to be associated with a corresponding bit position in a destination vector register and to identify a bit from the one or more source data elements to be copied to each of the particular bit positions; and vector bit gather logic to read each bit field from the second vector register to identify a bit from the one or more source data elements and to responsively copy the bit from each of the one or more source data elements to each of the corresponding bit positions in the destination vector register. | 06-30-2016 |