Patent application number | Description | Published |
20120185670 | SCALAR INTEGER INSTRUCTIONS CAPABLE OF EXECUTION WITH THREE REGISTERS - A processing core implemented on a semiconductor chip is described. The processing core includes logic circuitry to identify whether vector instructions and integer scalar instructions are to be executed with two registers or three registers, where, in the case of two registers input operand information is destroyed in one of two registers, and, in the case of three registers input operand is not destroyed. The processing core also includes steering circuitry coupled to the logic circuitry. The steering circuitry is to control first data paths between scalar integer execution units and a scalar integer register bank such that two registers are accessed from the scalar register bank if two register execution is identified for the scalar integer instructions or three registers are accessed from the scalar integer register bank if three register execution is identified for the scalar integer instructions. The steering circuitry is also to control second data paths between vector execution units and a vector register bank such that two registers are accessed from the vector register bank if two register execution is identified for the vector instructions or three registers are accessed from the vector register bank if three register execution is identified for the vector instructions. | 07-19-2012 |
20120254588 | SYSTEMS, APPARATUSES, AND METHODS FOR BLENDING TWO SOURCE OPERANDS INTO A SINGLE DESTINATION USING A WRITEMASK - Embodiments of systems, apparatuses, and methods for performing a blend instruction in a computer processor are described. In some embodiments, the execution of a blend instruction causes a data element-by-element selection of data elements of first and second source operands using the corresponding bit positions of a writemask as a selector between the first and second operands and storage of the selected data elements into the destination at the corresponding position in the destination. | 10-04-2012 |
20130275730 | APPARATUS AND METHOD OF IMPROVED EXTRACT INSTRUCTIONS - An apparatus is described that includes instruction execution logic circuitry to execute first, second, third and fourth instructions. Both the first instruction and the second instruction select a first group of input vector elements from one of multiple first non overlapping sections of respective first and second input vectors. The first group has a first bit width. Each of the multiple first non overlapping sections have a same bit width as the first group. Both the third instruction and the fourth instruction select a second group of input vector elements from one of multiple second non overlapping sections of respective third and fourth input vectors. The second group has a second bit width that is larger than the first bit width. Each of the multiple second non overlapping sections have a same bit width as the second group. The apparatus includes masking layer circuitry to mask the first and second groups of the first and third instructions at a first granularity, where, respective resultants produced therewith are respective resultants of the first and third instructions. The masking circuitry is also to mask the first and second groups of the second and fourth instructions at a second granularity, where, respective resultants produced therewith are respective resultants of the second and fourth instructions. | 10-17-2013 |
20130275731 | VECTOR INSTRUCTION FOR PRESENTING COMPLEX CONJUGATES OF RESPECTIVE COMPLEX NUMBERS - An apparatus is described having a semiconductor chip that has an instruction execution pipeline. The instruction execution pipeline has an execution unit with logic circuitry to perform the following for an instruction: accept input vector elements representing real and imaginary parts of a plurality of complex numbers; and, present the complex conjugates of the complex numbers. | 10-17-2013 |
20130283018 | Packed Data Rearrangement Control Indexes Generation Processors, Methods, Systems and Instructions - A method of an aspect includes receiving a packed data rearrangement control indexes generation instruction. The packed data rearrangement control indexes generation instruction indicates a destination storage location. A result is stored in the destination storage location in response to the packed data rearrangement control indexes generation instruction. The result includes a sequence of at least four non-negative integers representing packed data rearrangement control indexes. In an aspect, values of the at least four non-negative integers are not calculated using a result of a preceding instruction. Other methods, apparatus, systems, and instructions are disclosed. | 10-24-2013 |
20130283021 | APPARATUS AND METHOD OF IMPROVED INSERT INSTRUCTIONS - An apparatus is described having instruction execution logic circuitry to execute first, second, third and fourth instruction. Both the first instruction and the second instruction insert a first group of input vector elements to one of multiple first non overlapping sections of respective first and second resultant vectors. The first group has a first bit width. Each of the multiple first non overlapping sections have a same bit width as the first group. Both the third instruction and the fourth instruction insert a second group of input vector elements to one of multiple second non overlapping sections of respective third and fourth resultant vectors. The second group has a second bit width that is larger than said first bit width. Each of the multiple second non overlapping sections have a same bit width as the second group. The apparatus also includes masking layer circuitry to mask the first and third instructions at a first resultant vector granularity, and, mask the second and fourth instructions at a second resultant vector granularity. | 10-24-2013 |
20130290254 | INSTRUCTION EXECUTION THAT BROADCASTS AND MASKS DATA VALUES AT DIFFERENT LEVELS OF GRANULARITY - An apparatus is described that includes an execution unit to execute a first instruction and a second instruction. The execution unit includes input register space to store a first data structure to be replicated when executing the first instruction and to store a second data structure to be replicated when executing the second instruction. The first and second data structures are both packed data structures. Data values of the first packed data structure are twice as large as data values of the second packed data structure. The execution unit also includes replication logic circuitry to replicate the first data structure when executing the first instruction to create a first replication data structure, and, to replicate the second data structure when executing the second data instruction to create a second replication data structure. The execution unit also includes masking logic circuitry to mask the first replication data structure at a first granularity and mask the second replication data structure at a second granularity. The second granularity is twice as fine as the first granularity. | 10-31-2013 |
20130290672 | APPARATUS AND METHOD OF MASK PERMUTE INSTRUCTIONS - An apparatus is described having instruction execution logic circuitry. The instruction execution logic circuitry has input vector element routing circuitry to perform the following for each of three different instructions: for each of a plurality of output vector element locations, route into an output vector element location an input vector element from one of a plurality of input vector element locations that are available to source the output vector element. The output vector element and each of the input vector element locations are one of three available bit widths for the three different instructions. The apparatus further includes masking layer circuitry coupled to the input vector element routing circuitry to mask a data structure created by the input vector routing element circuitry. The masking layer circuitry is designed to mask at three different levels of granularity that correspond to the three available bit widths. | 10-31-2013 |
20130290687 | APPARATUS AND METHOD OF IMPROVED PERMUTE INSTRUCTIONS - An apparatus is described having instruction execution logic circuitry. The instruction execution logic circuitry has input vector element routing circuitry to perform the following for each of three different instructions: for each of a plurality of output vector element locations, route into an output vector element location an input vector element from one of a plurality of input vector element locations that are available to source the output vector element. The output vector element and each of the input vector element locations are one of three available bit widths for the three different instructions. The apparatus further includes masking layer circuitry coupled to the input vector element routing circuitry to mask a data structure created by the input vector routing element circuitry. The masking layer circuitry is designed to mask at three different levels of granularity that correspond to the three available bit widths. | 10-31-2013 |
20130305020 | VECTOR FRIENDLY INSTRUCTION FORMAT AND EXECUTION THEREOF - A vector friendly instruction format and execution thereof. According to one embodiment of the invention, a processor is configured to execute an instruction set. The instruction set includes a vector friendly instruction format. The vector friendly instruction format has a plurality of fields including a base operation field, a modifier field, an augmentation operation field, and a data element width field, wherein the first instruction format supports different versions of base operations and different augmentation operations through placement of different values in the base operation field, the modifier field, the alpha field, the beta field, and the data element width field, and wherein only one of the different values may be placed in each of the base operation field, the modifier field, the alpha field, the beta field, and the data element width field on each occurrence of an instruction in the first instruction format in instruction streams. | 11-14-2013 |
20130318328 | APPARATUS AND METHOD FOR SHUFFLING FLOATING POINT OR INTEGER VALUES - An apparatus and method are described for shuffling data elements from source registers to a destination register. For example, a method according to one embodiment includes the following operations: reading each mask bit stored in a mask data structure, the mask data structure containing mask bits associated with data elements of a destination register, the values usable for determining whether a masking operation or a shuffle operation should be performed on data elements stored within a first source register and a second source register; for each data element of the destination register, if a mask bit associated with the data element indicates that a shuffle operation should be performed, then shuffling data elements from the first source register and the second source register to the specified data element within the destination register; and if the mask bit indicates that a masking operation should be performed, then performing a specified masking operation with respect to the data element of the destination register. | 11-28-2013 |
20130326192 | BROADCAST OPERATION ON MASK REGISTER - Embodiments of systems, apparatuses, and methods for performing a mask broadcast instruction in a computer processor are described. In some embodiments, the execution of a mask broadcast instruction causes a broadcast of a data element of the source operand to a destination register of the destination operand according to the broadcast size. | 12-05-2013 |
20130326196 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING VECTOR PACKED UNARY DECODING USING MASKS - Embodiments of systems, apparatuses, and methods for performing in a computer processor vector packed unary value decoding using masks in response to a single vector packed unary decoding using masks instruction that includes a destination vector register operand, a source writemask register operand, and an opcode are described. | 12-05-2013 |
20130339661 | EFFICIENT ZERO-BASED DECOMPRESSION - A processor core including a hardware decode unit to decode vector instructions for decompressing a run length encoded (RLE) set of source data elements and an execution unit to execute the decoded instructions. The execution unit generates a first mask by comparing set of source data elements with a set of zeros and then counts the trailing zeros in the mask. A second mask is made based on the count of trailing zeros. The execution unit then copies the set of source data elements to a buffer using the second mask and then reads the number of RLE zeros from the set of source data elements. The buffer is shifted and copied to a result and the set of source data elements is shifted to the right. If more valid data elements are in the set of source data elements this is repeated until all valid data is processed. | 12-19-2013 |
20130339664 | INSTRUCTION EXECUTION UNIT THAT BROADCASTS DATA VALUES AT DIFFERENT LEVELS OF GRANULARITY - An apparatus is described that includes an execution unit to execute a first instruction and a second instruction. The execution unit includes input register space to store a first data structure to be replicated when executing the first instruction and to store a second data structure to be replicated when executing the second instruction. The first and second data structures are both packed data structures. Data values of the first packed data structure are twice as large as data values of the second packed data structure. The first data structure is four times as large as the second data structure. The execution unit also includes replication logic circuitry to replicate the first data structure when executing the first instruction to create a first replication data structure, and, to replicate the second data structure when executing the second instruction to create a second replication data structure. | 12-19-2013 |
20130339668 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING DELTA DECODING ON PACKED DATA ELEMENTS - Embodiments of systems, apparatuses, and methods for performing delta decoding on packed data elements of a source and storing the results in packed data elements of a destination using a single vector packed delta decode instruction are described. | 12-19-2013 |
20130339678 | MULTI-ELEMENT INSTRUCTION WITH DIFFERENT READ AND WRITE MASKS - A method is described that includes reading a first read mask from a first register. The method also includes reading a first vector operand from a second register or memory location. The method also includes applying the read mask against the first vector operand to produce a set of elements for operation. The method also includes performing an operation of the set elements. The method also includes creating an output vector by producing multiple instances of the operation's result. The method also includes reading a first write mask from a third register, the first write mask being different than the first read mask. The method also includes applying the write mask against the output vector to create a resultant vector. The method also includes writing the resultant vector to a destination register. | 12-19-2013 |
20130339682 | METHODS TO OPTIMIZE A PROGRAM LOOP VIA VECTOR INSTRUCTIONS USING A SHUFFLE TABLE AND A MASK STORE TABLE - According to one embodiment, a code optimizer is configured to receive first code having a program loop implemented with scalar instructions to store values of a first array to a second array based on values of a third array. The code optimizer is configured to generate second code representing the program loop with vector instructions including a shuffle instruction and a store instruction, the store instruction to shuffle using a shuffle table elements of the first array based on the second array in a vector manner, the store instruction to store using a mask store table the shuffled elements in the third array in a vector manner. | 12-19-2013 |
20140006756 | Systems, Apparatuses, and Methods for Performing a Shuffle and Operation (Shuffle-Op) | 01-02-2014 |
20140019712 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING VECTOR PACKED COMPRESSION AND REPEAT - Embodiments of systems, apparatuses, and methods for performing in a computer processor vector packed compression and repeat in response to a single vector packed compression and repeat instruction that includes a first and second source vector register operand, a destination vector register operand, and an opcode are described. | 01-16-2014 |
20140019713 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING A DOUBLE BLOCKED SUM OF ABSOLUTE DIFFERENCES - Embodiments of systems, apparatuses, and methods for performing in a computer processor vector double block packed sum of absolute differences (SAD) in response to a single vector double block packed sum of absolute differences instruction that includes a destination vector register operand, first and second source operands, an immediate, and an opcode are described. | 01-16-2014 |
20140019714 | VECTOR FREQUENCY EXPAND INSTRUCTION - A processor core that includes a hardware decode unit and an execution engine unit. The hardware decode unit to decode a vector frequency expand instruction, wherein the vector frequency compress instruction includes a source operand and a destination operand, wherein the source operand specifies a source vector register that includes one or more pairs of a value and run length that are to be expanded into a run of that value based on the run length. The execution engine unit to execute the decoded vector frequency expand instruction which causes, a set of one or more source data elements in the source vector register to be expanded into a set of destination data elements comprising more elements than the set of source data elements and including at least one run of identical values which were run length encoded in the source vector register. | 01-16-2014 |
20140019715 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING A CONVERSION OF A WRITEMASK REGISTER TO A LIST OF INDEX VALUES IN A VECTOR REGISTER - Embodiments of systems, apparatuses, and methods for performing in a computer processor conversion of a mask register into a list of index values in response to a single vector packed convert a mask register into a list of index values instruction that includes a destination vector register operand, a source writemask register operand, and an opcode are described. | 01-16-2014 |
20140019732 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING MASK BIT COMPRESSION - Embodiments of systems, apparatuses, and methods for performing in a computer processor mask bit compression in response to a single mask bit compression instruction that includes a source writemask register operand, a destination writemask register operand, and an opcode are described. | 01-16-2014 |
20140032877 | APPARATUS AND METHOD FOR AN INSTRUCTION THAT DETERMINES WHETHER A VALUE IS WITHIN A RANGE - A method is described that includes performing the following with a single instruction: receiving a first input operand V; receiving a second input operand S; calculating V−S; determining if V−S is positive or negative; and, providing as a resultant: V if V−S is negative; V−S if V−S is positive. | 01-30-2014 |
20140059322 | APPARATUS AND METHOD FOR BROADCASTING FROM A GENERAL PURPOSE REGISTER TO A VECTOR REGISTER - An apparatus and method are described for broadcasting from a general purpose source register to a destination vector register. For example, a method according to one embodiment includes the following operations: selecting data element position N within the destination vector register to be updated; broadcasting a set of data from the general purpose source register to data element position N within the destination vector register if a mask indicator is set to a first indication; and either copying zeroes to data element position N within the destination vector register or maintaining existing values stored within data element position N within the destination vector register if the mask indicator is set to a second indication. | 02-27-2014 |
20140068227 | SYSTEMS, APPARATUSES, AND METHODS FOR EXTRACTING A WRITEMASK FROM A REGISTER - Embodiments of systems, apparatuses, and methods for performing in a computer processor mask extraction from a general purpose register in response to a single mask extraction from a general purpose register instruction that includes a source general purpose register operand, a destination writemask register operand, an immediate value, and an opcode are described. | 03-06-2014 |
20140082333 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING AN ABSOLUTE DIFFERENCE CALCULATION BETWEEN CORRESPONDING PACKED DATA ELEMENTS OF TWO VECTOR REGISTERS - Embodiments of systems, apparatuses, and methods for performing in a computer processor absolute difference calculation in response to a single vector packed absolute difference instruction that includes a first and second source vector register operand, a destination vector register operand, and an opcode are described. | 03-20-2014 |
20140108480 | APPARATUS AND METHOD FOR VECTOR COMPUTE AND ACCUMULATE - An apparatus and method are described for comparing elements between two immediate values. For example, a method according to one embodiment includes the following operations: reading values of a first set of elements stored in a first immediate value, each element having a defined element position in the first immediate value; comparing each element from the first set of elements with each of a second set of elements stored in a second immediate value; counting the number of times the value of each element of the first set of elements is found in the second set of elements to arrive at a final count for each element of the first set of elements; and transferring the final count for each element to a third immediate value, wherein the final count is stored in an element position in the third immediate value corresponding to the defined element position in the first immediate value. | 04-17-2014 |
20140122831 | INSTRUCTION AND LOGIC TO PROVIDE VECTOR COMPRESS AND ROTATE FUNCTIONALITY - Instructions and logic provide vector compress and rotate functionality. Some embodiments, responsive to an instruction specifying: a vector source, a mask, a vector destination and destination offset, read the mask, and copy corresponding unmasked vector elements from the vector source to adjacent sequential locations in the vector destination, starting at the vector destination offset location. In some embodiments, the unmasked vector elements from the vector source are copied to adjacent sequential element locations modulo the total number of element locations in the vector destination. In some alternative embodiments, copying stops whenever the vector destination is full, and upon copying an unmasked vector element from the vector source to an adjacent sequential element location in the vector destination, the value of a corresponding field in the mask is changed to a masked value. Alternative embodiments zero elements of the vector destination, in which no element from the vector source is copied. | 05-01-2014 |
20140129801 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING DELTA ENCODING ON PACKED DATA ELEMENTS - Embodiments of systems, apparatuses, and methods for performing delta encoding on packed data elements of a source and storing the results in packed data elements of a destination using a single vector packed delta encode instruction are described. | 05-08-2014 |
20140149724 | VECTOR FRIENDLY INSTRUCTION FORMAT AND EXECUTION THEREOF - A vector friendly instruction format and execution thereof. According to one embodiment of the invention, a processor is configured to execute an instruction set. The instruction set includes a vector friendly instruction format. The vector friendly instruction format has a plurality of fields including a base operation field, a modifier field, an augmentation operation field, and a data element width field, wherein the first instruction format supports different versions of base operations and different augmentation operations through placement of different values in the base operation field, the modifier field, the alpha field, the beta field, and the data element width field, and wherein only one of the different values may be placed in each of the base operation field, the modifier field, the alpha field, the beta field, and the data element width field on each occurrence of an instruction in the first instruction format in instruction streams. | 05-29-2014 |
20140149802 | Apparatus And Method To Obtain Information Regarding Suppressed Faults - A processor includes an execution unit, a fault mask coupled to the execution unit, and a suppress mask coupled to the execution unit. The fault mask is to store a first plurality of bit values to indicate which elements of a multi-element vector have an associated fault generated in response to execution of an instruction on the element in the execution unit. The suppress mask is to store a second plurality of bit values to indicate which of the elements are to have an associated fault suppressed. The processor also includes counter logic to increment a counter in response to an indication of a first fault associated with the first element and received from the fault mask, and an indication of a first suppression associated with the first element and received from the suppress mask. Other embodiments are described as claimed. | 05-29-2014 |
20140188961 | Vectorization Of Collapsed Multi-Nested Loops - In an embodiment a method of vectorizing a collapsed multi-nested loop includes executing, in a vector unit of a processor, the collapsed loop to obtain a vector of offsets, including for each of a plurality of iterations, calculating a scalar offset into a multi-dimensional data structure, storing the scalar offset in a data element of a first vector register, and updating a loop counter value of a multi-dimensional loop counter vector. In turn, a plurality of data elements are loaded from the multi-dimensional data structure using a base value and indexes from the vector of offsets, at least one computation is performed on the loaded plurality of data elements to obtain a plurality of results, and the plurality of results are stored into the multi-dimensional data structure using the base value and the indexes from the vector of offsets. Other embodiments are described and claimed. | 07-03-2014 |
20140189287 | COLLAPSING OF MULTIPLE NESTED LOOPS, METHODS AND INSTRUCTIONS - In an embodiment, the present invention is directed to a processor including a decode logic to receive a multi-dimensional loop counter update instruction and to decode the multi-dimensional loop counter update instruction into at least one decoded instruction, and an execution logic to execute the at least one decoded instruction to update at least one loop counter value of a first operand associated with the multi-dimensional loop counter update instruction by a first amount. Methods to collapse loops using such instructions are also disclosed. Other embodiments are described and claimed. | 07-03-2014 |
20140189294 | SYSTEMS, APPARATUSES, AND METHODS FOR DETERMINING DATA ELEMENT EQUALITY OR SEQUENTIALITY - Systems, apparatuses, and methods of performing in a computer processor broadcasting data in response to a single vector packed broadcasting instruction that includes a source writemask register operand, a destination vector register operand, and an opcode. In some embodiments, the data of the source writemask register is zero extended prior to broadcasting. | 07-03-2014 |
20140189295 | Apparatus and Method of Efficient Vector Roll Operation - A machine readable storage medium containing program code is described that when processed by a processor causes a method to be performed. The method includes creating a resultant rolled version of an input vector by forming a first intermediate vector, forming a second intermediate vector and forming a resultant rolled version of an input vector. The first intermediate vector is formed by barrel rolling elements of the input vector along a first of two lanes defined by an upper half and a lower half of the input vector. The second intermediate vector is formed by barrel rolling elements of the input vector along a second of the two lanes. The resultant rolled version of the input vector is formed by incorporating upper portions of one of the intermediate vector's upper and lower halves as upper portions of the resultant's upper and lower halves and incorporating lower portions of the other intermediate vector's upper and lower halves as lower portions of the resultant's upper and lower halves. | 07-03-2014 |
20140189296 | SYSTEM, APPARATUS AND METHOD FOR LOOP REMAINDER MASK INSTRUCTION - A loop remainder mask instruction indicates a current iteration count of a loop as a first operand, an iteration limit of a loop as a second operand, and a destination. The loop contains iterations and each iteration includes a data element of the array. A processor receives the loop remainder mask instruction, decodes the instruction for execution, and stores a result of the execution in the destination. The result indicates a number of data elements of the array past an end of a preceding portion of the array that are to be handled separately from the preceding portion, the end of the preceding portion being where the current iteration count is recorded. | 07-03-2014 |
20140189307 | METHODS, APPARATUS, INSTRUCTIONS, AND LOGIC TO PROVIDE VECTOR ADDRESS CONFLICT RESOLUTION WITH VECTOR POPULATION COUNT FUNCTIONALITY - Instructions and logic provide SIMD address conflict resolution with vector population count functionality. Some embodiments include processors with a register with a variable plurality of data fields, each of the data fields to store a variable second plurality of bits. A destination register has corresponding data fields, each of these data fields to store a count of the number of bits set to one for corresponding data fields. Responsive to decoding a vector population count instruction, execution units count the number of bits set to one for each of data fields in the register, and store the counts in corresponding data fields of the first destination register. Vector population count instructions can be used with variable sized elements and conflict masks to generate iteration counts and completion masks to be used each iteration to resolve dependencies in gather-modify-scatter SIMD operations. | 07-03-2014 |
20140189308 | METHODS, APPARATUS, INSTRUCTIONS, AND LOGIC TO PROVIDE VECTOR ADDRESS CONFLICT DETECTION FUNCTIONALITY - Instructions and logic provide SIMD address conflict detection functionality. Some embodiments include processors with a register with a variable plurality of data fields, each of the data fields to store an offset for a data element in a memory. A destination register has corresponding data fields, each of these data fields to store a variable second plurality of bits to store a conflict mask having a mask bit for each offset. Responsive to decoding a vector conflict instruction, execution units compare the offset in each data field with every less significant data field to determine if they hold a matching offset, and in corresponding conflict masks in the destination register, set any mask bits corresponding to a less significant data field with a matching offset. Vector address conflict detection can be used with variable sized elements and to generate conflict masks to resolve dependencies in gather-modify-scatter SIMD operations. | 07-03-2014 |
20140189321 | INSTRUCTIONS AND LOGIC TO VECTORIZE CONDITIONAL LOOPS - Instructions and logic provide vectorization of conditional loops. A vector expand instruction has a parameter to specify a source vector, a parameter to specify a conditions mask register, and a destination parameter to specify a destination vector to hold n consecutive vector elements, each of the plurality of n consecutive vector elements having a same variable partition size of m bytes. In response to the processor instruction, data is copied from consecutive vector elements in the source vector, and expanded into unmasked vector elements of the specified destination vector, without copying data into masked vector elements of the destination vector, wherein n varies responsive to the processor instruction executed. The source vector may be a register and the destination vector may be in memory. Some embodiments store counts of the condition decisions. Alternative embodiments may store other data, for example such as target addresses, or table offsets, or indicators of processing directives, etc. | 07-03-2014 |
20140189322 | Systems, Apparatuses, and Methods for Masking Usage Counting - Embodiments of systems, apparatuses, and methods for counting instructions of a particular type are described herein. In some embodiments, a processor includes a plurality of write mask registers, logic to determine write mask register usage of an instruction in a particular manner and a counter to count a number of instances of instructions that have been determined to use a write mask register in the particular manner. | 07-03-2014 |
20140195775 | INSTRUCTION AND LOGIC TO PROVIDE VECTOR LOADS AND STORES WITH STRIDES AND MASKING FUNCTIONALITY - Instructions and logic provide vector loads and/or stores with stride and mask functionality. Some embodiments, responsive to an instruction specifying: a set of loads, destination register, mask register, memory address, and stride length; execution units read values in the mask register, wherein fields in the mask register correspond to stride-length multiples from the memory address to data elements in memory. A first mask value indicates the element has not been loaded from memory and a second value indicates that the element does not need to be, or has already been loaded. For each having the first value, the corresponding multiple of said stride length is generated according to the data field's position in the mask register to load the data element from memory into the corresponding destination register location, and the corresponding value in the mask register is changed to the second value. These instructions can restart after faults. | 07-10-2014 |
20140195778 | INSTRUCTION AND LOGIC TO PROVIDE VECTOR LOAD-OP/STORE-OP WITH STRIDE FUNCTIONALITY - Instructions and logic provide vector load-op and/or store-op with stride functionality. Some embodiments, responsive to an instruction specifying: a set of loads, a second operation, destination register, operand register, memory address, and stride length; execution units read values in a mask register, wherein fields in the mask register correspond to stride-length multiples from the memory address to data elements in memory. A first mask value indicates the element has not been loaded from memory and a second value indicates that the element does not need to be, or has already been loaded. For each having the first value, the data element is loaded from memory into the corresponding destination register location, and the corresponding value in the mask register is changed to the second value. Then the second operation is performed using corresponding data in the destination and operand registers to generate results. The instruction may be restarted after faults. | 07-10-2014 |
20140195783 | DOT PRODUCT PROCESSORS, METHODS, SYSTEMS, AND INSTRUCTIONS - A method of an aspect includes receiving a dot product instruction. The dot product instruction indicates a first source packed data including at least four data elements, indicates a second source packed data including at least eight data elements, and indicates a destination storage location. A result packed data is stored in the destination storage location in response to the dot product instruction. The result includes a plurality of data elements that each includes a dot product result. Each of the dot product results includes a sum of products of the at least four data elements of the first source packed data with corresponding data elements in a different subset of at least four data elements of the second source packed data. Other methods, apparatus, systems, and instructions are disclosed. | 07-10-2014 |
20140201497 | INSTRUCTION FOR ELEMENT OFFSET CALCULATION IN A MULTI-DIMENSIONAL ARRAY - An apparatus is described having functional unit logic circuitry. The functional unit logic circuitry has a first register to store a first input vector operand having an element for each dimension of a multi-dimensional data structure. Each element of the first vector operand specifying the size of its respective dimension. The functional unit has a second register to store a second input vector operand specifying coordinates of a particular segment of the multi-dimensional structure. The functional unit also has logic circuitry to calculate an address offset for the particular segment relative to an address of an origin segment of the multi-dimensional structure. | 07-17-2014 |
20140201498 | INSTRUCTION AND LOGIC TO PROVIDE VECTOR SCATTER-OP AND GATHER-OP FUNCTIONALITY - Instructions and logic provide vector scatter-op and/or gather-op functionality. In some embodiments, responsive to an instruction specifying: a gather and a second operation, a destination register, an operand register, and a memory address; execution units read values in a mask register, wherein fields in the mask register correspond to offset indices in the indices register for data elements in memory. A first mask value indicates the element has not been gathered from memory and a second value indicates that the element does not need to be, or has already been gathered. For each having the first value, the data element is gathered from memory into the corresponding destination register location, and the corresponding value in the mask register is changed to the second value. When all mask register fields have the second value, the second operation is performed using corresponding data in the destination and operand registers to generate results. | 07-17-2014 |
20140201499 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING CONVERSION OF A LIST OF INDEX VALUES INTO A MASK VALUE - Embodiments of systems, apparatuses, and methods for performing in a computer processor conversion of a list of index values into a mask value in response to a single vector packed conversion of a list of index values into a mask value instruction that includes a destination writemask register operand, a source vector register operand, and an opcode are described. | 07-17-2014 |
20140201502 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING A BUTTERFLY HORIZONTAL AND CROSS ADD OR SUBSTRACT IN RESPONSE TO A SINGLE INSTRUCTION - Embodiments of systems, apparatuses, and methods for performing in a computer processor vector packed butterfly horizontal cross add or subtract of packed data elements in response to a single vector packed butterfly horizontal cross add or subtract instruction that includes a destination vector register operand, a source vector register operand, an immediate, and an opcode are described. | 07-17-2014 |
20140201510 | SYSTEM, APPARATUS AND METHOD FOR GENERATING A LOOP ALIGNMENT COUNT OR A LOOP ALIGNMENT MASK - A loop alignment instruction indicates a base address of an array as a first operand, an iteration limit of a loop as a second operand, and a destination. The loop contains iterations and each iteration includes a data element of the array. A processor receives the loop alignment instruction, decodes the instruction for execution, and stores a result of the execution in the destination. The result indicates the number of data elements at a beginning of the array that are to be handled separately from a remaining portion of the array, such that the base address of the remaining portion of the array aligns with an alignment width. | 07-17-2014 |
20140208065 | APPARATUS AND METHOD FOR MASK REGISTER EXPAND OPERATION - An apparatus and method are described for expanding bits from a mask register in a processor and computing system with vector registers and vector data elements. For example, a method according to one embodiment includes the following operations: reading each mask register bit stored in a mask register, the mask register containing mask values used for performing operations on vector values stored in a set of vector registers; and replicating each mask register bit N times into a destination register, where N is the number of vector elements stored in each vector register. | 07-24-2014 |
20140208080 | APPARATUS AND METHOD FOR DOWN CONVERSION OF DATA TYPES - An apparatus and method are described for down-converting from a source operand to a destination operand with masking. For example, a method according to one embodiment includes the following operations: reading a source operand value to be down-converted from a first value to a down-converted value and stored in a destination location; reading each mask register bit stored in a mask register, the mask register bit(s) indicating whether to perform a masking operation or a conversion operation on the source operand value; if the mask register bit(s) indicates that a masking operation is to be performed, then performing a specified masking operation and storing the results of the masking operation in the destination location; and if the mask register bit(s) indicates that a masking operation is not to be performed, then down-converting the source operand value and storing the down-converted value in the specified destination location. | 07-24-2014 |
20140215186 | SYSTEMS, APPARATUSES, AND METHODS FOR MAPPING A SOURCE OPERAND TO A DIFFERENT RANGE - Embodiments of systems, apparatuses, and methods for performing a range mapping instruction in a computer processor are described. In some embodiments, the execution of a range mapping instruction maps a data element having a source data range to a destination data element having a destination data range and storage of the of the destination data element. | 07-31-2014 |
20140223138 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING CONVERSION OF A MASK REGISTER INTO A VECTOR REGISTER. - Embodiments of systems, apparatuses, and methods for performing in a computer processor conversion of a mask register into a vector register in response to a single vector packed convert a mask register to a vector register instruction that includes a destination vector register operand, a source writemask register operand, and an opcode are described. | 08-07-2014 |
20140223140 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING VECTOR PACKED UNARY ENCODING USING MASKS - Embodiments of systems, apparatuses, and methods for performing in a computer processor vector packed unary encoding using masks in response to a single vector packed unary encoding using masks instruction that includes a source vector register operand, a destination writemask register operand, and an opcode are described. | 08-07-2014 |
20140258683 | INSTRUCTION AND LOGIC TO PROVIDE VECTOR HORIZONTAL COMPARE FUNCTIONALITY - Instructions and logic provide vector horizontal compare functionality. Some embodiments, responsive to an instruction specifying: a destination operand, a size of the vector elements, a source operand, and a mask corresponding to a portion of the vector element data fields in the source operand; read values from data fields of the specified size in the source operand, corresponding to the mask and compare the values for equality. In some embodiments, responsive to a detection of inequality, a trap may be taken. In some alternative embodiments, a flag may be set. In other alternative embodiments, a mask field may be set to a masked state for the corresponding unequal value(s). In some embodiments, responsive to all unmasked data fields of the source operand being equal to a particular value, that value may be broadcast to all data fields of the specified size in the destination operand. | 09-11-2014 |
20140281395 | Systems, Apparatuses, and Methods for Reducing the Number of Short Integer Multiplications - Systems, methods, and apparatuses for calculating a square of a data value of a first source operand, a square of a data value of a second source operand, and a multiplication of the data of the first and second operands only using one multiplication are described. | 09-18-2014 |
20140281400 | Systems, Apparatuses,and Methods for Zeroing of Bits in a Data Element - Embodiments of systems, methods and apparatuses for execution a NAME instruction are described. The execution of a VPBZHI causes, on a per data element basis of a second source, a zeroing of bits higher (more significant) than a starting point in the data element. The starting point is defined by the contents of a data element in a first source. The resultant data elements are stored in a corresponding data element position of a destination. | 09-18-2014 |
20140281425 | LIMITED RANGE VECTOR MEMORY ACCESS INSTRUCTIONS, PROCESSORS, METHODS, AND SYSTEMS - A processor of an aspect includes a plurality of packed data registers. The processor also includes a unit coupled with the packed data registers. The unit is operable, in response to a limited range vector memory access instruction. The instruction is to indicate a source packed memory indices, which is to have a plurality of packed memory indices, which are to be selected from 8-bit memory indices and 16-bit memory indices. The unit is operable to access memory locations, in only a limited range of a memory, in response to the limited range vector memory access instruction. Other processors are disclosed, as are methods, systems, and instructions. | 09-18-2014 |
20140289494 | INSTRUCTION AND LOGIC TO PROVIDE VECTOR HORIZONTAL MAJORITY VOTING FUNCTIONALITY - Instructions and logic provide vector horizontal majority voting functionality. Some embodiments, responsive to an instruction specifying: a destination operand, a size of the vector elements, a source operand, and a mask corresponding to a portion of the vector element data fields in the source operand; read a number of values from data fields of the specified size in the source operand, corresponding to the mask specified by the instruction and store a result value to that number of corresponding data fields in the destination operand, the result value computed from the majority of values read from the number of data fields of the source operand. | 09-25-2014 |
20140317377 | VECTOR FREQUENCY COMPRESS INSTRUCTION - A processor core that includes a hardware decode unit to decode a vector frequency compress instruction that includes a source operand and a destination operand. The source operand specifying a source vector register that includes a plurality of source data elements including one or more runs of identical data elements that are each to be compressed in a destination vector register as a value and run length pair. The destination operand identifies the destination vector register. The processor core also includes an execution engine unit to execute the decoded vector frequency compress instruction which causes, for each source data element, a value to be copied into the destination vector register to indicate that source data element's value. One or more runs of the source data elements equal are encoded in the destination vector register as the predetermined compression value followed by a run length for that run. | 10-23-2014 |
20140365747 | SYSTEMS, APPARATUSES, AND METHODS FOR PERFORMING A HORIZONTAL PARTIAL SUM IN RESPONSE TO A SINGLE INSTRUCTION - Embodiments of systems, apparatuses, and methods for performing in a computer processor vector packed horizontal partial sum of packed data elements in response to a single vector packed horizontal sum instruction that includes a destination vector register operand, a source vector register operand, and an opcode are described. | 12-11-2014 |
20150026439 | APPARATUS AND METHOD FOR PERFORMING PERMUTE OPERATIONS - An apparatus and method are described for permuting data elements with masking. For example, a method according to one embodiment includes the following operations: reading values from a mask data structure to determine whether masking is implemented for each data element of a destination operand; if masking not implemented for a particular data element, then selecting data elements from a first source operand and a second source operand based on index values stored in destination operand to be copied to data element positions within the destination operand, wherein any one of the data elements from either the first source operand and the second source operand may be copied to any one of the data element positions within the destination operand; and if masking is implemented for a particular data element of the destination operand, then performing a designated masking operation with respect to that particular data element. | 01-22-2015 |
20150026440 | APPARATUS AND METHOD FOR PERFORMING A PERMUTE OPERATION - An apparatus and method are described for permuting data elements with masking. For example, a method according to one embodiment includes the following operations: reading values from a mask data structure to determine whether masking is implemented for each data element of a destination operand; if masking not implemented for a particular data element, then selecting data elements from the destination operand and a second source operand based on index values stored in a first source operand to be copied to data element positions within the destination operand, wherein any one of the data elements from either the destination operand and the second source operand may be copied to any one of the data element positions within the destination operand; if masking is implemented for a particular data element of the destination operand, then performing a designated masking operation with respect to that particular data element. | 01-22-2015 |
20150039851 | METHODS, APPARATUS, INSTRUCTIONS AND LOGIC TO PROVIDE VECTOR SUB-BYTE DECOMPRESSION FUNCTIONALITY - Methods, apparatus, instructions and logic provide SIMD vector sub-byte decompression functionality. Embodiments include shuffling a first and second byte into the least significant portion of a first vector element, and a third and fourth byte into the most significant portion. Processing continues shuffling a fifth and sixth byte into the least significant portion of a second vector element, and a seventh and eighth byte into the most significant portion. Then by shifting the first vector element by a first shift count and the second vector element by a second shift count, sub-byte elements are aligned to the least significant bits of their respective bytes. Processors then shuffle a byte from each of the shifted vector elements' least significant portions into byte positions of a destination vector element, and from each of the shifted vector elements' most significant portions into byte positions of another destination vector element. | 02-05-2015 |
20150046671 | METHODS, APPARATUS, INSTRUCTIONS AND LOGIC TO PROVIDE VECTOR POPULATION COUNT FUNCTIONALITY - Instructions and logic provide SIMD vector population count functionality. Some embodiments store in each data field of a portion of n data fields of a vector register or memory vector, a plurality of bits of data. In a processor, a SIMD instruction for a vector population count is executed, such that for that portion of the n data fields in the vector register or memory vector, the occurrences of binary values equal to each of a first one or more predetermined binary values, are counted and the counted occurrences are stored, in a portion of a destination register corresponding to the portion of the n data fields in the vector register or memory vector, as a first one or more counts corresponding to the first one or more predetermined binary values. | 02-12-2015 |
20150046672 | METHODS, APPARATUS, INSTRUCTIONS AND LOGIC TO PROVIDE POPULATION COUNT FUNCTIONALITY FOR GENOME SEQUENCING AND ALIGNMENT - Instructions and logic provide SIMD vector population count functionality. Some embodiments store in each data field of a portion of n data fields of a vector register or memory vector, at least two bits of data. In a processor, a SIMD instruction for a vector population count is executed, such that for that portion of the n data fields in the vector register or memory vector, the occurrences of binary values equal to each of a first one or more predetermined binary values, are counted and the counted occurrences are stored, in a portion of a destination register corresponding to the portion of the n data fields in the vector register or memory vector, as a first one or more counts corresponding to the first one or more predetermined binary values. | 02-12-2015 |
20150100760 | PROCESSOR TO PERFORM A BIT RANGE ISOLATION INSTRUCTION - Receiving an instruction indicating a source operand and a destination operand. Storing a result in the destination operand in response to the instruction. The result operand may have: (1) first range of bits having a first end explicitly specified by the instruction in which each bit is identical in value to a bit of the source operand in a corresponding position; and (2) second range of bits that all have a same value regardless of values of bits of the source operand in corresponding positions. Execution of instruction may complete without moving the first range of the result relative to the bits of identical value in the corresponding positions of the source operand, regardless of the location of the first range of bits in the result. Execution units to execute such instructions, computer systems having processors to execute such instructions, and machine-readable medium storing such an instruction are also disclosed. | 04-09-2015 |
20150100761 | SYSTEM-ON-CHIP (SoC) TO PERFORM A BIT RANGE ISOLATION INSTRUCTION - Receiving an instruction indicating a source operand and a destination operand. Storing a result in the destination operand in response to the instruction. The result operand may have: (1) first range of bits having a first end explicitly specified by the instruction in which each bit is identical in value to a bit of the source operand in a corresponding position; and (2) second range of bits that all have a same value regardless of values of bits of the source operand in corresponding positions. Execution of instruction may complete without moving the first range of the result relative to the bits of identical value in the corresponding positions of the source operand, regardless of the location of the first range of bits in the result. Execution units to execute such instructions, computer systems having processors to execute such instructions, and machine-readable medium storing such an instruction are also disclosed. | 04-09-2015 |
20150143084 | HAND HELD DEVICE TO PERFORM A BIT RANGE ISOLATION INSTRUCTION - Receiving an instruction indicating a source operand and a destination operand. Storing a result in the destination operand in response to the instruction. The result operand may have: (1) first range of bits having a first end explicitly specified by the instruction in which each bit is identical in value to a bit of the source operand in a corresponding position; and (2) second range of bits that all have a same value regardless of values of bits of the source operand in corresponding positions. Execution of instruction may complete without moving the first range of the result relative to the bits of identical value in the corresponding positions of the source operand, regardless of the location of the first range of bits in the result. Execution units to execute such instructions, computer systems having processors to execute such instructions, and machine-readable medium storing such an instruction are also disclosed. | 05-21-2015 |
20150186136 | SYSTEMS, APPARATUSES, AND METHODS FOR EXPAND AND COMPRESS - Systems, methods, and apparatuses for expanding and compressing vectors is described. In some embodiments, logic is to execute a vector expand (VPEXPANDBIT) instruction determine from each packed data element of the second source operand every bit position that has been set to indicate that a bit of data from a corresponding packed data element of the first source operand is to be written into a corresponding bit position in a packed data element of the destination operand, wherein the bits of data to be written in the destination packed data element are consecutive bits from the packed data element of the first source operand, and store consecutive bit values from each packed data element of the first source at the identified bit positions. | 07-02-2015 |
20150186137 | SYSTEMS, APPARATUSES, AND METHODS FOR VECTOR BIT TEST - Systems, methods, and apparatuses for vector bit test are described. In some embodiments, a vector bit test instruction is executed to shift each packed data element of a first source by a number of bits indicated by a corresponding packed data element of a second source, and store consecutive bit values from each packed data element of the first source at the identified bit positions of a corresponding packed data element of a destination. | 07-02-2015 |
20150261534 | PACKED TWO SOURCE INTER-ELEMENT SHIFT MERGE PROCESSORS, METHODS, SYSTEMS, AND INSTRUCTIONS - A processor includes a decoder to receive an instruction that indicates first and second source packed data operands and at least one shift count. An execution unit is operable, in response to the instruction, to store a result packed data operand. Each result data element includes a first least significant bit (LSB) portion of a first data element of a corresponding pair of data elements in a most significant bit (MSB) portion, and a second MSB portion of a second data element of the corresponding pair in a LSB portion. One of the first LSB portion of the first data element and the second MSB portion of the second data element has a corresponding shift count number of bits. The other has a number of bits equal to a size of a data element of the first source packed data minus the corresponding shift count. | 09-17-2015 |
20160092226 | Systems, Apparatuses, and Methods for Zeroing of Bits in a Data Element - Embodiments of systems, methods and apparatuses for execution a NAME instruction are described. The execution of a VPBZHI causes, on a per data element basis of a second source, a zeroing of bits higher (more significant) than a starting point in the data element. The starting point is defined by the contents of a data element in a first source. The resultant data elements are stored in a corresponding data element position of a destination. | 03-31-2016 |