Entries |
Document | Title | Date |
20080282058 | Message queuing system for parallel integrated circuit architecture and related method of operation - An integrated circuit comprises an external memory, a plurality of parallel connected Vector Processing Engines (VPEs), and an External Memory Unit (EMU) providing a data transfer path between the VPEs and the external memory. Each VPE contains a plurality of data processing units and a message queuing system adapted to transfer messages between the data processing units and other components of the integrated circuit. | 11-13-2008 |
20080301402 | Method and System for Stealing Interrupt Vectors - A system for stealing interrupt vectors from an operating system. Custom interrupt handler extensions are copied into an allocated block of memory from a kernel module. Also, operating system interrupt handlers are copied into a reserved space in the allocated block of memory from an interrupt vector memory location. In response to copying the operating system interrupt handlers into the reserved space in the allocated block of memory, custom interrupt handlers from the kernel module are copied over the operating system interrupt handlers in the interrupt vector memory location. The custom interrupt handlers after being copied into the interrupt vector memory location handle all interrupts received by the operating system. | 12-04-2008 |
20090106527 | Scalar Precision Float Implementation on the "W" Lane of Vector Unit - Embodiments of the invention are generally related to image processing, and more specifically to vector units for supporting image processing. A combined vector/scalar unit is provided wherein one or more processing lanes of the vector unit are used for performing scalar operations. An integrated register file is also provided for storing vector and scalar data. Therefore, the transfer of data to memory to exchange data between independent vector and scalar units is obviated and a significant amount of chip area is saved. | 04-23-2009 |
20090300323 | Vector Processor System - A vector processing system provides high performance vector processing using a System-On-a-Chip (SOC) implementation technique. One or more scalar processors (or cores) operate in conjunction with a vector processor, and the processors collectively share access to a plurality of memory interfaces coupled to Dynamic Random Access read/write Memories (DRAMs). In typical embodiments the vector processor operates as a slave to the scalar processors, executing computationally intensive Single Instruction Multiple Data (SIMD) codes in response to commands received from the scalar processors. The vector processor implements a vector processing Instruction Set Architecture (ISA) including machine state, instruction set, exception model, and memory model. | 12-03-2009 |
20100064115 | VECTOR PROCESSING UNIT - It is an object to speed up a vector store instruction on a memory that is divided into banks as setting a plurality of elements as a unit while minimizing an increase in physical quantity. A vector processing apparatus has a plurality of register banks and processes a data string including a plurality of data elements retained in the plurality of register banks, wherein: the plurality of register banks each have a read pointer | 03-11-2010 |
20100106940 | Processing Unit With Operand Vector Multiplexer Sequence Control - Operand vector multiplexer sequence control is used in a vector-based execution unit to control the shuffling of data elements in operand vectors used by a sequence of vector instructions processed by the vector-based execution unit. A swizzle sequence instruction is defined in an instruction set for the vector-based execution unit and is used to selectively apply a sequence of vector data element shuffle orders to one or more operand vectors to be used by the associated sequence of vector instructions. As a result, when a common sequence of data element shuffle orders is used frequently for a sequence of vector instructions, a single swizzle sequence instruction may be used to select the desired sequence of custom data element ordering for each of the vector instructions in the sequence. | 04-29-2010 |
20100115233 | DYNAMICALLY-SELECTABLE VECTOR REGISTER PARTITIONING - The present invention is directed generally to dynamically-selectable vector register partitioning, and more specifically to a processor infrastructure (e.g., co-processor infrastructure in a multi-processor system) that supports dynamic setting of vector register partitioning to any of a plurality of different vector partitioning modes. Thus, rather than being restricted to a fixed vector register partitioning mode, embodiments of the present invention enable a processor to be dynamically set to any of a plurality of different vector partitioning modes. Thus, for instance, different vector register partitioning modes may be employed for different applications being executed by the processor, and/or different vector register partitioning modes may even be employed for use in processing different vector oriented operations within a given applications being executed by the processor, in accordance with certain embodiments of the present invention. | 05-06-2010 |
20100115234 | CONFIGURABLE VECTOR LENGTH COMPUTER PROCESSOR - A processor core, comprises one or more vector units operable to change between a fine-grained vector mode having a shorter maximum vector length and a coarse-grained vector mode having a longer maximum vector length. Changing vector modes comprises halting all instruction stream execution in the core, flushing one or more registers in a register space, reconfiguring one or more vector registers in the register space, and restarting instruction execution in the core. | 05-06-2010 |
20100138631 | Process for QR Transformation using a CORDIC Processor - A CORDIC processor has a plurality of stages, each of the stages having a X input, Y input, a sign input, a sign output, an X output, a Y output, a mode control input having a ROTATE or VECTOR value, and a stage number k input, each CORDIC stage having a first shift generating an output by shifting the Y input k times, a second shift generating an output by shifting X input k times, a multiplexer having an output coupled to the sign input when the mode control input is ROTATE and to the sign of the Y input when the mode input is VECTOR, a first multiplier forming the product of the first shift output and the multiplexer output, a second multiplier forming the product of the second shift output and an inverted the multiplexer output, a first adder forming the X output from the sum of the first multiplier output and the X input, and a second adder forming the Y output from the sum of the second multiplier output and the Y input. | 06-03-2010 |
20100138632 | Programmable CORDIC Processor with Stage Re-Use - A CORDIC processor has a plurality of stages, each of the stages having a X input, Y input, a sign input, a sign output, an X output, a Y output, a mode control input having a ROTATE or VECTOR value, and a stage number k input, each CORDIC stage having a first shift generating an output by shifting the Y input k times, a second shift generating an output by shifting X input k times, a multiplexer having an output coupled to the sign input when the mode control input is ROTATE and to the sign of the Y input when the mode input is VECTOR, a first multiplier forming the product of the first shift output and the multiplexer output, a second multiplier forming the product of the second shift output and an inverted the multiplexer output, a first adder forming the X output from the sum of the first multiplier output and the X input, and a second adder forming the Y output from the sum of the second multiplier output and the Y input. | 06-03-2010 |
20100318764 | SYSTEM AND METHOD FOR MANAGING PROCESSOR-IN-MEMORY (PIM) OPERATIONS - A system and method of compiling program code, wherein the program code includes an operation on an array of data elements stored in memory of a computer system. The program code is scanned for operations that are vectorizable. The vectorizable operations are examined to determine whether they should be executed at least in part in a vector atomic memory operation (AMO) functional unit attached to memory. If so, the compiled code includes vector AMO instructions. | 12-16-2010 |
20110035568 | SELECT FIRST AND SELECT LAST INSTRUCTIONS FOR PROCESSING VECTORS - The described embodiments include a processor that executes a vector instruction. The processor starts by receiving a vector instruction that uses a first input vector, a second input vector, and a control vector, and optionally a predicate vector as inputs, wherein each of the vectors includes N elements. The processor then executes the vector instruction. In the described embodiments, when executing the vector instruction, the processor determines a key element position. If the predicate vector is received, the key element position is a predetermined active element position in the predicate vector, otherwise, the key element position is in a predetermined element position. The processor then uses the key element position to copy a result value into a result variable. When copying the result value into the result variable, if an element in the key element position of the control vector contains a predetermined value, the processor copies a value from the key element position in the second input vector into the result variable. Otherwise, the processor copies a value from the key element position in the first input vector into the result variable. | 02-10-2011 |
20110055517 | METHOD AND STRUCTURE OF USING SIMD VECTOR ARCHITECTURES TO IMPLEMENT MATRIX MULTIPLICATION - A structure (and method) including a plurality of coprocessing units and a controller that selectively loads data for processing on the plurality of coprocessing units, using a compound loading instruction. The compound loading instruction includes a plurality of low-level software instructions that preliminarily processes input data in a manner predetermined to simulate an effect of a single hardware loading instruction that would provide optimal loading of complex matrix data by loading input data in accordance with the effect of multiplying i·i=−1. | 03-03-2011 |
20110113217 | GENERATE PREDICTES INSTRUCTION FOR PROCESSING VECTORS - The described embodiments include a processor that executes a vector instruction. The processor starts by receiving a first input vector, a second input vector, and optionally receiving a predicate vector (each of which includes N elements) as inputs. The processor then executes the vector instruction. Executing the vector instruction causes the processor to generate a result vector. When generating the result vector, if the predicate vector was received, for each element of the result vector for which the corresponding element of the predicate vector is active, otherwise, for each element of the result vector, the processor determines elements that are to be set in the result vector based on values in elements in the first input vector and the second input vector. The processor then sets the determined elements of the result vector to a first predetermined value. | 05-12-2011 |
20110153980 | MULTI-STAGE RECONFIGURATION DEVICE AND RECONFIGURATION METHOD, LOGIC CIRCUIT CORRECTION DEVICE, AND RECONFIGURABLE MULTI-STAGE LOGIC CIRCUIT - To provide a device to reconfigure multi-level logic networks, which enable logic modification and reconfiguration of a multi-level logic network with small circuit area and low-power dissipation in a simple manner. For example, in the case of reconfiguring a multi-level logic network following logic modification for deleting an output vector F(b) of an objective logic function F(X) corresponding to an input vector b, unmodified pq elements are selected one by one from the nearest pq element E | 06-23-2011 |
20110219207 | RECONFIGURABLE PROCESSOR AND RECONFIGURABLE PROCESSING METHOD - A reconfigurable processor for efficiently performing a vector operation, and a method of controlling the reconfigurable processor are provided. The reconfigurable processor designates at least one of a plurality of processing elements as a vector lane based on vector lane configuration information, and allocates a vector operation to the designated vector lane. | 09-08-2011 |
20110314254 | METHOD FOR VECTOR PROCESSING - The present application relates to a method for processing data in a vector processor. The present application relates also to a vector processor for performing said method and a cellular communication device comprising said vector processor. The method for processing data in a vector processor comprises executing segmented operations on a segment of a vector for generating results, collecting the results of the segmented operations, and delivering the results in a result vector in such a way that subsequent operations remain processing in vector mode. | 12-22-2011 |
20110320765 | VARIABLE WIDTH VECTOR INSTRUCTION PROCESSOR - A computer processor, method, and computer program product for executing vector processing instructions on a variable width vector register file. An example embodiment is a computer processor that includes an instruction execution unit coupled to a variable width vector register file which contains a number of vector registers, the width of the vector registers is changeable during operation of the computer processor. | 12-29-2011 |
20120023308 | PARALLEL COMPARISON/SELECTION OPERATION APPARATUS, PROCESSOR, AND PARALLEL COMPARISON/SELECTION OPERATION METHOD - Provided is a parallel comparison/selection operation apparatus which efficiently executes a search for a maximum value or a search for a minimum value with an index. The parallel comparison/selection operation apparatus includes a vector comparison/selection unit | 01-26-2012 |
20120079233 | VECTOR LOGICAL REDUCTION OPERATION IMPLEMENTED ON A SEMICONDUCTOR CHIP - A semiconductor processor is described. The semiconductor processor includes logic circuitry to perform a logical reduction instruction. The logic circuitry has swizzle circuitry to swizzle a vector's elements so as to form a swizzle vector. The logic circuitry also has vector logic circuitry to perform a vector logic operation on said vector and said swizzle vector. | 03-29-2012 |
20120102299 | STALL PROPAGATION IN A PROCESSING SYSTEM WITH INTERSPERSED PROCESSORS AND COMMUNICATON ELEMENTS - A processing system includes processors and dynamically configurable communication elements (DCCs) coupled together in an interspersed arrangement. A source device may transfer a data item through an intermediate subset of the DCCs to a destination device. The source and destination devices may each correspond to different processors, DCCs, or input/output devices, or mixed combinations of these. In response to detecting a stall after the source device begins transfer of the data item to the destination device and prior to receipt of all of the data item at the destination device, a stalling device is operable to propagate stalling information through one or more of the intermediate subset towards the source device. In response to receiving the stalling information, at least one of the intermediate subset is operable to buffer all or part of the data item. | 04-26-2012 |
20120131308 | SYSTEM, DEVICE, AND METHOD FOR ON-THE-FLY PERMUTATIONS OF VECTOR MEMORIES FOR EXECUTING INTRA-VECTOR OPERATIONS - A device system and method for processing program instructions, for example, to execute intra vector operations. A fetch unit may receive a program instruction defining different operations on data elements stored at the same vector memory address. A processor may include different types of execution units each executing a different one of a predetermined plurality of elemental instructions. Each program instruction may be a combination of one or more of the elemental instructions. The processor may receive a vector of data elements stored non-consecutively at the same vector memory address to be processed by a same one of the elemental instructions and a vector of configuration values independently associated with executing the same elemental instruction on the non-consecutive data elements. At least two configuration values may be different to implement different operations by executing the same elemental instruction using the different configuration values on the vector of non-consecutive data elements. | 05-24-2012 |
20120166761 | VECTOR CONFLICT INSTRUCTIONS - A processing core implemented on a semiconductor chip is described having first execution unit logic circuitry that includes first comparison circuitry to compare each element in a first input vector against every element of a second input vector. The processing core also has second execution logic circuitry that includes second comparison circuitry to compare a first input value against every data element of an input vector. | 06-28-2012 |
20120216011 | APPARATUS AND METHOD OF SINGLE-INSTRUCTION, MULTIPLE-DATA VECTOR OPERATION MASKING - An apparatus, method, and medium for performing a vector operation on portions of one or more source vector registers. A vector unit performs an operation on the source vector registers and only stores results in the target vector register for elements which are selected by the vector operation mask. The vector operation mask can be read by the vector unit or loaded into the vector unit for each instruction cycle. The vector operation mask allows the vector unit to be used with partially filled source vector registers and eliminates the need for scalar operations to be performed on vector data. | 08-23-2012 |
20120221830 | CONFIGURABLE VECTOR LENGTH COMPUTER PROCESSOR - A processor core, comprises one or more vector units operable to change between a fine-grained vector mode having a shorter maximum vector length and a coarse-grained vector mode having a longer maximum vector length. Changing vector modes comprises halting all instruction stream execution in the core, flushing one or more registers in a register space, reconfiguring one or more vector registers in the register space, and restarting instruction execution in the core. | 08-30-2012 |
20120284487 | Vector Slot Processor Execution Unit for High Speed Streaming Inputs - A vector slot processor that is capable of supporting multiple signal processing operations for multiple demodulation standards is provided. The vector slot processor includes a plurality of micro execution slot (MES) that performs the multiple signal processing operations on the high speed streaming inputs. Each of the MES includes one or more n-way signal registers that receive the high speed streaming inputs, one or more n-way coefficient registers that store filter coefficients for the multiple signal processing, and one or more n-way Multiply and Accumulate (MAC) units that receive the high speed streaming inputs from the one or more n-way signal registers and filter coefficients from one or more n-way coefficient registers. The one or more n-way MAC units perform a vertical MAC operation and a horizontal multiply and add operation on the high speed streaming inputs. | 11-08-2012 |
20130024654 | VECTOR OPERATIONS FOR COMPRESSING SELECTED VECTOR ELEMENTS - A processor, method, and medium for using vector operations to compress selected elements of a vector. An input vector is compared to a criteria vector, and then a subset of the plurality of elements of the input vector are selected based on the comparison. A permutation vector is generated based on the locations of the selected elements and then the permutation vector is used to permute the selected elements of the input vector to an output vector. The selected elements of the input vector are stored in contiguous locations in the leftmost elements of the output vector. Then, the output vector is stored to memory and a pointer to the memory location is incremented by the number of selected elements. | 01-24-2013 |
20130024655 | PROCESSING VECTORS USING WRAPPING INCREMENT AND DECREMENT INSTRUCTIONS IN THE MACROSCALAR ARCHITECTURE - Embodiments of a system and a method in which a processor may execute instructions that cause the processor to receive an input vector and a control vector are disclosed. The executed instructions may also cause the processor to perform a fixed-value addition operation dependent upon the input vector and the control vector. | 01-24-2013 |
20130024656 | PROCESSING VECTORS USING WRAPPING BOOLEAN INSTRUCTIONS IN THE MACROSCALAR ARCHITECTURE - Embodiments of a system and a method in which a processor may execute instructions that cause the processor to receive an input vector and a control vector are disclosed. The executed instructions may also cause the processor to perform a Boolean operation on another input vector dependent upon the input vector and the control vector. | 01-24-2013 |
20130036293 | PROCESSING VECTORS USING WRAPPING MINIMA AND MAXIMA INSTRUCTIONS IN THE MACROSCALAR ARCHITECTURE - Embodiments of a system and a method in which a processor may execute instructions that cause the processor to receive an input vector and a control vector are disclosed. The executed instructions may also cause the processor to perform a minima or maxima operation on another input vector dependent upon the input vector and the control vector. | 02-07-2013 |
20130067196 | VECTORIZATION OF MACHINE LEVEL SCALAR INSTRUCTIONS IN A COMPUTER PROGRAM DURING EXECUTION OF THE COMPUTER PROGRAM - A method of operating a computer processor includes storing at least one machine level vector instruction in a memory and replacing a plurality of machine level scalar instructions in a computer program with the at least one machine level vector instruction during execution of the computer program based on execution addresses associated with the plurality of machine level scalar instructions and/or instruction opcodes associated with the plurality of machine level scalar instructions. | 03-14-2013 |
20130132707 | SYSTEM, METHOD, AND COMPUTER PROGRAM PRODUCT FOR ASSIGNING ELEMENTS OF A MATRIX TO PROCESSING THREADS WITH INCREASED CONTIGUOUSNESS - A system, method, and computer program product are provided for assigning elements of a matrix to processing threads. In use, a matrix is received to be processed by a parallel processing architecture. Such parallel processing architecture includes a plurality of processors each capable of processing a plurality of threads. Elements of the matrix are assigned to each of the threads for processing, utilizing an algorithm that increases a contiguousness of the elements being processed by each thread. | 05-23-2013 |
20130159666 | REDUCING ISSUE-TO-ISSUE LATENCY BY REVERSING PROCESSING ORDER IN HALF-PUMPED SIMD EXECUTION UNITS - Techniques for reducing issue-to-issue latency by reversing processing order in half-pumped single instruction multiple data (SIMD) execution units are described. In one embodiment a processor functional unit is provided comprising a frontend unit, and execution core unit, a backend unit, an execution order control signal unit, a first interconnect coupled between and output and an input of the execution core unit and a second interconnect coupled between an output of the backend unit and an input of the frontend unit. In operation, the execution order control signal unit generates a forwarding order control signal based on the parity of an applied clock signal on reception of a first vector instruction. This control signal is in turn used to selectively forward first and second portions of an execution result of the first vector instruction via the interconnects for use in the execution of a dependent second vector instruction. | 06-20-2013 |
20130159667 | Vector Size Agnostic Single Instruction Multiple Data (SIMD) Processor Architecture - A computer has a memory adapted to store a first plurality of instructions encoded with a first vector size and a second plurality of instructions encoded with a second vector size. An execution unit executes the first plurality of instructions and the second plurality of instructions by processing vector units in a uniform manner regardless of vector size. | 06-20-2013 |
20130159668 | PREDECODE LOGIC FOR AUTOVECTORIZING SCALAR INSTRUCTIONS IN AN INSTRUCTION BUFFER - A circuit arrangement, method, and program product for substituting a plurality of scalar instructions in an instruction stream with a functionally equivalent vector instruction for execution by a vector execution unit. Predecode logic is coupled to an instruction buffer which stores instructions in an instruction stream to be executed by the vector execution unit. The predecode logic analyzes the instructions passing through the instruction buffer to identify a plurality of scalar instructions that may be replaced by a vector instruction in the instruction stream. The predecode logic may generate the functionally equivalent vector instruction based on the plurality of scalar instructions, and the functionally equivalent vector instruction may be substituted into the instruction stream, such that the vector execution unit executes the vector instruction in lieu of the plurality of scalar instructions. | 06-20-2013 |
20130185540 | PROCESSOR WITH MULTI-LEVEL LOOPING VECTOR COPROCESSOR - A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core includes a program memory interface through which the scalar processor retrieves instructions from a program memory. The instructions include scalar instructions executable by the scalar processor and vector instructions executable by the vector coprocessor core. The vector coprocessor core includes a plurality of execution units and a vector command buffer. The vector command buffer is configured to decode vector instructions passed by the scalar processor core, to determine whether vector instructions defining an instruction loop have been decoded, and to initiate execution of the instruction loop by one or more of the execution units based on a determination that all of the vector instructions of the instruction loop have been decoded. | 07-18-2013 |
20130290672 | APPARATUS AND METHOD OF MASK PERMUTE INSTRUCTIONS - An apparatus is described having instruction execution logic circuitry. The instruction execution logic circuitry has input vector element routing circuitry to perform the following for each of three different instructions: for each of a plurality of output vector element locations, route into an output vector element location an input vector element from one of a plurality of input vector element locations that are available to source the output vector element. The output vector element and each of the input vector element locations are one of three available bit widths for the three different instructions. The apparatus further includes masking layer circuitry coupled to the input vector element routing circuitry to mask a data structure created by the input vector routing element circuitry. The masking layer circuitry is designed to mask at three different levels of granularity that correspond to the three available bit widths. | 10-31-2013 |
20130297908 | Decomposing Operations in More than One Dimension into One Dimensional Point Operations - A processing architecture uses stationary operands and opcodes common on a plurality of processors. Only data moves through the processors. The same opcode and operand is used by each processor assigned to operate, for example, on one row of pixels, one row of numbers, or one row of points in space. | 11-07-2013 |
20130332701 | APPARATUS AND METHOD FOR SELECTING ELEMENTS OF A VECTOR COMPUTATION - An apparatus and method are described for selecting elements to be used in a vector computation. For example, a method according to one embodiment includes the following operations: specifying whether to identify the first, last or next after last active element of an input mask register using an immediate value; identifying the first, last or next after last active element in the input mask register according to the immediate value; reading a value from an input vector register corresponding to the identified first, last or next after last active element in the input mask register; and writing the value to an output vector register. | 12-12-2013 |
20130339661 | EFFICIENT ZERO-BASED DECOMPRESSION - A processor core including a hardware decode unit to decode vector instructions for decompressing a run length encoded (RLE) set of source data elements and an execution unit to execute the decoded instructions. The execution unit generates a first mask by comparing set of source data elements with a set of zeros and then counts the trailing zeros in the mask. A second mask is made based on the count of trailing zeros. The execution unit then copies the set of source data elements to a buffer using the second mask and then reads the number of RLE zeros from the set of source data elements. The buffer is shifted and copied to a result and the set of source data elements is shifted to the right. If more valid data elements are in the set of source data elements this is repeated until all valid data is processed. | 12-19-2013 |
20140032877 | APPARATUS AND METHOD FOR AN INSTRUCTION THAT DETERMINES WHETHER A VALUE IS WITHIN A RANGE - A method is described that includes performing the following with a single instruction: receiving a first input operand V; receiving a second input operand S; calculating V−S; determining if V−S is positive or negative; and, providing as a resultant: V if V−S is negative; V−S if V−S is positive. | 01-30-2014 |
20140052959 | Experimental engineering optimization algorithm at point of performance - A method is provided for reducing the data set used in creating an optimization algorithm, thus to permit the use of microprocessors, that in turn permits embedding the optimization algorithm at the point of performance, in which a subset of data points in a performance window is used to derive a vector that is utilized to create an initial optimization algorithm. | 02-20-2014 |
20140059323 | SYSTEMS AND METHODS OF DATA EXTRACTION IN A VECTOR PROCESSOR - Systems and methods of data extraction in a vector processor are disclosed. In a particular embodiment a method of data extraction in a vector processor includes copying at least one data element to a source register of a permutation network. The method includes reordering multiple data elements of the source register, populating a destination register of the permutation network with the reordered data elements, and copying the reordered data elements from the destination register to a memory. | 02-27-2014 |
20140068226 | VECTOR INSTRUCTIONS TO ENABLE EFFICIENT SYNCHRONIZATION AND PARALLEL REDUCTION OPERATIONS - In one embodiment, a processor may include a vector unit to perform operations on multiple data elements responsive to a single instruction, and a control unit coupled to the vector unit to provide the data elements to the vector unit, where the control unit is to enable an atomic vector operation to be performed on at least some of the data elements responsive to a first vector instruction to be executed under a first mask and a second vector instruction to be executed under a second mask. Other embodiments are described and claimed. | 03-06-2014 |
20140075153 | REDUCING ISSUE-TO-ISSUE LATENCY BY REVERSING PROCESSING ORDER IN HALF-PUMPED SIMD EXECUTION UNITS - Techniques for reducing issue-to-issue latency by reversing processing order in half-pumped single instruction multiple data (SIMD) execution units are described. In one embodiment a processor functional unit is provided comprising a frontend unit, and execution core unit, a backend unit, an execution order control signal unit, a first interconnect coupled between and output and an input of the execution core unit and a second interconnect coupled between an output of the backend unit and an input of the frontend unit. In operation, the execution order control signal unit generates a forwarding order control signal based on the parity of an applied clock signal on reception of a first vector instruction. This control signal is in turn used to selectively forward first and second portions of an execution result of the first vector instruction via the interconnects for use in the execution of a dependent second vector instruction. | 03-13-2014 |
20140122832 | PARTIAL VECTORIZATION COMPILATION SYSTEM - Generally, this disclosure provides technologies for generating and executing partially vectorized code that may include backward dependencies within a loop body of the code to be vectorized. The method may include identifying backward dependencies within a loop body of the code; selecting one or more ranges of iterations within the loop body, wherein the selected ranges exclude the identified backward dependencies; and vectorizing the selected ranges. The system may include a vector processor configured to provide predicated vector instruction execution, loop iteration range enabling, and dynamic loop dependence checking. | 05-01-2014 |
20140129803 | MULTI-MAGNITUDINAL VECTORS WITH RESOLUTION BASED ON SOURCE VECTOR FEATURES - Methods, systems and computer program products for resolving multiple magnitudes assigned to a target vector are disclosed. A target vector that includes one or more target vector dimensions is received. One of the target vector dimensions is processed to determine a total number of magnitudes assigned to the processed target vector dimension. Also, a source vector that includes one or more source vector dimensions is received. The received source vector is processed to determine a total number of features associated with the source vector. When it is detected that the total number of magnitudes assigned to the processed target vector dimension exceeds one, one of the assigned magnitudes is selected based on one of the determined features associated with the source vector. | 05-08-2014 |
20140181466 | PROCESSORS HAVING FULLY-CONNECTED INTERCONNECTS SHARED BY VECTOR CONFLICT INSTRUCTIONS AND PERMUTE INSTRUCTIONS - An apparatus includes a decode unit to decode a permute instruction and a vector conflict instruction. A vector execution unit is coupled with the decode unit and includes a fully-connected interconnect. The fully-connected interconnect has at least four inputs to receive at least four corresponding data elements of at least one source vector. The fully-connected interconnect has at least four outputs. Each of the at least four inputs is coupled with each of the at least four outputs. The execution unit also includes a permute instruction execution logic coupled with the at least four outputs and operable to store a first vector result in response to the permute instruction. The execution unit also includes a vector conflict instruction execution logic coupled with the at least four outputs and operable to store a second vector result in a destination storage location in response to the vector conflict instruction. | 06-26-2014 |
20140189289 | INSTRUCTION FOR ACCELERATING SNOW 3G WIRELESS SECURITY ALGORITHM - Vector instructions for performing SNOW 3G wireless security operations are received and executed by the execution circuitry of a processor. The execution circuitry receives a first operand of the first instruction specifying a first vector register that stores a current state of a finite state machine (FSM). The execution circuitry also receives a second operand of the first instruction specifying a second vector register that stores data elements of a liner feedback shift register (LFSR) that are needed for updating the FSM. The execution circuitry executes the first instruction to produce a updated state of the FSM and an output of the FSM in a destination operand of the first instruction. | 07-03-2014 |
20140189290 | INSTRUCTION FOR FAST ZUC ALGORITHM PROCESSING - Vector instructions for performing ZUC stream cipher operations are received and executed by the execution circuitry of a processor. The execution circuitry receives a first vector instruction to perform an update to a liner feedback shift register (LFSR), and receives a second vector instruction to perform an update to a state of a finite state machine (FSM), where the FSM receives inputs from re-ordered bits of the LFSR. The execution circuitry executes the first vector instruction and the second vector instruction in a single-instruction multiple data (SIMD) pipeline. | 07-03-2014 |
20140189291 | Method And Apparatus For Integral Image Computation Instructions - A method is described that performing an image integral calculation by creating a second vector and creating a third vector. The second vector is created by executing a first instruction that adds alternating elements of a first vector to respective neighboring elements of the first vector and presents resulting summations into said second vector. The first instruction also passes through the respective neighboring elements to said second vector. The third vector is created by executing a second instruction that adds elements of one side of the second vector to an element of another side of the second vector and passes through the another side of the second vector. | 07-03-2014 |
20140189292 | Functional Unit Having Tree Structure To Support Vector Sorting Algorithm and Other Algorithms - An apparatus is described having a functional unit of an instruction execution pipeline. The functional unit has a plurality of compare-and-exchange circuits coupled to network circuitry to implement a vector sorting tree for a vector sorting instruction. Each of the compare-and-exchange circuits has a respective comparison circuit that compares a pair of inputs. Each of the compare-and-exchange circuits have a same sided first output for presenting a higher of the two inputs and a same sided second output for presenting a lower of the two inputs, said comparison circuit to also support said functional unit's execution of a prefix min and/or prefix add instruction. | 07-03-2014 |
20140189293 | Instructions for Sliding Window Encoding Algorithms - A processor is described having an instruction execution pipeline having a functional unit to execute an instruction that compares vector elements against an input value. Each of the vector elements and the input value have a first respective section identifying a location within data and a second respective section having a byte sequence of the data. The functional unit has comparison circuitry to compare respective byte sequences of the input vector elements against the input value's byte sequence to identify a number of matching bytes for each comparison. The functional unit also has difference circuitry to determine respective distances between the input vector ‘s elements’ byte sequences and the input value's byte sequence within the data. | 07-03-2014 |
20140189294 | SYSTEMS, APPARATUSES, AND METHODS FOR DETERMINING DATA ELEMENT EQUALITY OR SEQUENTIALITY - Systems, apparatuses, and methods of performing in a computer processor broadcasting data in response to a single vector packed broadcasting instruction that includes a source writemask register operand, a destination vector register operand, and an opcode. In some embodiments, the data of the source writemask register is zero extended prior to broadcasting. | 07-03-2014 |
20140195776 | MEMORY ACCESS FOR A VECTOR PROCESSOR - A method and device for memory access in processors is provided. A processor, comprising a plurality of computational units, is capable of executing a single instruction on multiple pieces of data simultaneously (SIMD). A read operation is initiated to load data from memory into the plurality of computational units (CUs) arranged into a plurality of CU groups. The memory is arranged into a plurality of memory macro-blocks each associated with a respective CU group of the plurality of CU groups. For each CU group a respective first memory address is determined and for each CU group, the data in the associated memory macro-block is accessed at the respective first memory address. | 07-10-2014 |
20140208066 | VECTOR GENERATE MASK INSTRUCTION - A Vector Generate Mask instruction. For each element in the first operand, a bit mask is generated. The mask includes bits set to a selected value starting at a position specified by a first field of the instruction and ending at a position specified by a second field of the instruction. | 07-24-2014 |
20140208067 | VECTOR ELEMENT ROTATE AND INSERT UNDER MASK INSTRUCTION - A Vector Element Rotate and Insert Under Mask instruction. Each element of a second operand of the instruction is rotated in a specified direction by a specified number of bits. For each bit in a third operand of the instruction that is set to one, the corresponding bit of the rotated elements in the second operand replaces the corresponding bit in a first operand of the instruction. | 07-24-2014 |
20140215182 | Persistent Relocatable Reset Vector for Processor - In an embodiment, an integrated circuit includes at least one processor. The processor may include a reset vector base address register configured to store a reset vector address for the processor. Responsive to a reset, the processor may be configured to capture a reset vector address on an input, updating the reset vector base address register. Upon release from reset, the processor may initiate instruction execution at the reset vector address. The integrated circuit may further include a logic circuit that is coupled to provide the reset vector address. The logic circuit may include a register that is programmable with the reset vector address. More particularly, in an embodiment, the register may be programmable via a write operation issued by the processor (e.g. a memory-mapped write operation). Accordingly, the reset vector address may be programmable in the integrated circuit, and may be changed from time to time. | 07-31-2014 |
20140244969 | List Vector Processing Apparatus, List Vector Processing Method, Storage Medium, Compiler, and Information Processing Apparatus - Disclosed is a list vector processing apparatus (LVPA) or the like which can process the indirect reference at a high speed. | 08-28-2014 |
20140244970 | DIGITAL SIGNAL PROCESSOR AND BASEBAND COMMUNICATION DEVICE - For increased efficiency, a digital signal processor comprises a vector execution unit arranged to execute instructions that are to be performed on multiple data in the form of a vector, comprising a vector controller arranged to determine if an instruction is a vector instruction and, if it is, inform a count register arranged to hold the vector length, said vector controller being further arranged receive an issue signal and control the execution of instructions based on this issue signal, said vector execution unit being characterized in that it comprises
| 08-28-2014 |
20140258677 | ANALYZING POTENTIAL BENEFITS OF VECTORIZATION - Embodiments of computer-implemented methods, systems, computing devices, and computer-readable media (transitory and non-transitory) are described herein for analyzing execution of a plurality of executable instructions and, based on the analysis, providing an indication of a benefit to be obtained by vectorization of at least a subset of the plurality of executable instructions. In various embodiments, the analysis may include identification of the subset of the plurality of executable instructions suitable for conversion to one or more single-instruction multiple-data (“SIMD”) instructions. | 09-11-2014 |
20140281370 | VECTOR PROCESSING ENGINES HAVING PROGRAMMABLE DATA PATH CONFIGURATIONS FOR PROVIDING MULTI-MODE VECTOR PROCESSING, AND RELATED VECTOR PROCESSORS, SYSTEMS, AND METHODS - Embodiments disclosed herein include vector processing engines (VPEs) having programmable data path configurations for providing multi-mode vector processing. Related vector processors, systems, and methods are also disclosed. The VPEs include a vector processing stage(s) configured to process vector data according to a vector instruction executed in the vector processing stage. Each vector processing stage includes vector processing blocks each configured to process vector data based on the vector instruction being executed. The vector processing blocks are capable of providing different vector operations for different types of vector instructions based on data path configurations. Data paths of the vector processing blocks are programmable to be reprogrammable to process vector data differently according to the particular vector instruction being executed. In this manner, a VPE can be provided with its data paths configuration programmable to execute different types of functions based on data path configuration according to the vector instruction being executed. | 09-18-2014 |
20140281371 | TECHNIQUES FOR ENABLING BIT-PARALLEL WIDE STRING MATCHING WITH A SIMD REGISTER - Various embodiments are generally directed to overcoming limitations of vector registers in their use with bit-parallel string matching algorithms. An apparatus includes a processor element; and logic to receive a pattern comprising a first string of elements to employ in a string matching operation, instantiate a test bitmask in a first vector register of the processor element, the first vector register comprising multiple lanes, copy bit values at MSB bit positions of the multiple lanes of the first vector register to a first vector mask as a vector value, bit-shift the vector value as a scalar value, bit-shift the first vector register, employ the vector value of the first vector mask to selectively fill LSB bit positions of lanes of a second vector register of the processor element; and OR the second vector register into the first vector register. Other embodiments are described and claimed. | 09-18-2014 |
20140281372 | VECTOR INDIRECT ELEMENT VERTICAL ADDRESSING MODE WITH HORIZONTAL PERMUTE - An example method for placing one or more element data values into an output vector includes identifying a vertical permute control vector including a plurality of elements, each element of the plurality of elements including a register address. The method also includes for each element of the plurality of elements, reading a register address from the vertical permute control vector. The method further includes retrieving a plurality of element data values based on the register address. The method also includes identifying a horizontal permute control vector including a set of addresses corresponding to an output vector. The method further includes placing at least some of the retrieved element data values of the plurality of element data values into the output vector based on the set of addresses in the horizontal permute control vector. | 09-18-2014 |
20140281373 | DIGITAL SIGNAL PROCESSOR AND BASEBAND COMMUNICATION DEVICE - A digital signal processor has a vector execution unit arranged to execute instructions on multiple data in the form of a vector, comprising a local queue arranged to receive instructions from a program memory and to hold them in the local queue until a predefined condition is fulfilled. The local queue being arranged to receive a sequence of instructions at a time from the program memory and to store the last N instructions, N being an integer. A vector controller in the vector execution unit comprises queue control means arranged to make the local queue repeat a sequence of M instructions stored in the local queue, M being an integer less than or equal to N, a number K of times. This reduces the time the vector execution unit is kept waiting because of IDLE commands in the program memory. | 09-18-2014 |
20140289495 | ENHANCED PREDICATE REGISTERS - Systems, apparatuses and methods for utilizing enhanced predicate registers which specify the element width and which elements are to be processed. The predicate size is dynamic, depending on the contents of the enhanced predicate register used for an instruction rather than being a static quality of a specific instruction. Specifying the element size in the enhanced predicate registers results in fewer instructions in an instruction set. | 09-25-2014 |
20140344549 | DIGITAL SIGNAL PROCESSOR AND BASEBAND COMMUNICATION DEVICE - The invention relates to a digital signal processor comprising a processor core, an integer execution unit and a number of vector execution units, said digital signal processor comprising a program memory arranged to hold instructions for the execution units and issue logic for issuing instructions. The digital signal processor comprises an issue control unit for selecting at least two execution units that are to receive and execute the same instruction at the same time, and logic for sending the instruction to said at least two execution units. | 11-20-2014 |
20140359252 | DIGITAL SIGNAL PROCESSOR - A multicore processor is achieved by a processor assembly, comprising a first processor having a first core and at least a first and a second unit, each being selected from the group of vector execution units, memory units and accelerators, said first core and first and second units being interconnected by a first network, and a second processor having a second core wherein the first core is arranged to enable the second core to control at least one of the units in the first processor. Each processors generally comprises a combination of execution units, memory units and accelerators, which may be controlled and/or accessed by units in the other processor. | 12-04-2014 |
20140372728 | VECTOR EXECUTION UNIT FOR DIGITAL SIGNAL PROCESSOR - A vector execution unit for use in a digital signal processor enables a new set of instructions. The unit comprises a first input port for receiving at least a first input data vector, an instruction decoder, a vector output port, and least one data-path. The instruction decoding unit is arranged to control the data-path to perform a comparison related to the first input data vector, and the processor comprises an integer port arranged to output the result of the comparison in the form of a decision vector to a memory unit or a functional unit in the digital signal processor. Alternatively or in addition, the integer port is also arranged to receive a decision vector of integer data, and the instruction decoding unit is arranged to control the data-path to process the first input data in dependence of the value of the integer data. | 12-18-2014 |
20150046673 | VECTOR PROCESSOR - A vector processor is disclosed including a variety of variable-length instructions. Computer-implemented methods are disclosed for efficiently carrying out a variety of operations in a time-conscious, memory-efficient, and power-efficient manner. Methods for more efficiently managing a buffer by controlling the threshold based on the length of delay line instructions are disclosed. Methods for disposing multi-type and multi-size operations in hardware are disclosed. Methods for condensing look-up tables are disclosed. Methods for in-line alteration of variables are disclosed. | 02-12-2015 |
20150046674 | LOW POWER COMPUTATIONAL IMAGING - The present application discloses a computing device that can provide a low-power, highly capable computing platform for computational imaging. The computing device can include one or more processing units, for example one or more vector processors and one or more hardware accelerators, an intelligent memory fabric, a peripheral device, and a power management module. The computing device can communicate with external devices, such as one or more image sensors, an accelerometer, a gyroscope, or any other suitable sensor devices. | 02-12-2015 |
20150046675 | APPARATUS, SYSTEMS, AND METHODS FOR LOW POWER COMPUTATIONAL IMAGING - The present application discloses a computing device that can provide a low-power, highly capable computing platform for computational imaging. The computing device can include one or more processing units, for example one or more vector processors and one or more hardware accelerators, an intelligent memory fabric, a peripheral device, and a power management module. The computing device can communicate with external devices, such as one or more image sensors, an accelerometer, a gyroscope, or any other suitable sensor devices. | 02-12-2015 |
20150074373 | SCATTER USING INDEX ARRAY AND FINITE STATE MACHINE - Methods and apparatus are disclosed using an index array and finite state machine for scatter/gather operations. Embodiment of apparatus may comprise: decode logic to decode scatter/gather instructions and generate micro-operations. An index array holds a set of indices and a corresponding set of mask elements. A finite state machine facilitates the scatter operation. Address generation logic generates an address from an index of the set of indices for at least each of the corresponding mask elements having a first value. Storage is allocated in a buffer for each of the set of addresses being generated. Data elements corresponding to the set of addresses being generated are copied to the buffer. Addresses from the set are accessed to store data elements if a corresponding mask element has said first value and the mask element is changed to a second value responsive to completion of their respective stores. | 03-12-2015 |
20150074374 | METHOD AND APPARATUS FOR ASYNCHRONOUS PROCESSOR WITH AUXILIARY ASYNCHRONOUS VECTOR PROCESSOR - An asynchronous processing system comprising an asynchronous scalar processor and an asynchronous vector processor coupled to the scalar processor. The asynchronous scalar processor is configured to perform processing functions on input data and to output instructions. The asynchronous vector processor is configured to perform processing functions in response to a very long instruction word (VLIW) received from the scalar processor. The VLIW comprises a first portion and a second portion, at least the first portion comprising a vector instruction. | 03-12-2015 |
20150089190 | Predicate Attribute Tracker - In an embodiment, a processor includes a register attribute tracker configured to track one or more attributes corresponding to registers. The register attribute tracker may track the attributes associated with the registers when those registers are used as output registers of instructions that explicitly define the attributes and, if the register attribute tracker has a tracked attribute associated with an input register of an instruction that does not explicitly define the attribute, the register attribute tracker may annotate the instruction with an attribute and/or associate an attribute with the output register of the instruction in the register attribute tracker. | 03-26-2015 |
20150089191 | Early Issue of Null-Predicated Operations - In an embodiment, a processor includes an issue circuit configured to issue instruction operations for execution. The issue circuit may be configured to monitor the source operands of the instruction operations, and to issue instruction operations for which the source operands (including predicate operands, as appropriate) are resolved. Additionally, the issue circuit may be configured to detect a null predicate that indicates that none of the vector elements will be modified by a corresponding instruction operation. The issue circuit may be configured to issue the corresponding instruction operation with the null predicate even if other source operands are not yet resolved. | 03-26-2015 |
20150089192 | Dynamic Attribute Inference - In an embodiment, a processor may be configured to dynamically infer one or more attributes of input and/or output registers of an instruction, given the attributes corresponding to at least one input registers. The inference may be made at the issue circuit/stage of the processor, for those registers that do not have attribute information at the issue circuit/stage. In an embodiment, the processor may also include a register attribute tracker configured to track attributes of registers prior to the issue stage of the processor pipeline. The processor may feed back, to the register attribute tracker, inferred attributes and the register addresses of the registers to which the inferred attributes apply. The register attribute tracker may be configured to may associate the inferred attribute with the identified register attribute tracker may also be configured to infer input register attributes from other input register attributes. | 03-26-2015 |
20150100755 | DATA PROCESSING APPARATUS AND METHOD FOR CONTROLLING PERFORMANCE OF SPECULATIVE VECTOR OPERATIONS - A data processing apparatus and a method of controlling performance of speculative vector operations are provided. The apparatus comprises processing circuitry for performing a sequence of speculative vector operations on vector operands, each vector operand comprising a plurality of vector elements, and speculation control circuitry for maintaining a speculation width indication indicating the number of vector elements of each vector operand to be subjected to the speculative vector operations. The speculation width indication is set to an initial value prior to performance of the sequence of speculative vector operations. The processing circuitry generates progress indications during performance of the sequence of speculative vector operations, and the speculation control circuitry detects, with reference to the progress indications and speculation reduction criteria, presence of a speculation reduction condition. The speculation reduction condition is a condition indicating that a reduction in the speculation width indication is expected to improve at least one performance characteristic of the data processing apparatus relative to continued operation without the reduction in the speculation width indication. The speculation control circuitry is responsive to detection of the speculation reduction condition to reduce the speculation width indication. This can significantly increase performance (for example in terms of throughput and/or energy consumption) when performing speculative vector operations. | 04-09-2015 |
20150113246 | INSTRUCTION AND LOGIC TO PROVIDE CONVERSIONS BETWEEN A MASK REGISTER AND A GENERAL PURPOSE REGISTER OR MEMORY - Instructions and logic provide conversions between a mask register and a general purpose register or memory. Some embodiments, responsive to an instruction specifying: a destination operand, a mask length corresponding to a number of mask data fields, and a source operand; values are read from data fields in the source operand, corresponding to the specified mask length, and stored to corresponding data fields in the destination operand specified by the instruction, wherein one of the source or the destination operands is a mask register. Values indicative of masked vector elements may be stored to any data fields in the destination operand other than the number of data fields corresponding to the specified mask length. For some embodiments, the other one of the source or the destination operands may be a general purpose register or a memory location. | 04-23-2015 |
20150121036 | CRYPTOGRAPHIC SUPPORT INSTRUCTIONS - A data processing system | 04-30-2015 |
20150143076 | VECTOR PROCESSING ENGINES (VPEs) EMPLOYING DESPREADING CIRCUITRY IN DATA FLOW PATHS BETWEEN EXECUTION UNITS AND VECTOR DATA MEMORY TO PROVIDE IN-FLIGHT DESPREADING OF SPREAD-SPECTRUM SEQUENCES, AND RELATED VECTOR PROCESSING INSTRUCTIONS, SYSTEMS, AND METHODS - Vector processing engines (VPEs) employing merging circuitry in data flow paths between execution units and vector data memory to provide in-flight merging of output vector data stored to vector data memory are disclosed. Related vector processing instructions, systems, and methods are also disclosed. Merging circuitry is provided in data flow paths between execution units and vector data memory in the VPE. The merging circuitry is configured to merge an output vector data sample set from execution units as a result of performing vector processing operations in-flight while the output vector data sample set is being provided over the output data flow paths from the execution units to the vector data memory to be stored. The merged output vector data sample set is stored in a merged form in the vector data memory without requiring additional post-processing steps, which may delay subsequent vector processing operations to be performed in execution units. | 05-21-2015 |
20150143077 | VECTOR PROCESSING ENGINES (VPEs) EMPLOYING MERGING CIRCUITRY IN DATA FLOW PATHS BETWEEN EXECUTION UNITS AND VECTOR DATA MEMORY TO PROVIDE IN-FLIGHT MERGING OF OUTPUT VECTOR DATA STORED TO VECTOR DATA MEMORY, AND RELATED VECTOR PROCESSING INSTRUCTIONS, SYSTEMS, AND METHODS - Vector processing engines (VPEs) employing merging circuitry in data flow paths between execution units and vector data memory to provide in-flight merging of output vector data stored to vector data memory are disclosed. Related vector processing instructions, systems, and methods are also disclosed. Merging circuitry is provided in data flow paths between execution units and vector data memory in the VPE. The merging circuitry is configured to merge an output vector data sample set from execution units as a result of performing vector processing operations in-flight while the output vector data sample set is being provided over the output data flow paths from the execution units to the vector data memory to be stored. The merged output vector data sample set is stored in a merged form in the vector data memory without requiring additional post-processing steps, which may delay subsequent vector processing operations to be performed in execution units. | 05-21-2015 |
20150143078 | VECTOR PROCESSING ENGINES (VPEs) EMPLOYING A TAPPED-DELAY LINE(S) FOR PROVIDING PRECISION FILTER VECTOR PROCESSING OPERATIONS WITH REDUCED SAMPLE RE-FETCHING AND POWER CONSUMPTION, AND RELATED VECTOR PROCESSOR SYSTEMS AND METHODS - Vector processing engines (VPEs) employing a tapped-delay line(s) for providing precision filter vector processing operations with reduced sample re-fetching and power consumption are disclosed. Related vector processor systems and methods are also disclosed. The VPEs are configured to provide filter vector processing operations. To minimize re-fetching of input vector data samples from memory to reduce power consumption, a tapped-delay line(s) is included in the data flow paths between a vector data file and execution units in the VPE. The tapped-delay line(s) is configured to receive and provide input vector data sample sets to execution units for performing filter vector processing operations. The tapped-delay line(s) is also configured to shift the input vector data sample set for filter delay taps and provide the shifted input vector data sample set to execution units, so the shifted input vector data sample set does not have to be re-fetched during filter vector processing operations. | 05-21-2015 |
20150143079 | VECTOR PROCESSING ENGINES (VPEs) EMPLOYING TAPPED-DELAY LINE(S) FOR PROVIDING PRECISION CORRELATION / COVARIANCE VECTOR PROCESSING OPERATIONS WITH REDUCED SAMPLE RE-FETCHING AND POWER CONSUMPTION, AND RELATED VECTOR PROCESSOR SYSTEMS AND METHODS - Vector processing engines (VPEs) employing a tapped-delay line(s) for providing precision correlation/covariance vector processing operations with reduced sample re-fetching and/or power consumption are disclosed. The VPEs disclosed herein are configured to provide correlation/covariance vector processing operations, such as code division multiple access (CDMA) correlation/covariance vector processing operations as a non-limiting example. A tapped-delay line(s) is included in the data flow paths between memory and execution units in the VPE. The tapped-delay line (s) is configured to receive and provide an input vector data sample set to execution units for performing correlation/covariance vector processing operations. The tapped-delay line(s) is also configured to shift the input vector data sample set for each filter delay tap and provide the shifted input vector data sample set to the execution units, so the shifted input vector data sample set need not be re-fetched from the vector data file during the filter vector processing operations. | 05-21-2015 |
20150143080 | VECTOR CHECKSUM INSTRUCTION - Elements from a second operand are added together one-by-one to obtain a first result. The adding includes performing one or more end around carry add operations. The first result is placed in an element of a first operand of the instruction. After each addition of an element, a carry out of a chosen position of the sum, if any, is added to a selected position in an element of the first operand. | 05-21-2015 |
20150149744 | DATA PROCESSING APPARATUS AND METHOD FOR PERFORMING VECTOR PROCESSING - A data processing apparatus and method are provided for processing execution threads, where each execution thread specifies at least one instruction. The data processing apparatus has a vector processing unit providing a plurality M of lanes of parallel processing, within each lane the vector processing unit being configured to perform a processing operation on a data element input to that lane for each of one or more input operands. A vector instruction is received that is specified by a group of the execution threads, that vector instruction identifying an associated processing operation and also providing an indication of the data elements of each input operand that are to be subjected to that associated processing operation. Vector merge circuitry then determines, based on that information, a required number of lanes of parallel processing for performing the associated processing operation. If it is determined that the required number of lanes is less than or equal to half the available number of lanes within the vector processing unit, then the vector merge circuitry allocates a plurality of the execution threads of the group to the vector processing unit such that each execution thread in that plurality is allocated different lanes amongst the available lanes of parallel processing. As a result, the vector processing unit then performs the associated processing operation in parallel for each of the plurality of execution threads, significantly increasing performance. | 05-28-2015 |
20150355905 | VECTOR MEMORY ACCESS INSTRUCTIONS FOR BIG-ENDIAN ELEMENT ORDERED AND LITTLE-ENDIAN ELEMENT ORDERED COMPUTER CODE AND DATA - Embodiments relate to vector memory access instructions for big-endian (BE) element ordered computer code and little-endian (LE) element ordered computer code. An aspect includes determining a mode of a computer system comprising one of a BE mode and an LE mode. Another aspect includes determining a code type comprising one of BE code and LE code. Another aspect includes determining a data type of data in a main memory that is associated with the object code comprising one of BE data and LE data. Another aspect includes based on the mode, code type, and data type, inserting a memory access instruction into the object code to perform a memory access associated with the vector in the object code, such that the memory access instruction performs element ordering of elements of the vector, and data ordering within the elements of the vector, in accordance with the determined mode, code type, and data type. | 12-10-2015 |
20150355906 | VECTOR MEMORY ACCESS INSTRUCTIONS FOR BIG-ENDIAN ELEMENT ORDERED AND LITTLE-ENDIAN ELEMENT ORDERED COMPUTER CODE AND DATA - Embodiments relate to vector memory access instructions for big-endian (BE) element ordered computer code and little-endian (LE) element ordered computer code. An aspect includes determining a mode of a computer system comprising one of a BE mode and an LE mode. Another aspect includes determining a code type comprising one of BE code and LE code. Another aspect includes determining a data type of data in a main memory that is associated with the object code comprising one of BE data and LE data. Another aspect includes based on the mode, code type, and data type, inserting a memory access instruction into the object code to perform a memory access associated with the vector in the object code, such that the memory access instruction performs element ordering of elements of the vector, and data ordering within the elements of the vector, in accordance with the determined mode, code type, and data type. | 12-10-2015 |
20160092218 | Conditional Stop Instruction with Accurate Dependency Detection - In an embodiment, a processor may implement a conditional stop instruction that includes a first predicate vector identifying the active elements of the instruction, a second predicate vector indicating true and false results for a conditional expression within a loop that is being vectorized, and a source operand specifying which combinations in the true and false results may indicate a dependency. The conditional stop instruction may generate a vector result indicating vector elements that have a dependency on a prior vector element, as well as an identification of which element position the dependency is on. More particularly, dependencies may be detected only on active elements as indicated by the first predicate vector. False dependencies that may occur due to inactive elements may be avoided, which may improve performance and/or provide for correct functional operation. | 03-31-2016 |
20160092234 | METHOD AND APPARATUS FOR SPECULATIVE VECTORIZATION - An apparatus and method for speculative vectorization. For example, one embodiment of a processor comprises: a queue comprising a set of locations for storing addresses associated with vectorized memory access instructions; and execution logic to execute a first vectorized memory access instruction to access the queue and to compare a new address associated with the first vectorized memory access instruction with existing addresses stored within a specified range of locations within the queue to detect whether a conflict exists, the existing addresses having been previously stored responsive to one or more prior vectorized memory access instructions. | 03-31-2016 |
20160092398 | Conditional Termination and Conditional Termination Predicate Instructions - In an embodiment, a processor may implement a vector instruction set including a conditional termination instruction (CTerm). The CTerm instruction may take two source operands and compare them according to a specified condition, updating flags as a result of the instruction. The flags may be used to affect predicate vector generation to control vectorized loop execution. In an embodiment, the vector instruction set may also include a conditional termination predicate instruction (CTPred). The CTPred instruction may take a pair of predicate vectors and a set of flags as operands, and may generate: a predicate vector to control parallel processing of vector elements, and a set of flags to control further loop processing. Either instruction may be used to efficiently manage vector loops in various embodiments, or the instructions may be used together. | 03-31-2016 |
20160092400 | Instruction and Logic for a Vector Format for Processing Computations - A processor includes a front end to fetch an instruction. The instruction is to calculate a data point using inputs from a plurality of adjacent source data in a plurality of dimensions. The processor includes a decoder to decode the instruction. The processor also includes a core to, based on the decoded instruction, perform a plurality of tabular vector read operations to read the plurality of adjacent source data and perform a tabular vector calculation to execute the instruction. The tabular vector calculation is based upon results of performing the plurality of tabular vector read operations. The core is further to write results of the tabular vector calculation. | 03-31-2016 |
20160094241 | APPARATUS AND METHOD FOR VECTOR COMPRESSION - An apparatus and method are described for performing vector compression. For example, one embodiment of a processor comprises: vector compression logic to compress a source vector comprising a plurality of valid data elements and invalid data elements to generate a destination vector in which valid data elements are stored contiguously on one side of the destination vector, the vector compression logic to utilize a bit mask associated with the source vector and comprising a plurality of bits, each bit corresponding to one of the plurality of data elements of the source vector and indicating whether the data element comprises a valid data element or an invalid data element, the vector compression logic to utilize indices of the bit mask and associated bit values of the bit mask to generate a control vector; and shuffle logic to shuffle/permute the data elements of the source vector to the destination vector in accordance with the control vector. | 03-31-2016 |
20160139921 | VECTOR INSTRUCTION TO COMPUTE COORDIANTE OF NEXT POINT IN A Z-ORDER CURVE - In one embodiment, a processor includes machine level instructions to compute a next point in a Z-order curve of a specified dimension for a specified coordinate. A processor decode unit is configured to decode an instruction having a source and immediate operands including a first z-curve index, the specified dimension and the specified coordinate. A processor execution unit is configured to execute the decoded instruction to compute the coordinate of the next point by incrementing the coordinate value associated with the specified coordinate to generate a second z-curve index including the incremented coordinate. | 05-19-2016 |
20160179530 | INSTRUCTION AND LOGIC TO PERFORM A VECTOR SATURATED DOUBLEWORD/QUADWORD ADD | 06-23-2016 |
20160179531 | COMPILER METHOD FOR GENERATING INSTRUCTIONS FOR VECTOR OPERATIONS IN A MULTI-ENDIAN INSTRUCTION SET | 06-23-2016 |
20160179537 | METHOD AND APPARATUS FOR PERFORMING REDUCTION OPERATIONS ON A SET OF VECTOR ELEMENTS | 06-23-2016 |
20160188334 | HARDWARE APPARATUSES AND METHODS RELATING TO ELEMENTAL REGISTER ACCESSES - Methods and apparatuses relating to a vector instruction with a register operand with an elemental offset are described. In one embodiment, a hardware processor includes a decode unit to decode a vector instruction with a register operand with an elemental offset to access a first number of elements in a register specified by the register operand, wherein the first number is a total number of elements in the register minus the elemental offset, access a second number of elements in a next logical register, wherein the second number is the elemental offset, and combine the first number of elements and the second number of elements as a data vector, and an execution unit to execute the vector instruction on the data vector. | 06-30-2016 |
20160188530 | METHOD AND APPARATUS FOR PERFORMING A VECTOR PERMUTE WITH AN INDEX AND AN IMMEDIATE - An apparatus and method for performing a vector permute. For example, one embodiment of a processor comprises: a source vector register to store a plurality of source data elements; a destination vector register to store a plurality of destination data elements; a control vector register to store a plurality of control data elements, each control data element corresponding to one of the destination data elements and including an N bit value indicating whether a source data element is to be copied to the corresponding destination data element; vector permute logic to compare the N bit value of each control data element to an N bit portion of an immediate to determine whether to copy a source data element to the corresponding destination data element, wherein if the N bit values match, then the vector permute logic is to identify a source data element using an index value included in the control data element and to responsively copy the source data element to the corresponding destination data element in the destination vector register. | 06-30-2016 |
20190146793 | Apparatus and Methods for Vector Based Transcendental Functions | 05-16-2019 |