Vector processor operation

Subclass of:

712 - Electrical computers and digital processing systems: processing architectures and instruction processing (e.g., processors)

712001000 - PROCESSING ARCHITECTURE

712002000 - Vector processor

Patent class list (only not empty are listed)

Deeper subclasses:

Class / Patent application number	Description	Number of patent applications / Date published
712009000	Concurrent	7
712008000	Sequential	2
20140189295	Apparatus and Method of Efficient Vector Roll Operation - A machine readable storage medium containing program code is described that when processed by a processor causes a method to be performed. The method includes creating a resultant rolled version of an input vector by forming a first intermediate vector, forming a second intermediate vector and forming a resultant rolled version of an input vector. The first intermediate vector is formed by barrel rolling elements of the input vector along a first of two lanes defined by an upper half and a lower half of the input vector. The second intermediate vector is formed by barrel rolling elements of the input vector along a second of the two lanes. The resultant rolled version of the input vector is formed by incorporating upper portions of one of the intermediate vector's upper and lower halves as upper portions of the resultant's upper and lower halves and incorporating lower portions of the other intermediate vector's upper and lower halves as lower portions of the resultant's upper and lower halves.	07-03-2014
20150019837	DATA PROCESSOR - A data processor includes: a plurality of controllers that process data; a program memory that stores a standby instruction and a data processing instruction at a plurality of addresses respectively; and a queue that stores different execution start addresses for the plurality of controllers, wherein after the plurality of controllers sequentially access the queue, the plurality of controllers acquire the different execution start addresses from the queue in an order of the sequential access, start execution of instructions from the acquired different execution start addresses in the program memory, and execute the data processing instruction and execute the standby instruction the number of times different for each of the controllers.	01-15-2015

Document	Title	Date
Entries
20080282058	Message queuing system for parallel integrated circuit architecture and related method of operation - An integrated circuit comprises an external memory, a plurality of parallel connected Vector Processing Engines (VPEs), and an External Memory Unit (EMU) providing a data transfer path between the VPEs and the external memory. Each VPE contains a plurality of data processing units and a message queuing system adapted to transfer messages between the data processing units and other components of the integrated circuit.	11-13-2008
20080301402	Method and System for Stealing Interrupt Vectors - A system for stealing interrupt vectors from an operating system. Custom interrupt handler extensions are copied into an allocated block of memory from a kernel module. Also, operating system interrupt handlers are copied into a reserved space in the allocated block of memory from an interrupt vector memory location. In response to copying the operating system interrupt handlers into the reserved space in the allocated block of memory, custom interrupt handlers from the kernel module are copied over the operating system interrupt handlers in the interrupt vector memory location. The custom interrupt handlers after being copied into the interrupt vector memory location handle all interrupts received by the operating system.	12-04-2008
20090106527	Scalar Precision Float Implementation on the "W" Lane of Vector Unit - Embodiments of the invention are generally related to image processing, and more specifically to vector units for supporting image processing. A combined vector/scalar unit is provided wherein one or more processing lanes of the vector unit are used for performing scalar operations. An integrated register file is also provided for storing vector and scalar data. Therefore, the transfer of data to memory to exchange data between independent vector and scalar units is obviated and a significant amount of chip area is saved.	04-23-2009
20090300323	Vector Processor System - A vector processing system provides high performance vector processing using a System-On-a-Chip (SOC) implementation technique. One or more scalar processors (or cores) operate in conjunction with a vector processor, and the processors collectively share access to a plurality of memory interfaces coupled to Dynamic Random Access read/write Memories (DRAMs). In typical embodiments the vector processor operates as a slave to the scalar processors, executing computationally intensive Single Instruction Multiple Data (SIMD) codes in response to commands received from the scalar processors. The vector processor implements a vector processing Instruction Set Architecture (ISA) including machine state, instruction set, exception model, and memory model.	12-03-2009
20100064115	VECTOR PROCESSING UNIT - It is an object to speed up a vector store instruction on a memory that is divided into banks as setting a plurality of elements as a unit while minimizing an increase in physical quantity. A vector processing apparatus has a plurality of register banks and processes a data string including a plurality of data elements retained in the plurality of register banks, wherein: the plurality of register banks each have a read pointer	03-11-2010
20100106940	Processing Unit With Operand Vector Multiplexer Sequence Control - Operand vector multiplexer sequence control is used in a vector-based execution unit to control the shuffling of data elements in operand vectors used by a sequence of vector instructions processed by the vector-based execution unit. A swizzle sequence instruction is defined in an instruction set for the vector-based execution unit and is used to selectively apply a sequence of vector data element shuffle orders to one or more operand vectors to be used by the associated sequence of vector instructions. As a result, when a common sequence of data element shuffle orders is used frequently for a sequence of vector instructions, a single swizzle sequence instruction may be used to select the desired sequence of custom data element ordering for each of the vector instructions in the sequence.	04-29-2010
20100115233	DYNAMICALLY-SELECTABLE VECTOR REGISTER PARTITIONING - The present invention is directed generally to dynamically-selectable vector register partitioning, and more specifically to a processor infrastructure (e.g., co-processor infrastructure in a multi-processor system) that supports dynamic setting of vector register partitioning to any of a plurality of different vector partitioning modes. Thus, rather than being restricted to a fixed vector register partitioning mode, embodiments of the present invention enable a processor to be dynamically set to any of a plurality of different vector partitioning modes. Thus, for instance, different vector register partitioning modes may be employed for different applications being executed by the processor, and/or different vector register partitioning modes may even be employed for use in processing different vector oriented operations within a given applications being executed by the processor, in accordance with certain embodiments of the present invention.	05-06-2010
20100115234	CONFIGURABLE VECTOR LENGTH COMPUTER PROCESSOR - A processor core, comprises one or more vector units operable to change between a fine-grained vector mode having a shorter maximum vector length and a coarse-grained vector mode having a longer maximum vector length. Changing vector modes comprises halting all instruction stream execution in the core, flushing one or more registers in a register space, reconfiguring one or more vector registers in the register space, and restarting instruction execution in the core.	05-06-2010
20100138631	Process for QR Transformation using a CORDIC Processor - A CORDIC processor has a plurality of stages, each of the stages having a X input, Y input, a sign input, a sign output, an X output, a Y output, a mode control input having a ROTATE or VECTOR value, and a stage number k input, each CORDIC stage having a first shift generating an output by shifting the Y input k times, a second shift generating an output by shifting X input k times, a multiplexer having an output coupled to the sign input when the mode control input is ROTATE and to the sign of the Y input when the mode input is VECTOR, a first multiplier forming the product of the first shift output and the multiplexer output, a second multiplier forming the product of the second shift output and an inverted the multiplexer output, a first adder forming the X output from the sum of the first multiplier output and the X input, and a second adder forming the Y output from the sum of the second multiplier output and the Y input.	06-03-2010
20100138632	Programmable CORDIC Processor with Stage Re-Use - A CORDIC processor has a plurality of stages, each of the stages having a X input, Y input, a sign input, a sign output, an X output, a Y output, a mode control input having a ROTATE or VECTOR value, and a stage number k input, each CORDIC stage having a first shift generating an output by shifting the Y input k times, a second shift generating an output by shifting X input k times, a multiplexer having an output coupled to the sign input when the mode control input is ROTATE and to the sign of the Y input when the mode input is VECTOR, a first multiplier forming the product of the first shift output and the multiplexer output, a second multiplier forming the product of the second shift output and an inverted the multiplexer output, a first adder forming the X output from the sum of the first multiplier output and the X input, and a second adder forming the Y output from the sum of the second multiplier output and the Y input.	06-03-2010
20100318764	SYSTEM AND METHOD FOR MANAGING PROCESSOR-IN-MEMORY (PIM) OPERATIONS - A system and method of compiling program code, wherein the program code includes an operation on an array of data elements stored in memory of a computer system. The program code is scanned for operations that are vectorizable. The vectorizable operations are examined to determine whether they should be executed at least in part in a vector atomic memory operation (AMO) functional unit attached to memory. If so, the compiled code includes vector AMO instructions.	12-16-2010
20110035568	SELECT FIRST AND SELECT LAST INSTRUCTIONS FOR PROCESSING VECTORS - The described embodiments include a processor that executes a vector instruction. The processor starts by receiving a vector instruction that uses a first input vector, a second input vector, and a control vector, and optionally a predicate vector as inputs, wherein each of the vectors includes N elements. The processor then executes the vector instruction. In the described embodiments, when executing the vector instruction, the processor determines a key element position. If the predicate vector is received, the key element position is a predetermined active element position in the predicate vector, otherwise, the key element position is in a predetermined element position. The processor then uses the key element position to copy a result value into a result variable. When copying the result value into the result variable, if an element in the key element position of the control vector contains a predetermined value, the processor copies a value from the key element position in the second input vector into the result variable. Otherwise, the processor copies a value from the key element position in the first input vector into the result variable.	02-10-2011
20110055517	METHOD AND STRUCTURE OF USING SIMD VECTOR ARCHITECTURES TO IMPLEMENT MATRIX MULTIPLICATION - A structure (and method) including a plurality of coprocessing units and a controller that selectively loads data for processing on the plurality of coprocessing units, using a compound loading instruction. The compound loading instruction includes a plurality of low-level software instructions that preliminarily processes input data in a manner predetermined to simulate an effect of a single hardware loading instruction that would provide optimal loading of complex matrix data by loading input data in accordance with the effect of multiplying i·i=−1.	03-03-2011
20110113217	GENERATE PREDICTES INSTRUCTION FOR PROCESSING VECTORS - The described embodiments include a processor that executes a vector instruction. The processor starts by receiving a first input vector, a second input vector, and optionally receiving a predicate vector (each of which includes N elements) as inputs. The processor then executes the vector instruction. Executing the vector instruction causes the processor to generate a result vector. When generating the result vector, if the predicate vector was received, for each element of the result vector for which the corresponding element of the predicate vector is active, otherwise, for each element of the result vector, the processor determines elements that are to be set in the result vector based on values in elements in the first input vector and the second input vector. The processor then sets the determined elements of the result vector to a first predetermined value.	05-12-2011
20110153980	MULTI-STAGE RECONFIGURATION DEVICE AND RECONFIGURATION METHOD, LOGIC CIRCUIT CORRECTION DEVICE, AND RECONFIGURABLE MULTI-STAGE LOGIC CIRCUIT - To provide a device to reconfigure multi-level logic networks, which enable logic modification and reconfiguration of a multi-level logic network with small circuit area and low-power dissipation in a simple manner. For example, in the case of reconfiguring a multi-level logic network following logic modification for deleting an output vector F(b) of an objective logic function F(X) corresponding to an input vector b, unmodified pq elements are selected one by one from the nearest pq element E	06-23-2011
20110219207	RECONFIGURABLE PROCESSOR AND RECONFIGURABLE PROCESSING METHOD - A reconfigurable processor for efficiently performing a vector operation, and a method of controlling the reconfigurable processor are provided. The reconfigurable processor designates at least one of a plurality of processing elements as a vector lane based on vector lane configuration information, and allocates a vector operation to the designated vector lane.	09-08-2011
20110314254	METHOD FOR VECTOR PROCESSING - The present application relates to a method for processing data in a vector processor. The present application relates also to a vector processor for performing said method and a cellular communication device comprising said vector processor. The method for processing data in a vector processor comprises executing segmented operations on a segment of a vector for generating results, collecting the results of the segmented operations, and delivering the results in a result vector in such a way that subsequent operations remain processing in vector mode.	12-22-2011
20110320765	VARIABLE WIDTH VECTOR INSTRUCTION PROCESSOR - A computer processor, method, and computer program product for executing vector processing instructions on a variable width vector register file. An example embodiment is a computer processor that includes an instruction execution unit coupled to a variable width vector register file which contains a number of vector registers, the width of the vector registers is changeable during operation of the computer processor.	12-29-2011
20120023308	PARALLEL COMPARISON/SELECTION OPERATION APPARATUS, PROCESSOR, AND PARALLEL COMPARISON/SELECTION OPERATION METHOD - Provided is a parallel comparison/selection operation apparatus which efficiently executes a search for a maximum value or a search for a minimum value with an index. The parallel comparison/selection operation apparatus includes a vector comparison/selection unit	01-26-2012
20120079233	VECTOR LOGICAL REDUCTION OPERATION IMPLEMENTED ON A SEMICONDUCTOR CHIP - A semiconductor processor is described. The semiconductor processor includes logic circuitry to perform a logical reduction instruction. The logic circuitry has swizzle circuitry to swizzle a vector's elements so as to form a swizzle vector. The logic circuitry also has vector logic circuitry to perform a vector logic operation on said vector and said swizzle vector.	03-29-2012
20120102299	STALL PROPAGATION IN A PROCESSING SYSTEM WITH INTERSPERSED PROCESSORS AND COMMUNICATON ELEMENTS - A processing system includes processors and dynamically configurable communication elements (DCCs) coupled together in an interspersed arrangement. A source device may transfer a data item through an intermediate subset of the DCCs to a destination device. The source and destination devices may each correspond to different processors, DCCs, or input/output devices, or mixed combinations of these. In response to detecting a stall after the source device begins transfer of the data item to the destination device and prior to receipt of all of the data item at the destination device, a stalling device is operable to propagate stalling information through one or more of the intermediate subset towards the source device. In response to receiving the stalling information, at least one of the intermediate subset is operable to buffer all or part of the data item.	04-26-2012
20120131308	SYSTEM, DEVICE, AND METHOD FOR ON-THE-FLY PERMUTATIONS OF VECTOR MEMORIES FOR EXECUTING INTRA-VECTOR OPERATIONS - A device system and method for processing program instructions, for example, to execute intra vector operations. A fetch unit may receive a program instruction defining different operations on data elements stored at the same vector memory address. A processor may include different types of execution units each executing a different one of a predetermined plurality of elemental instructions. Each program instruction may be a combination of one or more of the elemental instructions. The processor may receive a vector of data elements stored non-consecutively at the same vector memory address to be processed by a same one of the elemental instructions and a vector of configuration values independently associated with executing the same elemental instruction on the non-consecutive data elements. At least two configuration values may be different to implement different operations by executing the same elemental instruction using the different configuration values on the vector of non-consecutive data elements.	05-24-2012
20120166761	VECTOR CONFLICT INSTRUCTIONS - A processing core implemented on a semiconductor chip is described having first execution unit logic circuitry that includes first comparison circuitry to compare each element in a first input vector against every element of a second input vector. The processing core also has second execution logic circuitry that includes second comparison circuitry to compare a first input value against every data element of an input vector.	06-28-2012
20120216011	APPARATUS AND METHOD OF SINGLE-INSTRUCTION, MULTIPLE-DATA VECTOR OPERATION MASKING - An apparatus, method, and medium for performing a vector operation on portions of one or more source vector registers. A vector unit performs an operation on the source vector registers and only stores results in the target vector register for elements which are selected by the vector operation mask. The vector operation mask can be read by the vector unit or loaded into the vector unit for each instruction cycle. The vector operation mask allows the vector unit to be used with partially filled source vector registers and eliminates the need for scalar operations to be performed on vector data.	08-23-2012
20120221830	CONFIGURABLE VECTOR LENGTH COMPUTER PROCESSOR - A processor core, comprises one or more vector units operable to change between a fine-grained vector mode having a shorter maximum vector length and a coarse-grained vector mode having a longer maximum vector length. Changing vector modes comprises halting all instruction stream execution in the core, flushing one or more registers in a register space, reconfiguring one or more vector registers in the register space, and restarting instruction execution in the core.	08-30-2012
20120284487	Vector Slot Processor Execution Unit for High Speed Streaming Inputs - A vector slot processor that is capable of supporting multiple signal processing operations for multiple demodulation standards is provided. The vector slot processor includes a plurality of micro execution slot (MES) that performs the multiple signal processing operations on the high speed streaming inputs. Each of the MES includes one or more n-way signal registers that receive the high speed streaming inputs, one or more n-way coefficient registers that store filter coefficients for the multiple signal processing, and one or more n-way Multiply and Accumulate (MAC) units that receive the high speed streaming inputs from the one or more n-way signal registers and filter coefficients from one or more n-way coefficient registers. The one or more n-way MAC units perform a vertical MAC operation and a horizontal multiply and add operation on the high speed streaming inputs.	11-08-2012
20130024654	VECTOR OPERATIONS FOR COMPRESSING SELECTED VECTOR ELEMENTS - A processor, method, and medium for using vector operations to compress selected elements of a vector. An input vector is compared to a criteria vector, and then a subset of the plurality of elements of the input vector are selected based on the comparison. A permutation vector is generated based on the locations of the selected elements and then the permutation vector is used to permute the selected elements of the input vector to an output vector. The selected elements of the input vector are stored in contiguous locations in the leftmost elements of the output vector. Then, the output vector is stored to memory and a pointer to the memory location is incremented by the number of selected elements.	01-24-2013
20130024655	PROCESSING VECTORS USING WRAPPING INCREMENT AND DECREMENT INSTRUCTIONS IN THE MACROSCALAR ARCHITECTURE - Embodiments of a system and a method in which a processor may execute instructions that cause the processor to receive an input vector and a control vector are disclosed. The executed instructions may also cause the processor to perform a fixed-value addition operation dependent upon the input vector and the control vector.	01-24-2013
20130024656	PROCESSING VECTORS USING WRAPPING BOOLEAN INSTRUCTIONS IN THE MACROSCALAR ARCHITECTURE - Embodiments of a system and a method in which a processor may execute instructions that cause the processor to receive an input vector and a control vector are disclosed. The executed instructions may also cause the processor to perform a Boolean operation on another input vector dependent upon the input vector and the control vector.	01-24-2013
20130036293	PROCESSING VECTORS USING WRAPPING MINIMA AND MAXIMA INSTRUCTIONS IN THE MACROSCALAR ARCHITECTURE - Embodiments of a system and a method in which a processor may execute instructions that cause the processor to receive an input vector and a control vector are disclosed. The executed instructions may also cause the processor to perform a minima or maxima operation on another input vector dependent upon the input vector and the control vector.	02-07-2013
20130067196	VECTORIZATION OF MACHINE LEVEL SCALAR INSTRUCTIONS IN A COMPUTER PROGRAM DURING EXECUTION OF THE COMPUTER PROGRAM - A method of operating a computer processor includes storing at least one machine level vector instruction in a memory and replacing a plurality of machine level scalar instructions in a computer program with the at least one machine level vector instruction during execution of the computer program based on execution addresses associated with the plurality of machine level scalar instructions and/or instruction opcodes associated with the plurality of machine level scalar instructions.	03-14-2013
20130132707	SYSTEM, METHOD, AND COMPUTER PROGRAM PRODUCT FOR ASSIGNING ELEMENTS OF A MATRIX TO PROCESSING THREADS WITH INCREASED CONTIGUOUSNESS - A system, method, and computer program product are provided for assigning elements of a matrix to processing threads. In use, a matrix is received to be processed by a parallel processing architecture. Such parallel processing architecture includes a plurality of processors each capable of processing a plurality of threads. Elements of the matrix are assigned to each of the threads for processing, utilizing an algorithm that increases a contiguousness of the elements being processed by each thread.	05-23-2013
20130159666	REDUCING ISSUE-TO-ISSUE LATENCY BY REVERSING PROCESSING ORDER IN HALF-PUMPED SIMD EXECUTION UNITS - Techniques for reducing issue-to-issue latency by reversing processing order in half-pumped single instruction multiple data (SIMD) execution units are described. In one embodiment a processor functional unit is provided comprising a frontend unit, and execution core unit, a backend unit, an execution order control signal unit, a first interconnect coupled between and output and an input of the execution core unit and a second interconnect coupled between an output of the backend unit and an input of the frontend unit. In operation, the execution order control signal unit generates a forwarding order control signal based on the parity of an applied clock signal on reception of a first vector instruction. This control signal is in turn used to selectively forward first and second portions of an execution result of the first vector instruction via the interconnects for use in the execution of a dependent second vector instruction.	06-20-2013
20130159667	Vector Size Agnostic Single Instruction Multiple Data (SIMD) Processor Architecture - A computer has a memory adapted to store a first plurality of instructions encoded with a first vector size and a second plurality of instructions encoded with a second vector size. An execution unit executes the first plurality of instructions and the second plurality of instructions by processing vector units in a uniform manner regardless of vector size.	06-20-2013
20130159668	PREDECODE LOGIC FOR AUTOVECTORIZING SCALAR INSTRUCTIONS IN AN INSTRUCTION BUFFER - A circuit arrangement, method, and program product for substituting a plurality of scalar instructions in an instruction stream with a functionally equivalent vector instruction for execution by a vector execution unit. Predecode logic is coupled to an instruction buffer which stores instructions in an instruction stream to be executed by the vector execution unit. The predecode logic analyzes the instructions passing through the instruction buffer to identify a plurality of scalar instructions that may be replaced by a vector instruction in the instruction stream. The predecode logic may generate the functionally equivalent vector instruction based on the plurality of scalar instructions, and the functionally equivalent vector instruction may be substituted into the instruction stream, such that the vector execution unit executes the vector instruction in lieu of the plurality of scalar instructions.	06-20-2013
20130185540	PROCESSOR WITH MULTI-LEVEL LOOPING VECTOR COPROCESSOR - A processor includes a scalar processor core and a vector coprocessor core coupled to the scalar processor core. The scalar processor core includes a program memory interface through which the scalar processor retrieves instructions from a program memory. The instructions include scalar instructions executable by the scalar processor and vector instructions executable by the vector coprocessor core. The vector coprocessor core includes a plurality of execution units and a vector command buffer. The vector command buffer is configured to decode vector instructions passed by the scalar processor core, to determine whether vector instructions defining an instruction loop have been decoded, and to initiate execution of the instruction loop by one or more of the execution units based on a determination that all of the vector instructions of the instruction loop have been decoded.	07-18-2013
20130290672	APPARATUS AND METHOD OF MASK PERMUTE INSTRUCTIONS - An apparatus is described having instruction execution logic circuitry. The instruction execution logic circuitry has input vector element routing circuitry to perform the following for each of three different instructions: for each of a plurality of output vector element locations, route into an output vector element location an input vector element from one of a plurality of input vector element locations that are available to source the output vector element. The output vector element and each of the input vector element locations are one of three available bit widths for the three different instructions. The apparatus further includes masking layer circuitry coupled to the input vector element routing circuitry to mask a data structure created by the input vector routing element circuitry. The masking layer circuitry is designed to mask at three different levels of granularity that correspond to the three available bit widths.	10-31-2013
20130297908	Decomposing Operations in More than One Dimension into One Dimensional Point Operations - A processing architecture uses stationary operands and opcodes common on a plurality of processors. Only data moves through the processors. The same opcode and operand is used by each processor assigned to operate, for example, on one row of pixels, one row of numbers, or one row of points in space.	11-07-2013
20130332701	APPARATUS AND METHOD FOR SELECTING ELEMENTS OF A VECTOR COMPUTATION - An apparatus and method are described for selecting elements to be used in a vector computation. For example, a method according to one embodiment includes the following operations: specifying whether to identify the first, last or next after last active element of an input mask register using an immediate value; identifying the first, last or next after last active element in the input mask register according to the immediate value; reading a value from an input vector register corresponding to the identified first, last or next after last active element in the input mask register; and writing the value to an output vector register.	12-12-2013
20130339661	EFFICIENT ZERO-BASED DECOMPRESSION - A processor core including a hardware decode unit to decode vector instructions for decompressing a run length encoded (RLE) set of source data elements and an execution unit to execute the decoded instructions. The execution unit generates a first mask by comparing set of source data elements with a set of zeros and then counts the trailing zeros in the mask. A second mask is made based on the count of trailing zeros. The execution unit then copies the set of source data elements to a buffer using the second mask and then reads the number of RLE zeros from the set of source data elements. The buffer is shifted and copied to a result and the set of source data elements is shifted to the right. If more valid data elements are in the set of source data elements this is repeated until all valid data is processed.	12-19-2013
20140032877	APPARATUS AND METHOD FOR AN INSTRUCTION THAT DETERMINES WHETHER A VALUE IS WITHIN A RANGE - A method is described that includes performing the following with a single instruction: receiving a first input operand V; receiving a second input operand S; calculating V−S; determining if V−S is positive or negative; and, providing as a resultant: V if V−S is negative; V−S if V−S is positive.	01-30-2014
20140052959	Experimental engineering optimization algorithm at point of performance - A method is provided for reducing the data set used in creating an optimization algorithm, thus to permit the use of microprocessors, that in turn permits embedding the optimization algorithm at the point of performance, in which a subset of data points in a performance window is used to derive a vector that is utilized to create an initial optimization algorithm.	02-20-2014
20140059323	SYSTEMS AND METHODS OF DATA EXTRACTION IN A VECTOR PROCESSOR - Systems and methods of data extraction in a vector processor are disclosed. In a particular embodiment a method of data extraction in a vector processor includes copying at least one data element to a source register of a permutation network. The method includes reordering multiple data elements of the source register, populating a destination register of the permutation network with the reordered data elements, and copying the reordered data elements from the destination register to a memory.	02-27-2014
20140068226	VECTOR INSTRUCTIONS TO ENABLE EFFICIENT SYNCHRONIZATION AND PARALLEL REDUCTION OPERATIONS - In one embodiment, a processor may include a vector unit to perform operations on multiple data elements responsive to a single instruction, and a control unit coupled to the vector unit to provide the data elements to the vector unit, where the control unit is to enable an atomic vector operation to be performed on at least some of the data elements responsive to a first vector instruction to be executed under a first mask and a second vector instruction to be executed under a second mask. Other embodiments are described and claimed.	03-06-2014
20140075153	REDUCING ISSUE-TO-ISSUE LATENCY BY REVERSING PROCESSING ORDER IN HALF-PUMPED SIMD EXECUTION UNITS - Techniques for reducing issue-to-issue latency by reversing processing order in half-pumped single instruction multiple data (SIMD) execution units are described. In one embodiment a processor functional unit is provided comprising a frontend unit, and execution core unit, a backend unit, an execution order control signal unit, a first interconnect coupled between and output and an input of the execution core unit and a second interconnect coupled between an output of the backend unit and an input of the frontend unit. In operation, the execution order control signal unit generates a forwarding order control signal based on the parity of an applied clock signal on reception of a first vector instruction. This control signal is in turn used to selectively forward first and second portions of an execution result of the first vector instruction via the interconnects for use in the execution of a dependent second vector instruction.	03-13-2014
20140122832	PARTIAL VECTORIZATION COMPILATION SYSTEM - Generally, this disclosure provides technologies for generating and executing partially vectorized code that may include backward dependencies within a loop body of the code to be vectorized. The method may include identifying backward dependencies within a loop body of the code; selecting one or more ranges of iterations within the loop body, wherein the selected ranges exclude the identified backward dependencies; and vectorizing the selected ranges. The system may include a vector processor configured to provide predicated vector instruction execution, loop iteration range enabling, and dynamic loop dependence checking.	05-01-2014
20140129803	MULTI-MAGNITUDINAL VECTORS WITH RESOLUTION BASED ON SOURCE VECTOR FEATURES - Methods, systems and computer program products for resolving multiple magnitudes assigned to a target vector are disclosed. A target vector that includes one or more target vector dimensions is received. One of the target vector dimensions is processed to determine a total number of magnitudes assigned to the processed target vector dimension. Also, a source vector that includes one or more source vector dimensions is received. The received source vector is processed to determine a total number of features associated with the source vector. When it is detected that the total number of magnitudes assigned to the processed target vector dimension exceeds one, one of the assigned magnitudes is selected based on one of the determined features associated with the source vector.	05-08-2014
20140181466	PROCESSORS HAVING FULLY-CONNECTED INTERCONNECTS SHARED BY VECTOR CONFLICT INSTRUCTIONS AND PERMUTE INSTRUCTIONS - An apparatus includes a decode unit to decode a permute instruction and a vector conflict instruction. A vector execution unit is coupled with the decode unit and includes a fully-connected interconnect. The fully-connected interconnect has at least four inputs to receive at least four corresponding data elements of at least one source vector. The fully-connected interconnect has at least four outputs. Each of the at least four inputs is coupled with each of the at least four outputs. The execution unit also includes a permute instruction execution logic coupled with the at least four outputs and operable to store a first vector result in response to the permute instruction. The execution unit also includes a vector conflict instruction execution logic coupled with the at least four outputs and operable to store a second vector result in a destination storage location in response to the vector conflict instruction.	06-26-2014
20140189289	INSTRUCTION FOR ACCELERATING SNOW 3G WIRELESS SECURITY ALGORITHM - Vector instructions for performing SNOW 3G wireless security operations are received and executed by the execution circuitry of a processor. The execution circuitry receives a first operand of the first instruction specifying a first vector register that stores a current state of a finite state machine (FSM). The execution circuitry also receives a second operand of the first instruction specifying a second vector register that stores data elements of a liner feedback shift register (LFSR) that are needed for updating the FSM. The execution circuitry executes the first instruction to produce a updated state of the FSM and an output of the FSM in a destination operand of the first instruction.	07-03-2014
20140189290	INSTRUCTION FOR FAST ZUC ALGORITHM PROCESSING - Vector instructions for performing ZUC stream cipher operations are received and executed by the execution circuitry of a processor. The execution circuitry receives a first vector instruction to perform an update to a liner feedback shift register (LFSR), and receives a second vector instruction to perform an update to a state of a finite state machine (FSM), where the FSM receives inputs from re-ordered bits of the LFSR. The execution circuitry executes the first vector instruction and the second vector instruction in a single-instruction multiple data (SIMD) pipeline.	07-03-2014
20140189291	Method And Apparatus For Integral Image Computation Instructions - A method is described that performing an image integral calculation by creating a second vector and creating a third vector. The second vector is created by executing a first instruction that adds alternating elements of a first vector to respective neighboring elements of the first vector and presents resulting summations into said second vector. The first instruction also passes through the respective neighboring elements to said second vector. The third vector is created by executing a second instruction that adds elements of one side of the second vector to an element of another side of the second vector and passes through the another side of the second vector.	07-03-2014
20140189292	Functional Unit Having Tree Structure To Support Vector Sorting Algorithm and Other Algorithms - An apparatus is described having a functional unit of an instruction execution pipeline. The functional unit has a plurality of compare-and-exchange circuits coupled to network circuitry to implement a vector sorting tree for a vector sorting instruction. Each of the compare-and-exchange circuits has a respective comparison circuit that compares a pair of inputs. Each of the compare-and-exchange circuits have a same sided first output for presenting a higher of the two inputs and a same sided second output for presenting a lower of the two inputs, said comparison circuit to also support said functional unit's execution of a prefix min and/or prefix add instruction.	07-03-2014
20140189293	Instructions for Sliding Window Encoding Algorithms - A processor is described having an instruction execution pipeline having a functional unit to execute an instruction that compares vector elements against an input value. Each of the vector elements and the input value have a first respective section identifying a location within data and a second respective section having a byte sequence of the data. The functional unit has comparison circuitry to compare respective byte sequences of the input vector elements against the input value's byte sequence to identify a number of matching bytes for each comparison. The functional unit also has difference circuitry to determine respective distances between the input vector ‘s elements’ byte sequences and the input value's byte sequence within the data.	07-03-2014
20140189294	SYSTEMS, APPARATUSES, AND METHODS FOR DETERMINING DATA ELEMENT EQUALITY OR SEQUENTIALITY - Systems, apparatuses, and methods of performing in a computer processor broadcasting data in response to a single vector packed broadcasting instruction that includes a source writemask register operand, a destination vector register operand, and an opcode. In some embodiments, the data of the source writemask register is zero extended prior to broadcasting.	07-03-2014
20140195776	MEMORY ACCESS FOR A VECTOR PROCESSOR - A method and device for memory access in processors is provided. A processor, comprising a plurality of computational units, is capable of executing a single instruction on multiple pieces of data simultaneously (SIMD). A read operation is initiated to load data from memory into the plurality of computational units (CUs) arranged into a plurality of CU groups. The memory is arranged into a plurality of memory macro-blocks each associated with a respective CU group of the plurality of CU groups. For each CU group a respective first memory address is determined and for each CU group, the data in the associated memory macro-block is accessed at the respective first memory address.	07-10-2014
20140208066	VECTOR GENERATE MASK INSTRUCTION - A Vector Generate Mask instruction. For each element in the first operand, a bit mask is generated. The mask includes bits set to a selected value starting at a position specified by a first field of the instruction and ending at a position specified by a second field of the instruction.	07-24-2014
20140208067	VECTOR ELEMENT ROTATE AND INSERT UNDER MASK INSTRUCTION - A Vector Element Rotate and Insert Under Mask instruction. Each element of a second operand of the instruction is rotated in a specified direction by a specified number of bits. For each bit in a third operand of the instruction that is set to one, the corresponding bit of the rotated elements in the second operand replaces the corresponding bit in a first operand of the instruction.	07-24-2014
20140215182	Persistent Relocatable Reset Vector for Processor - In an embodiment, an integrated circuit includes at least one processor. The processor may include a reset vector base address register configured to store a reset vector address for the processor. Responsive to a reset, the processor may be configured to capture a reset vector address on an input, updating the reset vector base address register. Upon release from reset, the processor may initiate instruction execution at the reset vector address. The integrated circuit may further include a logic circuit that is coupled to provide the reset vector address. The logic circuit may include a register that is programmable with the reset vector address. More particularly, in an embodiment, the register may be programmable via a write operation issued by the processor (e.g. a memory-mapped write operation). Accordingly, the reset vector address may be programmable in the integrated circuit, and may be changed from time to time.	07-31-2014
20140244969	List Vector Processing Apparatus, List Vector Processing Method, Storage Medium, Compiler, and Information Processing Apparatus - Disclosed is a list vector processing apparatus (LVPA) or the like which can process the indirect reference at a high speed.	08-28-2014
20140244970	DIGITAL SIGNAL PROCESSOR AND BASEBAND COMMUNICATION DEVICE - For increased efficiency, a digital signal processor comprises a vector execution unit arranged to execute instructions that are to be performed on multiple data in the form of a vector, comprising a vector controller arranged to determine if an instruction is a vector instruction and, if it is, inform a count register arranged to hold the vector length, said vector controller being further arranged receive an issue signal and control the execution of instructions based on this issue signal, said vector execution unit being characterized in that it comprises	08-28-2014
20140258677	ANALYZING POTENTIAL BENEFITS OF VECTORIZATION - Embodiments of computer-implemented methods, systems, computing devices, and computer-readable media (transitory and non-transitory) are described herein for analyzing execution of a plurality of executable instructions and, based on the analysis, providing an indication of a benefit to be obtained by vectorization of at least a subset of the plurality of executable instructions. In various embodiments, the analysis may include identification of the subset of the plurality of executable instructions suitable for conversion to one or more single-instruction multiple-data (“SIMD”) instructions.	09-11-2014
20140281370	VECTOR PROCESSING ENGINES HAVING PROGRAMMABLE DATA PATH CONFIGURATIONS FOR PROVIDING MULTI-MODE VECTOR PROCESSING, AND RELATED VECTOR PROCESSORS, SYSTEMS, AND METHODS - Embodiments disclosed herein include vector processing engines (VPEs) having programmable data path configurations for providing multi-mode vector processing. Related vector processors, systems, and methods are also disclosed. The VPEs include a vector processing stage(s) configured to process vector data according to a vector instruction executed in the vector processing stage. Each vector processing stage includes vector processing blocks each configured to process vector data based on the vector instruction being executed. The vector processing blocks are capable of providing different vector operations for different types of vector instructions based on data path configurations. Data paths of the vector processing blocks are programmable to be reprogrammable to process vector data differently according to the particular vector instruction being executed. In this manner, a VPE can be provided with its data paths configuration programmable to execute different types of functions based on data path configuration according to the vector instruction being executed.	09-18-2014
20140281371	TECHNIQUES FOR ENABLING BIT-PARALLEL WIDE STRING MATCHING WITH A SIMD REGISTER - Various embodiments are generally directed to overcoming limitations of vector registers in their use with bit-parallel string matching algorithms. An apparatus includes a processor element; and logic to receive a pattern comprising a first string of elements to employ in a string matching operation, instantiate a test bitmask in a first vector register of the processor element, the first vector register comprising multiple lanes, copy bit values at MSB bit positions of the multiple lanes of the first vector register to a first vector mask as a vector value, bit-shift the vector value as a scalar value, bit-shift the first vector register, employ the vector value of the first vector mask to selectively fill LSB bit positions of lanes of a second vector register of the processor element; and OR the second vector register into the first vector register. Other embodiments are described and claimed.	09-18-2014
20140281372	VECTOR INDIRECT ELEMENT VERTICAL ADDRESSING MODE WITH HORIZONTAL PERMUTE - An example method for placing one or more element data values into an output vector includes identifying a vertical permute control vector including a plurality of elements, each element of the plurality of elements including a register address. The method also includes for each element of the plurality of elements, reading a register address from the vertical permute control vector. The method further includes retrieving a plurality of element data values based on the register address. The method also includes identifying a horizontal permute control vector including a set of addresses corresponding to an output vector. The method further includes placing at least some of the retrieved element data values of the plurality of element data values into the output vector based on the set of addresses in the horizontal permute control vector.	09-18-2014
20140281373	DIGITAL SIGNAL PROCESSOR AND BASEBAND COMMUNICATION DEVICE - A digital signal processor has a vector execution unit arranged to execute instructions on multiple data in the form of a vector, comprising a local queue arranged to receive instructions from a program memory and to hold them in the local queue until a predefined condition is fulfilled. The local queue being arranged to receive a sequence of instructions at a time from the program memory and to store the last N instructions, N being an integer. A vector controller in the vector execution unit comprises queue control means arranged to make the local queue repeat a sequence of M instructions stored in the local queue, M being an integer less than or equal to N, a number K of times. This reduces the time the vector execution unit is kept waiting because of IDLE commands in the program memory.	09-18-2014
20140289495	ENHANCED PREDICATE REGISTERS - Systems, apparatuses and methods for utilizing enhanced predicate registers which specify the element width and which elements are to be processed. The predicate size is dynamic, depending on the contents of the enhanced predicate register used for an instruction rather than being a static quality of a specific instruction. Specifying the element size in the enhanced predicate registers results in fewer instructions in an instruction set.	09-25-2014
20140344549	DIGITAL SIGNAL PROCESSOR AND BASEBAND COMMUNICATION DEVICE - The invention relates to a digital signal processor comprising a processor core, an integer execution unit and a number of vector execution units, said digital signal processor comprising a program memory arranged to hold instructions for the execution units and issue logic for issuing instructions. The digital signal processor comprises an issue control unit for selecting at least two execution units that are to receive and execute the same instruction at the same time, and logic for sending the instruction to said at least two execution units.	11-20-2014
20140359252	DIGITAL SIGNAL PROCESSOR - A multicore processor is achieved by a processor assembly, comprising a first processor having a first core and at least a first and a second unit, each being selected from the group of vector execution units, memory units and accelerators, said first core and first and second units being interconnected by a first network, and a second processor having a second core wherein the first core is arranged to enable the second core to control at least one of the units in the first processor. Each processors generally comprises a combination of execution units, memory units and accelerators, which may be controlled and/or accessed by units in the other processor.	12-04-2014
20140372728	VECTOR EXECUTION UNIT FOR DIGITAL SIGNAL PROCESSOR - A vector execution unit for use in a digital signal processor enables a new set of instructions. The unit comprises a first input port for receiving at least a first input data vector, an instruction decoder, a vector output port, and least one data-path. The instruction decoding unit is arranged to control the data-path to perform a comparison related to the first input data vector, and the processor comprises an integer port arranged to output the result of the comparison in the form of a decision vector to a memory unit or a functional unit in the digital signal processor. Alternatively or in addition, the integer port is also arranged to receive a decision vector of integer data, and the instruction decoding unit is arranged to control the data-path to process the first input data in dependence of the value of the integer data.	12-18-2014
20150046673	VECTOR PROCESSOR - A vector processor is disclosed including a variety of variable-length instructions. Computer-implemented methods are disclosed for efficiently carrying out a variety of operations in a time-conscious, memory-efficient, and power-efficient manner. Methods for more efficiently managing a buffer by controlling the threshold based on the length of delay line instructions are disclosed. Methods for disposing multi-type and multi-size operations in hardware are disclosed. Methods for condensing look-up tables are disclosed. Methods for in-line alteration of variables are disclosed.	02-12-2015
20150046674	LOW POWER COMPUTATIONAL IMAGING - The present application discloses a computing device that can provide a low-power, highly capable computing platform for computational imaging. The computing device can include one or more processing units, for example one or more vector processors and one or more hardware accelerators, an intelligent memory fabric, a peripheral device, and a power management module. The computing device can communicate with external devices, such as one or more image sensors, an accelerometer, a gyroscope, or any other suitable sensor devices.	02-12-2015
20150046675	APPARATUS, SYSTEMS, AND METHODS FOR LOW POWER COMPUTATIONAL IMAGING - The present application discloses a computing device that can provide a low-power, highly capable computing platform for computational imaging. The computing device can include one or more processing units, for example one or more vector processors and one or more hardware accelerators, an intelligent memory fabric, a peripheral device, and a power management module. The computing device can communicate with external devices, such as one or more image sensors, an accelerometer, a gyroscope, or any other suitable sensor devices.	02-12-2015
20150074373	SCATTER USING INDEX ARRAY AND FINITE STATE MACHINE - Methods and apparatus are disclosed using an index array and finite state machine for scatter/gather operations. Embodiment of apparatus may comprise: decode logic to decode scatter/gather instructions and generate micro-operations. An index array holds a set of indices and a corresponding set of mask elements. A finite state machine facilitates the scatter operation. Address generation logic generates an address from an index of the set of indices for at least each of the corresponding mask elements having a first value. Storage is allocated in a buffer for each of the set of addresses being generated. Data elements corresponding to the set of addresses being generated are copied to the buffer. Addresses from the set are accessed to store data elements if a corresponding mask element has said first value and the mask element is changed to a second value responsive to completion of their respective stores.	03-12-2015
20150074374	METHOD AND APPARATUS FOR ASYNCHRONOUS PROCESSOR WITH AUXILIARY ASYNCHRONOUS VECTOR PROCESSOR - An asynchronous processing system comprising an asynchronous scalar processor and an asynchronous vector processor coupled to the scalar processor. The asynchronous scalar processor is configured to perform processing functions on input data and to output instructions. The asynchronous vector processor is configured to perform processing functions in response to a very long instruction word (VLIW) received from the scalar processor. The VLIW comprises a first portion and a second portion, at least the first portion comprising a vector instruction.	03-12-2015
20150089190	Predicate Attribute Tracker - In an embodiment, a processor includes a register attribute tracker configured to track one or more attributes corresponding to registers. The register attribute tracker may track the attributes associated with the registers when those registers are used as output registers of instructions that explicitly define the attributes and, if the register attribute tracker has a tracked attribute associated with an input register of an instruction that does not explicitly define the attribute, the register attribute tracker may annotate the instruction with an attribute and/or associate an attribute with the output register of the instruction in the register attribute tracker.	03-26-2015
20150089191	Early Issue of Null-Predicated Operations - In an embodiment, a processor includes an issue circuit configured to issue instruction operations for execution. The issue circuit may be configured to monitor the source operands of the instruction operations, and to issue instruction operations for which the source operands (including predicate operands, as appropriate) are resolved. Additionally, the issue circuit may be configured to detect a null predicate that indicates that none of the vector elements will be modified by a corresponding instruction operation. The issue circuit may be configured to issue the corresponding instruction operation with the null predicate even if other source operands are not yet resolved.	03-26-2015
20150089192	Dynamic Attribute Inference - In an embodiment, a processor may be configured to dynamically infer one or more attributes of input and/or output registers of an instruction, given the attributes corresponding to at least one input registers. The inference may be made at the issue circuit/stage of the processor, for those registers that do not have attribute information at the issue circuit/stage. In an embodiment, the processor may also include a register attribute tracker configured to track attributes of registers prior to the issue stage of the processor pipeline. The processor may feed back, to the register attribute tracker, inferred attributes and the register addresses of the registers to which the inferred attributes apply. The register attribute tracker may be configured to may associate the inferred attribute with the identified register attribute tracker may also be configured to infer input register attributes from other input register attributes.	03-26-2015
20150100755	DATA PROCESSING APPARATUS AND METHOD FOR CONTROLLING PERFORMANCE OF SPECULATIVE VECTOR OPERATIONS - A data processing apparatus and a method of controlling performance of speculative vector operations are provided. The apparatus comprises processing circuitry for performing a sequence of speculative vector operations on vector operands, each vector operand comprising a plurality of vector elements, and speculation control circuitry for maintaining a speculation width indication indicating the number of vector elements of each vector operand to be subjected to the speculative vector operations. The speculation width indication is set to an initial value prior to performance of the sequence of speculative vector operations. The processing circuitry generates progress indications during performance of the sequence of speculative vector operations, and the speculation control circuitry detects, with reference to the progress indications and speculation reduction criteria, presence of a speculation reduction condition. The speculation reduction condition is a condition indicating that a reduction in the speculation width indication is expected to improve at least one performance characteristic of the data processing apparatus relative to continued operation without the reduction in the speculation width indication. The speculation control circuitry is responsive to detection of the speculation reduction condition to reduce the speculation width indication. This can significantly increase performance (for example in terms of throughput and/or energy consumption) when performing speculative vector operations.	04-09-2015
20150113246	INSTRUCTION AND LOGIC TO PROVIDE CONVERSIONS BETWEEN A MASK REGISTER AND A GENERAL PURPOSE REGISTER OR MEMORY - Instructions and logic provide conversions between a mask register and a general purpose register or memory. Some embodiments, responsive to an instruction specifying: a destination operand, a mask length corresponding to a number of mask data fields, and a source operand; values are read from data fields in the source operand, corresponding to the specified mask length, and stored to corresponding data fields in the destination operand specified by the instruction, wherein one of the source or the destination operands is a mask register. Values indicative of masked vector elements may be stored to any data fields in the destination operand other than the number of data fields corresponding to the specified mask length. For some embodiments, the other one of the source or the destination operands may be a general purpose register or a memory location.	04-23-2015
20150121036	CRYPTOGRAPHIC SUPPORT INSTRUCTIONS - A data processing system	04-30-2015
20150143076	VECTOR PROCESSING ENGINES (VPEs) EMPLOYING DESPREADING CIRCUITRY IN DATA FLOW PATHS BETWEEN EXECUTION UNITS AND VECTOR DATA MEMORY TO PROVIDE IN-FLIGHT DESPREADING OF SPREAD-SPECTRUM SEQUENCES, AND RELATED VECTOR PROCESSING INSTRUCTIONS, SYSTEMS, AND METHODS - Vector processing engines (VPEs) employing merging circuitry in data flow paths between execution units and vector data memory to provide in-flight merging of output vector data stored to vector data memory are disclosed. Related vector processing instructions, systems, and methods are also disclosed. Merging circuitry is provided in data flow paths between execution units and vector data memory in the VPE. The merging circuitry is configured to merge an output vector data sample set from execution units as a result of performing vector processing operations in-flight while the output vector data sample set is being provided over the output data flow paths from the execution units to the vector data memory to be stored. The merged output vector data sample set is stored in a merged form in the vector data memory without requiring additional post-processing steps, which may delay subsequent vector processing operations to be performed in execution units.	05-21-2015
20150143077	VECTOR PROCESSING ENGINES (VPEs) EMPLOYING MERGING CIRCUITRY IN DATA FLOW PATHS BETWEEN EXECUTION UNITS AND VECTOR DATA MEMORY TO PROVIDE IN-FLIGHT MERGING OF OUTPUT VECTOR DATA STORED TO VECTOR DATA MEMORY, AND RELATED VECTOR PROCESSING INSTRUCTIONS, SYSTEMS, AND METHODS - Vector processing engines (VPEs) employing merging circuitry in data flow paths between execution units and vector data memory to provide in-flight merging of output vector data stored to vector data memory are disclosed. Related vector processing instructions, systems, and methods are also disclosed. Merging circuitry is provided in data flow paths between execution units and vector data memory in the VPE. The merging circuitry is configured to merge an output vector data sample set from execution units as a result of performing vector processing operations in-flight while the output vector data sample set is being provided over the output data flow paths from the execution units to the vector data memory to be stored. The merged output vector data sample set is stored in a merged form in the vector data memory without requiring additional post-processing steps, which may delay subsequent vector processing operations to be performed in execution units.	05-21-2015
20150143078	VECTOR PROCESSING ENGINES (VPEs) EMPLOYING A TAPPED-DELAY LINE(S) FOR PROVIDING PRECISION FILTER VECTOR PROCESSING OPERATIONS WITH REDUCED SAMPLE RE-FETCHING AND POWER CONSUMPTION, AND RELATED VECTOR PROCESSOR SYSTEMS AND METHODS - Vector processing engines (VPEs) employing a tapped-delay line(s) for providing precision filter vector processing operations with reduced sample re-fetching and power consumption are disclosed. Related vector processor systems and methods are also disclosed. The VPEs are configured to provide filter vector processing operations. To minimize re-fetching of input vector data samples from memory to reduce power consumption, a tapped-delay line(s) is included in the data flow paths between a vector data file and execution units in the VPE. The tapped-delay line(s) is configured to receive and provide input vector data sample sets to execution units for performing filter vector processing operations. The tapped-delay line(s) is also configured to shift the input vector data sample set for filter delay taps and provide the shifted input vector data sample set to execution units, so the shifted input vector data sample set does not have to be re-fetched during filter vector processing operations.	05-21-2015
20150143079	VECTOR PROCESSING ENGINES (VPEs) EMPLOYING TAPPED-DELAY LINE(S) FOR PROVIDING PRECISION CORRELATION / COVARIANCE VECTOR PROCESSING OPERATIONS WITH REDUCED SAMPLE RE-FETCHING AND POWER CONSUMPTION, AND RELATED VECTOR PROCESSOR SYSTEMS AND METHODS - Vector processing engines (VPEs) employing a tapped-delay line(s) for providing precision correlation/covariance vector processing operations with reduced sample re-fetching and/or power consumption are disclosed. The VPEs disclosed herein are configured to provide correlation/covariance vector processing operations, such as code division multiple access (CDMA) correlation/covariance vector processing operations as a non-limiting example. A tapped-delay line(s) is included in the data flow paths between memory and execution units in the VPE. The tapped-delay line (s) is configured to receive and provide an input vector data sample set to execution units for performing correlation/covariance vector processing operations. The tapped-delay line(s) is also configured to shift the input vector data sample set for each filter delay tap and provide the shifted input vector data sample set to the execution units, so the shifted input vector data sample set need not be re-fetched from the vector data file during the filter vector processing operations.	05-21-2015
20150143080	VECTOR CHECKSUM INSTRUCTION - Elements from a second operand are added together one-by-one to obtain a first result. The adding includes performing one or more end around carry add operations. The first result is placed in an element of a first operand of the instruction. After each addition of an element, a carry out of a chosen position of the sum, if any, is added to a selected position in an element of the first operand.	05-21-2015
20150149744	DATA PROCESSING APPARATUS AND METHOD FOR PERFORMING VECTOR PROCESSING - A data processing apparatus and method are provided for processing execution threads, where each execution thread specifies at least one instruction. The data processing apparatus has a vector processing unit providing a plurality M of lanes of parallel processing, within each lane the vector processing unit being configured to perform a processing operation on a data element input to that lane for each of one or more input operands. A vector instruction is received that is specified by a group of the execution threads, that vector instruction identifying an associated processing operation and also providing an indication of the data elements of each input operand that are to be subjected to that associated processing operation. Vector merge circuitry then determines, based on that information, a required number of lanes of parallel processing for performing the associated processing operation. If it is determined that the required number of lanes is less than or equal to half the available number of lanes within the vector processing unit, then the vector merge circuitry allocates a plurality of the execution threads of the group to the vector processing unit such that each execution thread in that plurality is allocated different lanes amongst the available lanes of parallel processing. As a result, the vector processing unit then performs the associated processing operation in parallel for each of the plurality of execution threads, significantly increasing performance.	05-28-2015
20150355905	VECTOR MEMORY ACCESS INSTRUCTIONS FOR BIG-ENDIAN ELEMENT ORDERED AND LITTLE-ENDIAN ELEMENT ORDERED COMPUTER CODE AND DATA - Embodiments relate to vector memory access instructions for big-endian (BE) element ordered computer code and little-endian (LE) element ordered computer code. An aspect includes determining a mode of a computer system comprising one of a BE mode and an LE mode. Another aspect includes determining a code type comprising one of BE code and LE code. Another aspect includes determining a data type of data in a main memory that is associated with the object code comprising one of BE data and LE data. Another aspect includes based on the mode, code type, and data type, inserting a memory access instruction into the object code to perform a memory access associated with the vector in the object code, such that the memory access instruction performs element ordering of elements of the vector, and data ordering within the elements of the vector, in accordance with the determined mode, code type, and data type.	12-10-2015
20150355906	VECTOR MEMORY ACCESS INSTRUCTIONS FOR BIG-ENDIAN ELEMENT ORDERED AND LITTLE-ENDIAN ELEMENT ORDERED COMPUTER CODE AND DATA - Embodiments relate to vector memory access instructions for big-endian (BE) element ordered computer code and little-endian (LE) element ordered computer code. An aspect includes determining a mode of a computer system comprising one of a BE mode and an LE mode. Another aspect includes determining a code type comprising one of BE code and LE code. Another aspect includes determining a data type of data in a main memory that is associated with the object code comprising one of BE data and LE data. Another aspect includes based on the mode, code type, and data type, inserting a memory access instruction into the object code to perform a memory access associated with the vector in the object code, such that the memory access instruction performs element ordering of elements of the vector, and data ordering within the elements of the vector, in accordance with the determined mode, code type, and data type.	12-10-2015
20160092218	Conditional Stop Instruction with Accurate Dependency Detection - In an embodiment, a processor may implement a conditional stop instruction that includes a first predicate vector identifying the active elements of the instruction, a second predicate vector indicating true and false results for a conditional expression within a loop that is being vectorized, and a source operand specifying which combinations in the true and false results may indicate a dependency. The conditional stop instruction may generate a vector result indicating vector elements that have a dependency on a prior vector element, as well as an identification of which element position the dependency is on. More particularly, dependencies may be detected only on active elements as indicated by the first predicate vector. False dependencies that may occur due to inactive elements may be avoided, which may improve performance and/or provide for correct functional operation.	03-31-2016
20160092234	METHOD AND APPARATUS FOR SPECULATIVE VECTORIZATION - An apparatus and method for speculative vectorization. For example, one embodiment of a processor comprises: a queue comprising a set of locations for storing addresses associated with vectorized memory access instructions; and execution logic to execute a first vectorized memory access instruction to access the queue and to compare a new address associated with the first vectorized memory access instruction with existing addresses stored within a specified range of locations within the queue to detect whether a conflict exists, the existing addresses having been previously stored responsive to one or more prior vectorized memory access instructions.	03-31-2016
20160092398	Conditional Termination and Conditional Termination Predicate Instructions - In an embodiment, a processor may implement a vector instruction set including a conditional termination instruction (CTerm). The CTerm instruction may take two source operands and compare them according to a specified condition, updating flags as a result of the instruction. The flags may be used to affect predicate vector generation to control vectorized loop execution. In an embodiment, the vector instruction set may also include a conditional termination predicate instruction (CTPred). The CTPred instruction may take a pair of predicate vectors and a set of flags as operands, and may generate: a predicate vector to control parallel processing of vector elements, and a set of flags to control further loop processing. Either instruction may be used to efficiently manage vector loops in various embodiments, or the instructions may be used together.	03-31-2016
20160092400	Instruction and Logic for a Vector Format for Processing Computations - A processor includes a front end to fetch an instruction. The instruction is to calculate a data point using inputs from a plurality of adjacent source data in a plurality of dimensions. The processor includes a decoder to decode the instruction. The processor also includes a core to, based on the decoded instruction, perform a plurality of tabular vector read operations to read the plurality of adjacent source data and perform a tabular vector calculation to execute the instruction. The tabular vector calculation is based upon results of performing the plurality of tabular vector read operations. The core is further to write results of the tabular vector calculation.	03-31-2016
20160094241	APPARATUS AND METHOD FOR VECTOR COMPRESSION - An apparatus and method are described for performing vector compression. For example, one embodiment of a processor comprises: vector compression logic to compress a source vector comprising a plurality of valid data elements and invalid data elements to generate a destination vector in which valid data elements are stored contiguously on one side of the destination vector, the vector compression logic to utilize a bit mask associated with the source vector and comprising a plurality of bits, each bit corresponding to one of the plurality of data elements of the source vector and indicating whether the data element comprises a valid data element or an invalid data element, the vector compression logic to utilize indices of the bit mask and associated bit values of the bit mask to generate a control vector; and shuffle logic to shuffle/permute the data elements of the source vector to the destination vector in accordance with the control vector.	03-31-2016
20160139921	VECTOR INSTRUCTION TO COMPUTE COORDIANTE OF NEXT POINT IN A Z-ORDER CURVE - In one embodiment, a processor includes machine level instructions to compute a next point in a Z-order curve of a specified dimension for a specified coordinate. A processor decode unit is configured to decode an instruction having a source and immediate operands including a first z-curve index, the specified dimension and the specified coordinate. A processor execution unit is configured to execute the decoded instruction to compute the coordinate of the next point by incrementing the coordinate value associated with the specified coordinate to generate a second z-curve index including the incremented coordinate.	05-19-2016
20160179530	INSTRUCTION AND LOGIC TO PERFORM A VECTOR SATURATED DOUBLEWORD/QUADWORD ADD	06-23-2016
20160179531	COMPILER METHOD FOR GENERATING INSTRUCTIONS FOR VECTOR OPERATIONS IN A MULTI-ENDIAN INSTRUCTION SET	06-23-2016
20160179537	METHOD AND APPARATUS FOR PERFORMING REDUCTION OPERATIONS ON A SET OF VECTOR ELEMENTS	06-23-2016
20160188334	HARDWARE APPARATUSES AND METHODS RELATING TO ELEMENTAL REGISTER ACCESSES - Methods and apparatuses relating to a vector instruction with a register operand with an elemental offset are described. In one embodiment, a hardware processor includes a decode unit to decode a vector instruction with a register operand with an elemental offset to access a first number of elements in a register specified by the register operand, wherein the first number is a total number of elements in the register minus the elemental offset, access a second number of elements in a next logical register, wherein the second number is the elemental offset, and combine the first number of elements and the second number of elements as a data vector, and an execution unit to execute the vector instruction on the data vector.	06-30-2016
20160188530	METHOD AND APPARATUS FOR PERFORMING A VECTOR PERMUTE WITH AN INDEX AND AN IMMEDIATE - An apparatus and method for performing a vector permute. For example, one embodiment of a processor comprises: a source vector register to store a plurality of source data elements; a destination vector register to store a plurality of destination data elements; a control vector register to store a plurality of control data elements, each control data element corresponding to one of the destination data elements and including an N bit value indicating whether a source data element is to be copied to the corresponding destination data element; vector permute logic to compare the N bit value of each control data element to an N bit portion of an immediate to determine whether to copy a source data element to the corresponding destination data element, wherein if the N bit values match, then the vector permute logic is to identify a source data element using an index value included in the control data element and to responsively copy the source data element to the corresponding destination data element in the destination vector register.	06-30-2016
20190146793	Apparatus and Methods for Vector Based Transcendental Functions	05-16-2019

Inventors list

Assignees list

Classification tree browser

Top 100 Inventors

Top 100 Assignees

Vector processor operation

Subclass of:

712 - Electrical computers and digital processing systems: processing architectures and instruction processing (e.g., processors)

712001000 - PROCESSING ARCHITECTURE

712002000 - Vector processor

Patent class list (only not empty are listed)

Deeper subclasses: