Entries |
Document | Title | Date |
20080209164 | Microprocessor Architectures - A microprocessor architecture comprises a plurality of processing elements arranged in a single instruction multiple data SIMD array, wherein each processing element includes a plurality of execution units, each of which is operable to process an instruction of a particular instruction type, a serial processor which includes a plurality of execution units, each of which is operable to process an instruction of a particular instruction type, and an instruction controller operable to receive a plurality of instructions, and to distribute received instructions to the execution units in dependence upon the instruction types of the received instruction. The execution units of the serial processor are operable to process respective instructions in parallel. | 08-28-2008 |
20080209165 | SIMD MICROPROCESSOR, IMAGE PROCESSING APPARATUS INCLUDING SAME, AND IMAGE PROCESSING METHOD USED THEREIN - A SIMD microprocessor, which can be included in an image processing apparatus using an image processing method used therein, includes a global processor and multiple processor elements controlled by the global processor. Each single processor element of the multiple processor elements includes multiple operation units. The global processor is configured to control the multiple processing elements to uniformly change a configuration of the multiple operation units in the single processor element to determine a number of data units of operation according to the multiple operation units either operated individually or in cooperation with each other in the single processor element and a width of data processed per data unit of operation performed in the single processor element. A processor element number is assigned per data unit of operation to the single processor element to use for executing an operation. | 08-28-2008 |
20080215851 | Method and arrangement for the power-efficient control of processors - A method is provided for the functional control of program and/or data flows in digital signal processors and processors, which have respective closed and separated modules for program and data flow control, working in parallel with computers. The method enables a power-efficient adaptation of the signal processing with the applied SIMD command-type in the individual paths and minimizes the emergence of the appearance of NOP-commands with which the VLIW-architecture of the processor must be supplied. The adaptation of the signal processing is achieved by individually controlling the parallel signal processing of the processor in the data paths (DP) which respectively belong to a first and second slice. For this purpose, a single slice halt outputted from an SSM register bank switches the register clockline according to state-dependent signal processing. | 09-04-2008 |
20080294871 | MULTIDIMENSIONAL PROCESSOR ARCHITECTURE - A processor architecture includes a number of processing elements for treating input signals. The architecture is organized according to a matrix including rows and columns, the columns of which each include at least one microprocessor block having a computational part and a set of associated processing elements that are able to receive the same input signals. The number of associated processing elements is selectively variable in the direction of the column so as to exploit the parallelism of said signals. Additionally the processor architecture of the present invention enable dynamic switching between instruction parallelism and data parallel processing typical of vectorial functionality. The architecture can be scaled in various dimensions in an optimal configuration for the algorithm to be executed. | 11-27-2008 |
20080301403 | SYSTEM FOR INTEGRITY PROTECTION FOR STANDARD 2N-BIT MULTIPLE SIZED MEMORY DEVICES - An apparatus including a first circuit and a second circuit. The first circuit may be configured to generate one or more command signals, a read data path control signal and one or more write data path control signals in response to an integrity protection control signal and one or more arbitration signals. The second circuit may be configured to write data to a memory and read data from the memory in response to the one or more command signals, the read data path control signal and the one or more write data path control signals. In a first mode, the data may be written and read without integrity protection. In a second mode the data may be written and read with integrity protection, and the integrity protection is written and read separately from the data. | 12-04-2008 |
20080313423 | Distributed Memory Type Information Processing System - An information processing system includes a plurality of PMM and data transmission paths for connection between the PMM and transmitting a value of a PMM to another PMM. A memory of each PMM holds a list of values of first items arranged in the ascending order or descending order without overlap and/or a list of values of the second item to be shared. A memory module of each PMM transmits a value contained in the value list to another PMM, receives a value contained in the value list from the another PMM, references the value list of the first item and the value list of the second item of the another PMM, and generates a list of common values considering the values contained in the value lists of the first item and the second item of all the other PMM. | 12-18-2008 |
20080320273 | Interconnections in Simd Processor Architectures - A single instruction multiple data (SIMD) processor ( | 12-25-2008 |
20090013150 | SIMD MICROPROCESSOR AND DATA TRANSFER METHOD FOR USE IN SIMD MICROPROCESSOR - A disclosed SIMD microprocessor includes plural processor elements each having n arithmetic circuits and n registers configured to temporarily store data pieces to be input to the arithmetic circuits, n being a natural number equal to or greater than 2, and; a control circuit configured to determine an arrangement order of the processor elements and an arrangement order of the arithmetic circuits in the processor elements and determine whether to use the n arithmetic circuits as a single arithmetic circuit or as n arithmetic circuits. Each processor element further includes n shifter pairs each including a PE shifter and a bit shifter; and n shift data selection circuits configured to select arbitrary data pieces from the data pieces in the shifter pairs, perform bit extension on the data pieces, and transfer the data pieces to the arithmetic circuits. | 01-08-2009 |
20090013151 | SIMD TYPE MICROPROCESSOR - An SIMD type microprocessor is disclosed. The SIMD type microprocessor includes plural PEs (processor elements) each of which provides an ALU (arithmetic and logic unit) for lower-order bits, an ALU for upper-order bits, a control circuit for lower-order bits, a control circuit for upper-order bits, a range determining circuit for lower-order bits, and a range determining circuit for upper-order bits. The SIMD type microprocessor further includes a global processor, a range designation bus for lower-order bits which connects the global processor to the range determining circuit for lower-order bits, and a range designation bus for upper-order bits which connects the global processor to the range determining circuit for upper-order bits. The global processor instructs the range determining circuits to designate corresponding ranges to be operated on by the corresponding ALUs via the corresponding range designation buses so that the ALU for lower-order bits and the ALU for upper-order bits can be operated separately. | 01-08-2009 |
20090013152 | COMPUTING UNIT AND IMAGE FILTERING DEVICE - A processor capable of performing a filter processing in a high speed is provided. A computing unit comprises a computer for performing a filter processing. Data supply to the computer is performed by an internal register configured by a flip-flop. Data read from the internal register is outputted to a shift register and the data is supplied to the computer per cycle. And, the computing unit comprises a mechanism for changing a filter computing direction according to a motion vector, thereby preventing performance lowering due to branched command by performing a horizontal filtering and a vertical filtering by a same command. | 01-08-2009 |
20090024832 | PROCESS FOR THE AUTOMATIC PRODUCTION OF A PROCESSOR FROM A MACHINE DESCRIPTION - The invention is based on the task to undertake machine descriptions, with which an automated optimal hardware design of SIMD processors can be carried out. This is solved by the fact that functional units are selected from a criterion in the machine description, which is vector processible. A first or second reduced functional unit are selectively defined from a respective vector-processing functional unit, in which the reduced functional units process only a data element of a vectoral value. All reduced functional units, which use common control signals for the processing of a respective data element belonging to the vectoral value, are condensed to a disk. Reduced functional units, which process the same data elements in a sequence at least indirectly, are condensed to a disk module. The disk is reproduced with the contained reduced functional units so often that all reduced functional units represent the functionality of their respective selected vector-processing functional unit. | 01-22-2009 |
20090106528 | Parallel Image Processing System Control Method And Apparatus - To reduce the required amount of program codes when processing the whole image in a one-dimensional SIMD parallel image processing system having a smaller number of PEs than the number of pixels in the width direction of the image to be processed. A controller for controlling a PE array includes a command repetitive-execution part, which includes an operand converting part, a memory address converting part, and an operation code converting part. When a command fetching/decoding part reads and executes program codes stored in a program memory, the repetitive-execution part determines the program codes to cause the operand converting part, memory address converting part and operation code converting part to perform conversions in accordance with the command, thereby performing a repetitive execution of the one-command program description adaptive to a plurality of related pixels assigned to the PEs, whereby the program code amount can be reduced. | 04-23-2009 |
20090125702 | SIMD processor and addressing method - A single instruction, multiple data (SIMD) processor including a plurality of addressing register sets, used to flexibly calculate effective operand source and destination memory addresses is disclosed. Two or more address generators calculate effective addresses using the register sets. Each register set includes a pointer register, and a scale register. An address generator forms effective addresses from a selected register set's pointer register and scale register; and an offset. For example, the effective memory address may be formed by multiplying the scale value by an offset value and summing the pointer and the scale value multiplied by the offset value. | 05-14-2009 |
20090132785 | SIMD processor executing min/max instructions - A SIMD processor responds to a single min/max instruction to find the minimum or maximum valued data unit in an array of data units. The determined minimum/maximum value and an associated index value thereto may be output. Alternatively, the value of a data unit in another array may be output at a corresponding location. A further single instruction executable by the SIMD processor, may be applied to results obtained using such a single min/max instruction, to allow such instructions to operate on two dimensional arrays. | 05-21-2009 |
20090132786 | METHOD AND SYSTEM FOR LOCAL MEMORY ADDRESSING IN SINGLE INSTRUCTION, MULTIPLE DATA COMPUTER SYSTEM - A single instruction, multiple data (“SIMD”) computer system includes a central control unit coupled to 256 processing elements (“PEs”) and to 32 static random access memory (“SRAM”) devices. Each group of eight PEs can access respective groups of eight columns in a respective SRAM device. Each PE includes a local column address register that can be loaded through a data bus of the respective PE. A local column address stored in the local column address register is applied to an AND gate, which selects either the local column address or a column address applied to the AND gate by the central control unit. As a result, the central control unit can globally access the SRAM device, or a specific one of the eight columns that can be accessed by each PE can be selected locally by the PE. | 05-21-2009 |
20090144523 | MULTIPLE-SIMD PROCESSOR FOR PROCESSING MULTIMEDIA DATA AND ARITHMETIC METHOD USING THE SAME - A multiple-single instruction multiple data (SIMD) processor and an arithmetic method using the same are disclosed. When various arithmetic operations should be individually carried out by SIMD arithmetic units, control right is sub-divided to perform the arithmetic operations, such that the time of the arithmetic operations can be shortened and the efficiency thereof can be raised. When sub-divided control is not required, the control right is withdrawn and the arithmetic operations are carried out using a minimum number of program memories and a minimum number of SIMD arithmetic units, such that memory and power consumption thereof can be reduced. | 06-04-2009 |
20090187734 | Efficient Texture Processing of Pixel Groups with SIMD Execution Unit - A circuit arrangement and method perform concurrent texture processing of groups of pixels with a single instruction multiple data (SIMD) execution unit to improve the utilization of the SIMD execution unit when performing scalar operations associated with a texture processing algorithm. In addition, when utilized in connection with a multi-threaded SIMD execution unit, groups of pixels may be concurrently processed in different threads executed by the SIMD execution unit to further maximize the utilization of the SIMD execution unit by reducing the adverse effects of dependencies in scalar and/or vector operations incorporated into a texture processing algorithm. | 07-23-2009 |
20090210653 | METHOD AND DEVICE FOR TREATING AND PROCESSING DATA - Procedures and methods for managing and transmitting data within multidimensional systems of transmitters and receivers are described. Splitting a data stream into a plurality of independent branches and subsequent merging of the individual branches to form a data stream is to be performable in a simple manner, the individual data streams being recombined in the correct sequence. This method may be particularly useful for executing reentrant code. The method is well suited, in particular, for configurable architectures; particular attention is paid to the efficient control of configuration and reconfiguration. | 08-20-2009 |
20090222644 | Merge Operations of Data Arrays Based on SIMD Instructions - A method and apparatus are provided to perform efficient merging operations of two or more streams of data by using SIMD instruction. Streams of data are merged together in parallel and with mitigated or removed conditional branching. The merge operations of the streams of data include Merge AND and Merge OR operations. | 09-03-2009 |
20090271591 | VECTOR SIMD PROCESSOR - A data processor whose level of operation parallelism is enhanced by composing floating-point inner product execution units to be compatible with single instruction multiple data (SIMD) and thereby enhancing the operation processing capability is made possible. An operating system that can significantly enhance the level of operation parallelism per instruction while maintaining the efficiency of the floating-point length-4 vector inner product execution units is to be implemented. The floating-point length-4 vector inner product execution units are defined in the minimum width (32 bits for single precision) even where an extensive operating system becomes available, and compose the inner product execution units to be compatible with SIMD. The mutually augmenting effects of the inner product execution units and SIMD-compatible composition enhances the level of operation parallelism dramatically. Composition of the floating-point length-4 vector inner product execution units to calculate the sum of the inner product of length-4 vectors and scalar to be compatible with SIMD of four in parallel results in a processing capability of 32 FLOPS per cycle. | 10-29-2009 |
20090300325 | DATA PROCESSING SYSTEM, APPARATUS AND METHOD FOR PERFORMING FRACTIONAL MULTIPLY OPERATIONS - A data processing system, apparatus and method for performing fractional multiply operations is disclosed. The system includes a memory that stores instructions for SIMD operations and a processing core. The processing core includes registers that store operands for the fractional multiply operations. A coprocessor included in the processing core performs the fractional multiply operations on the operands and stores the result in a destination register that is also included in the processing core. | 12-03-2009 |
20100031002 | SIMD MICROPROCESSOR AND OPERATION METHOD - A disclosed SIMD microprocessor includes a processor element unit including multiple processor elements; and a global processor unit configured to interpret a program pre-recorded in a memory and supply a control signal to the processor element unit. Each of the processor elements includes an operational circuit; a first forwarding path for forwarding, to an input side of the operational circuit, an operation result obtained by the operational circuit; second forwarding paths, each of which forwards, to the input side of the operational circuit, an operation result obtained by an operational circuit of a neighboring processor element among the multiple processor elements; and a selection unit configured to select one of the first forwarding path and the second forwarding paths. | 02-04-2010 |
20100042808 | PROVISION OF EXTENDED ADDRESSING MODES IN A SINGLE INSTRUCTION MULTIPLE DATA (SIMD) DATA PROCESSOR - Executing a first memory access instruction with update by an N-bit processor includes accessing at least one source register of a plurality of registers, wherein the accessing includes accessing a first register, wherein each register of the plurality of registers includes a main portion of N bits and an extension portion of M bits, wherein the main portion of the first register includes a first address operand. The execution of the first instruction further includes forming a memory access address using the first address operand; using the memory access address as an address for a memory access; producing an updated address operand; and writing the updated address operand to the main portion of the first register. The producing includes accessing an extension portion of a source register of the at least one source register to obtain modifying information and using the modifying information in the producing an updated address operand. | 02-18-2010 |
20100077176 | METHOD AND APPARATUS FOR IMPROVED CALCULATION OF MULTIPLE DIMENSION FAST FOURIER TRANSFORMS - Apparatus and methods for storing data in a block to provide improved accessibility of the stored data in two or more dimensions. The data is loaded into memory macros constituting a row of the block such that sequential values in the data are loaded into sequential memory macros. The data loaded in the row is circularly shifted a predetermined number of columns relative to the preceding row. The circularly shifted row of data is stored, and the process is repeated until a predetermined number of rows of data are stored. A two dimensional (2D) data block is thereby formed. Each memory macro is a predetermined number of bits wide and each column is one memory macro wide. | 03-25-2010 |
20100082939 | TECHNIQUES FOR EFFICIENT IMPLEMENTATION OF BROWNIAN BRIDGE ALGORITHM ON SIMD PLATFORMS - Methods and apparatus for implementing Brownian Bridge algorithm on Single Instruction Multiple Data (SIMD) computing platforms are described. In one embodiment, a memory stores a plurality of data corresponding to an SIMD (Single Instruction, Multiple Data) instruction. A processor may include a plurality of SIMD lanes. Each of the plurality of the SIMD lanes may process one of the plurality of data stored in the memory in accordance with the SIMD instruction. Other embodiments are also described. | 04-01-2010 |
20100146241 | Modified-SIMD Data Processing Architecture - An apparatus and method for processing data includes an array of processing elements to simultaneously perform operations on multiple data elements using a single instruction. A grouping module assigns each processing element within the array to one of several groups. A modification module designates how each group of processing elements should handle the single instruction. This enables each group of processing elements to handle the single instruction differently. Each processing element is configured to handle the single instruction based on the group the processing element belongs to. | 06-10-2010 |
20100180100 | Matrix microprocessor and method of operation - A microprocessor includes a direct access memory (DMA) engine which is responsive to pairs of block indices associated with one or more blocks in a first logical plane and transfers the one or more blocks between the first logical plane, a second logical plane, and a physical memory space according to the pairs of block indices. The logical planes represent two dimensional fields of data such as those found in images and videos. The microprocessor further comprises cache memory which updates its content with one or more cache-blocks which are in the neighborhood of the one or more blocks improving the operation of the cache memory by increasing cache hits. The DMA engine may further operate on n-dimensional blocks in a n-dimensional logical space. The microprocessor further includes special-purpose instructions, operative on a single-instruction-multiple-data (SIMD) computation unit, especially tailored to perform matrix operations. The SIMD may share scalar operands with an onboard single-instruction-single-data (SISD) computation unit. | 07-15-2010 |
20100211758 | MICROPROCESSOR AND MEMORY-ACCESS CONTROL METHOD - A microprocessor that can perform sequential processing in data array unit includes: a load store unit that loads, when a fetched instruction is a load instruction for data, a data sequence including designated data from a data memory in memory width unit and specifies, based on an analysis result of the instruction, data scheduled to be designated in a load instruction in future; and a data temporary storage unit that stores use-scheduled data as the data specified by the load store unit. | 08-19-2010 |
20100241824 | PROCESSING ARRAY DATA ON SIMD MULTI-CORE PROCESSOR ARCHITECTURES - Techniques are disclosed for converting data into a format tailored for efficient multidimensional fast Fourier transforms (FFTS) on single instruction, multiple data (SIMD) multi-core processor architectures. The technique includes converting data from a multidimensional array stored in a conventional row-major order into SIMD format. Converted data in SIMD format consists of a sequence of blocks, where each block interleaves s rows such that SIMD vector processors may operate on s rows simultaneously. As a result, the converted data in SIMD format enables smaller-sized 1D FFTs to be optimized in SIMD multi-core processor architectures. | 09-23-2010 |
20100250897 | Addressing Device for Parallel Processor - The invention relates to a parallel processor which comprises elementary processors ( | 09-30-2010 |
20100274989 | ACCELERATING TRACEBACK ON A SIGNAL PROCESSOR - A method executed by an instruction set on a processor is described. The method includes providing a tbbit instruction, inputting a first index for the tbbit instruction, loading a second value for the tbbit instruction, wherein the second value comprises at least 2 | 10-28-2010 |
20100274990 | Apparatus and Method for Performing SIMD Multiply-Accumulate Operations - An apparatus and method for performing SIMD multiply-accumulate operations includes SIMD data processing circuitry responsive to control signals to perform data processing operations in parallel on multiple data elements. Instruction decoder circuitry is coupled to the SIMD data processing circuitry and is responsive to program instructions to generate the required control signals. The instruction decoder circuitry is responsive to a single instruction (referred to herein as a repeating multiply-accumulate instruction) having as input operands a first vector of input data elements, a second vector of coefficient data elements, and a scalar value indicative of a plurality of iterations required, to generate control signals to control the SIMD processing circuitry. In response to those control signals, the SIMD data processing circuitry performs the plurality of iterations of a multiply-accumulate process, each iteration involving performance of N multiply-accumulate operations in parallel in order to produce N multiply-accumulate data elements. For each iteration, the SIMD data processing circuitry determines N input data elements from said first vector and a single coefficient data element from the second vector to be multiplied with each of the N input data elements. The N multiply-accumulate data elements produced in a final iteration of the multiply-accumulate process are then used to produce N multiply-accumulate results. This mechanism provides a particularly energy efficient mechanism for performing SIMD multiply-accumulate operations, as for example are required for FIR filter processes. | 10-28-2010 |
20100318766 | PROCESSOR AND INFORMATION PROCESSING SYSTEM - A processor includes a processing unit capable of executing single-instruction multiple-data operations; a register file configured to store data that is to be supplied to the processing unit and to be subjected to operations, and a buffer provided separately from the register file, the buffer being a buffer where an integer “n” number of data columns each having a plurality of data elements are written on a column-by-column basis, and data elements at the same location are selected and read as “n” data elements from the respective “n” data columns, wherein the “n” data elements read from the buffer is supplied to the processing unit as data to be subjected to a single-instruction multiple-data operation. | 12-16-2010 |
20100332794 | UNPACKING PACKED DATA IN MULTIPLE LANES - Receiving an instruction indicating first and second operands. Each of the operands having packed data elements that correspond in respective positions. A first subset of the data elements of the first operand and a first subset of the data elements of the second operand each corresponding to a first lane. A second subset of the data elements of the first operand and a second subset of the data elements of the second operand each corresponding to a second lane. Storing result, in response to instruction, including: (1) in first lane, only lowest order data elements from first subset of first operand interleaved with corresponding lowest order data elements from first subset of second operand; and (2) in second lane, only highest order data elements from second subset of first operand interleaved with corresponding highest order data elements from second subset of second operand. | 12-30-2010 |
20110010524 | SIMD PROCESSOR ARRAY SYSTEM AND DATA TRANSFER METHOD THEREOF - There is provided an SIMD processor array system in which data can be efficiently transferred between processor elements located at different distances. The SIMD processor array system includes a control processor (CP) that is capable of issuing a plurality of instructions at the same time, and a PE array that includes a plurality of mutually-connected processing elements (PEs) to be controlled by the CP. The CP issues an inter-PE data shift instruction to each PE. According to the inter-PE data shift instruction, each PE performs a data sending operation of copying all the contents of a transfer data storing part of an adjoining PE to a transfer data storing part (MBF) of the own PE, and a data fetch operation of copying part or all of the contents of the MBF of the adjoining PE to a transfer data fetch and storing part (RBUF) of the own PE if part of the contents the MBF of the adjoining PE coincide with the contents of an ID storing part (IDB) of the own PE. | 01-13-2011 |
20110029756 | Method and System for Decoding Low Density Parity Check Codes - A method for decoding a codeword in a data stream encoded according to a low density parity check (LDPC) code having an m×j parity check matrix H by initializing variable nodes with soft values based on symbols in the codeword, wherein a graph representation of H includes m check nodes and j variable nodes, and wherein a check node m provides a row value estimate to a variable node j and a variable node j provides a column value estimate to a check node m if H(m,j) contains a 1, computing row value estimates for each check node, wherein amplitudes of only a subset of column value estimates provided to the check node are computed, computing soft values for each variable node based on the computed row value estimates, determining whether the codeword is decoded based on the soft values, and terminating decoding when the codeword is decoded. | 02-03-2011 |
20110040952 | SIMD PARALLEL COMPUTER SYSTEM, SIMD PARALLEL COMPUTING METHOD, AND CONTROL PROGRAM - Uniforming of the processing load is efficiently realized. Each processing element configuring an SIMD parallel computer system includes a data storage module that stores data processed or transferred, a number-of-data-sets storage device that stores number of data sets, and a front data storage device that stores the front data. Each processing element further includes a control processor that compares the number of data sets stored in one processing element with the number of data sets stored in the own processing element, and issues a data distribution leveling instruction that designates an action for updating contents of the data storage module, the number-of-data-sets storage device, and the front data storage device according to a rule determined based on a comparison result of the own processing element and that of the other processing elements and an action for moving the data stored in the one processing element to the own processing element. | 02-17-2011 |
20110047349 | PROCESSOR AND PROCESSOR CONTROL METHOD - A processor includes a plurality of subfunctional units provided corresponding to respective slots of one or more pieces of operation result data including a plurality of slots for an SIMD operation; and an enable generating unit configured to, in each of the one or more pieces of the operation result data, compare a value of a predetermined slot with a value of a slot other than the predetermined slot, and disable one or more subfunctional units to which the value equal to the value of the predetermined slot is inputted, and the processor outputs the value of the predetermined slot as the value of the one or more subfunctional units which have been disabled. | 02-24-2011 |
20110072238 | Method for variable length opcode mapping in a VLIW processor - The present invention provides a method for reducing program memory size required for a dual-issue processor with a scalar processor plus a SIMD vector processor. Coding the map of next group of instruction pairs in a no-operation (NOP) instruction of scalar and vector processor reduces the cases where one of the scalar or vector opcode being a NOP opcode. NOP for either scalar or vector processor defines the next 13 instructions as scalar-plus-vector, scalar-followed-by-scalar, or vector-followed-by-vector so that execution unit performs accordingly until next NOP or a branch instruction. | 03-24-2011 |
20110083000 | DATA PROCESSING ARCHITECTURES FOR PACKET HANDLING - A data processing architecture includes an input device that receives an incoming stream of data packets. A plurality of processing elements are operable to process data received from the input device. The input device is operable to distribute data packets in whole or in part to the processing elements in dependence upon the data processing bandwidth of the processing elements. | 04-07-2011 |
20110087860 | PARALLEL DATA PROCESSING SYSTEMS AND METHODS USING COOPERATIVE THREAD ARRAYS - Parallel data processing systems and methods use cooperative thread arrays (CTAs), i.e., groups of multiple threads that concurrently execute the same program on an input data set to produce an output data set. Each thread in a CTA has a unique identifier (thread ID) that can be assigned at thread launch time. The thread ID controls various aspects of the thread's processing behavior such as the portion of the input data set to be processed by each thread, the portion of an output data set to be produced by each thread, and/or sharing of intermediate results among threads. Mechanisms for loading and launching CTAs in a representative processing core and for synchronizing threads within a CTA are also described. | 04-14-2011 |
20110093682 | METHOD AND APPARATUS FOR PACKING DATA - An apparatus includes an instruction decoder, first and second source registers and a circuit coupled to the decoder to receive packed data from the source registers and to pack the packed data responsive to a pack instruction received by the decoder. A first packed data element and a second packed data element are received from the first source register. A third packed data element and a fourth packed data element are received from the second source register. The circuit packs packing a portion of each of the packed data elements into a destination register resulting with the portion from second packed data element adjacent to the portion from the first packed data element, and the portion from the fourth packed data element adjacent to the portion from the third packed data element. | 04-21-2011 |
20110099352 | Automatic control of multiple arithmetic/logic SIMD units - There is provided a method of performing single instruction multiple data (SIMD) operations. The method comprises storing a plurality of arrays in memory for performing SIMD operations thereon; determining a total number of SIMD operations to be performed on the plurality of arrays; loading a counter with the total number of SIMD operations to be performed on the plurality of arrays; enabling a plurality of arithmetic logic units (ALUs) to perform a first number of operations on first elements of the plurality of arrays; performing the first number of operations on first elements of the plurality of arrays using the plurality of ALUs; decrementing the counter by the first number of operations to provide a remaining number of operations; and enabling a number of the plurality of ALUs to perform the remaining number of operations on second elements of the plurality of arrays. | 04-28-2011 |
20110107060 | TRANSPOSING ARRAY DATA ON SIMD MULTI-CORE PROCESSOR ARCHITECTURES - Systems, methods and articles of manufacture are disclosed for transposing array data on a SIMD multi-core processor architecture. A matrix in a SIMD format may be received. The matrix may comprise a SIMD conversion of a matrix M in a conventional data format. A mapping may be defined from each element of the matrix to an element of a SIMD conversion of a transpose of matrix M. A SIMD-transposed matrix T may be generated based on matrix M and the defined mapping. A row-wise algorithm may be applied to T, without modification, to operate on columns of matrix M. | 05-05-2011 |
20110153983 | Gathering and Scattering Multiple Data Elements - According to a first aspect, efficient data transfer operations can be achieved by: decoding by a processor device, a single instruction specifying a transfer operation for a plurality of data elements between a first storage location and a second storage location; issuing the single instruction for execution by an execution unit in the processor; detecting an occurrence of an exception during execution of the single instruction; and in response to the exception, delivering pending traps or interrupts to an exception handler prior to delivering the exception. | 06-23-2011 |
20110173414 | MAXIMIZED MEMORY THROUGHPUT ON PARALLEL PROCESSING DEVICES - In parallel processing devices, for streaming computations, processing of each data element of the stream may not be computationally intensive and thus processing may take relatively small amounts of time to compute as compared to memory accesses times required to read the stream and write the results. Therefore, memory throughput often limits the performance of the streaming computation. Generally stated, provided are methods for achieving improved, optimized, or ultimately, maximized memory throughput in such memory-throughput-limited streaming computations. Streaming computation performance is maximized by improving the aggregate memory throughput across the plurality of processing elements and threads. High aggregate memory throughput is achieved by balancing processing loads between threads and groups of threads and a hardware memory interface coupled to the parallel processing devices. | 07-14-2011 |
20110185150 | Low-Overhead Misalignment and Reformatting Support for SIMD - Systems and methods for performing single instruction multiple data (SIMD) operations on a data set. The methods may include examining a structure of the data set to determine what reorganization may be necessary to facilitate SIMD processing. The method may include selecting a stored bit mask corresponding to the organization of the data set and loading the bit mask into an application specific register (ASR). Subsequently, the data may be reorganized inline according to the ASR as the data is loaded into the SIMD functional unit such that the SIMD functional unit may operate on the data set. The results of the SIMD operation may be written to a results register. | 07-28-2011 |
20110185151 | Data Processing Architecture - A parallel processor is described which is operated in a SIMD manner. The processor comprises: a plurality of processing elements connected in a string and grouped into a plurality of processing units, wherein each processing unit comprises a plurality of processing elements which each have direct interconnections with all of the other processing elements within the respective processing unit, the interconnections enabling data transfer between any two elements within a unit to be effected in a single clock cycle. | 07-28-2011 |
20110191567 | Relating to Single Instruction Multiple Data (SIMD) Architectures - Improvements Relating to Single Instruction Multiple Data (SIMD) Architectures A parallel processor for processing a plurality of different processing instruction streams in parallel is described. The processor comprises a plurality of data processing units; and a plurality of SIMD (Single Instruction Multiple Data) controllers, each connectable to a group of data processing units of the plurality of data processing units, and each SIMD controller arranged to handle an individual processing task with a subgroup of actively connected data processing units selected from the group of data processing units. The parallel processor is arranged to vary dynamically the size of the subgroup of data processing units to which each SIMD controller is actively connected under control of received processing instruction streams, thereby permitting each SIMD controller to be actively connected to a different number of processing units for different processing tasks. | 08-04-2011 |
20110208946 | Dual Mode Floating Point Multiply Accumulate Unit - Disclosed are various embodiments of a stream processing unit for single instruction multiple data (SIMD) processing, wherein the stream processing unit executes a stage of a Multiply-Accumulate calculation. In one embodiment, the stream processing unit comprises a plurality of scalar arithmetic logic units (ALUs) configured to receive data having a plurality of data types. The number and type of scalar ALUs corresponds to an SIMD factor. In one embodiment, the scalar ALUs are executed sequentially with a delay being introduced in between execution of each of the scalar ALUs, wherein the delay corresponds to the SIMD factor. | 08-25-2011 |
20120151183 | ENHANCING PERFORMANCE BY INSTRUCTION INTERLEAVING AND/OR CONCURRENT PROCESSING OF MULTIPLE BUFFERS - An embodiment may include circuitry to execute, at least in part, a first list of instructions and/or to concurrently process, at least in part, first and second buffers. The execution of the first list of instructions may result, at least in part, from invocation of a first function call. The first list of instructions may include at least one portion of a second list of instructions interleaved, at least in part, with at least one other portion of a third list of instructions. The portions may be concurrently carried out, at least in part, by one or more sets of execution units of the circuitry. The second and third lists of instructions may implement, at least in part, respective algorithms that are amenable to being invoked by separate respective function calls. The concurrent processing may involve, at least in part, complementary algorithms. | 06-14-2012 |
20120159120 | Method and apparatus for scheduling the issue of instructions in a microprocessor using multiple phases of execution - A microprocessor configured to execute programs divided into discrete phases, comprising: a scheduler for scheduling program instructions to be executed on the processor; a plurality of resources for executing programming instructions issued by the scheduler; wherein the scheduler is configured to schedule each phase of the program only after receiving an indication that execution of the preceding phase of the program has been completed. By splitting programs into multiple phases and providing a scheduler that is able to determine whether execution of a phase has been completed, each phase can be separately scheduled and the results of preceding phases can be used to inform the scheduling of subsequent phases. | 06-21-2012 |
20120210098 | ENABLING VIRTUAL CALLS IN A SIMD ENVIRONMENT - Systems and methods of enabling virtual calls in a single instruction multiple data (SIMD) environment may involve detecting a virtual call of a function and using a single dispatch of the function to invoke the virtual call for two or more channels of the virtual call. In one example, it is determined that the two or more channels share a common target address and a single dispatch of the function is conducted with respect to the common target address. The process may be iterated for additional channels of the virtual call that share a common target address. | 08-16-2012 |
20120254585 | METHOD AND APPARATUS FOR FAST BRANCH-FREE VECTOR DIVISION COMPUTATION - Methods and apparatus for double precision division/inversion vector computations on Single Instruction Multiple Data (SIMD) computing platforms are described. In one embodiment, an input argument is represented by an exponent portion and a fraction portion. These portions are scaled, inverted, and multiplied to generate an inverse version of the input argument. In an embodiment, the inversion of the exponent portion may be done by changing the sign of the exponent. Other embodiments are also described. | 10-04-2012 |
20120265964 | DATA PROCESSING DEVICE AND DATA PROCESSING METHOD THEREOF - Disclosed is a data processing device capable of efficiently performing an arithmetic process on variable-length data and an arithmetic process on fixed-length data. The data processing device includes first PEs of SIMD type, SRAMs provided respectively for the first PEs, and second PEs. The first PEs each perform an arithmetic operation on data stored in a corresponding one of the SRAMs. The second PEs each perform an arithmetic operation on data stored in corresponding ones of the SRAMs. Therefore, the SRAMs can be shared so as to efficiently perform the arithmetic process on variable-length data and the arithmetic process on fixed-length data. | 10-18-2012 |
20120297163 | AUTOMATIC KERNEL MIGRATION FOR HETEROGENEOUS CORES - A system and method for automatically migrating the execution of work units between multiple heterogeneous cores. A computing system includes a first processor core with a single instruction multiple data micro-architecture and a second processor core with a general-purpose micro-architecture. A compiler predicts execution of a function call in a program migrates at a given location to a different processor core. The compiler creates a data structure to support moving live values associated with the execution of the function call at the given location. An operating system (OS) scheduler schedules at least code before the given location in program order to the first processor core. In response to receiving an indication that a condition for migration is satisfied, the OS scheduler moves the live values to a location indicated by the data structure for access by the second processor core and schedules code after the given location to the second processor core. | 11-22-2012 |
20130124824 | Bitstream Buffer Manipulation With A SIMD Merge Instruction - Method, apparatus, and program means for performing bitstream buffer manipulation with a SIMD merge instruction. The method of one embodiment comprises determining whether any unprocessed data bits for a partial variable length symbol exist in a first data block is made. A shift merge operation is performed to merge the unprocessed data bits from the first data block with a second data block. A merged data block is formed. A merged variable length symbol comprised of the unprocessed data bits and a plurality of data bits from the second data block is extracted from the merged data block. | 05-16-2013 |
20130138917 | Bitstream Buffer Manipulation With A SIMD Merge Instruction - Method, apparatus, and program means for performing bitstream buffer manipulation with a SIMD merge instruction. The method of one embodiment comprises determining whether any unprocessed data bits for a partial variable length symbol exist in a first data block is made. A shift merge operation is performed to merge the unprocessed data bits from the first data block with a second data block. A merged data block is formed. A merged variable length symbol comprised of the unprocessed data bits and a plurality of data bits from the second data block is extracted from the merged data block. | 05-30-2013 |
20130145120 | Bitstream Buffer Manipulation With A SIMD Merge Instruction - Method, apparatus, and program means for performing bitstream buffer manipulation with a SIMD merge instruction. The method of one embodiment comprises determining whether any unprocessed data bits for a partial variable length symbol exist in a first data block is made. A shift merge operation is performed to merge the unprocessed data bits from the first data block with a second data block. A merged data block is formed. A merged variable length symbol comprised of the unprocessed data bits and a plurality of data bits from the second data block is extracted from the merged data block. | 06-06-2013 |
20130166878 | VECTOR SIMD PROCESSOR - Operation parallelism of a data processor is enhanced by floating-point inner product execution units compatible with single instruction multiple data (SIMD). An operating system that can significantly enhance the level of operation parallelism per instruction while maintaining efficiency of floating-point length-4 vector inner product execution units is implemented. The floating-point length-4 vector inner product execution units are defined in the minimum width (32 bits for single precision) even where an extensive operating system becomes available, and compose the inner product execution units to be compatible with SIMD. The mutually augmenting effects of the inner product execution units and SIMD-compatible composition enhances the level of operation parallelism dramatically. Composition of the floating-point length-4 vector inner product execution units to calculate the sum of the inner product of length-4 vectors and scalar to be compatible with SIMD of four in parallel results in a processing capability of 32 FLOPS per cycle. | 06-27-2013 |
20130227249 | Three-Dimensional Permute Unit for a Single-Instruction Multiple-Data Processor - A three-dimensional (3D) permute unit for a single-instruction-multiple-data stacked processor includes a first vector permute subunit and a second vector permute subunit. The first and second vector permute subunits are arranged in different layers of a 3D chip package. The vector permute subunits are each configured to process a portion of at least two input vectors. A first contact sub-field of the first vector permute subunit is configured to connect output ports of a first crossbar of the first vector permute subunit, holding an intermediate result of the first vector permute subunit, to a second contact sub-field of the second vector permute subunit. A first contact sub-field of the second vector permute subunit is configured to connect output ports of a first crossbar of the second vector permute subunit, holding an intermediate result of the second vector permute subunit, to a second contact sub-field of the first vector permute subunit. | 08-29-2013 |
20130318324 | MINICORE-BASED RECONFIGURABLE PROCESSOR AND METHOD OF FLEXIBLY PROCESSING MULTIPLE DATA USING THE SAME - A minicore-based reconfigurable processor and a method of flexibly processing multiple data using the same are provided. The reconfigurable processor includes minicores, each of the minicores including function units configured to perform different operations, respectively. The reconfigurable processor further includes a processing unit configured to activate two or more function units of two or more respective minicores, among the minicores, that are configured to perform an operation of a single instruction multiple data (SIMD) instruction, the processing unit further configured to execute the SIMD instruction using the activated two or more function units. | 11-28-2013 |
20140013077 | EFFICIENT HARDWARE INSTRUCTIONS FOR SINGLE INSTRUCTION MULTIPLE DATA PROCESSORS - A method and apparatus for efficiently processing data in various formats in a single instruction multiple data (“SIMD”) architecture is presented. Specifically, a method to unpack a fixed-width bit values in a bit stream to a fixed width byte stream in a SIMD architecture is presented. A method to unpack variable-length byte packed values in a byte stream in a SIMD architecture is presented. A method to decompress a run length encoded compressed bit-vector in a SIMD architecture is presented. A method to return the offset of each bit set to one in a bit-vector in a SIMD architecture is presented. A method to fetch bits from a bit-vector at specified offsets relative to a base in a SIMD architecture is presented. A method to compare values stored in two SIMD registers is presented. | 01-09-2014 |
20140013078 | EFFICIENT HARDWARE INSTRUCTIONS FOR SINGLE INSTRUCTION MULTIPLE DATA PROCESSORS - A method and apparatus for efficiently processing data in various formats in a single instruction multiple data (“SIMD”) architecture is presented. Specifically, a method to unpack a fixed-width bit values in a bit stream to a fixed width byte stream in a SIMD architecture is presented. A method to unpack variable-length byte packed values in a byte stream in a SIMD architecture is presented. A method to decompress a run length encoded compressed bit-vector in a SIMD architecture is presented. A method to return the offset of each bit set to one in a bit-vector in a SIMD architecture is presented. A method to fetch bits from a bit-vector at specified offsets relative to a base in a SIMD architecture is presented. A method to compare values stored in two SIMD registers is presented. | 01-09-2014 |
20140032879 | CIRCUIT AND METHOD FOR SEARCHING A DATA ARRAY AND SINGLE-INSTRUCTION, MULTIPLE-DATA PROCESSING UNIT INCORPORATING THE SAME - Search circuitry responsive to a single instruction for undertaking a step of a search of a data array for an extreme value therein, a method of searching a data array to identify an extreme value therein and a location thereof and a single-instruction, multiple-data (SIMD) processing unit incorporating the search circuitry or the method. In one embodiment, the search circuitry includes: a comparison element configured to compare two values in the data array, (2) multiplexers coupled to the comparison element and configured to select a more extreme value of the two values and a location in the data array of the more extreme value and (3) an incrementer configured to increment a counter associated with the search. | 01-30-2014 |
20140181467 | HIGH LEVEL SOFTWARE EXECUTION MASK OVERRIDE - Methods, and media, and computer systems are provided. The method includes, the media includes control logic for, and the computer system includes a processor with control logic for overriding an execution mask of SIMD hardware to enable at least one of a plurality of lanes of the SIMD hardware. Overriding the execution mask is responsive to a data parallel computation and a diverged control flow of a workgroup. | 06-26-2014 |
20140189296 | SYSTEM, APPARATUS AND METHOD FOR LOOP REMAINDER MASK INSTRUCTION - A loop remainder mask instruction indicates a current iteration count of a loop as a first operand, an iteration limit of a loop as a second operand, and a destination. The loop contains iterations and each iteration includes a data element of the array. A processor receives the loop remainder mask instruction, decodes the instruction for execution, and stores a result of the execution in the destination. The result indicates a number of data elements of the array past an end of a preceding portion of the array that are to be handled separately from the preceding portion, the end of the preceding portion being where the current iteration count is recorded. | 07-03-2014 |
20140195778 | INSTRUCTION AND LOGIC TO PROVIDE VECTOR LOAD-OP/STORE-OP WITH STRIDE FUNCTIONALITY - Instructions and logic provide vector load-op and/or store-op with stride functionality. Some embodiments, responsive to an instruction specifying: a set of loads, a second operation, destination register, operand register, memory address, and stride length; execution units read values in a mask register, wherein fields in the mask register correspond to stride-length multiples from the memory address to data elements in memory. A first mask value indicates the element has not been loaded from memory and a second value indicates that the element does not need to be, or has already been loaded. For each having the first value, the data element is loaded from memory into the corresponding destination register location, and the corresponding value in the mask register is changed to the second value. Then the second operation is performed using corresponding data in the destination and operand registers to generate results. The instruction may be restarted after faults. | 07-10-2014 |
20140208068 | DATA COMPRESSION AND DECOMPRESSION USING SIMD INSTRUCTIONS - Compression and decompression of numerical data utilizing single instruction, multiple data (SIMD) instructions is described. The numerical data includes integer and floating-point samples. Compression supports three encoding modes: lossless, fixed-rate, and fixed-quality. SIMD instructions for compression operations may include attenuation, derivative calculations, bit packing to form compressed packets, header generation for the packets, and packed array output operations. SIMD instructions for decompression may include packed array input operations, header recovery, decoder control, bit unpacking, integration, and amplification. Compression and decompression may be implemented in a microprocessor, digital signal processor, field-programmable gate array, application-specific integrated circuit, system-on-chip, or graphics processor, using SIMD instructions. Compression and decompression of numerical data can reduce memory, networking, and storage bottlenecks. This abstract does not limit the scope of the invention as described in the claims. | 07-24-2014 |
20140208069 | SIMD INSTRUCTIONS FOR DATA COMPRESSION AND DECOMPRESSION - An execution unit configured for compression and decompression of numerical data utilizing single instruction, multiple data (SIMD) instructions is described. The numerical data includes integer and floating-point samples. Compression supports three encoding modes: lossless, fixed-rate, and fixed-quality. SIMD instructions for compression operations may include attenuation, derivative calculations, bit packing to form compressed packets, header generation for the packets, and packed array output operations. SIMD instructions for decompression may include packed array input operations, header recovery, decoder control, bit unpacking, integration, and amplification. Compression and decompression may be implemented in a microprocessor, digital signal processor, field-programmable gate array, application-specific integrated circuit, system-on-chip, or graphics processor, using SIMD instructions. Compression and decompression of numerical data can reduce memory, networking, and storage bottlenecks. This abstract does not limit the scope of the invention as described in the claims. | 07-24-2014 |
20150012724 | DATA PROCESSING APPARATUS HAVING SIMD PROCESSING CIRCUITRY - A data processing apparatus has permutation circuitry for performing a permutation operation for changing a data element size or data element positioning of at least one source operand to generate first and second SIMD operands, and SIMD processing circuitry for performing a SIMD operation on the first and second SIMD operands. In response to a first SIMD instruction requiring a permutation operation, the instruction decoder controls the permutation circuitry to perform the permutation operation to generate the first and second SIMD operands and then controls the SIMD processing circuitry to perform the SIMD operation using these operands. In response to a second SIMD instruction not requiring a permutation operation, the instruction decoder controls the SIMD processing circuitry to perform the SIMD operation using the first and second SIMD operands identified by the instruction, without passing them via the permutation circuitry. | 01-08-2015 |
20150019838 | Vector Load and Duplicate Operations - A method of loading and duplicating scalar data from a source into a destination register. The data may be duplicated in byte, half word, word or double word parts, according to a duplication pattern. | 01-15-2015 |
20150100758 | DATA PROCESSOR AND METHOD OF LANE REALIGNMENT - A data processor includes a register file divided into at least a first portion and a second portion for storing data. A single instruction, multiple data (SIMD) unit is also divided into at least a first lane and a second lane. The first and second lanes of the SIMD unit correspond respectively to the first and second portions of the register file. Furthermore, each lane of the SIMD unit is capable of data processing. The data processor also includes a realignment element in communication with the register file and the SIMD unit. The realignment element is configured to selectively realign conveyance of data between the first portion of the register file and the first lane of the SIMD unit to the second lane of the SIMD unit. | 04-09-2015 |
20150127924 | METHOD AND APPARATUS FOR PROCESSING SHUFFLE INSTRUCTION - A method and corresponding apparatus for processing a shuffle instruction are provided. Shuffle units are configured in a hierarchical structure, and each of the shuffle units generates a shuffled data element array by performing shuffling on an input data element array. In the hierarchical structure, which includes an upper shuffle unit and a lower shuffle unit, the shuffled data element array output from the lower shuffle unit is input to the upper shuffle unit as a portion of the input data element array for the upper shuffle unit. | 05-07-2015 |
20150317157 | TECHNIQUES FOR SERIALIZED EXECUTION IN A SIMD PROCESSING SYSTEM - A SIMD processor may be configured to determine one or more active threads from a plurality of threads, select one active thread from the one or more active threads, and perform a divergent operation on the selected active thread. The divergent operation may be a serial operation. | 11-05-2015 |
20150356054 | DATA PROCESSOR AND METHOD FOR DATA PROCESSING - A integrated circuit device has at least one instruction processing module arranged for executing vector data processing upon receipt of a respective one of a set of data processing instructions. The data processing instructions include at least one matrix processing instruction for processing elements of a matrix. The elements of rows of the matrix are stored in a set of register, and the instruction processing module comprising an accessing unit for accessing selected elements of the matrix, which selected elements are non-sequentially located according to a predetermined pattern across multiple registers of the set of registers, the accessing enabling respective processing lanes to write or read different registers. Advantageously elements in columns of a matrix can efficiently be processed. | 12-10-2015 |
20150363357 | MEMORY CONTROLLER AND SIMD PROCESSOR - Technology to suppress the drop in SIMD processor efficiency that occurs when exchanging two-dimensional data in a plurality of rectangular regions, between an external section and a plurality of processor elements in an SIMD processor, so that one rectangular region corresponds to one processor element. In the SIMD processor, an address storage unit in a memory controller is capable of setting N number of addresses Ai (i=1 through N) in an external memory by utilizing a control processor. A parameter storage unit is capable of setting a first parameter OSV, a second parameter W, and a third parameter L by utilizing a control processor. A data transfer unit executes the transfer of data between an external memory, and the buffers in N number of processor elements contained in the applicable SIMD processor, based on the contents of the address storage unit and the parameter storage unit. | 12-17-2015 |
20150370755 | SIMD PROCESSOR AND CONTROL PROCESSOR, AND PROCESSING ELEMENT WITH ADDRESS CALCULATING UNIT - To improve processing efficiency of a SIMD processor that divides two-dimensional data into blocks, each having a width of PE number N, to store the data in a local memory of each of PEs by a lateral direction priority method. | 12-24-2015 |
20160026464 | Programmable Counters for Counting Floating-Point Operations in SIMD Processors - A processor includes one or more execution units to execute instructions, each having one or more elements in different element sizes using one or more registers in different register sizes. The processor further includes a counter configured to count a number of instructions performing predetermined types of operations executed by the one or more execution units. The processor further includes one or more registers to allow an external component to configure the counter to count a number of instructions associated with a combination of a register size and a element size (register/element size) and to retrieve a counter value produced by the counter. | 01-28-2016 |
20160034282 | INSTRUCTION AND LOGIC TO PROVIDE SIMD SECURE HASHING ROUND SLICE FUNCTIONALITY - Instructions and logic provide SIMD secure hashing round slice functionality. Some embodiments include a processor comprising: a decode stage to decode an instruction for a SIMD secure hashing algorithm round slice, the instruction specifying a source data operand set, a message-plus-constant operand set, a round-slice portion of the secure hashing algorithm round, and a rotator set portion of rotate settings. Processor execution units, are responsive to the decoded instruction, to perform a secure hashing round-slice set of round iterations upon the source data operand set, applying the message-plus-constant operand set and the rotator set, and store a result of the instruction in a SIMD destination register. One embodiment of the instruction specifies a hash round type as one of four MD5 round types. Other embodiments may specify a hash round type by an immediate operand as one of three SHA-1 round types or as a SHA-2 round type. | 02-04-2016 |
20160055005 | System and Method for Page-Conscious GPU Instruction - Embodiments disclose a system and method for reducing virtual address translation latency in a wide execution engine that implements virtual memory. One example method describes a method comprising receiving a wavefront, classifying the wavefront into a subset based on classification criteria selected to reduce virtual address translation latency associated with a memory support structure, and scheduling the wavefront for processing based on the classifying. | 02-25-2016 |
20160062771 | OPTIMIZE CONTROL-FLOW CONVERGENCE ON SIMD ENGINE USING DIVERGENCE DEPTH - There are provided a system, a method and a computer program product for selecting an active data stream (a lane) while running SPMD (Single Program Multiple Data) code on SIMD (Single Instruction Multiple Data) machine. The machine runs an instruction stream over input data streams. The machine increments lane depth counters of all active lanes upon the thread-PC reaching a branch operation. The machine updates the lane-PC of each active lane according to targets of the branch operation. The machine selects an active lane and activates only lanes whose lane-PCs match the thread-PC. The machine decrements the lane depth counters of the selected active lanes and updates the lane-PC of each active lane upon the instruction stream reaching a first instruction. The machine assigns the lane-PC of a lane with a largest lane depth counter value to the thread-PC and activates all lanes whose lane-PCs match the thread-PC. | 03-03-2016 |
20160070571 | REGISTER FILES FOR STORING DATA OPERATED ON BY INSTRUCTIONS OF MULTIPLE WIDTHS - A processor core includes even and odd execution slices each having a register file. The slices are each configured to perform operations specified in a first set of instructions on data from its respective register file, and together configured to perform operations specified in a second set of instructions on data stored across both register files. During utilization, the processor receives a first instruction of the first set specifying an operation, a target register, and a source register. Next, a second instruction upon which content of the source register depends is identified as being of the second set. In response, the first instruction is dispatched to the even slice. In accordance with the operation specified in the first instruction, the even slice uses content of the source register in its register file to produce a result. Copies of the result are written to the target register in both register files. | 03-10-2016 |
20160077838 | Single Instruction Multiple Data (SIMD) Architectures - A parallel processor for processing a plurality of different processing instruction streams in parallel is described. The processor comprises a plurality of data processing units; and a plurality of SIMD (Single Instruction Multiple Data) controllers, each connectable to a group of data processing units of the plurality of data processing units, and each SIMD controller arranged to handle an individual processing task with a subgroup of actively connected data processing units selected from the group of data processing units. The parallel processor is arranged to vary dynamically the size of the subgroup of data processing units to which each SIMD controller is actively connected under control of received processing instruction streams, thereby permitting each SIMD controller to be actively connected to a different number of processing units for different processing tasks. | 03-17-2016 |
20160092215 | INSTRUCTION AND LOGIC FOR MULTIPLIER SELECTORS FOR MERGING MATH FUNCTIONS - A processor includes a front end with logic to identify a multiplier, multiplicand, and mathematical mode based upon an instruction. The processor also includes a multiplier circuit to apply Booth encoding to multiply the multiplier and multiplicand. The multiplier circuit includes circuitry to determine leftmost and rightmost partial products of multiplying the multiplier and multiplicand using Booth encoding. The circuitry includes a most significant bit (MSB) array and least significant bit (LSB) array corresponding to the multiplier. The multiplier circuit also includes logic to selectively enable selectors of the circuitry to find partial products based upon the mathematical mode of the instruction. | 03-31-2016 |
20160092237 | Variable Length Execution Pipeline - In an aspect, a pipelined execution resource can produce an intermediate result for use in an iterative approximation algorithm in an odd number of clock cycles. The pipelined execution resource executes SIMD requests by staggering commencement of execution of the requests from a SIMD instruction. When executing one or more operations for a SIMD iterative approximation algorithm, and an operation for another SIMD iterative approximation algorithm is ready to begin execution, control logic causes intermediate results completed by the pipelined execution resource to pass through a wait state, before being used in a subsequent computation. This wait state presents two open scheduling cycles in which both parts of the next SIMD instruction can begin execution. Although the wait state increases latency to complete an in-progress algorithm, a total throughput of execution on the pipeline increases. | 03-31-2016 |
20160092241 | SINGLE INSTRUCTION ARRAY INDEX COMPUTATION - Embodiments are directed to a method of adjusting an index, wherein the index identifies a location of an element within an array. The method includes executing, by a computer, a single instruction that adjusts a first parameter of the index to match a parameter of an array address. The single instruction further adjusts a second parameter of the index to match a parameter of the array element. The adjustment of the first parameter includes a sign extension. | 03-31-2016 |
20160103684 | COALESCING ADJACENT GATHER/SCATTER OPERATIONS - According to one embodiment, a processor includes an instruction decoder to decode a first instruction to gather data elements from memory, the first instruction having a first operand specifying a first storage location and a second operand specifying a first memory address storing a plurality of data elements. The processor further includes an execution unit coupled to the instruction decoder, in response to the first instruction, to read contiguous a first and a second of the data elements from a memory location based on the first memory address indicated by the second operand, and to store the first data element in a first entry of the first storage location and a second data element in a second entry of a second storage location corresponding to the first entry of the first storage location. | 04-14-2016 |
20160103785 | GATHER USING INDEX ARRAY AND FINITE STATE MACHINE - Methods and apparatus are disclosed for using an index array and finite state machine for scatter/gather operations. Embodiment of apparatus may comprise: decode logic to decode a scatter/gather instruction and generate a set of micro-operations, and an index array to hold a set of indices and a corresponding set of mask elements. A finite state machine facilitates the gather operation. Address generation logic generates an address from an index of the set of indices for at least each of the corresponding mask elements having a first value. An address is accessed to load a corresponding data element if the mask element had the first value. The data element is written at an in-register position in a destination vector register according to a respective in-register position the index. Values of corresponding mask elements are changed from the first value to a second value responsive to completion of their respective loads. | 04-14-2016 |
20160103787 | COALESCING ADJACENT GATHER/SCATTER OPERATIONS - According to one embodiment, a processor includes an instruction decoder to decode a first instruction to gather data elements from memory, the first instruction having a first operand specifying a first storage location and a second operand specifying a first memory address storing a plurality of data elements. The processor further includes an execution unit coupled to the instruction decoder, in response to the first instruction, to read contiguous a first and a second of the data elements from a memory location based on the first memory address indicated by the second operand, and to store the first data element in a first entry of the first storage location and a second data element in a second entry of a second storage location corresponding to the first entry of the first storage location. | 04-14-2016 |
20160103789 | COALESCING ADJACENT GATHER/SCATTER OPERATIONS - According to one embodiment, a processor includes an instruction decoder to decode a first instruction to gather data elements from memory, the first instruction having a first operand specifying a first storage location and a second operand specifying a first memory address storing a plurality of data elements. The processor further includes an execution unit coupled to the instruction decoder, in response to the first instruction, to read contiguous a first and a second of the data elements from a memory location based on the first memory address indicated by the second operand, and to store the first data element in a first entry of the first storage location and a second data element in a second entry of a second storage location corresponding to the first entry of the first storage location. | 04-14-2016 |
20160103790 | COALESCING ADJACENT GATHER/SCATTER OPERATIONS - According to one embodiment, a processor includes an instruction decoder to decode a first instruction to gather data elements from memory, the first instruction having a first operand specifying a first storage location and a second operand specifying a first memory address storing a plurality of data elements. The processor further includes an execution unit coupled to the instruction decoder, in response to the first instruction, to read contiguous a first and a second of the data elements from a memory location based on the first memory address indicated by the second operand, and to store the first data element in a first entry of the first storage location and a second data element in a second entry of a second storage location corresponding to the first entry of the first storage location. | 04-14-2016 |
20160110196 | COALESCING ADJACENT GATHER/SCATTER OPERATIONS - According to one embodiment, a processor includes an instruction decoder to decode a first instruction to gather data elements from memory, the first instruction having a first operand specifying a first storage location and a second operand specifying a first memory address storing a plurality of data elements. The processor further includes an execution unit coupled to the instruction decoder, in response to the first instruction, to read contiguous a first and a second of the data elements from a memory location based on the first memory address indicated by the second operand, and to store the first data element in a first entry of the first storage location and a second data element in a second entry of a second storage location corresponding to the first entry of the first storage location. | 04-21-2016 |
20160124709 | FAST, ENERGY-EFFICIENT EXPONENTIAL COMPUTATIONS IN SIMD ARCHITECTURES - In one embodiment, a computer-implemented method includes receiving as input a value of a variable x and receiving as input a degree n of a polynomial function being used to evaluate an exponential function e | 05-05-2016 |
20160139934 | HARDWARE INSTRUCTION SET TO REPLACE A PLURALITY OF ATOMIC OPERATIONS WITH A SINGLE ATOMIC OPERATION - Systems and methods may process a single atomic operation. An instruction set may be generated to replace a plurality of atomic operations with a single atomic operation. The instruction set may include an accumulation instruction to compute a prefix sum for a plurality of initial values associated with a plurality of processing lanes to generate a plurality of accumulated values. The instruction set may also include a broadcast instruction to return a pre-existing value to be added with each of the plurality of accumulated values to generate a plurality of intermediate accumulated values. In one example, a graphics processor may execute the instruction set to process the single atomic operation. | 05-19-2016 |
20160140079 | IMPLEMENTING 128-BIT SIMD OPERATIONS ON A 64-BIT DATAPATH - A method of implementing a processor architecture and corresponding system includes operands of a first size and a datapath of a second size. The second size is different from the first size. Given a first array of registers and a second array of registers, each register of the first and second arrays being of the second size, selecting a first register and corresponding second register from the first array and the second array, respectively, to perform operations of the first size. This allows a user, who is interfacing with the hardware processor through software, to provide data of the datapath bit-width instead of the register bit-width. Advantageously, the user is agnostic to the size of the registers. | 05-19-2016 |
20160170771 | SIMD K-NEAREST-NEIGHBORS IMPLEMENTATION | 06-16-2016 |
20160179535 | METHOD AND APPARATUS FOR EFFICIENT EXECUTION OF NESTED BRANCHES ON A GRAPHICS PROCESSOR UNIT | 06-23-2016 |