Patent application number | Description | Published |
20090172677 | Efficient State Management System - The present invention provides an efficient state management system for a complex ASIC, and applications thereof. In an embodiment, a computer-based system executes state-dependent processes. The computer-based system includes a command processor (CP) and a plurality of processing blocks. The CP receives commands in a command stream and manages a global state responsive to global context events in the command stream. The plurality of processing blocks receive the commands in the command stream and manage respective block states responsive to block context events in the command stream. Each respective processing block executes a process on data in a data stream based on the global state and the block state of the respective processing block. | 07-02-2009 |
20090300288 | Write Combining Cache with Pipelined Synchronization - Systems and methods for pipelined synchronization in a write-combining cache are described herein. An embodiment to transmit data to a memory to enable pipelined synchronization of a cache includes obtaining a plurality of synchronization events for transactions with said memory, calculating one or more matches between said events and said data stored in one or more cache-lines of said cache, storing event time stamps of events associated with said matches, generating one or more priority values based on said event time stamps, concurrently transmitting said data to said memory based on said priority values. | 12-03-2009 |
20100017652 | APPARATUS WITH REDUNDANT CIRCUITRY AND METHOD THEREFOR - An apparatus with circuit redundancy includes a set of parallel arithmetic logic units (ALUs), a redundant parallel ALU, input data shifting logic that is coupled to the set of parallel ALUs and that is operatively coupled to the redundant parallel ALU. The input data shifting logic shifts input data for a defective ALU, in a first direction, to a neighboring ALU in the set. When the neighboring ALU is the last or end ALU in the set, the shifting logic continues to shift the input data for the end ALU that is not defective, to the redundant parallel ALU. The redundant parallel ALU then operates for the defective ALU. Output data shifting logic is coupled to an output of the parallel redundant ALU and all other ALU outputs to shift the output data in a second and opposite direction than the input shifting logic, to realign output of data for continued processing, including for storage or for further processing by other circuitry. | 01-21-2010 |
20110050716 | Processing Unit with a Plurality of Shader Engines - A processor includes a first shader engine and a second shader engine. The first shader engine is configured to process pixel shaders for a first subset of pixels to be displayed on a display device. The second shader engine is configured to process pixel shaders for a second subset of pixels to be displayed on the display device. Both the first and second shader engines are also configured to process general-compute shaders and non-pixel graphics shaders. The processor may also include a level-one (L1) data cache, coupled to and positioned between the first and second shader engines. | 03-03-2011 |
20110055511 | Interlocked Increment Memory Allocation and Access - A method of allocating a memory to a plurality of concurrent threads is presented. The method includes dynamically determining writer threads each having at least one pending write to the memory; and dynamically allocating respective contiguous blocks in the memory for each of the writer threads. Another method of allocating a memory to a plurality of concurrent threads includes launching the plurality of threads as a plurality of wavefronts, dynamically determining a group of wavefronts each having at least one thread requiring a write to the memory, and dynamically allocating respective contiguous blocks in the memory for each wavefront from the group of wavefronts. A corresponding method of assigning a memory to a plurality of reader threads includes determining a first number corresponding to a number of writer threads having a block allocated in said memory, launching a first number of reader threads, entering a first wavefront of said reader threads from said group of wavefronts to an atomic operation, and assigning a first block in the memory to the first wavefront during the corresponding atomic operation, where the first block is contiguous to a previously allocated block dynamically allocated to another wavefront from said group of wavefronts. Corresponding system embodiments and computer program product embodiments are also presented. | 03-03-2011 |
20110057942 | Efficient Data Access for Unified Pixel Interpolation - Disclosed herein are methods, apparatuses, and systems for accessing vertex data stored in a memory, and applications thereof. Such a method includes writing vertex data of primitives into contiguous banks of a memory such that the vertex data of consecutively written primitives spans more than one row of the memory. Vertex data of two consecutively written primitives are read from the memory in a single clock cycle. | 03-10-2011 |
20110066813 | Method And System For Local Data Sharing - Embodiments for a local data share (LDS) unit are described herein. Embodiments include a co-operative set of threads to load data into shared memory so that the threads can have repeated memory access allowing higher memory bandwidth. In this way, data can be shared between related threads in a cooperative manner by providing a re-use of a locality of data from shared registers. Furthermore, embodiments of the invention allow a cooperative set of threads to fetch data in a partitioned manner so that it is only fetched once into a shared memory that can be repeatedly accessed via a separate low latency path. | 03-17-2011 |
20110115802 | Processing Unit that Enables Asynchronous Task Dispatch - A processing unit that includes a plurality of virtual engines and a shader core. The plurality of virtual engines is configured to (i) receive, from an operating system (OS), a plurality of tasks substantially in parallel with each other and (ii) load a set of state data associated with each of the plurality of tasks. The shader core is configured to execute the plurality of tasks substantially in parallel based on the set of state data associated with each of the plurality of tasks. The processing unit may also include a scheduling module that schedules the plurality of tasks to be issued to the shader core. | 05-19-2011 |
20120110309 | Data Output Transfer To Memory - Methods, systems, and computer readable media for improved transfer of processing data outputs to memory are disclosed. According to an embodiment, a method for transferring outputs of a plurality of threads concurrently executing in one or more processing units to a memory includes: forming, based upon one or more of the outputs, a combined memory export instruction comprising one or more data elements and one or more control elements; and sending the combined memory export instruction to the memory. The combined memory export instruction can be sent to memory in a single clock cycle. Another method includes: forming, based upon outputs from two or more of the threads, a memory export instruction comprising two or more data elements; embedding at least one address representative of the two or more of the outputs in a second memory instruction; and sending the memory export instruction and the second memory instruction to the memory. | 05-03-2012 |
20120131596 | Method and System for Synchronizing Thread Wavefront Data and Events - Systems and methods for synchronizing thread wavefronts and associated events are disclosed. According to an embodiment, a method for synchronizing one or more thread wavefronts and associated events includes inserting a first event associated with a first data output from a first thread wavefront into an event synchronizer. The event synchronizer is configured to release the first event before releasing events inserted subsequent to the first event. The method further includes releasing the first event from the event synchronizer after the first data is stored in the memory. Corresponding system and computer readable medium embodiments are also disclosed. | 05-24-2012 |
20120188259 | Mechanisms for Enabling Task Scheduling - Embodiments described herein provide a method including receiving a command to schedule a first process and selecting a command queue associated with the first process. The method also includes scheduling the first process to run on an accelerated processing device and preempting a second process running on the accelerated processing device to allow the first process to run on the accelerated processing device. | 07-26-2012 |
20120194524 | Preemptive Context Switching - Methods, systems, and computer readable media embodiments are disclosed for preemptive context-switching of processes running on a accelerated processing device. Embodiments include, detecting by an accelerated processing device a memory exception, and preempting a process from running on the accelerated processing device based upon the detected exception. | 08-02-2012 |
20120194525 | Managed Task Scheduling on a Graphics Processing Device (APD) - Provided herein is a method including receiving a run list including one or more processes to run on an accelerated processing device, wherein each of the one or more processes is associated with a corresponding independent job command queue. The method also includes scheduling each of the one or more processes to run on the accelerated processing device based on a criteria associated with each process. | 08-02-2012 |
20120194527 | Method for Preempting Graphics Tasks to Accommodate Compute Tasks in an Accelerated Processing Device (APD) - Embodiments described herein provide a method of arbitrating a processing resource. The method includes receiving a command to preempt a task and preventing additional wavefronts associated with the task from being processed. The method also includes evicting currently executing wavefronts associated with the task from being processed based upon predetermined criteria | 08-02-2012 |
20120194528 | Method and System for Context Switching - Embodiments of the present invention provide a method of preempting a task. The method includes removing the task from the parallel processors via a scheduling mechanism. Responsive to the removing, the method also includes ceasing (i) retrieval of commands from a buffer associated with the task, (ii) dispatch of groups of work-items associated with the task, (iii) dispatch of wavefronts associated with the task, and (iiii) execution of the wavefronts. State information related to the task is saved. | 08-02-2012 |
20120198458 | Methods and Systems for Synchronous Operation of a Processing Device - Embodiments of the present invention provide a method of synchronous operation of a first processing device and a second processing device. The method includes executing a process on the first processing device, responsive to a determination that execution of the process on the first device has reached a serial-parallel boundary, passing an execution thread of the process from the first processing device to the second processing device, and executing the process on the second processing device. | 08-02-2012 |
20120200576 | Preemptive context switching of processes on ac accelerated processing device (APD) based on time quanta - Methods, systems, and computer readable media for preemptive context-switching of processes on an accelerated processing device are based upon a comparison of the running time of the process and a threshold time quanta. A method includes preempting a process running on an accelerated processing device based upon a running time of the process and a threshold time quanta. | 08-09-2012 |
20120200579 | Process Device Context Switching - Methods, systems, and computer readable media embodiments are disclosed for preemptive context-switching of processes running on an accelerated processing device. A method includes, responsive to an exception upon access to a memory by a process running on a accelerated processing device, whether to preempt the process based on the exception, and preempting, based upon the determining, the process from running on the accelerated processing device. | 08-09-2012 |
20120204014 | Systems and Methods for Improving Divergent Conditional Branches - Embodiments of the present invention provide systems, methods, and computer program products for improving divergent conditional branches in code being executed by a processor. For example, in an embodiment, a method comprises detecting a conditional statement of a program being simultaneously executed by a plurality of threads, determining which threads evaluate a condition of the conditional statement as true and which threads evaluate the condition as false, pushing an identifier associated with the larger set of the threads onto a stack, executing code associated with a smaller set of the threads, and executing code associated with the larger set of the threads. | 08-09-2012 |
20130117750 | Method and System for Workitem Synchronization - Method, system, and computer program product embodiments for synchronizing workitems on one or more processors are disclosed. The embodiments include executing a barrier skip instruction by a first workitem from the group, and responsive to the executed barrier skip instruction, reconfiguring a barrier to synchronize other workitems from the group in a plurality of points in a sequence without requiring the first workitem to reach the barrier in any of the plurality of points. | 05-09-2013 |
20130135327 | Saving and Restoring Non-Shader State Using a Command Processor - Provided is a system including a command processor configured for interrupting processing of a first set of instructions executing within a shader core. | 05-30-2013 |
20130141446 | Method and Apparatus for Servicing Page Fault Exceptions - A method, apparatus and computer readable media for servicing page fault exceptions in a accelerated processing device (APD). A page fault related to a wavefront is detected. A fault handling request to a translation mechanism is sent when the page fault is detected. A fault handling response corresponding to the detected page fault from the translation mechanism is received. Confirmation that the detected page fault has been handled through performing page mapping based on the fault handling response is received. | 06-06-2013 |
20130141447 | Method and Apparatus for Accommodating Multiple, Concurrent Work Inputs - A method of accommodating more than one compute input is provided. The method creates an APD arbitration policy that dynamically assigns compute instructions from a sequence of instructions awaiting processing to the APD compute units for execution of a run list. | 06-06-2013 |
20130145202 | Handling Virtual-to-Physical Address Translation Failures - A method tolerates virtual to physical address translation failures. A translation request is sent from a graphics processing device to a translation mechanism. The translation request is associated with a first wavefront. A fault notification is received within an accelerated processing device (APD) from the translation mechanism that a request cannot be acknowledged. The first wavefront is, stored within a shader core of the APD if the fault notification is received. The first wavefront is replaced with a second wavefront if the fault notification is received, the second wavefront being ready to be executed. | 06-06-2013 |
20130147816 | Partitioning Resources of a Processor - Embodiments describe herein provide an apparatus, a computer readable medium and a method for simultaneously processing tasks within an APD. The method includes processing a first task within an APD. The method also includes reducing utilization of the APD by the first task to facilitate simultaneous processing of the second task, such that the utilization remains below a threshold. | 06-13-2013 |
20130155074 | SYSCALL MECHANISM FOR PROCESSOR TO PROCESSOR CALLS - Provided is a method for processing system calls from a GPU to a CPU. The method includes a GPU storing a plurality of tasks in a memory, with each task representing a function to be performed on the CPU. The method also includes generating a CPU interrupt, and processing of the stored plurality of tasks by the CPU. | 06-20-2013 |
20130155077 | Policies for Shader Resource Allocation in a Shader Core - A method of determining priority within an accelerated processing device is provided. The accelerated processing device includes compute pipeline queues that are processed in accordance with predetermined criteria. The queues are selected based on priority characteristics and the selected queue is processed until a time quantum lapses or a queue having a higher priority becomes available for processing. | 06-20-2013 |
20130155079 | Saving and Restoring Shader Context State - Provided is a method for processing a command in a computing system including an accelerated processing device (APD) having a command processor. The method includes executing an interrupt routine to save one or more contexts related to a first set of instructions on a shader core in response to an instruction to preempt processing of the first set of instructions. | 06-20-2013 |
20130160017 | Software Mechanisms for Managing Task Scheduling on an Accelerated Processing Device (APD) - Embodiments describe herein provide a method of for managing task scheduling on a accelerated processing device. The method includes executing a first task within the accelerated processing device (APD), monitoring for an interruption of the execution of the first task, and switching to a second task when an interruption is detected. | 06-20-2013 |
20130160019 | Method for Resuming an APD Wavefront in Which a Subset of Elements Have Faulted - A method resumes an accelerated processing device (APD) wavefront in which a subset of elements have faulted. A restore command for a job including a wavefront is received. A list of context states for the wavefront is read from a memory associated with a APD. An empty shell wavefront is created for restoring the list of context states. A portion of not acknowledged data is masked over a portion of acknowledged data within the restored wavefronts. | 06-20-2013 |
20130191852 | Multithreaded Computing - A system, method, and computer program product are provided for improving resource utilization of multithreaded applications. Rather than requiring threads to block while waiting for data from a channel or requiring context switching to minimize blocking, the techniques disclosed herein provide an event-driven approach to launch kernels only when needed to perform operations on channel data, and then terminate in order to free resources. These operations are handled efficiently in hardware, but are flexible enough to be implemented in all manner of programming models. | 07-25-2013 |
20130262812 | Hardware Managed Allocation and Deallocation Evaluation Circuit - A system and method is provided for improving efficiency, power, and bandwidth consumption in parallel processing. Rather than using memory polling to ensure that enough space is available in memory locations for, for example, write instructions, the techniques disclosed herein provide a system and method to automate this evaluation mechanism in environments such as data-parallel processing to efficiently check available space in memory locations before instructions such as write threads are allowed. These operations are handled efficiently in hardware, but are flexible enough to be implemented in all manner of programming models. | 10-03-2013 |
20130262834 | Hardware Managed Ordered Circuit - A system and method is provided for improving efficiency, power, and bandwidth consumption in parallel processing. Rather than requiring memory polling to ensure ordered execution of processes or threads, the techniques disclosed herein provide a system and method to allow any process or thread to run out of order as long as needed, but ensure ordered execution of multiple ordered instructions when needed. These operations are handled efficiently in hardware, but are flexible enough to be implemented in all manner of programming models. | 10-03-2013 |
20130326524 | Method and System for Synchronization of Workitems with Divergent Control Flow - Disclosed methods, systems, and computer program products embodiments include synchronizing a group of workitems on a processor by storing a respective program counter associated with each of the workitems, selecting at least one first workitem from the group for execution, and executing the selected at least one first workitem on the processor. The selecting is based upon the respective stored program counter associated with the at least one first workitem. | 12-05-2013 |
20140022263 | METHOD FOR URGENCY-BASED PREEMPTION OF A PROCESS - The desire to use an Accelerated Processing Device (APD) for general computation has increased due to the APD's exemplary performance characteristics. However, current systems incur high overhead when dispatching work to the APD because a process cannot be efficiently identified or preempted. The occupying of the APD by a rogue process for arbitrary amounts of time can prevent the effective utilization of the available system capacity and can reduce the processing progress of the system. Embodiments described herein can overcome this deficiency by enabling the system software to pre-empt a process executing on the APD for any reason. The APD provides an interface for initiating such a pre-emption. This interface exposes an urgency of the request which determines whether the process being preempted is allowed a grace period to complete its issued work before being forced off the hardware. | 01-23-2014 |
20140149677 | Prefetch Kernels on Data-Parallel Processors - Embodiments include methods, systems and computer readable media configured to execute a first kernel (e.g. compute or graphics kernel) with reduced intermediate state storage resource requirements. These include executing a first and second (e.g. prefetch) kernel on a data-parallel processor, such that the second kernel begins executing before the first kernel. The second kernel performs memory operations that are based upon at least a subset of memory operations in the first kernel. | 05-29-2014 |
20140157287 | Optimized Context Switching for Long-Running Processes - Methods, systems, and computer readable storage media embodiments allow for low overhead context switching of threads. In embodiments, applications, such as, but not limited to, iterative data-parallel applications, substantially reduce the overhead of context switching by adding a user or higher-level program configurability of a state to be saved upon preempting of a executing thread. These methods, systems, and computer readable storage media include aspects of running a group of threads on a processor, saving state information by respective threads in the group in response to a signal from a scheduler, and pre-empting running of the group after the saving of the state information. | 06-05-2014 |
20140292756 | Hybrid Render with Deferred Primitive Batch Binning - A system, method and a computer program product are provided for hybrid rendering with deferred primitive batch binning. A primitive batch is generated from a sequence of primitives. Initial bin intercepts are identified for primitives in the primitive batch. A bin for processing is identified. The bin corresponds to a region of a screen space. Pixels of the primitives intercepting the identified bin are processed. Next bin intercepts are identified while the primitives intercepting the identified bin are processed. | 10-02-2014 |
20140362102 | GRAPHICS PROCESSING HARDWARE FOR USING COMPUTE SHADERS AS FRONT END FOR VERTEX SHADERS - A GPU is configured to read and process data produced by a compute shader via the one or more ring buffers and pass the resulting processed data to a vertex shader as input. The GPU is further configured to allow the compute shader and vertex shader to write through a cache. Each ring buffer is configured to synchronize the compute shader and the vertex shader to prevent processed data generated by the compute shader that is written to a particular ring buffer from being overwritten before the data is accessed by the vertex shader. It is emphasized that this abstract is provided to comply with the rules requiring an abstract that will allow a searcher or other reader to quickly ascertain the subject matter of the technical disclosure. It is submitted with the understanding that it will not be used to interpret or limit the scope or meaning of the claims. | 12-11-2014 |