Patent application number | Description | Published |
20110078427 | TRAP HANDLER ARCHITECTURE FOR A PARALLEL PROCESSING UNIT - A trap handler architecture is incorporated into a parallel processing subsystem such as a GPU. The trap handler architecture minimizes design complexity and verification efforts for concurrently executing threads by imposing a property that all thread groups associated with a streaming multi-processor are either all executing within their respective code segments or are all executing within the trap handler code segment. | 03-31-2011 |
20130124838 | INSTRUCTION LEVEL EXECUTION PREEMPTION - One embodiment of the present invention sets forth a technique instruction level and compute thread array granularity execution preemption. Preempting at the instruction level does not require any draining of the processing pipeline. No new instructions are issued and the context state is unloaded from the processing pipeline. When preemption is performed at a compute thread array boundary, the amount of context state to be stored is reduced because execution units within the processing pipeline complete execution of in-flight instructions and become idle. If, the amount of time needed to complete execution of the in-flight instructions exceeds a threshold, then the preemption may dynamically change to be performed at the instruction level instead of at compute thread array granularity. | 05-16-2013 |
20130145102 | MULTI-LEVEL INSTRUCTION CACHE PREFETCHING - One embodiment of the present invention sets forth an improved way to prefetch instructions in a multi-level cache. Fetch unit initiates a prefetch operation to transfer one of a set of multiple cache lines, based on a function of a pseudorandom number generator and the sector corresponding to the current instruction L1 cache line. The fetch unit selects a prefetch target from the set of multiple cache lines according to some probability function. If the current instruction L1 cache | 06-06-2013 |
20140165072 | TECHNIQUE FOR SAVING AND RESTORING THREAD GROUP OPERATING STATE - A streaming multiprocessor (SM) included within a parallel processing unit (PPU) is configured to suspend a thread group executing on the SM and to save the operating state of the suspended thread group. A load-store unit (LSU) within the SM re-maps local memory associated with the thread group to a location in global memory. Subsequently, the SM may re-launch the suspended thread group. The LSU may then perform local memory access operations on behalf of the re-launched thread group with the re-mapped local memory that resides in global memory. | 06-12-2014 |
20140189260 | APPROACH FOR CONTEXT SWITCHING OF LOCK-BIT PROTECTED MEMORY - A streaming multiprocessor in a parallel processing subsystem processes atomic operations for multiple threads in a multi-threaded architecture. The streaming multiprocessor receives a request from a thread in a thread group to acquire access to a memory location in a lock-protected shared memory, and determines whether a address lock in a plurality of address locks is asserted, where the address lock is associated the memory location. If the address lock is asserted, then the streaming multiprocessor refuses the request. Otherwise, the streaming multiprocessor asserts the address lock, asserts a thread group lock in a plurality of thread group locks, where the thread group lock is associated with the thread group, and grants the request. One advantage of the disclosed techniques is that acquired locks are released when a thread is preempted. As a result, a preempted thread that has previously acquired a lock does not retain the lock indefinitely. | 07-03-2014 |
20140189329 | COOPERATIVE THREAD ARRAY GRANULARITY CONTEXT SWITCH DURING TRAP HANDLING - Techniques are provided for handling a trap encountered in a thread that is part of a thread array that is being executed in a plurality of execution units. In these techniques, a data structure with an identifier associated with the thread is updated to indicate that the trap occurred during the execution of the thread array. Also in these techniques, the execution units execute a trap handling routine that includes a context switch. The execution units perform this context switch for at least one of the execution units as part of the trap handling routine while allowing the remaining execution units to exit the trap handling routine before the context switch. One advantage of the disclosed techniques is that the trap handling routine operates efficiently in parallel processors. | 07-03-2014 |
20140189711 | COOPERATIVE THREAD ARRAY GRANULARITY CONTEXT SWITCH DURING TRAP HANDLING - Techniques are provided for restoring thread groups in a cooperative thread array (CTA) within a processing core. Each thread group in the CTA is launched to execute a context restore routine. Each thread group, executes the context restore routine to restore from a memory a first portion of context associated with the thread group, and determines whether the thread group completed an assigned function prior to executing the context restore routine. If the thread group completed an assigned function prior to executing the context restore routine, then the thread group exits the context restore routine. If the thread group did not complete the assigned function prior to executing the context restore routine, then the thread group executes one or more operations associated with a trap handler routine. One advantage of the disclosed techniques is that the trap handling routine operates efficiently in parallel processors. | 07-03-2014 |
20140337389 | SYSTEM, METHOD, AND COMPUTER PROGRAM PRODUCT FOR SCHEDULING TASKS ASSOCIATED WITH CONTINUATION THREAD BLOCKS - A system, method, and computer program product for scheduling tasks associated with continuation thread blocks. The method includes the steps of generating a first task metadata data structure in a memory, generating a second task metadata data structure in the memory, executing a first task corresponding to the first task metadata data structure in a processor, generating state information representing a continuation task related to the first task and storing the state information in the second task metadata data structure, executing the continuation task in the processor after the one or more child tasks have finished execution, and indicating that the first task has logically finished execution once the continuation task has finished execution. The second task metadata data structure is related to the first task metadata data structure, and at least one instruction in the first task causes one or more child tasks to be executed by the processor. | 11-13-2014 |
20140337569 | SYSTEM, METHOD, AND COMPUTER PROGRAM PRODUCT FOR LOW LATENCY SCHEDULING AND LAUNCH OF MEMORY DEFINED TASKS - A system, method, and computer program product for low-latency scheduling and launch of memory defined tasks. The method includes the steps of receiving a task metadata data structure to be stored in a memory associated with a processor, transmitting the task metadata data structure to a scheduling unit of the processor, storing the task metadata data structure in a cache unit included in the scheduling unit, and copying the task metadata data structure from the cache unit to the memory. | 11-13-2014 |
20140372703 | SYSTEM, METHOD, AND COMPUTER PROGRAM PRODUCT FOR WARMING A CACHE FOR A TASK LAUNCH - A system, method, and computer program product for warming a cache for a task launch is described. The method includes the steps of receiving a task data structure that defines a processing task, extracting information stored in a cache warming field of the task data structure, and, prior to executing the processing task, generating a cache warming instruction that is configured to load one or more entries of a cache storage with data fetched from a memory. | 12-18-2014 |