Inventors list

Assignees list

Classification tree browser

Top 100 Inventors

Top 100 Assignees

Emberling

Brian Emberling, Palo Alto, CA US

Patent application number	Description	Published
20090037918	Thread sequencing for multi-threaded processor with instruction cache - Execution of the first thread of a new program is prioritized ahead of older threads for a previously running program. The new program is invoked during the execution of a thread of the previous program. The first thread of the program is prioritized ahead of the remaining threads of the previous program. In an embodiment of the invention, additional threads of the new program are also prioritized ahead of the older threads. A thread's context may include a table of constant values that can be referenced by each program and are shared by multiple threads. Changing the values in a constant table for a new thread is time intensive. To avoid changes to the constant table (and thereby save time), a higher priority status is conferred to the first thread that follows a change to the constant table.	02-05-2009
20090276777	Multiple Programs for Efficient State Transitions on Multi-Threaded Processors - A system and method to optimize processor performance and minimizing average thread latency by selectively loading a cache when a program state, resources required for execution of a program or the program itself change, is described. An embodiment of the invention supports a “cache priming program” that is selectively executed for a first thread/program/sub-routine of each process. Such a program is optimized for situations when instructions and other program data are not yet resident in cache(s), and/or whenever resources required for program execution or the program itself changes. By pre-loading the cache with two resources required for two instructions for only a first thread, average thread latency is reduced because the resources are already present in the cache. Since, such a mechanism is carried out only for one thread in a program cycle, pitfalls of a conventional general pre-fetch scheme that involves parsing of the program in advance to determine which resources and instructions will be needed at a later time, are avoided.	11-05-2009
20110055511	Interlocked Increment Memory Allocation and Access - A method of allocating a memory to a plurality of concurrent threads is presented. The method includes dynamically determining writer threads each having at least one pending write to the memory; and dynamically allocating respective contiguous blocks in the memory for each of the writer threads. Another method of allocating a memory to a plurality of concurrent threads includes launching the plurality of threads as a plurality of wavefronts, dynamically determining a group of wavefronts each having at least one thread requiring a write to the memory, and dynamically allocating respective contiguous blocks in the memory for each wavefront from the group of wavefronts. A corresponding method of assigning a memory to a plurality of reader threads includes determining a first number corresponding to a number of writer threads having a block allocated in said memory, launching a first number of reader threads, entering a first wavefront of said reader threads from said group of wavefronts to an atomic operation, and assigning a first block in the memory to the first wavefront during the corresponding atomic operation, where the first block is contiguous to a previously allocated block dynamically allocated to another wavefront from said group of wavefronts. Corresponding system embodiments and computer program product embodiments are also presented.	03-03-2011
20110173629	Thread Synchronization - A method of processing threads is provided. The method includes receiving a first thread that accesses a memory resource in a current state, holding the first thread, and releasing the first thread based responsive to a final thread that accesses the memory resource in the current state has been received.	07-14-2011
20120013627	DYNAMIC CONTROL OF SIMDs - Systems and methods to improve performance in a graphics processing unit are described herein. Embodiments achieve power saving in a graphics processing unit by dynamically activating/deactivating individual SIMDs in a shader complex that comprises multiple SIMD units. On-the-fly dynamic disabling and enabling of individual SIMDs provides flexibility in achieving a required performance and power level for a given processing application. Embodiments of the invention also achieve dynamic medium grain clock gating of SIMDs in a shader complex. Embodiments reduce switching power by shutting down clock trees to unused logic by providing a clock on demand mechanism. In this way, embodiments enhance clock gating to save more switching power for the duration of time when SIMDs are idle (or assigned no work). Embodiments can also save leakage power by power gating SIMDs for a duration when SIMDs are idle for an extended period of time.	01-19-2012

Patent applications by Brian Emberling, Palo Alto, CA US

Brian Emberling, San Mateo, CA US

Patent application number	Description	Published
20090300621	Local and Global Data Share - A graphics processing unit is disclosed, the graphics processing unit having a processor having one or more SIMD processing units, and a local data share corresponding to one of the one or more SIMD processing units, the local data share comprising one or more low latency accessible memory regions for each group of threads assigned to one or more execution wavefronts, and a global data share comprising one or more low latency memory regions for each group of threads.	12-03-2009

Brian D. Emberling, Palo Alto, CA US

Patent application number	Description	Published
20090276563	Incremental State Updates - A system and method are described that manage incremental state updates in such a way that multiple threads within a processor can each operate, in effect, on their own set of state data. The system and method are applicable to any processor in which multiple threads require access to sets of state information which differ from one another by a relatively small number of state changes.	11-05-2009
20100107143	Method and System for Thread Monitoring - An apparatus and methods for hardware-based performance monitoring of a computer system are presented. The apparatus includes: processing units; a memory; a connector device connecting the processing units and the memory; probes inserted the processing units, and the probes generating probe signals when selected processing events are detected; and a thread trace device connected to the connector device. The thread trace device includes an event interface to receive probe signals, and an event memory controller to send probe event messages to the memory, where probe event messages are based on probe signals. Also presented is a method that includes: inserting event probes in hardware-based processing units, where the event probes generate probe events when predetermined processing events are detected; configuring a hardware-based device to generate probe event messages based on said probe events; and transferring the probe event messages to a memory. The probe event messages transferred to memory can be subsequently analyzed using a software program to determine, for example, thread-to-thread interactions.	04-29-2010
20130124900	PROCESSOR WITH POWER CONTROL VIA INSTRUCTION ISSUANCE - Methods and apparatuses are provided for power control in a processor. The apparatus comprises a plurality of operational units arranged as a group of operational units. A power consumption monitor determines when cumulative power consumption of the group of operational units exceeds a threshold (e.g., either or both of the cumulative power threshold and the cumulative power rate threshold) during a time interval, after which a filter for issuing instructions to the group of operational units suspends instruction issuance to the group of operational units for the remainder of the time interval. The method comprises monitoring cumulative power consumption by a group of operational units within a processor over a time interval. If the cumulative power consumption of the group of operational units exceeds the threshold, instruction issuance to the group of operational units is suspended for the remainder of the time interval.	05-16-2013

Patent applications by Brian D. Emberling, Palo Alto, CA US

Brian D. Emberling, San Mateo, CA US

Patent application number	Description	Published
20080313436	Handling of extra contexts for shader constants - The present invention provides a system for handling extra contexts for shader constants, and applications thereof. In an embodiment there is provided a computer-based method for executing a series of compute packets in an execution pipeline. The execution pipeline includes a first plurality of registers configured to store state-updates of a first type and a second plurality of registers configured to store state-updates of a second type. A first number of state-updates of the first type and a second number of state-updates of the second type are respectively identified and stored in the first and second plurality of registers. A compute packet is sent to the execution pipeline responsive to the first number and the second number. Then, the compute packet is executed by the execution pipeline.	12-18-2008
20120204014	Systems and Methods for Improving Divergent Conditional Branches - Embodiments of the present invention provide systems, methods, and computer program products for improving divergent conditional branches in code being executed by a processor. For example, in an embodiment, a method comprises detecting a conditional statement of a program being simultaneously executed by a plurality of threads, determining which threads evaluate a condition of the conditional statement as true and which threads evaluate the condition as false, pushing an identifier associated with the larger set of the threads onto a stack, executing code associated with a smaller set of the threads, and executing code associated with the larger set of the threads.	08-09-2012
20130117750	Method and System for Workitem Synchronization - Method, system, and computer program product embodiments for synchronizing workitems on one or more processors are disclosed. The embodiments include executing a barrier skip instruction by a first workitem from the group, and responsive to the executed barrier skip instruction, reconfiguring a barrier to synchronize other workitems from the group in a plurality of points in a sequence without requiring the first workitem to reach the barrier in any of the plurality of points.	05-09-2013
20140108871	Method and System for Thread Monitoring - An apparatus and methods for hardware-based performance monitoring of a computer system are presented. The apparatus includes: processing units; a memory; a connector device connecting the processing units and the memory; probes inserted the processing units, and the probes generating probe signals when selected processing events are detected; and a thread trace device connected to the connector device. The thread trace device includes an event interface to receive probe signals, and an event memory controller to send probe event messages to the memory, where probe event messages are based on probe signals. The probe event messages transferred to memory can be subsequently analyzed using a software program to determine, for example, thread-to-thread interactions.	04-17-2014

Patent applications by Brian D. Emberling, San Mateo, CA US