Youfeng Wu, Palo Alto US

Youfeng Wu, Palo Alto, CA US

Patent application number	Description	Published
20080244538	Multi-core processor virtualization based on dynamic binary translation - A processor virtualization abstracts the behavior of a processor instruction set architecture from an underlying micro-architecture implementation. It is capable of running any processor instruction set architecture compatible software on any micro-architecture implementation. A system wide dynamic binary translator translates source system programs to target programs and manages the execution of those target programs. It also provides the necessary and sufficient infrastructure requires to render multi-core processor virtualization.	10-02-2008
20080282116	Transient Fault Detection by Integrating an SRMT Code and a Non SRMT Code in a Single Application - Disclosed is a method for running a first code generated by a Software-based Redundant Multi-Threading (SRMT) compiler along with a second code generated by a normal compiler at runtime, the first code including a first function and a second function, the second code including a third function. The method comprises running the first function in a leading thread and a tailing thread (	11-13-2008
20080282257	Transient Fault Detection by Integrating an SRMT Code and a Non SRMT Code in a Single Application - Disclosed is a method for running a first code generated by a Software-based Redundant Multi-Threading (SRMT) compiler along with a second code generated by a normal compiler at runtime, the first code including a first function and a second function, the second code including a third function. The method comprises running the first function in a leading thread and a tailing thread (	11-13-2008
20090077360	Software constructed stands for execution on a multi-core architecture - In one embodiment, the present invention includes a software-controlled method of forming instruction strands. The software may include instructions to obtain code of a superblock including a plurality of basic blocks, build a dependency directed acyclic graph (DAG) for the code, sort nodes coupled by edges of the dependency DAG into a topological order, form strands from the nodes based on hardware constraints, rule constraints, and scheduling constraints, and generate executable code for the strands and store the executable code in a storage. Other embodiments are described and claimed.	03-19-2009
20090125894	Highly scalable parallel static single assignment for dynamic optimization on many core architectures - A method, system, and computer readable medium for converting a series of computer executable instructions in control flow graph form into an intermediate representation, of a type similar to Static Single Assignment (SSA), used in the compiler arts. The indeterminate representation may facilitate compilation optimizations such as constant propagation, sparse conditional constant propagation, dead code elimination, global value numbering, partial redundancy elimination, strength reduction, and register allocation. The method, system, and computer readable medium are capable of operating on the control flow graph to construct an SSA representation in parallel, thus exploiting recent advances in multi-core processing and massively parallel computing systems. Other embodiments may be employed, and other embodiments are described and claimed.	05-14-2009
20090172644	SOFTWARE FLOW TRACKING USING MULTIPLE THREADS - Methods, systems and machine readable media are disclosed for performing dynamic information flow tracking. One method includes executing operations of a program with a main thread, and tracking the main thread's execution of the operations of the program with a tracking thread. The method further includes updating, with the tracking thread, a taint value associated with the value of the main thread to reflect whether the value is tainted, and determining, with the tracking thread based upon the taint value, whether use of the value by the main thread violates a specific security policy.	07-02-2009
20090172654	PROGRAM TRANSLATION AND TRANSACTIONAL MEMORY FORMATION - Disclosed are methods, machine readable medium and systems that dynamically translate binary programs. The dynamic binary translation may include identifying a hot code trace of a program. The translation may further include determining a completion ratio for the hot code trace. The translation may also include packaging the hot code trace into a transactional memory region in response to the completion ratio having a predetermined relationship to a threshold ratio.	07-02-2009
20090172713	ON-DEMAND EMULATION VIA USER-LEVEL EXCEPTION HANDLING - Methods and apparatuses enable on-demand instruction emulation via user-level exception handling. A non-supported instruction triggers an exception during runtime of a program. In response to the exception, a user-level or application-level exception handler is launched, instead of a kernel-level handler. Then the exception handler can execute at the application layer instead of the kernel level. The handler identifies the instruction and emulates the instruction, where emulation of the instruction is supported by the handler. Emulating the instructions enables the program to continue execution. Repeated instruction emulation is amortized via dynamic binary translation of hot code.	07-02-2009
20090313616	Code reuse and locality hinting - A method and apparatus for improving parallelism through optimal code replication is herein described. An optimal replication factor for code is determined based on costs associated with a plurality of replication factors. The code is replicated by the optimal replication factor, and then the code is potentially executed in parallel to obtain parallelized efficient execution.	12-17-2009
20100083236	COMPACT TRACE TREES FOR DYNAMIC BINARY PARALLELIZATION - Methods and apparatus relating to compact trace trees for dynamic binary parallelization are described. In one embodiment, a compact trace tree (CTT) is generated to improve the effectiveness of dynamic binary parallelization. CTT may be used to determine which traces are to be duplicated and specialized for execution on separate processing elements. Other embodiments are also described and claimed.	04-01-2010
20100169861	ENERGY/PERFORMANCE WITH OPTIMAL COMMUNICATION IN DYNAMIC PARALLELIZATION OF SINGLE-THREADED PROGRAMS - A method and apparatus for optimizing parallelized single threaded programs is herein described. Code regions, such as dependency chains, are replicated utilizing any known method, such as dynamic code replication. A flow network associated with a replicated code region is built and a minimum cut algorithm is applied to determine duplicated nodes, which may include a single instruction or a group of instructions, to be removed. The dependency of removed nodes is fulfilled with inserted communication to ensure proper data consistency of the original single-threaded program. As a result, both performance and power consumption is optimized for parallel code sections through removal of expensive workload nodes and replacement with communication between other replicated code regions to be executed in parallel.	07-01-2010
20100306512	COMPILER TECHNIQUE FOR EFFICIENT REGISTER CHECKPOINTING TO SUPPORT TRANSACTION ROLL-BACK - A method and apparatus for efficient register checkpointing is herein described. A transaction is detected in program code. A recovery block is inserted in the program code to perform recovery operations in response to an abort of the first transaction. A roll-back edge is potentially inserted from an abort point to the recovery block. A control flow edge is inserted from the recovery block to a entry point of the transaction. Checkpoint code is inserted before the entry point to backup live-in registers in backup storage elements and recovery code is inserted in the recovery block to restore the live-in registers from the backup storage elements in response to an abort of the transaction.	12-02-2010
20110099541	Context-Sensitive Slicing For Dynamically Parallelizing Binary Programs - In one embodiment of the invention a method comprising (1) receiving an unstructured binary code region that is single-threaded; (2) determining a slice criterion for the region; (3) determining a call edge, a return edge, and a fallthrough pseudo-edge for the region based on analysis of the region at a binary level; and (4) determining a context-sensitive slice based on the call edge, the return edge, the fallthrough pseudo-edge, and the slice criterion. Embodiments of the invention may include a program analysis technique that can be used to provide context-sensitive slicing of binary programs for slicing hot regions identified at runtime, with few underlying assumptions about the program from which the binary is derived. Also, in an embodiment a slicing method may include determining a context-insensitive slice, when a time limit is met, by determining the context-insensitive slice while treating call edges as a normal control flow edges.	04-28-2011
20110145551	TWO-STAGE COMMIT (TSC) REGION FOR DYNAMIC BINARY OPTIMIZATION IN X86 - Generally, the present disclosure provides systems and methods to generate a two-stage commit (TSC) region which has two separate commit stages. Frequently executed code may be identified and combined for the TSC region. Binary optimization operations may be performed on the TSC region to enable the code to run more efficiently by, for example, reording load and store instructions. In the first stage, load operations in the region may be committed atomically and in the second stage, store operations in the region may be committed atomically.	06-16-2011
20110153999	METHODS AND APPARATUS TO MANAGE PARTIAL-COMMIT CHECKPOINTS WITH FIXUP SUPPORT - Example methods and apparatus to manage partial commit-checkpoints are disclosed. A disclosed example method includes identifying a commit instruction associated with a region of instructions executed by a processor, identifying candidate instructions from the region of instructions, and generating a processor partial commit-checkpoint to save a current state of the processor, the checkpoint based on calculated register values associated with live instructions, and including instruction reference addresses to link the candidate instructions.	06-23-2011
20110154002	Methods And Apparatuses For Efficient Load Processing Using Buffers - Various embodiments of the invention concern methods and apparatuses for power and time efficient load handling. A compiler may identify producer loads, consumer reuse loads, consumer forwarded loads, and producer/consumer hybrid loads. Based on this identification, performance of the load may be efficiently directed to a load value buffer, store buffer, data cache, or elsewhere. Consequently, accesses to cache are reduced, through direct loading from load value buffers and store buffers, thereby efficiently processing the loads.	06-23-2011
20110167416	SYSTEMS, APPARATUSES, AND METHODS FOR A HARDWARE AND SOFTWARE SYSTEM TO AUTOMATICALLY DECOMPOSE A PROGRAM TO MULTIPLE PARALLEL THREADS - Systems, apparatuses, and methods for a hardware and software system to automatically decompose a program into multiple parallel threads are described. In some embodiments, the systems and apparatuses execute a method of original code decomposition and/or generated thread execution.	07-07-2011
20110320766	APPARATUS, METHOD, AND SYSTEM FOR IMPROVING POWER, PERFORMANCE EFFICIENCY BY COUPLING A FIRST CORE TYPE WITH A SECOND CORE TYPE - An apparatus and method is described herein for coupling a processor core of a first type with a co-designed core of a second type. Execution of program code on the first core is monitored and hot sections of the program code are identified. Those hot sections are optimize for execution on the co-designed core, such that upon subsequently encountering those hot sections, the optimized hot sections are executed on the co-designed core. When the co-designed core is executing optimized hot code, the first processor core may be in a low-power state to save power or executing other code in parallel. Furthermore, multiple threads of cold code may be pipelined on the first core, while multiple threads of hot code are pipeline on the co-designed core to achieve maximum performance.	12-29-2011
20110320781	DYNAMIC DATA SYNCHRONIZATION IN THREAD-LEVEL SPECULATION - In one embodiment, the present invention introduces a speculation engine to parallelize serial instructions by creating separate threads from the serial instructions and inserting processor instructions to set a synchronization bit before a dependence source and to clear the synchronization bit after a dependence source, where the synchronization bit is designed to stall a dependence sink from a thread running on a separate core. Other embodiments are described and claimed.	12-29-2011
20120016853	EFFICIENT AND CONSISTENT SOFTWARE TRANSACTIONAL MEMORY - A method and apparatus for efficient and consistent validation/conflict detection in a Software Transactional Memory (STM) system is herein described. A version check barrier is inserted after a load to compare versions of loaded values before and after the load. In addition, a global timestamp (GTS) is utilized to track a latest committed transaction. Each transaction is associated with a local timestamp (LTS) initialized to the GTS value at the start of a transaction. As a transaction commits it updates the GTS to a new value and sets versions of modified locations to the new value. Pending transactions compare versions determined in read barriers to their LTS. If the version is greater than their LTS indicating another transaction has committed after the pending transaction started and initialized the LTS, then the pending transaction validates its read set to maintain efficient and consistent transactional execution.	01-19-2012
20120079245	DYNAMIC OPTIMIZATION FOR CONDITIONAL COMMIT - An apparatus and method is described herein for conditionally committing and/or speculative checkpointing transactions, which potentially results in dynamic resizing of transactions. During dynamic optimization of binary code, transactions are inserted to provide memory ordering safeguards, which enables a dynamic optimizer to more aggressively optimize code. And the conditional commit enables efficient execution of the dynamic optimization code, while attempting to prevent transactions from running out of hardware resources. While the speculative checkpoints enable quick and efficient recovery upon abort of a transaction. Processor hardware is adapted to support dynamic resizing of the transactions, such as including decoders that recognize a conditional commit instruction, a speculative checkpoint instruction, or both. And processor hardware is further adapted to perform operations to support conditional commit or speculative checkpointing in response to decoding such instructions.	03-29-2012
20120079246	APPARATUS, METHOD, AND SYSTEM FOR PROVIDING A DECISION MECHANISM FOR CONDITIONAL COMMITS IN AN ATOMIC REGION - An apparatus and method is described herein for conditionally committing /andor speculative checkpointing transactions, which potentially results in dynamic resizing of transactions. During dynamic optimization of binary code, transactions are inserted to provide memory ordering safeguards, which enables a dynamic optimizer to more aggressively optimize code. And the conditional commit enables efficient execution of the dynamic optimization code, while attempting to prevent transactions from running out of hardware resources. While the speculative checkpoints enable quick and efficient recovery upon abort of a transaction. Processor hardware is adapted to support dynamic resizing of the transactions, such as including decoders that recognize a conditional commit instruction, a speculative checkpoint instruction, or both. And processor hardware is further adapted to perform operations to support conditional commit or speculative checkpointing in response to decoding such instructions.	03-29-2012
20120185714	METHOD, APPARATUS, AND SYSTEM FOR ENERGY EFFICIENCY AND ENERGY CONSERVATION INCLUDING CODE RECIRCULATION TECHNIQUES - An apparatus, method and system is described herein for enabling intelligent recirculation of hot code sections. A hot code section is determined and marked with a begin and end instruction. When the begin instruction is decoded, recirculation logic in a back-end of a processor enters a detection mode and loads decoded loop instructions. When the end instruction is decoded, the recirculation logic enters a recirculation mode. And during the recirculation mode, the loop instructions are dispatched directly from the recirculation logic to execution stages for execution. Since the loop is being directly serviced out of the back-end, the front-end may be powered down into a standby state to save power and increase energy efficiency. Upon finishing the loop, the front-end is powered back on and continues normal operation, which potentially includes propagating next instructions after the loop that were prefetched before the front-end entered the standby mode.	07-19-2012
20120198426	METHODS AND APPARATUS TO FORM A RESILIENT OBJECTIVE INSTRUCTION CONSTRUCT - Methods and an apparatus to form a resilient objective instruction construct are provided. An example method obtains a source instruction construct and forms a resilient objective instruction construct by compiling one or more resilient transactions.	08-02-2012
20120233477	DYNAMIC CORE SELECTION FOR HETEROGENEOUS MULTI-CORE SYSTEMS - Dynamically switching cores on a heterogeneous multi-core processing system may be performed by executing program code on a first processing core. Power up of a second processing core may be signaled. A first performance metric of the first processing core executing the program code may be collected. When the first performance metric is better than a previously determined core performance metric, power down of the second processing core may be signaled and execution of the program code may be continued on the first processing core. When the first performance metric is not better than the previously determined core performance metric, execution of the program code may be switched from the first processing core to the second processing core.	09-13-2012
20120260072	REGISTER ALLOCATION IN ROTATION BASED ALIAS PROTECTION REGISTER - A system may comprises an optimizer/scheduler to schedule on a set of instructions, compute a data dependence, a checking constraint and/or an anti-checking constraint for the set of scheduled instructions, and allocate alias registers for the set of scheduled instructions based on the data dependence, the checking constraint and/or the anti-checking constraint. In one embodiment, the optimizer is to release unused registers to reduce the alias registers used to protect the scheduled instructions. The optimizer is further to insert a dummy instruction after a fused instruction to break cycles in the checking and anti-checking constraints.	10-11-2012
20130246712	Methods And Apparatuses For Efficient Load Processing Using Buffers - Various embodiments of the invention concern methods and apparatuses for power and time efficient load handling. A compiler may identify producer loads, consumer reuse loads, consumer forwarded loads, and producer/consumer hybrid loads. Based on this identification, performance of the load may be efficiently directed to a load value buffer, store buffer, data cache, or elsewhere. Consequently, accesses to cache are reduced, through direct loading from load value buffers and store buffers, thereby efficiently processing the loads.	09-19-2013
20130275700	BI-DIRECTIONAL COPYING OF REGISTER CONTENT INTO SHADOW REGISTERS - Embodiments of the present disclosure describe a processor, which may include copy circuitry coupled to a shadow register file and a control register. The copy circuitry may be configured to copy content from a range of a number of registers to a shadow range of the shadow register file in a forward or backward direction. The forward or backward direction may be based at least in part on a value stored in the control register.	10-17-2013
20130283014	EXPEDITING EXECUTION TIME MEMORY ALIASING CHECKING - Embodiments of apparatus, computer-implemented methods, systems, and computer-readable media are described herein for expediting execution time memory alias checking. A sequence of instructions targeted for execution on an execution processor may be received or retrieved. The execution processor may include a plurality of alias registers and circuitry configured to check entries in the alias register for memory aliasing. One or more optimizations may be performed on the received or retrieved sequence of instructions to optimize execution performance of the received or retrieved sequence of instructions. This may include a reorder of a plurality of memory instructions in the received or retrieved sequence of instructions. After the optimization, one or more move instructions may be inserted in the optimized sequence of instructions to move one or more entries among the alias registers during execution, to expedite alias checking at execution time. Other embodiments may be described and/or claimed.	10-24-2013
20130318507	APPARATUS, METHOD, AND SYSTEM FOR PROVIDING A DECISION MECHANISM FOR CONDITIONAL COMMITS IN AN ATOMIC REGION - An apparatus and method is described herein for conditionally committing and/or speculative checkpointing transactions, which potentially results in dynamic resizing of transactions. During dynamic optimization of binary code, transactions are inserted to provide memory ordering safeguards, which enables a dynamic optimizer to more aggressively optimize code. And the conditional commit enables efficient execution of the dynamic optimization code, while attempting to prevent transactions from running out of hardware resources. While the speculative checkpoints enable quick and efficient recovery upon abort of a transaction. Processor hardware is adapted to support dynamic resizing of the transactions, such as including decoders that recognize a conditional commit instruction, a speculative checkpoint instruction, or both. And processor hardware is further adapted to perform operations to support conditional commit or speculative checkpointing in response to decoding such instructions.	11-28-2013
20130346781	Power Gating Functional Units Of A Processor - In one embodiment, the present invention includes an apparatus having a core including functional units each to execute instructions of a target instruction set architecture (ISA) and a power controller to control a power mode of a first functional unit responsive to a power identification field of a power instruction of a power region of a code block to be executed on the core. Other embodiments are described and claimed.	12-26-2013
20140007054	METHODS AND SYSTEMS TO IDENTIFY AND REPRODUCE CONCURRENCY VIOLATIONS IN MULTI-THREADED PROGRAMS USING EXPRESSIONS	01-02-2014
20140032885	METHODS AND APPARATUS TO MANAGE PARTIAL-COMMIT CHECKPOINTS WITH FIXUP SUPPORT - Example methods and apparatus to manage partial commit-checkpoints are disclosed. A disclosed example method includes identifying a commit instruction associated with a region of instructions executed by a processor, identifying candidate instructions from the region of instructions, and generating a processor partial commit-checkpoint to save a current state of the processor, the checkpoint based on calculated register values associated with live instructions, and including instruction reference addresses to link the candidate instructions.	01-30-2014
20140096132	FLEXIBLE ACCELERATION OF CODE EXECUTION - Technologies for performing flexible code acceleration on a computing device includes initializing an accelerator virtual device on the computing device. The computing device allocates memory-mapped input and output (I/O) for the accelerator virtual device and also allocates an accelerator virtual device context for a code to be accelerated. The computing device accesses a bytecode of the code to be accelerated and determines whether the bytecode is an operating system-dependent bytecode. If not, the computing device performs hardware acceleration of the bytecode via the memory-mapped I/O using an internal binary translation module. However, if the bytecode is operating system-dependent, the computing device performs software acceleration of the bytecode.	04-03-2014
20140122845	OVERLAPPING ATOMIC REGIONS IN A PROCESSOR - In one embodiment, the present invention includes a processor having a core to execute instructions. This core can include various structures and logic that enable instructions of different atomic regions to be executed in an overlapping manner. To this end, the core can include a register file having registers to store data for use in execution of the instructions, and multiple shadow register files each to store a register checkpoint on initiation of a given atomic region. In this way, overlapping execution of atomic regions identified by a programmer or compiler can occur. Other embodiments are described and claimed.	05-01-2014
20140208085	INSTRUCTION AND LOGIC TO EFFICIENTLY MONITOR LOOP TRIP COUNT - Logic and instruction to efficiently monitor loop trip count. Loop trip count information of a loop may be stored in a dedicated hardware buffer. Average loop trip count of the loop may be calculated based on the stored loop trip count information. Based on the average trip count, loop optimizations may be applied or removed from the loop. The stored loop trip count information may include an identifier identifying the loop, a total loop trip count of the loop, and an exit count of the loop.	07-24-2014
20140223166	DYNAMIC CORE SELECTION FOR HETEROGENEOUS MULTI-CORE SYSTEMS - Dynamically switching cores on a heterogeneous multi-core processing system may be performed by executing program code on a first processing core. Power up of a second processing core may be signaled. A first performance metric of the first processing core executing the program code may be collected. When the first performance metric is better than a previously determined core performance metric, power down of the second processing core may be signaled and execution of the program code may be continued on the first processing core. When the first performance metric is not better than the previously determined core performance metric, execution of the program code may be switched from the first processing core to the second processing core.	08-07-2014
20140281246	INSTRUCTION BOUNDARY PREDICTION FOR VARIABLE LENGTH INSTRUCTION SET - A system, processor, and method to predict with high accuracy and retain instruction boundaries for previously executed instructions in order to decode variable length instructions is disclosed. In at least one embodiment, a disclosed processor includes an instruction fetch unit, an instruction cache, a boundary byte predictor, and an instruction decoder. In some embodiments, the instruction fetch unit provides an instruction address and the instruction cache produces an instruction tag and instruction cache content corresponding to the instruction address. The instruction decoder, in some embodiments, includes boundary byte logic to determine an instruction boundary in the instruction cache content.	09-18-2014
20140281382	MODIFIED EXECUTION USING CONTEXT SENSITIVE AUXILIARY CODE - A system and method to enhance execution of architected instructions in a processor uses auxiliary code to optimize execution of base microcode. An execution context of the architected instructions may be profiled to detect potential optimizations, resulting in generation and storage of auxiliary microcode. When the architected instructions are decoded to base microcode for execution, the base microcode may be enhanced or modified using retrieved auxiliary code.	09-18-2014
20140282423	METHODS AND APPARATUS TO MANAGE CONCURRENT PREDICATE EXPRESSIONS - Methods, apparatus, systems and articles of manufacture are disclosed to manage concurrent predicate expressions. An example method discloses inserting a first condition hook into a first thread, the first condition hook associated with a first condition, inserting a second condition hook into a second thread, the second condition hook associated with a second condition, preventing the second thread from executing until the first condition is satisfied, and identifying a concurrency violation when the second condition is satisfied.	09-18-2014
20140298306	SOFTWARE PIPELINING AT RUNTIME - Apparatuses and methods may provide for determining a level of performance for processing one or more loops by a dynamic compiler and executing code optimizations to generate a pipelined schedule for the one or more loops that achieves the determined level of performance within a prescribed time period. In one example, a dependence graph may be established for the one or more loops, and each dependence graph may be partitioned into stages based on the level of performance.	10-02-2014
20140359591	DYNAMIC OPTIMIZATION OF PIPELINED SOFTWARE - In an embodiment, a system includes a processor including at least one core to execute operations of a loop that includes S stages. The system also includes stage insertion means for adding a delay stage to the loop to increase a lifetime of a corresponding register associated with a first variable of the loop and to delay storage of contents of the register. The system also includes a dynamic random access memory (DRAM). Other embodiments are described and claimed.	12-04-2014
20150039861	ALLOCATION OF ALIAS REGISTERS IN A PIPELINED SCHEDULE - In an embodiment, a system includes a processor including one or more cores and a plurality of alias registers to store memory range information associated with a plurality of operations of a loop. The memory range information references one or more memory locations within a memory. The system also includes register assignment means for assigning each of the alias registers to a corresponding operation of the loop, where the assignments are made according to a rotation schedule, and one of the alias registers is assigned to a first operation in a first iteration of the loop and to a second operation in a subsequent iteration of the loop. The system also includes the memory coupled to the processor. Other embodiments are described and claimed.	02-05-2015

Patent applications by Youfeng Wu, Palo Alto, CA US

Inventors list

Assignees list

Classification tree browser

Top 100 Inventors

Top 100 Assignees

Youfeng Wu, Palo Alto US

Youfeng Wu, Palo Alto, CA US