Patent application number | Description | Published |
20120179881 | Performing An Allreduce Operation Using Shared Memory - Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit. | 07-12-2012 |
20120204041 | Profiling An Application For Power Consumption During Execution On A Compute Node - Methods, apparatus, and products are disclosed for profiling an application for power consumption during execution on a compute node that include: receiving an application for execution on a compute node; identifying a hardware power consumption profile for the compute node, the hardware power consumption profile specifying power consumption for compute node hardware during performance of various processing operations; determining a power consumption profile for the application in dependence upon the application and the hardware power consumption profile for the compute node; and reporting the power consumption profile for the application. | 08-09-2012 |
20120290863 | Budget-Based Power Consumption For Application Execution On A Plurality Of Compute Nodes - Methods, apparatus, and products are disclosed for budget-based power consumption for application execution on a plurality of compute nodes that include: assigning an execution priority to each of one or more applications; executing, on the plurality of compute nodes, the applications according to the execution priorities assigned to the applications at an initial power level provided to the compute nodes until a predetermined power consumption threshold is reached; and applying, upon reaching the predetermined power consumption threshold, one or more power conservation actions to reduce power consumption of the plurality of compute nodes during execution of the applications. | 11-15-2012 |
20120304193 | Scheduling Applications For Execution On A Plurality Of Compute Nodes Of A Parallel Computer To Manage Temperature Of The Nodes During Execution - Methods, apparatus, and products are disclosed for scheduling applications for execution on a plurality of compute nodes of a parallel computer to manage temperature of the plurality of compute nodes during execution that include: identifying one or more applications for execution on the plurality of compute nodes; creating a plurality of physically discontiguous node partitions in dependence upon temperature characteristics for the compute nodes and a physical topology for the compute nodes, each discontiguous node partition specifying a collection of physically adjacent compute nodes; and assigning, for each application, that application to one or more of the discontiguous node partitions for execution on the compute nodes specified by the assigned discontiguous node partitions. | 11-29-2012 |
20130060557 | DISTRIBUTED HARDWARE DEVICE SIMULATION - Distributed hardware device simulation, including: identifying a plurality of hardware components of the hardware device; providing software components simulating the functionality of each hardware component, wherein the software components are installed on compute nodes of a distributed processing system; receiving, in at least one of the software components, one or more messages representing an input to the hardware component; simulating the operation of the hardware component with the software component, thereby generating an output of the software component representing the output of the hardware component; and sending, from the software component to at least one other software component, one or more messages representing the output of the hardware component. | 03-07-2013 |
20130060844 | DIRECT INJECTION OF DATA TO BE TRANSFERRED IN A HYBRID COMPUTING ENVIRONMENT - Direct injection of a data to be transferred in a hybrid computing environment that includes a host computer and a plurality of accelerators, the host computer and the accelerators adapted to one another for data communications by a system level message passing module. Each accelerator includes a Power Processing Element (‘PPE’) and a plurality of Synergistic Processing Elements (‘SPEs’). Direct injection includes reserving, by each SPE, a slot in a shared memory region accessible by the host computer; loading, by each SPE into local memory of the SPE, a portion of data to be transferred to the host computer; executing, by each SPE in parallel, a data processing operation on the portion of the data loaded in local memory of each SPE; and writing, by each SPE, the processed data to the SPE's reserved slot in the shared memory region accessible by the host computer. | 03-07-2013 |
20130067111 | ROUTING DATA COMMUNICATIONS PACKETS IN A PARALLEL COMPUTER - Routing data communications packets in a parallel computer that includes compute nodes organized for collective operations, each compute node including an operating system kernel and a system-level messaging module that is a module of automated computing machinery that exposes a messaging interface to applications, each compute node including a routing table that specifies, for each of a multiplicity of route identifiers, a data communications path through the compute node, including: receiving in a compute node a data communications packet that includes a route identifier value; retrieving from the routing table a specification of a data communications path through the compute node; and routing, by the compute node, the data communications packet according to the data communications path identified by the compute node's routing table entry for the data communications packet's route identifier value. | 03-14-2013 |
20130067206 | Endpoint-Based Parallel Data Processing In A Parallel Active Messaging Interface Of A Parallel Computer - Endpoint-based parallel data processing in a parallel active messaging interface (‘PAMI’) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective operation through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks. | 03-14-2013 |
20130073603 | SEND-SIDE MATCHING OF DATA COMMUNICATIONS MESSAGES - Send-side matching of data communications messages in a distributed computing system comprising a plurality of compute nodes, including: issuing by a receiving node to source nodes a receive message that specifies receipt of a single message to be sent from any source node, the receive message including message matching information, a specification of a hardware-level mutual exclusion device, and an identification of a receive buffer; matching by two or more of the source nodes the receive message with pending send messages in the two or more source nodes; operating by one of the source nodes having a matching send message the mutual exclusion device, excluding messages from other source nodes with matching send messages and identifying to the receiving node the source node operating the mutual exclusion device; and sending to the receiving node from the source node operating the mutual exclusion device a matched pending message. | 03-21-2013 |
20130073832 | PERFORMING A DETERMINISTIC REDUCTION OPERATION IN A PARALLEL COMPUTER - A parallel computer that includes compute nodes having computer processors and a CAU (Collectives Acceleration Unit) that couples processors to one another for data communications. In embodiments of the present invention, deterministic reduction operation include: organizing processors of the parallel computer and a CAU into a branched tree topology, where the CAU is a root of the branched tree topology and the processors are children of the root CAU; establishing a receive buffer that includes receive elements associated with processors and configured to store the associated processor's contribution data; receiving, in any order from the processors, each processor's contribution data; tracking receipt of each processor's contribution data; and reducing, the contribution data in a predefined order, only after receipt of contribution data from all processors in the branched tree topology. | 03-21-2013 |
20130074097 | ENDPOINT-BASED PARALLEL DATA PROCESSING WITH NON-BLOCKING COLLECTIVE INSTRUCTIONS IN A PARALLEL ACTIVE MESSAGING INTERFACE OF A PARALLEL COMPUTER - Endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface (‘PAMI’) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing by the parallel application a data communications geometry, the geometry specifying a set of endpoints that are used in collective operations of the PAMI, including associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry; registering in each endpoint in the geometry a dispatch callback function for a collective operation; and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation. | 03-21-2013 |
20130074098 | PROCESSING DATA COMMUNICATIONS EVENTS IN A PARALLEL ACTIVE MESSAGING INTERFACE OF A PARALLEL COMPUTER - Processing data communications events in a parallel active messaging interface (‘PAMI’) of a parallel computer that includes compute nodes that execute a parallel application, with the PAMI including data communications endpoints, and the endpoints are coupled for data communications through the PAMI and through other data communications resources, including determining by an advance function that there are no actionable data communications events pending for its context, placing by the advance function its thread of execution into a wait state, waiting for a subsequent data communications event for the context; responsive to occurrence of a subsequent data communications event for the context, awakening by the thread from the wait state; and processing by the advance function the subsequent data communications event now pending for the context. | 03-21-2013 |
20130080563 | EFFECTING HARDWARE ACCELERATION OF BROADCAST OPERATIONS IN A PARALLEL COMPUTER - Compute nodes of a parallel computer organized for collective operations via a network, each compute node having a receive buffer and establishing a topology for the network; selecting a schedule for a broadcast operation; depositing, by a root node of the topology, broadcast data in a target node's receive buffer, including performing a DMA operation with a well-known memory location for the target node's receive buffer; depositing, by the root node in a memory region designated for storing broadcast data length, a length of the broadcast data, including performing a DMA operation with a well-known memory location of the broadcast data length memory region; and triggering, by the root node, the target node to perform a next DMA operation, including depositing, in a memory region designated for receiving injection instructions for the target node, an instruction to inject the broadcast data into the receive buffer of a subsequent target node. | 03-28-2013 |
20130081059 | DATA COMMUNICATIONS IN A PARALLEL ACTIVE MESSAGING INTERFACE OF A PARALLEL COMPUTER - Data communications in a parallel active messaging interface (‘PAMI’) of a parallel computer, the parallel computer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a data communications instruction, the instruction characterized by an instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance with the instruction type, the transfer data from the origin endpoint to the target endpoint. | 03-28-2013 |
20130086358 | COLLECTIVE OPERATION PROTOCOL SELECTION IN A PARALLEL COMPUTER - Collective operation protocol selection in a parallel computer that includes compute nodes may be carried out by calling a collective operation with operating parameters; selecting a protocol for executing the operation and executing the operation with the selected protocol. Selecting a protocol includes: iteratively, until a prospective protocol meets predetermined performance criteria: providing, to a protocol performance function for the prospective protocol, the operating parameters; determining whether the prospective protocol meets predefined performance criteria by evaluating a predefined performance fit equation, calculating a measure of performance of the protocol for the operating parameters; determining that the prospective protocol meets predetermined performance criteria and selecting the protocol for executing the operation only if the calculated measure of performance is greater than a predefined minimum performance threshold. | 04-04-2013 |
20130091510 | DATA COMMUNICATIONS IN A PARALLEL ACTIVE MESSAGING INTERFACE OF A PARALLEL COMPUTER - Data communications in a parallel active messaging interface (‘PAMI’) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a SEND instruction, the SEND instruction specifying a transmission of transfer data from the origin endpoint to a first target endpoint; transmitting from the origin endpoint to the first target endpoint a Request-To-Send (‘RTS’) message advising the first target endpoint of the location and size of the transfer data; assigning by the first target endpoint to each of a plurality of target endpoints separate portions of the transfer data; and receiving by the plurality of target endpoints the transfer data. | 04-11-2013 |
20130097404 | DATA COMMUNICATIONS IN A PARALLEL ACTIVE MESSAGING INTERFACE OF A PARALLEL COMPUTER - Eager send data communications in a parallel active messaging interface (‘PAMI’) of a parallel computer, the PAMI composed of data communications endpoints that specify a client, a context, and a task, including receiving an eager send data communications instruction with transfer data disposed in a send buffer characterized by a read/write send buffer memory address in a read/write virtual address space of the origin endpoint; determining for the send buffer a read-only send buffer memory address in a read-only virtual address space, the read-only virtual address space shared by both the origin endpoint and the target endpoint, with all frames of physical memory mapped to pages of virtual memory in the read-only virtual address space; and communicating by the origin endpoint to the target endpoint an eager send message header that includes the read-only send buffer memory address. | 04-18-2013 |
20130111482 | ESTABLISHING A GROUP OF ENDPOINTS IN A PARALLEL COMPUTER | 05-02-2013 |
20130111496 | PERFORMING A LOCAL BARRIER OPERATION | 05-02-2013 |
20130124666 | MANAGING INTERNODE DATA COMMUNICATIONS FOR AN UNINITIALIZED PROCESS IN A PARALLEL COMPUTER - A parallel computer includes nodes, each having main memory and a messaging unit (MU). Each MU includes computer memory, which in turn includes, MU message buffers. Each MU message buffer is associated with an uninitialized process on the compute node. In the parallel computer, managing internode data communications for an uninitialized process includes: receiving, by an MU of a compute node, one or more data communications messages in an MU message buffer associated with an uninitialized process on the compute node; determining, by an application agent, that the MU message buffer associated with the uninitialized process is full prior to initialization of the uninitialized process; establishing, by the application agent, a temporary message buffer for the uninitialized process in main computer memory; and moving, by the application agent, data communications messages from the MU message buffer associated with the uninitialized process to the temporary message buffer in main computer memory. | 05-16-2013 |
20130125135 | INTRANODE DATA COMMUNICATIONS IN A PARALLEL COMPUTER - Intranode data communications in a parallel computer that includes compute nodes configured to execute processes, where the data communications include: allocating, upon initialization of a first process of a compute node, a region of shared memory; establishing, by the first process, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; sending, to a second process on the same compute node, a data communications message without determining whether the second process has been initialized, including storing the data communications message in the message buffer of the second process; and upon initialization of the second process: retrieving, by the second process, a pointer to the second process's message buffer; and retrieving, by the second process from the second process's message buffer in dependence upon the pointer, the data communications message sent by the first process. | 05-16-2013 |
20130125140 | INTRANODE DATA COMMUNICATIONS IN A PARALLEL COMPUTER - Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory. | 05-16-2013 |
20130173675 | PERFORMING A GLOBAL BARRIER OPERATION IN A PARALLEL COMPUTER - Performing a global barrier operation in a parallel computer that includes compute nodes coupled for data communications, where each compute node executes tasks, with one task on each compute node designated as a master task, including: for each task on each compute node until all master tasks have joined a global barrier: determining whether the task is a master task; if the task is not a master task, joining a single local barrier; if the task is a master task, joining the global barrier and the single local barrier only after all other tasks on the compute node have joined the single local barrier. | 07-04-2013 |
20130176904 | Providing Full Point-To-Point Communications Among Compute Nodes Of An Operational Group In A Global Combining Network Of A Parallel Computer - Methods, apparatus, and products are disclosed for providing full point-to-point communications among compute nodes of an operational group in a global combining network of a parallel computer, each compute node connected to each adjacent compute node in the global combining network through a link, that include: receiving a network packet in a compute node, the network packet specifying a destination compute node; selecting, in dependence upon the destination compute node, at least one of the links for the compute node along which to forward the network packet toward the destination compute node; and forwarding the network packet along the selected link to the adjacent compute node connected to the compute node through the selected link. | 07-11-2013 |
20130179897 | Thread Selection During Context Switching On A Plurality Of Compute Nodes - Methods, apparatus, and products are disclosed for thread selection during context switching on a plurality of compute nodes that includes: executing, by a compute node, an application using a plurality of threads of execution, including executing one or more of the threads of execution; selecting, by the compute node from a plurality of available threads of execution for the application, a next thread of execution in dependence upon power characteristics for each of the available threads; determining, by the compute node, whether criteria for a thread context switch are satisfied; and performing, by the compute node, the thread context switch if the criteria for a thread context switch are satisfied, including executing the next thread of execution. | 07-11-2013 |
20130290673 | PERFORMING A DETERMINISTIC REDUCTION OPERATION IN A PARALLEL COMPUTER - Performing a deterministic reduction operation in a parallel computer that includes compute nodes, each of which includes computer processors and a CAU (Collectives Acceleration Unit) that couples computer processors to one another for data communications, including organizing processors and a CAU into a branched tree topology in which the CAU is a root and the processors are children; receiving, from each of the processors in any order, dummy contribution data, where each processor is restricted from sending any other data to the root CAU prior to receiving an acknowledgement of receipt from the root CAU; sending, by the root CAU to the processors in the branched tree topology, in a predefined order, acknowledgements of receipt of the dummy contribution data; receiving, by the root CAU from the processors in the predefined order, the processors' contribution data to the reduction operation; and reducing, by the root CAU, the processors' contribution data. | 10-31-2013 |
20130304995 | Scheduling Synchronization In Association With Collective Operations In A Parallel Computer - Methods, apparatuses, and computer program products for scheduling synchronization in association with collective operations in a parallel computer that includes a shared memory and a plurality of compute nodes that execute a parallel application utilizing the shared memory are provided. Embodiments include acquiring an available channel of the shared memory; posting to the acquired channel of the shared memory one or more collective operations and a synchronization point; determining that processing within the acquired channel has reached the synchronization point; and posting to the acquired channel, in response to determining that processing within the acquired channel has reached the synchronization point, a background synchronization operation corresponding to the one or more collective operations. | 11-14-2013 |
20130312010 | Processing Posted Receive Commands In A Parallel Computer - Processing posted receive commands in a parallel computer, including: posting, by a parallel process of a compute node, a receive command, the receive command including a set of parameters excluding the receive command from being directed among parallel posted receive queues; flattening the parallel unexpected message queues into a single unexpected message queue; determining whether the posted receive command is satisfied by an entry in the single unexpected message queue; if the posted receive command is satisfied by an entry in the single unexpected message queue, processing the posted receive command; if the posted receive command is not satisfied by an entry in the single unexpected message queue: flattening the parallel posted receive queues into a single posted receive queue; and storing the posted receive command in the single posted receive queue. | 11-21-2013 |
20130312011 | PROCESSING POSTED RECEIVE COMMANDS IN A PARALLEL COMPUTER - Processing posted receive commands in a parallel computer, including: posting, by a parallel process of a compute node, a receive command, the receive command including a set of parameters excluding the receive command from being directed among parallel posted receive queues; flattening the parallel unexpected message queues into a single unexpected message queue; determining whether the posted receive command is satisfied by an entry in the single unexpected message queue; if the posted receive command is satisfied by an entry in the single unexpected message queue, processing the posted receive command; if the posted receive command is not satisfied by an entry in the single unexpected message queue: flattening the parallel posted receive queues into a single posted receive queue; and storing the posted receive command in the single posted receive queue. | 11-21-2013 |
20130346997 | MECHANISM OF SUPPORTING SUB-COMMUNICATOR COLLECTIVES WITH O(64) COUNTERS AS OPPOSED TO ONE COUNTER FOR EACH SUB-COMMUNICATOR - A system and method for enhancing barrier collective synchronization on a computer system comprises a computer system including a data storage device. The computer system includes a program stored in the data storage device and steps of the program being executed by a processor. The system includes providing a plurality of communicators for storing state information for a barrier algorithm. Each communicator designates a master core in a multi-processor environment of the computer system. The system allocates or designates one counter for each of a plurality of threads. The system configures a table with a number of entries equal to the maximum number of threads. The system sets a table entry with an ID associated with a communicator when a process thread initiates a collective. The system determines an allocated or designated counter by searching entries in the table. | 12-26-2013 |
20140047450 | Utilizing A Kernel Administration Hardware Thread Of A Multi-Threaded, Multi-Core Compute Node Of A Parallel Computer - Methods, apparatuses, and computer program products for utilizing a kernel administration hardware thread of a multi-threaded, multi-core compute node of a parallel computer are provided. Embodiments include a kernel assigning a memory space of a hardware thread of an application processing core to a kernel administration hardware thread of a kernel processing core. A kernel administration hardware thread is configured to advance the hardware thread to a next memory space associated with the hardware thread in response to the assignment of the kernel administration hardware thread to the memory space of the hardware thread. Embodiments also include the kernel administration hardware thread executing an instruction within the assigned memory space. | 02-13-2014 |
20140047451 | Optimizing Collective Communications Within A Parallel Computer - Methods, apparatuses, and computer program products for optimizing collective communications within a parallel computer comprising a plurality of hardware threads for executing software threads of a parallel application are provided. Embodiments include a processor of a parallel computer determining for each software thread, an affinity of the software thread to a particular hardware thread. Each affinity indicates an assignment of a software thread to a particular hardware thread. The processor also generates one or more affinity domains based on the affinities of the software threads. Embodiments also include a processor generating, for each affinity domain, a topology of the affinity domain based on the affinities of the software threads to the hardware threads. According to embodiments of the present application, a processor also performs, based on the generated topologies of the affinity domains, a collective operation on one or more software threads. | 02-13-2014 |