Patent application number | Description | Published |
20080222303 | LATENCY HIDING MESSAGE PASSING PROTOCOL - A method, system, and article of manufacture that provide latency hiding, high bandwidth message passing protocols used for data communication between nodes of a parallel computer system are disclosed. A source node transmits a request to send message to a receiving node. Prior to receiving a clear to send message, the sending node continues to send deterministically routed (or fully described) data packets to the receiving node, thereby hiding the latency inherent in the request to send—clear to send message exchange. Once the sending node receives the clear to send message, any remaining portion of the message may be sent using partially described packets which may be routed dynamically, thereby maximizing bandwidth. | 09-11-2008 |
20080259916 | OPPORTUNISTIC QUEUEING INJECTION STRATEGY FOR NETWORK LOAD BALANCING - Embodiments of the invention include a method, system, and article of manufacture that provide opportunistic queuing injection strategy used for data communication between nodes of a parallel computer system. A message may be encapsulated into a set of data packets. When the packets are sent, an opportunistic injection queue may be configured to transmit them to multiple hardware injection ports. This approach allows for complete network link saturation. In a parallel system with network links in multiple dimensions, sending message packets using more than one dimension may substantially increase network throughput. | 10-23-2008 |
20080263320 | Executing a Scatter Operation on a Parallel Computer - Executing a scatter operation on a parallel computer includes: configuring a send buffer on a logical root, the send buffer having positions, each position corresponding to a ranked node in an operational group of compute nodes and for storing contents scattered to that ranked node; and repeatedly for each position in the send buffer: broadcasting, by the logical root to each of the other compute nodes on a global combining network, the contents of the current position of the send buffer using a bitwise OR operation, determining, by each compute node, whether the current position in the send buffer corresponds with the rank of that compute node, if the current position corresponds with the rank, receiving the contents and storing the contents in a reception buffer of that compute node, and if the current position does not correspond with the rank, discarding the contents. | 10-23-2008 |
20080270852 | MULTI-DIRECTIONAL FAULT DETECTION SYSTEM - An apparatus, program product and method checks for nodal faults in a group of nodes comprising a center node and all adjacent nodes. The center node concurrently communicates with the immediately adjacent nodes in three dimensions. The communications are analyzed to determine a presence of a faulty node or connection. | 10-30-2008 |
20080281997 | Low Latency, High Bandwidth Data Communications Between Compute Nodes in a Parallel Computer - Methods, parallel computers, and computer program products are disclosed for low latency, high bandwidth data communications between compute nodes in a parallel computer. Embodiments include receiving, by an origin direct memory access (‘DMA’) engine of an origin compute node, data for transfer to a target compute node; sending, by the origin DMA engine of the origin compute node to a target DMA engine on the target compute node, a request to send (‘RTS’) message; transferring, by the origin DMA engine, a predetermined portion of the data to the target compute node using memory FIFO operation; determining, by the origin DMA engine whether an acknowledgement of the RTS message has been received from the target DMA engine; if the an acknowledgement of the RTS message has not been received, transferring, by the origin DMA engine, another predetermined portion of the data to the target compute node using a memory FIFO operation; and if the acknowledgement of the RTS message has been received by the origin DMA engine, transferring, by the origin DMA engine, any remaining portion of the data to the target compute node using a direct put operation. | 11-13-2008 |
20080288820 | MULTI-DIRECTIONAL FAULT DETECTION SYSTEM - An apparatus, program product and method checks for nodal faults in a group of nodes comprising a center node and all adjacent nodes. The center node concurrently communicates with the immediately adjacent nodes in three dimensions. The communications are analyzed to determine a presence of a faulty node or connection. | 11-20-2008 |
20080301683 | Performing an Allreduce Operation Using Shared Memory - Methods, apparatus, and products are disclosed for performing an allreduce operation using shared memory that include: receiving, by at least one of a plurality of processing cores on a compute node, an instruction to perform an allreduce operation; establishing, by the core that received the instruction, a job status object for specifying a plurality of shared memory allreduce work units, the plurality of shared memory allreduce work units together performing the allreduce operation on the compute node; determining, by an available core on the compute node, a next shared memory allreduce work unit in the job status object; and performing, by that available core on the compute node, that next shared memory allreduce work unit. | 12-04-2008 |
20080301704 | Controlling Data Transfers from an Origin Compute Node to a Target Compute Node - Methods, apparatus, and products are disclosed for controlling data transfers from an origin compute node to a target compute node that include: receiving, by an application messaging module on the target compute node, an indication of a data transfer from an origin compute node to the target compute node; and administering, by the application messaging module on the target compute node, the data transfer using one or more messaging primitives of a system messaging module in dependence upon the indication. | 12-04-2008 |
20080307194 | Parallel, Low-Latency Method for High-Performance Deterministic Element Extraction From Distributed Arrays - The present invention provides a system and method for extracting elements from distributed arrays on a parallel processing system. The system includes a module that populates a local array with elements from input, a module that submits a largest element value in the local array and a processor ID for a local processor, and a module that determines a globally largest element value from the largest element values submitted by each one of the plurality of processors. The system further includes a module that broadcasts a winning globally largest element value and winning processor ID to the plurality of processors, and a module that increments an element pointer to the next value in the local array if the winning processor ID equals the processor ID for the local processor. | 12-11-2008 |
20080307195 | Parallel, Low-Latency Method for High-Performance Speculative Element Extraction From Distributed Arrays - The present invention provides a system and method for extracting elements from distributed arrays on a parallel processing system. The system includes a module that populates a result array with globally largest elements from the input, a module that generates a partition element, a module that counts the number of local elements greater than the partition and a module that determines the globally largest elements. The method for extracting elements from distributed arrays on a parallel processing system includes populating a result array with globally largest elements from the input, generating a partition element, counting the number of local elements greater than the partition and determining the globally largest elements. | 12-11-2008 |
20080313341 | Data Communications - Data communications, including issuing, by an application program to a high level data communications library, a request for initialization of a data communications service; issuing to a low level data communications library a request for registration of data communications functions; registering the data communications functions, including instantiating a factory object for each of the one or more data communications functions; issuing by the application program an instruction to execute a designated data communications function; issuing, to the low level data communications library, an instruction to execute the designated data communications function, including passing to the low level data communications library a call parameter that identifies a factory object; creating with the identified factory object the data communications object that implements the data communications function according to the protocol; and executing by the low level data communications library the designated data communications function. | 12-18-2008 |
20080313376 | Heuristic Status Polling - Methods, compute nodes, and computer program products are provided for heuristic status polling of a component in a computing system. Embodiments include receiving, by a polling module from a requesting application, a status request requesting status of a component; determining, by the polling module, whether an activity history for the component satisfies heuristic polling criteria; polling, by the polling module, the component for status if the activity history for the component satisfies the heuristic polling criteria; and not polling, by the polling module, the component for status if the activity history for the component does not satisfy the heuristic criteria. | 12-18-2008 |
20080313506 | BISECTIONAL FAULT DETECTION SYSTEM - An apparatus and program product logically divide a group of nodes and causes node pairs comprising a node from each section to communicate. Results from the communications may be analyzed to determine performance characteristics, such as bandwidth and proper connectivity. | 12-18-2008 |
20080320329 | ROW FAULT DETECTION SYSTEM - An apparatus and program product check for nodal faults in a row of nodes by causing each node in the row to concurrently communicate with its adjacent neighbor nodes in the row. The communications are analyzed to determine a presence of a faulty node or connection. | 12-25-2008 |
20080320330 | ROW FAULT DETECTION SYSTEM - An apparatus, program product and method check for nodal faults in a row of nodes by causing each node in the row to concurrently communicate with its adjacent neighbor nodes in the row. The communications are analyzed to determine a presence of a faulty node or connection. | 12-25-2008 |
20090037511 | Effecting a Broadcast with an Allreduce Operation on a Parallel Computer - Methods, parallel computers, and computer program products are disclosed for effecting a broadcast with an allreduce operation on a parallel computer, the parallel computer comprising a plurality of compute nodes, the compute nodes organized into at least one operational group of compute nodes for collective parallel operations of the parallel computer, each compute node in the operational group assigned a unique rank, the compute nodes of the operational group coupled for data communications through a global combining network; and one compute node assigned to be a logical root. Embodiments include configuring, by the logical root node, a send buffer having a contribution to be broadcast to each ranked node in the operational group; configuring, by all ranked nodes other than the logical root, a receive buffer for receiving the contribution from the logical root; and repeatedly for each element of the contribution of the logical root in the send buffer: contributing, by the logical root, the element of the contribution in the send buffer; injecting, by all ranked nodes other than the logical root, one or more zeros corresponding to a size of the element; performing, by all the compute nodes of the operational group, an allreduce operation with a bitwise OR using the element and the injected zeros, yielding a result for the allreduce operation; and storing in each receive buffer, by all ranked nodes other than the logical root, the result of the allreduce. | 02-05-2009 |
20090037598 | Providing Nearest Neighbor Point-to-Point Communications Among Compute Nodes of an Operational Group in a Global Combining Network of a Parallel Computer - Methods, apparatus, and products are disclosed for providing nearest neighbor point-to-point communications among compute nodes of an operational group in a global combining network of a parallel computer, each compute node connected to each adjacent compute node in the global combining network through a link, that include: identifying each link in the global combining network for each compute node of the operational group; designating one of a plurality of point-to-point class routing identifiers for each link such that no compute node in the operational group is connected to two adjacent compute nodes in the operational group with links designated for the same class routing identifiers; and configuring each compute node of the operational group for point-to-point communications with each adjacent compute node in the global combining network through the link between that compute node and that adjacent compute node using that link's designated class routing identifier. | 02-05-2009 |
20090043912 | Providing Full Point-To-Point Communications Among Compute Nodes of an Operational Group in a Global Combining Network of a Parallel Computer - Methods, apparatus, and products are disclosed for providing full point-to-point communications among compute nodes of an operational group in a global combining network of a parallel computer, each compute node connected to each adjacent compute node in the global combining network through a link, that include: receiving a network packet in a compute node, the network packet specifying a destination compute node; selecting, in dependence upon the destination compute node, at least one of the links for the compute node along which to forward the network packet toward the destination compute node; and forwarding the network packet along the selected link to the adjacent compute node connected to the compute node through the selected link. | 02-12-2009 |
20090043988 | Configuring Compute Nodes of a Parallel Computer in an Operational Group into a Plurality of Independent Non-Overlapping Collective Networks - Methods, apparatus, and products are disclosed for configuring compute nodes of a parallel computer in an operational group into a plurality of independent non-overlapping collective networks, the compute nodes in the operational group connected together for data communications through a global combining network, that include: partitioning the compute nodes in the operational group into a plurality of non-overlapping subgroups; designating one compute node from each of the non-overlapping subgroups as a master node; and assigning, to the compute nodes in each of the non-overlapping subgroups, class routing instructions that organize the compute nodes in that non-overlapping subgroup as a collective network such that the master node is a physical root. | 02-12-2009 |
20090044052 | CELL BOUNDARY FAULT DETECTION SYSTEM - An apparatus and program product determine a nodal fault along the boundary, or face, of a computing cell. Nodes on adjacent cell boundaries communicate with each other, and the communications are analyzed to determine if a node or connection is faulty. | 02-12-2009 |
20090300384 | Reducing Power Consumption While Performing Collective Operations On A Plurality Of Compute Nodes - Methods, apparatus, and products are disclosed for reducing power consumption while performing collective operations on a plurality of compute nodes that include: receiving, by each compute node, instructions to perform a type of collective operation; selecting, by each compute node from a plurality of collective operations for the collective operation type, a particular collective operation in dependence upon power consumption characteristics for each of the plurality of collective operations; and executing, by each compute node, the selected collective operation. | 12-03-2009 |
20090300385 | Reducing Power Consumption While Synchronizing A Plurality Of Compute Nodes During Execution Of A Parallel Application - Methods, apparatus, and products are disclosed for reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application that include: beginning, by each compute node, performance of a blocking operation specified by the parallel application, each compute node beginning the blocking operation asynchronously with respect to the other compute nodes; reducing, for each compute node, power to one or more hardware components of that compute node in response to that compute node beginning the performance of the blocking operation; and restoring, for each compute node, the power to the hardware components having power reduced in response to all of the compute nodes beginning the performance of the blocking operation. | 12-03-2009 |
20090300386 | Reducing power consumption during execution of an application on a plurality of compute nodes - Methods, apparatus, and products are disclosed for reducing power consumption during execution of an application on a plurality of compute nodes that include: powering up, during compute node initialization, only a portion of computer memory of the compute node, including configuring an operating system for the compute node in the powered up portion of computer memory; receiving, by the operating system, an instruction to load an application for execution; allocating, by the operating system, additional portions of computer memory to the application for use during execution; powering up the additional portions of computer memory allocated for use by the application during execution; and loading, by the operating system, the application into the powered up additional portions of computer memory. | 12-03-2009 |
20090300394 | Reducing Power Consumption During Execution Of An Application On A Plurality Of Compute Nodes - Methods, apparatus, and products are disclosed for reducing power consumption during execution of an application on a plurality of compute nodes that include: executing, by each compute node, an application, the application including power consumption directives corresponding to one or more portions of the application; identifying, by each compute node, the power consumption directives included within the application during execution of the portions of the application corresponding to those identified power consumption directives; and reducing power, by each compute node, to one or more components of that compute node according to the identified power consumption directives during execution of the portions of the application corresponding to those identified power consumption directives. | 12-03-2009 |
20090300399 | Profiling power consumption of a plurality of compute nodes while processing an application - Methods, apparatus, and products are disclosed for profiling power consumption of a plurality of compute nodes while processing an application that include: executing the application on the plurality of compute nodes; monitoring performance characteristics for components of the plurality of compute nodes during execution of the application; and recording, in a power profile for the application, power consumption during execution of the application in dependence upon the performance characteristics for components of the plurality of compute nodes. | 12-03-2009 |
20090307036 | Budget-Based Power Consumption For Application Execution On A Plurality Of Compute Nodes - Methods, apparatus, and products are disclosed for budget-based power consumption for application execution on a plurality of compute nodes that include: assigning an execution priority to each of one or more applications; executing, on the plurality of compute nodes, the applications according to the execution priorities assigned to the applications at an initial power level provided to the compute nodes until a predetermined power consumption threshold is reached; and applying, upon reaching the predetermined power consumption threshold, one or more power conservation actions to reduce power consumption of the plurality of compute nodes during execution of the applications. | 12-10-2009 |
20090307703 | Scheduling Applications For Execution On A Plurality Of Compute Nodes Of A Parallel Computer To Manage temperature of the nodes during execution - Methods, apparatus, and products are disclosed for scheduling applications for execution on a plurality of compute nodes of a parallel computer to manage temperature of the plurality of compute nodes during execution that include: identifying one or more applications for execution on the plurality of compute nodes; creating a plurality of physically discontiguous node partitions in dependence upon temperature characteristics for the compute nodes and a physical topology for the compute nodes, each discontiguous node partition specifying a collection of physically adjacent compute nodes; and assigning, for each application, that application to one or more of the discontiguous node partitions for execution on the compute nodes specified by the assigned discontiguous node partitions. | 12-10-2009 |
20090307708 | Thread Selection During Context Switching On A Plurality Of Compute Nodes - Methods, apparatus, and products are disclosed for thread selection during context switching on a plurality of compute nodes that includes: executing, by a compute node, an application using a plurality of threads of execution, including executing one or more of the threads of execution; selecting, by the compute node from a plurality of available threads of execution for the application, a next thread of execution in dependence upon power characteristics for each of the available threads; determining, by the compute node, whether criteria for a thread context switch are satisfied; and performing, by the compute node, the thread context switch if the criteria for a thread context switch are satisfied, including executing the next thread of execution. | 12-10-2009 |
20100005189 | Pacing Network Traffic Among A Plurality Of Compute Nodes Connected Using A Data Communications Network - Methods, apparatus, and products are disclosed for pacing network traffic among a plurality of compute nodes connected using a data communications network. The network has a plurality of network regions, and the plurality of compute nodes are distributed among these network regions. Pacing network traffic among a plurality of compute nodes connected using a data communications network includes: identifying, by a compute node for each region of the network, a roundtrip time delay for communicating with at least one of the compute nodes in that region; determining, by the compute node for each region, a pacing algorithm for that region in dependence upon the roundtrip time delay for that region; and transmitting, by the compute node, network packets to at least one of the compute nodes in at least one of the network regions in dependence upon the pacing algorithm for that region. | 01-07-2010 |
20100005326 | Profiling An Application For Power Consumption During Execution On A Compute Node - Methods, apparatus, and products are disclosed for profiling an application for power consumption during execution on a compute node that include: receiving an application for execution on a compute node; identifying a hardware power consumption profile for the compute node, the hardware power consumption profile specifying power consumption for compute node hardware during performance of various processing operations; determining a power consumption profile for the application in dependence upon the application and the hardware power consumption profile for the compute node; and reporting the power consumption profile for the application. | 01-07-2010 |
20100037035 | Generating An Executable Version Of An Application Using A Distributed Compiler Operating On A Plurality Of Compute Nodes - Methods, apparatus, and products are disclosed for generating an executable version of an application using a distributed compiler operating on a plurality of compute nodes that include: receiving, by each compute node, a portion of source code for an application; compiling, in parallel by each compute node, the portion of the source code received by that compute node into a portion of object code for the application; performing, in parallel by each compute node, inter-procedural analysis on the portion of the object code of the application for that compute node, including sharing results of the inter-procedural analysis among the compute nodes; optimizing, in parallel by each compute node, the portion of the object code of the application for that compute node using the shared results of the inter-procedural analysis; and generating the executable version of the application in dependence upon the optimized portions of the object code of the application. | 02-11-2010 |
20100274997 | Executing a Gather Operation on a Parallel Computer - Methods, apparatus, and computer program products are disclosed for executing a gather operation on a parallel computer according to embodiments of the present invention. Embodiments include configuring, by the logical root, a result buffer or the logical root, the result buffer having positions, each position corresponding to a ranked node in the operational group and for storing contribution data gathered from that ranked node. Embodiments also include repeatedly for each position in the result buffer: determining, by each compute node of an operational group, whether the current position in the result buffer corresponds with the rank of the compute node, if the current position in the result buffer corresponds with the rank of the compute node, contributing, by that compute node, the compute node's contribution data, if the current position in the result buffer does not correspond with the rank of the compute node, contributing, by that compute node, a value of zero for the contribution data, and storing, by the logical root in the current position in the result buffer, results of a bitwise OR operation of all the contribution data by all compute nodes of the operational group for the current position, the results received through the global combining network. | 10-28-2010 |
20100318835 | BISECTIONAL FAULT DETECTION SYSTEM - An apparatus, program product and method logically divide a group of nodes and causes node pairs comprising a node from each section to communicate. Results from the communications may be analyzed to determine performance characteristics, such as bandwidth and proper connectivity. | 12-16-2010 |
20110219208 | MULTI-PETASCALE HIGHLY EFFICIENT PARALLEL SUPERCOMPUTER - A Multi-Petascale Highly Efficient Parallel Supercomputer of 100 petaOPS-scale computing, at decreased cost, power and footprint, and that allows for a maximum packaging density of processing nodes from an interconnect point of view. The Supercomputer exploits technological advances in VLSI that enables a computing model where many processors can be integrated into a single Application Specific Integrated Circuit (ASIC). Each ASIC computing node comprises a system-on-chip ASIC utilizing four or more processors integrated into one die, with each having full access to all system resources and enabling adaptive partitioning of the processors to functions such as compute or messaging I/O on an application by application basis, and preferably, enable adaptive partitioning of functions in accordance with various algorithmic phases within an application, or if I/O or other processors are underutilized, then can participate in computation or communication nodes are interconnected by a five dimensional torus network with DMA that optimally maximize the throughput of packet communications between nodes and minimize latency. | 09-08-2011 |
20110238949 | Distributed Administration Of A Lock For An Operational Group Of Compute Nodes In A Hierarchical Tree Structured Network - Distributed administration of a lock for an operational group of compute nodes in a hierarchical tree structured network including assigning the root node of the operational group to send acknowledgments for lock requests, the root lock administration module comprising a module of automated computing machinery; receiving a lock request assigned to a particular node from a child node; determining whether another request from another child is directly ahead in an acknowledgement queue; if a request from another child is directly ahead in the acknowledgement queue, putting the lock request for the particular node in the acknowledgement queue until the lock request directly ahead in the acknowledgement queue is satisfied and when the lock request ahead in the queue is satisfied, sending the particular node for whom the lock request is assigned a message acknowledging the particular node has the lock; and if a request from another child is not directly ahead in a queue, sending to the particular node for whom the lock request is assigned a message acknowledging that the particular node has the lock. | 09-29-2011 |
20110238950 | Performing A Scatterv Operation On A Hierarchical Tree Network Optimized For Collective Operations - Performing a scattery operation on a hierarchical tree network optimized for collective operations including receiving, by the scattery module installed on the node, from a nearest neighbor parent above the node a chunk of data having at least a portion of data for the node; maintaining, by the scattery module installed on the node, the portion of the data for the node; determining, by the scattery module installed on the node, whether any portions of the data are for a particular nearest neighbor child below the node or one or more other nodes below the particular nearest neighbor child; and sending, by the scattery module installed on the node, those portions of data to the nearest neighbor child if any portions of the data are for a particular nearest neighbor child below the node or one or more other nodes below the particular nearest neighbor child. | 09-29-2011 |
20110239003 | Direct Injection of Data To Be Transferred In A Hybrid Computing Environment - Direct injection of a data to be transferred in a hybrid computing environment that includes a host computer and a plurality of accelerators, the host computer and the accelerators adapted to one another for data communications by a system level message passing module. Each accelerator includes a Power Processing Element (‘PPE’) and a plurality of Synergistic Processing Elements (‘SPEs’). Direct injection includes reserving, by each SPE, a slot in a shared memory region accessible by the host computer; loading, by each SPE into local memory of the SPE, a portion of data to be transferred to the host computer; executing, by each SPE in parallel, a data processing operation on the portion of the data loaded in local memory of each SPE; and writing, by each SPE, the processed data to the SPE's reserved slot in the shared memory region accessible by the host computer. | 09-29-2011 |
20110246582 | Message Passing with Queues and Channels - In an embodiment, a send thread receives an identifier that identifies a destination node and a pointer to data. The send thread creates a first send request in response to the receipt of the identifier and the data pointer. The send thread selects a selected channel from among a plurality of channels. The selected channel comprises a selected hand-off queue and an identification of a selected message unit. Each of the channels identifies a different message unit. The selected hand-off queue is randomly accessible. If the selected hand-off queue contains an available entry, the send thread adds the first send request to the selected hand-off queue. If the selected hand-off queue does not contain an available entry, the send thread removes a second send request from the selected hand-off queue and sends the second send request to the selected message unit. | 10-06-2011 |
20110258281 | QUERY PERFORMANCE DATA ON PARALLEL COMPUTER SYSTEM HAVING COMPUTE NODES - Embodiments of the invention provide a method for querying performance counter data on a massively parallel computing system, while minimizing the costs associated with interrupting computer processors and limited memory resources. DMA descriptors may be inserted into an injection FIFO of a remote compute node in the massively parallel computing system. Upon executing the DMA operations described by the DMA descriptors, performance counter data may be transferred from the remote compute node to a destination node. | 10-20-2011 |
20110265098 | Message Passing with Queues and Channels - In an embodiment, a reception thread receives a source node identifier, a type, and a data pointer from an application and, in response, creates a receive request. If the source node identifier specifies a source node, the reception thread adds the receive request to a fast-post queue. If a message received from a network does not match a receive request on a posted queue, a polling thread adds a receive request that represents the message to an unexpected queue. If the fast-post queue contains the receive request, the polling thread removes the receive request from the fast-post queue. If the receive request that was removed from the fast-post queue does not match the receive request on the unexpected queue, the polling thread adds the receive request that was removed from the fast-post queue to the posted queue. The reception thread and the polling thread execute asynchronously from each other. | 10-27-2011 |
20110270942 | COMBINING MULTIPLE HARDWARE NETWORKS TO ACHIEVE LOW-LATENCY HIGH-BANDWIDTH POINT-TO-POINT COMMUNICATION - Systems, methods and articles of manufacture are disclosed for performing a collective operation on a parallel computing system that includes multiple compute nodes and multiple networks connecting the compute nodes. Each of the networks may have different characteristics. A source node may broadcast a DMA descriptor over a first network to a target node, to initialize the collective operation. The target node may perform the collective operation over a second network and using the broadcast DMA descriptor. | 11-03-2011 |
20110271006 | PIPELINING PROTOCOLS IN MISALIGNED BUFFER CASES - Systems, methods and articles of manufacture are disclosed for effecting a desired collective operation on a parallel computing system that includes multiple compute nodes. The compute nodes may pipeline multiple collective operations to effect the desired collective operation. To select protocols suitable for the multiple collective operations, the compute nodes may also perform additional collective operations. The compute nodes may pipeline the multiple collective operations and/or the additional collective operations to effect the desired collective operation more efficiently. | 11-03-2011 |
20110271263 | Compiling Software For A Hierarchical Distributed Processing System - Compiling software for a hierarchical distributed processing system including providing to one or more compiling nodes software to be compiled, wherein at least a portion of the software to be compiled is to be executed by one or more other nodes; compiling, by the compiling node, the software; maintaining, by the compiling node, any compiled software to be executed on the compiling node; selecting, by the compiling node, one or more nodes in a next tier of the hierarchy of the distributed processing system in dependence upon whether any compiled software is for the selected node or the selected node's descendants; sending to the selected node only the compiled software to be executed by the selected node or selected node's descendant. | 11-03-2011 |
20110288848 | PASSING NON-ARCHITECTED REGISTERS VIA A CALLBACK/ADVANCE MECHANISM IN A SIMULATOR ENVIRONMENT - Embodiments of the invention provide a method of calculating performance counter data for a computer simulator, while minimizing the performance costs associated with cycle-accurate simulation. A callback may be associated with the instructions of a user program and, when the instructions are executed, the associated callbacks may be executed as well. Upon execution, the callbacks may calculate performance counter data related to the associated instruction. | 11-24-2011 |
20110289177 | Effecting Hardware Acceleration Of Broadcast Operations In A Parallel Computer - Compute nodes of a parallel computer organized for collective operations via a network, each compute node having a receive buffer and establishing a topology for the network; selecting a schedule for a broadcast operation; depositing, by a root node of the topology, broadcast data in a target node's receive buffer, including performing a DMA operation with a well-known memory location for the target node's receive buffer; depositing, by the root node in a memory region designated for storing broadcast data length, a length of the broadcast data, including performing a DMA operation with a well-known memory location of the broadcast data length memory region; and triggering, by the root node, the target node to perform a next DMA operation, including depositing, in a memory region designated for receiving injection instructions for the target node, an instruction to inject the broadcast data into the receive buffer of a subsequent target node. | 11-24-2011 |
20110296137 | Performing A Deterministic Reduction Operation In A Parallel Computer - A parallel computer that includes compute nodes having computer processors and a CAU (Collectives Acceleration Unit) that couples processors to one another for data communications. In embodiments of the present invention, deterministic reduction operation include: organizing processors of the parallel computer and a CAU into a branched tree topology, where the CAU is a root of the branched tree topology and the processors are children of the root CAU; establishing a receive buffer that includes receive elements associated with processors and configured to store the associated processor's contribution data; receiving, in any order from the processors, each processor's contribution data; tracking receipt of each processor's contribution data; and reducing, the contribution data in a predefined order, only after receipt of contribution data from all processors in the branched tree topology. | 12-01-2011 |
20110296139 | Performing A Deterministic Reduction Operation In A Parallel Computer - Performing a deterministic reduction operation in a parallel computer that includes compute nodes, each of which includes computer processors and a CAU (Collectives Acceleration Unit) that couples computer processors to one another for data communications, including organizing processors and a CAU into a branched tree topology in which the CAU is a root and the processors are children; receiving, from each of the processors in any order, dummy contribution data, where each processor is restricted from sending any other data to the root CAU prior to receiving an acknowledgement of receipt from the root CAU; sending, by the root CAU to the processors in the branched tree topology, in a predefined order, acknowledgements of receipt of the dummy contribution data; receiving, by the root CAU from the processors in the predefined order, the processors' contribution data to the reduction operation; and reducing, by the root CAU, the processors' contribution data. | 12-01-2011 |
20120036384 | Reducing Power Consumption While Synchronizing A Plurality Of Compute Nodes During Execution Of A Parallel Application - Methods, apparatus, and products are disclosed for reducing power consumption while synchronizing a plurality of compute nodes during execution of a parallel application that include: beginning, by each compute node, performance of a blocking operation specified by the parallel application, each compute node beginning the blocking operation asynchronously with respect to the other compute nodes; reducing, for each compute node, power to one or more hardware components of that compute node in response to that compute node beginning the performance of the blocking operation; and restoring, for each compute node, the power to the hardware components having power reduced in response to all of the compute nodes beginning the performance of the blocking operation. | 02-09-2012 |
20120066284 | Send-Side Matching Of Data Communications Messages - Send-side matching of data communications messages in a distributed computing system comprising a plurality of compute nodes organized for collective operations, including: issuing by a receiving node to source nodes a receive message that specifies receipt of a single message to be sent from any source node, the receive message including message matching information, a specification of a hardware-level mutual exclusion device, and an identification of a receive buffer; matching by two or more of the source nodes the receive message with pending send messages in the two or more source nodes; operating by one of the source nodes having a matching send message the mutual exclusion device, excluding messages from other source nodes with matching send messages and identifying to the receiving node the source node operating the mutual exclusion device; and sending to the receiving node from the source node operating the mutual exclusion device a matched pending message. | 03-15-2012 |
20120066310 | COMBINING MULTIPLE HARDWARE NETWORKS TO ACHIEVE LOW-LATENCY HIGH-BANDWIDTH POINT-TO-POINT COMMUNICATION OF COMPLEX TYPES - Systems, methods and articles of manufacture are disclosed for performing a vector collective operation on a parallel computing system that includes multiple compute nodes and a network connecting the compute nodes that includes an ALU. A collective operation may be performed to determine displacements for the vector collective operation. Descriptors for the vector collective operation may be generated based on the displacements. The vector collective operation may then be performed using the descriptors. | 03-15-2012 |
20120079035 | Administering Truncated Receive Functions In A Parallel Messaging Interface - Administering truncated receive functions in a parallel messaging interface (‘PMI’) of a parallel computer comprising a plurality of compute nodes coupled for data communications through the PMI and through a data communications network, including: sending, through the PMI on a source compute node, a quantity of data from the source compute node to a destination compute node; specifying, by an application on the destination compute node, a portion of the quantity of data to be received by the application on the destination compute node and a portion of the quantity of data to be discarded; receiving, by the PMI on the destination compute node, all of the quantity of data; providing, by the PMI on the destination compute node to the application on the destination compute node, only the portion of the quantity of data to be received by the application; and discarding, by the PMI on the destination compute node, the portion of the quantity of data to be discarded. | 03-29-2012 |
20120079133 | Routing Data Communications Packets In A Parallel Computer - Routing data communications packets in a parallel computer that includes compute nodes organized for collective operations, each compute node including an operating system kernel and a system-level messaging module that is a module of automated computing machinery that exposes a messaging interface to applications, each compute node including a routing table that specifies, for each of a multiplicity of route identifiers, a data communications path through the compute node, including: receiving in a compute node a data communications packet that includes a route identifier value; retrieving from the routing table a specification of a data communications path through the compute node; and routing, by the compute node, the data communications packet according to the data communications path identified by the compute node's routing table entry for the data communications packet's route identifier value. | 03-29-2012 |
20120079165 | Paging Memory From Random Access Memory To Backing Storage In A Parallel Computer - Paging memory from random access memory (‘RAM’) to backing storage in a parallel computer that includes a plurality of compute nodes, including: executing a data processing application on a virtual machine operating system in a virtual machine on a first compute node; providing, by a second compute node, backing storage for the contents of RAM on the first compute node; and swapping, by the virtual machine operating system in the virtual machine on the first compute node, a page of memory from RAM on the first compute node to the backing storage on the second compute node. | 03-29-2012 |
20120117361 | Processing Data Communications Events In A Parallel Active Messaging Interface Of A Parallel Computer - Processing data communications events in a parallel active messaging interface (‘PAMI’) of a parallel computer that includes compute nodes that execute a parallel application, with the PAMI including data communications endpoints, and the endpoints are coupled for data communications through the PAMI and through other data communications resources, including determining by an advance function that there are no actionable data communications events pending for its context, placing by the advance function its thread of execution into a wait state, waiting for a subsequent data communications event for the context; responsive to occurrence of a subsequent data communications event for the context, awakening by the thread from the wait state; and processing by the advance function the subsequent data communications event now pending for the context. | 05-10-2012 |
20120137294 | Data Communications In A Parallel Active Messaging Interface Of A Parallel Computer - Data communications in a parallel active messaging interface (‘PAMI’) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a SEND instruction, the SEND instruction specifying a transmission of transfer data from the origin endpoint to a first target endpoint; transmitting from the origin endpoint to the first target endpoint a Request-To-Send (‘RTS’) message advising the first target endpoint of the location and size of the transfer data; assigning by the first target endpoint to each of a plurality of target endpoints separate portions of the transfer data; and receiving by the plurality of target endpoints the transfer data. | 05-31-2012 |
20120151485 | Data Communications In A Parallel Active Messaging Interface Of A Parallel Computer - Data communications in a parallel active messaging interface (‘PAMI’) of a parallel computer, the parallel computer including a plurality of compute nodes that execute a parallel application, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources, including receiving in an origin endpoint of the PAMI a data communications instruction, the instruction characterized by an instruction type, the instruction specifying a transmission of transfer data from the origin endpoint to a target endpoint and transmitting, in accordance with the instruction type, the transfer data from the origin endpoint to the target endpoint. | 06-14-2012 |
20120185230 | Distributed Hardware Device Simulation - Distributed hardware device simulation, including: identifying a plurality of hardware components of the hardware device; providing software components simulating the functionality of each hardware component, wherein the software components are installed on compute nodes of a distributed processing system; receiving, in at least one of the software components, one or more messages representing an input to the hardware component; simulating the operation of the hardware component with the software component, thereby generating an output of the software component representing the output of the hardware component; and sending, from the software component to at least one other software component, one or more messages representing the output of the hardware component. | 07-19-2012 |
20120185679 | Endpoint-Based Parallel Data Processing With Non-Blocking Collective Instructions In A Parallel Active Messaging Interface Of A Parallel Computer - Endpoint-based parallel data processing with non-blocking collective instructions in a parallel active messaging interface (‘PAMI’) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing by the parallel application a data communications geometry, the geometry specifying a set of endpoints that are used in collective operations of the PAMI, including associating with the geometry a list of collective algorithms valid for use with the endpoints of the geometry; registering in each endpoint in the geometry a dispatch callback function for a collective operation; and executing without blocking, through a single one of the endpoints in the geometry, an instruction for the collective operation. | 07-19-2012 |
20120185873 | Data Communications In A Parallel Active Messaging Interface Of A Parallel Computer - Data communications in a parallel active messaging interface (‘PAMI’) of a parallel computer composed of compute nodes that execute a parallel application, each compute node including application processors that execute the parallel application and at least one management processor dedicated to gathering information regarding data communications. The PAMI is composed of data communications endpoints, each endpoint composed of a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes and the endpoints coupled for data communications through the PAMI and through data communications resources. Embodiments function by gathering call site statistics describing data communications resulting from execution of data communications instructions and identifying in dependence upon the call cite statistics a data communications algorithm for use in executing a data communications instruction at a call site in the parallel application. | 07-19-2012 |
20120210094 | Data Communications In A Parallel Active Messaging Interface Of A Parallel Computer - Eager send data communications in a parallel active messaging interface (PAMI) of a parallel computer, the PAMI composed of data communications endpoints that specify a client, a context, and a task, including receiving an eager send data communications instruction with transfer data disposed in a send buffer characterized by a read/write send buffer memory address in a read/write virtual address space of the origin endpoint; determining for the send buffer a read-only send buffer memory address in a read-only virtual address space, the read-only virtual address space shared by both the origin endpoint and the target endpoint, with all frames of physical memory mapped to pages of virtual memory in the read-only virtual address space; and communicating by the origin endpoint to the target endpoint an eager send message header that includes the read-only send buffer memory address. | 08-16-2012 |
20120254344 | Endpoint-Based Parallel Data Processing In A Parallel Active Messaging Interface Of A Parallel Computer - Endpoint-based parallel data processing in a parallel active messaging interface (‘PAMI’) of a parallel computer, the PAMI composed of data communications endpoints, each endpoint including a specification of data communications parameters for a thread of execution on a compute node, including specifications of a client, a context, and a task, the compute nodes coupled for data communications through the PAMI, including establishing a data communications geometry, the geometry specifying, for tasks representing processes of execution of the parallel application, a set of endpoints that are used in collective operations of the PAMI including a plurality of endpoints for one of the tasks; receiving in endpoints of the geometry an instruction for a collective operation; and executing the instruction for a collective operation through the endpoints in dependence upon the geometry, including dividing data communications operations among the plurality of endpoints for one of the tasks. | 10-04-2012 |
20120265835 | QUERY PERFORMANCE DATA ON PARALLEL COMPUTER SYSTEM HAVING COMPUTE NODES - Embodiments of the invention provide a method for querying performance counter data on a massively parallel computing system, while minimizing the costs associated with interrupting computer processors and limited memory resources. DMA descriptors may be inserted into an injection FIFO of a remote compute node in the massively parallel computing system. Upon executing the DMA operations described by the DMA descriptors, performance counter data may be transferred from the remote compute node to a destination node. | 10-18-2012 |
20130042088 | Collective Operation Protocol Selection In A Parallel Computer - Collective operation protocol selection in a parallel computer that includes compute nodes may be carried out by calling a collective operation with operating parameters; selecting a protocol for executing the operation and executing the operation with the selected protocol. Selecting a protocol includes: iteratively, until a prospective protocol meets predetermined performance criteria: providing, to a protocol performance function for the prospective protocol, the operating parameters; determining whether the prospective protocol meets predefined performance criteria by evaluating a predefined performance fit equation, calculating a measure of performance of the protocol for the operating parameters; determining that the prospective protocol meets predetermined performance criteria and selecting the protocol for executing the operation only if the calculated measure of performance is greater than a predefined minimum performance threshold. | 02-14-2013 |
20130042245 | Performing A Global Barrier Operation In A Parallel Computer - Performing a global barrier operation in a parallel computer that includes compute nodes coupled for data communications, where each compute node executes tasks, with one task on each compute node designated as a master task, including: for each task on each compute node until all master tasks have joined a global barrier: determining whether the task is a master task; if the task is not a master task, joining a single local barrier; if the task is a master task, joining the global barrier and the single local barrier only after all other tasks on the compute node have joined the single local barrier. | 02-14-2013 |
20130042254 | Performing A Local Barrier Operation - Performing a local barrier operation with parallel tasks executing on a compute node including, for each task: retrieving a present value of a counter; calculating, in dependence upon the present value of the counter and a total number of tasks performing the local barrier operation, a base value of the counter, the base value representing the counter's value prior to any task joining the local barrier; calculating, in dependence upon the base value and the total number of tasks performing the local barrier operation, a target value of the counter, the target value representing the counter's value when all tasks have joined the local barrier; joining the local barrier, including atomically incrementing the value of the counter; and repetitively, until the present value of the counter is no less than the target value of the counter: retrieving the present value of the counter and determining whether the present value equals the target value. | 02-14-2013 |
20130067479 | Establishing A Group Of Endpoints In A Parallel Computer - A parallel computer executes a number of tasks, each task includes a number of endpoints and the endpoints are configured to support collective operations. In such a parallel computer, establishing a group of endpoints receiving a user specification of a set of endpoints included in a global collection of endpoints, where the user specification defines the set in accordance with a predefined virtual representation of the endpoints, the predefined virtual representation is a data structure setting forth an organization of tasks and endpoints included in the global collection of endpoints and the user specification defines the set of endpoints without a user specification of a particular endpoint; and defining a group of endpoints in dependence upon the predefined virtual representation of the endpoints and the user specification. | 03-14-2013 |
20130074086 | PIPELINING PROTOCOLS IN MISALIGNED BUFFER CASES - Systems, methods and articles of manufacture are disclosed for effecting a desired collective operation on a parallel computing system that includes multiple compute nodes. The compute nodes may pipeline multiple collective operations to effect the desired collective operation. To select protocols suitable for the multiple collective operations, the compute nodes may also perform additional collective operations. The compute nodes may pipeline the multiple collective operations and/or the additional collective operations to effect the desired collective operation more efficiently. | 03-21-2013 |
20130086551 | Providing A User With A Graphics Based IDE For Developing Software For Distributed Computing Systems - Graphics based IDE for distributed computing systems software development including providing a graphical representation of a topology of a distributed computing system for which the user is to develop a software application; receiving an identification of a system component upon which a portion of the application is to execute; providing a text editor for receiving from the user computer program instructions forming the portion of the application; inserting, without user intervention as part of the portion of the application, predetermined computer program instructions configured to support the identified system component; receiving, through the text editor, the portion of the application including the predetermined computer program instructions configured to support the identified system component; and storing, the computer program instructions forming the portion of the application, at a user specified location within the application. | 04-04-2013 |
20130117403 | Managing Internode Data Communications For An Uninitialized Process In A Parallel Computer - A parallel computer includes nodes, each having main memory and a messaging unit (MU). Each MU includes computer memory, which in turn includes, MU message buffers. Each MU message buffer is associated with an uninitialized process on the compute node. In the parallel computer, managing internode data communications for an uninitialized process includes: receiving, by an MU of a compute node, one or more data communications messages in an MU message buffer associated with an uninitialized process on the compute node; determining, by an application agent, that the MU message buffer associated with the uninitialized process is full prior to initialization of the uninitialized process; establishing, by the application agent, a temporary message buffer for the uninitialized process in main computer memory; and moving, by the application agent, data communications messages from the MU message buffer associated with the uninitialized process to the temporary message buffer in main computer memory. | 05-09-2013 |
20130117761 | Intranode Data Communications In A Parallel Computer - Intranode data communications in a parallel computer that includes compute nodes configured to execute processes, where the data communications include: allocating, upon initialization of a first process of a compute node, a region of shared memory; establishing, by the first process, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; sending, to a second process on the same compute node, a data communications message without determining whether the second process has been initialized, including storing the data communications message in the message buffer of the second process; and upon initialization of the second process: retrieving, by the second process, a pointer to the second process's message buffer; and retrieving, by the second process from the second process's message buffer in dependence upon the pointer, the data communications message sent by the first process. | 05-09-2013 |
20130117764 | Internode Data Communications In A Parallel Computer - Internode data communications in a parallel computer that includes compute nodes that each include main memory and a messaging unit, the messaging unit including computer memory and coupling compute nodes for data communications, in which, for each compute node at compute node boot time: a messaging unit allocates, in the messaging unit's computer memory, a predefined number of message buffers, each message buffer associated with a process to be initialized on the compute node; receives, prior to initialization of a particular process on the compute node, a data communications message intended for the particular process; and stores the data communications message in the message buffer associated with the particular process. Upon initialization of the particular process, the process establishes a messaging buffer in main memory of the compute node and copies the data communications message from the message buffer of the messaging unit into the message buffer of main memory. | 05-09-2013 |