Collective Communications¶

Collective Communications A collective communication is a communication that involves a group (or groups) of processes.

the group of processes is represented as always as a communicator that provides a context for the operation,
Syntax and semantics of the collective operations are consistent with the syntax and semantics of the point-to-point operations,
For collective operations, the amount of data sent must exactly match the amount of data specified by the receiver.

Mixing type of calls Collective communication calls may use the same communicators as point-to-point communication; Any (conforming) implementation of MPI messages guarantees that calls generated on behalf of collective communication calls will not be confused with messages generated by point-to-point communication.

Broadcast, Gather and Scatter¶

The broadcast operation. In the broadcast, initially just the first process contains the data \(a_0\), but after the broadcast all processes contain it.
This is an example of a one-to-all communication, i.e., only one process contributes to the result, while all processes receive the result.

int MPI_Bcast(void* buffer, int count, 
 MPI_Datatype datatype, int root, MPI_Comm comm)

Broadcasts a message from the process with rank root to all processes of the group, itself included.

void* buffer on return, the content of root’s buffer is copied to all other processes.
int count size of the message
MPI_Datatype datatype type of the buffer
int root rank of the process broadcasting the message
MPI_Comm comm communicator grouping the processes involved in the broadcast operation

The scatter and gather operations

In the scatter, initially just the first process contains the data \(a_0,\ldots,a_3\), but after the scatter the \(j\)th process contains the \(a_j\) data.
In the gather, initially the \(j\)th process contains the \(a_j\) data, but after the gather the first process contains the data \(a_0,\ldots,a_3\)

Each process (root process included) sends the contents of its send buffer to the root process. The latter receives the messages and stores them in rank order.
```
int MPI_Gather(const void* sendbuf, int sendcount, 
  MPI_Datatype sendtype,  void* recvbuf, int recvcount, 
  MPI_Datatype recvtype, int root, MPI_Comm comm)
```
- const void* sendbuf starting address of send buffer
- int sendcount number of elements in send buffer
- MPI_Datatype sendtype data type of send buffer elements
- void* recvbuf [address of receive buffer]{.alert}
- int recvcount [number of elements for any single receive]{.alert} (and [not]{.ul} the total number of items!)
- MPI_Datatype recvtype [data type of received buffer elements]{.alert}
- int root rank of receiving process
- MPI_Comm comm communicator

Observe that

The type signature of sendcount, sendtype on each process must be equal to the type signature of recvcount, recvtype at all the processes.
The amount of data sent must be equal to the amount of data received, pairwise between each process and the root.

Therefore, if we need to have a varying count of data from each process, we need to use instead

int MPI_Gatherv(const void* sendbuf, int sendcount, 
 MPI_Datatype sendtype, void* recvbuf, 
 const int recvcounts[], const int displs[], 
 MPI_Datatype recvtype, int root, MPI_Comm comm)

where

const int recvcounts[] is an array (of length group size) containing the number of elements that are received from each process,
const int displs[] is an array (of length group size). Entry i specifies the displacement relative to recvbuf at which to place the incoming data from process i.

If we need to have the result of the gather operation on every process involved in the communicator we can use the variant

int MPI_Allgather(const void* sendbuf, int sendcount,
 MPI_Datatype sendtype, void* recvbuf, int recvcount,
 MPI_Datatype recvtype, MPI_Comm comm)

All processes in the communicator comm receive the result. The block of data sent from the \(j\)th process is received by every process and placed in the \(j\)th block of the buffer recvbuf.
The type signature associated with sendcount, sendtype, at a process must be equal to the type signature associated with recvcount, recvtype at any other process.

This function has also the version for gathering messages with different sizes:

int MPI_Allgatherv(const void* sendbuf, int sendcount,
 MPI_Datatype sendtype, void* recvbuf, const int recvcounts[],
 const int displs[], MPI_Datatype recvtype, MPI_Comm comm)

and works in a way analogous to the MPI_Gatherv.

The scatter is simply the inverse operation of MPI_Gather

int MPI_Scatter(const void* sendbuf, int sendcount, 
  MPI_Datatype sendtype, void* recvbuf, int recvcount, 
  MPI_Datatype recvtype, int root, MPI_Comm comm)

const void* sendbuf [address of send buffer]{.alert}
int sendcount [number of elements sent to each process]{.alert}
MPI_Datatype sendtype [type of send buffer elements]{.alert}
void* recvbuf address of receive buffer
int recvcount number of elements in receive buffer
MPI_Datatype recvtype data type of receive buffer elements
int root rank of sending process
MPI_Comm comm communicator

Taxonomy of collective communications: Scatter Observe that

The type signature of sendcount, sendtype on each process must be equal to the type signature of recvcount, recvtype at the root.
The amount of data sent must be equal to the amount of data received, pairwise between each process and the root.

Therefore, if we need to have a varying count of data from each process, we need to use instead

int MPI_Scatterv(const void* sendbuf, const int sendcounts[],
 const int displs[], MPI_Datatype sendtype, void* recvbuf,
 int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm)

where

const int sendcounts[] is an array (of length group size) containing the number of elements that are sent to each process,
const int displs[] is an array (of length group size). Entry i specifies the displacement relative to recvbuf from which to take the outgoing data to process i.

Modifying the 1st derivative code¶

Modifying the 1st derivative code Let us perform the following modification to our first derivative code:

Taking from input the number of points to use in each interval,
Collecting the whole result on one process and print it on file.

For the first step we use the MPI_Bcast function,

if(mynode == 0){
 if(argc != 2){
  n = 20;
 }else{
  n = atoi(argv[1]);
 }
}
MPI_Bcast(&n,1,MPI_INT,
 0,MPI_COMM_WORLD);

We read on rank \(0\) the number n from command line,
Then we broadcast it with MPI_Bcast, pay attention to the fact that the broadcast operations happens on all the processes!

Modifying the 1st derivative code Then we gather all the derivatives from the various processes and collect them on process 0.

if(mynode == 0)
 globalderiv = (double *) 
   malloc(sizeof(double) 
   *(n*totalnodes));

MPI_Gather(fx,n,MPI_DOUBLE,
  globalderiv,n,MPI_DOUBLE,
  0,MPI_COMM_WORLD);

we allocate on rank 0 the memory that is necessary to store the whole derivative array,
then we use the to gather all the array fx (of double) inside the globalderiv array.

At last we print it out on file on rank 0

if(mynode == 0){
 FILE *fptr; fptr = fopen("derivative", "w");
 for(int i = 0; i < n*totalnodes; i++)
  fprintf(fptr,"%f %f\n",globala+i*dx,globalderiv[i]);
 fclose(fptr); free(globalderiv);}

All-to-All Scatter/Gather¶

All-to-All MPI_ALLTOALL is an extension of MPI_ALLGATHER to the case where each process sends distinct data to each of the receivers.

int MPI_Alltoall(const void* sendbuf, int sendcount,
 MPI_Datatype sendtype, void* recvbuf, int recvcount, 
 MPI_Datatype recvtype, MPI_Comm comm)

The \(j\)th block sent from process \(i\) is received by process \(j\) and is placed in the \(i\)th block of recvbuf.
The type signature for sendcount, sendtype, at a process must be equal to the type signature for recvcount, recvtype at any other process.

All-to-All different data size If we need to send data of different size between the processes

int MPI_Alltoallv(const void* sendbuf, const int sendcounts[],
 const int sdispls[], MPI_Datatype sendtype, void* recvbuf,
 const int recvcounts[], const int rdispls[],
 MPI_Datatype recvtype, MPI_Comm comm);

const void* sendbuf starting address of send buffer
const int sendcounts[] array specifying the number of elements to send to each rank
const int sdispls[] entry \(j\) specifies the displacement (relative to sendbuf) from which to take the outgoing data destined for process \(j\)
void* recvbuf array specifying the number of elements that can be received from each rank
const int recvcounts[] integer array. Entry \(i\) specifies the displacement (relative to recvbuf) at which to place the incoming data from process \(i\)
const int rdispls[] entry \(i\) specifies the displacement (relative to recvbuf) at which to place the incoming data from process \(i\)

Global reduce operation¶

The reduce operation The reduce operation for a given operator takes a data buffer from each of the processes in the communicator group and combines it according to operator rules.

int MPI_Reduce(const void* sendbuf, void* recvbuf, 
 int count, MPI_Datatype datatype, MPI_Op op, 
 int root, MPI_Comm comm);

const void* sendbuf address of send buffer
void* recvbuf address of receive buffer
int count number of elements in send buffer
MPI_Datatype datatype data type of elements of send buffer
MPI_Op op reduce operation
int root rank of root process
MPI_Comm comm communicator

The reduce operation The value of MPI_Op op for the reduce operation can be taken from any of the following operators.

Constant	Operation	Constant	Operation
`MPI_MAX`	Maximum	`MPI_MAXLOC`	Max value and location
`MPI_MIN`	Minimum	`MPI_MINLOC`	Minimum value and location
`MPI_SUM`	Sum	`MPI_LOR`	Logical or
`MPI_PROD`	Product	`MPI_BOR`	Bit-wise or
`MPI_LAND`	Logical and	`MPI_LXOR`	Logical exclusive or
`MPI_BAND`	Bit-wise and	`MPI_BXOR`	Bit-wise exclusive or

Moreover, if a different operator is needed, it is possible to create it by means of the function

int MPI_Op_create(MPI_User_function* user_fn, int commute, 
MPI_Op* op)

In C the prototype for a MPI_User_function is

typedef void MPI_User_function(void* invec, void* inoutvec, 
    int *len, MPI_Datatype *datatype);

Global reduce operation – All-Reduce As for other collective operations we may want to have the result of the reduction available on every process in a group.

The routine for obtaining such result is

int MPI_Allreduce(const void* sendbuf, void* recvbuf, 
 int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

const void* sendbuf address of send buffer
void* recvbuf address of receive buffer
int count number of elements in send buffer
MPI_Datatype datatype data type of elements of send buffer
MPI_Op op reduce operation
MPI_Comm comm communicator

This instruction behaves like a combination of a reduction and broadcast operation.

Global reduce operation – All-Reduce-Scatter This is another variant of the reduction operation in which the result is scattered to all processes in a group on return.

int MPI_Reduce_scatter_block(const void* sendbuf, 
 void* recvbuf, int recvcount, MPI_Datatype datatype, 
 MPI_Op op, MPI_Comm comm);

The routine is called by all group members using the same arguments for recvcount, datatype, op and comm.
The resulting vector is treated as n consecutive blocks of recvcount elements that are scattered to the processes of the group comm.
The \(i\)th block is sent to process \(i\) and stored in the receive buffer defined by recvbuf, recvcount, and datatype.

Global reduce operation – All-Reduce-Scatter Of this function also a variant with variable block–size is available

int MPI_Reduce_scatter(const void* sendbuf, void* recvbuf,
const int recvcounts[], MPI_Datatype datatype, MPI_Op op,
MPI_Comm comm);

This routine first performs a global element-wise reduction on vectors of \(\verb|count|=\sum_{i=0}^{n-1}\verb|recevcounts[i]|\) elements in the send buffers defined by sendbuf, count and datatype, using the operation op, where n is the size of the communicator.
The routine is called by all group members using the same arguments for recvcounts, datatype, op and comm.
The resulting vector is treated as n consecutive blocks where the number of elements of the \(i\)th block is recvcounts[i].
The \(i\)th block is sent to process \(i\) and stored in the receive buffer defined by recvbuf, recvcounts[i] and datatype.

A Short Introduction to Parallel Computing

Collective Communications¶

Broadcast, Gather and Scatter¶

Modifying the 1st derivative code¶

All-to-All Scatter/Gather¶

Global reduce operation¶

Some computations using collective communications¶

Computing Integrals¶

Python version¶

Random number generation: Montecarlo type algorithms¶

Python version¶