A First Scientific Computation¶

The 1st derivative of a function with finite differences Given a function $f(x) : [a,b] \rightarrow \mathbb{R}$ we want to approximate $f'(x)$ on a (uniform) grid on the $[a,b]$ interval by using a finite difference scheme in parallel.

Given an integer $n \in \mathbb{N}$ we can subdivide the interval $[a,b]$ into intervals of length $\Delta x = \frac{(b-a)}{n-1}$ with grid points $\{x_j\}_{j=0}^{n} = \{x_j = a + j \Delta x\}_{j=0}^{n-1}$:

,
and consider the values $\{f_j\}_{j=0}^{n-1} = \{f(x_j)\}_{j=0}^{n-1}$
We can approximate the values of $f'(x_j)$, for $j=1,\ldots,n-2$, by using only the values of $f$ at the knots $\{f_j\}_{j=0}^{n-1}$

The 1st derivative of a function with finite differences

The first derivative of $f$ at $x = x_j$ can be expressed by using knots for $j' > j$
or equivalently by using knots for $j' < j$
at last we can consider the arithmetic mean of previous two:

Writing the sequential algorithm The sequential algorithms needs to break the approximation process into three parts

evaluate the derivative $f'(x_i)$ for $i=1,\ldots,n-2$,
evaluate the derivative at the left–hand side $f'(x_0)$,
evaluate the derivative at the right–hand side $f'(x_{n-1})$.

To have the same order of approximation at each point of the grid we need to use a one–sided formula for the steps 2 and 3, specifically $$f'(x_0) \approx \frac{-3 f_0 + 4 f_1 - f_2}{2 \Delta x}, \quad f'(x_{n-1}) \approx \frac{3f_{n-1} -4 f_{n-2} + f_{n-3}}{2 \Delta x}$$

Writing the sequential algorithm Then the sequential algorithm can be written as

void firstderiv1D_vec(int n, double dx, double *f, double *fx){
double scale;
scale = 1.0/(2.0*dx);
for (int i = 1; i < n-1; i++){
 fx[i] = (f[i+1] - f[i-1])*scale;
}
 fx[0] = (-3.0*f[0] + 4.0*f[1] - f[2])*scale;
 fx[n-1] = (3.0*f[n-1] - 4.0*f[n-2] + f[n-3])*scale;
 return;
}

The function takes as input

the number of grid points is $n$,
the amplitude of such intervals $\Delta x$,
the array containing the evaluation of $f$ (intent: input),
the array that will contain the value of the derivative (intent: output)

Writing the parallel algorithm To implement the sequential differencing functions in parallel with MPI, we have to perform several steps

partition our domain $[a,b]$ among the processors,
each processor then computes the finite differences for all the points contained on that processor

To actually perform the second step, we need to observe that the end-points on each subdomain needs information that is not contained on the processor, but that resides on a different one, we need to communicate boundary data!

Red dots are halo data, the one we need to communicate, while gray dots are data owned by the process.

Writing the parallel algorithm The prototype of the function we want to write can be, in this case,

void firstderiv1Dp_vec(int n, double dx, double *f,
  double *fx, int mynode, int totalnodes)

where

int n is the number of points per process,
double dx the amplitude of each interval,
double *f, double *fx the local portions with the values of $f(x)$ (input) and $f'(x)$ (output),
int mynode the rank of the current process,
int totalnodes the size of the communicator

We declare then the variables

double scale = 1.0/(2.0*dx);
double mpitemp;
MPI_Status status;

Writing the parallel algorithm Then we can treat the case in which we are at the beginning or at the end of the global interval

if(mynode == 0){
 fx[0] = (-3.0*f[0] + 4.0*f[1] - f[2])*scale;
}
if(mynode == (totalnodes-1)){
 fx[n-1] = (3.0*f[n-1] - 4.0*f[n-2] + f[n-3])*scale;
}

this approximate the derivative at the first and last point of the global interval.

Then, we can compute the inner part (the gray points) of the local interval by doing:

for(int i=1;i<n-1;i++){
 fx[i] = (f[i+1]-f[i-1])*scale;
}

Writing the parallel algorithm The other case we need to treat is again the particular case in which we are in the first, or in the last interval. In both cases we have only one communication to perform

if(mynode == 0){
 mpitemp = f[n-1];
 MPI_Send(&mpitemp,1,MPI_DOUBLE,1,1,MPI_COMM_WORLD);
 MPI_Recv(&mpitemp,1,MPI_DOUBLE,1,1,MPI_COMM_WORLD,&status);
 fx[n-1] = (mpitemp - f[n-2])*scale;
}
else if(mynode == (totalnodes-1)){
 MPI_Recv(&mpitemp,1,MPI_DOUBLE,mynode-1,1,MPI_COMM_WORLD, 
   &status);
 fx[0] = (f[1]-mpitemp)*scale;
 mpitemp = f[0];
 MPI_Send(&mpitemp,1,MPI_DOUBLE,mynode-1,1,MPI_COMM_WORLD);
}

Writing the parallel algorithm Finally, the only remaining case is the one in which we need to communicate both the extremes of the interval

else{
 MPI_Recv(&mpitemp,1,MPI_DOUBLE,mynode-1,1,MPI_COMM_WORLD, 
   &status);
 fx[0] = (f[1]-mpitemp)*scale;
 mpitemp = f[0];
 MPI_Send(&mpitemp,1,MPI_DOUBLE,mynode-1,1,MPI_COMM_WORLD);
 mpitemp = f[n-1];
 MPI_Send(&mpitemp,1,MPI_DOUBLE,mynode+1,1,MPI_COMM_WORLD);
 MPI_Recv(&mpitemp,1,MPI_DOUBLE,mynode+1,1,MPI_COMM_WORLD, 
   &status);
 fx[n-1] = (mpitemp-f[n-2])*scale;
}

And the routine is complete!

Writing the parallel algorithm A simple (and not very useful) principal program for this routine can be written by first initializing the parallel environment, and discovering who we are.

MPI_Init( &argc, &argv );
MPI_Comm_rank( MPI_COMM_WORLD, &mynode );
MPI_Comm_size( MPI_COMM_WORLD, &totalnodes );

Then we build the local values of the $f$ function

globala = 0;
globalb = 1;
a = globala + ((double) mynode)*(globalb - globala)
    /( (double) totalnodes);
b = globala + ((double) mynode+1)*(globalb - globala)
    /( (double) totalnodes);
f  = (double *) malloc(sizeof(double)*(n));
fx = (double *) malloc(sizeof(double)*(n));
dx = (b-a)/((double) n);
for( int i = 0; i < n; i++){
 f[i] = fun(a+((double) i)*dx);
}

Finally we invoke our parallel computation

Writing the parallel algorithm To check if what we have done makes sens we evaluate the error in the $\|\cdot\|_2$ norm on the grid, i.e., $\sqrt{\Delta x} \| \mathbf{f}' - \mathbf{fx}\|_2$ on every process

error = 0.0;
for(int i = 0; i < n; i++){
 error += pow( fx[i]-funprime(a+((b-a)*((double) i))
    /((double) n)),2.0);
}
error = sqrt(dx*error);
printf("Node %d ||f' - fx||_2 = %e\n",mynode,error);

Then we clear the memory and close the parallel environment

free(f);
free(fx);
MPI_Finalize();

Further modifications

In every case the function void firstderiv1Dp_vec wants to exchange information between two adjacent processes, i.e., every process wants to “swap” is halo with its adjacent process. We can rewrite the whole function by using the MPI_Sendrecv_replace point-to-point communication routine.
We can rewrite the entire program in an “embarrassing parallel” way, if every process has access to $f$, and are assuming that all the interval are partitioned the same way, by using the knowledge of our rank we can compute what are the boundary elements at the previous and following process. Thus, no communication at all!

Full implementation¶

First we can implement the sequential version of the code. This simply applies the formulas from the previous section.

%%file ccode/sequentialderiv.c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>

void firstderiv1D(int n, double a, double b, double *fx);
void firstderiv1D_vec(int n, double dx, double *f, double *fx);
double fun(double x);
double funprime( double x);

int main(int argc, char **argv){
  double a,b,dx,*f,*fx;
  int n;
  if(argc != 2){
    n = 20;
  }else{
    n = atoi(argv[1]);
  }
  a = 0.0;
  b = 1.0;
  f  = (double *) malloc(sizeof(double)*n);
  fx = (double *) malloc(sizeof(double)*n);
  // firstderiv1D(n, a, b, fx);
  dx = (b-a)/((double) n);
  for( int i = 0; i <= n; i++){
    f[i] = fun(a+((double) i)*dx);
  }
  firstderiv1D_vec(n, dx, f, fx);
  for(int i = 0; i < n; i++){
    printf("fx[%d] = %1.2f f'[%d] = %1.2f |fx - f'| = %1.1e\n",
      i,fx[i],i,funprime(a+((b-a)*i)/n),abs(fx[i]-
        funprime(a+((b-a)*((double) i))/((double) (n-1)))));
  }

  return 1;
}

void firstderiv1D(int n, double a, double b, double *fx){
  double dx = (b-a)/((double) n-1);
  double scale = 1.0/(2.0*dx);
  for (int i = 1; i < n-1; i++){
    fx[i] = (fun(a + (i+1)*dx) - fun(a + (i-1)*dx))*scale;
  }
  fx[0] = (-3.0*fun(a) + 4.0*fun(a + dx) - fun(a + 2.0*dx))*scale;
  fx[n-1] = (3.0*fun(a+ (n-1)*dx) - 4.0*fun(a + (n-2)*dx) + fun(a + (n-3)*dx))*scale;
  return;
}

void firstderiv1D_vec(int n, double dx, double *f, double *fx){
  double scale;
  scale = 1.0/(2.0*dx);
  for (int i = 1; i < n-1; i++){
    fx[i] = (f[i+1] - f[i-1])*scale;
  }
  fx[0] = (-3.0*f[0] + 4.0*f[1] - f[2])*scale;
  fx[n-1] = (3.0*f[n-1] - 4.0*f[n-2] + f[n-3])*scale;
  return;
}

double fun(double x){
  return(x*x + 1);
}

double funprime( double x){
  return(2*x);
}

Overwriting ccode/sequentialderiv.c

Code that can then be converted into its (distributed) parallel version as

%%file ccode/parallelderiv.c
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include "mpi.h"

void firstderiv1Dp_vec(int n, double dx, double *f,double *fx, int mynode, int totalnodes);

double fun(double x);
double funprime( double x);

int main(int argc, char **argv){
  double globala,globalb,a,b,dx,*f,*fx, error;
  int n, mynode, totalnodes;
  MPI_Status status;

  MPI_Init( &argc, &argv );
  MPI_Comm_rank( MPI_COMM_WORLD, &mynode );
  MPI_Comm_size( MPI_COMM_WORLD, &totalnodes );

  n = 20;

  globala = 0;
  globalb = 1;

  a = globala + ((double) mynode)*(globalb - globala)/( (double) totalnodes);
  b = globala + ((double) mynode+1)*(globalb - globala)/( (double) totalnodes);

  f  = (double *) malloc(sizeof(double)*(n));
  fx = (double *) malloc(sizeof(double)*(n));

  dx = (b-a)/((double) n);
  for( int i = 0; i < n; i++){
    f[i] = fun(a+((double) i)*dx);
  }
  firstderiv1Dp_vec( n, dx, f, fx, mynode, totalnodes);

  error = 0.0;
  for(int i = 0; i < n; i++){
    error += pow( fx[i]-funprime(a+((b-a)*((double) i))/((double) n)),2.0);
  }
  error = sqrt(dx*error);
  printf("Node %d ||f' - fx||_2 = %e\n",mynode,error);


  free(f);
  free(fx);
  MPI_Finalize();

  return 0;
}


void firstderiv1Dp_vec(int n, double dx, double *f,double *fx, int mynode, int totalnodes){
double scale = 1.0/(2.0*dx);
double mpitemp;
MPI_Status status;

if(mynode == 0){
 fx[0] = (-3.0*f[0] + 4.0*f[1] - f[2])*scale;
}
if(mynode == (totalnodes-1)){
 fx[n-1] = (3.0*f[n-1] - 4.0*f[n-2] + f[n-3])*scale;
}
for(int i=1;i<n-1;i++){
 fx[i] = (f[i+1]-f[i-1])*scale;
}
if(mynode == 0){
   mpitemp = f[n-1];
   MPI_Send(&mpitemp,1,MPI_DOUBLE,1,1,MPI_COMM_WORLD);
   MPI_Recv(&mpitemp,1,MPI_DOUBLE,1,1,MPI_COMM_WORLD,&status);
   fx[n-1] = (mpitemp - f[n-2])*scale;
}
else if(mynode == (totalnodes-1)){
  MPI_Recv(&mpitemp,1,MPI_DOUBLE,mynode-1,1,MPI_COMM_WORLD, &status);
  fx[0] = (f[1]-mpitemp)*scale;
  mpitemp = f[0];
  MPI_Send(&mpitemp,1,MPI_DOUBLE,mynode-1,1,MPI_COMM_WORLD);
}
else{
 MPI_Recv(&mpitemp,1,MPI_DOUBLE,mynode-1,1,MPI_COMM_WORLD, &status);
 fx[0] = (f[1]-mpitemp)*scale;
 mpitemp = f[0];
 MPI_Send(&mpitemp,1,MPI_DOUBLE,mynode-1,1,MPI_COMM_WORLD);
 mpitemp = f[n-1];
 MPI_Send(&mpitemp,1,MPI_DOUBLE,mynode+1,1,MPI_COMM_WORLD);
 MPI_Recv(&mpitemp,1,MPI_DOUBLE,mynode+1,1,MPI_COMM_WORLD, &status);
 fx[n-1] = (mpitemp-f[n-2])*scale;
}
return;
}

double fun(double x){
  return(sin(x));
}

double funprime( double x){
  return(cos(x));
}

Overwriting ccode/parallelderiv.c

For both the codes we add again to the Makefile

sequentialderiv: sequentialderiv.c
	$(CC) $(CFLAGS) $(LDFLAGS) $? $(LDLIBS) -o $@
parallelderiv: parallelderiv.c
	$(MPICC) $(CFLAGS) $(LDFLAGS) $? $(LDLIBS) -o $@

and compile with make sequentialderiv and make parallelderiv and run as

!(cd ccode && make parallelderiv)
!mpirun -np 4 ccode/parallelderiv
!(cd ccode && make sequentialderiv)
!mpirun -np 4 ccode/sequentialderiv

make[1]: ingresso nella directory "/home/cirdan/Documenti/RTDa-PISA/CorsoCalcoloParallelo2021/introtoparallelcomputing/intrompi/ccode"
mpicc			 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/cirdan/anaconda3/envs/parallel/include -g			 -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/home/cirdan/anaconda3/envs/parallel/lib -Wl,-rpath-link,/home/cirdan/anaconda3/envs/parallel/lib -L/home/cirdan/anaconda3/envs/parallel/lib parallelderiv.c -lm -ldl -o parallelderiv

make[1]: uscita dalla directory "/home/cirdan/Documenti/RTDa-PISA/CorsoCalcoloParallelo2021/introtoparallelcomputing/intrompi/ccode"

Node 0 ||f' - fx||_2 = 1.384681e-05
Node 1 ||f' - fx||_2 = 1.211868e-05
Node 2 ||f' - fx||_2 = 1.059319e-05
Node 3 ||f' - fx||_2 = 8.880167e-06

make[1]: ingresso nella directory "/home/cirdan/Documenti/RTDa-PISA/CorsoCalcoloParallelo2021/introtoparallelcomputing/intrompi/ccode"
/home/cirdan/anaconda3/envs/parallel/bin/x86_64-conda_cos6-linux-gnu-cc -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /home/cirdan/anaconda3/envs/parallel/include -g			 -Wl,-O2 -Wl,--sort-common -Wl,--as-needed -Wl,-z,relro -Wl,-z,now -Wl,--disable-new-dtags -Wl,--gc-sections -Wl,-rpath,/home/cirdan/anaconda3/envs/parallel/lib -Wl,-rpath-link,/home/cirdan/anaconda3/envs/parallel/lib -L/home/cirdan/anaconda3/envs/parallel/lib sequentialderiv.c -lm -ldl -o sequentialderiv

make[1]: uscita dalla directory "/home/cirdan/Documenti/RTDa-PISA/CorsoCalcoloParallelo2021/introtoparallelcomputing/intrompi/ccode"

fx[0] = -0.00 f'[0] = 0.00 |fx - f'| = -2.2e-15
fx[1] = 0.10 f'[1] = 0.10 |fx - f'| = -5.3e-03
fx[2] = 0.20 f'[2] = 0.20 |fx - f'| = -1.1e-02
fx[3] = 0.30 f'[3] = 0.30 |fx - f'| = -1.6e-02
fx[4] = 0.40 f'[4] = 0.40 |fx - f'| = -2.1e-02
fx[5] = 0.50 f'[5] = 0.50 |fx - f'| = -2.6e-02
fx[6] = 0.60 f'[6] = 0.60 |fx - f'| = -3.2e-02
fx[7] = 0.70 f'[7] = 0.70 |fx - f'| = -3.7e-02
fx[8] = 0.80 f'[8] = 0.80 |fx - f'| = -4.2e-02
fx[9] = 0.90 f'[9] = 0.90 |fx - f'| = -4.7e-02
fx[10] = 1.00 f'[10] = 1.00 |fx - f'| = -5.3e-02
fx[11] = 1.10 f'[11] = 1.10 |fx - f'| = -5.8e-02
fx[12] = 1.20 f'[12] = 1.20 |fx - f'| = -6.3e-02
fx[13] = 1.30 f'[13] = 1.30 |fx - f'| = -6.8e-02
fx[14] = 1.40 f'[14] = 1.40 |fx - f'| = -7.4e-02
fx[15] = 1.50 f'[15] = 1.50 |fx - f'| = -7.9e-02
fx[16] = 1.60 f'[16] = 1.60 |fx - f'| = -8.4e-02
fx[17] = 1.70 f'[17] = 1.70 |fx - f'| = -8.9e-02
fx[18] = 1.80 f'[18] = 1.80 |fx - f'| = -9.5e-02
fx[19] = 1.90 f'[19] = 1.90 |fx - f'| = -1.0e-01
fx[0] = -0.00 f'[0] = 0.00 |fx - f'| = -2.2e-15
fx[1] = 0.10 f'[1] = 0.10 |fx - f'| = -5.3e-03
fx[2] = 0.20 f'[2] = 0.20 |fx - f'| = -1.1e-02
fx[3] = 0.30 f'[3] = 0.30 |fx - f'| = -1.6e-02
fx[4] = 0.40 f'[4] = 0.40 |fx - f'| = -2.1e-02
fx[5] = 0.50 f'[5] = 0.50 |fx - f'| = -2.6e-02
fx[6] = 0.60 f'[6] = 0.60 |fx - f'| = -3.2e-02
fx[7] = 0.70 f'[7] = 0.70 |fx - f'| = -3.7e-02
fx[8] = 0.80 f'[8] = 0.80 |fx - f'| = -4.2e-02
fx[9] = 0.90 f'[9] = 0.90 |fx - f'| = -4.7e-02
fx[10] = 1.00 f'[10] = 1.00 |fx - f'| = -5.3e-02
fx[11] = 1.10 f'[11] = 1.10 |fx - f'| = -5.8e-02
fx[12] = 1.20 f'[12] = 1.20 |fx - f'| = -6.3e-02
fx[13] = 1.30 f'[13] = 1.30 |fx - f'| = -6.8e-02
fx[14] = 1.40 f'[14] = 1.40 |fx - f'| = -7.4e-02
fx[15] = 1.50 f'[15] = 1.50 |fx - f'| = -7.9e-02
fx[16] = 1.60 f'[16] = 1.60 |fx - f'| = -8.4e-02
fx[17] = 1.70 f'[17] = 1.70 |fx - f'| = -8.9e-02
fx[18] = 1.80 f'[18] = 1.80 |fx - f'| = -9.5e-02
fx[19] = 1.90 f'[19] = 1.90 |fx - f'| = -1.0e-01
fx[0] = -0.00 f'[0] = 0.00 |fx - f'| = -2.2e-15
fx[1] = 0.10 f'[1] = 0.10 |fx - f'| = -5.3e-03
fx[2] = 0.20 f'[2] = 0.20 |fx - f'| = -1.1e-02
fx[3] = 0.30 f'[3] = 0.30 |fx - f'| = -1.6e-02
fx[4] = 0.40 f'[4] = 0.40 |fx - f'| = -2.1e-02
fx[5] = 0.50 f'[5] = 0.50 |fx - f'| = -2.6e-02
fx[6] = 0.60 f'[6] = 0.60 |fx - f'| = -3.2e-02
fx[7] = 0.70 f'[7] = 0.70 |fx - f'| = -3.7e-02
fx[8] = 0.80 f'[8] = 0.80 |fx - f'| = -4.2e-02
fx[9] = 0.90 f'[9] = 0.90 |fx - f'| = -4.7e-02
fx[10] = 1.00 f'[10] = 1.00 |fx - f'| = -5.3e-02
fx[11] = 1.10 f'[11] = 1.10 |fx - f'| = -5.8e-02
fx[12] = 1.20 f'[12] = 1.20 |fx - f'| = -6.3e-02
fx[13] = 1.30 f'[13] = 1.30 |fx - f'| = -6.8e-02
fx[14] = 1.40 f'[14] = 1.40 |fx - f'| = -7.4e-02
fx[15] = 1.50 f'[15] = 1.50 |fx - f'| = -7.9e-02
fx[16] = 1.60 f'[16] = 1.60 |fx - f'| = -8.4e-02
fx[17] = 1.70 f'[17] = 1.70 |fx - f'| = -8.9e-02
fx[18] = 1.80 f'[18] = 1.80 |fx - f'| = -9.5e-02
fx[19] = 1.90 f'[19] = 1.90 |fx - f'| = -1.0e-01
fx[0] = -0.00 f'[0] = 0.00 |fx - f'| = -2.2e-15
fx[1] = 0.10 f'[1] = 0.10 |fx - f'| = -5.3e-03
fx[2] = 0.20 f'[2] = 0.20 |fx - f'| = -1.1e-02
fx[3] = 0.30 f'[3] = 0.30 |fx - f'| = -1.6e-02
fx[4] = 0.40 f'[4] = 0.40 |fx - f'| = -2.1e-02
fx[5] = 0.50 f'[5] = 0.50 |fx - f'| = -2.6e-02
fx[6] = 0.60 f'[6] = 0.60 |fx - f'| = -3.2e-02
fx[7] = 0.70 f'[7] = 0.70 |fx - f'| = -3.7e-02
fx[8] = 0.80 f'[8] = 0.80 |fx - f'| = -4.2e-02
fx[9] = 0.90 f'[9] = 0.90 |fx - f'| = -4.7e-02
fx[10] = 1.00 f'[10] = 1.00 |fx - f'| = -5.3e-02
fx[11] = 1.10 f'[11] = 1.10 |fx - f'| = -5.8e-02
fx[12] = 1.20 f'[12] = 1.20 |fx - f'| = -6.3e-02
fx[13] = 1.30 f'[13] = 1.30 |fx - f'| = -6.8e-02
fx[14] = 1.40 f'[14] = 1.40 |fx - f'| = -7.4e-02
fx[15] = 1.50 f'[15] = 1.50 |fx - f'| = -7.9e-02
fx[16] = 1.60 f'[16] = 1.60 |fx - f'| = -8.4e-02
fx[17] = 1.70 f'[17] = 1.70 |fx - f'| = -8.9e-02
fx[18] = 1.80 f'[18] = 1.80 |fx - f'| = -9.5e-02
fx[19] = 1.90 f'[19] = 1.90 |fx - f'| = -1.0e-01

A Short Introduction to Parallel Computing

A First Scientific Computation¶

Full implementation¶