Hello,

I get the following error message when running the demo code mpi_poisson.py

`Intel MKL FATAL ERROR: Cannot load symbol MKLMPI_Get_wrappers`

This seems to be related to dynamic/static link of MKL library.

The solver is PCG using bddc with mumps preconditioner.

This error is only activated when the problem size is “sufficiently large”.

The code works fine when maxh = 0.08, but fails when maxh = 0.06.

Meanwhile, bddc + usehypre works fine for me.

I build the version ‘6.2.2006-83-gdc140c7’ with the following configuration file (static MKL library):

```
...
-DUSE_MPI=ON \
-DUSE_MUMPS=ON \
-DUSE_HYPRE=ON \
-DUSE_MKL=ON \
-DMKL_STATIC=ON \
...
```

The compiler is gcc 8.3.0, MPI is a MPICH v3.3, MKL library is from Intel19.0

Also, a side question, what options are available for the inverse flag of bddc solver that is suitable for MPI?

`c = Preconditioner(a, type="bddc", inverse = "???")`

Best,

Guosheng

Okay, this is a new error for me too. I have never worked with MPICH.

It looks ike MKLMPI_Get_wrappers is in the MKL BLACS library, so maybe NGSolve is linked with the wrong one. MPICH should be linked against libmkl_blacs_intelmpi…, not against libmkl_blacs_openmpi…

Could you upload your CMakeCache.txt from the NGSolve build directory?

You could also try to manually load the SDL library by putting this at the top of your python file:

```
from ctypes import CDLL, RTLD_GLOBAL
CDLL(<path_to_libmkl_rt.so>, RTDL_GLOBAL)
```

I know it is not a very clean solution, but it usually works for me.

If this does not bother you too much you could even build NGSolve with dynamic MKL linking and MKL_SDL=ON.

Only “mumps” and “masterinverse”. Some more direct solvers are available through the PETSc interface.

However, that does not work with the “inverse=” flag, but has to be accessed by “coarsetype=petsc_pc”. What petsc_pc does can then be configured via flags.

Best,

Lukas

Yes, I am using intelmpi from MKL, otherwise the build stage will complain.

But strangely, I can not reproduce the bug anymore today.

What I did is re-install petsc to include mumps and superlu_dist by the following configuration file (adapted from yours) (I can not link the scalapack lib from MKL, so I downloaded it):

```
20 ./configure --with-debugging=0 \
21 --PETSC_ARCH="arch-linux-c-opt" \
22 --prefix=$HOME/NGSolve-X/petsc \
23 --with-fortran-bindings=0 \
24 --with-fortran-type-initialize=0 \
25 --with-shared-libraries \
26 --with-mpi=1 --with-mpi-dir=/XXX/mpich/3.3/gcc/8.3.0 \
27 --with-cxx-dialect=c++11 \
28 --with-metis=1 --with-metis-dir=XXX/ngbuild-mpi/dependencies/parmetis \
29 --with-parmetis=1 --with-parmetis-dir=XXX/ngbuild-mpi/dependencies/parmetis \
30 --with-hypre=1 --download-hypre=yes --download-hypre-shared=0 \
31 --with-blaslapack=1 --with-blaslapack-dir=XXX/intel/19.0/mkl \
32 --with-zlib=1 \
33 --with-ml=1 --download-ml=yes --download-ml-shared=0 \
34 --with-ilu=1 \
35 --with-cmake=1 \
36 --download-slepc=yes \
37 --with-superlu=1 --download-superlu=yes \
38 --with-mumps=1 --download-mumps=yes --download-mumps-shared=0 \
39 --with-scalapack=1 --download-scalapack=yes \
40 --with-superlu_dist=1 --download-superlu_dist=yes --download-superlu_dist-shared=0
```

After that, I can use mumps and superlu_dist direct solver via ngs-petsc. And this seems fixed my mumps error. I guess the code just link to the petsc mumps first, rather than the ngsolve-build mumps. Is it correct?

I recall that building with MKL_SDL=ON failed on my machine once, so I will stick with the current version for now.

I am looking into a scalable solver for the hybrid-mixed poisson operator. Here is what I got for the timing of system solver for a test problem with RT0-P0 (code is attached):

`mpirun -np $NSLOTS ngspy mpi_hybrid.py 0 60`

24 cores: ~4s

48 cores: ~3s

96 cores: ~5s

192 cores: >100s

I didn’t see the expected speed-up when using more than 24 cores. (my machine is 24 core/node)

Do you know what’s going on for my code?

Best,

Guosheng

Attachment: mpi_hybrid.py

Just a follow-up on the hybrid-mixed solver.

I ended up using PETSc’s ksp solver with a mumps direct solver as the preconditioner:

```
ksp = petsc.KSP(mat=mat_wrap, name="someksp", petsc_options={
"ksp_type": "preonly",
"pc_type": "cholesky",
"pc_factor_mat_solver_type": "mumps"})
```

As I didn’t find a good preconditioner for the high-order hybrid-mixed matrix.

One thing I noticed is that the PETSc ksp solver returns “DISTRUBUTED” vector, and I need to manually convert it to a “CUMULATIVE” vector to make the code work.

```
gfu.vec.data = ksp * rhs
gfu.vec.Cumulate()
```

Best,

Guosheng

No idea. Are you measuring simply the time the entire job takes? Bigger jobs can take longer to start up.

You could try writing trace-files on some ranks and looking at those to get an idea of where the problem comes from

`with Taskmanager(pajetrace=10*1024*1024): `

What are you doing with the vector? Parallel NGSolve objects should accept a CUMULATED or DISTRIBUTED vector as input and do the conversion automatically if needed. But if you are doing something directly to the values you need to cumulate it.

Best,

Lukas

[quote]What are you doing with the vector? Parallel NGSolve objects should accept a CUMULATED or DISTRIBUTED vector as input and do the conversion automatically if needed. But if you are doing something directly to the values you need to cumulate it.

[/quote]

I am simply using a static condensation approach for linear system solver. Without Cumulate, the output error is not correct. (see attached code)

For timing the solver, I am simply using python timer for the cost of linear system solve:

```
import time
...
t0 = time.time()
gfu.vec.data = ksp * rhs
t1 = time.time()
gfu.vec.Cumulate() # cumulate, this line is necessary
gfu.vec.data += a.harmonic_extension * gfu.vec
gfu.vec.data += a.inner_solve * rhs
if rank==0:
print("cost: ", t1-t0)
```

Iterative solver just product unexpected behavior for me.

Attached is a simply hybrid-Poisson solver with hypre preconditioner that fails to be scalable on my machine.

The cost of linear system solver as recorded as follows:

```
mpirun -np $NSLOTS ngspy mpi_hybrid.py
12 cpus: 5.7 s
24 cpus: 3.5 s
results for three separate runs :
48 cpus: [2.3, 2.4s, 3.2 s]
96 cpus: [11 s, 8s, 8.4s] [speed is degraded from 48 cpus]
192 cpus: [19 s, 31s, (larger than 100s)]
```

You can clearly see the downgraded performance from 48 cores to 96 cores although # of iterations is still the same. With 192 cpus, the cost is even worse.This result simply does not make any sense to me.

Also, with more cpus, it takes more time to set the hypre preconditioner.

(~9s for 48 cores, ~48s for 96 cores, and ~80 s for 192 cores)

Best,

Guosheng

Attachment: mpi_hybrid_2020-07-28.py

Ah, I see. The problem here is that harmonic_extension and harmonix_extension_trans are local operators. They directly access the locally stored vector values. You can wrap a ParallelMatrix around them to make the code a little more elegant

```
hex = ParallelMatrix(harmonic_extension, row_pardods=..., col_pardofs=..., op_type=C2C)
hext = ParallelMatrix(harmonic_extension_trans, row_pardods=..., col_pardofs=..., op_type=D2D)
```

Now “hex” is an operator that takes a cumulated vector as input and has a cumulated vector as an output. So, when you give it the distributed solution vector from ksp, it automatically cumulates it. You can also write down the entire operation as a single operator:

```
hex, hext, aiii = a.harmonic_extension, a.harmonic_extension_trans, a.inner_solve
Id = IdentityMatrix(a.mat.height)
if a.space.mesh.comm.size == 1:
full_precond = ((Id + hex) @ precond @ (Id + hext)) + aiii
else:
Ihex = ParallelMatrix(Id + hex, row_pardofs = a.mat.row_pardofs, col_pardofs = a.mat.row_pardofs, op = ngs.ParallelMatrix.C2C)
Ihext = ParallelMatrix(Id + hext, row_pardofs = self.a.mat.row_pardofs, col_pardofs = self.a.mat.row_pardofs, op = ngs.ParallelMatrix.D2D)
Isolve = ParallelMatrix(aiii, row_pardofs = self.a.mat.row_pardofs, col_pardofs = self.a.mat.row_pardofs, op = ngs.ParallelMatrix.D2C)
full_precond = ( Ihex @ precond @ Ihext ) + aiii
```

Then you can just write

`gfu.vec.data = full_precond * f.vec`

(Nothing wrong with your code, I just find this a little more neat)

I will have a look at the timings.

Best,

Lukas