SIGSEGV fault for large problems

gcdiwan · February 20, 2020, 11:37am

Dear NGSolve Developers;

I am having issues with the direct solver when dealing with problems when ndof ~ 2-10 million. My problem involves an inhomogeneous dirichlet bc and i use the technique given in Sec 1.3 of the documentation, namely:

u, v = fes.TnT()
a = BilinearForm(fes, symmetric=True)
a += grad(u)*grad(v)*dx
f = LinearForm (fes)
gfu = GridFunction (fes)
gfu.Set (ubar, definedon=BND)
#
with TaskManager():
    a.Assemble()
    f.Assemble()
    res = gfu.vec.CreateVector()
    res.data = f.vec - a.mat * gfu.vec   
    gfu.vec.data += a.mat.Inverse(freedofs=fes.FreeDofs(), inverse="sparsecholesky") * res

The slurm script i use to launch ngsolve:

#!/bin/bash
#SBATCH --job-name=ngs
#SBATCH -N 4
#SBATCH --ntasks  96
#SBATCH --ntasks-per-node=24
#SBATCH --ntasks-per-core=1
#SBATCH --mem=24gb
#Load ngsolve_mpi module
module load apps/ngsolve_mpi 
mpirun ngspy script.py

However the slurm script returns the following message for one of the nodes:

-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  node16
  System call: unlink(2) /tmp/openmpi-sessions-1608000011@node16_0/19664/1/2/vader_segment.node16.2
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------

The exact same piece of code of course runs fine when the system is smaller. It’s only when i use a refined mesh for the same geometry, i get in to the errors. I have tried refinements both outside (using gmsh and then reading the refined mesh in ngsolve) and inside (i.e. reading a corase mesh and then refining it with ngsolve’s refine) but both end up in errors. Could you comment on likely cause of the problem? Thank you in advance for your help.

gcdiwan · February 20, 2020, 1:44pm

Trying with umfpack, i get out of memory error: UMFPACK V5.7.4 (Feb 1, 2016): ERROR: out of memory.
Also ngsolve writes the error message for line where i call umfpack:

Traceback (most recent call last):
File “script.py”, line 107, in
gfu.vec.data += a.mat.Inverse(freedofs=fes.FreeDofs(), inverse=“umfpack”) * res
netgen.libngpy._meshing.NgException: UmfpackInverse: Symbolic factorization failed.

matthiash · February 20, 2020, 3:39pm

The solver is running out of memory, 24GB is probably not enough for millions of unknowns using a direct solver. For larger problems you might want to switch to an iterative solver instead, have a look at the documentation about preconditioners:

https://ngsolve.org/docu/latest/i-tutorials/unit-2.1.1-preconditioners/preconditioner.html

Best,
Matthias

lkogler · February 20, 2020, 5:18pm

Also, “sparsecholesky” and “umfpack” do not work with parallel matrices - they are for local matrices only!

For parallel matrices use “mumps” as a direct solver (if you have configured with it).

(You can also use “masterinverse” for small problems - then the master proc gathers the entire matrix and inverts it by itself).

gcdiwan · February 21, 2020, 1:29pm

I had the wrong information on how much RAM I can use (the reason for setting 24G in the earlier script). We have a 15 noded cluster with 256GB of RAM on each node. Setting with 6 nodes (each with 48 parallel threads) and RAM/node to be used as 150GB (leaving some for the OS), I still could not get the code running with MUMPS. With some 900GB of RAM among the 6 nodes requested, I can’t understand why it still struggles to solve. I even reduced the problem size considerably: from previous 9m dofs, i now have 5.2m dofs.

#SBATCH --job-name=ngs_phi
#SBATCH -N 6
#SBATCH --mem=150G
#SBATCH --nodelist="node[11,16-20]"

Slurm output file has:

[node16:71446] *** An error occurred in MPI_Comm_rank
[node16:71446] *** reported by process [694681601,1]
[node16:71446] *** on communicator MPI_COMM_WORLD
[node16:71446] *** MPI_ERR_COMM: invalid communicator
[node16:71446] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node16:71446] ***    and potentially your MPI job)

whereas the error file from slurm has:

[node11:93553] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[node11:93553] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

gcdiwan · February 21, 2020, 1:34pm

[quote=“matthiash” post=2406]For larger problems you might want to switch to an iterative solver instead, have a look at the documentation about preconditioners:
[/quote]
I am trying to solve high-frequency Helmholtz problems and I have not tried the iterative solvers yet as I don’t know enough about the suitability of the preconditioners available in ngsolve.

lkogler · February 21, 2020, 1:49pm

This does not look like a memory issue.

Sorry for asking, but are you properly distributing the mesh in the beginning of your script?

If it is not that, I would say there is a problem with MPI. I have come across similar error messages when there are multiple MPI installations present and I was using the wrong one.

gcdiwan · February 21, 2020, 2:07pm

[quote=“lkogler” post=2412]This does not look like a memory issue.

Sorry for asking, but are you properly distributing the mesh in the beginning of your script?

If it is not that, I would say there is a problem with MPI. I have come across similar error messages when there are multiple MPI installations present and I was using the wrong one.[/quote]

Ok - how do you distribute the mesh properly when it’s generated by an external tool such as gmsh?

Here is my complete (well almost except the script that writes the gmsh file) code:

#!/usr/bin/env python
# coding: utf-8
#
from ngsolve import *
from netgen.csg import *
import numpy as np
import sys
import math
import time
import csv
import os
from GMRes_v2 import GMResv2
#
import subprocess
import multiprocessing
#
from netgen.read_gmsh import ReadGmsh
# initialise mpi
comm = mpi_world
rank = comm.rank
nproc= comm.size
PI = np.pi;

# ************************************
# problem data:
freq = float(sys.argv[1])
polOrder = int(sys.argv[2])
elmperlam = int(sys.argv[3])


# ************************************
# geometrical params:
x0 = -0.011817;
y0 = -38.122429; 
z0 = -0.004375;
# ************************************
cspeed = 343e3 # in mm/s
waveno = 2.0*PI*freq / cspeed
wavelength = 2.0*PI/waveno
helem = wavelength / elmperlam
dpml = 2.0*wavelength
radcomp = 27.5 + 4.0*wavelength # radius of sensor plus 4 wavelengths
Rext = radcomp
rpml = radcomp - dpml

# ************************************
meshfilename = '../../meshes/model.msh'
# import the Gmsh file to a Netgen mesh object
mesh = ReadGmsh(meshfilename)
mesh = Mesh(mesh)
print('mesh1 ne: ', mesh.ne)
mesh.Refine()
mesh.Refine()
print('mesh2 ne: ', mesh.ne)

if (rank==0):
    print(mesh.GetBoundaries());
    print ("num vol elements:", mesh.GetNE(VOL))
    print ("num bnd elements:", mesh.GetNE(BND))
    print('add pml..')
mesh.SetPML(pml.Radial(origin=(x0,y0,z0), rad=rpml, alpha=1j), definedon="air")
ubar = exp (1J*waveno*x)
fes = H1(mesh, complex=True, order=polOrder, dirichlet="sensor_srf")
if (rank==0):
    print('ndof = ', fes.ndof)
u = fes.TrialFunction()
v = fes.TestFunction()
print("rank "+str(rank)+" has "+str(fes.ndof)+" of "+str(fes.ndofglobal)+" dofs!")
mesh.GetMaterials()

start = time.time()
gfu = GridFunction (fes)
gfu.Set (ubar, definedon='sensor_srf')
a = BilinearForm (fes, symmetric=True)
a += SymbolicBFI (grad(u)*grad(v) )
a += SymbolicBFI (-waveno*waveno*u*v)
f = LinearForm (fes)
from datetime import datetime
with TaskManager():
# create threads and assemble

    print('cpus: ', multiprocessing.cpu_count() )
    a.Assemble()
    f.Assemble()
    res = gfu.vec.CreateVector()
    res.data = f.vec - a.mat * gfu.vec
    end = time.time()
    if (rank==0):
        print('tassm: ', end - start)
    start = time.time()
    print("solve started: ", datetime.now().strftime("%H:%M:%S") )
    gfu.vec.data += a.mat.Inverse(freedofs=fes.FreeDofs(), inverse="mumps") * res
    end = time.time()
    print("solve ended: ", datetime.now().strftime("%H:%M:%S") )
    if (rank==0):
        print('tsolve: ', end - start)

gcdiwan · February 21, 2020, 2:14pm

[quote=“lkogler” post=2412]
If it is not that, I would say there is a problem with MPI. I have come across similar error messages when there are multiple MPI installations present and I was using the wrong one.[/quote]

I am not sure if that is the case as I load the ngsolve parallel build specifically by calling:

module load apps/ngsolve_mpi

There is a serial ngsolve build that’s available but i don’t think that’s being invoked.

lkogler · February 21, 2020, 2:20pm

Something like this:

if mpi_world.rank == 0:
    ngmesh = ReadGmsh(meshfilename)
    if mpi_world.size > 1:
        ngmesh.Distribute(mpi_world)
else:
    ngmesh = netgen.meshing.Mesh.Receive(mpi_world)
mesh = Mesh(ngmesh)

I have not personally tested it with a mesh loaded from gmesh, but it should work

gcdiwan · February 21, 2020, 4:02pm

Still encounter the SIGSEGV fault with mumps despite distributing the mesh. Just for the sake of completeness, I tried the mpi_poisson.py script in master/py_tutorials/mpi/ with mumps (both with and without the preconditioning)

u.vec.data = a.mat.Inverse(V.FreeDofs(), inverse="mumps") * f.vec  # use MUMPS parallel inverse

and it still fails with the same error. This probably tells me something’s wrong in the parallel build with mumps. mpi_poisson.py works with sparsecholesky however.

lkogler · February 25, 2020, 8:56am

When you use “sparsecholesky” with a parallel matrix, NGSolve reverts to a different inverse type that works with parallel matrices (i believe “masterinverse”) without telling you (not very pretty, I know), which is why mpi_poisson.py works.

Errors like this:

Usually indicate a problem with the installation, or with how the job is started.

Are you using the MUMPS built with NGSolve or a seperate MUMPS install? We have had issues with MUMPS 5.1 for larger problems. Upgrading to 5.2 resolved those.

If you use a seperate MUMPS install you have to make sure that that MUMPS and NGSolve have been built with the same MPI libraries.

gcdiwan · February 25, 2020, 10:10am

If this is sequential, I am guessing i will only be limited with the memory - i will give this a try but this will probably be too slow.

Even with mpirun -n 8 ngspy mpi_poisson.py or

mpirun -n 8 python3 mpi_poisson.py I have no success and I get SIGSEGV faults.

[quote]Are you using the MUMPS built with NGSolve or a separate MUMPS install? We have had issues with MUMPS 5.1 for larger problems. Upgrading to 5.2 resolved those.

If you use a seperate MUMPS install you have to make sure that that MUMPS and NGSolve have been built with the same MPI libraries[/quote]

There’s no other MUMPS installation on the cluster other than the one that gets downloaded with CMake and the version is 5.2.0.

Thinking about the compatibilities: do you think the following is an issue:
For ngsolve i seem to have (in my ngsolve-src/build/CMakeCache.txt)

CMAKE_CXX_COMPILER:FILEPATH=/opt/gridware/depots/3dc76222/el7/pkg/compilers/gcc/8.3.0/bin/g++
CMAKE_C_COMPILER:FILEPATH=/opt/gridware/depots/3dc76222/el7/pkg/compilers/gcc/8.3.0/bin/gcc
CMAKE_Fortran_COMPILER:FILEPATH=/opt/gridware/depots/3dc76222/el7/pkg/compilers/gcc/8.3.0/bin/gfortran

whereas, within the Makefile.inc (in build/ngsolve-src/build/dependencies/src/project_mumps/Makefile.inc) for mumps, i have:

CC = /opt/gridware/depots/3dc76222/el7/pkg/mpi/openmpi/2.0.2/gcc-4.8.5/bin/mpicc
FC = /opt/gridware/depots/3dc76222/el7/pkg/mpi/openmpi/2.0.2/gcc-4.8.5/bin/mpif90
FL = /opt/gridware/depots/3dc76222/el7/pkg/mpi/openmpi/2.0.2/gcc-4.8.5/bin/mpif90

Thanks for your help, much appreciated!

lkogler · February 25, 2020, 10:34am

What does “mpicc --version” produce?
If MPI was built with gcc 4.8.5 and NGSolve with gcc 8.3 that might lead to problems.
Ideally you want to use the same compilers.
Do you have access to an MPI version that has been compiled with a compiler version that can also compile NGSolve (e.g gcc 7.3/8.3)?

JuliusZ · February 25, 2020, 10:53am

I can confirm this issue.
I had the same problem and compiling openmpi with gcc 8.3.0 fixed it.

lkogler · February 25, 2020, 11:04am

Maybe we should show a warning when the compiler versions mismatch.

JuliusZ · February 25, 2020, 11:07am

Probably.
But maybe a brief section in the documentation would already be sufficient.
I will share my experience once I got more experience with MPI and also ngs-petsc such that user users can install working versions from the first moment on.

gcdiwan · February 25, 2020, 11:17am

[quote]What does “mpicc --version” produce?
If MPI was built with gcc 4.8.5 and NGSolve with gcc 8.3 that might lead to problems.
Ideally you want to use the same compilers.[/quote]

I have:

mpicc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)

I am actually confused to see the output of module load for mpi build of ngsolve: so when i load the mpi build for ngsolve via module load, i see:

module load apps/ngsolve_mpi/6.2/gcc-8.3.0+python-3.8.0+mpi-2.0.2+atlas-3.10.3+scalapack_atlasshared-2.0.2 
libs/gcc/8.3.0
 |
 OK
apps/python/3.8.0/gcc-4.8.5
 | -- libs/gcc/system ... VARIANT (have alternative: libs/gcc/8.3.0)
 |
 OK
libs/atlas/3.10.3/gcc-8.3.0
 | -- libs/gcc/8.3.0 ... SKIPPED (already loaded)
 |
 OK
mpi/openmpi/2.0.2/gcc-4.8.5
 | -- libs/gcc/system ... VARIANT (have alternative: libs/gcc/8.3.0)
 |
 OK
libs/scalapack_atlasshared/2.0.2/gcc-8.3.0+openmpi-2.0.2+atlas-3.10.3
 | -- libs/gcc/8.3.0 ... SKIPPED (already loaded)
 | -- mpi/openmpi/2.0.2/gcc-4.8.5 ... SKIPPED (already loaded)
 | -- libs/atlas/3.10.3/gcc-8.3.0 ... SKIPPED (already loaded)
 |
 OK
apps/ffmpeg/4.1.3/gcc-4.8.5
 | -- libs/gcc/system ... VARIANT (have alternative: libs/gcc/8.3.0)
 |
 OK

So I thought I have a 8.3 version available (as an alternative) but I see only 4.8.5 builds:

module load mpi/openmpi/1.
mpi/openmpi/1.10.2/gcc-4.8.5  mpi/openmpi/1.4.5/gcc-4.8.5   mpi/openmpi/1.6.5/gcc-4.8.5   mpi/openmpi/1.8.5/gcc-4.8.5

So most likely I need to have a openmpi built with gcc 8.3?

gcdiwan · February 25, 2020, 11:25am

[quote=“JuliusZ” post=2438]I can confirm this issue.
I had the same problem and compiling openmpi with gcc 8.3.0 fixed it.[/quote]

Ok, so the version of C or Fortran compiler used for MUMPS has no effect and it’s just the openmpi that has to be compiled with gcc 8.3.0?

JuliusZ · February 25, 2020, 11:32am

well, gcc contains a c (gcc) and fortran (gfortran) compiler. Ideally, everything is compiled with the same version. You can also set this by the environment variables CC, CXX, and FC for cmake.