I am having issues with the direct solver when dealing with problems when ndof ~ 2-10 million. My problem involves an inhomogeneous dirichlet bc and i use the technique given in Sec 1.3 of the documentation, namely:
u, v = fes.TnT()
a = BilinearForm(fes, symmetric=True)
a += grad(u)*grad(v)*dx
f = LinearForm (fes)
gfu = GridFunction (fes)
gfu.Set (ubar, definedon=BND)
#
with TaskManager():
a.Assemble()
f.Assemble()
res = gfu.vec.CreateVector()
res.data = f.vec - a.mat * gfu.vec
gfu.vec.data += a.mat.Inverse(freedofs=fes.FreeDofs(), inverse="sparsecholesky") * res
However the slurm script returns the following message for one of the nodes:
-------------------------------------------------------
Primary job terminated normally, but 1 process returned
a non-zero exit code.. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have. It is likely that your MPI job will now either abort or
experience performance degradation.
Local host: node16
System call: unlink(2) /tmp/openmpi-sessions-1608000011@node16_0/19664/1/2/vader_segment.node16.2
Error: No such file or directory (errno 2)
--------------------------------------------------------------------------
The exact same piece of code of course runs fine when the system is smaller. It’s only when i use a refined mesh for the same geometry, i get in to the errors. I have tried refinements both outside (using gmsh and then reading the refined mesh in ngsolve) and inside (i.e. reading a corase mesh and then refining it with ngsolve’s refine) but both end up in errors. Could you comment on likely cause of the problem? Thank you in advance for your help.
Trying with umfpack, i get out of memory error: UMFPACK V5.7.4 (Feb 1, 2016): ERROR: out of memory.
Also ngsolve writes the error message for line where i call umfpack:
Traceback (most recent call last):
File “script.py”, line 107, in
gfu.vec.data += a.mat.Inverse(freedofs=fes.FreeDofs(), inverse=“umfpack”) * res
netgen.libngpy._meshing.NgException: UmfpackInverse: Symbolic factorization failed.
The solver is running out of memory, 24GB is probably not enough for millions of unknowns using a direct solver. For larger problems you might want to switch to an iterative solver instead, have a look at the documentation about preconditioners:
I had the wrong information on how much RAM I can use (the reason for setting 24G in the earlier script). We have a 15 noded cluster with 256GB of RAM on each node. Setting with 6 nodes (each with 48 parallel threads) and RAM/node to be used as 150GB (leaving some for the OS), I still could not get the code running with MUMPS. With some 900GB of RAM among the 6 nodes requested, I can’t understand why it still struggles to solve. I even reduced the problem size considerably: from previous 9m dofs, i now have 5.2m dofs.
[node16:71446] *** An error occurred in MPI_Comm_rank
[node16:71446] *** reported by process [694681601,1]
[node16:71446] *** on communicator MPI_COMM_WORLD
[node16:71446] *** MPI_ERR_COMM: invalid communicator
[node16:71446] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[node16:71446] *** and potentially your MPI job)
whereas the error file from slurm has:
[node11:93553] 3 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[node11:93553] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[quote=“matthiash” post=2406]For larger problems you might want to switch to an iterative solver instead, have a look at the documentation about preconditioners:
[/quote]
I am trying to solve high-frequency Helmholtz problems and I have not tried the iterative solvers yet as I don’t know enough about the suitability of the preconditioners available in ngsolve.
Sorry for asking, but are you properly distributing the mesh in the beginning of your script?
If it is not that, I would say there is a problem with MPI. I have come across similar error messages when there are multiple MPI installations present and I was using the wrong one.
[quote=“lkogler” post=2412]This does not look like a memory issue.
Sorry for asking, but are you properly distributing the mesh in the beginning of your script?
If it is not that, I would say there is a problem with MPI. I have come across similar error messages when there are multiple MPI installations present and I was using the wrong one.[/quote]
Ok - how do you distribute the mesh properly when it’s generated by an external tool such as gmsh?
Here is my complete (well almost except the script that writes the gmsh file) code:
#!/usr/bin/env python
# coding: utf-8
#
from ngsolve import *
from netgen.csg import *
import numpy as np
import sys
import math
import time
import csv
import os
from GMRes_v2 import GMResv2
#
import subprocess
import multiprocessing
#
from netgen.read_gmsh import ReadGmsh
# initialise mpi
comm = mpi_world
rank = comm.rank
nproc= comm.size
PI = np.pi;
# ************************************
# problem data:
freq = float(sys.argv[1])
polOrder = int(sys.argv[2])
elmperlam = int(sys.argv[3])
# ************************************
# geometrical params:
x0 = -0.011817;
y0 = -38.122429;
z0 = -0.004375;
# ************************************
cspeed = 343e3 # in mm/s
waveno = 2.0*PI*freq / cspeed
wavelength = 2.0*PI/waveno
helem = wavelength / elmperlam
dpml = 2.0*wavelength
radcomp = 27.5 + 4.0*wavelength # radius of sensor plus 4 wavelengths
Rext = radcomp
rpml = radcomp - dpml
# ************************************
meshfilename = '../../meshes/model.msh'
# import the Gmsh file to a Netgen mesh object
mesh = ReadGmsh(meshfilename)
mesh = Mesh(mesh)
print('mesh1 ne: ', mesh.ne)
mesh.Refine()
mesh.Refine()
print('mesh2 ne: ', mesh.ne)
if (rank==0):
print(mesh.GetBoundaries());
print ("num vol elements:", mesh.GetNE(VOL))
print ("num bnd elements:", mesh.GetNE(BND))
print('add pml..')
mesh.SetPML(pml.Radial(origin=(x0,y0,z0), rad=rpml, alpha=1j), definedon="air")
ubar = exp (1J*waveno*x)
fes = H1(mesh, complex=True, order=polOrder, dirichlet="sensor_srf")
if (rank==0):
print('ndof = ', fes.ndof)
u = fes.TrialFunction()
v = fes.TestFunction()
print("rank "+str(rank)+" has "+str(fes.ndof)+" of "+str(fes.ndofglobal)+" dofs!")
mesh.GetMaterials()
start = time.time()
gfu = GridFunction (fes)
gfu.Set (ubar, definedon='sensor_srf')
a = BilinearForm (fes, symmetric=True)
a += SymbolicBFI (grad(u)*grad(v) )
a += SymbolicBFI (-waveno*waveno*u*v)
f = LinearForm (fes)
from datetime import datetime
with TaskManager():
# create threads and assemble
print('cpus: ', multiprocessing.cpu_count() )
a.Assemble()
f.Assemble()
res = gfu.vec.CreateVector()
res.data = f.vec - a.mat * gfu.vec
end = time.time()
if (rank==0):
print('tassm: ', end - start)
start = time.time()
print("solve started: ", datetime.now().strftime("%H:%M:%S") )
gfu.vec.data += a.mat.Inverse(freedofs=fes.FreeDofs(), inverse="mumps") * res
end = time.time()
print("solve ended: ", datetime.now().strftime("%H:%M:%S") )
if (rank==0):
print('tsolve: ', end - start)
[quote=“lkogler” post=2412]
If it is not that, I would say there is a problem with MPI. I have come across similar error messages when there are multiple MPI installations present and I was using the wrong one.[/quote]
I am not sure if that is the case as I load the ngsolve parallel build specifically by calling:
module load apps/ngsolve_mpi
There is a serial ngsolve build that’s available but i don’t think that’s being invoked.
Still encounter the SIGSEGV fault with mumps despite distributing the mesh. Just for the sake of completeness, I tried the mpi_poisson.py script in master/py_tutorials/mpi/ with mumps (both with and without the preconditioning)
and it still fails with the same error. This probably tells me something’s wrong in the parallel build with mumps. mpi_poisson.py works with sparsecholesky however.
When you use “sparsecholesky” with a parallel matrix, NGSolve reverts to a different inverse type that works with parallel matrices (i believe “masterinverse”) without telling you (not very pretty, I know), which is why mpi_poisson.py works.
Errors like this:
Usually indicate a problem with the installation, or with how the job is started.
Are you using the MUMPS built with NGSolve or a seperate MUMPS install? We have had issues with MUMPS 5.1 for larger problems. Upgrading to 5.2 resolved those.
If you use a seperate MUMPS install you have to make sure that that MUMPS and NGSolve have been built with the same MPI libraries.
If this is sequential, I am guessing i will only be limited with the memory - i will give this a try but this will probably be too slow.
Even with mpirun -n 8 ngspy mpi_poisson.py or
mpirun -n 8 python3 mpi_poisson.py I have no success and I get SIGSEGV faults.
[quote]Are you using the MUMPS built with NGSolve or a separate MUMPS install? We have had issues with MUMPS 5.1 for larger problems. Upgrading to 5.2 resolved those.
If you use a seperate MUMPS install you have to make sure that that MUMPS and NGSolve have been built with the same MPI libraries[/quote]
There’s no other MUMPS installation on the cluster other than the one that gets downloaded with CMake and the version is 5.2.0.
Thinking about the compatibilities: do you think the following is an issue:
For ngsolve i seem to have (in my ngsolve-src/build/CMakeCache.txt)
whereas, within the Makefile.inc (in build/ngsolve-src/build/dependencies/src/project_mumps/Makefile.inc) for mumps, i have:
CC = /opt/gridware/depots/3dc76222/el7/pkg/mpi/openmpi/2.0.2/gcc-4.8.5/bin/mpicc
FC = /opt/gridware/depots/3dc76222/el7/pkg/mpi/openmpi/2.0.2/gcc-4.8.5/bin/mpif90
FL = /opt/gridware/depots/3dc76222/el7/pkg/mpi/openmpi/2.0.2/gcc-4.8.5/bin/mpif90
What does “mpicc --version” produce?
If MPI was built with gcc 4.8.5 and NGSolve with gcc 8.3 that might lead to problems.
Ideally you want to use the same compilers.
Do you have access to an MPI version that has been compiled with a compiler version that can also compile NGSolve (e.g gcc 7.3/8.3)?
Probably.
But maybe a brief section in the documentation would already be sufficient.
I will share my experience once I got more experience with MPI and also ngs-petsc such that user users can install working versions from the first moment on.
[quote]What does “mpicc --version” produce?
If MPI was built with gcc 4.8.5 and NGSolve with gcc 8.3 that might lead to problems.
Ideally you want to use the same compilers.[/quote]
I have:
mpicc --version
gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-16)
I am actually confused to see the output of module load for mpi build of ngsolve: so when i load the mpi build for ngsolve via module load, i see:
well, gcc contains a c (gcc) and fortran (gfortran) compiler. Ideally, everything is compiled with the same version. You can also set this by the environment variables CC, CXX, and FC for cmake.