So, I solve Poisson on a 3-D cube with, say, 353k dofs. It takes about 4 seconds on my laptop with 12 threads. I’m using the multi-grid option in NGsolve.
I then tried this on a higher end compute node, with 80 threads. It took about 5 seconds (it took longer!). I also refined the mesh, to get about 2.8m dofs, and the compute time scaled up accordingly, which makes sense. But I would have thought that it should be somewhat faster (than my laptop) with more threads. Note: NGsolve does print a statement saying it is using 80 threads.
Is there some setting I am missing?
Many parts of NGSovle scale very well with shared memory parallelization, some less, and some are even not implemented in parallel.
To localize the problem you should try some trace profiling, I have added a new tutorial for that:
Some quick guesses for common problems:
- make sure that your lapack kernel runs sequentially (something like OMP_NUM_THREADS=1), and you are not creating 80 * 80 threads
- point-Gauss Seidel iteration is not parallel, block Gauss Seidel is
- check if you are using hyper-threading. Usually it does not help, and it also should not make things worse.
Thank you. I played around with it more, and there seemed to be an ideal number of threads. If I use too many, then it gets slower (because of overhead I guess).