I need to evaluate a coefficient function that itself is an integral over the whole domain, but
this integral varies at every evaluation point (in the variational formulation, this is a double integral,
I posted a related question here: Integral over product space - Kunena)

With sufficiently fine meshes this is a very expensive task, which I try to mitigate by using
MPI.
At the moment, I load the global mesh in every rank to compute this integral.
The PDE I am trying to solve is time-dependent and I need to use part of the solution from
the last time step in the integrand. Is there any way I can gather a grid function at every
rank, so the global values are available there?
Or is there the possibility to compute an integral globally over grid functions? (At the moment,
this step is done in C++ in a custom coefficient function implementation.)

That is quite the difficult problem, I must say! Unfortunately, I am afraid I have no easy solution for you.
I do have some suggestions, though.

The first thing you could try is hybrid parallelization. Only run one MPI-rank per node (or socket), but use multiple threads on every rank. Per default, NGSolve turns off shared mempory parallelization when run with MPI, but you can turn it back on from python with

SetNumThreads(N)

That way, you would still have to load the global mesh on every rank to compute the integral, but you would have fewer ranks in total.

To speed up the computation of the integral itself, could you compute multiple integrals at the same time? Possibly, your CoefficientFunction could compute all integrals when it is called the first time in each timestep. Then you would only need to loop through elements once and compute integrals for all points you need.

Finally, if your mesh gets too fine for the hybrid parallelization approach, you could use MPI sub-communicators. Then, every rank that participates in the “main” computation could have a series of workers that partition one instance of the global mesh among them and are responsible for computing the integral only.
I think this would require a good amount of additional C++ coding. You would have to send the coefficients of the gridfunction to the worker nodes, and the integral values from workers to master ranks. This might be quite nasty to do efficiently.

I managed to figure that I could reuse the code written in gridfunction.cpp for the loading and saving
of grid functions in parallel, so I gather the coefficients at rank zero and redistribute from there.
I wouldn’t call it efficient, but it works somehow and makes the computations at least feasible (now I
need only 20 min for a time step.)

By accident, I already used your first suggestion: I specify my thread numbers explicitly every time.
I didn’t know it would be turned off by default with MPI. Thanks for this clarification.

Your second suggestion I also implemented. I use a cache for the coefficient function and at the beginning of each time step I compute its values for later reuse. Is there maybe an especially good data structure in NGSolve for this?

The third approach is quite interesting. To make sure I understood correctly, let me rephrase it:
You suggest that (additionally, or instead) to thread workers, I spawn for every rank MPI workers.
The advantage here to thread workers is that their amount is not so strongly limited by the socket cores.
I am definitely not an MPI expert which might explain the basic question.

I guess I understood right that there is no python mechanism for gathering a grid function? I mean, I understand. It is not something one would like to do…