A more recent discussion of this on python-dev: https://mail.python.org/pipermail/python-dev/2019-January/156095.html
The situation there appears to be a case of "Hand off an OS level thread from the creating interpreter to a different subinterpreter. As far as I can tell, calling GILState_Ensure in such a thread will still acquire the GIL of the creating interpreter (or something equally nonsensical)."
It's a single-threaded application using subinterpreters, but all the callbacks from the NumPy code end up hitting the original interpreter that initialised the thread local state in the main thread.