Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPICS base 7.0.3.1, deadlock #163

Open
SkyToGround opened this issue May 4, 2020 · 6 comments
Open

EPICS base 7.0.3.1, deadlock #163

SkyToGround opened this issue May 4, 2020 · 6 comments
Assignees
Labels

Comments

@SkyToGround
Copy link

Calling the destructor on more than one instance of a pvac::ClientProvider which has been constructed with a provider name of "ca" (channel access) causes (what appears to be) a deadlock. Using "pva" (pvAccess) does not appear to trigger this deadlock. I have attached a screenshot of where the deadlock appears to be located (from pausing the application when running it under a debugger). The code that causes the triggers the deadlock can be found here: https://github.com/ess-dmsc/forward-epics-to-kafka/blob/a75fab2a7343906c147722825a258332fc2126e7/src/EpicsClient/EpicsClientMonitorImpl.h

In this piece of code, the version of EPICS base used is 7.0.3.1. I believe I had a the same issue when testing with earlier versions of EPICS7 a few months back as well.
deadlock_screenshot

I have also attached screenshots of stack traces of other threads executing EPICS code.
thread_8
thread_3
thread_2

@mdavidsaver
Copy link
Member

The main thread is waiting for the first of four caProvider worker threads to exit. Can you say if this thread (in PutDoneThread::run()) is still running? If so, please take a stack trace. If not, can you say if there was a prior error message about eg. an unhandled exception?

putDoneThread->stop();
getDoneThread->stop();
monitorEventThread->stop();
channelConnectThread->stop();

Also, you're emails mentioned that you had made a workaround. The details of what you had to change to do this might also give some hints.

@mdavidsaver
Copy link
Member

Also, since trace for the main thread has been cropped. Is the main() function still in progress, or has it returned?

@SkyToGround
Copy link
Author

  • Yes, it is the main thread/function that has deadlocked (well) before reaching return 0.
  • The threads not shown in the screenshots were created by librdkafka, so no, PutDoneThread::run() seems to not be running.
  • I see no exception originating from EPICS but that does not mean there is not one. The original developer(s) of this application were kind of bad and had a tendency to use catch(...) to make "problems" "disappear".
  • The workaround was to make (the CA) channel provider static and hence not deallocate it more than once. See this link.

@mdavidsaver
Copy link
Member

Thanks for the details. Seems likely to be straightforward issue with CAProvider. Stopping the singleton worker hangs because the worker has already been stopped.

@mrkraimer @anjohnson Over to you guys.

@anjohnson
Copy link
Member

@mdavidsaver Is it sensible to instantiate multiple Channel Providers of the same type? I thought these were intended to be one-off objects, and it looks like the CA provider was coded with that assumption but without trying to enforce it. If an application creates multiple CA providers it will use multiple CA client contexts and multiple TCP sockets when talking to the same IOC, which could be a major resource drain on the IOCs it connects to. Is that true for the pva provider too?

Currently the {{get,put}Done,monitorEvent,channnelConnect}Thread classes all have a singleton that owns the underlying thread, but each CAChannelProvider object has its own ca_client_context pointer (so destroying the first provider also stops all the threads, hence this bug). What do you think the best solution would be here, add another singleton inside the CAChannelProvider that owns a single CA client context, or allow the user to shoot themselves in the foot by creating a separate context for each provider?

@mdavidsaver
Copy link
Member

Is it sensible to instantiate multiple Channel Providers of the same type?

In general, yes. I can't say whether it is called for in this particular application. eg. Gateways create a different PVA client for each network interface.

What do you think the best solution ...

We're several steps down a road of fixes which introduce further bugs. imo. it is well past time for redesign. Why are there four workers doing (almost) the same task? Why 4x the code? Do these worker(s) need to be singletons?

Also, this issue highlights a gap in testing. Clearly only one instance is being created. At minimum testCaProvider can run through twice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants