-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
676 subscribed messages which go out of scope get garbage collected and cause undefined behaviour #884
base: develop
Are you sure you want to change the base?
Conversation
738e21a
to
da19670
Compare
Inviting @sassy-asjp to review this PR as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the work! I like the approach of having implementation-agnostic callbacks to handle reference counting with foreign languages (for example, Python). However, I don't think we are fully implementing/testing all required functionality (see comments).
* @param incRef Function to increment reference count | ||
* @param decRef Function to decrement reference count | ||
*/ | ||
void setSourceRef(void* src, void (*incRef)(void*), void (*decRef)(void*)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, the reference increase and decrease functions are only called when setSourceRef
is called, which means that the reference count should be decreased correctly when a reader is subscribed to a new standalone message. However, what happens when the reader goes out of scope? That should also trigger a reference count decrease, otherwise the standalone messages will leak when all subscribed readers go out of scope (simulation end, for example). While this might be easy to do in C++ (add setSourceRef(nullptr, nullptr, nullptr)
to the reader destructor), I don't see an easy way to do this for C modules ("readers" in C modules) since those don't have destructors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Howdy @juan-g-bonilla , thanks for the feedback. Right now we are fully relying on the BSK develop to ensure stand-alone messages remain in scope and are not deleted. If we have a solution that partially helps keep messages in scope, i.e. a solution that doesn't work with C stand-along messages which are more rare, then I am ok with this as long as this behavior is fully documented.
In our tutorials on making stand-alone messages we can add a section about the challenges to consider with such messages, discuss how this is handled for C++ messages connected to C++ modules, but also explain how to solve this with other situations where the coder has to ensure the messages remain in scope as they currently do. Right now we don't have a good discussion on this but could add this to this branch.
Would you agree with such a partial approach? We are making the coder's life easier in some cases, but in others we are more open about what needs to be done to retain the msg in memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, I think the problem would be with C readers connected to standalone Python messages, since C doesn't even have a concept of reader/writer (my knowledge of C messages is iffy, so take this with a grain of salt). Since C objects don't have a destructor, there is no way to notify the standalone Python message that a C in-message has gone out of scope and thus it should decrease its reference count, which would lead to leaks. Also, right now, C++ readers of
I agree that a solution that makes life easier today is better than a perfect one in the future. However, today, C++ and C models look very similar in the Python layer, so I can also imagine that introducing this inconsistent behavior between them can produce more headaches than cure them. Moreover, the current behavior is that you get a segfault / UB for standalone messages, while with the changes the messages would leak. The former is catastrophic (usually), while the latter is more forgiving but also insidious (and we have been fighting hard against leaking objects recently).
If we are relying on documentation, which of the following are conducive to fewer headaches?
- Stand-alone Python messages should be kept in scope during the entire simulation (otherwise you may get segfaults or UB).
- Stand-alone Python messages should be kept in scope during the entire simulation when connected to at least one model implemented in C (otherwise the standalone message might leak).
It's your call, I would be ok with either.
However, there might be a way to address this issue also for C messages. @Natsoulas take a look at the SWIG wrappers of C models (in swig_c_wrap.i
). These produce a C++ object, which could have a destructor. There might be a way to solve this issue where the refDecCallback
is called not on Reader
destruction in C (since there are not readers or destructors in C) but on destruction of the SWIG wrapper of the C model, which is a C++ object. Essentially, the swig_c_wrap.i
could be modified to have a std::vector<std::function<void()>> onDestructionCallbacks
, which get called on destruction of the wrapper. Then, it would be a matter of registering the inMsg refDecCallback
in this vector. All of this easier said than done, but there might be a path here (which could be spun into its own PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with your concern, that is what prompted my earlier question. With a partial documented solution, are we causing more headaches in debugging than if we better document that the coder must keep the stand-alone messages in memory.
@Natsoulas, let me know what you think/find of Juan's suggestion at the bottom of this post.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding Juan's first comment about reference counting and destructors:
I appreciate the detailed feedback about potential memory leaks when readers go out of scope. The suggestion about using SWIG wrappers is interesting and I'm hoping it helps resolve things. I can explore modifying swig_c_wrap.i to add destruction callbacks, which would help handle C module cleanup properly. This would definitely be a more complete solution than the current approach.
Regarding the documentation vs partial solution discussion:
@schaubh I agree with Juan that introducing inconsistent behavior between C++ and C models would be generally problematic. Rather than proceeding with the current "partial" solution, I propose that I:
- First implement the suggested SWIG wrapper approach to handle C module cleanup like in Juan's comment
- Add more comprehensive unit tests like the ones Juan suggested
- Then document the unified behavior
@juan-g-bonilla let me know if this rough plan of action is missing anything else important. Lastly, thanks I do think that using std::function instead of function pointers would be an improvement. Trying these ideas out, I'll force push again once I have tests passing/expected functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@juan-g-bonilla I've updated this PR to incorporate your suggestions and provide further unit test coverage, I'd appreciate it if you could check this out again. Thanks!
if testFailCount == 0: | ||
print("PASSED") | ||
else: | ||
[print(msg) for msg in testMessages] | ||
assert(testFailCount == 0) | ||
|
||
|
||
def test_standalone_message_scope(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see additional tests:
- A standalone message is connected to multiple readers: the standalone message is kept alive until all readers are gone.
- A standalone message is connected to a C++ reader, both the python message and C++ reader go out of scope, the standalone message is deleted (no leaks).
- Same as above but for C reader.
- Same as above but a mix of C++ and C readers.
Feel free to combine multiple of the above tests into a single test if that's easier. I'd suggest calling del
and/or the Python garbage collector manually to ensure the objects are actually calling their delete function when going out of scope (as you know the Python GC can delay that arbitrarily).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To add to this, I think it is also important to test Python modules and Python standalone readers in those combinations as well.
It would also be nice if the test checked that after deleting the name and manually forcing Python garbage collector, that the standalone message was actually deleted, though that might be more trouble than its worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sassy-asjp and @juan-g-bonilla let me know if you agree with my further unit test additions/whether or not it covers what you suggested how you like
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! I had some comments about test implementation, but I think they cover what I requested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sassy-asjp No problem, I just incorporated your comments into the unit tests' implementation. Let me know if my changes are satisfactory or if anything remains unclear/needs work. I appreciate your help with this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Natsoulas Thanks! It looks good to me
427fb81
to
16408c3
Compare
5166b04
to
159bb4c
Compare
159bb4c
to
e8b2386
Compare
if testFailCount == 0: | ||
print("PASSED") | ||
else: | ||
[print(msg) for msg in testMessages] | ||
assert(testFailCount == 0) | ||
|
||
|
||
def test_standalone_message_scope(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Natsoulas Thanks! It looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First of all, sorry for the brief and maybe incomplete feedback. It's a complicated week, but I didn't want to leave this PR unreviewed for too long. I didn't do an exhaustive review, so please feel free to push back if you think I missed something.
Thank you for considering our suggestions, especially the extensive test suite! Seems like a lot of work and thought has been into this PR.
However, I don't think this addresses our two main concerns since the last review:
- Ensure there are no leaks.
- Ensure that C modules keep standalone Python messages alive.
See the comments in test_messaging
for suggested test alterations that prove this. I think we need to go back to our previous discussion about reducing the reference count on C++ reader destruction and on C module destruction (through the destructionCallbacks
).
I'm happy to provide more feedback or help in the coming weeks, but it might take a minute.
@@ -63,6 +65,22 @@ class CWrapper : public SysModel { | |||
std::unique_ptr<TConfig> config; //!< class variable | |||
}; | |||
|
|||
class CModuleWrapper { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this class is doing anything? I'm pretty sure if we delete it, the code runs the same way.
I think I didn't properly explain what I meant with the destructionCallbacks
feature. My idea was to add it to CWrapper
. However, the bigger problem is that we would NEED to use addDestructionCallback
at some point to handle C messages, which the code right now is not doing.
# Delete message first, then module | ||
del msg_ref | ||
gc.collect() | ||
del mod | ||
gc.collect() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Delete reader, which should also delete the message
del mod
gc.collect()
# Verify message was actually cleaned up
if msg_ref() is not None:
testFailCount += 1
testMessages.append("Message was not properly cleaned up")
This test (and maybe others, please check!) are not actually checking for memory leaks. You had the right intuition with using weak references, deleting, and using the garbage collector. However, what we actually want to ensure is that once the original Python message has gone out of scope (msg
in the create_and_subscribe
scope) and the subscribed C++ reader has also gone out of scope (mod
was deleted), then the Python message gets actually garbage collected (thus trying to get the object from the weakref should return None
). You successfully implemented this test for the test_standalone_message_c_reader_cleanup
!
Try the code above in place of this block and it should fail (because we are causing leaks!!)
return True, msg_ref | ||
|
||
# Test in local scope | ||
success, msg_ref = create_and_subscribe() | ||
if not success: | ||
testFailCount += 1 | ||
testMessages.append("Failed to read message in local scope") | ||
|
||
# Force cleanup | ||
gc.collect() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
return True, msg_ref, mod
# Test in local scope
success, msg_ref, mod = create_and_subscribe()
if not success:
testFailCount += 1
testMessages.append("Failed to read message in local scope")
# Force cleanup
gc.collect()
# Verify that the message was kept around by 'mod'
if msg_ref() is None:
testFailCount += 1
testMessages.append("Message was not kept alive by reader")
# Delete reader, which should also delete the message
del mod
I don't think this test is actually checking that the Python message is kept alive by the C reader. Try the suggested block and you can see that the test fails because the Python message gets garbage collected while the C reader (mod
) is still alive.
This means that we still don't have a mechanism for C message 'readers' to keep standalone Python messages around.
Thanks a lot to @Natsoulas and the reviewers for working on this issue! I've tested the latest commit (e8b2386), but it seems I may have found a case where it still does not work. If you uncomment the line marked with I think it may be due to the type of from Basilisk.architecture import messaging
from Basilisk.utilities import SimulationBaseClass
from Basilisk.utilities import macros
from Basilisk.fswAlgorithms import hillPoint
import numpy as np
def test_Msg_outOfScope():
# Create a sim module as an empty container
scSim = SimulationBaseClass.SimBaseClass()
# create the simulation process
dynProcess = scSim.CreateNewProcess("dynamicsProcess")
# create the dynamics task and specify the integration update time
dynProcess.addTask(scSim.CreateNewTask("dynamicsTask", macros.sec2nano(1.)))
# create modules
mod1 = hillPoint.hillPoint()
mod1.ModelTag = "cppModule1"
scSim.AddModelToTask("dynamicsTask", mod1)
# setup message recording
msgRec = mod1.attRefOutMsg.recorder()
scSim.AddModelToTask("dynamicsTask", msgRec)
keep_msgs_in_scope = []
def assert_msgs_data(module, data):
celBodyInMsg = module.celBodyInMsg.read()
assert np.equal(celBodyInMsg.r_BdyZero_N, 0).all()
assert np.equal(celBodyInMsg.v_BdyZero_N, 0).all()
transNavInMsg = module.transNavInMsg.read()
assert np.equal(transNavInMsg.r_BN_N, data).all()
assert np.equal(transNavInMsg.v_BN_N, data).all()
def create_scoped_messages(module, data):
# Create stand-alone messages with some data.
CelBodyData = messaging.EphemerisMsgPayload()
celBodyInMsg = messaging.EphemerisMsg().write(CelBodyData)
transNavData = messaging.NavTransMsgPayload()
transNavData.r_BN_N = data
transNavData.v_BN_N = data
transNavInMsg = messaging.NavTransMsg().write(transNavData)
# Subscribe the module to the messages. The module will keep a pointer
# to them but not take ownerhsip.
module.celBodyInMsg.subscribeTo(celBodyInMsg)
module.transNavInMsg.subscribeTo(transNavInMsg)
# Confirm that the data we wrote is in the messages.
assert_msgs_data(module, data)
# XXX: Uncomment the following line to keep the messages in scope, so the test passes.
# keep_msgs_in_scope.extend([celBodyInMsg, transNavInMsg])
# On return, the messages will go out-of-scope and get garbage
# collected. The module will no longer be pointing to message data, but
# to random memory...
# Create and subscribe to messages in a local scope, so that they don't
# exist in the scope here.
data = [1., 20., -500.75]
create_scoped_messages(mod1, data=data)
# Assert that the messages have the correct values written to them.
# (If they went out-of-scope and got garbage collected, this will fail.)
assert_msgs_data(mod1, data=data)
# initialize Simulation:
scSim.InitializeSimulation()
# configure a simulation stop time and execute the simulation run
scSim.ConfigureStopTime(macros.sec2nano(3.0))
scSim.ExecuteSimulation()
# Assert that the messages STILL have the correct values.
# (If they went out-of-scope and got garbage collected, this will fail.)
assert_msgs_data(mod1, data=data)
# Confirm that the simulated data matches the expected
expected = np.array([
[0., 0.33262475, 0.01328506],
[0., 0.33262475, 0.01328506],
[0., 0.33262475, 0.01328506],
[0., 0.33262475, 0.01328506]
])
assert np.isclose(msgRec.sigma_RN, expected).all()
assert np.equal(msgRec.omega_RN_N, 0).all()
assert np.equal(msgRec.domega_RN_N, 0).all() |
Howdy @dpad , thanks for the feedback. We'll keep working on this branch while wrapping up other branches. Just returned back from conference travel and plan to spend the weekend catching up on several branches and PRs. |
Description
A non-invasive fix to the messaging subscription scoping bug highlighted in issue #676. Additionally, this pr adds a unit test that validates this fix. This fix is summarized as reference counting to ensure message subscription is not lost due to python scoping.
Verification
Changes are validated via unit test and documentation build.
Documentation
Added docstring where appropriate to additions to the messaging header file.
Future work
Investigate memory leaks and other relevant messaging issues that pop up.