676 subscribed messages which go out of scope get garbage collected and cause undefined behaviour #884

Natsoulas · 2024-12-23T09:10:42Z

Tickets addressed: issue Subscribed messages which go out of scope get garbage collected and cause undefined behaviour #676
Review: By commit
Merge strategy: Merge (no squash)

Description

A non-invasive fix to the messaging subscription scoping bug highlighted in issue #676. Additionally, this pr adds a unit test that validates this fix. This fix is summarized as reference counting to ensure message subscription is not lost due to python scoping.

Verification

Changes are validated via unit test and documentation build.

Documentation

Added docstring where appropriate to additions to the messaging header file.

Future work

Investigate memory leaks and other relevant messaging issues that pop up.

schaubh · 2024-12-27T16:01:46Z

Inviting @sassy-asjp to review this PR as well.

juan-g-bonilla

Thanks for the work! I like the approach of having implementation-agnostic callbacks to handle reference counting with foreign languages (for example, Python). However, I don't think we are fully implementing/testing all required functionality (see comments).

juan-g-bonilla · 2025-01-02T15:57:37Z

src/architecture/messaging/messaging.h

+     * @param incRef Function to increment reference count
+     * @param decRef Function to decrement reference count
+     */
+    void setSourceRef(void* src, void (*incRef)(void*), void (*decRef)(void*)) {


As far as I can tell, the reference increase and decrease functions are only called when setSourceRef is called, which means that the reference count should be decreased correctly when a reader is subscribed to a new standalone message. However, what happens when the reader goes out of scope? That should also trigger a reference count decrease, otherwise the standalone messages will leak when all subscribed readers go out of scope (simulation end, for example). While this might be easy to do in C++ (add setSourceRef(nullptr, nullptr, nullptr) to the reader destructor), I don't see an easy way to do this for C modules ("readers" in C modules) since those don't have destructors.

Howdy @juan-g-bonilla , thanks for the feedback. Right now we are fully relying on the BSK develop to ensure stand-alone messages remain in scope and are not deleted. If we have a solution that partially helps keep messages in scope, i.e. a solution that doesn't work with C stand-along messages which are more rare, then I am ok with this as long as this behavior is fully documented.

In our tutorials on making stand-alone messages we can add a section about the challenges to consider with such messages, discuss how this is handled for C++ messages connected to C++ modules, but also explain how to solve this with other situations where the coder has to ensure the messages remain in scope as they currently do. Right now we don't have a good discussion on this but could add this to this branch.

Would you agree with such a partial approach? We are making the coder's life easier in some cases, but in others we are more open about what needs to be done to retain the msg in memory.

For reference, I think the problem would be with C readers connected to standalone Python messages, since C doesn't even have a concept of reader/writer (my knowledge of C messages is iffy, so take this with a grain of salt). Since C objects don't have a destructor, there is no way to notify the standalone Python message that a C in-message has gone out of scope and thus it should decrease its reference count, which would lead to leaks. Also, right now, C++ readers of

I agree that a solution that makes life easier today is better than a perfect one in the future. However, today, C++ and C models look very similar in the Python layer, so I can also imagine that introducing this inconsistent behavior between them can produce more headaches than cure them. Moreover, the current behavior is that you get a segfault / UB for standalone messages, while with the changes the messages would leak. The former is catastrophic (usually), while the latter is more forgiving but also insidious (and we have been fighting hard against leaking objects recently).

If we are relying on documentation, which of the following are conducive to fewer headaches?

Stand-alone Python messages should be kept in scope during the entire simulation (otherwise you may get segfaults or UB).

Stand-alone Python messages should be kept in scope during the entire simulation when connected to at least one model implemented in C (otherwise the standalone message might leak).
It's your call, I would be ok with either.

However, there might be a way to address this issue also for C messages. @Natsoulas take a look at the SWIG wrappers of C models (in swig_c_wrap.i). These produce a C++ object, which could have a destructor. There might be a way to solve this issue where the refDecCallback is called not on Reader destruction in C (since there are not readers or destructors in C) but on destruction of the SWIG wrapper of the C model, which is a C++ object. Essentially, the swig_c_wrap.i could be modified to have a std::vector<std::function<void()>> onDestructionCallbacks, which get called on destruction of the wrapper. Then, it would be a matter of registering the inMsg refDecCallback in this vector. All of this easier said than done, but there might be a path here (which could be spun into its own PR).

I agree with your concern, that is what prompted my earlier question. With a partial documented solution, are we causing more headaches in debugging than if we better document that the coder must keep the stand-alone messages in memory.

@Natsoulas, let me know what you think/find of Juan's suggestion at the bottom of this post.

Regarding Juan's first comment about reference counting and destructors:
I appreciate the detailed feedback about potential memory leaks when readers go out of scope. The suggestion about using SWIG wrappers is interesting and I'm hoping it helps resolve things. I can explore modifying swig_c_wrap.i to add destruction callbacks, which would help handle C module cleanup properly. This would definitely be a more complete solution than the current approach.

Regarding the documentation vs partial solution discussion:
@schaubh I agree with Juan that introducing inconsistent behavior between C++ and C models would be generally problematic. Rather than proceeding with the current "partial" solution, I propose that I:

First implement the suggested SWIG wrapper approach to handle C module cleanup like in Juan's comment

Add more comprehensive unit tests like the ones Juan suggested

Then document the unified behavior

@juan-g-bonilla let me know if this rough plan of action is missing anything else important. Lastly, thanks I do think that using std::function instead of function pointers would be an improvement. Trying these ideas out, I'll force push again once I have tests passing/expected functionality.

@juan-g-bonilla I've updated this PR to incorporate your suggestions and provide further unit test coverage, I'd appreciate it if you could check this out again. Thanks!

juan-g-bonilla · 2025-01-02T16:03:01Z

src/tests/test_messaging.py

    if testFailCount == 0:
        print("PASSED")
    else:
        [print(msg) for msg in testMessages]
    assert(testFailCount == 0)


+def test_standalone_message_scope():


I'd like to see additional tests:

A standalone message is connected to multiple readers: the standalone message is kept alive until all readers are gone.

A standalone message is connected to a C++ reader, both the python message and C++ reader go out of scope, the standalone message is deleted (no leaks).

Same as above but for C reader.

Same as above but a mix of C++ and C readers.

Feel free to combine multiple of the above tests into a single test if that's easier. I'd suggest calling del and/or the Python garbage collector manually to ensure the objects are actually calling their delete function when going out of scope (as you know the Python GC can delay that arbitrarily).

To add to this, I think it is also important to test Python modules and Python standalone readers in those combinations as well.

It would also be nice if the test checked that after deleting the name and manually forcing Python garbage collector, that the standalone message was actually deleted, though that might be more trouble than its worth.

@sassy-asjp and @juan-g-bonilla let me know if you agree with my further unit test additions/whether or not it covers what you suggested how you like

Thank you! I had some comments about test implementation, but I think they cover what I requested

@sassy-asjp No problem, I just incorporated your comments into the unit tests' implementation. Let me know if my changes are satisfactory or if anything remains unclear/needs work. I appreciate your help with this.

@Natsoulas Thanks! It looks good to me

src/architecture/messaging/messaging.h

src/tests/test_messaging.py

sassy-asjp · 2025-01-15T05:00:26Z

src/tests/test_messaging.py

    if testFailCount == 0:
        print("PASSED")
    else:
        [print(msg) for msg in testMessages]
    assert(testFailCount == 0)


+def test_standalone_message_scope():


@Natsoulas Thanks! It looks good to me

juan-g-bonilla

First of all, sorry for the brief and maybe incomplete feedback. It's a complicated week, but I didn't want to leave this PR unreviewed for too long. I didn't do an exhaustive review, so please feel free to push back if you think I missed something.

Thank you for considering our suggestions, especially the extensive test suite! Seems like a lot of work and thought has been into this PR.

However, I don't think this addresses our two main concerns since the last review:

Ensure there are no leaks.
Ensure that C modules keep standalone Python messages alive.

See the comments in test_messaging for suggested test alterations that prove this. I think we need to go back to our previous discussion about reducing the reference count on C++ reader destruction and on C module destruction (through the destructionCallbacks).

I'm happy to provide more feedback or help in the coming weeks, but it might take a minute.

juan-g-bonilla · 2025-01-15T07:02:07Z

src/architecture/_GeneralModuleFiles/swig_c_wrap.i

@@ -63,6 +65,22 @@ class CWrapper : public SysModel {
    std::unique_ptr<TConfig> config; //!< class variable
 };

+class CModuleWrapper {


I don't think this class is doing anything? I'm pretty sure if we delete it, the code runs the same way.

I think I didn't properly explain what I meant with the destructionCallbacks feature. My idea was to add it to CWrapper. However, the bigger problem is that we would NEED to use addDestructionCallback at some point to handle C messages, which the code right now is not doing.

juan-g-bonilla · 2025-01-15T07:07:47Z

src/tests/test_messaging.py

+    # Delete message first, then module
+    del msg_ref
+    gc.collect()
+    del mod
+    gc.collect()


# Delete reader, which should also delete the message del mod gc.collect() # Verify message was actually cleaned up if msg_ref() is not None: testFailCount += 1 testMessages.append("Message was not properly cleaned up")

This test (and maybe others, please check!) are not actually checking for memory leaks. You had the right intuition with using weak references, deleting, and using the garbage collector. However, what we actually want to ensure is that once the original Python message has gone out of scope (msg in the create_and_subscribe scope) and the subscribed C++ reader has also gone out of scope (mod was deleted), then the Python message gets actually garbage collected (thus trying to get the object from the weakref should return None). You successfully implemented this test for the test_standalone_message_c_reader_cleanup!

Try the code above in place of this block and it should fail (because we are causing leaks!!)

juan-g-bonilla · 2025-01-15T07:18:25Z

src/tests/test_messaging.py

+        return True, msg_ref
+
+    # Test in local scope
+    success, msg_ref = create_and_subscribe()
+    if not success:
+        testFailCount += 1
+        testMessages.append("Failed to read message in local scope")
+
+    # Force cleanup
+    gc.collect()
+


return True, msg_ref, mod # Test in local scope success, msg_ref, mod = create_and_subscribe() if not success: testFailCount += 1 testMessages.append("Failed to read message in local scope") # Force cleanup gc.collect() # Verify that the message was kept around by 'mod' if msg_ref() is None: testFailCount += 1 testMessages.append("Message was not kept alive by reader") # Delete reader, which should also delete the message del mod

I don't think this test is actually checking that the Python message is kept alive by the C reader. Try the suggested block and you can see that the test fails because the Python message gets garbage collected while the C reader (mod) is still alive.

This means that we still don't have a mechanism for C message 'readers' to keep standalone Python messages around.

dpad · 2025-01-24T08:11:14Z

Thanks a lot to @Natsoulas and the reviewers for working on this issue!

I've tested the latest commit (e8b2386), but it seems I may have found a case where it still does not work. If you uncomment the line marked with XXX, the test passes. (Sorry if you're already aware of this failing case, I haven't kept up with the discussion.)

I think it may be due to the type of mod1.celBodyInMsg and mod1.transNavInMsg, which are NavTransMsg_C and EphemerisMsg_C respectively. By contrast, the test-cases which pass are on CppModuleTemplate.dataInMsg which is a ReadFunctor<CModuleTemplateMsgPayload> type.

from Basilisk.architecture import messaging
from Basilisk.utilities import SimulationBaseClass
from Basilisk.utilities import macros
from Basilisk.fswAlgorithms import hillPoint

import numpy as np
def test_Msg_outOfScope():
    #  Create a sim module as an empty container
    scSim = SimulationBaseClass.SimBaseClass()

    #  create the simulation process
    dynProcess = scSim.CreateNewProcess("dynamicsProcess")

    # create the dynamics task and specify the integration update time
    dynProcess.addTask(scSim.CreateNewTask("dynamicsTask", macros.sec2nano(1.)))

    # create modules
    mod1 = hillPoint.hillPoint()
    mod1.ModelTag = "cppModule1"
    scSim.AddModelToTask("dynamicsTask", mod1)

    # setup message recording
    msgRec = mod1.attRefOutMsg.recorder()
    scSim.AddModelToTask("dynamicsTask", msgRec)

    keep_msgs_in_scope = []

    def assert_msgs_data(module, data):
        celBodyInMsg = module.celBodyInMsg.read()
        assert np.equal(celBodyInMsg.r_BdyZero_N, 0).all()
        assert np.equal(celBodyInMsg.v_BdyZero_N, 0).all()

        transNavInMsg = module.transNavInMsg.read()
        assert np.equal(transNavInMsg.r_BN_N,     data).all()
        assert np.equal(transNavInMsg.v_BN_N,     data).all()

    def create_scoped_messages(module, data):
        # Create stand-alone messages with some data.
        CelBodyData = messaging.EphemerisMsgPayload()
        celBodyInMsg = messaging.EphemerisMsg().write(CelBodyData)

        transNavData = messaging.NavTransMsgPayload()
        transNavData.r_BN_N     = data
        transNavData.v_BN_N     = data
        transNavInMsg = messaging.NavTransMsg().write(transNavData)

        # Subscribe the module to the messages. The module will keep a pointer
        # to them but not take ownerhsip.
        module.celBodyInMsg.subscribeTo(celBodyInMsg)
        module.transNavInMsg.subscribeTo(transNavInMsg)

        # Confirm that the data we wrote is in the messages.
        assert_msgs_data(module, data)

        # XXX: Uncomment the following line to keep the messages in scope, so the test passes.
        # keep_msgs_in_scope.extend([celBodyInMsg, transNavInMsg])

        # On return, the messages will go out-of-scope and get garbage
        # collected. The module will no longer be pointing to message data, but
        # to random memory...

    # Create and subscribe to messages in a local scope, so that they don't
    # exist in the scope here.
    data = [1., 20., -500.75]
    create_scoped_messages(mod1, data=data)

    # Assert that the messages have the correct values written to them.
    # (If they went out-of-scope and got garbage collected, this will fail.)
    assert_msgs_data(mod1, data=data)

    #  initialize Simulation:
    scSim.InitializeSimulation()

    #   configure a simulation stop time and execute the simulation run
    scSim.ConfigureStopTime(macros.sec2nano(3.0))
    scSim.ExecuteSimulation()

    # Assert that the messages STILL have the correct values.
    # (If they went out-of-scope and got garbage collected, this will fail.)
    assert_msgs_data(mod1, data=data)

    # Confirm that the simulated data matches the expected
    expected = np.array([
        [0., 0.33262475, 0.01328506],
        [0., 0.33262475, 0.01328506],
        [0., 0.33262475, 0.01328506],
        [0., 0.33262475, 0.01328506]
    ])
    assert np.isclose(msgRec.sigma_RN, expected).all()
    assert np.equal(msgRec.omega_RN_N, 0).all()
    assert np.equal(msgRec.domega_RN_N, 0).all()

schaubh · 2025-01-24T18:25:32Z

Howdy @dpad , thanks for the feedback. We'll keep working on this branch while wrapping up other branches. Just returned back from conference travel and plan to spend the weekend catching up on several branches and PRs.

Natsoulas added the bug Something isn't working label Dec 23, 2024

Natsoulas self-assigned this Dec 23, 2024

Natsoulas requested a review from a team as a code owner December 23, 2024 09:10

Natsoulas linked an issue Dec 23, 2024 that may be closed by this pull request

Subscribed messages which go out of scope get garbage collected and cause undefined behaviour #676

Open

Natsoulas force-pushed the 676-subscribed-messages-which-go-out-of-scope-get-garbage-collected-and-cause-undefined-behaviour branch 7 times, most recently from 738e21a to da19670 Compare December 25, 2024 09:31

schaubh requested review from schaubh and juan-g-bonilla December 27, 2024 15:50

juan-g-bonilla requested changes Jan 2, 2025

View reviewed changes

Natsoulas force-pushed the 676-subscribed-messages-which-go-out-of-scope-get-garbage-collected-and-cause-undefined-behaviour branch 5 times, most recently from 427fb81 to 16408c3 Compare January 9, 2025 20:24

sassy-asjp reviewed Jan 10, 2025

View reviewed changes

src/tests/test_messaging.py Outdated Show resolved Hide resolved

src/tests/test_messaging.py Show resolved Hide resolved

src/tests/test_messaging.py Show resolved Hide resolved

src/tests/test_messaging.py Outdated Show resolved Hide resolved

Natsoulas force-pushed the 676-subscribed-messages-which-go-out-of-scope-get-garbage-collected-and-cause-undefined-behaviour branch 2 times, most recently from 5166b04 to 159bb4c Compare January 10, 2025 22:43

Natsoulas requested review from juan-g-bonilla and sassy-asjp January 10, 2025 22:52

Natsoulas added 3 commits January 10, 2025 15:57

fix messaging subscription scoping

cb3ccd4

add unit tests verifying fix

0ae022a

update release notes

e8b2386

Natsoulas force-pushed the 676-subscribed-messages-which-go-out-of-scope-get-garbage-collected-and-cause-undefined-behaviour branch from 159bb4c to e8b2386 Compare January 10, 2025 22:57

sassy-asjp approved these changes Jan 15, 2025

View reviewed changes

juan-g-bonilla requested changes Jan 15, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

676 subscribed messages which go out of scope get garbage collected and cause undefined behaviour #884

676 subscribed messages which go out of scope get garbage collected and cause undefined behaviour #884

Natsoulas commented Dec 23, 2024

schaubh commented Dec 27, 2024

juan-g-bonilla left a comment

juan-g-bonilla Jan 2, 2025 •

edited

Loading

schaubh Jan 2, 2025

juan-g-bonilla Jan 2, 2025

schaubh Jan 2, 2025

Natsoulas Jan 4, 2025

Natsoulas Jan 9, 2025

juan-g-bonilla Jan 2, 2025

sassy-asjp Jan 6, 2025

Natsoulas Jan 9, 2025

sassy-asjp Jan 10, 2025

Natsoulas Jan 10, 2025

sassy-asjp Jan 15, 2025

sassy-asjp Jan 15, 2025

juan-g-bonilla left a comment

juan-g-bonilla Jan 15, 2025

juan-g-bonilla Jan 15, 2025

juan-g-bonilla Jan 15, 2025

dpad commented Jan 24, 2025 •

edited

Loading

schaubh commented Jan 24, 2025

676 subscribed messages which go out of scope get garbage collected and cause undefined behaviour #884

Are you sure you want to change the base?

676 subscribed messages which go out of scope get garbage collected and cause undefined behaviour #884

Conversation

Natsoulas commented Dec 23, 2024

Description

Verification

Documentation

Future work

schaubh commented Dec 27, 2024

juan-g-bonilla left a comment

Choose a reason for hiding this comment

juan-g-bonilla Jan 2, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

juan-g-bonilla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpad commented Jan 24, 2025 • edited Loading

schaubh commented Jan 24, 2025

juan-g-bonilla Jan 2, 2025 •

edited

Loading

dpad commented Jan 24, 2025 •

edited

Loading