Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

676 subscribed messages which go out of scope get garbage collected and cause undefined behaviour #884

Open
wants to merge 3 commits into
base: develop
Choose a base branch
from

Conversation

Natsoulas
Copy link
Contributor

Description

A non-invasive fix to the messaging subscription scoping bug highlighted in issue #676. Additionally, this pr adds a unit test that validates this fix. This fix is summarized as reference counting to ensure message subscription is not lost due to python scoping.

Verification

Changes are validated via unit test and documentation build.

Documentation

Added docstring where appropriate to additions to the messaging header file.

Future work

Investigate memory leaks and other relevant messaging issues that pop up.

@Natsoulas Natsoulas added the bug Something isn't working label Dec 23, 2024
@Natsoulas Natsoulas self-assigned this Dec 23, 2024
@Natsoulas Natsoulas requested a review from a team as a code owner December 23, 2024 09:10
@Natsoulas Natsoulas force-pushed the 676-subscribed-messages-which-go-out-of-scope-get-garbage-collected-and-cause-undefined-behaviour branch 7 times, most recently from 738e21a to da19670 Compare December 25, 2024 09:31
@schaubh
Copy link
Contributor

schaubh commented Dec 27, 2024

Inviting @sassy-asjp to review this PR as well.

Copy link
Contributor

@juan-g-bonilla juan-g-bonilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work! I like the approach of having implementation-agnostic callbacks to handle reference counting with foreign languages (for example, Python). However, I don't think we are fully implementing/testing all required functionality (see comments).

* @param incRef Function to increment reference count
* @param decRef Function to decrement reference count
*/
void setSourceRef(void* src, void (*incRef)(void*), void (*decRef)(void*)) {
Copy link
Contributor

@juan-g-bonilla juan-g-bonilla Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, the reference increase and decrease functions are only called when setSourceRef is called, which means that the reference count should be decreased correctly when a reader is subscribed to a new standalone message. However, what happens when the reader goes out of scope? That should also trigger a reference count decrease, otherwise the standalone messages will leak when all subscribed readers go out of scope (simulation end, for example). While this might be easy to do in C++ (add setSourceRef(nullptr, nullptr, nullptr) to the reader destructor), I don't see an easy way to do this for C modules ("readers" in C modules) since those don't have destructors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Howdy @juan-g-bonilla , thanks for the feedback. Right now we are fully relying on the BSK develop to ensure stand-alone messages remain in scope and are not deleted. If we have a solution that partially helps keep messages in scope, i.e. a solution that doesn't work with C stand-along messages which are more rare, then I am ok with this as long as this behavior is fully documented.

In our tutorials on making stand-alone messages we can add a section about the challenges to consider with such messages, discuss how this is handled for C++ messages connected to C++ modules, but also explain how to solve this with other situations where the coder has to ensure the messages remain in scope as they currently do. Right now we don't have a good discussion on this but could add this to this branch.

Would you agree with such a partial approach? We are making the coder's life easier in some cases, but in others we are more open about what needs to be done to retain the msg in memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference, I think the problem would be with C readers connected to standalone Python messages, since C doesn't even have a concept of reader/writer (my knowledge of C messages is iffy, so take this with a grain of salt). Since C objects don't have a destructor, there is no way to notify the standalone Python message that a C in-message has gone out of scope and thus it should decrease its reference count, which would lead to leaks. Also, right now, C++ readers of

I agree that a solution that makes life easier today is better than a perfect one in the future. However, today, C++ and C models look very similar in the Python layer, so I can also imagine that introducing this inconsistent behavior between them can produce more headaches than cure them. Moreover, the current behavior is that you get a segfault / UB for standalone messages, while with the changes the messages would leak. The former is catastrophic (usually), while the latter is more forgiving but also insidious (and we have been fighting hard against leaking objects recently).

If we are relying on documentation, which of the following are conducive to fewer headaches?

  • Stand-alone Python messages should be kept in scope during the entire simulation (otherwise you may get segfaults or UB).
  • Stand-alone Python messages should be kept in scope during the entire simulation when connected to at least one model implemented in C (otherwise the standalone message might leak).
    It's your call, I would be ok with either.

However, there might be a way to address this issue also for C messages. @Natsoulas take a look at the SWIG wrappers of C models (in swig_c_wrap.i). These produce a C++ object, which could have a destructor. There might be a way to solve this issue where the refDecCallback is called not on Reader destruction in C (since there are not readers or destructors in C) but on destruction of the SWIG wrapper of the C model, which is a C++ object. Essentially, the swig_c_wrap.i could be modified to have a std::vector<std::function<void()>> onDestructionCallbacks, which get called on destruction of the wrapper. Then, it would be a matter of registering the inMsg refDecCallback in this vector. All of this easier said than done, but there might be a path here (which could be spun into its own PR).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with your concern, that is what prompted my earlier question. With a partial documented solution, are we causing more headaches in debugging than if we better document that the coder must keep the stand-alone messages in memory.

@Natsoulas, let me know what you think/find of Juan's suggestion at the bottom of this post.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding Juan's first comment about reference counting and destructors:
I appreciate the detailed feedback about potential memory leaks when readers go out of scope. The suggestion about using SWIG wrappers is interesting and I'm hoping it helps resolve things. I can explore modifying swig_c_wrap.i to add destruction callbacks, which would help handle C module cleanup properly. This would definitely be a more complete solution than the current approach.

Regarding the documentation vs partial solution discussion:
@schaubh I agree with Juan that introducing inconsistent behavior between C++ and C models would be generally problematic. Rather than proceeding with the current "partial" solution, I propose that I:

  1. First implement the suggested SWIG wrapper approach to handle C module cleanup like in Juan's comment
  2. Add more comprehensive unit tests like the ones Juan suggested
  3. Then document the unified behavior

@juan-g-bonilla let me know if this rough plan of action is missing anything else important. Lastly, thanks I do think that using std::function instead of function pointers would be an improvement. Trying these ideas out, I'll force push again once I have tests passing/expected functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@juan-g-bonilla I've updated this PR to incorporate your suggestions and provide further unit test coverage, I'd appreciate it if you could check this out again. Thanks!

if testFailCount == 0:
print("PASSED")
else:
[print(msg) for msg in testMessages]
assert(testFailCount == 0)


def test_standalone_message_scope():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see additional tests:

  • A standalone message is connected to multiple readers: the standalone message is kept alive until all readers are gone.
  • A standalone message is connected to a C++ reader, both the python message and C++ reader go out of scope, the standalone message is deleted (no leaks).
  • Same as above but for C reader.
  • Same as above but a mix of C++ and C readers.

Feel free to combine multiple of the above tests into a single test if that's easier. I'd suggest calling del and/or the Python garbage collector manually to ensure the objects are actually calling their delete function when going out of scope (as you know the Python GC can delay that arbitrarily).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To add to this, I think it is also important to test Python modules and Python standalone readers in those combinations as well.

It would also be nice if the test checked that after deleting the name and manually forcing Python garbage collector, that the standalone message was actually deleted, though that might be more trouble than its worth.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sassy-asjp and @juan-g-bonilla let me know if you agree with my further unit test additions/whether or not it covers what you suggested how you like

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! I had some comments about test implementation, but I think they cover what I requested

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sassy-asjp No problem, I just incorporated your comments into the unit tests' implementation. Let me know if my changes are satisfactory or if anything remains unclear/needs work. I appreciate your help with this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Natsoulas Thanks! It looks good to me

src/architecture/messaging/messaging.h Outdated Show resolved Hide resolved
@Natsoulas Natsoulas force-pushed the 676-subscribed-messages-which-go-out-of-scope-get-garbage-collected-and-cause-undefined-behaviour branch 5 times, most recently from 427fb81 to 16408c3 Compare January 9, 2025 20:24
src/tests/test_messaging.py Outdated Show resolved Hide resolved
src/tests/test_messaging.py Show resolved Hide resolved
src/tests/test_messaging.py Show resolved Hide resolved
src/tests/test_messaging.py Outdated Show resolved Hide resolved
@Natsoulas Natsoulas force-pushed the 676-subscribed-messages-which-go-out-of-scope-get-garbage-collected-and-cause-undefined-behaviour branch 2 times, most recently from 5166b04 to 159bb4c Compare January 10, 2025 22:43
@Natsoulas Natsoulas force-pushed the 676-subscribed-messages-which-go-out-of-scope-get-garbage-collected-and-cause-undefined-behaviour branch from 159bb4c to e8b2386 Compare January 10, 2025 22:57
if testFailCount == 0:
print("PASSED")
else:
[print(msg) for msg in testMessages]
assert(testFailCount == 0)


def test_standalone_message_scope():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Natsoulas Thanks! It looks good to me

Copy link
Contributor

@juan-g-bonilla juan-g-bonilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First of all, sorry for the brief and maybe incomplete feedback. It's a complicated week, but I didn't want to leave this PR unreviewed for too long. I didn't do an exhaustive review, so please feel free to push back if you think I missed something.

Thank you for considering our suggestions, especially the extensive test suite! Seems like a lot of work and thought has been into this PR.

However, I don't think this addresses our two main concerns since the last review:

  • Ensure there are no leaks.
  • Ensure that C modules keep standalone Python messages alive.

See the comments in test_messaging for suggested test alterations that prove this. I think we need to go back to our previous discussion about reducing the reference count on C++ reader destruction and on C module destruction (through the destructionCallbacks).

I'm happy to provide more feedback or help in the coming weeks, but it might take a minute.

@@ -63,6 +65,22 @@ class CWrapper : public SysModel {
std::unique_ptr<TConfig> config; //!< class variable
};

class CModuleWrapper {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this class is doing anything? I'm pretty sure if we delete it, the code runs the same way.

I think I didn't properly explain what I meant with the destructionCallbacks feature. My idea was to add it to CWrapper. However, the bigger problem is that we would NEED to use addDestructionCallback at some point to handle C messages, which the code right now is not doing.

Comment on lines +488 to +492
# Delete message first, then module
del msg_ref
gc.collect()
del mod
gc.collect()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    # Delete reader, which should also delete the message
    del mod
    gc.collect()

    # Verify message was actually cleaned up
    if msg_ref() is not None:
        testFailCount += 1
        testMessages.append("Message was not properly cleaned up")

This test (and maybe others, please check!) are not actually checking for memory leaks. You had the right intuition with using weak references, deleting, and using the garbage collector. However, what we actually want to ensure is that once the original Python message has gone out of scope (msg in the create_and_subscribe scope) and the subscribed C++ reader has also gone out of scope (mod was deleted), then the Python message gets actually garbage collected (thus trying to get the object from the weakref should return None). You successfully implemented this test for the test_standalone_message_c_reader_cleanup!

Try the code above in place of this block and it should fail (because we are causing leaks!!)

Comment on lines +524 to +534
return True, msg_ref

# Test in local scope
success, msg_ref = create_and_subscribe()
if not success:
testFailCount += 1
testMessages.append("Failed to read message in local scope")

# Force cleanup
gc.collect()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

        return True, msg_ref, mod

    # Test in local scope
    success, msg_ref, mod = create_and_subscribe()
    if not success:
        testFailCount += 1
        testMessages.append("Failed to read message in local scope")

    # Force cleanup
    gc.collect()

    # Verify that the message was kept around by 'mod'
    if msg_ref() is None:
        testFailCount += 1
        testMessages.append("Message was not kept alive by reader")

    # Delete reader, which should also delete the message
    del mod

I don't think this test is actually checking that the Python message is kept alive by the C reader. Try the suggested block and you can see that the test fails because the Python message gets garbage collected while the C reader (mod) is still alive.

This means that we still don't have a mechanism for C message 'readers' to keep standalone Python messages around.

@dpad
Copy link
Contributor

dpad commented Jan 24, 2025

Thanks a lot to @Natsoulas and the reviewers for working on this issue!

I've tested the latest commit (e8b2386), but it seems I may have found a case where it still does not work. If you uncomment the line marked with XXX, the test passes. (Sorry if you're already aware of this failing case, I haven't kept up with the discussion.)

I think it may be due to the type of mod1.celBodyInMsg and mod1.transNavInMsg, which are NavTransMsg_C and EphemerisMsg_C respectively. By contrast, the test-cases which pass are on CppModuleTemplate.dataInMsg which is a ReadFunctor<CModuleTemplateMsgPayload> type.

from Basilisk.architecture import messaging
from Basilisk.utilities import SimulationBaseClass
from Basilisk.utilities import macros
from Basilisk.fswAlgorithms import hillPoint

import numpy as np
def test_Msg_outOfScope():
    #  Create a sim module as an empty container
    scSim = SimulationBaseClass.SimBaseClass()

    #  create the simulation process
    dynProcess = scSim.CreateNewProcess("dynamicsProcess")

    # create the dynamics task and specify the integration update time
    dynProcess.addTask(scSim.CreateNewTask("dynamicsTask", macros.sec2nano(1.)))

    # create modules
    mod1 = hillPoint.hillPoint()
    mod1.ModelTag = "cppModule1"
    scSim.AddModelToTask("dynamicsTask", mod1)

    # setup message recording
    msgRec = mod1.attRefOutMsg.recorder()
    scSim.AddModelToTask("dynamicsTask", msgRec)

    keep_msgs_in_scope = []

    def assert_msgs_data(module, data):
        celBodyInMsg = module.celBodyInMsg.read()
        assert np.equal(celBodyInMsg.r_BdyZero_N, 0).all()
        assert np.equal(celBodyInMsg.v_BdyZero_N, 0).all()

        transNavInMsg = module.transNavInMsg.read()
        assert np.equal(transNavInMsg.r_BN_N,     data).all()
        assert np.equal(transNavInMsg.v_BN_N,     data).all()

    def create_scoped_messages(module, data):
        # Create stand-alone messages with some data.
        CelBodyData = messaging.EphemerisMsgPayload()
        celBodyInMsg = messaging.EphemerisMsg().write(CelBodyData)

        transNavData = messaging.NavTransMsgPayload()
        transNavData.r_BN_N     = data
        transNavData.v_BN_N     = data
        transNavInMsg = messaging.NavTransMsg().write(transNavData)

        # Subscribe the module to the messages. The module will keep a pointer
        # to them but not take ownerhsip.
        module.celBodyInMsg.subscribeTo(celBodyInMsg)
        module.transNavInMsg.subscribeTo(transNavInMsg)

        # Confirm that the data we wrote is in the messages.
        assert_msgs_data(module, data)

        # XXX: Uncomment the following line to keep the messages in scope, so the test passes.
        # keep_msgs_in_scope.extend([celBodyInMsg, transNavInMsg])

        # On return, the messages will go out-of-scope and get garbage
        # collected. The module will no longer be pointing to message data, but
        # to random memory...

    # Create and subscribe to messages in a local scope, so that they don't
    # exist in the scope here.
    data = [1., 20., -500.75]
    create_scoped_messages(mod1, data=data)

    # Assert that the messages have the correct values written to them.
    # (If they went out-of-scope and got garbage collected, this will fail.)
    assert_msgs_data(mod1, data=data)

    #  initialize Simulation:
    scSim.InitializeSimulation()

    #   configure a simulation stop time and execute the simulation run
    scSim.ConfigureStopTime(macros.sec2nano(3.0))
    scSim.ExecuteSimulation()

    # Assert that the messages STILL have the correct values.
    # (If they went out-of-scope and got garbage collected, this will fail.)
    assert_msgs_data(mod1, data=data)

    # Confirm that the simulated data matches the expected
    expected = np.array([
        [0., 0.33262475, 0.01328506],
        [0., 0.33262475, 0.01328506],
        [0., 0.33262475, 0.01328506],
        [0., 0.33262475, 0.01328506]
    ])
    assert np.isclose(msgRec.sigma_RN, expected).all()
    assert np.equal(msgRec.omega_RN_N, 0).all()
    assert np.equal(msgRec.domega_RN_N, 0).all()

@schaubh
Copy link
Contributor

schaubh commented Jan 24, 2025

Howdy @dpad , thanks for the feedback. We'll keep working on this branch while wrapping up other branches. Just returned back from conference travel and plan to spend the weekend catching up on several branches and PRs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 👀 In review
Development

Successfully merging this pull request may close these issues.

Subscribed messages which go out of scope get garbage collected and cause undefined behaviour
5 participants