Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing Memory Consumption #2365

Open
steffenbeermann opened this issue Nov 12, 2024 · 5 comments
Open

Increasing Memory Consumption #2365

steffenbeermann opened this issue Nov 12, 2024 · 5 comments
Labels
bug Something isn't working
Milestone

Comments

@steffenbeermann
Copy link

steffenbeermann commented Nov 12, 2024

Describe the bug
We noticed that we are running out of memory sometimes on the devices running IoT Edge with the OPC Publisher. After investigating, we noticed that the OPC Publisher module takes more and more memory over time, see:
Image

Restarting the module will reset the memory usage (see dip at start of the graph).
We noticed the same behavior on OPC Publisher 2.9.11 and 2.9.0

To Reproduce
We have the following config:

{
  "Hostname": "publisher_axh_active",
  "Cmd": [
    "--cl=5",
    "--cf",
    "--aa",
    "--pf=/mount/publishednodes.json",
    "--PkiRootPath=/mount/pki",
    "--si=90",
    "--di=3600"
  ],
  "HostConfig": {
    "Binds": [
      "/home/edge/OPCPublisherMount_Active:/mount"
    ]
  }
}

and we have around 20 endpoints configured with a total of around 200.000 Nodes in multiple subscription.

On an older version 2.8.1 it seems to run fine without increasing memory over time:
Image

Expected behavior
The OPC Publisher does its garbage collection it self and does not built up high memory usages

@marcschier marcschier added the bug Something isn't working label Nov 15, 2024
@marcschier marcschier added this to the 2.9.12 milestone Nov 15, 2024
@marcschier
Copy link
Collaborator

Could you run without Prometheus by setting --em=False?

Also, how does the oom show for you?

@steffenbeermann
Copy link
Author

steffenbeermann commented Nov 15, 2024

We rely on the metrics of the OPC publisher module. With disabling Prometheus, the edge metric collector will still function right?

We noticed the issue because the IoT Edge runtime crashed and restarted all modules. It failed for a few hours until it went up again. See the drop of the memory on the 10th of November:
Image

This goes along with a downtime of all modules for about 3 hours, where they did not respond. After some automatic restarts all modules went up again on their own.

@marcschier
Copy link
Collaborator

Could you take a look at the diagnostic log and check the numbers of the data flow pipeline (encoder related, send queue, egress, etc.), check if they are low or increasing? That could mean we need to tune the data flow path to not hold on to incoming data it cannot send out.

My colleague also found a resource leak in the security handshake of the OPC UA stack. But it is likely not enough to explain the increases. Especially the huge increase at the end of the run from 2->4GB.

@steffenbeermann
Copy link
Author

I have this diagnostic info for one endpoint:

    "sentMessagesPerSec": 0.12591434486187222,
    "ingestionDuration": "6.17:47:16.9692019",
    "ingressDataChanges": 3494295,
    "ingressValueChanges": 12381772,
    "ingressBatchBlockBufferSize": 48,
    "encodingBlockInputSize": 0,
    "encodingBlockOutputSize": 0,
    "encoderNotificationsProcessed": 3495567,
    "encoderNotificationsDropped": 0,
    "encoderIoTMessagesProcessed": 73343,
    "encoderAvgNotificationsMessage": 47.660540201519055,
    "encoderAvgIoTMessageBodySize": 37767.330201928016,
    "encoderAvgIoTChunkUsage": 9.220539600080082,
    "estimatedIoTChunksPerDay": 0,
    "outgressInputBufferCount": 0,
    "outgressInputBufferDropped": 0,
    "outgressIoTMessageCount": 73336,
    "connectionRetries": 0,
    "opcEndpointConnected": true,
    "monitoredOpcNodesSucceededCount": 6862,
    "monitoredOpcNodesFailedCount": 0,
    "ingressEventNotifications": 0,
    "ingressEvents": 0,
    "encoderMaxMessageSplitRatio": 0,
    "ingressDataChangesInLastMinute": 360,
    "ingressValueChangesInLastMinute": 1144,
    "ingressHeartbeats": 0,
    "ingressCyclicReads": 0

I don't exactly understand what you mean with data flow pipeline?

@steffenbeermann
Copy link
Author

Could you run without Prometheus by setting --em=False?

Also, how does the oom show for you?

I tried with the setting --em=False but it seems it did not solved the problem as there is still an slow increase over the last two weeks:
Image

@marcschier marcschier modified the milestones: 2.9.12, 2.9.13 Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants