Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: Expose litep2p metrics in an agnostic manner #294

Open
wants to merge 32 commits into
base: master
Choose a base branch
from

Conversation

lexnv
Copy link
Collaborator

@lexnv lexnv commented Nov 29, 2024

This PR optionally exposes litep2p metrics in an agnostic manner.

API Design

Litep2p supports at the moment the following primitives to register and operate on metrics:

/// A registry for metrics.
pub trait MetricsRegistryT: Send + Sync {
    /// Register a new counter.
    fn register_counter(&self, name: String, help: String) -> Result<MetricCounter, Error>;

    /// Register a new gauge.
    fn register_gauge(&self, name: String, help: String) -> Result<MetricGauge, Error>;
}


/// Represents a metric that can only go up.
pub trait MetricCounterT: Send + Sync {
    /// Increment the counter by `value`.
    fn inc(&self, value: u64);
}

/// Represents a metric that can arbitrarily go up and down.
pub trait MetricGaugeT: Send + Sync {
    /// Set the gauge to `value`.
    fn set(&self, value: u64);

    /// Increment the gauge.
    fn inc(&self);
...
}

Around these primitives, substrate can expose prometheus metrics, and if need be update seamlessly to other metric crates (like prometheus-client).

Metrics Exposed

The exposed metrics inform the user about the state of litep2p components (like kademlia number of store elements, identify negotiating substreams etc) and help developers detect abnormal behavior (like memory leaks / unbounded growth / stalls in some protocols).

Transport Manager

  • number of incoming connections
  • number of outgoing connections
  • number of opening errors
  • number of managed peers
  • number of pending connections

Transport Layer (TCP + Websocket)

  • number of pending inbound connections
  • number of pending open connections
  • number of raw unnegotiated connections
  • number of pending substreams

Kademlia

  • number of total peers
  • number of engine running queries
  • number of executor queries
  • number of memory store records
  • number of memory store local providers
  • number of memory store providers
  • number of memory store provider refreshes

Identify / Ping

  • number of total peers
  • number of pending inbound substreams
  • number of pending outbound substreams

Request Response Protocol

These metrics are exposed for every req-resp protocol.

  • number of connected peers
  • number of pending dials
  • number of pending inbound substreams
  • number of pending inbound requests
  • number of pending outbound
  • number of outbound cancels
  • number of pending outbound responses

Notification Protocol

These metrics are exposed for every notification protocol.

  • number of connected peers
  • number of outbound initiated handshakes
  • number of outbound substreams
  • number of validation
  • number of ready substream handshakes
  • number of timers

Dashboards

Screenshot 2024-12-03 at 15 02 27 Screenshot 2024-12-03 at 15 02 51 Screenshot 2024-12-03 at 15 02 43 Screenshot 2024-12-03 at 15 03 09

Review notes: Metric traits are define in src/metrics.rs, they should give enough background for the metric registration / update that is happening in the rest of the code

@lexnv lexnv added the enhancement New feature or request label Nov 29, 2024
@lexnv lexnv self-assigned this Nov 29, 2024
lexnv added 10 commits November 29, 2024 17:43
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
lexnv added a commit that referenced this pull request Dec 3, 2024
Similar to #296, there is a
possibility of leaking memory in the following edge-case:
- T0: Connection is established and outbound substream is initiated with
peer
  - This maps the substream ID to the request bytes information
- T1: Connection is closed before the service has a chance to report
`TransportEvent::SubstreamOpened` or
`TransportEvent::SubstreamOpenFailure`

In this case, if we connect and immediately disconnect with a request in
flight, we are effectively leaking the request bytes.


Detected by:
- #294


### Dashboard

- We are leaking ~111 requests over 3 days timespan:

<img width="1484" alt="Screenshot 2024-12-03 at 10 41 01"
src="https://github.com/user-attachments/assets/f6701017-4add-4aa1-aee1-e1f8d33d54f3">


cc @paritytech/networking

Signed-off-by: Alexandru Vasile <[email protected]>
@lexnv lexnv changed the title wip: Metrics metrics: Expose litep2p metrics in an agnostic manner Dec 3, 2024
lexnv added a commit that referenced this pull request Dec 3, 2024
This PR fixes a subtle memory leak that can happen in the following
edge-case situation:
- connection is established and substream outbound is initiated with
remote peer
- the substream ID is tracked until the substream either completes
successfully or fails
- the connection is closed soon after, leading to no substream events
ever being generated

For this edge-cases, we need to remove the tracking of the substream ID
when the connection is reported as closed.

This has been detected after running a node for more than 2 days with
the following generic metrics PR:
- #294

Closes: #295

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Comment on lines +130 to +131
#[error("Invalid metric: `{0}`")]
MetricError(String),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used anywhere?

Comment on lines +212 to +219
pending_outbound: registry.register_gauge(
"litep2p_identify_pending_outbound".into(),
"Pending outbound substreams".into(),
)?,
pending_inbound: registry.register_gauge(
"litep2p_identify_pending_inbound".into(),
"Pending inbound substreams".into(),
)?,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this info reported?

Comment on lines +178 to +186
peers: MetricGauge,

mem_store_records: MetricGauge,
mem_store_providers: MetricGauge,
mem_store_local_providers: MetricGauge,
mem_store_provider_refresh: MetricGauge,

engine_queries: MetricGauge,
executor_queries: MetricGauge,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be of interest to also know the size of the routing table.

Comment on lines +103 to +110
pending_outbound: registry.register_gauge(
"litep2p_ping_pending_outbound".into(),
"Pending outbound substreams".into(),
)?,
pending_inbound: registry.register_gauge(
"litep2p_ping_pending_inbound".into(),
"Pending inbound substreams".into(),
)?,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with Identify, why do we need this info? Metrics reporting should be optimized in the prometheus libs, but it's not completely free. Also, storing them in the database is not free either. So, I would keep the amount of metrics reported to the necessary minimum.

Also, Identify & Ping peer counts should be equal to the number of connected peers, so I would not report Identify & Ping metrics at all, and just keep one global peer counter.

Comment on lines +292 to 299
struct Metrics {
connected_peers: MetricGauge,
pending_outbound_num: MetricGauge,
pending_outbound_handshake_num: MetricGauge,
ready_substreams_handshake_num: MetricGauge,
pending_validations_num: MetricGauge,
timers_num: MetricGauge,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would reduce all these counters to just one MetricCounter — the number of discovered peers via mDNS.


/// Metric incremented when the connection starts
/// and decremented when the connection closes.
pub _active_connections_num: ScopeGaugeMetric,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: starting with an underscore looks confusing, like it's not implemented or so. May be #[allow_unused]?

/// The following metrics are shared with all TCP connections.
active_connections_num: MetricGauge,
pending_substreams_num: MetricGauge,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As with other metrics, I would may be reconsider what we need and what we don't. For example, I am not sure wee need the sizes of all internal containers. At the same time, the total number of substreams may be of interest.

fn new(
context: TransportHandle,
config: Self::Config,
_registry: Option<crate::metrics::MetricsRegistry>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an issue to implement this in the future?

/// The following metrics are shared with all TCP connections.
active_connections_num: MetricGauge,
pending_substreams_num: MetricGauge,
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dq: What metrics we need and what don't + number of open substreams.

@@ -215,6 +215,7 @@ impl TransportBuilder for QuicTransport {
fn new(
context: TransportHandle,
mut config: Self::Config,
_registry: Option<crate::metrics::MetricsRegistry>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need an issue to implement this?

@lexnv
Copy link
Collaborator Author

lexnv commented Jan 13, 2025

Thanks for the review here @dmitry-markin 🙏 Offhand, everything makes sense!

Will try to get to this soon, immediately after we gather a bit more info on: paritytech/polkadot-sdk#7076 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants