-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metrics: Expose litep2p metrics in an agnostic manner #294
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Ideally this should be Into<String>, but that way the we cannot be object safe Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Similar to #296, there is a possibility of leaking memory in the following edge-case: - T0: Connection is established and outbound substream is initiated with peer - This maps the substream ID to the request bytes information - T1: Connection is closed before the service has a chance to report `TransportEvent::SubstreamOpened` or `TransportEvent::SubstreamOpenFailure` In this case, if we connect and immediately disconnect with a request in flight, we are effectively leaking the request bytes. Detected by: - #294 ### Dashboard - We are leaking ~111 requests over 3 days timespan: <img width="1484" alt="Screenshot 2024-12-03 at 10 41 01" src="https://github.com/user-attachments/assets/f6701017-4add-4aa1-aee1-e1f8d33d54f3"> cc @paritytech/networking Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
This PR fixes a subtle memory leak that can happen in the following edge-case situation: - connection is established and substream outbound is initiated with remote peer - the substream ID is tracked until the substream either completes successfully or fails - the connection is closed soon after, leading to no substream events ever being generated For this edge-cases, we need to remove the tracking of the substream ID when the connection is reported as closed. This has been detected after running a node for more than 2 days with the following generic metrics PR: - #294 Closes: #295 cc @paritytech/networking --------- Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
#[error("Invalid metric: `{0}`")] | ||
MetricError(String), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used anywhere?
pending_outbound: registry.register_gauge( | ||
"litep2p_identify_pending_outbound".into(), | ||
"Pending outbound substreams".into(), | ||
)?, | ||
pending_inbound: registry.register_gauge( | ||
"litep2p_identify_pending_inbound".into(), | ||
"Pending inbound substreams".into(), | ||
)?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this info reported?
peers: MetricGauge, | ||
|
||
mem_store_records: MetricGauge, | ||
mem_store_providers: MetricGauge, | ||
mem_store_local_providers: MetricGauge, | ||
mem_store_provider_refresh: MetricGauge, | ||
|
||
engine_queries: MetricGauge, | ||
executor_queries: MetricGauge, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be of interest to also know the size of the routing table.
pending_outbound: registry.register_gauge( | ||
"litep2p_ping_pending_outbound".into(), | ||
"Pending outbound substreams".into(), | ||
)?, | ||
pending_inbound: registry.register_gauge( | ||
"litep2p_ping_pending_inbound".into(), | ||
"Pending inbound substreams".into(), | ||
)?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with Identify, why do we need this info? Metrics reporting should be optimized in the prometheus libs, but it's not completely free. Also, storing them in the database is not free either. So, I would keep the amount of metrics reported to the necessary minimum.
Also, Identify & Ping peer counts should be equal to the number of connected peers, so I would not report Identify & Ping metrics at all, and just keep one global peer counter.
struct Metrics { | ||
connected_peers: MetricGauge, | ||
pending_outbound_num: MetricGauge, | ||
pending_outbound_handshake_num: MetricGauge, | ||
ready_substreams_handshake_num: MetricGauge, | ||
pending_validations_num: MetricGauge, | ||
timers_num: MetricGauge, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would reduce all these counters to just one MetricCounter
— the number of discovered peers via mDNS.
|
||
/// Metric incremented when the connection starts | ||
/// and decremented when the connection closes. | ||
pub _active_connections_num: ScopeGaugeMetric, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: starting with an underscore looks confusing, like it's not implemented or so. May be #[allow_unused]
?
/// The following metrics are shared with all TCP connections. | ||
active_connections_num: MetricGauge, | ||
pending_substreams_num: MetricGauge, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As with other metrics, I would may be reconsider what we need and what we don't. For example, I am not sure wee need the sizes of all internal containers. At the same time, the total number of substreams may be of interest.
fn new( | ||
context: TransportHandle, | ||
config: Self::Config, | ||
_registry: Option<crate::metrics::MetricsRegistry>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an issue to implement this in the future?
/// The following metrics are shared with all TCP connections. | ||
active_connections_num: MetricGauge, | ||
pending_substreams_num: MetricGauge, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dq: What metrics we need and what don't + number of open substreams.
@@ -215,6 +215,7 @@ impl TransportBuilder for QuicTransport { | |||
fn new( | |||
context: TransportHandle, | |||
mut config: Self::Config, | |||
_registry: Option<crate::metrics::MetricsRegistry>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need an issue to implement this?
Thanks for the review here @dmitry-markin 🙏 Offhand, everything makes sense! Will try to get to this soon, immediately after we gather a bit more info on: paritytech/polkadot-sdk#7076 (comment) |
This PR optionally exposes litep2p metrics in an agnostic manner.
API Design
Litep2p supports at the moment the following primitives to register and operate on metrics:
Around these primitives, substrate can expose
prometheus
metrics, and if need be update seamlessly to other metric crates (likeprometheus-client
).Metrics Exposed
The exposed metrics inform the user about the state of litep2p components (like kademlia number of store elements, identify negotiating substreams etc) and help developers detect abnormal behavior (like memory leaks / unbounded growth / stalls in some protocols).
Transport Manager
Transport Layer (TCP + Websocket)
Kademlia
Identify / Ping
Request Response Protocol
These metrics are exposed for every req-resp protocol.
Notification Protocol
These metrics are exposed for every notification protocol.
Dashboards
Review notes: Metric traits are define in
src/metrics.rs
, they should give enough background for the metric registration / update that is happening in the rest of the code