Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add capability to publish metrics to prometheus #2684

Merged
merged 34 commits into from
Jan 9, 2025

Conversation

chesterxgchen
Copy link
Collaborator

@chesterxgchen chesterxgchen commented Jul 7, 2024

Description

One of the feature request is to add system metrics to monitoring FLARE running metrics via Prometheus + Grafana or other monitoring systems.

This PR add that missing piece. Here are few pieces to make this work

  1. JobMetricsCollector/SysMetricsCollecor, this collector will subscribe a callback for the ReservedTopic.APP_METRICS topic in the DataBus; and receive callback when the topic is published.

The SysMetricsCollector listens to the parent process events ( system start/end etc.) for client and server process
The JobMetricsCollector listens to the job process events, mostly related to the job, task etc.

  1. StatsD-reporter
    The statsd-reporter post the the metrics received ( from event callback) to the statsd-exporter interface: by default localhost:9125.

StatsD-export expose the :9102/metrics web interface for Prometheus to scrape, which can be used as data source for Grafana to visualize.
These are standard setup. we added an example with docker-compose file to illustrate this process

NVFLARE Monitoring Metrics

Event Metric Count Metric Time Taken
SYSTEM_START _system_start_count
SYSTEM_END _system_end_count _system_time_taken
ABOUT_TO_START_RUN _about_to_start_run_count
START_RUN _start_run_count
ABOUT_TO_END_RUN _about_to_end_run_count
END_RUN _end_run_count _run_time_taken
CHECK_END_RUN_READINESS _check_end_run_readiness_count
SWAP_IN _swap_in_count
SWAP_OUT _swap_out_count
START_WORKFLOW _start_workflow_count
END_WORKFLOW _end_workflow_count _workflow_time_taken
ABORT_TASK _abort_task_count
FATAL_SYSTEM_ERROR _fatal_system_error_count
JOB_DEPLOYED _job_deployed_count
JOB_STARTED _job_started_count
JOB_COMPLETED _job_completed_count _job_time_taken
JOB_ABORTED _job_aborted_count
JOB_CANCELLED _job_cancelled_count
CLIENT_DISCONNECTED _client_disconnected_count
CLIENT_RECONNECTED _client_reconnected_count
BEFORE_PULL_TASK _before_pull_task_count
AFTER_PULL_TASK _after_pull_task_count _pull_task_time_taken
BEFORE_PROCESS_TASK_REQUEST _before_process_task_request_count
AFTER_PROCESS_TASK_REQUEST _after_process_task_request_count _process_task_request_time_taken
BEFORE_PROCESS_SUBMISSION _before_process_submission_count
AFTER_PROCESS_SUBMISSION _after_process_submission_count _process_submission_time_taken
BEFORE_TASK_DATA_FILTER _before_task_data_filter_count
AFTER_TASK_DATA_FILTER _after_task_data_filter_count _data_filter_time_taken
BEFORE_TASK_RESULT_FILTER _before_task_result_filter_count
AFTER_TASK_RESULT_FILTER _after_task_result_filter_count _result_filter_time_taken
BEFORE_TASK_EXECUTION _before_task_execution_count
AFTER_TASK_EXECUTION _after_task_execution_count _task_execution_time_taken
BEFORE_SEND_TASK_RESULT _before_send_task_result_count
AFTER_SEND_TASK_RESULT _after_send_task_result_count _send_task_result_time_taken
BEFORE_PROCESS_RESULT_OF_UNKNOWN_TASK _before_process_result_of_unknown_task_count
AFTER_PROCESS_RESULT_OF_UNKNOWN_TASK _after_process_result_of_unknown_task_count _process_result_of_unknown_task_time_taken
PRE_RUN_RESULT_AVAILABLE _pre_run_result_available_count
BEFORE_CHECK_CLIENT_RESOURCES _before_check_client_resources_count
AFTER_CHECK_CLIENT_RESOURCES _after_check_client_resources_count _check_client_resources_time_taken
SUBMIT_JOB _submit_job_count
DEPLOY_JOB_TO_SERVER _deploy_job_to_server_count
DEPLOY_JOB_TO_CLIENT _deploy_job_to_client_count
BEFORE_CHECK_RESOURCE_MANAGER _before_check_resource_manager_count
BEFORE_SEND_ADMIN_COMMAND _before_send_admin_command_count
BEFORE_CLIENT_REGISTER _before_client_register_count
AFTER_CLIENT_REGISTER _after_client_register_count client_register_time_taken
CLIENT_REGISTER_RECEIVED _client_register_received_count
CLIENT_REGISTER_PROCESSED _client_register_processed_count
CLIENT_QUIT _client_quit_count
SYSTEM_BOOTSTRAP _system_bootstrap_count

These metrics can be separated into Job Metrics and System Metrics. System Metrics are associated with the Client and Server parent processes, while Job Metrics are associated with each job.

We support three different setups:

setup-1
setup-2
setup-3

The detailed examples for setup 1 and 2 are given using hello-pt

A few sentences describing the changes proposed in this pull request.

Types of changes

  • Non-breaking change (fix or new feature that would not break existing functionality).
  • Breaking change (fix or new feature that would cause existing functionality to change).
  • New tests added to cover the changes.
  • Quick tests passed locally by running ./runtest.sh.
  • In-line docstrings updated.
  • Documentation updated.

@chesterxgchen chesterxgchen marked this pull request as draft July 7, 2024 04:47
@chesterxgchen chesterxgchen marked this pull request as ready for review July 19, 2024 04:39
@chesterxgchen chesterxgchen marked this pull request as draft July 25, 2024 23:16
@chesterxgchen chesterxgchen marked this pull request as ready for review August 17, 2024 02:51
@chesterxgchen chesterxgchen marked this pull request as draft August 17, 2024 02:53
@chesterxgchen chesterxgchen marked this pull request as ready for review January 4, 2025 05:17
@chesterxgchen chesterxgchen marked this pull request as draft January 5, 2025 16:02
@chesterxgchen chesterxgchen marked this pull request as ready for review January 7, 2025 23:20
@chesterxgchen chesterxgchen marked this pull request as draft January 8, 2025 20:27
@chesterxgchen chesterxgchen marked this pull request as ready for review January 9, 2025 00:09
@IsaacYangSLA
Copy link
Collaborator

/build

@IsaacYangSLA IsaacYangSLA self-requested a review January 9, 2025 00:34
IsaacYangSLA
IsaacYangSLA previously approved these changes Jan 9, 2025
Copy link
Collaborator

@IsaacYangSLA IsaacYangSLA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updated commits.

@chesterxgchen
Copy link
Collaborator Author

/build

@chesterxgchen
Copy link
Collaborator Author

/build

@chesterxgchen
Copy link
Collaborator Author

/build

Copy link
Collaborator

@IsaacYangSLA IsaacYangSLA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Approving after rebase.

@IsaacYangSLA IsaacYangSLA merged commit 2cedf04 into NVIDIA:main Jan 9, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants