Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reimplement async monitor handling #133

Closed
wants to merge 2 commits into from
Closed

Reimplement async monitor handling #133

wants to merge 2 commits into from

Conversation

manveerxyz
Copy link
Contributor

@manveerxyz manveerxyz commented Oct 25, 2024

We had to revert #131 since it broke log submission.
This PR re-implements it with fixes -- including how we handle async loops, as well as making batch creation synchronous.

Tested on 2xRTX3090s, logs are being sent and received on local protocol svc:

...
22:39:10 [INFO] [Rank 0] Pending tasks: deque([])
22:39:10 [INFO] [Rank 0] Cleaning up 0 completed tasks
22:39:10 [INFO] [Rank 0] step: 1, loss: 7.4207, diloco_peers: 2
22:39:10 [INFO] [Rank 0] Pending tasks: deque([])
22:39:10 [INFO] [Rank 0] Cleaning up 0 completed tasks
22:39:10 [INFO] [Rank 0] step: 2, loss: 7.4049, tokens_per_second: 262783.83, mfu: 1.52, diloco_peers: 2
22:39:10 [INFO] [Rank 0] Pending tasks: deque([])
22:39:10 [INFO] [Rank 0] Cleaning up 0 completed tasks
22:39:10 [INFO] [Rank 0] step: 3, loss: 7.4265, tokens_per_second: 278822.70, mfu: 1.62, diloco_peers: 2
22:39:10 [INFO] [Rank 0] Pending tasks: deque([])
22:39:10 [INFO] [Rank 0] Cleaning up 0 completed tasks
22:39:10 [INFO] [Rank 0] step: 4, loss: 7.4168, tokens_per_second: 284322.89, mfu: 1.65, diloco_peers: 2
22:39:10 [INFO] [Rank 0] Pending tasks: deque([])
22:39:10 [INFO] [Rank 0] Cleaning up 0 completed tasks
22:39:10 [INFO] [Rank 0] Sending batch with 5 logs
22:39:11 [INFO] [Rank 0] Sent 5 logs to server
22:39:11 [INFO] [Rank 0] step: 5, loss: 7.4077, tokens_per_second: 277327.57, mfu: 1.61, diloco_peers: 2
22:39:11 [INFO] [Rank 0] Pending tasks: deque([])
22:39:11 [INFO] [Rank 0] Cleaning up 0 completed tasks
22:39:11 [INFO] [Rank 0] Sending batch with 1 logs
22:39:11 [INFO] [Rank 0] Sent 1 logs to server
22:39:11 [DEBUG] [Rank 0] [0] Resolving world
22:39:11 [DEBUG] [Rank 0] Node 0 last heartbeat: 1729895950.8916602
22:39:11 [DEBUG] [Rank 0] Node 1 last heartbeat: 1729895950.8478436
22:39:11 [DEBUG] [Rank 0] Joiners (not admitting): [], Dead nodes: [], Evicting nodes: []
22:39:11 [DEBUG] [Rank 0] World resolved in 0.0012244440003996715 seconds
22:39:11 [DEBUG] [Rank 0] sync pseudo gradient  with world size 2
22:39:11 [DEBUG] [Rank 0] Waiting on barrier
22:39:11 [DEBUG] [Rank 0] [0] Monitored Barrier 0
22:39:11 [DEBUG] [Rank 0] Others have 600 seconds to resolve
22:39:12 [DEBUG] [Rank 0] Monitored barrier resolved in 0.20101371300552273 seconds
22:39:12 [DEBUG] [Rank 0] Beginning all reduce
22:39:12 [DEBUG] [Rank 0] 0/4 all reduce bucket done in 0.003842 seconds, numel: 262144
22:39:12 [DEBUG] [Rank 0] 1/4 all reduce bucket done in 0.006478 seconds, numel: 852480
22:39:12 [DEBUG] [Rank 0] 2/4 all reduce bucket done in 0.005521 seconds, numel: 852480
22:39:12 [DEBUG] [Rank 0] 3/4 all reduce bucket done in 0.002027 seconds, numel: 262400
22:39:12 [DEBUG] [Rank 0] All reduce takes 0.219897 seconds numels: 2229504
22:39:12 [INFO] [Rank 0] Sync psuedo-gradient in 0.265378 seconds
22:39:12 [INFO] [Rank 0] all reduce pseudo gradient in: 0.26548387900402304 seconds
22:39:12 [DEBUG] [Rank 0] sync inner model
22:39:12 [INFO] [Rank 0] effective mfu: 0.023154298032237574
22:39:12 [INFO] [Rank 0] Pending tasks: deque([])
22:39:12 [INFO] [Rank 0] Cleaning up 0 completed tasks
22:39:12 [INFO] [Rank 0] Sending batch with 1 logs
22:39:12 [INFO] [Rank 0] Sent 1 logs to server
22:39:12 [INFO] [Rank 0] Training finished, exiting ...

@manveerxyz manveerxyz requested a review from samsja October 25, 2024 23:12
@samsja samsja closed this Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants