Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Track Exact Fast-LLM Version in Training Outputs and wandb Logs #101

Open
tscholak opened this issue Dec 31, 2024 · 2 comments
Open
Labels
enhancement New feature or request Priority

Comments

@tscholak
Copy link
Collaborator

tscholak commented Dec 31, 2024

🧐 Problem Description

The version of Fast-LLM used for training is currently not easily accessible. While training job specs (e.g., Toolkit, Kubeflow) provide the image path/URL, references like ghcr.io/servicenow/fast-llm:latest don't indicate which commit or tagged version was used. This makes it difficult to trace back to the exact codebase version for a training run.

💡 Proposed Solution

Include a version string in the output directory of each training run and log it to wandb for visibility.

Details:

  • For tagged release commits, use the semantic version (e.g., v1.2.3).
  • For non-tagged commits, include the commit hash (e.g., abcdef1) and mark the build as "dirty" if uncommitted changes exist (e.g., abcdef1-dirty).
  • Example formats:
    • Tagged release: v1.2.3
    • Non-tagged commit: abcdef1
    • Modified tagged release: v1.2.3-dirty

This version string should:

  1. Be written to a file in the training output directory (e.g., fast_llm_version.txt).
  2. Be logged to wandb:
    • As part of the run configuration (wandb.init(config=...)).
    • As a standalone field (wandb.log).
    • Optionally, as a tag for easier filtering (wandb.init(tags=...)).
  3. Be shown in stdout logs.

🔄 Alternatives Considered

  1. Using container image tags in job specs:
    • Problem: Tags like latest are ambiguous. Job descriptions may not persist (e.g., they could be garbage-collected or lost when a Kubernetes instance is decommissioned).

📈 Potential Benefits

  • Reproducibility: Trace models back to the exact version of Fast-LLM used.
  • Transparency: Facilitates auditing and debugging of training runs.
  • Usability: Avoids manual tracking of version information.

📝 Additional Context

This feature aligns with best practices for software versioning and reproducibility. Common formats like semantic versioning (semver) and commit hashes are widely supported and easy to interpret.

Relevant references:

@jlamypoirier
Copy link
Collaborator

jlamypoirier commented Jan 2, 2025

That's a good idea, but the git information is lost in the docker image. Do you have an idea on how to recover it?

Also I'd also show the version in stdout logs, and make things match with the version saved in the checkpoint. For non-release version I'd add the Fast-LLM version to the string, ex. v1.2.3-abcdef1-dirty

@tscholak
Copy link
Collaborator Author

tscholak commented Jan 8, 2025

the git information is lost in the docker image. Do you have an idea on how to recover it?

we could modify the docker build GitHub action to tamper with fast_llm.version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Priority
Projects
None yet
Development

No branches or pull requests

2 participants