[feat] Track Exact Fast-LLM Version in Training Outputs and wandb Logs #101

tscholak · 2024-12-31T15:10:15Z

🧐 Problem Description

The version of Fast-LLM used for training is currently not easily accessible. While training job specs (e.g., Toolkit, Kubeflow) provide the image path/URL, references like ghcr.io/servicenow/fast-llm:latest don't indicate which commit or tagged version was used. This makes it difficult to trace back to the exact codebase version for a training run.

💡 Proposed Solution

Include a version string in the output directory of each training run and log it to wandb for visibility.

Details:

For tagged release commits, use the semantic version (e.g., v1.2.3).
For non-tagged commits, include the commit hash (e.g., abcdef1) and mark the build as "dirty" if uncommitted changes exist (e.g., abcdef1-dirty).
Example formats:
- Tagged release: v1.2.3
- Non-tagged commit: abcdef1
- Modified tagged release: v1.2.3-dirty

This version string should:

Be written to a file in the training output directory (e.g., fast_llm_version.txt).
Be logged to wandb:
- As part of the run configuration (wandb.init(config=...)).
- As a standalone field (wandb.log).
- Optionally, as a tag for easier filtering (wandb.init(tags=...)).
Be shown in stdout logs.

🔄 Alternatives Considered

Using container image tags in job specs:
- Problem: Tags like latest are ambiguous. Job descriptions may not persist (e.g., they could be garbage-collected or lost when a Kubernetes instance is decommissioned).

📈 Potential Benefits

Reproducibility: Trace models back to the exact version of Fast-LLM used.
Transparency: Facilitates auditing and debugging of training runs.
Usability: Avoids manual tracking of version information.

📝 Additional Context

This feature aligns with best practices for software versioning and reproducibility. Common formats like semantic versioning (semver) and commit hashes are widely supported and easy to interpret.

Relevant references:

Semantic Versioning
Git documentation on describing tags.
wandb documentation on logging custom metrics and tags.

The text was updated successfully, but these errors were encountered:

jlamypoirier · 2025-01-02T20:41:36Z

That's a good idea, but the git information is lost in the docker image. Do you have an idea on how to recover it?

Also I'd also show the version in stdout logs, and make things match with the version saved in the checkpoint. For non-release version I'd add the Fast-LLM version to the string, ex. v1.2.3-abcdef1-dirty

tscholak · 2025-01-08T14:02:06Z

the git information is lost in the docker image. Do you have an idea on how to recover it?

we could modify the docker build GitHub action to tamper with fast_llm.version.

tscholak added the enhancement New feature or request label Dec 31, 2024

tscholak mentioned this issue Dec 31, 2024

[meta] Fast-LLM Improvements Tracker 🌟 #100

Open

jlamypoirier added the Priority label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Track Exact Fast-LLM Version in Training Outputs and wandb Logs #101

[feat] Track Exact Fast-LLM Version in Training Outputs and wandb Logs #101

tscholak commented Dec 31, 2024 •

edited

Loading

jlamypoirier commented Jan 2, 2025 •

edited

Loading

tscholak commented Jan 8, 2025

[feat] Track Exact Fast-LLM Version in Training Outputs and wandb Logs #101

[feat] Track Exact Fast-LLM Version in Training Outputs and wandb Logs #101

Comments

tscholak commented Dec 31, 2024 • edited Loading

🧐 Problem Description

💡 Proposed Solution

Details:

🔄 Alternatives Considered

📈 Potential Benefits

📝 Additional Context

jlamypoirier commented Jan 2, 2025 • edited Loading

tscholak commented Jan 8, 2025

tscholak commented Dec 31, 2024 •

edited

Loading

jlamypoirier commented Jan 2, 2025 •

edited

Loading