You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The version of Fast-LLM used for training is currently not easily accessible. While training job specs (e.g., Toolkit, Kubeflow) provide the image path/URL, references like ghcr.io/servicenow/fast-llm:latest don't indicate which commit or tagged version was used. This makes it difficult to trace back to the exact codebase version for a training run.
💡 Proposed Solution
Include a version string in the output directory of each training run and log it to wandb for visibility.
Details:
For tagged release commits, use the semantic version (e.g., v1.2.3).
For non-tagged commits, include the commit hash (e.g., abcdef1) and mark the build as "dirty" if uncommitted changes exist (e.g., abcdef1-dirty).
Example formats:
Tagged release: v1.2.3
Non-tagged commit: abcdef1
Modified tagged release: v1.2.3-dirty
This version string should:
Be written to a file in the training output directory (e.g., fast_llm_version.txt).
Be logged to wandb:
As part of the run configuration (wandb.init(config=...)).
As a standalone field (wandb.log).
Optionally, as a tag for easier filtering (wandb.init(tags=...)).
Be shown in stdout logs.
🔄 Alternatives Considered
Using container image tags in job specs:
Problem: Tags like latest are ambiguous. Job descriptions may not persist (e.g., they could be garbage-collected or lost when a Kubernetes instance is decommissioned).
📈 Potential Benefits
Reproducibility: Trace models back to the exact version of Fast-LLM used.
Transparency: Facilitates auditing and debugging of training runs.
Usability: Avoids manual tracking of version information.
📝 Additional Context
This feature aligns with best practices for software versioning and reproducibility. Common formats like semantic versioning (semver) and commit hashes are widely supported and easy to interpret.
That's a good idea, but the git information is lost in the docker image. Do you have an idea on how to recover it?
Also I'd also show the version in stdout logs, and make things match with the version saved in the checkpoint. For non-release version I'd add the Fast-LLM version to the string, ex. v1.2.3-abcdef1-dirty
🧐 Problem Description
The version of Fast-LLM used for training is currently not easily accessible. While training job specs (e.g., Toolkit, Kubeflow) provide the image path/URL, references like
ghcr.io/servicenow/fast-llm:latest
don't indicate which commit or tagged version was used. This makes it difficult to trace back to the exact codebase version for a training run.💡 Proposed Solution
Include a version string in the output directory of each training run and log it to wandb for visibility.
Details:
v1.2.3
).abcdef1
) and mark the build as "dirty" if uncommitted changes exist (e.g.,abcdef1-dirty
).v1.2.3
abcdef1
v1.2.3-dirty
This version string should:
fast_llm_version.txt
).wandb.init(config=...)
).wandb.log
).wandb.init(tags=...)
).🔄 Alternatives Considered
latest
are ambiguous. Job descriptions may not persist (e.g., they could be garbage-collected or lost when a Kubernetes instance is decommissioned).📈 Potential Benefits
📝 Additional Context
This feature aligns with best practices for software versioning and reproducibility. Common formats like semantic versioning (semver) and commit hashes are widely supported and easy to interpret.
Relevant references:
The text was updated successfully, but these errors were encountered: