-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
E2E testing for RHEL AI images #391
Comments
I had a short chat with Rom. First step would be to bootstrap (with Terraform) a test environment instance. The first one would be for Nvidia. @Gregory-Pereira could you please identify the AWS instance requirements? |
According to Stef, this is the instance type for the Nvidia test environment - g5.8xlarge 128 GB disk |
First pass of this was merged in #411 using the basic image, but in the code you can see there the options for |
Currently working on the next pass in #413. This PR will install the bootc and e2e test deps onto the terraformed provisioned instance via the anisble playbook. I am however running into an issue. The E2E tests require the |
to summarize what i'm trying to do with instructlab/instructlab#1016
The GPU worker from GitHub is a Tesla T4 with 16 GB RAM, so there are some limitations. It sounds like you're testing with more powerful infrastructure, so you'll be able to exercise a more extensive workflow than the "smoke test" style I'm trying to get into |
A couple of us have kicked off writing E2E tests for the InstructLab workflow, based on the Containers that @rhatdan and friends have been adding to the AI Lab Recipes ... in the training directory.
To do that you'll probably need:
From Dan Walsh: The InstructLab RHEL AI container images ... these are here https://github.com/containers/ai-lab-recipes/tree/main/training/instructlab ... or somewhere near quay.io/ai-lab
From Russell Bryant: The work on getting the e2e tests complete. I worked a bunch on this today as did he ... they should be somewhere near: instructlab/instructlab#1016 ... in particular Russell made CPU testing (on beefy machines) run in less than an hour.
As an initial test goal, I think the following combination should work:
Beefy AWS instance with many CPUs
The InstructLab cuda RHEL AI image ... which should automatically fall back to CPU pytorch logic
The E2E test from Russel and the PR above
Obviously the actual goal here is to use Bifrost RHEL AI accelerated images + internal InstructLab application container images for that testing. Dan is still working on those. But this is a good place to start. WDYT?
@Russell Bryant is also adding this test to InstructLab CI ... albeit without the Bifrost accelerated images. Anything to add Russell?
The text was updated successfully, but these errors were encountered: