Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

E2E testing for RHEL AI images #391

Closed
lmilbaum opened this issue Apr 29, 2024 · 5 comments · Fixed by #436
Closed

E2E testing for RHEL AI images #391

lmilbaum opened this issue Apr 29, 2024 · 5 comments · Fixed by #436
Assignees

Comments

@lmilbaum
Copy link
Collaborator

A couple of us have kicked off writing E2E tests for the InstructLab workflow, based on the Containers that @rhatdan and friends have been adding to the AI Lab Recipes ... in the training directory.

To do that you'll probably need:
From Dan Walsh: The InstructLab RHEL AI container images ... these are here https://github.com/containers/ai-lab-recipes/tree/main/training/instructlab ... or somewhere near quay.io/ai-lab

From Russell Bryant: The work on getting the e2e tests complete. I worked a bunch on this today as did he ... they should be somewhere near: instructlab/instructlab#1016 ... in particular Russell made CPU testing (on beefy machines) run in less than an hour.
As an initial test goal, I think the following combination should work:
Beefy AWS instance with many CPUs
The InstructLab cuda RHEL AI image ... which should automatically fall back to CPU pytorch logic
The E2E test from Russel and the PR above
Obviously the actual goal here is to use Bifrost RHEL AI accelerated images + internal InstructLab application container images for that testing. Dan is still working on those. But this is a good place to start. WDYT?

@Russell Bryant is also adding this test to InstructLab CI ... albeit without the Bifrost accelerated images. Anything to add Russell?

@lmilbaum
Copy link
Collaborator Author

I had a short chat with Rom. First step would be to bootstrap (with Terraform) a test environment instance. The first one would be for Nvidia. @Gregory-Pereira could you please identify the AWS instance requirements?

@lmilbaum
Copy link
Collaborator Author

lmilbaum commented Apr 30, 2024

According to Stef, this is the instance type for the Nvidia test environment - g5.8xlarge 128 GB disk

@Gregory-Pereira
Copy link
Collaborator

First pass of this was merged in #411 using the basic image, but in the code you can see there the options for g5.8xlarge machine type and 128 GB of storage were provided but commented out (I want to get something working on the absolute minimum infra we need and scale up). Currently working on the Ansible playbook that will install the required dependencies based on Stef's PR

@Gregory-Pereira
Copy link
Collaborator

Gregory-Pereira commented May 1, 2024

Currently working on the next pass in #413. This PR will install the bootc and e2e test deps onto the terraformed provisioned instance via the anisble playbook. I am however running into an issue. The E2E tests require the cuda-toolkit and build-essentials among other things. The cuda-toolkit has no version available for Fedora 40. I have no access to the AWS account to create a new AMI based around Fedora 39, or Ubuntu 22.04 is another option. After discussion with Russel, it seems that their current workflows run on Ubuntu 22.04, so we will be basing our AMI off that to interface with them as fast as possible, and we can iterate from there. Will have to pick this up tomorrow with access from @lmilbaum

@russellb
Copy link

russellb commented May 1, 2024

to summarize what i'm trying to do with instructlab/instructlab#1016

  • get a first end-to-end workflow running in CI, running ilab on the host OS using the single GPU worker type available built-in to GitHub. That's what the PR includes that's working as of this afternoon
  • once ^^^ is settled, make a variant of it that builds the instructlab/containers/cuda/Containerfile image and uses that to run ilab instead of installing it directly. This is the more interesting test, but the step above was a helpful stepping stone. It also provides a reference to compare back to if something isn't working with the container. Anyway, this is what I want to do next for instructlab CI.

The GPU worker from GitHub is a Tesla T4 with 16 GB RAM, so there are some limitations. ilab train on Linux typically needs more than that, but there's a --4-bit-quant option that makes it work ... up to a point. Converting the resulting model to gguf doesn't work. That's this issue: instructlab/instructlab#579

It sounds like you're testing with more powerful infrastructure, so you'll be able to exercise a more extensive workflow than the "smoke test" style I'm trying to get into instructlab CI.

@lmilbaum lmilbaum moved this from Todo For Summit to In Progress in AI Lab Recipes Planning May 2, 2024
@lmilbaum lmilbaum linked a pull request May 2, 2024 that will close this issue
@github-project-automation github-project-automation bot moved this from In Progress to Done in AI Lab Recipes Planning May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging a pull request may close this issue.

3 participants