E2E testing for RHEL AI images #391

lmilbaum · 2024-04-29T22:47:26Z

A couple of us have kicked off writing E2E tests for the InstructLab workflow, based on the Containers that @rhatdan and friends have been adding to the AI Lab Recipes ... in the training directory.

To do that you'll probably need:
From Dan Walsh: The InstructLab RHEL AI container images ... these are here https://github.com/containers/ai-lab-recipes/tree/main/training/instructlab ... or somewhere near quay.io/ai-lab

From Russell Bryant: The work on getting the e2e tests complete. I worked a bunch on this today as did he ... they should be somewhere near: instructlab/instructlab#1016 ... in particular Russell made CPU testing (on beefy machines) run in less than an hour.
As an initial test goal, I think the following combination should work:
Beefy AWS instance with many CPUs
The InstructLab cuda RHEL AI image ... which should automatically fall back to CPU pytorch logic
The E2E test from Russel and the PR above
Obviously the actual goal here is to use Bifrost RHEL AI accelerated images + internal InstructLab application container images for that testing. Dan is still working on those. But this is a good place to start. WDYT?

@Russell Bryant is also adding this test to InstructLab CI ... albeit without the Bifrost accelerated images. Anything to add Russell?

lmilbaum · 2024-04-30T11:52:35Z

I had a short chat with Rom. First step would be to bootstrap (with Terraform) a test environment instance. The first one would be for Nvidia. @Gregory-Pereira could you please identify the AWS instance requirements?

lmilbaum · 2024-04-30T12:42:46Z

According to Stef, this is the instance type for the Nvidia test environment - g5.8xlarge 128 GB disk

Gregory-Pereira · 2024-04-30T22:05:30Z

First pass of this was merged in #411 using the basic image, but in the code you can see there the options for g5.8xlarge machine type and 128 GB of storage were provided but commented out (I want to get something working on the absolute minimum infra we need and scale up). Currently working on the Ansible playbook that will install the required dependencies based on Stef's PR

Gregory-Pereira · 2024-05-01T01:16:52Z

Currently working on the next pass in #413. This PR will install the bootc and e2e test deps onto the terraformed provisioned instance via the anisble playbook. I am however running into an issue. The E2E tests require the cuda-toolkit and build-essentials among other things. The cuda-toolkit has no version available for Fedora 40. I have no access to the AWS account to create a new AMI based around ~~Fedora 39, or~~ Ubuntu 22.04 ~~is another option~~. After discussion with Russel, it seems that their current workflows run on Ubuntu 22.04, so we will be basing our AMI off that to interface with them as fast as possible, and we can iterate from there. Will have to pick this up tomorrow with access from @lmilbaum

russellb · 2024-05-01T02:24:48Z

to summarize what i'm trying to do with instructlab/instructlab#1016

get a first end-to-end workflow running in CI, running ilab on the host OS using the single GPU worker type available built-in to GitHub. That's what the PR includes that's working as of this afternoon
once ^^^ is settled, make a variant of it that builds the instructlab/containers/cuda/Containerfile image and uses that to run ilab instead of installing it directly. This is the more interesting test, but the step above was a helpful stepping stone. It also provides a reference to compare back to if something isn't working with the container. Anyway, this is what I want to do next for instructlab CI.

The GPU worker from GitHub is a Tesla T4 with 16 GB RAM, so there are some limitations. ilab train on Linux typically needs more than that, but there's a --4-bit-quant option that makes it work ... up to a point. Converting the resulting model to gguf doesn't work. That's this issue: instructlab/instructlab#579

It sounds like you're testing with more powerful infrastructure, so you'll be able to exercise a more extensive workflow than the "smoke test" style I'm trying to get into instructlab CI.

lmilbaum added this to AI Lab Recipes Planning Apr 29, 2024

lmilbaum moved this to Todo For Summit in AI Lab Recipes Planning Apr 29, 2024

lmilbaum added this to the Summit release (v1.1) milestone Apr 29, 2024

lmilbaum assigned lmilbaum and Gregory-Pereira Apr 29, 2024

lmilbaum moved this from Todo For Summit to In Progress in AI Lab Recipes Planning May 2, 2024

lmilbaum linked a pull request May 2, 2024 that will close this issue

restart test environment instance after bootc install #436

Merged

lmilbaum closed this as completed in #436 May 2, 2024

github-project-automation bot moved this from In Progress to Done in AI Lab Recipes Planning May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E2E testing for RHEL AI images #391

E2E testing for RHEL AI images #391

lmilbaum commented Apr 29, 2024

lmilbaum commented Apr 30, 2024

lmilbaum commented Apr 30, 2024 •

edited

Loading

Gregory-Pereira commented Apr 30, 2024

Gregory-Pereira commented May 1, 2024 •

edited

Loading

russellb commented May 1, 2024

E2E testing for RHEL AI images #391

E2E testing for RHEL AI images #391

Comments

lmilbaum commented Apr 29, 2024

lmilbaum commented Apr 30, 2024

lmilbaum commented Apr 30, 2024 • edited Loading

Gregory-Pereira commented Apr 30, 2024

Gregory-Pereira commented May 1, 2024 • edited Loading

russellb commented May 1, 2024

lmilbaum commented Apr 30, 2024 •

edited

Loading

Gregory-Pereira commented May 1, 2024 •

edited

Loading