Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync with upstream main #456

Merged
merged 33 commits into from
Dec 20, 2024

Conversation

hdefazio
Copy link

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Type of changes
Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • This change requires a documentation update

Feature/Issue validation/testing:

Please describe the tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.

  • Test A

  • Test B

  • Logs

Special notes for your reviewer:

  1. Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Checklist:

  • Have you added unit/e2e tests that prove your fix is effective or that this feature works?
  • Has code been commented, particularly in hard-to-understand areas?
  • Have you made corresponding changes to the documentation?

Release note:


Re-running failed tests

  • /rerun-all - rerun all failed workflows.
  • /rerun-workflow <workflow name> - rerun a specific failed workflow. Only one workflow name can be specified. Multiple /rerun-workflow commands are allowed per comment.

LOADBC and others added 30 commits November 7, 2024 09:02
…rve#4012)

* Fix readiness probe logic and update test scenarios for HTTPGet, TCPSocket, and Exec handling

Signed-off-by: Snehomoy <[email protected]>

* Update: Refactor logic for readiness probe handling

Signed-off-by: Snehomoy <[email protected]>

* Apply gofmt formatting to agent_injector.go

Signed-off-by: Snehomoy <[email protected]>

* Added logger to replace fmt.Printf for better consistency and observability

Signed-off-by: Snehomoy <[email protected]>

* Formatted file using goimports with -local

Signed-off-by: Snehomoy <[email protected]>

---------

Signed-off-by: Snehomoy <[email protected]>
) (kserve#4018)

* Feat: Fix memory issue by replacing io.ReadAll with io.Copy (kserve#4017)

Previously, io.ReadAll was causing out-of-memory problems when downloading large files from GCS.
This change replaces io.ReadAll() with io.Copy() to stream data and prevent excessive memory usage.

Signed-off-by: ops-jaeha <[email protected]>

* Feat: Fix add newline at end of file to satisfy golang lint

Signed-off-by: ops-jaeha <[email protected]>

* Feat: Refact log Info for golang lint (kserve#4017)

Signed-off-by: ops-jaeha <[email protected]>

---------

Signed-off-by: ops-jaeha <[email protected]>
chore:	Fix CVE-2024-26130 - NULL Pointer Dereference
	  - Upgrade cryptography to version 42.0.4 or higher.
	Update Python version to match KServe 0.14.0
	Update tensorflow, tensorflow-io-gcs-filesystem and dill libraries

Signed-off-by: Spolti <[email protected]>
…rve#4024)

* Fix huggingface srever not work with return_probabilities

Signed-off-by: oplushappy <[email protected]>

* Fix pytest huggingface server assertion error

Signed-off-by: oplushappy <[email protected]>

* Fix the lint error and Add approx for  assertion

Signed-off-by: oplushappy <[email protected]>

* Parse string output to dictionary for accurate assertion

Signed-off-by: oplushappy <[email protected]>

* Fix linting error

Signed-off-by: oplushappy <[email protected]>

---------

Signed-off-by: oplushappy <[email protected]>
* Add deeper readiness and liveness check for transformer

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Add unit tests

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* put the feature behind flag

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Update tests

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* resolve comments

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Make use of inference client

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Add e2e test

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Make inference client singleton and lazy initialize

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Raise 503 If server is not ready / live

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Add test for custom transformer with rest protocol

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Fix CI running out of space

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Increase memory limit

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Check for model ready

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Webhook debug

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Address reviews

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Check for retry count in grpc client

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Update python/kserve/kserve/model_server.py

Co-authored-by: Dan Sun <[email protected]>
Signed-off-by: Sivanantham <[email protected]>

---------

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
Signed-off-by: Sivanantham <[email protected]>
Co-authored-by: Dan Sun <[email protected]>
* add storageaccesskey to azure env builder

Signed-off-by: bentohset <[email protected]>

* update integration and unit test for azure storage access key

Signed-off-by: bentohset <[email protected]>

* fix formatting

Signed-off-by: bentohset <[email protected]>

---------

Signed-off-by: bentohset <[email protected]>
* support single digit azure zone id

Signed-off-by: bentohset <[email protected]>

* add single digit azure dns zone id tests

Signed-off-by: bentohset <[email protected]>

* fix formatting

Signed-off-by: bentohset <[email protected]>

---------

Signed-off-by: bentohset <[email protected]>
* Fix trust_remote_code not passed in encoder model

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Add test

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Fix name conflict in e2e test

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

---------

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
Signed-off-by: Sivanantham <[email protected]>
* introduce the prepare-for-release.sh script

chore:	The purpose of this script is to facilitate the
	release process by updating the KServe version
	everywhere that is necessary.

fixes kserve#3399

Signed-off-by: Spolti <[email protected]>

* review - update release_process_v2.md

Signed-off-by: Spolti <[email protected]>

* Update hack/prepare-for-release.sh

Co-authored-by: Dan Sun <[email protected]>
Signed-off-by: Filippe Spolti <[email protected]>

* Update hack/prepare-for-release.sh

Signed-off-by: Filippe Spolti <[email protected]>

* Update hack/prepare-for-release.sh

Signed-off-by: Filippe Spolti <[email protected]>

---------

Signed-off-by: Spolti <[email protected]>
Signed-off-by: Filippe Spolti <[email protected]>
Signed-off-by: Dan Sun <[email protected]>
Co-authored-by: Dan Sun <[email protected]>
* LocalModelNode Daemonset Controller Skeleton (kserve#4026)

* hello world controller

Signed-off-by: Gavin Li <[email protected]>

* go fmt

Signed-off-by: Gavin Li <[email protected]>

* daemonset

Signed-off-by: Gavin Li <[email protected]>

* Update Makefile

Co-authored-by: Jin Dong <[email protected]>
Signed-off-by: Gavin Li <[email protected]>

* make generate

Signed-off-by: Gavin Li <[email protected]>

* install LocalModelNode CRD

Signed-off-by: Gavin Li <[email protected]>

* feedback

Signed-off-by: Gavin Li <[email protected]>

* make manifests

Signed-off-by: Gavin Li <[email protected]>

* agent

Signed-off-by: Gavin Li <[email protected]>

Co-authored-by: Jin Dong <[email protected]>

* LocalModelController creates LocalModelNode resource for ready nodes (kserve#4036)

* Manage localmodelNode

Signed-off-by: Jin Dong <[email protected]>

* Update patch

Signed-off-by: Jin Dong <[email protected]>

* Fix rbac

Signed-off-by: Jin Dong <[email protected]>

* Add a test to controller_test.go

Signed-off-by: Jin Dong <[email protected]>

* Update pkg/controller/v1alpha1/localmodel/controller.go

Co-authored-by: Dan Sun <[email protected]>
Signed-off-by: Jin Dong <[email protected]>

---------

Signed-off-by: Jin Dong <[email protected]>
Co-authored-by: Dan Sun <[email protected]>

* Delete from LocalModelNode when the localmodel is deleted (kserve#4053)

* Delete model from LocalModelNode

Signed-off-by: Jin Dong <[email protected]>

* Cleanup code

Signed-off-by: Jin Dong <[email protected]>

* Cleanup code

Signed-off-by: Jin Dong <[email protected]>

* Fix lint

Signed-off-by: Jin Dong <[email protected]>

* Initializer node status map

Signed-off-by: Jin Dong <[email protected]>

* Address comments

Signed-off-by: Jin Dong <[email protected]>

---------

Signed-off-by: Jin Dong <[email protected]>

* Update Model status from LocalModelNode status (kserve#4056)

* Delete model from LocalModelNode

Signed-off-by: Jin Dong <[email protected]>

* Cleanup code

Signed-off-by: Jin Dong <[email protected]>

* Cleanup code

Signed-off-by: Jin Dong <[email protected]>

* Fix lint

Signed-off-by: Jin Dong <[email protected]>

* Initializer node status map

Signed-off-by: Jin Dong <[email protected]>

* Update status

Signed-off-by: Jin Dong <[email protected]>

* Update localmodel node status

Signed-off-by: Jin Dong <[email protected]>

* Remove job dependency from localmodel controller

Signed-off-by: Jin Dong <[email protected]>

* Remove some unused lines

Signed-off-by: Jin Dong <[email protected]>

* Add comments

Signed-off-by: Jin Dong <[email protected]>

---------

Signed-off-by: Jin Dong <[email protected]>

* LocalModelNode Agent that creates download jobs and update statuses from jobs (kserve#4075)

* download working

Signed-off-by: Gavin Li <[email protected]>

* delete working

Signed-off-by: Gavin Li <[email protected]>

* cleanup

Signed-off-by: Gavin Li <[email protected]>

* gofmt

Signed-off-by: Gavin Li <[email protected]>

* Delete model from LocalModelNode

Signed-off-by: Jin Dong <[email protected]>

* Cleanup code

Signed-off-by: Jin Dong <[email protected]>

* Fix lint

Signed-off-by: Jin Dong <[email protected]>

* Initializer node status map

Signed-off-by: Jin Dong <[email protected]>

* Update status

Signed-off-by: Jin Dong <[email protected]>

* Update localmodel node status

Signed-off-by: Jin Dong <[email protected]>

* Remove job dependency from localmodel controller

Signed-off-by: Jin Dong <[email protected]>

* Remove some unused lines

Signed-off-by: Jin Dong <[email protected]>

* Add comments

Signed-off-by: Jin Dong <[email protected]>

* Update manager

Signed-off-by: Jin Dong <[email protected]>

* Update rbac

Signed-off-by: Jin Dong <[email protected]>

* Add tests and temporarily remove delete models code

Signed-off-by: Jin Dong <[email protected]>

* Do not create download jobs if model is already downloaded

Signed-off-by: Jin Dong <[email protected]>

* remove mislieading log line

Signed-off-by: Jin Dong <[email protected]>

* Clean up code a little bit

Signed-off-by: Jin Dong <[email protected]>

* Update configurations

Signed-off-by: Jin Dong <[email protected]>

* update test

Signed-off-by: Jin Dong <[email protected]>

* Use a fixed name for the download container

Signed-off-by: Jin Dong <[email protected]>

---------

Signed-off-by: Gavin Li <[email protected]>
Signed-off-by: Jin Dong <[email protected]>
Co-authored-by: Gavin Li <[email protected]>

* Delete models from local disk when they are not in LocalModelNode spec (kserve#4084)

* download working

Signed-off-by: Gavin Li <[email protected]>

* delete working

Signed-off-by: Gavin Li <[email protected]>

* Delete model from LocalModelNode

Signed-off-by: Jin Dong <[email protected]>

* Initializer node status map

Signed-off-by: Jin Dong <[email protected]>

* Update status

Signed-off-by: Jin Dong <[email protected]>

* Update localmodel node status

Signed-off-by: Jin Dong <[email protected]>

* Update manager

Signed-off-by: Jin Dong <[email protected]>

* Update rbac

Signed-off-by: Jin Dong <[email protected]>

* Add tests and temporarily remove delete models code

Signed-off-by: Jin Dong <[email protected]>

* Do not create download jobs if model is already downloaded

Signed-off-by: Jin Dong <[email protected]>

* Delete function

Signed-off-by: Jin Dong <[email protected]>

* Update configurations

Signed-off-by: Jin Dong <[email protected]>

* Add test and Fix deletion code

Signed-off-by: Jin Dong <[email protected]>

* Use a fixed name for the download container

Signed-off-by: Jin Dong <[email protected]>

* Remove deleted models from status and periodically trigger reconciliation

Signed-off-by: Jin Dong <[email protected]>

* Fix storagecontainer permissions and a minor change

Signed-off-by: Jin Dong <[email protected]>

---------

Signed-off-by: Gavin Li <[email protected]>
Signed-off-by: Jin Dong <[email protected]>
Co-authored-by: Gavin Li <[email protected]>

---------

Signed-off-by: Jin Dong <[email protected]>
Signed-off-by: Gavin Li <[email protected]>
Co-authored-by: Gavin Li <[email protected]>
Co-authored-by: Jin Dong <[email protected]>
storage containers typo fix

Signed-off-by: Andrews Arokiam <[email protected]>
Support datetime object in v1/v2 response

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
* Update ClusterLocalModel to LocalModelCache

Signed-off-by: Dan Sun <[email protected]>

* Fix generation fmt

Signed-off-by: Dan Sun <[email protected]>

* black fmt

Signed-off-by: Dan Sun <[email protected]>

* Fix generated code

Signed-off-by: Dan Sun <[email protected]>

* Run go mod tidy

Signed-off-by: Dan Sun <[email protected]>

* Fix model status

Signed-off-by: Dan Sun <[email protected]>

---------

Signed-off-by: Dan Sun <[email protected]>
* Fix LocalModel controller reconciles deleted resource

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Rebase

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Fix path base routing e2e workflow

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

---------

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
…erve#4003)

* Requeue and then double check the Pending status

Signed-off-by: Hannah DeFazio <[email protected]>

* Add test case, fix old tests

Signed-off-by: Hannah DeFazio <[email protected]>

* Check the retun value for PropagateModelStatus, add knative failure case

Signed-off-by: Hannah DeFazio <[email protected]>

---------

Signed-off-by: Hannah DeFazio <[email protected]>
Co-authored-by: Hannah DeFazio <[email protected]>
* init

Signed-off-by: Gavin Li <[email protected]>

* broken code

Signed-off-by: Gavin Li <[email protected]>

* register webhook

Signed-off-by: Gavin Li <[email protected]>

* rename + working

Signed-off-by: Gavin Li <[email protected]>

* pass in client

Signed-off-by: Gavin Li <[email protected]>

* check storageURI

Signed-off-by: Gavin Li <[email protected]>

---------

Signed-off-by: Gavin Li <[email protected]>
…art (kserve#4111)

add localmodelnode agent image

Signed-off-by: Rituraj Singh <[email protected]>
Co-authored-by: Rituraj Singh <[email protected]>
* added vllm cpu image dockerfile

Signed-off-by: ayush <[email protected]>

* updated predictor controller to add '-gpu' suffix to huggingfaceserver image tag for GPU deployments

Signed-off-by: ayush <[email protected]>

* cleanup

Signed-off-by: ayush <[email protected]>

* added unit testcase for UpdateImageTag util

Signed-off-by: ayush <[email protected]>

* added documentation for vLLM CPU support

Signed-off-by: ayush <[email protected]>

* updated vllm-cpu example with llama 3.1 model

Signed-off-by: ayush <[email protected]>

* modified dockerfile to use vllm requirements-build to install dependencies

Signed-off-by: ayush <[email protected]>

* shifted to use vLLM with OpenVINO for CPU workloads

Signed-off-by: ayush <[email protected]>

* upgraded vllm and torch versions for huggingfaceserver

Signed-off-by: ayush <[email protected]>

* change base image to ubuntu

Signed-off-by: ayush <[email protected]>

* addressed comments in dockerfile and github workflow

Signed-off-by: ayush <[email protected]>

* added e2e test case

Signed-off-by: ayush <[email protected]>

* added huggingface_server_cpu_openvino image build in CI

Signed-off-by: ayush <[email protected]>

* updated poetry version

Signed-off-by: ayush <[email protected]>

* done linting

Signed-off-by: ayush <[email protected]>

* ran poetry lock --no-update

Signed-off-by: ayush <[email protected]>

* ran black formatting

Signed-off-by: ayush <[email protected]>

* removed huggingface server gpu image build in e2e tests

Signed-off-by: ayush <[email protected]>

* made separate job for e2e test of huggingface server vllm backend

Signed-off-by: ayush <[email protected]>

* updated vllm completion response in test

Signed-off-by: ayush <[email protected]>

* added vllm marker in pytest.ini file

Signed-off-by: ayush <[email protected]>

* reverted to vLLM v0.6.3.post1

Signed-off-by: ayush <[email protected]>

* added vllm-openvino limitations in documentation

Signed-off-by: ayush <[email protected]>

* updated poetry lock

Signed-off-by: ayush <[email protected]>

---------

Signed-off-by: ayush <[email protected]>
Signed-off-by: Ayush Sawant <[email protected]>
* chore: use patch instead of update for finalizer changes

Signed-off-by: Derek Wang <[email protected]>

* go mod tidy

Signed-off-by: Derek Wang <[email protected]>

* lint

Signed-off-by: Derek Wang <[email protected]>

---------

Signed-off-by: Derek Wang <[email protected]>
Signed-off-by: Dan Sun <[email protected]>
Co-authored-by: Dan Sun <[email protected]>
* Fix localmodelcache permission for isvc

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

* Patch localmodelcache webhook for kubeflow overlay

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>

---------

Signed-off-by: Sivanantham Chinnaiyan <[email protected]>
@hdefazio
Copy link
Author

/rerun-all

@hdefazio
Copy link
Author

hdefazio commented Dec 19, 2024

test

Signed-off-by: Edgar Hernández <[email protected]>
Copy link

openshift-ci bot commented Dec 19, 2024

@hdefazio: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-fast 9469f1f link true /test e2e-fast
ci/prow/e2e-slow 9469f1f link true /test e2e-slow
ci/prow/e2e-raw 9469f1f link true /test e2e-raw

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@hdefazio hdefazio requested a review from israel-hdez December 19, 2024 17:50
Copy link

openshift-ci bot commented Dec 20, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hdefazio, israel-hdez

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [hdefazio,israel-hdez]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@israel-hdez israel-hdez merged commit 4b2f139 into opendatahub-io:master Dec 20, 2024
26 of 30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.