Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TELCODOCS-1788: OpenShift 4.15 secrets disabled #27

Merged
merged 3 commits into from
Mar 15, 2024
Merged

Conversation

@mikemckiernan mikemckiernan added the documentation Issue/PR focused on fixing/editing/adding documentation bits label Mar 8, 2024
@mikemckiernan mikemckiernan self-assigned this Mar 8, 2024
Copy link

github-actions bot commented Mar 8, 2024

Documentation preview

https://nvidia.github.io/cloud-native-docs/review/pr-27

Copy link
Member Author

@mikemckiernan mikemckiernan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StephenJamesSmith , I thought that I submitted this feedback before I lost internet connectivity. PTAL.

Special Considerations for OpenShift 4.15
=========================================

In OpenShift 4.15, secrets are no longer automatically generated when the integrated OpenShift image registry is disabled.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I honestly got lost about whether the issue was the registry is disabled or if it was that storage for the registry wasn't configured.

the ``DriverToolkit DaemonSet`` when it checks for the existence of a ``build-dockercfg`` secret for
the Driver Toolkit service account. This results in a stalled state for the NVIDIA GPU Operator:

.. code-block:: console
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because this is prep for installation, before checking pod status, I think Erwan suggested that customers could check if storage was configured ("Managed"?) for the image registry.

I've lost the output at this point, but on a bare-metal deployment, the following command did not show the storage as "Managed" (IIRC):

oc describe configs.imageregistry.operator.openshift.io cluster

On the cluster that Erwan started for me (AWS, not BM) I was noodling around with something like this to see if we could cook up a "If your output looks like (or does not look like), then the NVIDIA GPU Operator installation will be able to generate secrets and no further action is required. Proceed to "

oc get configs.imageregistry.operator.openshift.io cluster -o jsonpath='{.spec.storage}'

...but, I had an AWS cluster and the problem can't be repped on AWS.

% oc patch configs.imageregistry.operator.openshift.io cluster --type merge --patch '{"spec":{"managementState":"Managed"}}'


When the registry is in Managed state, the NVIDIA GPU Operator creates the secrets:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is true, the customer won't have installed the Operator yet, I don't think. So, I don't think we need from here to the end. We could include a sample command that indicates the installation will succeed. I think it was the oc describe configs.imageregistry.operator.openshift.io cluster that Erwan suggested, with the key piece being the section under the Storage: field. As long as storage was configured, my understanding is that the secrets could be made and the Operator installation would succeed.

@mikemckiernan
Copy link
Member Author

Stephen, I put a few notes at the end of Erwan's document. Unless someone knows of a better way, I'm thinking that the oc get configs.imageregistry.operator.openshift.io cluster -o jsonpath='{.spec.storage}' command is the best way to predict failure and the need to configure storage for the image registry.

@StephenJamesSmith
Copy link
Contributor

StephenJamesSmith commented Mar 11, 2024

@mikemckiernan Saw your "sloppy notes". Will you be making these changes to the PR?


You do not need an entitlement on OpenShift Container Platform versions greater than 4.9.9.
This change affects the instatllation of NVIDIA GPU Operator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/instatllation/installation

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to find a different codespell hook.

pre-commit run --files openshift/steps-overview.rst 
mixed line ending........................................................Passed
trim trailing whitespace.................................................Passed
check yaml...........................................(no files to check)Skipped
codespell................................................................Passed

Thank you for pointing it out to me.

Copy link
Contributor

@StephenJamesSmith StephenJamesSmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found just 1 typo.

@egallen
Copy link

egallen commented Mar 14, 2024

@mikemckiernan except the typo, the content is good.
/lgtm

StephenJamesSmith and others added 3 commits March 15, 2024 08:06
Signed-off-by: Mike McKiernan <[email protected]>
@mikemckiernan mikemckiernan merged commit a064c52 into main Mar 15, 2024
5 checks passed
@mikemckiernan mikemckiernan deleted the TELCODOCS-1788 branch March 15, 2024 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Issue/PR focused on fixing/editing/adding documentation bits
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants