Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial Configuration API Design #1

Merged
merged 5 commits into from
Jan 16, 2025

Conversation

LionelJouin
Copy link
Member

  • Init go project
  • Initial API design documentation
  • Code Generation
  • Initial API struct

The document to review is docs/design/device-configuration-api.md.
I took the hack/update-codegen.sh code from the Gateway-API project and adapted it to this project.
This is the initial API, it will be extended later. For now we can use it to invoke the CNI ADD and DEL.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 16, 2024
* Init go project
* Initial API design documentation
* Code Generation
* Initial API struct

Signed-off-by: Lionel Jouin <[email protected]>
@aojea
Copy link

aojea commented Dec 16, 2024

@s1061123 @dougbtv can you please take a look?

Copy link

@dougbtv dougbtv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve, wholeheartedly! The conversation herein is questions and thoughts that I have. But, at this point -- I believe getting the repository started with the first PR is most important, and those details can be hammered out, nothing here in my comments that's a show stopper.

- macvlan-eth0
opaque:
driver: cni.dra.networking.k8s.io
parameters: # CNIParameters with the GVK, interface name and CNI Config (in YAML format).
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is.... quasi-uncharted territory, but... I think it'll work, and I think people will like it overall. It's an improvement over "straight up json in the yaml" -- for sure. Here's the tricky part... It's hard to validate. I wonder what it'll look like if it's invalid, a CNI plugin kicks it back, and then we see the failing pod... I'm curious how that experience will be for the user. This is a common CNI problem I see users facing.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Doug mentioned, somewhere we need to describe 'the way to convert from JSON to YAML' because even if the conversion is intuitive, currently no official document for that. Otherwise, we will get some issue in this repo: How to apply my current CNI JSON file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, maybe a solution could be to accept json format in the config field (type runtime.RawExtension) and have a mutating webhook that will json.Unmarshal the CNI and put it back in the config field (config.raw). This way, the cni config will be transformed in a yaml format while the user applied it in json.

* [cni.dev](https://www.cni.dev/)
* [k8snetworkplumbingwg/multus-cni](https://github.com/k8snetworkplumbingwg/multus-cni)
* [KEP-4817 - Resource Claim Status with possible standardized network interface data](https://github.com/kubernetes/enhancements/issues/4817)
* [containernetworking/cni#1132 - VALIDATE Operation](https://github.com/containernetworking/cni/issues/1132)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well clearly you've already thought about the validation, because... Here it is :) So... Yeah. This may be inevitable. Maybe each terminal CNI plugin can provide it's own kind of schema

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think that's the idea with this CNI issue, each CNI should provide a validation schema. I will try to work on it soon in CNI, once this is in place we could have full validation.
One of the issue with the CNI-DRA-Driver might be how to gather all schemas.


If the invocation of the CNI `ADD` operation fails, the `DEL` operation will be invoked immediately to clean up the failing interface. An error will also be returned during pod creation to fail the pod creation.

The interface will still be reported in the `status.devices` field of the ResourceClaim but with the `Ready` condition type set to `False`, with the `reason` field set to `NetworkInterfaceNotReady`, and with the `message` field set with the error returned while calling the CNI `ADD` operation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is really clean to have that message from the failed ADD available. Good one. Do we need any messaging other than the static "CNI-DRA-Driver configured the interface" on a successful ADD?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have anything else in mind?

uid: 680f0a77-8d0b-4e21-8599-62581e335ed6
```

#### Failures
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have anything about failed CNI DELs? Or, does it not matter since the pod object is being deleted at this point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we need something right. The pod could be deleted, but maybe we could block the deletion of the ResourceClaim and keep retrying DEL?

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dougbtv, LionelJouin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

- macvlan-eth0
opaque:
driver: cni.dra.networking.k8s.io
parameters: # CNIParameters with the GVK, interface name and CNI Config (in YAML format).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As Doug mentioned, somewhere we need to describe 'the way to convert from JSON to YAML' because even if the conversion is intuitive, currently no official document for that. Otherwise, we will get some issue in this repo: How to apply my current CNI JSON file?


* [cni.dev](https://www.cni.dev/)
* [k8snetworkplumbingwg/multus-cni](https://github.com/k8snetworkplumbingwg/multus-cni)
* [KEP-4817 - Resource Claim Status with possible standardized network interface data](https://github.com/kubernetes/enhancements/issues/4817)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Opaque Parameter validation:
* All properties in the CNI object must be valid.
* The CNI config must follow correct syntax and semantics.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd liked to know how to validate the CNI object? For example, will cni-dra-driver checks each plugin's parameter, somehow, as following? (this is invalid config sample and it has 3 syntax/semantics error. do you see what is wrong?)

              - type: bridge
                master: ens2f0np0
                hairpinMode: yes-we-need!
                vlan: "110"
                (snip)

Yeah, I understand to validate the user-input is important, but in this spec, I'd like to know 'what should be validate' and 'what should be skipped to validate'. Currently CNI community does not have 'CNI config validator' and actually impossible to implement it because each CNI plugin introduces its own parameter and we cannot clarify which is 'required' or 'optional'. From the CNI semantics point of view, I'm wondering whether we can verify CNI config or not...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My idea was to rely on the CNI project to provides the tools for that (not existing yet, but we can build it from this issue: containernetworking/cni#1132). I don't think the CNI-DRA-Driver should try to validate by itself anything else in the CNI config.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From above description, it (validation) seems to be 'must have'. How about to add some text to mention that validation is optional. At least, current CNI plugin does not have containernetworking/cni#1132, so you cannot use it for now.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion, I added a note to mention a feature must be first provided from the CNI project.

* A ResourceClaim utilizing the device class `cni.networking.k8s.io` must be claimed by one and only one pod.

Opaque Parameter validation:
* All properties in the CNI object must be valid.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does cni-dra-driver validate in the CNI object? I mean, cni-dra-driver checks CNI object syntax? or more? I'd like to know how cni-dra-driver checks each filed (from syntax/semantics).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was referring to the other properties of the CNI object like IfName, I clarified with latest commit.

hack/update-codegen.sh Outdated Show resolved Hide resolved
@@ -0,0 +1,15 @@
/*
Copyright The Kubernetes Authors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MikeZappa87
Copy link

@LionelJouin looking forward to reviewing this tomorrow!

@shaneutt
Copy link
Member

/cc

- opaque:
driver: cni.dra.networking.k8s.io
parameters:
apiVersion: cni.dra.networking.k8s.io/v1alpha1
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is going to be an opaque config of the drivers I feel we should not use the reserved namspace k8s.io and instead use x-k8s.io https://github.com/kubernetes/community/blob/master/sig-architecture/api-review-process.md#voluntary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I also replaced the driver name, deviceclass name. I am not sure if it should apply there


## Design

A new API with the Kind `CNI` will be introduced under the `cni.networking.k8s.io` Group. This API will be providing the CNI configuration along with necessary parameters and optional fields for invoking CNI plugins. The configuration data specified in the `opaque.parameters` will be reflected in the ResourceClaim status by Kubernetes. This reported status will provide the necessary details for executing CNI operations such as `ADD` on pod creation and `DEL` on pod deletion which will enable seamless lifecycle management.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it sounds more like CNIConfig

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me CNI config is the "JSON-serialized plugin configuration object", there is also the "Required environment parameters". That's why I called this CNI, but a better name is welcome.
https://www.cni.dev/docs/spec/#add-add-container-to-network-or-apply-modifications

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a reader of the Kind: CNIConfig I would understand what its purpose is. CNIConfig is a good suggestion

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I renamed the CNI struct to CNIConfig.

Comment on lines +33 to +34
// IfName represents the name of the network interface requested.
IfName string `json:"ifName"`
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I miss more detail about the entire workflow, let's see, there is a driver that exposes Devices in a ResourceSlice with some Attributes and Quantities , what is the driver going to expose there?

Then there is an user that is going to choose to schedule a Pod using a Claim that references one of these Devices ... and he can use a DeviceClass and/or ResourceClaim to pass configuration ... that I assume is this object , how do the driver merge the configs from Class and Claim?
how do the driver executes against the exposed Device? this seems to create an entire new device

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have to investigate this part more in details and I wanted to try to address those in another design document. I am not sure the current ResourceSlice API supports it, as, as you mentioned, a new interface is being created with this driver. Here are some ideas/example I had in mind to address with the scheduler part (some have been mentioned some times ago during the multi-network meeting):

  • Interface Availability: Macvlan is requested, the pod must be scheduled on a node where the master interface exist.
  • IP Pool Exhaustion: The pod must be scheduled on a node where the IPAM can provide IPs (let's say a subnet is provided per node).
  • CNI Availabitity: The CNI must exist on the node.
  • Bandwidth limits: A pod must be scheduled on a node where the master interface/interface provided will fulfills the bandwidth requirement.

Otherwise for the DeviceClass, I don't really have any configuration idea. This will be provided by this project (cni-dra-driver) right? Only the ResourceClaim will be provided by the user himself.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's start from the use cases then and walk through tehm

Interface Availability: Macvlan is requested, the pod must be scheduled on a node where the master interface exist.

ok , so we need to expose "master interfaces" and users need to be able to "share" this master interface
Talked with @johnbelamaric the other day about this, the simplest solution as today seems to add some kind of quota in the driver, say "interface X has 50 macvlan type subinterfaces", and users can request them and the driver can track them

IP Pool Exhaustion: The pod must be scheduled on a node where the IPAM can provide IPs (let's say a subnet is provided per node).

is this the primary IPAM or secondary ipams, we need to model this in terms of resources, is the resources an IPPool?

CNI Availabitity: The CNI must exist on the node.

can you be more specific on what cni availability means? if in the example of macvlan we mean a macvlan cni plugin, then that is determined by the resources exposed, you expose macvlan subinterfaces in the driver only if you have that capability

Bandwidth limits: A pod must be scheduled on a node where the master interface/interface provided will fulfills the bandwidth requirement.

IIUIC this means that the driver need to expose the total bandwith of the uplink and each pod in the node must claim some bandwidth , no? this only works if the pods claim bandwith so the driver can track it

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok , so we need to expose "master interfaces" and users need to be able to "share" this master interface
Talked with @johnbelamaric the other day about this, the simplest solution as today seems to add some kind of quota in the driver, say "interface X has 50 macvlan type subinterfaces", and users can request them and the driver can track them

thank you for checking this. This seems fine to me, I believe that would be the first step for the scheduling part.

is this the primary IPAM or secondary ipams, we need to model this in terms of resources, is the resources an IPPool?

I was referreing to any IPAM (primary, secondary, doesn't really matter). I am not sure this could be solved with this driver. Maybe the IPAMs could publish and model the resources by themselves somehow?

can you be more specific on what cni availability means? if in the example of macvlan we mean a macvlan cni plugin, then that is determined by the resources exposed, you expose macvlan subinterfaces in the driver only if you have that capability

I meant the CNI binary itself exists on the node.

IIUIC this means that the driver need to expose the total bandwith of the uplink and each pod in the node must claim some bandwidth , no? this only works if the pods claim bandwith so the driver can track it

Right, this is correct. Probably if a pod use specific bandwidth of an interface, then only pod requesting specific bandwidth could be scheduled on that node?

These are some ideas to solve with the scheduling part, there is also the VLAN/VxLAN IDs, a pod requesting a VLAN with ID 2000 cannot be scheduled if that VLAN is already in use on that node (with the same master interface).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi,
Please allow me to join this discussion. I'm also interested in this project especially to support shared network device. However, I don't think this discussion should block this PR.

I have checked the latest code of Kubernetes's DRA allocator. There is no reduction of allocated amount from the device. They are used only for matching in my understanding. However, I also think it will be useful for some other cases as well to have the features to deduct and remain the available capacity from the attribute and allow the device to share.

Code pointers:

I do some PoC by allowing to store latest claim status and merging the results to have macvlan work with as-is idea of shared resource claim (as in the gpu example). The merged results I have from this example for now are:

  devices:
  - conditions: null
    data:
      results:
      - cniVersion: 1.0.0
        interfaces:
        - mac: 66:19:a8:eb:09:ce
          name: net1
          sandbox: /var/run/netns/cni-4f0a3a8f-f68d-a71b-82de-7e4c166f4501
        ips:
        - address: 10.10.1.55/24
          gateway: 10.10.1.1
          interface: 0
      - cniVersion: 1.0.0
        interfaces:
        - mac: 96:44:e9:f3:fb:99
          name: net1
          sandbox: /var/run/netns/cni-acb0eed4-39d8-e525-f722-89c310f4b222
        ips:
        - address: 10.10.1.56/24
          gateway: 10.10.1.1
          interface: 0
    device: cni
    driver: poc.dra.networking
    networkData:
      hardwareAddress: 66:19:a8:eb:09:ce,96:44:e9:f3:fb:99
      interfaceName: net1
      ips:
      - 10.10.1.55/24
      - 10.10.1.56/24
    pool: kind-worker
  reservedFor:
  - name: demo-b-1
    resource: pods
    uid: 25fc07d1-e28a-4fb7-b99b-9f0c475c90d0
  - name: demo-b-2
    resource: pods
    uid: 07b785a9-c5f6-4b70-b310-e10d32b1b219

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no blocking at ll, what I'm trying to mean is that to design an API we need to understand the use cases that we want to support, let's pick 1 or 2 to start with , the ones you are sure about it, and then we walk through the use cases and represent how the user is expected to use them ... is like writing an e2e test before coding.

I think we have somehow more or less clear how the claim config will look like and its output, but we need to define before how the driver is going to expose the things and how the user will claim them, those are fundamental pieces at least I fail to see here

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aojea Thank you for your explanation. I somehow see the point.

To have a picture of result from the defined CNI configuration, how's about adding an example result of configured devices inside the Pod (from macvlan+host-local in the config example)? Or having an illustration how the driver pass the configure from claim config to the CNI.

About exposing the network device, I also see that the document still miss the definitions of the resource naming and attributes that will be published to the resource slice. I would like to suggest to also include host device's common attribute such as pci address, vendor, product + ip address. I have exposed and used these information in muti-nic chi project's interface discovery function to cover configure macvlan, ipvlan, and sriov networks.

For host-device CNI plugin, I think we are good for the current design. User/admin can use resource claim or resource claim template for that as described in the doc. We may need to add example of this use case.
For shared device like macvlan, ipvlan, the pod-end device is created on the fly and I don't think we can have multiple resource claim linked to the same device for current version. In my understanding, we need pre-defined resource claim (per each master) and have user pods linked to it.

Is the goal of this document to design something worked with current version or for the future of DRA?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This document defines the configuration API. My goal is to have a separated document for the allocation API (scheduling).

In that new document (allocation API) we could define each use cases (virtual devices, macvlan, vlan, bandwidth...) and start with the most simple ones for now (1 or 2 would be good as @aojea mentioned, macvlan and host-device seems fine with me). More advanced use cases might require DRA changes/updates/KEP, but we will see at that time I guess.

)

// +genclient
// +k8s:deepcopy-gen:interfaces=k8s.io/apimachinery/pkg/runtime.Object
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need the generated code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not fully sure yet. Just the example DRA drivers were doing it.

Copy link

@aojea aojea Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think those were based on the CRDs and the legacy/classic DRA, we may not need that complexity


## Design

A new API with the Kind `CNI` will be introduced under the `cni.networking.x-k8s.io` Group. This API will be providing the CNI configuration along with necessary parameters and optional fields for invoking CNI plugins. The configuration data specified in the `opaque.parameters` will be reflected in the ResourceClaim status by Kubernetes. This reported status will provide the necessary details for executing CNI operations such as `ADD` on pod creation and `DEL` on pod deletion which will enable seamless lifecycle management.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small thing I figured I would ask while I'm here: is/how-is the version of the underlying CNI itself being handled?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe since it'll have basically a yaml representation of a CNI config JSON, it'll have the version information there for when it's exec'd. (Definitely correct me if I'm wrong there Lionel)

As each ResourceClaim can be claimed by at most one pod, a ResourceClaimTemplate can be used to link the same ResourceClaim configuration to multiple pods.

To support scenarios where multiple network interfaces are required, a ResourceClaim can contain one or multiple requests and a pod can reference one or multiple ResourceClaims.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about adding to this section something about the roles of users which are expected to use or be affected by this API and its consequences? Could it be valuable to explicitly spell out who the users/roles are?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You refer to Cluster operator, Application developers... right?
Maybe it could be useful yes. Maybe we should have another document about that? Or maybe this should be part of the DRA documentation instead?
From what I understand, ResourceClaims are deployed mainly by developers, and Drivers are deployed by operators.


The design aims to support all possible functionalities the Container Network Interface (CNI) project offers while ensuring compliance with its specifications.

## Design
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this design introduces a whole new API, do you think it would be valuable to add a ## Motivation section here for future readers to better understand the "why"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggestion, I added the motivation section.

@k8s-ci-robot k8s-ci-robot removed the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 7, 2025
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jan 7, 2025
* Add Motivation section for the device configuration API
* Rename CNI struct to CNIConfig in the API
* Remove generated code and codegen
* Add note for the CNI config validation

Signed-off-by: Lionel Jouin <[email protected]>
@LionelJouin LionelJouin changed the title Initial API Design Initial Configuration API Design Jan 15, 2025
@LionelJouin LionelJouin merged commit c9899d2 into kubernetes-sigs:main Jan 16, 2025
1 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.