Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Flyte Admin RBAC + Project/Domain Isolation #5871

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Sovietaced
Copy link
Contributor

@Sovietaced Sovietaced commented Oct 20, 2024

Tracking issue

Related to #5189
Related to #4622

Signed-off-by: Jason Parraga <[email protected]>
Copy link

codecov bot commented Oct 20, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 36.71%. Comparing base (bdaf79f) to head (ef505e2).
Report is 16 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #5871   +/-   ##
=======================================
  Coverage   36.71%   36.71%           
=======================================
  Files        1304     1304           
  Lines      130081   130081           
=======================================
  Hits        47764    47764           
  Misses      78147    78147           
  Partials     4170     4170           
Flag Coverage Δ
unittests-datacatalog 51.58% <ø> (ø)
unittests-flyteadmin 54.41% <ø> (ø)
unittests-flytecopilot 11.73% <ø> (ø)
unittests-flytectl 62.40% <ø> (ø)
unittests-flyteidl 6.89% <ø> (ø)
unittests-flyteplugins 53.62% <ø> (ø)
unittests-flytepropeller 42.84% <ø> (ø)
unittests-flytestdlib 54.78% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Jason Parraga <[email protected]>
@Sovietaced Sovietaced marked this pull request as ready for review October 21, 2024 05:48
@eapolinario
Copy link
Contributor

cc: @robert-ulbrich-mercedes-benz

* This is a dedicated utility function used in repository code to create resource since you cannot add a WHERE clause filter for records that don't exist yet :)
* ```go
func (r *ExecutionRepo) Create(ctx context.Context, input models.Execution, executionTagModel []*models.ExecutionTag) error {
if err := util.AuthorizeResourceCreation(ctx, input.Project, input.Domain); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting, I like the middleware approach because it lets you intercept new service methods for free. In this proposal, how do you ensure that new DB methods always call the authorization util?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is cooperative so you cannot ensure it. The implementation I have adds it for every resource possible IIRC so it would probably have to be enforced through code review

- name: read-only
rules:
- name: "read everything"
methodPattern: "Get.*|List.*"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious about your thoughts on brittleness for regex parsing here. We mostly try to follow restful gRPC method names but we have a few notable exceptions like TerminateExecution, RecoverExecution, RelaunchExecution. If upsteam OSS Flyte merges a change that adds a new syntax, is it the responsibility of whoever deploys Flyte at their company and manages these policies to update the regexes appropriately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using something besides .* on the method pattern is opt in so yeah I think it would be up to the company to update their regexes and I'd imagine that is the behavior they would want. Ideally they validate the ux in non-prod and roll out to prod.

I'm also open to supporting different matching types like prefix and exact matches.

rules:
- name: "r/w for every project in production"
methodPattern: ".*"
domain: production # you can wildcard project and declare domain level access
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does the same work for wild carding domain if you want, for example, mapping team write access?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes absolutely

#### Authorization Utils + DB Layer

The final piece of the puzzle is what performs resource level authorization and filtering. Historically, I have found
that the best (albeit challenging) way to do this is at the database layer for a few reasons:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal makes sense, but there's some notable instances where we could have unauthorized side effects

specifically, for create execution, a user who has read access may be able to read the respective launch plan, we then create the CRD and only then insert the model in the DB (which gets rejected) here:

workflowExecutor := plugins.Get[workflowengineInterfaces.WorkflowExecutor](m.pluginRegistry, plugins.PluginIDWorkflowExecutor)
execInfo, execErr := workflowExecutor.Execute(ctx, workflowengineInterfaces.ExecutionData{
Namespace: namespace,
ExecutionID: workflowExecutionID,
ReferenceWorkflowName: workflow.Id.Name,
ReferenceLaunchPlanName: launchPlan.Id.Name,
WorkflowClosure: workflow.Closure.CompiledWorkflow,
WorkflowClosureReference: storage.DataReference(workflowModel.RemoteClosureIdentifier),
ExecutionParameters: executionParameters,
OffloadedInputsReference: inputsURI,
})
if execErr != nil {
createExecModelInput.Error = execErr
m.systemMetrics.PropellerFailures.Inc()
logger.Infof(ctx, "failed to execute workflow %+v with execution id %+v and inputs %+v with err %v",
request, workflowExecutionID, executionInputs, execErr)
} else {
m.systemMetrics.AcceptanceDelay.Observe(acceptanceDelay.Seconds())
createExecModelInput.Cluster = execInfo.Cluster
}
executionModel, err := transformers.CreateExecutionModel(createExecModelInput)
if err != nil {
logger.Infof(ctx, "Failed to create execution model in transformer for id: [%+v] with err: %v",
workflowExecutionID, err)
return nil, nil, nil, err
}
executionTagModel, err := transformers.CreateExecutionTagModel(createExecModelInput)
if err != nil {
logger.Infof(ctx, "Failed to create execution tag model in transformer for id: [%+v] with err: %v",
workflowExecutionID, err)
return nil, nil, nil, err
}
return ctx, executionModel, executionTagModel, nil

so in this scenario, we create a rogue Flyte CRD that never gets committed to the db (operation fails with permission denied) that propeller will attempt to process and send event updates for, leading to potential system weirdness

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point, I wasn't aware of that interaction.

I think there are a couple ways to work around this.

  1. Bring authorization for creation into the application layer (in addition to the db layer I suppose).
  2. Rework the code such that the execution is written to the database (maybe partially) prior to creating the CRD. And then writing the rest to the DB after the CRD is created if necessary.

I am being a bit naive with the complexity to approach number 2 but I do think this is a general problem with distributed systems. For example, the way the code is written as is makes the system susceptible to ghost CRDs since the system could crash before writing to the DB. I also understand this issue is probably much more likely to happen with authorization errors though :)

The architectural issue I see with this type of behavior boils down to how business logic does what is effectively a distributed transaction across other services (submitting CRDs in this case) and writing to the database. The most robust way to attempt this sort of behavior that I've seen is to always write to the DB first (this is the source of truth) and then opportunistically do the rest of the distributed transaction. For cases where the server crashes after writing to the DB you'll basically need some anti entropy background task to handle edge cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, it's definitely a problem now even without authz checks failing. I recall we waffled on this approach initially because we didn't want to commit the entry to the db until after the CRD was created (since the workflowengine packages gracefully handles already exists on retries, but falsely implying an execution was created in the reversed case was more confusing)

if we do bring authorization to the application layer (even for these exceptional cases) I'm a bit concerned we're now depending on code review to enforce authorization checks in both the db and application levels


#### Authorization Config
The authorization config will be used to configure the authorization behavior. The first thing it describe is a way to
resolve roles from the user's identity context. Ideally this should be flexible enough to resolve a role from different
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for a multi-cluster set-up, where the control plane (admin) and data plane (propeller) reside on separate clusters, propeller authenticates with a set of client credentials against flyteadmin to send execution events and create executions (child workflows, e.g. calling a launch plan in code)

if I'm understanding correctly, in this case we could use the userID of the propeller credentials to permit deployment-wide RW access? this isn't quite userID, but I think your suggestion can very well support applicationID using this approach. My understanding is that it's a bit more tricky to add scopes & claims to applications rather than user identities

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use a multi cluster setup and admittedly our implementation of this proposal actually has a separate config field for allow listing certain user IDs, which in our case are Okta client Ids for the propeller Okta app. This allow list gives them full access to all APIs and all projects/domains. This was more of an incremental thing I added when I realized very quickly that authorization broke propeller :)

When I was writing the proposal I realized that I could probably do away with user ID allow lists in our implementation and support role resolution more generally by allowing users to pick out which values from the token they want to potentially match with policies. So yeah, I think for Flyte workloads you'd just turn on the user ID or application ID resolution strategy and create a role like so, not having to deal with scopes at all.

 - name: 0oahjhk34aUxGnWcZ0h7 # the names can even include things like okta app IDs
      rules:
        - name: "flyte propeller"
          methodPattern: ".*"

I guess to be clear, the role resolution strategies configured will ultimately generate a set of strings and then try and match those strings against the policy map keys.

But I'm open to doing something special for the Flyte workloads

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this proposal sounds like it unblocks propeller access! I'm curious how this works for supporting dynamically provisioned app credentials (besides just propeller, but for things like CICD, etc) since I believe Okta doesn't allow you to attach claims to applications unless you use custom claims in the authorization server which is a bit more involved

@davidmirror-ops davidmirror-ops added the rfc A label for RFC issues label Oct 24, 2024
@davidmirror-ops
Copy link
Contributor

davidmirror-ops commented Nov 7, 2024

2024-11-7 Contributor's sync notes: Jason introduced the proposal and expects to work on the implementation as a feature gate. Community feedback is important to learn how the community is using providers other than Okta to handle authz and groups.

@chicco785
Copy link

What's the current status of the proposal?

@Sovietaced
Copy link
Contributor Author

What's the current status of the proposal?

I am working through an initial pull request to add this functionality.

@chicco785
Copy link

What's the current status of the proposal?

I am working through an initial pull request to add this functionality.

That's cool and I am happy to have a look! What about the TSC evaluation phases? I have seen other posts where RBAC was dismissed...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rfc A label for RFC issues
Projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

5 participants