-
Notifications
You must be signed in to change notification settings - Fork 674
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Flyte Admin RBAC + Project/Domain Isolation #5871
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Jason Parraga <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #5871 +/- ##
=======================================
Coverage 36.71% 36.71%
=======================================
Files 1304 1304
Lines 130081 130081
=======================================
Hits 47764 47764
Misses 78147 78147
Partials 4170 4170
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
Signed-off-by: Jason Parraga <[email protected]>
* This is a dedicated utility function used in repository code to create resource since you cannot add a WHERE clause filter for records that don't exist yet :) | ||
* ```go | ||
func (r *ExecutionRepo) Create(ctx context.Context, input models.Execution, executionTagModel []*models.ExecutionTag) error { | ||
if err := util.AuthorizeResourceCreation(ctx, input.Project, input.Domain); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting, I like the middleware approach because it lets you intercept new service methods for free. In this proposal, how do you ensure that new DB methods always call the authorization util?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is cooperative so you cannot ensure it. The implementation I have adds it for every resource possible IIRC so it would probably have to be enforced through code review
- name: read-only | ||
rules: | ||
- name: "read everything" | ||
methodPattern: "Get.*|List.*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious about your thoughts on brittleness for regex parsing here. We mostly try to follow restful gRPC method names but we have a few notable exceptions like TerminateExecution
, RecoverExecution
, RelaunchExecution
. If upsteam OSS Flyte merges a change that adds a new syntax, is it the responsibility of whoever deploys Flyte at their company and manages these policies to update the regexes appropriately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using something besides .*
on the method pattern is opt in so yeah I think it would be up to the company to update their regexes and I'd imagine that is the behavior they would want. Ideally they validate the ux in non-prod and roll out to prod.
I'm also open to supporting different matching types like prefix and exact matches.
rules: | ||
- name: "r/w for every project in production" | ||
methodPattern: ".*" | ||
domain: production # you can wildcard project and declare domain level access |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does the same work for wild carding domain if you want, for example, mapping team write access?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes absolutely
#### Authorization Utils + DB Layer | ||
|
||
The final piece of the puzzle is what performs resource level authorization and filtering. Historically, I have found | ||
that the best (albeit challenging) way to do this is at the database layer for a few reasons: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This proposal makes sense, but there's some notable instances where we could have unauthorized side effects
specifically, for create execution, a user who has read access may be able to read the respective launch plan, we then create the CRD and only then insert the model in the DB (which gets rejected) here:
flyte/flyteadmin/pkg/manager/impl/execution_manager.go
Lines 1027 to 1062 in 5f69589
workflowExecutor := plugins.Get[workflowengineInterfaces.WorkflowExecutor](m.pluginRegistry, plugins.PluginIDWorkflowExecutor) | |
execInfo, execErr := workflowExecutor.Execute(ctx, workflowengineInterfaces.ExecutionData{ | |
Namespace: namespace, | |
ExecutionID: workflowExecutionID, | |
ReferenceWorkflowName: workflow.Id.Name, | |
ReferenceLaunchPlanName: launchPlan.Id.Name, | |
WorkflowClosure: workflow.Closure.CompiledWorkflow, | |
WorkflowClosureReference: storage.DataReference(workflowModel.RemoteClosureIdentifier), | |
ExecutionParameters: executionParameters, | |
OffloadedInputsReference: inputsURI, | |
}) | |
if execErr != nil { | |
createExecModelInput.Error = execErr | |
m.systemMetrics.PropellerFailures.Inc() | |
logger.Infof(ctx, "failed to execute workflow %+v with execution id %+v and inputs %+v with err %v", | |
request, workflowExecutionID, executionInputs, execErr) | |
} else { | |
m.systemMetrics.AcceptanceDelay.Observe(acceptanceDelay.Seconds()) | |
createExecModelInput.Cluster = execInfo.Cluster | |
} | |
executionModel, err := transformers.CreateExecutionModel(createExecModelInput) | |
if err != nil { | |
logger.Infof(ctx, "Failed to create execution model in transformer for id: [%+v] with err: %v", | |
workflowExecutionID, err) | |
return nil, nil, nil, err | |
} | |
executionTagModel, err := transformers.CreateExecutionTagModel(createExecModelInput) | |
if err != nil { | |
logger.Infof(ctx, "Failed to create execution tag model in transformer for id: [%+v] with err: %v", | |
workflowExecutionID, err) | |
return nil, nil, nil, err | |
} | |
return ctx, executionModel, executionTagModel, nil |
so in this scenario, we create a rogue Flyte CRD that never gets committed to the db (operation fails with permission denied) that propeller will attempt to process and send event updates for, leading to potential system weirdness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good point, I wasn't aware of that interaction.
I think there are a couple ways to work around this.
- Bring authorization for creation into the application layer (in addition to the db layer I suppose).
- Rework the code such that the execution is written to the database (maybe partially) prior to creating the CRD. And then writing the rest to the DB after the CRD is created if necessary.
I am being a bit naive with the complexity to approach number 2 but I do think this is a general problem with distributed systems. For example, the way the code is written as is makes the system susceptible to ghost CRDs since the system could crash before writing to the DB. I also understand this issue is probably much more likely to happen with authorization errors though :)
The architectural issue I see with this type of behavior boils down to how business logic does what is effectively a distributed transaction across other services (submitting CRDs in this case) and writing to the database. The most robust way to attempt this sort of behavior that I've seen is to always write to the DB first (this is the source of truth) and then opportunistically do the rest of the distributed transaction. For cases where the server crashes after writing to the DB you'll basically need some anti entropy background task to handle edge cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, it's definitely a problem now even without authz checks failing. I recall we waffled on this approach initially because we didn't want to commit the entry to the db until after the CRD was created (since the workflowengine packages gracefully handles already exists on retries, but falsely implying an execution was created in the reversed case was more confusing)
if we do bring authorization to the application layer (even for these exceptional cases) I'm a bit concerned we're now depending on code review to enforce authorization checks in both the db and application levels
|
||
#### Authorization Config | ||
The authorization config will be used to configure the authorization behavior. The first thing it describe is a way to | ||
resolve roles from the user's identity context. Ideally this should be flexible enough to resolve a role from different |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for a multi-cluster set-up, where the control plane (admin) and data plane (propeller) reside on separate clusters, propeller authenticates with a set of client credentials against flyteadmin to send execution events and create executions (child workflows, e.g. calling a launch plan in code)
if I'm understanding correctly, in this case we could use the userID of the propeller credentials to permit deployment-wide RW access? this isn't quite userID, but I think your suggestion can very well support applicationID using this approach. My understanding is that it's a bit more tricky to add scopes & claims to applications rather than user identities
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use a multi cluster setup and admittedly our implementation of this proposal actually has a separate config field for allow listing certain user IDs, which in our case are Okta client Ids for the propeller Okta app. This allow list gives them full access to all APIs and all projects/domains. This was more of an incremental thing I added when I realized very quickly that authorization broke propeller :)
When I was writing the proposal I realized that I could probably do away with user ID allow lists in our implementation and support role resolution more generally by allowing users to pick out which values from the token they want to potentially match with policies. So yeah, I think for Flyte workloads you'd just turn on the user ID or application ID resolution strategy and create a role like so, not having to deal with scopes at all.
- name: 0oahjhk34aUxGnWcZ0h7 # the names can even include things like okta app IDs
rules:
- name: "flyte propeller"
methodPattern: ".*"
I guess to be clear, the role resolution strategies configured will ultimately generate a set of strings and then try and match those strings against the policy map keys.
But I'm open to doing something special for the Flyte workloads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this proposal sounds like it unblocks propeller access! I'm curious how this works for supporting dynamically provisioned app credentials (besides just propeller, but for things like CICD, etc) since I believe Okta doesn't allow you to attach claims to applications unless you use custom claims in the authorization server which is a bit more involved
2024-11-7 Contributor's sync notes: Jason introduced the proposal and expects to work on the implementation as a feature gate. Community feedback is important to learn how the community is using providers other than Okta to handle authz and groups. |
What's the current status of the proposal? |
I am working through an initial pull request to add this functionality. |
That's cool and I am happy to have a look! What about the TSC evaluation phases? I have seen other posts where RBAC was dismissed... |
Tracking issue
Related to #5189
Related to #4622