-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Copy action #396
base: main
Are you sure you want to change the base?
Copy action #396
Conversation
- Action result is empty, similar to symlink actions. | ||
|
||
The API looks like `actions.copy(in: File, out: File, path: String|None): None`. | ||
- `actions.copy(in, out)` is a simple same-type copy operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does "same type" refer to the artifact type or to the filesystem type? The two don't necessarily match: artifacts of file or directory type can be materialized as symlinks in the filesystem.
What should actions.copy(in, out)
do if in
and out
are both declare_file
(or both declare_directory
), but in
is materialized in the filesystem as a symlink? I think there are three possible answers:
- Copy the symlink as-is
- Indirect through the symlink and make a copy of its target
- Execution error (I like this one the least because it makes builds with the same analysis properties and input digests result in different execution behavior, depending on the state of the filesystem - a sort of non-determinism, if you will).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does "same type" refer to the artifact type or to the filesystem type? The two don't necessarily match: artifacts of file or directory type can be materialized as symlinks in the filesystem.
I was thinking artifact type when I wrote this, symlinks are something I've yet to figure out. It's very easy to break them, which to be fair is the case for the existing copy rules. The challenge here is that with some symlinks Bazel will actually dereference them and track the targeted artifact.
Not sure what behaviour to go with here just yet. I feel like "copy-as-is" would be the least surprising option, keeps it simple.
- Indirect through the symlink and make a copy of its target
Would make sense when in the source tree a file has been symlinked specifically so it's content can appear in multiple locations. Such cases would be ideally served by hardlinks (multiple canonical handles, same file) but that's not something version control systems keep track of.
Symlinks to a directory ramps up the complexity. They can be used to present the same content in multiple locations, but they also have more novel uses like in pnpm
.
Additionally symlinks can point to anything, including other symlinks (e.g. symlink-a -> symlink-b -> file
, symlink-a -> symlink-b -> symlink-a
). Solving for that in the copy implementation doesn't sound like much fun.
A case could also be made for reading the symlink target then creating the new "copy" with an adjusted target path.
Opting to not do anything special for symlinks may be the best option, can always do the special niche operation (e.g. copying the file a symlink points to) in a separate action.
The API looks like `actions.copy(in: File, out: File, path: String|None): None`. | ||
- `actions.copy(in, out)` is a simple same-type copy operation. | ||
Unanswered question: How are `actions.symlink` inputs handled? Update the target? Error? | ||
- `actions.copy(in, out, path = "foo/bar")` copies a file from the input directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be clear, I think you're referring to the case where in
is a declare_directory
and out
is a declare_file
? Can you work that into the sentence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Tiago comment.
I would prefer to see separate definitions for cases such as:
- File to File
- Dir to Dir
- File to Dir
I am also ok if we decide to only support File to File initially.
`path` cannot be used with generic symlinks (`actions.declare_symlink`, source files??? and within tree artifacts support is later added) as Bazel does not track their target. | ||
|
||
Compared to existing solutions, `actions.copy` should; | ||
- Complete much more quickly, as spawn machinery is not used and no subprocess is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An interesting question is whether the implementation is required to make an actual copy, or if hardlinks would suffice where supported.
Do you know whether hardlinks also present problems for the use cases where actions.symlink
doesn't work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardlinks are probably fine for outputs, only surprising if output tree checking is turned off and a file is modified.
Copy-on-write is also an option (macOS has clonefile
, Linux has reflink
, Windows has something recent), though it requires the source and destination to be on the same volume, and for the filesystem to support it. Cached feature detection would be necessary for optimal performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if we start with an actual copy and add options to use hard link or copy-on-write(CoW) copy separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, executable hardlinks have an issue with macOS Gatekeeper killing the wrong executables:
https://developer.apple.com/forums/thread/663456.
|
||
Introduce a new `copy` action with the following behaviors; | ||
- Content hash of output is identical to input, only the name is different. | ||
- Realisation of output is deferred, similar to how remote spawn actions are handled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implementation-wise this is probably the most challenging/contentious part, because (1) the "build without the bytes" machinery currently only allows delayed materialization of artifacts that already exist remotely, and (2) architecturally we don't have a way for an action to say "copy metadata from input to output", only "collect output metadata from the filesystem".
(I reworded this from an earlier version to state the difficulties more clearly. These are not insurmountable problems, but they'd be responsible for most of the implementation complexity, so we should be clear that accepting this proposal would require to take them on.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm under no illusions as to the complexity this particular requirement brings. Unfortunately it is also necessary to avoid a potential performance regression (vs just throwing a bunch of concurrent copy spawns at the remote).
For context, internally we've patched skylib
copy rules that are configured out of the box to run locally, so that they can run remotely. This is for performance primarily (cutting down on uploads/downloads from the remote), but also correctness in general (remote is Linux while host is either Linux or macOS, will be less of an issue if/when my other proposal's implementation gets in).
- `actions.copy(in, out)` is a simple same-type copy operation. | ||
Unanswered question: How are `actions.symlink` inputs handled? Update the target? Error? | ||
- `actions.copy(in, out, path = "foo/bar")` copies a file from the input directory. | ||
`path` cannot be used with generic symlinks (`actions.declare_symlink`, source files??? and within tree artifacts support is later added) as Bazel does not track their target. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using it with source artifacts should be fine (what determines whether Bazel tracks the contents is the artifact type, not the filesystem type, and source artifacts are currently always of file
type) but then we run into the same question as above in cases where they're materialized as symlinks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For symlink source files, I think this is a non-issue. If the symlink points to a file, Bazel will treat it as a file and that can't be passed to in
. If the symlink points to a directory, Bazel will treat it as a directory and directories aren't allowed as source inputs (they do trigger a "unsound directory dependency warning in Bazel v7, v8 I think makes it an error).
I worded it poorly but I believe tree artifacts cannot contain symlinks, and even if they do later they'll be appropriately classified (directory, file, generic symlink) once copied out so I think the last item can be removed there.
Introduce a new `copy` action with the following behaviors; | ||
- Content hash of output is identical to input, only the name is different. | ||
- Realisation of output is deferred, similar to how remote spawn actions are handled. | ||
- Action result is empty, similar to symlink actions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what you mean here is that the result of the action isn't uploaded to a remote or disk cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with having an empty action result. I think it's still valuable to have one I think things might break badly if we don't have one to track the output directory life time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a very poor job of describing the requirement here, and this is after having rewritten this several times (while failing to fully recall what I was on about).
Thinking on this again now, I think my past self was concerned about how remote/BES APIs would respond to a large number of copy operations (10k+) that are unbatched. That is, it is very easy to process 10k copy actions in a second (or there abouts) so keeping the number of events/calls they produce as a side effect low is important for keeping connections from becoming flooded and remote services healthy.
There is definitely a better way to phrase this requirement, I'll need to poke around at the remote/BES API surfaces to fill some knowledge gaps first I think. This may be a simple as batching (not of the actions themselves, more the side effects like BES events and remote interactions).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, this is something that we have discussed internally with BuildBuddy as these copy actions are used heavily among the rules_js, rules_oci ecosystem.
Specifically for rules_oci, the copy contents could range anywhere between a few bytes of json, to multi-gigabytes of compressed tarball container layer. Which make them quite "expensive" to schedule and execute remotely.
An alternative way of achieving this, which we are discussing internally, is for the RBE scheduler to "detect" the known copy actions and skip the executions completely by creating the ActionResult directly instead. Though, it would be much nicer if it's Bazel doing these "short-circuit" executions instead of our server.
Introduce a new `copy` action with the following behaviors; | ||
- Content hash of output is identical to input, only the name is different. | ||
- Realisation of output is deferred, similar to how remote spawn actions are handled. | ||
- Action result is empty, similar to symlink actions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with having an empty action result. I think it's still valuable to have one I think things might break badly if we don't have one to track the output directory life time.
The API looks like `actions.copy(in: File, out: File, path: String|None): None`. | ||
- `actions.copy(in, out)` is a simple same-type copy operation. | ||
Unanswered question: How are `actions.symlink` inputs handled? Update the target? Error? | ||
- `actions.copy(in, out, path = "foo/bar")` copies a file from the input directory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to Tiago comment.
I would prefer to see separate definitions for cases such as:
- File to File
- Dir to Dir
- File to Dir
I am also ok if we decide to only support File to File initially.
`path` cannot be used with generic symlinks (`actions.declare_symlink`, source files??? and within tree artifacts support is later added) as Bazel does not track their target. | ||
|
||
Compared to existing solutions, `actions.copy` should; | ||
- Complete much more quickly, as spawn machinery is not used and no subprocess is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if we start with an actual copy and add options to use hard link or copy-on-write(CoW) copy separately.
dst = ctx.actions.declare_directory(ctx.attr.out) | ||
ctx.actions.copy(ctx.attr.src, dst) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this example, the desired output here would be a Directory which should be uploaded to remote cache CAS so that subsequent downstream actions could consume it.
If the copy is large (i.e. Dir to Dir recursively), having an ActionResult would also save other Bazel clients from having to do the work.
I also think that if File to Dir
is supported, then we should consider making it a list of files instead so we can compose the directory contents in one go.
Proposal to introduce a builtin copy action.
Goals
no-remote
andno-cache
that can ultimately lead to greater overhead in remote builds (e.g. forcing remote files to be downloaded under--remote_download_minimal
).