Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

system_binary "test binary" caches failures across pantsd invocations (and doesn't work with shim scripts) #21744

Open
trhodeos opened this issue Dec 11, 2024 · 4 comments
Labels

Comments

@trhodeos
Copy link

trhodeos commented Dec 11, 2024

Describe the bug
We use julia via juliaup (similar to pyenv, but for Julia). We have the following system_binary declaration:

system_binary(
    name="julia",
    binary_name="julia",
    fingerprint=r".*1.10.*",
    fingerprint_args=["-v"],
)

Caching this makes sense for "static" binaries. However, juliaup installs a shim julia script which uses the default juliaup julia version. Thus, the test binary step gets cached since the "inputs" (aka the shim script) havent changed.

This means that if the Test Binary rule breaks (or passes) once, it'll always be that value until the cache is deleted.

Pants version
2.23.0

OS
MacOS (but most likely happens on both)

Additional info
Here's a repro case. Note that you need juliaup installed. You should toggle between juliaup default 1.10 and juliaup default 1.9 to see the failure modes.

@trhodeos trhodeos added the bug label Dec 11, 2024
@trhodeos trhodeos changed the title system_binary "test binary" caches failures across pantsd invocations system_binary "test binary" caches failures across pantsd invocations (and doesn't work with shim scripts) Dec 11, 2024
@huonw
Copy link
Contributor

huonw commented Dec 17, 2024

Sorry for the trouble. Thanks for tracking down that code.

There's a bit more context/info in the called function's docstring:

def executable_search_path_cache_scope(
self, *, cache_failures: bool = False
) -> ProcessCacheScope:
"""Whether it's safe to cache executable search path discovery or not.
If the environment may change on us, e.g. the user upgrades a binary, then it's not safe to
cache the discovery to disk. Technically, in that case, we should recheck the binary every
session (i.e. Pants run), but we instead settle for every Pantsd lifetime to have more
acceptable performance.
Meanwhile, when running with Docker, we already invalidate whenever the image changes
thanks to https://github.com/pantsbuild/pants/pull/17101.
Remote execution often is safe to cache, but depends on the remote execution server.
So, we rely on the user telling us what is safe.
"""
caching_allowed = self.val and (
self.val.has_field(DockerImageField)
or (
self.val.has_field(RemoteEnvironmentCacheBinaryDiscovery)
and self.val[RemoteEnvironmentCacheBinaryDiscovery].value
)
)
if cache_failures:
return (
ProcessCacheScope.ALWAYS
if caching_allowed
else ProcessCacheScope.PER_RESTART_ALWAYS
)
return (
ProcessCacheScope.SUCCESSFUL
if caching_allowed
else ProcessCacheScope.PER_RESTART_SUCCESSFUL
)

In particular, when running locally, this is only cached "per restart", i.e. for a pantsd session. Thus, getting Pantsd to restart (rm -rf .pants.d/pids is one way) will cause the binary search to rerun and find the new one.

This ends up being a performance/"correctness" trade-off: no caching makes for a slower experience.

Does the pantsd tip help?

Maybe we could even improve the error message to tell people about this, since it's easy to trip over:

async def _find_binary(
address: Address,
binary_name: str,
search_path: SearchPath,
fingerprint_pattern: str | None,
fingerprint_args: tuple[str, ...] | None,
fingerprint_dependencies: tuple[str, ...] | None,
log_fingerprinting_errors: bool,
) -> BinaryPath:
binaries = await Get(
BinaryPaths,
BinaryPathRequest(
binary_name=binary_name,
search_path=search_path,
),
)
fingerprint_args = fingerprint_args or ()
deps = await Get(
ResolvedExecutionDependencies,
ResolveExecutionDependenciesRequest(address, (), fingerprint_dependencies),
)
rds = deps.runnable_dependencies
env: dict[str, str] = {}
append_only_caches: Mapping[str, str] = {}
immutable_input_digests: Mapping[str, Digest] = {}
if rds:
env = {"PATH": rds.path_component}
env.update(**(rds.extra_env or {}))
append_only_caches = rds.append_only_caches
immutable_input_digests = rds.immutable_input_digests
tests: tuple[FallibleProcessResult, ...] = await MultiGet(
Get(
FallibleProcessResult,
Process(
description=f"Testing candidate for `{binary_name}` at `{path.path}`",
argv=(path.path,) + fingerprint_args,
input_digest=deps.digest,
env=env,
append_only_caches=append_only_caches,
immutable_input_digests=immutable_input_digests,
),
)
for path in binaries.paths
)
for test, binary in zip(tests, binaries.paths):
if test.exit_code != 0:
if log_fingerprinting_errors:
logger.warning(
f"Error occurred while fingerprinting candidate binary `{binary.path}` "
f"for binary `{binary_name}` (exit code {test.exit_code}) (use the `{SystemBinaryLogFingerprintingErrorsField.alias}` field to control this warning):\n\n"
f"stdout:\n{test.stdout.decode(errors='ignore')}\n"
f"stderr:\n{test.stderr.decode(errors='ignore')}"
)
# Skip this binary since fingerprinting failed.
continue
if fingerprint_pattern:
fingerprint = test.stdout.decode().strip()
match = re.match(fingerprint_pattern, fingerprint)
if not match:
continue
return binary
message = StringIO()
message.write(f"Could not find a binary with name `{binary_name}`")
if fingerprint_pattern:
message.write(
f" with output matching `{fingerprint_pattern}` when run with arguments `{' '.join(fingerprint_args or ())}`"
)
message.write(". The following paths were searched:\n")
for sp in search_path:
message.write(f"- {sp}\n")
failed_tests = [
(test, binary) for test, binary in zip(tests, binaries.paths) if test.exit_code != 0
]
if failed_tests:
message.write(
"\n\nThe following binaries were skipped because each binary returned an error when invoked:"
)
for failed_test, failed_binary in failed_tests:
message.write(f"\n\n- {failed_binary.path} (exit code {failed_test.exit_code})\n")
message.write(f" stdout:\n{failed_test.stdout.decode(errors='ignore')}\n")
message.write(f" stderr:\n{failed_test.stderr.decode(errors='ignore')}\n")
raise ValueError(message.getvalue())

@trhodeos
Copy link
Author

trhodeos commented Dec 17, 2024

It seemed to be "sticky" for me (aka persisted after restarting pantsd).. Let me double check that though, I may be misspeaking.

Yeah, I'm not sure what the answer is here. For this specific case, there's a config file that controls the "live binary".. It'd be nice if we could add that config file (in $HOME/.julia/juliaup/juliaup.json in this case) to the fingerprint of system_binary, but I realize that comes with it's own set of tradeoffs (fingerprinting files outside of the "pants" realm)

@tdyas
Copy link
Contributor

tdyas commented Dec 26, 2024

An alternative solution would be to add an attribute to the system_binary target type to make caching "per session" so every Pants invocation reevaluates the binary path search. "Per session" is one of the ProcessCacheScope variants.

We already have a similar idea for shell_command and adhoc_tool target types coming in v2.24.x via a new cache_scope field; for example, https://www.pantsbuild.org/prerelease/reference/targets/adhoc_tool#cache_scope.

@tdyas
Copy link
Contributor

tdyas commented Dec 26, 2024

Yeah, I'm not sure what the answer is here. For this specific case, there's a config file that controls the "live binary".. It'd be nice if we could add that config file (in $HOME/.julia/juliaup/juliaup.json in this case) to the fingerprint of system_binary, but I realize that comes with it's own set of tradeoffs (fingerprinting files outside of the "pants" realm)

This is feasible to implement in local execution on Pants v2.23.x+ because the rules engine now has the ability to retrieve metadata of system paths outside of the repository (via the PathMetadataRequest / PathMetadataResult types).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants