Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long strings dumped to dvc.lock contain extra space upon load #10668

Open
janpawlowskiof opened this issue Jan 13, 2025 · 2 comments
Open

Long strings dumped to dvc.lock contain extra space upon load #10668

janpawlowskiof opened this issue Jan 13, 2025 · 2 comments
Labels
bug Did we break something? upstream Issues which need to be resolved in an upstream dependency

Comments

@janpawlowskiof
Copy link

janpawlowskiof commented Jan 13, 2025

Bug Report

repro: long strings dumped to dvc.lock contain extra space upon load

Edit:

In PR, I replaced the float("inf") with sys.maxsize as suggested in #9397 (comment)

Description

The actual bug seems to be originating from ruamel.yaml, but we should mitigate it here.

How it manifests:

I have this stage indvc.yaml

  faulty_stage:
    cmd: >-
      echo imruneverytime > faulty.txt
    params:
      - fault_parameter_name
    outs:
      - faulty.txt

and this in params.yaml

fault_parameter_name: |
  This is a prompt.
  This is a prompt.
  This is a prompt.
  This is a prompt.
  This is a prompt.

Despite not changing anything, dvc sees the step as changed, but then realizes the step was cached and loads it from cache.

# Running for the first time, expected

bash-5.2$ dvc repro -s faulty_stage
Running stage 'faulty_stage':                                         
> echo imruneverytime > faulty.txt
Updating lock file 'dvc.lock'                                                                                                                                              
Use `dvc push` to send your updates to remote storage.

# Nothing changes here, yet it tries to run, but finally does not, because of the previous run being cached

bash-5.2$ dvc repro -s faulty_stage
Stage 'faulty_stage' is cached - skipping run, checking out outputs   
Updating lock file 'dvc.lock'                                                                                                                                              
Use `dvc push` to send your updates to remote storage.
bash-5.2$ 

Why it occurs:

When long lines are dumped to dvc.lock the line gets wrapped.
The issue happens (only sometimes I think), when this wrap occurs after "\n" character.
When the yaml is then loaded it contains and additional space.

You can see this happen directly in the ruamel.yaml
Image

This can be worked around by adding this line, which makes it so that string is dumped in one line.

yaml.width = float("inf")
Image You can see this fixes the bug.

How it can be solved.

This can be probably solved be adding this line

yaml.width = float("inf")

here

yaml = YAML()

This solves my issue, and while I didn't do any real testing of this, I don't think this should cause any issues elsewhere.

Output of dvc doctor:

DVC version: 3.58.0 (pip)
-------------------------
Platform: Python 3.11.9 on macOS-14.3-arm64-arm-64bit
Subprojects:
        dvc_data = 3.16.7
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.40.2
        scmrepo = 3.3.9
Supports:
        http (aiohttp = 3.11.11, aiohttp-retry = 2.9.1),
        https (aiohttp = 3.11.11, aiohttp-retry = 2.9.1)
Config:
        Global: /Users/jpawlowski/Library/Application Support/dvc
        System: /Library/Application Support/dvc
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk3s1s1
Caches: local
Remotes: None
Workspace directory: apfs on /dev/disk3s1s1
Repo: dvc, git
Repo.site_cache_dir: /Library/Caches/dvc/repo/c7d0d0e3856141270a471060952e5a84

This will hopefully be a one line fix, and I can add a PR for it today.

Best regards!

@skshetry
Copy link
Member

Setting the width might change the existing dvc.lock files. Although setting width may seem to fix it, the issue is really with ruamel-yaml not round-tripping properly.

See https://sourceforge.net/p/ruamel-yaml/tickets/526/.

@janpawlowskiof
Copy link
Author

janpawlowskiof commented Jan 13, 2025

Correct, although since currently there seems to be no work being done ruamel-yaml's side, I sill would consider simply changing YAML's settings to avoid this issue.
Especially since I don't see a way to work around it.
Everytime I do dvc commit I need to confirm that I want to commit since params.yaml changed, when in reality they didn't. This adds a lot of confusion.

As for exisiting dvc.lock:

I'm pretty sure this change will not affect loading existing dvc.lock files.

The first time you recreate dvc.lock with this change, it will create a different dvc.lock, however when loaded, both old and new files should yield identical objects.
So I think it should not cause any troubles?
Let me know if you have doubts, so I can test for them.

@shcheklein shcheklein added bug Did we break something? upstream Issues which need to be resolved in an upstream dependency labels Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? upstream Issues which need to be resolved in an upstream dependency
Projects
None yet
Development

No branches or pull requests

3 participants