Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creator is in error state while processing assets of bio.libretexts.org #128

Open
benoit74 opened this issue Jan 9, 2025 · 3 comments
Open
Assignees
Labels
bug Something isn't working
Milestone

Comments

@benoit74
Copy link
Contributor

benoit74 commented Jan 9, 2025

See https://farm.openzim.org/pipeline/e7ddd2d3-eae7-43fc-94d7-0aa4b1e77d04

Source of the problem seems to be

ERROR Detected
[mindtouch2zim::Thread-10 (worker)::2025-01-09 18:15:57,141] WARNING:Exception while processing asset from https://bio.libretexts.org/@api/deki/files/82600/Emissions-by-sector-%2525E2%252580%252593-pie-charts.svg?revision=1 used by page ID 110266 (https://bio.libretexts.org/Bookshelves/Biochemistry/Fundamentals_of_Biochemistry_(Jakubowski_and_Flatt)/Unit_IV_-_Special_Topics/32%3A_Biochemistry_and_Climate_Change/32.18%3A__Part_4_-_Turning_Trees_into_Plexiglass%3A_Synthetic_Biology_For_Production_of_Green_Foods_and_Products): Asynchronous error: N3zim29IncoherentImplementationErrorE
Declared provider's size (3735174) is not equal to total size returned by feed() calls (0).

I will restart the recipe on same worker and see if issue happens again.

Note that it looks like the issue is transient: all assets fails to be added, then it works again, then it is again failing, ...

We should probably catch Creator is in error state. exceptions and fail the scrape on these ones, it is only going to create a broken ZIM.

@benoit74 benoit74 added the bug Something isn't working label Jan 9, 2025
@benoit74 benoit74 added this to the 0.2.0 milestone Jan 9, 2025
@benoit74 benoit74 self-assigned this Jan 9, 2025
@rgaudin
Copy link
Member

rgaudin commented Jan 10, 2025

Declared provider's size (3735174) is not equal to total size returned by feed() calls (0).

Those are nasty ones. Let's get in touch cause this may or may not be a regression on scraperlib.

We should probably catch Creator is in error state. exceptions

Do you mean that it currently ignores all exceptions? That sounds like a bad idea. C-originated exceptions are all RuntimeError with different text. RuntimeError should not be ignored at all IMO.

@benoit74
Copy link
Contributor Author

What I observed now:

  • most tasks (I restarted all libretexts.org recipes) are not affected by the bug
  • libretexts.org_en_med failed twice on different workers
  • libretexts.org_en_bio ran twice on same worker and second run failed for another reason, there is no Creator is in error message
  • we have bigger recipes which succeeded

This means that:

  • bug looks induced by a race condition
  • bug is not linked to a specific worker

As a reminder, all assets are passed to the libzim as bytes (coming from a BytesIO):

creator.add_item_for(
path="content/" + asset_path.value,
content=asset_content.getvalue(),
)

Given libzim message, I suspect the race condition might be that Python is freeing the bytes faster than the libzim is consuming it. But I have no clue how it can happens / what we should do.

@rgaudin does it remind you something? did we had a recent change around this in python-libzim or python-scraperlib? About bytes and path manipulation, I recall of changes around stream_file in scraperlib, but this is something totally different.

@benoit74
Copy link
Contributor Author

Do you mean that it currently ignores all exceptions? That sounds like a bad idea. C-originated exceptions are all RuntimeError with different text. RuntimeError should not be ignored at all IMO.

I think so as well ... now 🤣

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants