Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Many fixes for reliability of the scraper #78

Merged
merged 9 commits into from
Nov 25, 2024
Merged

Many fixes for reliability of the scraper #78

merged 9 commits into from
Nov 25, 2024

Conversation

benoit74
Copy link
Contributor

@benoit74 benoit74 commented Nov 22, 2024

Fix #74
Fix #76
Fix #77 (and glossary had the same problem)

Workaround for #71 (real solution postponed to "later") and many other likely situations where we encounter an "unknown" src/href/srcset (inline JS and CSS, ...)

Changes: see list of commits

Some remarks:

  • with these changes, we tend to more often display only warning in the logs for situation which are known to not be totally OK but not cause significant harm to the final ZIM ; this is expected to be able to create a ZIM even when weird situation are encountered in the wild ; I tend to prefer these warnings to really handling the situation to be more easily able to investigate without debugging (e.g. for empty index or glossary pages, I preferred to log a warning and be able to investigate rather than just handling the case where it might be empty and saying nothing, this is not really normal and might be a real problem, we just don't know without manual analysis)
  • I dropped the --html-issues-warn-only because now we do not have such issues anymore, we are always just logging a warning because these issues are here to stay and we do not want to fail the scrape just for few isolated issues (they are really isolated from test scrape which ran last days)
  • as discussed, we know have a new dev option to control how many assets are allowed to fail (to provide some slack when needed) and a prod option to register known bad assets (and we automatically include some known ones coming from automatic CSS)

@benoit74 benoit74 self-assigned this Nov 22, 2024
Copy link

codecov bot commented Nov 22, 2024

Codecov Report

Attention: Patch coverage is 13.23529% with 118 lines in your changes missing coverage. Please review.

Project coverage is 43.14%. Comparing base (5236501) to head (ad99067).
Report is 10 commits behind head on main.

Files with missing lines Patch % Lines
scraper/src/mindtouch2zim/processor.py 1.04% 95 Missing ⚠️
scraper/src/mindtouch2zim/asset.py 23.07% 20 Missing ⚠️
scraper/src/mindtouch2zim/entrypoint.py 0.00% 1 Missing ⚠️
scraper/src/mindtouch2zim/html_rewriting.py 88.88% 1 Missing ⚠️
scraper/src/mindtouch2zim/utils.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #78      +/-   ##
==========================================
- Coverage   43.85%   43.14%   -0.72%     
==========================================
  Files          15       15              
  Lines         969      978       +9     
  Branches      133      133              
==========================================
- Hits          425      422       -3     
- Misses        529      545      +16     
+ Partials       15       11       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@benoit74 benoit74 marked this pull request as ready for review November 22, 2024 10:36
@benoit74 benoit74 requested a review from rgaudin November 22, 2024 10:36
@benoit74
Copy link
Contributor Author

For the record, I abused the dev Docker image by building from this branch as well, just to be able to run the scraper asap in Zimfarm, since we did not release 0.1, who cares

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ; see suggestion to rename Exception

if len(private_pages) == len(selected_pages):
# we should never get here since we already check fail early if root
# page is private, but we are better safe than sorry
raise Exception("All pages have been ignored, not creating an empty ZIM")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use OSError instead

f"Exception while processing asset for {asset_url.value}: "
f"{exc}"
)
raise Exception( # noqa: B904
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use more qualified exceptions such as OSError

@benoit74
Copy link
Contributor Author

I've opened #91 for the Exception, there are many more than the ones you're mentioning here. But good point indeed

@benoit74 benoit74 merged commit 08e9733 into main Nov 25, 2024
11 checks passed
@benoit74 benoit74 deleted the reliability branch November 25, 2024 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants