-
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Many fixes for reliability of the scraper #78
Conversation
… blank URL to not load external assets + display warning
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #78 +/- ##
==========================================
- Coverage 43.85% 43.14% -0.72%
==========================================
Files 15 15
Lines 969 978 +9
Branches 133 133
==========================================
- Hits 425 422 -3
- Misses 529 545 +16
+ Partials 15 11 -4 ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
For the record, I abused the |
…ff only when needed
1d4b766
to
32d8faf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ; see suggestion to rename Exception
if len(private_pages) == len(selected_pages): | ||
# we should never get here since we already check fail early if root | ||
# page is private, but we are better safe than sorry | ||
raise Exception("All pages have been ignored, not creating an empty ZIM") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use OSError instead
f"Exception while processing asset for {asset_url.value}: " | ||
f"{exc}" | ||
) | ||
raise Exception( # noqa: B904 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use more qualified exceptions such as OSError
I've opened #91 for the |
Fix #74
Fix #76
Fix #77 (and glossary had the same problem)
Workaround for #71 (real solution postponed to "later") and many other likely situations where we encounter an "unknown" src/href/srcset (inline JS and CSS, ...)
Changes: see list of commits
Some remarks:
--html-issues-warn-only
because now we do not have such issues anymore, we are always just logging a warning because these issues are here to stay and we do not want to fail the scrape just for few isolated issues (they are really isolated from test scrape which ran last days)