-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
skip download of speaker image if url is "-" #228
Conversation
@benoit74 the tests are failing due to failure to upload to codecov. I think we need to fix that and also not fail on such scenario, as we did on another scraper (youtube?) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't we enhance the tests to also check for:
- bare URL like just
-
,www.host.com
(without thehttps://
) - absolute URLs like
//www.host.com
- At least one valid URL with file name and query string (here we have only domain names)
For some I'm not sure if these URLs should be valid or not, tbc if rest of the code is capable to handle such a URL, and I'm not sure that we might receive these as speaker URL, but I'm sure it should be clear what would happen in such a case.
I will fix the codecov issue as mentioned by rgaudin: add proper credentials and make codecov results informational.
Edit: codecov results are already informational, see https://github.com/openzim/ted/blob/main/codecov.yml ; here CI is failing because upload failed. Which is a good thing I think, we want the upload to be done. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #228 +/- ##
========================================
- Coverage 5.00% 5.00% -0.01%
========================================
Files 8 8
Lines 1098 1100 +2
Branches 238 239 +1
========================================
Hits 55 55
- Misses 1042 1044 +2
Partials 1 1 ☔ View full report in Codecov by Sentry. |
OK, maybe the repo is not properly configured then? I remember we have to add the TOKEN for some a while ago |
I've updated the CODECOV_TOKEN, it was only 4 months old so I'm fairly surprised it suddenly stopped working. Maybe it was just a glitch somewhere. Anyway, now it worked fine. |
I ran some tests locally with sample urls and for bare urls like I didn't want to consider query string and other parts seeing as the input is coming from an external source (not user input). Thus, I decided to restrict it to checking the validity of the domain names because the error was about a name resolution failure (due to invalid domain name). It's also why I named the function I could refactor it into a more robust url matcher though. What do you think? EDIT: Or would it be better to use a third-party library to validate this as url regex matching can get complicated. Noticed mine doesn't cover IPv6 urls although I don't think any of the speaker images match that pattern. |
d8bf387
to
17ad543
Compare
I just rebased your branch to use latest codecov action, not sure it was the problem around failing CI, but it might be. |
I'm sorry but this makes no sense to me. First because a URI is more generic than a URL, which is necessarily online (where a URI could be found in an XML document for instance and is not online), so I don't get the relation of naming it If checking this URL for validity is too complex, I don't mind to just ignore |
I looked up existing URL matchers and I think the simple approach of checking against |
I think this is more appropriate indeed: to only focus on the existing issue. I suggest you remove the |
80f2c09
to
88c69df
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thank you ! I indeed prefer this simple thing for now.
Rationale
Sometimes, speaker image URLs are invalid and causes the scraper to throw errors after attempting to fetch the resource. This PR ensures the hostname of the URLs are at least valid before attempting to fetch the resource.
This PR resolves #224
Changes