-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shutdown of Apache Tika Corpora #3035
Comments
Is the complete list of files a subset (pseudocode: local_filename.startswith("tika")) of example_files.yaml? But the reason we need each of them is found in the tests where they are used? |
Not necessarily - it is every file referencing https://corpora.tika.apache.org somewhere. For the reasons, deeper research might be required - looking at the code/usages itself, looking into the commits/PRs introducing them etc. |
Could one option be to ask on the mailing list for access to the corpora, and then for each file that was used in the test suite, determine if the file can be copied (placing it into the samples directory) and failing that, at least maybe having an easier time figuring out the quality of the PDF that was necessary for a test and then creating a synthetic PDF that exercises that same behavior? |
Most of the active developers should still have access to the relevant files due to local caching and - if in doubt - getting them from the Ubuntu cache used by GitHub Actions (I have done this in the past and should have more than one local copy of the relevant files). Whether our specific files are relevant for the take-down requests or not is unclear here. I have some doubts about the sample files, though. They would have to be CC-BY-SA-4.0, which requires re-building local copies of them with the problematic features. Identifying the actual reason for using the PDF requires quite some research from my experience with looking into the few arXiv files (#2904). Additionally, generating synthetic files requires additional experience with lots of PDF internals. (To be honest: I would indeed prefer to avoid having to rely on unclear copyright at all, but this is another topic.) With the time cleaning this up would take, further development and pushing new releases will/would be blocked for an unknown amount of time, while a new release is overdue. (I planned to do this in the last two weeks, but did not find enough time to do so.) Thus, we probably need a short-term solution like disabling Windows CI altogether for now as the Ubuntu builds rely on the cache (and thus pass) and Windows-specific issues are quite sparse anyway. |
If possible, I would still prefer to generate actual minimal files with the desired features or maybe even get rid of obsolete cases. This avoids future issues about copyrights as in this case. (While the files seem to originate from US government sources which are public domain in the US, this does not necessarily hold true for usage outside of the US: https://en.wikipedia.org/wiki/Copyright_status_of_works_by_the_federal_government_of_the_United_States) |
the files are available under CC0 (https://web.archive.org/web/20230625181442/https://digitalcorpora.org/about-digitalcorpora/terms-of-use/ ) out of the hudge job required if we have to rebuild examples, we may reduce test coverage provided with generic files |
But this still refers to US law - with us maintainers all being EU-based, this might be different - at least this is how I read the Wikipedia article and its referenced material like section 3.1.7 of https://www.cendi.gov/pdf/FAQ_Copyright_30jan18.pdf or section 105 of https://en.wikisource.org/wiki/Copyright_Law_Revision_(House_Report_No._94-1476):
|
Windows tests started failing as the Apache Tika Corpora site has been taken offline some hours ago: https://lists.apache.org/thread/l53lct6hjojwlhsfwcnzgtj5b1kpyo0h
Example error:
We have to review all the corresponding URLs and check for suitable solutions.
The text was updated successfully, but these errors were encountered: