Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutdown of Apache Tika Corpora #3035

Open
stefan6419846 opened this issue Jan 9, 2025 · 8 comments
Open

Shutdown of Apache Tika Corpora #3035

stefan6419846 opened this issue Jan 9, 2025 · 8 comments
Labels
is-maintenance Anything that is just internal: Simplifying code, syntax changes, updating docs, speed improvements nf-testing Non-functional change: Testing

Comments

@stefan6419846
Copy link
Collaborator

Windows tests started failing as the Apache Tika Corpora site has been taken offline some hours ago: https://lists.apache.org/thread/l53lct6hjojwlhsfwcnzgtj5b1kpyo0h

Example error:

FAILED tests/test_page.py::test_extract_text[https://corpora.tika.apache.org/base/docs/govdocs1/932/932446.pdf-tika-932446.pdf] - urllib.error.URLError: <urlopen error [WinError 10061] No connection could be made because the target machine actively refused it>

We have to review all the corresponding URLs and check for suitable solutions.

@stefan6419846 stefan6419846 added nf-testing Non-functional change: Testing is-maintenance Anything that is just internal: Simplifying code, syntax changes, updating docs, speed improvements labels Jan 9, 2025
@j-t-1
Copy link
Contributor

j-t-1 commented Jan 9, 2025

Is the complete list of files a subset (pseudocode: local_filename.startswith("tika")) of example_files.yaml? But the reason we need each of them is found in the tests where they are used?

@stefan6419846
Copy link
Collaborator Author

stefan6419846 commented Jan 9, 2025

Not necessarily - it is every file referencing https://corpora.tika.apache.org somewhere. For the reasons, deeper research might be required - looking at the code/usages itself, looking into the commits/PRs introducing them etc.

@MasterOdin
Copy link
Member

MasterOdin commented Jan 9, 2025

Could one option be to ask on the mailing list for access to the corpora, and then for each file that was used in the test suite, determine if the file can be copied (placing it into the samples directory) and failing that, at least maybe having an easier time figuring out the quality of the PDF that was necessary for a test and then creating a synthetic PDF that exercises that same behavior?

@stefan6419846
Copy link
Collaborator Author

Most of the active developers should still have access to the relevant files due to local caching and - if in doubt - getting them from the Ubuntu cache used by GitHub Actions (I have done this in the past and should have more than one local copy of the relevant files). Whether our specific files are relevant for the take-down requests or not is unclear here.

I have some doubts about the sample files, though. They would have to be CC-BY-SA-4.0, which requires re-building local copies of them with the problematic features. Identifying the actual reason for using the PDF requires quite some research from my experience with looking into the few arXiv files (#2904). Additionally, generating synthetic files requires additional experience with lots of PDF internals. (To be honest: I would indeed prefer to avoid having to rely on unclear copyright at all, but this is another topic.)

With the time cleaning this up would take, further development and pushing new releases will/would be blocked for an unknown amount of time, while a new release is overdue. (I planned to do this in the last two weeks, but did not find enough time to do so.) Thus, we probably need a short-term solution like disabling Windows CI altogether for now as the Ubuntu builds rely on the cache (and thus pass) and Windows-specific issues are quite sparse anyway.

@pubpub-zz
Copy link
Collaborator

pubpub-zz commented Jan 9, 2025

after some research I've found a (original?) location where the files:
https://digitalcorpora.org/corpora/file-corpora/files/

In recent years a significant amount of forensic research has involved the analysis of files or file fragments. In the absence of such corpora, researchers and students who wish to work with files first need to collect files—a surprisingly difficult task if one wishes a large number of files of many types from a variety of sources. Although many files can be freely downloaded from the web, building and running a high-performance document discovery and downloading tool is not a trivial task. Once files are downloaded they need to be analyzed, characterized and curated. Finally, many corpora that might be assembled cannot be easily redistributed due to privacy or copyright concerns.
For these reasons, we have created and released a corpus of 1 million documents that are freely available for research and may be (to the best of our knowledge) freely redistributed. These documents were obtained by performing searches for words randomly chosen from the Unix dictionary, numbers randomly chosen between 1 and 1 million, and randomized combinations of the two, for documents of specified file types that resided on web servers in the .gov domain using the Yahoo an Google search engines.

the files being stored in zip files it is not so easy to use them straight, therefore I propose to put a copy in this thread and change the links:
list of links to be addressed:
https://corpora.tika.apache.org/base/docs/govdocs1/906/906769.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/909/909655.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/911/911260.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/912/912552.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/913/913678.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/914/914102.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/914/914133.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/914/914568.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/914/914902.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/915/915194.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/918/918113.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/918/918137.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/922/922840.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/923/923621.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/924/924546.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/924/924562.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/924/924666.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/924/924794.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/930/930513.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/932/932446.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/932/932449.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/933/933322.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/934/934771.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/935/935066.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/935/935981.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/935/935996.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/937/937334.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/938/938702.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/940/940704.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/941/941536.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/942/942050.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/942/942303.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/942/942358.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/948/948176.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/950/950337.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/952/952016.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/952/952133.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/953/953770.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/956/956939.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/957/957304.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/957/957721.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/958/958496.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/959/959184.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/959/959519.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/960/960317.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/961/961883.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/962/962292.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/963/963692.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/964/964029.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/965/965118.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/966/966635.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/967/967399.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/967/967943.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/969/969502.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/971/971703.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/972/972174.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/972/972243.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/972/972486.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/972/972962.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/974/974966.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/976/976030.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/976/976488.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/977/977609.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/977/977774.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/978/978477.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/980/980613.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/981/981961.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/982/982336.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/984/984877.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/985/985770.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/986/986065.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/988/988698.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/989/989691.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/992/992472.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/994/994636.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/994/994759.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/995/995175.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/997/997511.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/998/998719.pdf
https://corpora.tika.apache.org/base/docs/govdocs1/999/999944.pdf

waiting for some feedback before uploading the files

@stefan6419846
Copy link
Collaborator Author

If possible, I would still prefer to generate actual minimal files with the desired features or maybe even get rid of obsolete cases. This avoids future issues about copyrights as in this case. (While the files seem to originate from US government sources which are public domain in the US, this does not necessarily hold true for usage outside of the US: https://en.wikipedia.org/wiki/Copyright_status_of_works_by_the_federal_government_of_the_United_States)

@pubpub-zz
Copy link
Collaborator

If possible, I would still prefer to generate actual minimal files with the desired features or maybe even get rid of obsolete cases. This avoids future issues about copyrights as in this case. (While the files seem to originate from US government sources which are public domain in the US, this does not necessarily hold true for usage outside of the US: https://en.wikipedia.org/wiki/Copyright_status_of_works_by_the_federal_government_of_the_United_States)

the files are available under CC0 (https://web.archive.org/web/20230625181442/https://digitalcorpora.org/about-digitalcorpora/terms-of-use/ )

out of the hudge job required if we have to rebuild examples, we may reduce test coverage provided with generic files

@stefan6419846
Copy link
Collaborator Author

But this still refers to US law - with us maintainers all being EU-based, this might be different - at least this is how I read the Wikipedia article and its referenced material like section 3.1.7 of https://www.cendi.gov/pdf/FAQ_Copyright_30jan18.pdf or section 105 of https://en.wikisource.org/wiki/Copyright_Law_Revision_(House_Report_No._94-1476):

The prohibition on copyright protection for United States Government works is not intended to have any effect on protection of these works abroad. Works of the governments of most other countries are copyrighted. There are no valid policy reasons for denying such protection to United States Government works in foreign countries, or for precluding the Government from making licenses for the use of its works abroad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-maintenance Anything that is just internal: Simplifying code, syntax changes, updating docs, speed improvements nf-testing Non-functional change: Testing
Projects
None yet
Development

No branches or pull requests

4 participants