Shutdown of Apache Tika Corpora #3035

stefan6419846 · 2025-01-09T12:38:21Z

Windows tests started failing as the Apache Tika Corpora site has been taken offline some hours ago: https://lists.apache.org/thread/l53lct6hjojwlhsfwcnzgtj5b1kpyo0h

Example error:

FAILED tests/test_page.py::test_extract_text[https://corpora.tika.apache.org/base/docs/govdocs1/932/932446.pdf-tika-932446.pdf] - urllib.error.URLError: <urlopen error [WinError 10061] No connection could be made because the target machine actively refused it>

We have to review all the corresponding URLs and check for suitable solutions.

j-t-1 · 2025-01-09T14:54:30Z

Is the complete list of files a subset (pseudocode: local_filename.startswith("tika")) of example_files.yaml? But the reason we need each of them is found in the tests where they are used?

stefan6419846 · 2025-01-09T14:57:32Z

Not necessarily - it is every file referencing https://corpora.tika.apache.org somewhere. For the reasons, deeper research might be required - looking at the code/usages itself, looking into the commits/PRs introducing them etc.

MasterOdin · 2025-01-09T16:24:45Z

Could one option be to ask on the mailing list for access to the corpora, and then for each file that was used in the test suite, determine if the file can be copied (placing it into the samples directory) and failing that, at least maybe having an easier time figuring out the quality of the PDF that was necessary for a test and then creating a synthetic PDF that exercises that same behavior?

stefan6419846 · 2025-01-09T16:44:55Z

Most of the active developers should still have access to the relevant files due to local caching and - if in doubt - getting them from the Ubuntu cache used by GitHub Actions (I have done this in the past and should have more than one local copy of the relevant files). Whether our specific files are relevant for the take-down requests or not is unclear here.

I have some doubts about the sample files, though. They would have to be CC-BY-SA-4.0, which requires re-building local copies of them with the problematic features. Identifying the actual reason for using the PDF requires quite some research from my experience with looking into the few arXiv files (#2904). Additionally, generating synthetic files requires additional experience with lots of PDF internals. (To be honest: I would indeed prefer to avoid having to rely on unclear copyright at all, but this is another topic.)

With the time cleaning this up would take, further development and pushing new releases will/would be blocked for an unknown amount of time, while a new release is overdue. (I planned to do this in the last two weeks, but did not find enough time to do so.) Thus, we probably need a short-term solution like disabling Windows CI altogether for now as the Ubuntu builds rely on the cache (and thus pass) and Windows-specific issues are quite sparse anyway.

pubpub-zz · 2025-01-09T20:14:43Z

after some research I've found a (original?) location where the files:
https://digitalcorpora.org/corpora/file-corpora/files/

In recent years a significant amount of forensic research has involved the analysis of files or file fragments. In the absence of such corpora, researchers and students who wish to work with files first need to collect files—a surprisingly difficult task if one wishes a large number of files of many types from a variety of sources. Although many files can be freely downloaded from the web, building and running a high-performance document discovery and downloading tool is not a trivial task. Once files are downloaded they need to be analyzed, characterized and curated. Finally, many corpora that might be assembled cannot be easily redistributed due to privacy or copyright concerns.
For these reasons, we have created and released a corpus of 1 million documents that are freely available for research and may be (to the best of our knowledge) freely redistributed. These documents were obtained by performing searches for words randomly chosen from the Unix dictionary, numbers randomly chosen between 1 and 1 million, and randomized combinations of the two, for documents of specified file types that resided on web servers in the .gov domain using the Yahoo an Google search engines.

waiting for some feedback before uploading the files

stefan6419846 · 2025-01-09T20:19:21Z

If possible, I would still prefer to generate actual minimal files with the desired features or maybe even get rid of obsolete cases. This avoids future issues about copyrights as in this case. (While the files seem to originate from US government sources which are public domain in the US, this does not necessarily hold true for usage outside of the US: https://en.wikipedia.org/wiki/Copyright_status_of_works_by_the_federal_government_of_the_United_States)

pubpub-zz · 2025-01-09T20:33:39Z

If possible, I would still prefer to generate actual minimal files with the desired features or maybe even get rid of obsolete cases. This avoids future issues about copyrights as in this case. (While the files seem to originate from US government sources which are public domain in the US, this does not necessarily hold true for usage outside of the US: https://en.wikipedia.org/wiki/Copyright_status_of_works_by_the_federal_government_of_the_United_States)

the files are available under CC0 (https://web.archive.org/web/20230625181442/https://digitalcorpora.org/about-digitalcorpora/terms-of-use/ )

out of the hudge job required if we have to rebuild examples, we may reduce test coverage provided with generic files

stefan6419846 · 2025-01-09T20:47:55Z

But this still refers to US law - with us maintainers all being EU-based, this might be different - at least this is how I read the Wikipedia article and its referenced material like section 3.1.7 of https://www.cendi.gov/pdf/FAQ_Copyright_30jan18.pdf or section 105 of https://en.wikisource.org/wiki/Copyright_Law_Revision_(House_Report_No._94-1476):

The prohibition on copyright protection for United States Government works is not intended to have any effect on protection of these works abroad. Works of the governments of most other countries are copyrighted. There are no valid policy reasons for denying such protection to United States Government works in foreign countries, or for precluding the Government from making licenses for the use of its works abroad.

stefan6419846 added nf-testing Non-functional change: Testing is-maintenance Anything that is just internal: Simplifying code, syntax changes, updating docs, speed improvements labels Jan 9, 2025

stefan6419846 mentioned this issue Jan 9, 2025

MAINT: Add a comment #3034

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shutdown of Apache Tika Corpora #3035

Shutdown of Apache Tika Corpora #3035

stefan6419846 commented Jan 9, 2025

j-t-1 commented Jan 9, 2025 •

edited

Loading

stefan6419846 commented Jan 9, 2025 •

edited

Loading

MasterOdin commented Jan 9, 2025 •

edited

Loading

stefan6419846 commented Jan 9, 2025

pubpub-zz commented Jan 9, 2025 •

edited

Loading

stefan6419846 commented Jan 9, 2025

pubpub-zz commented Jan 9, 2025

stefan6419846 commented Jan 9, 2025

Shutdown of Apache Tika Corpora #3035

Shutdown of Apache Tika Corpora #3035

Comments

stefan6419846 commented Jan 9, 2025

j-t-1 commented Jan 9, 2025 • edited Loading

stefan6419846 commented Jan 9, 2025 • edited Loading

MasterOdin commented Jan 9, 2025 • edited Loading

stefan6419846 commented Jan 9, 2025

pubpub-zz commented Jan 9, 2025 • edited Loading

stefan6419846 commented Jan 9, 2025

pubpub-zz commented Jan 9, 2025

stefan6419846 commented Jan 9, 2025

j-t-1 commented Jan 9, 2025 •

edited

Loading

stefan6419846 commented Jan 9, 2025 •

edited

Loading

MasterOdin commented Jan 9, 2025 •

edited

Loading

pubpub-zz commented Jan 9, 2025 •

edited

Loading