Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Are we having duplicate files in the Stack Exchange repo? #1017

Closed
Popolechien opened this issue May 31, 2024 · 12 comments
Closed

Are we having duplicate files in the Stack Exchange repo? #1017

Popolechien opened this issue May 31, 2024 · 12 comments
Assignees
Labels
Question Further information is requested

Comments

@Popolechien
Copy link
Collaborator

Popolechien commented May 31, 2024

I'm looking at our stackexchanges and I see a bunch of files with almost similar descriptions yet very different size.
I haven't investigated yet but @RavanJAltaie can you please have a look?

Screenshot 2024-05-31 at 16 01 28
Screenshot 2024-05-31 at 16 05 38

@Popolechien Popolechien added the Question Further information is requested label May 31, 2024
@benoit74
Copy link
Contributor

Korean issue has already been pointed by a contributor a long time ago, but no-one cared: #838 (there was also a Portuguese file pointed in this issue).

English this is weird.

@Popolechien
Copy link
Collaborator Author

Ok so I guess these old files need to be identified and deleted @RavanJAltaie

@RavanJAltaie
Copy link
Contributor

@benoit74 I checked in the library
image

There is only 1 file per language with different description.
@Popolechien Which files you mean to delete?

@benoit74 benoit74 reopened this Jul 4, 2024
@Popolechien
Copy link
Collaborator Author

@RavanJAltaie The correct search term would be Korean (https://library.kiwix.org/#lang=eng&q=korean), in which case you get this:
Screenshot 2024-07-04 at 13 13 24

The first one (on the left) is labeled as Mul, which is incorrect since this is for English speakers learning Korean. Not sure we can fix and restart the recipe, but at least try.

The second one is from 2020, so you need to find it on download.kiwix.org and ask for deletion. I have no idea why the older one should be bigger than the more recent one, maybe this should be investigated in case the recent one is broken / faulty.

@benoit74
Copy link
Contributor

benoit74 commented Jul 4, 2024

There is at least these two issues:

I don't achieve to get a grasp on why there is 3 "English" ZIMs in the imager service catalog. This is probably a bug in imager service, will be easier to confirm once the rest is cleaned-up

@Popolechien
Copy link
Collaborator Author

Popolechien commented Jul 4, 2024

I see we also have a problem with https://library.kiwix.org/#lang=eng&q=english&category=stack_exchange

So please @RavanJAltaie try to identify how many of these duplicates we have and ask for deletions. Regarding korean and looking at download.kiwix.org/zim/stack_exchange I see that
[ ] korean.stackexchange.com_en_all_2020-10.zim
[ ] korean.stackexchange.com_ko_all_2023-05.zim
[ ] korean.stackexchange.com_ko_all_2023-10.zim

Should probably be deleted. @benoit74 any idea why the library would chose to display the 2020 file rather than the 2023 one?

@benoit74
Copy link
Contributor

benoit74 commented Jul 4, 2024

@benoit74 any idea why the library would chose to display the 2020 file rather than the 2023 one?

korean.stackexchange.com_en_all is not equal to korean.stackexchange.com_ko_all

For korean.stackexchange.com_en_all, the 2020-10 version is the last one

For korean.stackexchange.com_ko_all, the 2023-10 version is the last one

@benoit74
Copy link
Contributor

benoit74 commented Jul 4, 2024

The first one (on the left) is labeled as Mul, which is incorrect since this is for English speakers learning Korean. Not sure we can fix and restart the recipe, but at least try.

Not possible with sotoki AFAIK, at least for now.

@Popolechien
Copy link
Collaborator Author

Popolechien commented Jul 4, 2024

Good catch. But then why don't we have three of them (mul, ko and en) appearing on the library?

@benoit74
Copy link
Contributor

benoit74 commented Jul 4, 2024

Good catch. But then why don't we have three of them (mul, ko and en) appearing on the library?

Either scraper has changed or metadata sourced from StackExchange has changed. I think the second option is most probable.

@benoit74
Copy link
Contributor

benoit74 commented Jul 4, 2024

I've created the deletion request: #1103

Let's keep this open to see what happens with the triple english once deletion is done.

@benoit74
Copy link
Contributor

benoit74 commented Jul 4, 2024

No need to keep this opened, I found the bugs: openzim/sotoki#321 and offspot/imager-service#436

@benoit74 benoit74 closed this as completed Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants