-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Correction: Diacritics missing from author name though present in PDF #333
Comments
The goal is generally for (a) BibTeX to reflect PDF and (b) author pages to collect all observed name variants. So this is a mistake that should be corrected in the XML. Want to open a PR (and be added to our list of volunteers)? |
Before making a one-time change I'd like to understand the underlying problem. Could it be that his name is listed without the diacritic in START, so it is showing up that way in the metadata for many of the venues? |
I just checked -- indeed, his name in START is just |
Missing diacritics is a widespread problem that we hoped to sidestep by allowing name variants. I suppose one could try to write a scraper to try to detect them. If we knew when the switch was made to using START names, then looking for frequent mismatches after that year would help to identify people to contact and ask them to consider updating their profile. |
I adapted the |
In L18 (528 papers, wow), the script made 150 changes (also wow) and printed another 100+ warnings that usually indicate a typo or missing word. The automatic changes are easy to check, and they all look good except for a few:
The first changes ogoneks to cedillas, I believe incorrectly. @mjpost, do you think the PDF should be followed in such cases? |
What system does LREC use to fill metadata? Do they use START also? I'm running the script on L16 now (for #341) and seeing some PDF/XML mismatches that are the same as in L18. For example (not an exhaustive list):
So it would be nice to fix these at the source instead of on our end. |
Seems to be START for both years: |
@mjpost what are your thoughts about editing XML to match PDF in these cases where the PDF has less information than the current XML:
|
Just to be clear, since my tone may indicate otherwise, we can discuss any of these. |
As a general rule, I would say the xml should reflect how you would want to cite the paper, and not necessarily have to match the PDF 100% of the time. On that basis, I would say that the xml should have:
Unfortunately these rules require some research/judgment, but I think it is better to leave things the way they are in case of doubt than it is to exactly mirror the PDFs. |
@mjpost : Approve means that you would like to keep the XML data and not the PDF one, correct?
As an expert in this field [ :-) ]: it depends. You cannot change an oe to ö without any evidence. However, if either the PDF or the XML actually has Köhn in it, it is probably safe to say that the umlaut is the preferred version. Case in point: Philipp Köhn spells his name Koehn in all publications and has this probably also in softconf. No algorithm would try to change it to Köhn. I try to use the umlaut, but in some cases have to enter an ascii-only name, so Koehn will be in some database as well.
In this case, one should go with Rösiger. |
@akoehn—yes, approve means I was in favor of the XML diverging from the PDF in the cases mentioned above. I like @danielgildea's concise summary above. We should start throwing conclusions from these discussions into a wiki page that describes our approach. Just seeing (6) above: I think capitalization falls under the Gildea Principles: we correct it to English conventions unless we have evidence the author prefers it that way (e.g., danah boyd, e e cummings). |
I think I hear a consensus about (3) Don't copy errors from the PDF into the XML. Note that this can occasionally be a tough call: for example, I had difficulty figuring out Elahe Khorasani vs. Elahe Khorashani. (4-5) Assuming that neither the PDF or XML has an error, go with the PDF. (1-2) Override styles (like first initials or all caps) imposed by a conference (which is rare). But: (1-2) There's less clarity about individual papers that use first initials or all caps -- @mjpost says go with the PDF. So for the examples mentioned in this thread, and a couple more: Consensus cases:
Not sure about whether these are considered typos or not:
And the cases where there is difference of opinion:
|
I'm not sure about this one. Why would an author choose to abbreviate their name in some publications but not others? It could be that the names in the PDF all follow one convention which is inconsistent with what some of the authors normally do. I would generally prefer more information over less information, so if the name was spelled out in START but abbreviated in the PDF, I'd go with the non-abbreviated version. |
I agree about having more information whenever possible; I just want us to have some degree of certainty about it, so that we don't "Gilbert Keith" someone's "GK". If we have information from another source (start ID, inference about a conference convention, etc) that suggests the author is fine with an evidenced fuller version, I'm fine with it. But part of the reason to have a strong preference for the PDF is that without that convention, one can spend endless time trying to figure out what's right in all these situations. |
@nschneid we previously discussed first initials at length in #245; @mjpost sorry to bring it up again. In the current situation (LREC and other conferences that use START), the full names are known to be correct because they are provided by the authors, so it seems especially sad to delete information (and indeed, I didn't do it in PR #340). I will try to summarize the above discussion in the wiki, and I will back out the changes in L18 that made some last names all-caps. |
Do you want to further discuss how to get people to change their names in START? If not, we can close this issue. |
I think we can pretty reliably restore accents now by scraping them from PDFs. What's the best way to use this -- to identify people to ask to update their START accounts, or just run the scraper as part of ingestion? The scraper also changes casing and inserts/deletes spaces and hyphens. But it can only flag, not autocorrect, changes in spelling or insertion/deletion of names or initials. |
As far as getting people to update their names in START, it seems like there are a
|
I think we should focus our efforts on implementing this ourself when we generate the XML (say in
(1) and (2) are still good ideas to reduce the amount that has to be fixed, though. |
I agree, it would be annoying for everyone involved to email individual people. So, we have an author-name scraper (https://github.com/acl-org/acl-anthology/blob/auto_accents/bin/auto_authors.py) that could be incorporated into
|
@davidweichiang, do you want a local copy of the Anthology PDFs? It's 35 GB. If you have a CLSP account we could set this up, or find another way. |
I don’t have a CLSP account (I don’t think). But a local copy might be a good idea if we can figure out a way. |
Short cross link: #295 (comment) for a discussion of how to mirror PDFs in bulk. Should be ~5mins to implement. |
I've posted a file with checksums here [14 MB]. |
Can this file (as well as the mirroring script) become part of the repo? |
Can we discuss that further in #295 (the mirroring issue)? I can write the script & create a pull request later today; I am currently on a train with limited bandwidth. Adding the checksums file to the repository seems like a good idea to me. |
Hi, I just ran I wonder if we should change the anthology code to consider any two names that slugify to the same thing to be the same person. That way these people could have one author page without us having to track down every name variant during ingestion, which we don't have any consistent process for currently. |
I like that idea. We should wrap up discussion in #623 and come up with a solution. |
Ironically, Diacritics Restoration Using Neural Networks lists "Jan Hajic" on the page and in the BibTeX whereas it's spelled "Jan Hajič" in the PDF.
I see he is listed with the diacritic in some venues but not others, though going by the PDFs, Jan Hajič seems to be the preferred spelling.
Should the policy be that if an author is listed with multiple spellings differing only in diacritics, the one with the most diacritics should be applied?
The text was updated successfully, but these errors were encountered: