Skip to content
This repository has been archived by the owner on Feb 18, 2021. It is now read-only.

References are being replaced into XML with wrong ref id #91

Open
axfelix opened this issue Jan 26, 2017 · 8 comments
Open

References are being replaced into XML with wrong ref id #91

axfelix opened this issue Jan 26, 2017 · 8 comments

Comments

@axfelix
Copy link
Contributor

axfelix commented Jan 26, 2017

I need to track down where this is happening currently, but I've noticed that some Word documents have ref id's that look like id="ID83ddb41b-5d29-4b6f-b862-74e019db4ec7" after being processed by meTypeset, but wind up with "R56" after being output by our stack. Something is replacing the original IDs...

@axfelix
Copy link
Contributor Author

axfelix commented Jan 26, 2017

Probably being caused by https://github.com/pkp/xmlps/blob/master/module/ReferencesConversion/src/ReferencesConversion/Model/Converter/References.php#L311, which was done to avoid Pandoc breaking on non-numeric ref IDs. Need to think of a way to fix this without breaking inline ref IDs...

@axfelix
Copy link
Contributor Author

axfelix commented Jan 26, 2017

@axfelix
Copy link
Contributor Author

axfelix commented Jan 30, 2017

Actually @kaschioudi , I'm noticing that we seem to be replacing the wrong Ref ID numbers into XML documents processed from PDF via Cermine too -- this might be a wider problem in our implementation...

@Vitaliy-1
Copy link

Actually, there are many issues with generating references. For example, all strings between ( ) are considering with meTypeset as citation, which is really inconvenient. Placing citations in square brackets do not solve the problem, because occasionally numbers not in square brackets also are parsed as references.

For now I have wrote the Java code that parses all references in square brackets and put to them needed id`s. Also I thinks maybe it is better to write the Java app with JAXB library that will parse JATS after DOCX XSLT transformation and give well-formed JATS as output.

@axfelix
Copy link
Contributor Author

axfelix commented Feb 5, 2017

We're aware that meTypeset overdetects parentheses as references -- I actually thought that this must be due to some recent changes we made to it as I've been noticing the problem more and more lately, but I tried reverting to an older version and the problem is still there, so it turns out it just never came up to this extent in our earlier testing. It's flagged as an issue.

As for whether we're investing more effort into parsing pre-JATS transformation or post-JATS transformation, it's a balance to strike between "cleanness" and lossiness.

@axfelix
Copy link
Contributor Author

axfelix commented Feb 5, 2017

@kaschioudi , I think I probably misspoke when I was asking you to fix #50 -- we can't be arbitrarily incrementing ref IDs like in https://github.com/pkp/xmlps/blob/master/module/ReferencesConversion/src/ReferencesConversion/Model/Converter/References.php#L315, we need to match them to the inline xref rid whenever we change them.

Heidelberg's MPT script has a component which I believe is designed to recurse through meTypeset output and change UUIDs to integer ref IDs when needed, so I'm going to test that first: https://github.com/withanage/mpt/blob/master/static/tools/archive/postProcess.py#L502

@axfelix
Copy link
Contributor Author

axfelix commented Feb 5, 2017

Actually, I'm afraid MPT might be too convoluted to add to the workflow just for this -- let's see if we can handle it directly in ReferencesConversion.

axfelix added a commit that referenced this issue Mar 28, 2017
@axfelix
Copy link
Contributor Author

axfelix commented Mar 29, 2017

OK, this is working and merged into master!

There still seem to be a few issues -- the attached doc has a few unmatched rid="ref5" attributes, but all of the xrefs following the pattern rid="R20" are now matched. Not sure what's causing the difference between "R#" and "ref#" but will look into it. Have removed the branch for now because it was mixed in with #92 when I did the merge, but leaving the issue open.

document.xml.txt
33345-106540-1-PB.pdf

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants