TODO

Kpdfsync                                                                               THINGS-TO-DO
---------------------------------------------------------------------------------------------------

# Alpha Release
# TASKS                                                                 Estimated       Actual
[X] String comparison algorithm, that can analyze the degree of match.
    So that minor differences between the pattern and the read text 
    from pdf files are handled.
[X] Use PDFClown library to highlight the text which matches the most
    with the highlighted text from My Clippings file.
[X] Parse the 'My Clippings.txt' file.
[-] Gui POC
[X] Manual and Automatic creation of association between highlights.
    and notes.
[ ] Use grid layout for displaying and creating page mappings.
    (Not done, in favor of below)
[X] Use custom renderer in list box to show highlight nore mappings.
[X] A separate dialog window for selection of notes for a highlight.
[X] Loging

# Beta Release
# TASKS                                                                 Estimated       Actual
[ ] Optimization and cleanup objects.
[ ] Lib - Use Iterator instead of Enumeration. (Not sure)
[ ] GUI - Status bar showing last error or success message.
[ ] Lib - parseLine function can be protected. It is public now.
[ ] Lib - matching Bom bytes can be put inside a method in the
          ByteOrderMarkTypes enum. It is now separe in
          ByteOrderMark file.
# BUGS:
[X] (GitHub Issue #9)
    Book: Steve Jobs (MOBI)
    Highlights and notes are not getting automatically mapped

    On some pages, getting the yellow exclamation point,
    indicating that there are nore notes than highlight.
    However that is not the case in the Clippings file.

Cause:
    HighlightNotePairManager.pairAutomatic() loops though each PageResource
    and calls pairAutomatic on it. This loop stops at the 1st error, so if
    a pairAutomatic call fails, all subsequent page resources are not paired
    automatically.
    This came to light due to the below `#8` bug. Because that exception
    is thrown before, no page resource after that is paired automatically.

Solution:
    1. Do not break the loop on exception. Skip the page resource which
    cannot be mapped and go to the next. The message box with the error
    will not appear at that point, but the `Bug` icon will be there.

    2. Have a method in PageResource to dry run automatic pairing, if
    this function returns false, skep it.

[ ] (GitHub Issue #8)
    Book: Steve Jobs (MOBI)
    Invalid. There are more notes than highlights

    Invalid. There are more notes than highlights is wrongly
    reported even when there is no issue in the Clippings.txt
    file.

Cause:
    On multi page highlights, there are two location numbers in
    the clippings file. If there is also a note associated with
    the highlight, it however has only one page in the clippings
    file.
    Say the highlights are on page 1107-1108 and the note is on
    1108. Kpdfsync will parse and read the 1st location number
    of the highlight and the only location number of the note.
    Which means, highlight location will be 1107 and note
    locaiton will be 1108. Thus the error - A single note in a
    page.

    Note that, if there was one highligh in 1108-1108 page, then
    the user will see no error message, but 1108 will have note
    whose assiciated highlight is in page 1107. See solution.

    [Kindle does not do multipage highlights on PDF files, so this
    issue not possible in PDF files]

Solution:
    1. When double clicking on a highlight, allow notes from other
    pages to be associated with - in which case move the note to the
    page of the highlight.

    2. When clicking a page which more notes than highlights, we get
    an error. After the error, show the NotesHighlights Map dialog and
    which will show the notes in the page, then allow selected notes
    to be moved to a different page.

    Both this options will move the note to a different page and this
    solving the error. Option 1 starts which the highlight, option 2
    starts from the note, thats all the difference. In both cases the
    user need to know the page of the offending note or the correct
    page where the associated highlight resides.
    I prefer the 1st option, because it is easy to know the page of the
    offending note.

[ ] The string matching algo is too simple, and gives wrong match
    percentage, if the strings being compared differ in the number
    of non-whitespace characters. The two indexes get out of sync
    at the first mismatch and never recover.
    Example:
    PDF text      = 123 56 789
    Clipping text = 123 456 789
    % match       = 3/8          (Wrong)
    % match       = 7/8          (What is expected)

[ ] Related to the above bug, we are highlighting more characters -
    by that many characters as the diffence in the number of
    characters, between the text read from the PDF and the pattern
    read from the clippings file.
    The algorithm matches character by character, the pattern and the
    text from the pdf. The matching and thus the highlighting is as 
    long as the shortest string. If the pattern is shorter, the length
    of the highlight is that. However we humans can see the end of the
    match should have ended before.
    Example:
    PDF text      = 12 67 89
    Clipping text = 12 45 67
    Highlighting  = 12 67 89     (Wrong, Because length of Clipping text
                                  is 8)
    Highlighting  = 12 67        (Expected, Much more accepatble 
                                  highlighting)

[X] (GitHub Issue #2)
    Book: Concrete Mathematics original PDF file.

    For some PDF files, org.pdfclown.tools.TextExtractor.extract() is returning null.
    This is seen with the Concrete Mathematics original PDF file. May be a TrueType font issue.
    Here is the stack trace:
    java.lang.NullPointerException
            at java.base/java.util.Hashtable.put(Hashtable.java:476)
            at org.pdfclown.documents.contents.fonts.PfbParser.parse(PfbParser.java:99)
            at org.pdfclown.documents.contents.fonts.Type1Font.getNativeEncoding(Type1Font.java:96)
            at org.pdfclown.documents.contents.fonts.Type1Font.loadEncoding(Type1Font.java:141)
            at org.pdfclown.documents.contents.fonts.SimpleFont.onLoad(SimpleFont.java:118)
            at org.pdfclown.documents.contents.fonts.Font.load(Font.java:738)
            at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:351)
            at org.pdfclown.documents.contents.fonts.SimpleFont.<init>(SimpleFont.java:62)
            at org.pdfclown.documents.contents.fonts.Type1Font.<init>(Type1Font.java:75)
            at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:249)
            at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
            at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
            at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
            at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
            at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
            at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
            at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1330)
            at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:811)
            at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:817)
            at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:777)
            at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:770)
            at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.get(ContentScanner.java:690)
            at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.access$0(ContentScanner.java:682)
            at org.pdfclown.documents.contents.ContentScanner.getCurrentWrapper(ContentScanner.java:1154)
            at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:633)
            at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:296)
            at coderarjob.kpdfsync.lib.annotator.PdfAnnotatorV1.highlight(PdfAnnotatorV1.java:62)
            at coderarjob.kpdfsync.poc.MainFrame$2.run(MainFrame.java:172)
    
    Solution:
    Running pdftocairo tool (from poppler-utils package), solves this error.
    Command: pdftocairo -pdf <in pdf file> <out pdf file>

[ ] Highlight is not visible on the output PDF file. This was seen on the Concrete Mathematics
    cropped PDF file.

[ ] If there are too many highlights per page, then 'sometimes' there is a 'heap full' 
    exception at 'new TextMarkup(page, note, MarkupTypeEnum.Highlight, highlights)' place.

    Exception can be reproduced by:
    1. Clippings file: My Clippings_8Feb22.txt
    2. Book          : progit
    3. Skip pages    : 6
    4. Threshold     : 10
    5. Begin highlighting.

    The times, this exception occures, it occures around the 73% mark.

[X] (GitHub Issue #1)
    EOFException at org.pdfclown.tools.TextExtractor.extract() method. This is seen on
    'the_evolution_of_operating_system_cropped.pdf' file. Could also be a font issue.
    Here is the stack trace
java.lang.RuntimeException: java.io.EOFException
        at org.pdfclown.documents.contents.fonts.CffParser.load(CffParser.java:703)
        at org.pdfclown.documents.contents.fonts.CffParser.<init>(CffParser.java:640)
        at org.pdfclown.documents.contents.fonts.Type1Font.getNativeEncoding(Type1Font.java:104)
        at org.pdfclown.documents.contents.fonts.Type1Font.loadEncoding(Type1Font.java:151)
        at org.pdfclown.documents.contents.fonts.SimpleFont.onLoad(SimpleFont.java:118)
        at org.pdfclown.documents.contents.fonts.Font.load(Font.java:738)
        at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:351)
        at org.pdfclown.documents.contents.fonts.SimpleFont.<init>(SimpleFont.java:62)
        at org.pdfclown.documents.contents.fonts.Type1Font.<init>(Type1Font.java:75)
        at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:249)
        at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
        at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
        at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
        at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
        at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
        at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
        at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1330)
        at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:811)
        at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:777)
        at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:770)
        at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.get(ContentScanner.java:690)
        at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.access$0(ContentScanner.java:682)
        at org.pdfclown.documents.contents.ContentScanner.getCurrentWrapper(ContentScanner.java:1154)
        at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:633)
        at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:296)
        at coderarjob.kpdfsync.lib.annotator.PdfAnnotatorV1.highlight(PdfAnnotatorV1.java:62)
        at coderarjob.kpdfsync.poc.MainFrame$2.run(MainFrame.java:172)
        at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.io.EOFException
        at org.pdfclown.bytes.Buffer.readUnsignedShort(Buffer.java:511)
        at org.pdfclown.documents.contents.fonts.CffParser$Index.parse(CffParser.java:306)
        at org.pdfclown.documents.contents.fonts.CffParser$Index.parse(CffParser.java:324)
        at org.pdfclown.documents.contents.fonts.CffParser.load(CffParser.java:669)
        ... 27 more
:: Cause #1
java.io.EOFException
        at org.pdfclown.bytes.Buffer.readUnsignedShort(Buffer.java:511)
        at org.pdfclown.documents.contents.fonts.CffParser$Index.parse(CffParser.java:306)
        at org.pdfclown.documents.contents.fonts.CffParser$Index.parse(CffParser.java:324)
        at org.pdfclown.documents.contents.fonts.CffParser.load(CffParser.java:669)
        at org.pdfclown.documents.contents.fonts.CffParser.<init>(CffParser.java:640)
        at org.pdfclown.documents.contents.fonts.Type1Font.getNativeEncoding(Type1Font.java:104)
        at org.pdfclown.documents.contents.fonts.Type1Font.loadEncoding(Type1Font.java:151)
        at org.pdfclown.documents.contents.fonts.SimpleFont.onLoad(SimpleFont.java:118)
        at org.pdfclown.documents.contents.fonts.Font.load(Font.java:738)
        at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:351)
        at org.pdfclown.documents.contents.fonts.SimpleFont.<init>(SimpleFont.java:62)
        at org.pdfclown.documents.contents.fonts.Type1Font.<init>(Type1Font.java:75)
        at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:249)
        at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
        at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
        at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
        at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
        at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
        at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
        at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1330)
        at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:811)
        at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:777)
        at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:770)
        at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.get(ContentScanner.java:690)
        at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.access$0(ContentScanner.java:682)
        at org.pdfclown.documents.contents.ContentScanner.getCurrentWrapper(ContentScanner.java:1154)
        at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:633)
        at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:296)
        at coderarjob.kpdfsync.lib.annotator.PdfAnnotatorV1.highlight(PdfAnnotatorV1.java:62)
        at coderarjob.kpdfsync.poc.MainFrame$2.run(MainFrame.java:172)

    Solution:
    Running pdftocairo tool (from poppler-utils package), solves this error.
    Command: pdftocairo -pdf <in pdf file> <out pdf file>

[X] (GitHub Issue #4)
    Book: MMURTL 
    org.pdfclown.util.NotImplementedException: LZWDecode

        Stack trace:
        org.pdfclown.util.NotImplementedException: LZWDecode
        at org.pdfclown.bytes.filters.Filter.get(Filter.java:74)
        at org.pdfclown.objects.PdfStream.getBody(PdfStream.java:193)
        at org.pdfclown.objects.PdfStream.getBody(PdfStream.java:155)
        at org.pdfclown.documents.contents.Contents$ContentStream.moveNextStream(Contents.java:279)
        at org.pdfclown.documents.contents.Contents$ContentStream.(Contents.java:86)
        at org.pdfclown.documents.contents.Contents.load(Contents.java:591)
        at org.pdfclown.documents.contents.Contents.(Contents.java:366)
        at org.pdfclown.documents.contents.Contents.wrap(Contents.java:345)
        at org.pdfclown.documents.Page.getContents(Page.java:571)
        at org.pdfclown.documents.contents.ContentScanner.(ContentScanner.java:1033)
        at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:297)
        at coderarjob.kpdfsync.lib.annotator.PdfAnnotatorV1.highlight(PdfAnnotatorV1.java:62)
        at coderarjob.kpdfsync.poc.MainFrame$2.run(MainFrame.java:202)
        at java.base/java.lang.Thread.run(Thread.java:833)

    Solution:
    Running pdftocairo tool (from poppler-utils package), solves this error.
    Command: pdftocairo -pdf <in pdf file> <out pdf file>

[ ] (GitHub Issue #3)
    Book: https://plan9.io/sys/doc/lexnames.pdf

    Cannot invoke "org.pdfclown.documents.contents.IContentContext.getContents()" 
    because "contentContext" is null

        Stack Trace:
        Exception :Cannot invoke "org.pdfclown.documents.contents.IContentContext.getContents()" because "contentContext" is null
        java.lang.NullPointerException: Cannot invoke "org.pdfclown.documents.contents.IContentContext.getContents()" because "contentContext" is null
        at org.pdfclown.documents.contents.ContentScanner.(ContentScanner.java:1033)
        at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:297)
        at coderarjob.kpdfsync.lib.annotator.PdfAnnotatorV1.highlight(PdfAnnotatorV1.java:62)
        at coderarjob.kpdfsync.poc.MainFrame$2.run(MainFrame.java:202)
        at java.base/java.lang.Thread.run(Thread.java:833)

[ ] (GitHub Issue #5) 
    Book :/home/coder/kpdfsync/test-files/Books/Classic Operating Systems_ From Batch Processing To Distributed Systems_cropped.pdf
    
    Index Out Of Bounds in PdfAnnotatorV1
      Exception :index -1, length 0
        java.lang.StringIndexOutOfBoundsException: index -1, length 0
        at java.base/java.lang.String.checkIndex(String.java:4560)
        at java.base/java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:970)
        at java.base/java.lang.StringBuilder.deleteCharAt(StringBuilder.java:298)
        at coderarjob.kpdfsync.lib.annotator.PdfAnnotatorV1.doHighlight(PdfAnnotatorV1.java:96)
        at coderarjob.kpdfsync.lib.annotator.PdfAnnotatorV1.highlight(PdfAnnotatorV1.java:65)
        at coderarjob.kpdfsync.poc.MainFrame$2.run(MainFrame.java:201)
      at java.base/java.lang.Thread.run(Thread.java:833)

[X] (GitHub Issue #6)
    Book :resulting pdf after fixing original progit.pdf

    Exception :'name' table does NOT exist.
    org.pdfclown.util.parsers.ParseException: 'name' table does NOT exist.
        at org.pdfclown.documents.contents.fonts.OpenFontParser.getName(OpenFontParser.java:570)
        at org.pdfclown.documents.contents.fonts.OpenFontParser.load(OpenFontParser.java:221)
        at org.pdfclown.documents.contents.fonts.OpenFontParser.<init>(OpenFontParser.java:205)
        at org.pdfclown.documents.contents.fonts.TrueTypeFont.loadEncoding(TrueTypeFont.java:91)
        at org.pdfclown.documents.contents.fonts.SimpleFont.onLoad(SimpleFont.java:118)
        at org.pdfclown.documents.contents.fonts.Font.load(Font.java:738)
        at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:351)
        at org.pdfclown.documents.contents.fonts.SimpleFont.<init>(SimpleFont.java:62)
        at org.pdfclown.documents.contents.fonts.TrueTypeFont.<init>(TrueTypeFont.java:68)
        at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:253)
        at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
        at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
        at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
        at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
        at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
        at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
        at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1330)
        at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:811)
        at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:777)
        at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:770)
        at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.get(ContentScanner.java:690)
        at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.access$0(ContentScanner.java:682)
        at org.pdfclown.documents.contents.ContentScanner.getCurrentWrapper(ContentScanner.java:1154)
        at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:633)
        at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:647)
        at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:296)
        at coderarjob.kpdfsync.lib.annotator.PdfAnnotatorV1.highlight(PdfAnnotatorV1.java:62)
        at coderarjob.kpdfsync.poc.MainFrame$2.run(MainFrame.java:201)
        at java.lang.Thread.run(Thread.java:748)

    Solution:
    Modifed pdfclown to treat 'name' and 'post' tables as optional. It is 
    released with kpdfsync 0.9.0-alpha.

[ ] Book: Rust Programming Language (Duplicate issue)
    Highlight is not visible on the output PDF file. The 'Annotations' list shows that the
    highlights and comments exits (comments contents match) but are not visible.