-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTODO
353 lines (316 loc) · 21 KB
/
TODO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
Kpdfsync THINGS-TO-DO
---------------------------------------------------------------------------------------------------
# Alpha Release
# TASKS Estimated Actual
[X] String comparison algorithm, that can analyze the degree of match.
So that minor differences between the pattern and the read text
from pdf files are handled.
[X] Use PDFClown library to highlight the text which matches the most
with the highlighted text from My Clippings file.
[X] Parse the 'My Clippings.txt' file.
[-] Gui POC
[X] Manual and Automatic creation of association between highlights.
and notes.
[ ] Use grid layout for displaying and creating page mappings.
(Not done, in favor of below)
[X] Use custom renderer in list box to show highlight nore mappings.
[X] A separate dialog window for selection of notes for a highlight.
[X] Loging
# Beta Release
# TASKS Estimated Actual
[ ] Optimization and cleanup objects.
[ ] Lib - Use Iterator instead of Enumeration. (Not sure)
[ ] GUI - Status bar showing last error or success message.
[ ] Lib - parseLine function can be protected. It is public now.
[ ] Lib - matching Bom bytes can be put inside a method in the
ByteOrderMarkTypes enum. It is now separe in
ByteOrderMark file.
# BUGS:
[X] (GitHub Issue #9)
Book: Steve Jobs (MOBI)
Highlights and notes are not getting automatically mapped
On some pages, getting the yellow exclamation point,
indicating that there are nore notes than highlight.
However that is not the case in the Clippings file.
Cause:
HighlightNotePairManager.pairAutomatic() loops though each PageResource
and calls pairAutomatic on it. This loop stops at the 1st error, so if
a pairAutomatic call fails, all subsequent page resources are not paired
automatically.
This came to light due to the below `#8` bug. Because that exception
is thrown before, no page resource after that is paired automatically.
Solution:
1. Do not break the loop on exception. Skip the page resource which
cannot be mapped and go to the next. The message box with the error
will not appear at that point, but the `Bug` icon will be there.
2. Have a method in PageResource to dry run automatic pairing, if
this function returns false, skep it.
[ ] (GitHub Issue #8)
Book: Steve Jobs (MOBI)
Invalid. There are more notes than highlights
Invalid. There are more notes than highlights is wrongly
reported even when there is no issue in the Clippings.txt
file.
Cause:
On multi page highlights, there are two location numbers in
the clippings file. If there is also a note associated with
the highlight, it however has only one page in the clippings
file.
Say the highlights are on page 1107-1108 and the note is on
1108. Kpdfsync will parse and read the 1st location number
of the highlight and the only location number of the note.
Which means, highlight location will be 1107 and note
locaiton will be 1108. Thus the error - A single note in a
page.
Note that, if there was one highligh in 1108-1108 page, then
the user will see no error message, but 1108 will have note
whose assiciated highlight is in page 1107. See solution.
[Kindle does not do multipage highlights on PDF files, so this
issue not possible in PDF files]
Solution:
1. When double clicking on a highlight, allow notes from other
pages to be associated with - in which case move the note to the
page of the highlight.
2. When clicking a page which more notes than highlights, we get
an error. After the error, show the NotesHighlights Map dialog and
which will show the notes in the page, then allow selected notes
to be moved to a different page.
Both this options will move the note to a different page and this
solving the error. Option 1 starts which the highlight, option 2
starts from the note, thats all the difference. In both cases the
user need to know the page of the offending note or the correct
page where the associated highlight resides.
I prefer the 1st option, because it is easy to know the page of the
offending note.
[ ] The string matching algo is too simple, and gives wrong match
percentage, if the strings being compared differ in the number
of non-whitespace characters. The two indexes get out of sync
at the first mismatch and never recover.
Example:
PDF text = 123 56 789
Clipping text = 123 456 789
% match = 3/8 (Wrong)
% match = 7/8 (What is expected)
[ ] Related to the above bug, we are highlighting more characters -
by that many characters as the diffence in the number of
characters, between the text read from the PDF and the pattern
read from the clippings file.
The algorithm matches character by character, the pattern and the
text from the pdf. The matching and thus the highlighting is as
long as the shortest string. If the pattern is shorter, the length
of the highlight is that. However we humans can see the end of the
match should have ended before.
Example:
PDF text = 12 67 89
Clipping text = 12 45 67
Highlighting = 12 67 89 (Wrong, Because length of Clipping text
is 8)
Highlighting = 12 67 (Expected, Much more accepatble
highlighting)
[X] (GitHub Issue #2)
Book: Concrete Mathematics original PDF file.
For some PDF files, org.pdfclown.tools.TextExtractor.extract() is returning null.
This is seen with the Concrete Mathematics original PDF file. May be a TrueType font issue.
Here is the stack trace:
java.lang.NullPointerException
at java.base/java.util.Hashtable.put(Hashtable.java:476)
at org.pdfclown.documents.contents.fonts.PfbParser.parse(PfbParser.java:99)
at org.pdfclown.documents.contents.fonts.Type1Font.getNativeEncoding(Type1Font.java:96)
at org.pdfclown.documents.contents.fonts.Type1Font.loadEncoding(Type1Font.java:141)
at org.pdfclown.documents.contents.fonts.SimpleFont.onLoad(SimpleFont.java:118)
at org.pdfclown.documents.contents.fonts.Font.load(Font.java:738)
at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:351)
at org.pdfclown.documents.contents.fonts.SimpleFont.<init>(SimpleFont.java:62)
at org.pdfclown.documents.contents.fonts.Type1Font.<init>(Type1Font.java:75)
at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:249)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1330)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:811)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:817)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:777)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:770)
at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.get(ContentScanner.java:690)
at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.access$0(ContentScanner.java:682)
at org.pdfclown.documents.contents.ContentScanner.getCurrentWrapper(ContentScanner.java:1154)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:633)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:296)
at coderarjob.kpdfsync.lib.annotator.PdfAnnotatorV1.highlight(PdfAnnotatorV1.java:62)
at coderarjob.kpdfsync.poc.MainFrame$2.run(MainFrame.java:172)
Solution:
Running pdftocairo tool (from poppler-utils package), solves this error.
Command: pdftocairo -pdf <in pdf file> <out pdf file>
[ ] Highlight is not visible on the output PDF file. This was seen on the Concrete Mathematics
cropped PDF file.
[ ] If there are too many highlights per page, then 'sometimes' there is a 'heap full'
exception at 'new TextMarkup(page, note, MarkupTypeEnum.Highlight, highlights)' place.
Exception can be reproduced by:
1. Clippings file: My Clippings_8Feb22.txt
2. Book : progit
3. Skip pages : 6
4. Threshold : 10
5. Begin highlighting.
The times, this exception occures, it occures around the 73% mark.
[X] (GitHub Issue #1)
EOFException at org.pdfclown.tools.TextExtractor.extract() method. This is seen on
'the_evolution_of_operating_system_cropped.pdf' file. Could also be a font issue.
Here is the stack trace
java.lang.RuntimeException: java.io.EOFException
at org.pdfclown.documents.contents.fonts.CffParser.load(CffParser.java:703)
at org.pdfclown.documents.contents.fonts.CffParser.<init>(CffParser.java:640)
at org.pdfclown.documents.contents.fonts.Type1Font.getNativeEncoding(Type1Font.java:104)
at org.pdfclown.documents.contents.fonts.Type1Font.loadEncoding(Type1Font.java:151)
at org.pdfclown.documents.contents.fonts.SimpleFont.onLoad(SimpleFont.java:118)
at org.pdfclown.documents.contents.fonts.Font.load(Font.java:738)
at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:351)
at org.pdfclown.documents.contents.fonts.SimpleFont.<init>(SimpleFont.java:62)
at org.pdfclown.documents.contents.fonts.Type1Font.<init>(Type1Font.java:75)
at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:249)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1330)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:811)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:777)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:770)
at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.get(ContentScanner.java:690)
at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.access$0(ContentScanner.java:682)
at org.pdfclown.documents.contents.ContentScanner.getCurrentWrapper(ContentScanner.java:1154)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:633)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:296)
at coderarjob.kpdfsync.lib.annotator.PdfAnnotatorV1.highlight(PdfAnnotatorV1.java:62)
at coderarjob.kpdfsync.poc.MainFrame$2.run(MainFrame.java:172)
at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.io.EOFException
at org.pdfclown.bytes.Buffer.readUnsignedShort(Buffer.java:511)
at org.pdfclown.documents.contents.fonts.CffParser$Index.parse(CffParser.java:306)
at org.pdfclown.documents.contents.fonts.CffParser$Index.parse(CffParser.java:324)
at org.pdfclown.documents.contents.fonts.CffParser.load(CffParser.java:669)
... 27 more
:: Cause #1
java.io.EOFException
at org.pdfclown.bytes.Buffer.readUnsignedShort(Buffer.java:511)
at org.pdfclown.documents.contents.fonts.CffParser$Index.parse(CffParser.java:306)
at org.pdfclown.documents.contents.fonts.CffParser$Index.parse(CffParser.java:324)
at org.pdfclown.documents.contents.fonts.CffParser.load(CffParser.java:669)
at org.pdfclown.documents.contents.fonts.CffParser.<init>(CffParser.java:640)
at org.pdfclown.documents.contents.fonts.Type1Font.getNativeEncoding(Type1Font.java:104)
at org.pdfclown.documents.contents.fonts.Type1Font.loadEncoding(Type1Font.java:151)
at org.pdfclown.documents.contents.fonts.SimpleFont.onLoad(SimpleFont.java:118)
at org.pdfclown.documents.contents.fonts.Font.load(Font.java:738)
at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:351)
at org.pdfclown.documents.contents.fonts.SimpleFont.<init>(SimpleFont.java:62)
at org.pdfclown.documents.contents.fonts.Type1Font.<init>(Type1Font.java:75)
at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:249)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1330)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:811)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:777)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:770)
at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.get(ContentScanner.java:690)
at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.access$0(ContentScanner.java:682)
at org.pdfclown.documents.contents.ContentScanner.getCurrentWrapper(ContentScanner.java:1154)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:633)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:296)
at coderarjob.kpdfsync.lib.annotator.PdfAnnotatorV1.highlight(PdfAnnotatorV1.java:62)
at coderarjob.kpdfsync.poc.MainFrame$2.run(MainFrame.java:172)
Solution:
Running pdftocairo tool (from poppler-utils package), solves this error.
Command: pdftocairo -pdf <in pdf file> <out pdf file>
[X] (GitHub Issue #4)
Book: MMURTL
org.pdfclown.util.NotImplementedException: LZWDecode
Stack trace:
org.pdfclown.util.NotImplementedException: LZWDecode
at org.pdfclown.bytes.filters.Filter.get(Filter.java:74)
at org.pdfclown.objects.PdfStream.getBody(PdfStream.java:193)
at org.pdfclown.objects.PdfStream.getBody(PdfStream.java:155)
at org.pdfclown.documents.contents.Contents$ContentStream.moveNextStream(Contents.java:279)
at org.pdfclown.documents.contents.Contents$ContentStream.(Contents.java:86)
at org.pdfclown.documents.contents.Contents.load(Contents.java:591)
at org.pdfclown.documents.contents.Contents.(Contents.java:366)
at org.pdfclown.documents.contents.Contents.wrap(Contents.java:345)
at org.pdfclown.documents.Page.getContents(Page.java:571)
at org.pdfclown.documents.contents.ContentScanner.(ContentScanner.java:1033)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:297)
at coderarjob.kpdfsync.lib.annotator.PdfAnnotatorV1.highlight(PdfAnnotatorV1.java:62)
at coderarjob.kpdfsync.poc.MainFrame$2.run(MainFrame.java:202)
at java.base/java.lang.Thread.run(Thread.java:833)
Solution:
Running pdftocairo tool (from poppler-utils package), solves this error.
Command: pdftocairo -pdf <in pdf file> <out pdf file>
[ ] (GitHub Issue #3)
Book: https://plan9.io/sys/doc/lexnames.pdf
Cannot invoke "org.pdfclown.documents.contents.IContentContext.getContents()"
because "contentContext" is null
Stack Trace:
Exception :Cannot invoke "org.pdfclown.documents.contents.IContentContext.getContents()" because "contentContext" is null
java.lang.NullPointerException: Cannot invoke "org.pdfclown.documents.contents.IContentContext.getContents()" because "contentContext" is null
at org.pdfclown.documents.contents.ContentScanner.(ContentScanner.java:1033)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:297)
at coderarjob.kpdfsync.lib.annotator.PdfAnnotatorV1.highlight(PdfAnnotatorV1.java:62)
at coderarjob.kpdfsync.poc.MainFrame$2.run(MainFrame.java:202)
at java.base/java.lang.Thread.run(Thread.java:833)
[ ] (GitHub Issue #5)
Book :/home/coder/kpdfsync/test-files/Books/Classic Operating Systems_ From Batch Processing To Distributed Systems_cropped.pdf
Index Out Of Bounds in PdfAnnotatorV1
Exception :index -1, length 0
java.lang.StringIndexOutOfBoundsException: index -1, length 0
at java.base/java.lang.String.checkIndex(String.java:4560)
at java.base/java.lang.AbstractStringBuilder.deleteCharAt(AbstractStringBuilder.java:970)
at java.base/java.lang.StringBuilder.deleteCharAt(StringBuilder.java:298)
at coderarjob.kpdfsync.lib.annotator.PdfAnnotatorV1.doHighlight(PdfAnnotatorV1.java:96)
at coderarjob.kpdfsync.lib.annotator.PdfAnnotatorV1.highlight(PdfAnnotatorV1.java:65)
at coderarjob.kpdfsync.poc.MainFrame$2.run(MainFrame.java:201)
at java.base/java.lang.Thread.run(Thread.java:833)
[X] (GitHub Issue #6)
Book :resulting pdf after fixing original progit.pdf
Exception :'name' table does NOT exist.
org.pdfclown.util.parsers.ParseException: 'name' table does NOT exist.
at org.pdfclown.documents.contents.fonts.OpenFontParser.getName(OpenFontParser.java:570)
at org.pdfclown.documents.contents.fonts.OpenFontParser.load(OpenFontParser.java:221)
at org.pdfclown.documents.contents.fonts.OpenFontParser.<init>(OpenFontParser.java:205)
at org.pdfclown.documents.contents.fonts.TrueTypeFont.loadEncoding(TrueTypeFont.java:91)
at org.pdfclown.documents.contents.fonts.SimpleFont.onLoad(SimpleFont.java:118)
at org.pdfclown.documents.contents.fonts.Font.load(Font.java:738)
at org.pdfclown.documents.contents.fonts.Font.<init>(Font.java:351)
at org.pdfclown.documents.contents.fonts.SimpleFont.<init>(SimpleFont.java:62)
at org.pdfclown.documents.contents.fonts.TrueTypeFont.<init>(TrueTypeFont.java:68)
at org.pdfclown.documents.contents.fonts.Font.wrap(Font.java:253)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:72)
at org.pdfclown.documents.contents.FontResources.wrap(FontResources.java:1)
at org.pdfclown.documents.contents.ResourceItems.get(ResourceItems.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getResource(SetFont.java:119)
at org.pdfclown.documents.contents.objects.SetFont.getFont(SetFont.java:83)
at org.pdfclown.documents.contents.objects.SetFont.scan(SetFont.java:97)
at org.pdfclown.documents.contents.ContentScanner.moveNext(ContentScanner.java:1330)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.extract(ContentScanner.java:811)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:777)
at org.pdfclown.documents.contents.ContentScanner$TextWrapper.<init>(ContentScanner.java:770)
at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.get(ContentScanner.java:690)
at org.pdfclown.documents.contents.ContentScanner$GraphicsObjectWrapper.access$0(ContentScanner.java:682)
at org.pdfclown.documents.contents.ContentScanner.getCurrentWrapper(ContentScanner.java:1154)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:633)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:647)
at org.pdfclown.tools.TextExtractor.extract(TextExtractor.java:296)
at coderarjob.kpdfsync.lib.annotator.PdfAnnotatorV1.highlight(PdfAnnotatorV1.java:62)
at coderarjob.kpdfsync.poc.MainFrame$2.run(MainFrame.java:201)
at java.lang.Thread.run(Thread.java:748)
Solution:
Modifed pdfclown to treat 'name' and 'post' tables as optional. It is
released with kpdfsync 0.9.0-alpha.
[ ] Book: Rust Programming Language (Duplicate issue)
Highlight is not visible on the output PDF file. The 'Annotations' list shows that the
highlights and comments exits (comments contents match) but are not visible.