-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathChangelog
executable file
·377 lines (313 loc) · 15.1 KB
/
Changelog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
Version 3.92 - 2021.04.16
-------------------------
Ensured that flexrules also have the initial string "\rV3\r" if the training
set is unambiguous. The string is inserted by the function 'flexcombi', which
is called to combine flexrules from each pass that is needed to handle
ambiguities. The function 'flexcombi' is now also called after the first (and
perhaps only) pass, with no merging really taking place.
Version 3.91 - 2019.01.16
-------------------------
When training with tags, copy the original option structure to a tag specific
option structure, so that parameter training processes start with a fresh
initial set of weights, instead of inheriting them from the previous tag.
Version 3.90 - 2019.01.09
-------------------------
Increased size of 'spass' buffer in affixtrain.cpp.
Write information to a file, not stdout.
Some signed/unsigned issues solved.
Set minimum no. of lines in fractional training file to 1.
Compute 'fraclines' in affixtrain.cpp as number of non-empty lines.
Fixed null-pointer issue in node.cpp.
Version 3.89 - 2018.06.28
-------------------------
Increased size of line buffers in testrules.cpp.
Version 3.88 - 2018.06.28
-------------------------
Another cause of (null) problem found and fixed. (options->tag == 0 is legal
but generated file names like pairsToTrainInNextPass.1(null).pass1 .)
Version 3.87 - 2018.03.14
-------------------------
Cause of (null) problem found and fixed it.
Version 3.86 - 2018.03.11
-------------------------
The string "(null)" appeared in the file names starting with "evaluation" and
some others, where the string "NoTags" was expected. Currently there is no
explanation for this bug. Inserted an assert() to check optionStruct::POStag().
Version 3.85 - 2018.02.28
-------------------------
The rewritten node::Prune function was able to prune (delete) `this' node.
As a consequence the top root could be deleted, which should never happen. The
fix is to call Prune on the child of the top node instead of on the top node
itself. One can still use the old Prune function by setting #define OLDPRUNE
in settingsaffixtrain.h. However, the old Prune function is more recursive and
can cause stack overflow. Thanks to Marek Medveď for pointing out the problem.
Version 3.84 - 2018.02.17
-------------------------
Fixed errors in prediction of OOV word ambiguity.
Numbers were only derived from last run in cross validation, but normalized as
if all runs had been counted. Also, the factor 0.5 in the computation of true
positives and false negatives has been dropped. This factor had been introduced
to compensate for the fact that ambiguous words occur twice (or more often) in
the test set. However, other statistics did count with respect to number of
test lines, and did not "corrigate" for ambiguity.
Made start with conversion of recursive calls to iterations.
Version 3.83 - 2018.02.02
-------------------------
Extra print statement when verbose.
Version 3.82 - 2017.12.20
-------------------------
Made it possible to compute optimal training parameters with a training set
that has more than two columns, e.g. a text corpus in with tab separations
between word, lemma and tag, one text word per line. However, it is still
better to prepare a training list without duplicate lines, as affixtrain
does not remove duplicates.
Version 3.81 - 2017.12.8
-------------------------
Renamed the global boolean variable UTF8 (defined in letterfunc/utf8func.h) to globUTF8.
Renamed the parameter file "parms.txt" to "approx_parms.txt".
Version 3.8 - 2017.09.26
-------------------------
New option -m that sets the maximum ambiguity of the created rules.
Default is -m 30. -m 1 means: create unambiguous rules.
If m > 1 affixtrain creates two or more trees and merges them.
Version 3.74 - 2017.09.22
-------------------------
Fixed errors when RULESASTEXTINDENTED = 1, added typecasts to stop complaining,
made node::compatibleSibling(node *) iterative, set maximum recursion depth at
30000 for cleanup(node *,int), removed wrong assert() in testrules.cpp.
Version 3.73 - 2017.09.16
-------------------------
Added description of -R option in output from affixtrain -h.
Do not require -D option if -G is set. (-D Parameters are for affixtrain
itself, not for external training programs.)
-c option is set to 0 if -G is set.
A bug is fixed so that one can test with the training set also if the training
set is clustered. (Which is meaningless if test set == training set).
Version 3.72 - 2017.08.09
-------------------------
Adapted example/readme.txt to the new default behaviour when the -n parameter
is not given explictly. Also explained that you always need to generate
flexrules on the basis of a fullform / lemma list, disregarding POS tags in the
training data, even if you have plans to also generate flexrules on a per POS
tag basis. The neutral flexrules are needed as a last resort for words with
tags that are not represented in the training data.
The help text (affixtrain -h) explains that affixtrain assumes -n FBT if there
are three or more columns in the training data and otherwise -n FB.
(Previously, the default has been -n FB in all cases. That was the reason why
the example did not work as intended.)
Version 3.71 - 2017.06.20
-------------------------
Fixed bugs related to the new behaviour in 3.7
Changed -v option. Before it was a boolean, now it is a natural number or -
If - or 0, the program is not verbose. If 1 or greater, the program is verbose.
The greater the number, the more verbose.
When validating the rules, if there are few available word lemma pairs, the
program does merely n-fold cross validation, where n is the number of clusters
of word lemma pairs. A cluster, or "clump", is a set of word lemma pairs that
are linked by having the same word (homographs) or by having the same lemma
(belonging to same paradigm). If there are many clumps, validation is still
done by cross validation, but n varies so as to give statistically significant
results. For example, if there are a million clumps and 1.5 % is used for
testing, just two iterations are enough to establish reliable values for
accuracy and precision, since 1.5 % of a million still is 15000 (randomly
chosen) test pairs.
If no column format is specified, the program guesses whether it is FB or FBT,
depending on whether it finds two or three columns. If the input has three
columns, but that column has to be ignored, then explicitly state that on the
command line, e.g. -n FBO.
If the column format is specified with a column for tags, or the column format
is not specified, and only two columns are found, then the format FB is
assumed.
Version 3.7 - 2017.06.18
------------------------
Parameters for optimal training of flexrules are now computed for each PoS tag
separately. New command line option -k <tag name> for training rules for a
single PoS tag. If <tag name> is empty and the -n option indicates that one of
the columns in the training set contains tags, then rules are computed for each
tag occurring in the training set, each training with individual optimal
parameters.
Version 3.67 - 2017.03.20
-------------------------
Right trimming of tag and other fields. Tag fields, often at the end of a line,
could contain spurious characters, such as Carriage Return.
Version 3.66 - 2016.10.03
-------------------------
Fixed error that caused combination of flexrules to fail.
#define _NA 0
Version 3.65 - 2016.09.29
-------------------------
fixed missing arguments
Version 3.64 - 2016.09.19
-------------------------
Removed parameters from deeply recursive print functions.
Fixed deallocation error.
Version 3.63 - 2016.09.07
-------------------------
Fixed error caused by skipping trainingpairs.
Version 3.62 - 2016.09.07
-------------------------
Replaced \r in printf by \n if not Windows.
Version 3.61 - 2016.09.07
-------------------------
Fixed assertion error.
Version 3.60 - 2016.09.06
-------------------------
When testing candidate rules, if the amount of tests is very large due to the
amount of candidate rules and the amount of training pairs, training pairs
will be skipped so the running time remains more or less acceptable.
Changed the type of some ints to size_t or ptrdiff_t to silence the 64-bit
Visual C++ compiler. (Microsoft Visual Studio 2008. Version 9.0)
Version 3.52 - 2016.08.13
-------------------------
Eliminated more calls to the expensive apply() function (if #defined _NA 1)
For very big training sets it may be better to set #defined _NA 0 to
restrict running time.
Version 3.51 - 2016.08.12
-------------------------
Eliminated calls to the expensive apply() function.
Version 3.50 - 2016.08.12
-------------------------
New functionality (if #define PRUNETRAININGPAIRS 1 in settingsaffixtrain.h):
While building the decision tree, discard the training pairs that require
rules that are not shared/supported by any other training pairs.
Version 3.41
-------------------------
More cosmetic changes, eliminiation of a few function calls.
Version 3.4 - 2016.08.09
-------------------------
Reorganization of source code - most classes in separate files.
Version 3.37 - 2016.05.03
-------------------------
Typecast char to unsigned char when calling isspace and isUpper.
Version 3.36 - 2016.02.26
-------------------------
Added parameters for Norwegian and Swedish.
Version 3.35 - 2016.02.08
-------------------------
Allow ; or : or , as separator between parameters (-D option)
Added new parameters for et, it and la.
Version 3.34 - 2016.02.07
-------------------------
Evaluation file is now called 'evaluation...' and moved to current working
directory.
Described -x option (remove intermediary files).
Version 3.31 - 2016.01.25
-------------------------
New: 10-fold cross validation
New: test other trainers/lemmatizers
Version 3.20 - 2015.09.21
-------------------------
Increased buffer size. Morphological info for some (Polish) words can be
enormous.
Version 3.19 - 2015.09.21
-------------------------
Added vector for Russian based on turglem data (1.6 million pairs).
Added vectors for Polish, Icelandic and Latin. Results for Latin OOV words are
remarkably bad, on paper.
Version 3.18 - 2015.09.07
-------------------------
Added several some more vectors to try. Serbian, Macedonian, Farsi.
Version 3.17 - 2015.09.08
-------------------------
Added vectors for Portuguese. Set minimum number of lines to start
determination of "best" parameters with at 50000 (up from 20000).
Version 3.16 - 2015.09.07
-------------------------
Added several some more vectors to try.
Version 3.15 - 2015.09.04
-------------------------
Added some more vectors to try. Greek and Romanian.
Version 3.14 - 2015.09.03
-------------------------
Added some more vectors to try. Added metadata to those vectors that
can later be used for visualisation and evaluation.
Version 3.13 - 2015.09.02
-------------------------
Set delta to distance between the most outlying parameter vector and its
closest neighbour. Added two parameter lines for Dutch and Hungarian
both with -XS. (Which gave remarkably small counts of rules.)
Version 3.12 - 2015.09.01
-------------------------
Option -XS: penalty increases with the number of characters (including
wildcards) in the pattern.
Version 3.11 - 2015.09.01
-------------------------
Fixed bug having to do with reading training data in DOS format (CR LF line
endings) under Linux.
Version 3.10 - 2015.08.31
-------------------------
Show 10 decimal digits of training parameters (-D option). (Previously 6)
Ignore CR in input.
Replaced tab by four spaces in source code.
Use Unicode.txt 09-Feb-2015 20:08 and CaseFolding-8.0.0.txt (11-Feb-2015 12:34)
Version 3.09 - 2015.08.26
-------------------------
Changes to approximation algorithm for finding optimal penalty parameters:
1) Use ziggurat method to compute normal distribution for delta vector
coordinates.
2) Recompute minimal fraction by setting minumum number of examples to 10000
or 1% of available training examples, whichever is largest, arbitrarily
regarding less as not representative.
3) An array of good parameter sets that are returned during the first calls to
brown(), so the algorithm doesn't have to search a too big parameter space.
4) Setting default parameters to 0.0;-0.7;0.7;-0.1;0.1;0.0
Version 3.08 - 2015.08.21
-------------------------
Fixed correct marking of ambiguous examples.
Version 3.07 - 2015.08.21
-------------------------
Fixed option -b, which was broken due to previous changes.
Version 3.06 - 2015.08.20
-------------------------
Removed "whereof used for training <number>" in report + other improvements.
Reinserted functionality that marks control answer as ambiguous (one out of
multiple correct answers).
Version 3.05 - 2015.08.19
-------------------------
Undid last changes. Solved problem bu cutting path part off before
constructing file names.
Version 3.04 - 2015.08.19
-------------------------
Generated parameter file is named with ".parms" suffix. Before, the name
had "parms." prefix. Same for flexrules: ".flexrules" suffix instead of
"flexrules." prefix.
Version 3.03 - 2015.08.17
-------------------------
When testing with clumped test data (i.e. homographs and/or unrelated words
with same looking lemma combined in clusters, using empty lines between
clusters) lines that have same full form and same lemma (but e.g.,
different word classes) are reduced to a single line. (Only full form and
lemma are taken into account when testting.) This reduction already happened
in previous versions for unclustered test data. This change can have some
influence on the evaluation, because duplicates often are 'easy' cases,
so the reported precision is expected to be somewhat lower than previously.
Version 3.00 - 2015.08.13
-------------------------
Improved user interface, fewer files in temporary directory. Defaults
for some options changed. You need only provide a single option,
-i <full-form/lemma file>, whereafter affixtrain computes optimal
parameters for reducing the size of a trained tree (locally) maximally
for the provided training data. Then affixtrain validates the results
both with OOV and vocabulary words. Finally affixtrain produces
flexrule files using all training data.
In a separate run, using option -b, a flexrule file can be pretty printed
and converted to a Bracmat file (another text format). Optionally a list
of words can be lemmatised.
There is still some dead wood removal and code rationalizing to do.
Version 2.28 - 2015.08.06
-------------------------
"Zero-length" lemmas are excluded from proposed list of possible lemmas.
Version 2.26 - 2015.07.16
-------------------------
Started changelog.
Current state of affixtrain:
1. Testing with training data gives zero incorrectly lemmatized words. This is
as at always should have been, but it wasn't 100% the case until now.
2. Unlimited ambiguity is allowed. The first versions of affixtrain could
handle at most two lemmas per word form. After introducing a new flexrule
file format, this limitation is gone.
3. When running affixtrain, a parameter file is created that documents what
was done. Any test results are added at the end. These results can help
decide which pruning threshold (also called cutoff) is optimal for
lemmatization of OOV words. (You probably want to use the optional
dictionary if you choose to use pruned flexrules.)