Better `fancify.sh` #3

fabi1cazenave · 2024-12-09T18:54:32Z

Given that:

most keyboard layouts have no support for fancy letters or punctuation marks such as æ, ’, “”, …, etc.
many corpus texts don’t use these fancy characters either
the kalamine analyzer can default to ASCII when these characters are not supported by a keyboard layout: ae instead of æ, ' instead of ’, ... instead of …, "" instead of “”, etc.

our corpus should be “fancified” before getting transformed into JSON dictionary, in order not to penalize keyboard layouts that have a proper support for these special characters. That’s what the fancify.sh script (or make fancy target) does. But this is still a work in progress — several substitutions are still missing, e.g.:

straight quote pairs into “”, « », „“ depending on the language
fine no-break space before ?:;! in French
¿ sign in Spanish
dashes rather than --
etc.

The text was updated successfully, but these errors were encountered:

Ced-C · 2024-12-19T12:31:22Z

One might want to also add accent marks on capital letters (e.g. Etre→Être)

Also, we might eventually want to support several languages, so it should be an entry parameter.

Given that the above use case and some functionality such as quote pair matching might be a little complex to implement, I would tend to switch from bash script to python wdyt ?

Ced-C · 2024-12-19T13:36:32Z

I was comparing the Alice ebook I had :

Publication: 1865
Catégorie(s): Fiction, Fantasy, Jeunesse
Source: http://www.ebooksgratuits.com

to the edition in this repo…
The edition I have, albeit different, already has all proper typography (œ,  ?, « »,&c.)
This makes me wonder… does a generic parsing process for corpora makes a lot of sense ?

I mean, sure, some standard mistakes could be address (like the one pointed out at the beginig of this issue)
but it will never be perfect.

For instance, in Alice (Guttenberg edition) ellipse … are typographed as 4 hyphens ---- ; is that standard or is it just applicable to this edition ? Looking on the web, I have found no mention of this use of hyphens.

wouldn’t it better to chose already well typographed editions?

fabi1cazenave · 2024-12-22T03:04:41Z

One might want to also add accent marks on capital letters (e.g. Etre→Être)

+1

Also, we might eventually want to support several languages, so it should be an entry parameter.

True. Some substitutions are generic, others are language-specific.

Given that the above use case and some functionality such as quote pair matching might be a little complex to implement, I would tend to switch from bash script to python wdyt ?

I’d bet sed would be enough. Moving to a full-Python solution can still be a later option.

I mean, sure, some standard mistakes could be address (like the one pointed out at the beginig of this issue)
but it will never be perfect.

I agree it won’t be perfect. But even non-perfect typographic improvements would be welcome for conversational corpora such as Leipzig.

For instance, in Alice (Guttenberg edition) ellipse … are typographed as 4 hyphens ---- ; is that standard or is it just applicable to this edition ?

This is absolutely non-standard and should be fixed. I expect most books to require specific tuning/replacements.

wouldn’t it better to chose already well typographed editions?

I agree it would be better but I’m not sure we can find a significant book collection with an open licence.

fabi1cazenave changed the title ~~Better fanciftcation~~ Better fancification Dec 9, 2024

fabi1cazenave changed the title ~~Better fancification~~ Better fancify.sh Dec 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Better `fancify.sh` #3

Better `fancify.sh` #3

fabi1cazenave commented Dec 9, 2024

Ced-C commented Dec 19, 2024

Ced-C commented Dec 19, 2024

fabi1cazenave commented Dec 22, 2024

Better fancify.sh #3

Better fancify.sh #3

Comments

fabi1cazenave commented Dec 9, 2024

Ced-C commented Dec 19, 2024

Ced-C commented Dec 19, 2024

fabi1cazenave commented Dec 22, 2024

Better `fancify.sh` #3

Better `fancify.sh` #3