-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
start/end position slippage #157
Comments
There is an --- the rest is not directly relevant to your issue, but describes some issues I encountered when figuring out offsets Do you know what is the encoding of the input? GNfinder converts texts to UTF-8 and provides offsets according to that converted text instead of the input. That would be my first guess. The � appears when conversion of a character to UTF-8 failed or when finder encounters a character that should never be inside of a name (in this case sylverstris/nigra was considered as one word). On top of different encodings there is also two different ways to generate diacritics in UTF-8 (single character and character combination ) These are also normalized to 'single character' UTF-8 encoding There is an option in There is another option, where offset it calculated by bytes instead of UTF-8 runes. Probably it also should be added to web UI |
Thanks for the fast answer! It is good to be aware of Hm, diacritic conversion can be an issue... Also the problem can be in how text editors calculate position (I check in Notepad++). Encoding of my file is UTF-8, but the slippage remains and gets really noticeable on larger queries. For example, for the attached example, when using start/end coordinates, the last capture instead of Why I'm concerned with this at all, is that I tried to convert start positions to line numbers and got slightly off results. Now I see that a source of my issues is probably that I calculated by bites using dd:
|
if slippage increases, it probably means that offset in your editor is based on bytes, not on UTF-8 runes. I will add byte offset to GUI (probably next week), it would help in this case. If you are able to use gnfinder in command line, it already has bytes offset option # to get UTF-8 intput and byte offset
echo "Pardosa moesta is a spider" |gnfinder -b -i -f compact |jq |
gnfinder always uses external method for subslices |
I may be misunderstanding the output of gnfinder online, but it seems that positions in the Start/End columns sometimes slip. On example below,
Urtica
has start-end coordinates 322-328, but should be 322-327;Ceratodon
has 1305-1314 instead of 1307-1315, etc.(Also, � in the output, but that can be a problem on my side.)
Options selected:
Input:
Output:
The text was updated successfully, but these errors were encountered: