Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

start/end position slippage #157

Open
sav-che opened this issue Jan 16, 2025 · 4 comments
Open

start/end position slippage #157

sav-che opened this issue Jan 16, 2025 · 4 comments

Comments

@sav-che
Copy link

sav-che commented Jan 16, 2025

I may be misunderstanding the output of gnfinder online, but it seems that positions in the Start/End columns sometimes slip. On example below, Urtica has start-end coordinates 322-328, but should be 322-327; Ceratodon has 1305-1314 instead of 1307-1315, etc.

(Also, � in the output, but that can be a problem on my side.)

Options selected:

  • Freeform text
  • Output format: TSV
  • All occurrences
  • Show Ambiguous Uninomials
  • Verification data sources: GBIF, Index Fungorum, WFO

Input:

Substrate.Text
Ruderaal terrein.
Op Sambucus.
Naaldhouttak
Ranunculus repens.
Berkenbos op zand.

Op Larixstomp in Larixbos.
Op Larixstomp in Larixbos.


Ranunculus repens.


Op kale turf.

Op houtstrooisel (naaldhout).





in essenhakhoutbosje op grond.
On Crataegus.
Op hout.
Gemengd bos, dode Urtica stengels.





Oude begraafplaats, bodem.

Juniperus-struweel, Noord-mos (Dicranum scoparium, Deschampsia flexuosa).
Boomstronk, eik.




Oud, hoog opgaand, donker Douglassparrenbos, op rottende stronk.
On rather dry, humose sand along road through deciduous forest.


zandgebied ingeplant met loofboompjes.

Jeneverbesstruweel.
In der Laubstreu.

Open plek in een Juniperus-struweel, stuifzand begroeid met Dicranum scoparium.
Gemengd bos, op eikenhout.



Nat bosje, onder wilg in gras en blad samen met Hypholoma myosotis.
Eiken-beukenbos.
Op dode tak op de grond.

Open grasvegetatie op ± lemig zand.

Tussen afgevallen Fagusblad.




Hulstrijk eikenbos.


In pomis putridis et floribus emarcidis.
Loam pits.
Op bladstrooisel.


Regelmatig betreden, kortgrazige, mosrijke, schrale vegetatie op droge zand -op-leembodem.

In schrale, droge, zandige vlakke, soms bereden wegberm op zwak kalkhoudend zand (pH = ± 6.7), tussen Ceratodon en Polytrichum piliferum.
On Prunus.
Pinus sylvestris/nigra plantation.

Output:

Index	Verbatim	Name	Start	End
0	Sambucus.	Sambucus	38	47
1	Ranunculus repens.	Ranunculus repens	63	81
2	Ranunculus repens.	Ranunculus repens	165	183
3	Crataegus.	Crataegus	282	292
4	Urtica	Urtica	322	328
5	(Dicranum scoparium,	Dicranum scoparium	410	430
6	Deschampsia flexuosa).	Deschampsia flexuosa	431	453
7	Dicranum scoparium.	Dicranum scoparium	760	779
8	Nat	Nat	815	818
9	Hypholoma myosotis.	Hypholoma myosotis	863	882
10	Ceratodon	Ceratodon	1305	1314
11	Polytrichum piliferum.	Polytrichum piliferum	1318	1340
12	Prunus.	Prunus	1345	1352
13	Pinus sylvestris/nigra	Pinus sylvestris�nigra	1354	1376

@dimus
Copy link
Member

dimus commented Jan 16, 2025

There is an exclusive method to get a slice of an array, so cat would be [0:3] and inclusive method where 'cat' would be [0:2].
Most common is exclusive method, this is why there is one additional character given in the offsets.

--- the rest is not directly relevant to your issue, but describes some issues I encountered when figuring out offsets

Do you know what is the encoding of the input? GNfinder converts texts to UTF-8 and provides offsets according to that converted text instead of the input. That would be my first guess. The � appears when conversion of a character to UTF-8 failed or when finder encounters a character that should never be inside of a name (in this case sylverstris/nigra was considered as one word). On top of different encodings there is also two different ways to generate diacritics in UTF-8 (single character and character combination ) These are also normalized to 'single character' UTF-8 encoding

There is an option in gnfinder command-line app to return the normalized text that was used for the name-finding. I think I need to add this option to web UI as well

There is another option, where offset it calculated by bytes instead of UTF-8 runes. Probably it also should be added to web UI

@sav-che
Copy link
Author

sav-che commented Jan 17, 2025

Thanks for the fast answer!

It is good to be aware of inclusive and exclusive methods, it explains some cases. Are they used in a mix, taxon by taxon, or one method applied to the whole query text?

Hm, diacritic conversion can be an issue... Also the problem can be in how text editors calculate position (I check in Notepad++). Encoding of my file is UTF-8, but the slippage remains and gets really noticeable on larger queries. For example, for the attached example, when using start/end coordinates, the last capture instead of Corylus. becomes nosa and Co.

Why I'm concerned with this at all, is that I tried to convert start positions to line numbers and got slightly off results. Now I see that a source of my issues is probably that I calculated by bites using dd:

cat "$START_POSITIONS" | parallel -j 8 --keep-order --tag '(dd if='$FILE_TO_PARSE' bs=1 count={1} 2>/dev/null; printf "\n") | wc -l' > "$OUTPUT"

example output.tsv.txt
example query.txt

@dimus
Copy link
Member

dimus commented Jan 17, 2025

if slippage increases, it probably means that offset in your editor is based on bytes, not on UTF-8 runes. I will add byte offset to GUI (probably next week), it would help in this case. If you are able to use gnfinder in command line, it already has bytes offset option

# to get UTF-8 intput and byte offset
echo "Pardosa moesta is a spider" |gnfinder -b -i -f compact |jq

@dimus
Copy link
Member

dimus commented Jan 17, 2025

gnfinder always uses external method for subslices

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants