Just for own usage now as this is not developed at all. I use C's memchr to search a byte block for a specific byte. In this case a delimiter. This is similar to findfirst, but then without safety checks for speed. Then we can give our best shot to extract the integer using C's atoi. Unlike Julias base parse this can deal with messy "strings" (actually bytes) like "300++" or "+300messy". The advantage? It's faster, the disadvantage however is that there is no error checking at all. So you can only use this if you exactly know what your input looks like.
This is around 2x faster when parsing numbers from a file and 5x faster when parsing from an existing UInt8
vector (see benchmark scripts)
add https://github.com/rickbeeloo/NumberScan
Note
- atoi will return
0
if it cannot parse the number. So when interested in0
this is useless. - no safety checks are performed, if you use this make sure to test it
- if you want to parse an integer from a messy string you can also just call
NumberScan._atoi
.
We will use this to quickly parse numbers from the GFA format. So for example:
x = Vector{UInt8}("100+ 200- 400+ 800-")
for number in numberScan(x, ' ')
println(number)
end
Since files are also just byte blocks we can Mmap
files and scan them for numbers
targets = Set([100, 200, 300, 1000])
found = 0
for number in mmapScan("data/test.txt", ' ')
if number in $targets
found +=1
end
end
println(found)
- Make this more general, like keep iterating till numbers are found. Such that you don't have to check for
number != 0
. Perhaps also make this work without delimiter - Maybe some basic saftey checks
- Could use unsafe type conversion of the pointers to not have any error instructions in the native code