-
Notifications
You must be signed in to change notification settings - Fork 5
RFC for "gnfinder plus" project
gnfinder
finds names in plain UTF-8 encoded texts. However, a user often has to find names in PDF, MS Word, MS Excel files, HTML, images. Sometimes the document is taken from a local storage, and sometimes by URL. GloblNames has a GNRD service that can use such documents, however, it would be handy if we have a command-line application or gRPC service that can have the same functionality.
This document is a place to discuss ideas about creating gnfinder+
command line and gRPC tool that can normalize local or remote documents or images into UTF-8 texts, send them to gnfinder, and return results. Such a program would then be used in GNRD or by itself.
It makes sense to keep gnfinder simple and use gnfinder+
as a separate project. Such an approach would help to avoid functionality bloat and keep name-finding and text-normalizing functionalities independent from each other.
It would be easier to use Go for such a project because it would mean that gnfinder can be used directly as a library, and it would simplify implementation significantly. Go as a language also has a low overhead, small binary, and memory footprint; it has a robust and straightforward concurrency model. Go is well suited for command-line applications.
It would be reasonable to reuse tests from GNRD and modify them as needed for gnfinder+.
Having a functionality similar to file
command on Linux and Mac would be beneficial.
A project filetype is a pure Go implementation of the functionality.
A project unipdf project (AGPL license) that can also be used to convert PDF to plain text.
There is also docconv, a wrapper over the pdf2text
app.
Ability to recognize multi-column PDF documents would help a lot to get more names from the scientific literature.
A project unioffice(AGPL license) seems to be a powerful and actively developed library that can be used in gnfinder+
project to deal with MS Word and MS Excel texts.
A project gosseract uses C++ tesseract library binding to Go language. The library can use all the power of Tesseract for recognizing texts in images. It seems that the project gives the ability to compile gnfinder+
into a standalone file. However, the portability is somewhat decreased, because the project would stop to be a pure Go project.
Stripping tags from HTML can be done by external projects like html-strip-tags-go or by home-grown code. Similar task is solved in gnparser. It can be argued that UTF-8 encoded HTML texts should be supported by the "core" gnfinder
, because <i>
tags might be an additional clue for location of scientific names in texts.
gnfinder can be used for name-finding in normalized UTF-8 texts. An example of using gnfinder can be found in bhlindex and htindex projects.
Go standard library can provide this functionality. Here is an example of an approach.
gnparser+
would have flags to distinguish inputs. It would also be great to have automatic detection of files. In the worst-case, detection can be done by file extensions. The output would closely correspond to the output of gnfinder.
Would mirror the functionality of command-line application. It would take either binary files as parameters, or URLs that point to a document in a supported format.
It would be incorporated into GNRD substituting Ruby code for text normalization.
Harsh Zalavadiya (@harshzalavadiya) started a proof of concept project gnfinder-plus. This project might evolve into a production-ready application.