RFC for "gnfinder plus" project

Introduction

gnfinder finds names in plain UTF-8 encoded texts. However, a user often has to find names in PDF, MS Word, MS Excel files, HTML, images. Sometimes the document is taken from a local storage, and sometimes by URL. GloblNames has a GNRD service that can use such documents, however, it would be handy if we have a command-line application or gRPC service that can have the same functionality.

This document is a place to discuss ideas about creating gnfinder+ command line and gRPC tool that can normalize local or remote documents or images into UTF-8 texts, send them to gnfinder, and return results. Such a program would then be used in GNRD or by itself.

It makes sense to keep gnfinder simple and use gnfinder+ as a separate project. Such an approach would help to avoid functionality bloat and keep name-finding and text-normalizing functionalities independent from each other.

Language

It would be easier to use Go for such a project because it would mean that gnfinder can be used directly as a library, and it would simplify implementation significantly. Go as a language also has a low overhead, small binary, and memory footprint; it has a robust and straightforward concurrency model. Go is well suited for command-line applications.

Tests

It would be reasonable to reuse tests from GNRD and modify them as needed for gnfinder+.

Libraries

File type detection

Having a functionality similar to file command on Linux and Mac would be beneficial. A project filetype is a pure Go implementation of the functionality.

PDF processing

A project unipdf project (AGPL license) that can also be used to convert PDF to plain text.

There is also docconv, a wrapper over the pdf2text app.

Ability to recognize multi-column PDF documents would help a lot to get more names from the scientific literature.

MS Word, MS Excel processing

A project unioffice(AGPL license) seems to be a powerful and actively developed library that can be used in gnfinder+ project to deal with MS Word and MS Excel texts.

Images, scanned PDF documents

A project gosseract uses C++ tesseract library binding to Go language. The library can use all the power of Tesseract for recognizing texts in images. It seems that the project gives the ability to compile gnfinder+ into a standalone file. However, the portability is somewhat decreased, because the project would stop to be a pure Go project.

HTML texts

Stripping tags from HTML can be done by external projects like html-strip-tags-go or by home-grown code. Similar task is solved in gnparser. It can be argued that UTF-8 encoded HTML texts should be supported by the "core" gnfinder, because <i> tags might be an additional clue for location of scientific names in texts.

Name-finding

gnfinder can be used for name-finding in normalized UTF-8 texts. An example of using gnfinder can be found in bhlindex and htindex projects.

Remote or local file access

Go standard library can provide this functionality. Here is an example of an approach.

Project user interfaces

Command Line

gnparser+ would have flags to distinguish inputs. It would also be great to have automatic detection of files. In the worst-case, detection can be done by file extensions. The output would closely correspond to the output of gnfinder.

gRPC service

Would mirror the functionality of command-line application. It would take either binary files as parameters, or URLs that point to a document in a supported format.

Web application

It would be incorporated into GNRD substituting Ruby code for text normalization.

Development

Harsh Zalavadiya (@harshzalavadiya) started a proof of concept project gnfinder-plus. This project might evolve into a production-ready application.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly