-
Notifications
You must be signed in to change notification settings - Fork 21
LIMA Technical Documentation
Table of Contents generated with DocToc
LIMA configuration files are searched in several places. First, the folders pointed to by the LIMA_CONF
environment variable (which has the same syntax as the PATH
variable: /path/one:/path/two:…
). If it is not defined, they are searched in the XDG_DATA_HOME or ~/.local…
and then /usr/share/lima/conf
folder (under GNU/Linux). This can be overridden on the command line using the --config-dir
parameter.
If your configuration files seems to be ignored, it could be because another file is read instead. To check in which order configuration files are searched, just define the LIMA_SHOW_CONFIG_PATH
environment variable to a non-empty string and the list of the searched folders will be displayed to the console.
The main configuration files are lima-common.xml
and lima-analysis.xml
. This can be overridden on the command line using the --common-config-file
and the --lp-config-file
parameters respectively. lima-common.xml
defines some general information common to all languages and the names of the files defining language specific data (by default lima-common-<lang>.xml
for the lang language). lima-analysis.xml
defines the names of the files describing language specific pipelines and process units (by default lima-lp-<lang>.xml
for the lang language). It also defines a mapping between global and language specific pipeline names.
The file that you will mainly have to look at to change the behavior of LIMA on a given <lang> language is lima-lp-<lang>.xml
.
The last configuration files, that can be very helpful to help debugging or to understand the internals of LIMA are the log4cpp.properties
filed. They allow to activate several levels of debugging information.
All LIMA XML configuration files have the following structure:
<?xml version='1.0' encoding='UTF-8'?>
<modulesConfig>
<module name="moduleName">
<group name="groupName">
<param key="paramName" value="param value"/>
<list name="listName">
<item value="1st item value"/>
<item value="2nd item value"/>
<item value="..."/>
</list>
<map name="mapName">
<entry key="FirstKey" value="1st key value"/>
<entry key="SecondKey" value="2nd key value"/>
<entry key="..." value="..."/>
</map>
</group>
<group name="...">
...
</group>
</module>
<module name="...">
...
</module>
</modulesConfig>
One can include configuration data from external files with the following syntax:
<group name="include">
<list name="includeList">
<item value="<filename to include>/<module name to include>"/>
</list>
</group>
This must be placed as a child of a module tag. This will include the content of the target module in the target file into the current module where the include statement is.
This is the file defining all processing done during linguistic analysis and the resources they use. It contains two modules: Processors
for pipelines and process units and Resources
for the linguistic resources.
It contains several groups in four categories:
- Definition of pipelines
- Definition of process units
- Definition of loggers
- Definition of dumpers
In fact loggers and dumpers are kind of process units but with a special role, respectively to write log messages tracing the results of some process units and to write or print final results.
Each group has a name and a class, which is the identifier of the C++ class to instantiate using the dedicated factory.
The pipeline groups are all of the class ProcessUnitPipeline
:
<group name="main" class="ProcessUnitPipeline" >
They contain one list named processUnitSequence
whose items are the names of process units, loggers and dumpers. When a pipeline is selected (main
by default or the one selected with the --pipeline=
or -p
option), its elements are executed in sequence. There is no check of the possible dependencies between units. This is the role of the user to define coherent sequences. Please refer to process units reference documentation for details about the role, dependencies, and configuration of each process unit.
The beginning of the main
pipeline at the time of this writing is:
<group name="main" class="ProcessUnitPipeline" >
<list name="processUnitSequence">
<!--item value="beginStatusLogger"/-->
<item value="flattokenizer"/>
<item value="regexmatcher"/>
<!--item value="fullTokenXmlLoggerTokenizer"/-->
<item value="simpleWord"/>
<item value="hyphenWordAlternatives"/>
<item value="idiomaticAlternatives"/>
As you can see, several elements are commented out. Some of them are pipeline units that can be activated to do more things, like semantic analysis. The others are loggers that one can activate to see the results of previous modules and dumpers alternative to the default one, the conllDumper
. Note that there can be several dumpers activated. Note also that some dumpers need to use a handler different than the default one. The correct handler is activated on the command line by using the --dumper=
or -d
parameter.
The other preexisting pipelines (but one can define others) are:
- limaserver: the pipeline for the LIMA HTTP/JSON server;
- easy: produce output for the Easy parsers evaluation campaign;
- none: an empty pipeline that does nothing, for tests.
All other Processors
groups have parameters specific to each class. Some of the parameters are references to linguistic resources defined in the next module.
This module describes linguistic resources that are loaded at initialization time and that can be referenced from the process units using their name. Each one is described in a group that has a name and and a class which is the id of the C++ class used to instantiate it.
Resources are described on a dedicated page
These files are in the log4j format. They allow to setup several categories defined in the C++ code. Each debug message is emitted in one category and at one level. If the level set for this category in the log4cpp.properties files is lower or equal to this category, then the message is printed on the standard output.
The levels are:
NOTSET < TRACE < DEBUG < INFO < NOTICE < WARN < ERROR < CRIT < ALERT < FATAL = EMERG
Note that the destination of each category (file, standard output, system logs, etc.) should be configurable, but it is not the case currently.
The log4cpp.properties files are searched in the same places as the XML configuration files. They are all loaded in reverse order of their discovery, so that definitions in places before the others overwrite those that come later. In each searched folder (given by LIMA_CONF
, etc.), all .properties
files in the log4cpp
subfolder are loaded first and then the log4cpp.properties
file itself.
In Lima, plugins are shared libraries that can be loaded and linked dynamically.
This task is handled by the class Lima::AmosePluginsManager
. It allows to add new process units or resources without having to recompile all LIMA executables.
The class Lima::AmosePluginsManager
is declared and defined in the following files:
lima_common/src/common/AbstractFactoryPattern/AmosePluginsManager.h
lima_common/src/common/AbstractFactoryPattern/AmosePluginsManager.cpp
It is exported to the library lima-common-factory
.
The class Lima::AmosePluginsManager
publicly inherits from Singleton<AmosePluginsManager>
allowing it to exists in the form of a single instance at runtime.
An instance of that class is responsible for the plugins loading.
This task is done by the call of its unique method loadPlugins()
in its constructor
at instanciation time.
What loadPlugins method does is to look for a plugins directory under the $LIMA_CONF
folder.
Then, for each file found under plugins directory, it reads plugin names and deduces
the shared libraries that need to be loaded.
Once a shared library is identified, Lima::AmosePluginsManager
delegates the task to
DynamicLibrariesManager
so that it can actually load the library.
Any Lima project (lima_common
, lima_linguisticprocessings
) can have a corresponding plugins file.
These files are automatically filled during project compilation step.
In order to declare a library as a LIMA plugin, the developer needs to use the LIMA cmake macro
DECLARE_LIMA_PLUGIN
(defined in each project's root CMakeLists.txt). This macro appends
a line containing the library name in the project plugins file, and add a cmake rule to create
a shared library (using the add_library
command).
The class Lima::DynamicLibrariesManager
is defined in the following files:
lima_common/src/common/AbstractFactoryPattern/DynamicLibrariesManager.h
lima_common/src/common/AbstractFactoryPattern/DynamicLibrariesManager.cpp
It is exported to the library lima-common-factory
.
The class Lima::DynamicLibrariesManager
publicly inherits from Singleton<DynamicLibrariesManager>
allowing it to exist in the form of a single instance at runtime.
An instance of that class is responsible for the dynamic loading of shared libraries that are passed to it. Indeed, it provides a method loadLibrary(const std::string& libname)
for that task.
This class use the Qt QLibrary
class to dynamically load libraries in a portable way. In order to prevent loading a library multiple times, it holds a map, called m_handles
, where the key is the library name and the value is the instance of QLibrary class.
Lima::AmosePluginsManager::single();
Only one call is needed. As written above, the constructor of AmosePluginsManager
will load the plugins found under $LIMA_CONF/plugins
.
LIMA components are based on the "facory of factories" design pattern. Their goal is to design software modules that are independent from each others and that can easily be implemented either as local tool or as distant ones, like REST or CORBA services.
There is currently one such component in LIMA, the LIMA client, but other LVIC components (text and image index, search engine client…) use the same elements.
Components must be independent. That is, they can only use the libraries available in a common project, and the public APIs of other components. The public API of a component must be clearly identified, and must be independent of the rest of the component code (this is possible with the AbstractFactory pattern).
Each component will be the subject of a separate subproject. Common libraries are grouped together in the lima-common project.
The component must be able to execute, at the request of the user, using a distant server (REST, WSDL, CORBA…), or as an integrated local tool. The core of the component must be completely independent of the way it is served, i.e. it is possible to compile a component without using a distant server.
Each component is initialized from a configuration file (which can offer references to other files). This configuration file is in the Lima XML format (module / group / param / list / map). The XMLConfigurationFile library of the lima-common project offers all the tools to read this type of files, this libraries can be enriched.
Each component offers an entry point for its initialization. Initialization is the responsibility of the calling program, a component must never initialize itself. Similarly, a component must never initialize another component.
The public API of a component will be placed in a particular subproject. This API must contain at least the abstract class defining the component's client, and the main factory for configuring the component and creating clients. There may also be other objects, needed for interaction with the component, or result structures. The public part of a component must have NO DEPENDENCY with the core of the component.
In this part: 'component' is the name of the LIMA component (for example: linguisticProcessing).
src / component: a subproject with the name of the component src / component / client: Public API src / component / client / AbstractComponentClient.h: definition of the client interface src / component / client / ComponentClientFactory.h: main factory src / component / core: client core definition src / component / corba: definition of client corba test / test programs
The public part of the component is located under src / component / client. Each client type is located in a separate, src / component / core subproject for the core client.
In the lima-common projects: common / clientFactory / clientFactory.h is defined a client factory template. This template avoids rewriting the component factory in each component.
This template contains 2 classes: ClientFactory (abstract factory for a client) and MainClientFactory (mainframe).
This class represents an abstract client object factory. For each type of client (core, corba), it is necessary to define a factory inheriting from ClientFactory , performing 2 operations: initialization and creation of the client. Initialization is called once before any client is created. Initialization must perform all necessary operations to create clients. The creation of a client can be called as many times as necessary, and must create a new client of the desired type.
Each factory must be a singleton, and register with the corresponding MainClientFactory.
This class is actually the main factory of the component's client. It allows you to initialize a component to create a particular client type, and create clients. (In practice it delegates these calls to the corresponding ClientFactory ).
An architecture with abstract clients has the advantage of being modular and flexible. The core client performs the requested processing. You can then create a corba client, whose role will only pass the call to a server, which will execute the call with a core client. Then, on the day when it is desired to use several servers, it is possible to write a third client, sending the call to all the servers and merging the results.
Table of Contents generated with DocToc