-
Notifications
You must be signed in to change notification settings - Fork 134
ElasticSearch Integration
Integration with ElasticSearch is being developed as a new type of Target Storage (i.e, ACHE stores crawled pages in ElasticSearch indexes instead of storing in disk).
Currently, ACHE creates an index named "achecrawler", and indexes documents two ElasticSearch types:
-
target
, for pages classified as on-topic by the page classifier -
negative
, for pages classified as off-topic by the page classifier
Currently, these two types use the same schema, which has the following fields:
-
domain
: domain of the url -
topPrivateDomain
: top private domain of the url -
url
: complete URL -
title
: title of the page extracted from the html tag<title>
-
text
: clean text extract from html using Boilerpipe's DefaultExtractor -
retrieved
: date when the time was fetched using ISO-8601 representation Ex: "2015-04-16T07:03:50.257+0000" -
words
: array of strings with tokens extracted from the text content -
wordsMeta
: array of strings with tokens extracted from tags<meta>
of the html content -
html
: raw html content
To use ElasticSearch, you need to add the following lines in the configuration file target_storage.cfg
:
DATA_FORMAT ELASTICSEARCH
ELASTICSEARCH_HOST localhost
ELASTICSEARCH_PORT 9300
PS: ELASTICSEARCH_PORT should point to the transport client port (which defaults to 9300), not the JSON API port.
When running ACHE using ElasticSearch, you should provide the name of the ElasticSearch index that should be used in the command line using the following arguments: -e <arg>
or --elasticIndex <arg>