Skip to content

ElasticSearch Integration

Aécio Santos edited this page Jun 19, 2015 · 6 revisions

Introduction

Integration with ElasticSearch is being developed as a new type of Target Storage (i.e, ACHE stores crawled pages in ElasticSearch indexes instead of storing in disk).

Index, types, and fields

Currently, ACHE creates an index named "achecrawler", and indexes documents two ElasticSearch types:

  • target, for pages classified as on-topic by the page classifier
  • negative, for pages classified as off-topic by the page classifier

Currently, these two types use the same schema, which has the following fields:

  • domain: domain of the url
  • topPrivateDomain: top private domain of the url
  • url: complete URL
  • title: title of the page extracted from the html tag <title>
  • text: clean text extract from html using Boilerpipe's DefaultExtractor
  • retrieved: date when the time was fetched using ISO-8601 representation Ex: "2015-04-16T07:03:50.257+0000"
  • words: array of strings with tokens extracted from the text content
  • wordsMeta: array of strings with tokens extracted from tags <meta> of the html content
  • html: raw html content

Configuration

To use ElasticSearch, you need to add the following lines in the configuration file target_storage.cfg:

DATA_FORMAT ELASTICSEARCH
ELASTICSEARCH_HOST localhost
ELASTICSEARCH_PORT 9300

PS: ELASTICSEARCH_PORT should point to the transport client port (which defaults to 9300), not the JSON API port.

Running ACHE using ElasticSearch storage

When running ACHE using ElasticSearch, you should provide the name of the ElasticSearch index that should be used in the command line using the following arguments: -e <arg> or --elasticIndex <arg>