ElasticSearch Integration

Introduction

Integration with ElasticSearch is being developed as a new type of Target Storage (i.e, ACHE stores crawled pages in ElasticSearch indexes instead of storing in disk).

Index, types, and fields

Currently, ACHE creates an index named "achecrawler", and indexes documents two ElasticSearch types:

target, for pages classified as on-topic by the page classifier
negative, for pages classified as off-topic by the page classifier

Currently, these two types use the same schema, which has the following fields:

domain: domain of the url
topPrivateDomain: top private domain of the url
url: complete URL
title: title of the page extracted from the html tag <title>
text: clean text extract from html using Boilerpipe's DefaultExtractor
retrieved: date when the time was fetched using ISO-8601 representation Ex: "2015-04-16T07:03:50.257+0000"
words: array of strings with tokens extracted from the text content
wordsMeta: array of strings with tokens extracted from tags <meta> of the html content
html: raw html content

Configuration

To use ElasticSearch, you need to add the following lines in the configuration file target_storage.cfg:

DATA_FORMAT ELASTICSEARCH
ELASTICSEARCH_HOST localhost
ELASTICSEARCH_PORT 9300

PS: ELASTICSEARCH_PORT should point to the transport client port (which defaults to 9300), not the JSON API port.

Running ACHE using ElasticSearch storage

When running ACHE using ElasticSearch, you should provide the name of the ElasticSearch index that should be used in the command line using the following arguments: -e <arg> or --elasticIndex <arg>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ElasticSearch Integration

Introduction

Index, types, and fields

Configuration

Running ACHE using ElasticSearch storage

Clone this wiki locally