Skip to content
ignore-me edited this page Sep 18, 2014 · 4 revisions

Configuration

Cmr is configured through a single ini configuration file located at /etc/cmr/config.ini. After installation it is necessary to provide some additional information before CMR is usable.

basepath must be configured, this path is intended to point at the mount point in which the data warehouse is located. scratch_path is derived from basepath and is intended to point at a location also within the mount where it is suitable to write intermediate data necessary for CMR to run jobs. There are three important paths derived from scratch_path, output_path, error_path, and bundle_path. These locations indicate where CMR should put its output, error, and bundles respectively during job execution, so it is important that they too are accessible from all nodes.

For instance, if the clustered file system's mount point is /mnt/gv0 a valid configuration could look like this.

basepath=/mnt/gv0
scratch_path=${basepath}/home/${USER}/.scratch/cmr/${JOB_ID}
output_path=${scratch_path}/output
error_path=${scratch_path}/error
bundle_path=${scratch_path}/bundle

During job execution each path will be expanded. So if a user... octodata runs a job his set of paths might be as follows

basepath:      /mnt/gv0
scratch_path:  /mnt/gv0/home/octodata/.scratch/cmr/06ab1fff-05f4-4ba3-877c-ff309abcb221
output_path:   /mnt/gv0/home/octodata/.scratch/cmr/06ab1fff-05f4-4ba3-877c-ff309abcb221/output
error_path:    /mnt/gv0/home/octodata/.scratch/cmr/06ab1fff-05f4-4ba3-877c-ff309abcb221/error
bundle_path:   /mnt/gv0/home/octodata/.scratch/cmr/06ab1fff-05f4-4ba3-877c-ff309abcb221/bundle

As shown above by default the scratch_path is user dependent, and to function correctly it is necessary for each user directory to be writable by each individual user. There are two ways of addressing this, either the user directories must be pre-created with the correct permissions (so that clients may write to them, they'll probably want to read from them too) or the directory immediately above to be world writable, in which case CMR clients will create the users directory. i.e. if the scratch path is ${basepath}/home/${USER}/.scratch/cmr/${JOB_ID} then ${basepath}/home/ should be world writable.

If intending to use CMR in a multi-node environment the bind addresses server_in, server_out, caster_in, and caster_out must be configured as well by default they reference localhost. The cmr-worker, and cmr-clients will attempt to connect to these addresses so it is important that the same addresses are configured on all nodes in a CMR installation.

Running CMR

In order to run a CMR job, the cmr-caster and cmr-server, and at least one cmr-worker instance must be started. Currently it is necessary for CMR clients, and the cmr-server server to be run on the same system.

Provided the configuration is available in /etc/cmr/config.ini all cmr components are start-able without specifying any additional command line arguments, so getting everything up and running is as easy as running each component.

cmr-server
cmr-caster
cmr-worker

Alternatively, If CMR was installed using the Debian packages each component may be started via service.

service cmr-server start
service cmr-worker start

The cmr-server service script also controls the cmr-caster component as the pair rely on each other. Each component will write logs into /var/log/cmr/

If everything has been started successfully, a service status call will result in the following:

:~$ service cmr-server status
 * cmr-server (3137) is running ok
 * cmr-caster (31289) is running ok
:~$
:~$ service cmr-worker status
 * cmr-worker (1: 22404) is running ok
 * cmr-worker (2: 22414) is running ok
 * cmr-worker (3: 22424) is running ok
 * cmr-worker (4: 22434) is running ok
:~$

Each status return will give the pid of the running process.

Do note that the cmr-worker process count can be configured by changing NUMINSTANCES in /etc/default/cmr-worker (debian packages only). The above example is using 4 processes. In our experience running a few more processes rather than increasing the number of threads in each cmr-worker (through configuration) yielded a small boost in performance.

Configuring cget

cget requires additional information about the data warehouse in order for it to provide a simplified user interface. Specifically variables concerning the location of files within the warehouse (warehouse_file_path) must be filled in for cget to function. warehouse_file_path describes the location of files in the warehouse as a glob pattern given a mixture of variable and fixed path elements. A warehouse_file_path might look like this.

basepath=/mnt/gv0
warehouse_file_path=${basepath}/user/hive/warehouse/${TABLE}/pdate=${DATE}/ptime=${HOUR}/log_${MINUTE}.gz

Now if we execute a cget command with this configuration

cget --select "day,type" --from "purchases" --between "2014-01-01 03:23:00 and 2014-01-04 01:16:00"

It will produce the following cmr command (a complex date, and fairly granular warehouse_file_path was chosen intentionally to show how specifying inputs can get rather complex and error prone if done by hand)

cmr
    --mapper 'cmr_map_json  day type  _1'
    --reducer 'cmr_reduce s'  --final-reducer 'cmr_reduce s -o "\t"'
    --input "/mnt/gv0/user/hive/warehouse/purchases/pdate=2014-01-01/ptime=3/log_{23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59}.gz"
    --input "/mnt/gv0/user/hive/warehouse/purchases/pdate=2014-01-01/ptime={4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}/log_*.gz"
    --input "/mnt/gv0/user/hive/warehouse/purchases/pdate=2014-01-04/ptime=1/log_{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}.gz"
    --input "/mnt/gv0/user/hive/warehouse/purchases/pdate=2014-01-{02,03}/ptime=*/log_*.gz"
    --stdout  2> /dev/null

resulting in the following reduced data

2014-01-01  banana  38
2014-01-01  orange  22
2014-01-02  banana  44
2014-01-02  orange  32
2014-01-03  banana  38
2014-01-03  orange  33
2014-01-04  banana  0
2014-01-04  orange  0

Multiple tables in cget

cget can support querying multiple tables in a single query (allowing their results to be merged if they can be mapped to the same domain). In order to enable this functionality a relationship between the contents of a row and it's table must be established. These relatio nships may be expressed in the cmr configuration under the section cget-table as follows.

[cget-table]
purchases=handler:purchases
impressions=handler:impressions

This configuration will instruct cget that when accessing the table purchases rows that contain the key value pair {"handler":"purchases"} belong to the purchases table, and when accessing the table impressions rows that contain the key value pair {"handler":"imp ressions"} belong to the impressions table. With this additional configuration more complex queries may be constructed in cget, because the mapper can now choose to map differently depending on which table a row belongs to. For instance, lets say that the type field fr om purchases corresponds directly to the item field in the impressions table. The following query will allow the two to be mapped into the same domain so that they can be reduced together.

cget --select "day,type" --from "purchases" --select "day,item" --from "impressions" --between "2014-01-01 03:23:00 and 2014-01-04 01:16:00"

This query will now map rows from purchases table to "day, type, 1, 0" and rows from the impressions table to "day, item, 0, 1" the resulting mapped rows will be reduced. This will create output rows that contain "day, type, number of purchases, number of impressions".

2014-01-01  banana  38  327
2014-01-01  orange  22  223
2014-01-02  banana  44  400
2014-01-02  orange  32  252
2014-01-03  banana  38  304
2014-01-03  orange  33  237
2014-01-04  banana  0 3
2014-01-04  orange  0 2

Configuration sections & options

[global]        configuration options affect all components of CMR and are overridden by configuration options specific to each component
[cmr-server]    configuration options specific to cmr-server
[cmr-worker]    configuration options specific to cmr-worker
[cmr]           configuration options specific to cmr client
[cmr-grep]      configuration options specific to cmr-grep client
log_level               minimum logging severity, one of [TRACE, DEBUG, INFO, WARN, ERROR, FATAL]
server_in               cmr-server binds to this address, which is used to service cmr client requests
server_out              cmr-server binds to this address, which is used to communicate with worker instances
caster_in               cmr-caster binds to this address, which is used to receive requests to broadcast events
caster_out              cmr-caster binds to this address, which is used to broadcast events
base-relative-path    output paths specified by users is assumed to be relative to basepath
basepath                path to the root of the clustered file system
scratch_path            scratch location (must be within the clustered file system)
output_path             default output location (must be within the clustered file system)
error_path              error log output location (must be within the clustered file system)
bundle_path             script bundle location (must be within the clustered file system)
warehouse_path      path to the root of the data warehouse (must be within the clustered file system)
warehouse_file_path   path to a file located within the data warehouse (must be within the clustered file system)
batch_size              number of files to be included in a single task
max_task_attempts       maximum allowed attempts on a single task before job failure
accept_timeout          time allowed for a worker to acknowledge aquiring a client task
task_timeout            base timeout allowed for a worker to complete a client task
retry_timeout           additional time allotted to a task for susequent attempts upon task failure
deadline_scale_factor   increase task timeout by this amount for each task submitted by a job (scale for large jobs)
max_threads             controls the number of threads used by a reactor.
              Within the [global] section it controls the number of threads used for crawling the file system
              Within the [cmr-worker] section it controls the number of worker threads
tasks_per_thread        number of tasks that should be queued on a single thread
max_thread_backlog      absolute maximum number of tasks allowed to be queued on a single thread
drought_backoff         extra sleep time for server/worker when no work is available to be completed
enabled         on/off switch
work_resend_interval  time to wait between work requests (if there is no response)
dispatch_interval   time to wait between dispatch task attempts
delete_zerobyte_output  delete partial/output files that contain no output
anti_spoof        attempt to confirm client identity (will only work if the server and client are run on the same system)