-
Notifications
You must be signed in to change notification settings - Fork 2
Configuration
Cmr is configured through a single ini configuration file located at /etc/cmr/config.ini. After installation it is necessary to provide some additional information before CMR is usable.
basepath must be configured, this path is intended to point at the mount point in which the data warehouse is located. scratch_path is derived from basepath and is intended to point at a location also within the mount where it is suitable to write intermediate data necessary for CMR to run jobs. There are three important paths derived from scratch_path, output_path, error_path, and bundle_path. These locations indicate where CMR should put its output, error, and bundles respectively during job execution, so it is important that they too are accessible from all nodes.
For instance, if the clustered file system's mount point is /mnt/gv0 a valid configuration could look like this.
basepath=/mnt/gv0
scratch_path=${basepath}/home/${USER}/.scratch/cmr/${JOB_ID}
output_path=${scratch_path}/output
error_path=${scratch_path}/error
bundle_path=${scratch_path}/bundle
During job execution each path will be expanded. So if a user... octodata runs a job his set of paths might be as follows
basepath: /mnt/gv0
scratch_path: /mnt/gv0/home/octodata/.scratch/cmr/06ab1fff-05f4-4ba3-877c-ff309abcb221
output_path: /mnt/gv0/home/octodata/.scratch/cmr/06ab1fff-05f4-4ba3-877c-ff309abcb221/output
error_path: /mnt/gv0/home/octodata/.scratch/cmr/06ab1fff-05f4-4ba3-877c-ff309abcb221/error
bundle_path: /mnt/gv0/home/octodata/.scratch/cmr/06ab1fff-05f4-4ba3-877c-ff309abcb221/bundle
As shown above by default the scratch_path is user dependent, and to function correctly it is necessary for each user directory to be writable by each individual user. There are two ways of addressing this, either the user directories must be pre-created with the correct permissions (so that clients may write to them, they'll probably want to read from them too) or the directory immediately above to be world writable, in which case CMR clients will create the users directory. i.e. if the scratch path is ${basepath}/home/${USER}/.scratch/cmr/${JOB_ID} then ${basepath}/home/ should be world writable.
If intending to use CMR in a multi-node environment the bind addresses server_in, server_out, caster_in, and caster_out must be configured as well by default they reference localhost. The cmr-worker, and cmr-clients will attempt to connect to these addresses so it is important that the same addresses are configured on all nodes in a CMR installation.
In order to run a CMR job, the cmr-caster and cmr-server, and at least one cmr-worker instance must be started. Currently it is necessary for CMR clients, and the cmr-server server to be run on the same system.
Provided the configuration is available in /etc/cmr/config.ini all cmr components are start-able without specifying any additional command line arguments, so getting everything up and running is as easy as running each component.
cmr-server
cmr-caster
cmr-worker
Alternatively, If CMR was installed using the Debian packages each component may be started via service.
service cmr-server start
service cmr-worker start
The cmr-server service script also controls the cmr-caster component as the pair rely on each other. Each component will write logs into /var/log/cmr/
If everything has been started successfully, a service status call will result in the following:
:~$ service cmr-server status
* cmr-server (3137) is running ok
* cmr-caster (31289) is running ok
:~$
:~$ service cmr-worker status
* cmr-worker (1: 22404) is running ok
* cmr-worker (2: 22414) is running ok
* cmr-worker (3: 22424) is running ok
* cmr-worker (4: 22434) is running ok
:~$
Each status return will give the pid of the running process.
Do note that the cmr-worker process count can be configured by changing NUMINSTANCES in /etc/default/cmr-worker (debian packages only). The above example is using 4 processes. In our experience running a few more processes rather than increasing the number of threads in each cmr-worker (through configuration) yielded a small boost in performance.
cget requires additional information about the data warehouse in order for it to provide a simplified user interface. Specifically variables concerning the location of files within the warehouse (warehouse_file_path) must be filled in for cget to function. warehouse_file_path describes the location of files in the warehouse as a glob pattern given a mixture of variable and fixed path elements. A warehouse_file_path might look like this.
basepath=/mnt/gv0
warehouse_file_path=${basepath}/user/hive/warehouse/${TABLE}/pdate=${DATE}/ptime=${HOUR}/log_${MINUTE}.gz
Now if we execute a cget command with this configuration
cget --select "day,type" --from "purchases" --between "2014-01-01 03:23:00 and 2014-01-04 01:16:00"
It will produce the following cmr command (a complex date, and fairly granular warehouse_file_path was chosen intentionally to show how specifying inputs can get rather complex and error prone if done by hand)
cmr
--mapper 'cmr_map_json day type _1'
--reducer 'cmr_reduce s' --final-reducer 'cmr_reduce s -o "\t"'
--input "/mnt/gv0/user/hive/warehouse/purchases/pdate=2014-01-01/ptime=3/log_{23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59}.gz"
--input "/mnt/gv0/user/hive/warehouse/purchases/pdate=2014-01-01/ptime={4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23}/log_*.gz"
--input "/mnt/gv0/user/hive/warehouse/purchases/pdate=2014-01-04/ptime=1/log_{0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15}.gz"
--input "/mnt/gv0/user/hive/warehouse/purchases/pdate=2014-01-{02,03}/ptime=*/log_*.gz"
--stdout 2> /dev/null
resulting in the following reduced data
2014-01-01 banana 38
2014-01-01 orange 22
2014-01-02 banana 44
2014-01-02 orange 32
2014-01-03 banana 38
2014-01-03 orange 33
2014-01-04 banana 0
2014-01-04 orange 0
cget can support querying multiple tables in a single query (allowing their results to be merged if they can be mapped to the same domain). In order to enable this functionality a relationship between the contents of a row and it's table must be established. These relatio nships may be expressed in the cmr configuration under the section cget-table as follows.
[cget-table]
purchases=handler:purchases
impressions=handler:impressions
This configuration will instruct cget that when accessing the table purchases rows that contain the key value pair {"handler":"purchases"} belong to the purchases table, and when accessing the table impressions rows that contain the key value pair {"handler":"imp ressions"} belong to the impressions table. With this additional configuration more complex queries may be constructed in cget, because the mapper can now choose to map differently depending on which table a row belongs to. For instance, lets say that the type field fr om purchases corresponds directly to the item field in the impressions table. The following query will allow the two to be mapped into the same domain so that they can be reduced together.
cget --select "day,type" --from "purchases" --select "day,item" --from "impressions" --between "2014-01-01 03:23:00 and 2014-01-04 01:16:00"
This query will now map rows from purchases table to "day, type, 1, 0" and rows from the impressions table to "day, item, 0, 1" the resulting mapped rows will be reduced. This will create output rows that contain "day, type, number of purchases, number of impressions".
2014-01-01 banana 38 327
2014-01-01 orange 22 223
2014-01-02 banana 44 400
2014-01-02 orange 32 252
2014-01-03 banana 38 304
2014-01-03 orange 33 237
2014-01-04 banana 0 3
2014-01-04 orange 0 2
[global] configuration options affect all components of CMR and are overridden by configuration options specific to each component
[cmr-server] configuration options specific to cmr-server
[cmr-worker] configuration options specific to cmr-worker
[cmr] configuration options specific to cmr client
[cmr-grep] configuration options specific to cmr-grep client
log_level minimum logging severity, one of [TRACE, DEBUG, INFO, WARN, ERROR, FATAL]
server_in cmr-server binds to this address, which is used to service cmr client requests
server_out cmr-server binds to this address, which is used to communicate with worker instances
caster_in cmr-caster binds to this address, which is used to receive requests to broadcast events
caster_out cmr-caster binds to this address, which is used to broadcast events
base-relative-path output paths specified by users is assumed to be relative to basepath
basepath path to the root of the clustered file system
scratch_path scratch location (must be within the clustered file system)
output_path default output location (must be within the clustered file system)
error_path error log output location (must be within the clustered file system)
bundle_path script bundle location (must be within the clustered file system)
warehouse_path path to the root of the data warehouse (must be within the clustered file system)
warehouse_file_path path to a file located within the data warehouse (must be within the clustered file system)
batch_size number of files to be included in a single task
max_task_attempts maximum allowed attempts on a single task before job failure
accept_timeout time allowed for a worker to acknowledge aquiring a client task
task_timeout base timeout allowed for a worker to complete a client task
retry_timeout additional time allotted to a task for susequent attempts upon task failure
deadline_scale_factor increase task timeout by this amount for each task submitted by a job (scale for large jobs)
max_threads controls the number of threads used by a reactor.
Within the [global] section it controls the number of threads used for crawling the file system
Within the [cmr-worker] section it controls the number of worker threads
tasks_per_thread number of tasks that should be queued on a single thread
max_thread_backlog absolute maximum number of tasks allowed to be queued on a single thread
drought_backoff extra sleep time for server/worker when no work is available to be completed
enabled on/off switch
work_resend_interval time to wait between work requests (if there is no response)
dispatch_interval time to wait between dispatch task attempts
delete_zerobyte_output delete partial/output files that contain no output
anti_spoof attempt to confirm client identity (will only work if the server and client are run on the same system)