Stuck Harvest Fix

The goal is to automate this process so that the system will automatically fix a stuck harvest source after 12 hours (or some amount of fixed time). This was originally done in the geodatagov extension, though the process is outdated and needs to be revamped.

Understanding what is stuck

Mostly, harvest jobs get stuck in the fetch stage. To determine what is stuck, consider:

Are the harvest-fetch workers idle?
For harvest jobs in the Running state, is the timestamp of gather_finished?

Review the /var/log/harvest-fetch.log on catalog-harvest1p. Are harvest objects being processed, or does it seem like the workers are idle?

Review the harvest jobs in the Running state. SELECT * FROM harvest_job WHERE status = 'Running';. Is the timestamp of gather_finished older than an hour?

If you answer "yes" to both of these, the job is probably stuck. Otherwise, let the fetch workers continue working since "unstucking" the jobs that might finish could cause undesired side effects.

"Unstucking" the jobs

To manually force a completion of a harvest job, the harvest object records in the DB that are not processed (marked as complete or error) will need to be marked as an error, and to save that error. The process looks like this:

Create harvest error descriptions for the stuck harvest tasks (fetch or gather tasks).
Mark the stuck harvest tasks as errored.
Mark any harvest jobs that were not started properly as New.
(automated) harvest run will see the errored tasks and cleanup or restart jobs as appropriate.

Create error descriptions for unprocessed/stuck tasks

The SQL for saving the unprocessed records is below (replace harvest_job_id text with the job that needs to be released).

Note: the query below will fail with a non-unique key error on harvest_object_error.id. The addition of a newly unique id for the harvest object error record should be created, but will need to be generated by python: it cannot be generated via SQL natively.

# Creates harvest_object_errors to be shown to the user after the job is complete.
INSERT INTO harvest_object_error ("harvest_object_id", "message", "stage", "id")
SELECT id, 'Unknown error occurred, object did not harvest',  'unknown', 'ERROR_TEST'
FROM harvest_object
WHERE state <> 'COMPLETE'
AND state <> 'ERROR'
AND state <> 'STUCK'
AND harvest_job_id = '';

The above query could be improved by extracting the stage of failure by evaluating which timestamp is null: gathered, fetch_started, fetch_finished, import_started, import_finished, and giving one of these 3 values: Import, Fetch, and Validation.

Mark stuck tasks as errored

The SQL for forcing the complete for a stuck task is below (replace job_id text with the job that needs to be released). Once harvester run checks for completed jobs, it will see all the tasks as errored or complete and mark the harvest job complete.

# Marks stuck task as error for given harvest job id
UPDATE harvest_object
SET state = 'ERROR'
WHERE state <> 'COMPLETE'
AND harvest_job_id = '';

Eventually these queries could be run on any jobs that have not had any activity in 12 or more hours. This would look something like:

# Marks all stuck tasks as errored after 12 hours
UPDATE harvest_object 
SET state='ERROR' 
WHERE 
state <> 'COMPLETE' AND state <> 'ERROR' 
AND import_finished IS NULL 
AND import_started < now() - '12 hours'::interval;

Mark harvest jobs that never started correctly as New

There are also use cases where a harvest is started, but the harvesters are so backed up that the gather process is never started, and there are no harvest objects to mark as error. At this point, you can either restart the job or mark it as completed. To force the restart, use the following SQL:

UPDATE harvest_job 
SET status = 'New' 
WHERE gather_started IS null
AND status = 'Running' 
AND created < now() - interval '1 day';

Provide feedback

Saved searches