Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error management on random-connector-incremental #60

Open
matteogrolla opened this issue Jan 21, 2020 · 3 comments
Open

Error management on random-connector-incremental #60

matteogrolla opened this issue Jan 21, 2020 · 3 comments

Comments

@matteogrolla
Copy link

Hi,
I'm Matteo Grolla from Sourcesense, Lucidwork's partner in Italy.
I'm developing a custom connector for a customer but I have questions about error management, Robert Lucarini suggested to post my questions here.
Let's use random-content-incremental for our discussion and let's focus on the fetch method
What I've noticed is:

  • if an exception is thrown inside generateRandom the framework restarts the crawl from previous checkpoint (or the beginning if it was the first)
    How can I terminate the crawl marking it as failed?
    I'd like that next time I restart the crawl it proceeds from last saved checkpoint
  • if an exception is thrown inside emitDocument the framework logs the error and proceeds with the crawl.
    Will this document be recrawled? When? Can we control this?
    Thanks a lot
@roblucar
Copy link
Member

Hi @matteogrolla , Thank you for posting your questions.

  1. The crawlDB will manage the state of the Job runs. In particular, the BlockID

identifies a series of 1 or more Jobs, and the lifetime of a BlockId spans from the start of a crawl to the crawls completion.When a Job starts and the previous Job did not complete (failed or stopped), the previous Job’s BlockId is reused. The same BlockId will be reused until the crawl successfully completes.BlockIds are used to quickly identify items in the CrawlDb which may not have been fully processed (complete).
Which addresses the restart. Unfortunately there is no way at the time to programmatically stop a crawl job like when a user or external process initiates a stop through the Fusion UI or API.

  1. Yes, the document will be marked as failed and retried on the successive crawl job.

@mwmitchell
Copy link
Contributor

Hi @matteogrolla,

For this one:

if an exception is thrown inside generateRandom the framework restarts the crawl from previous checkpoint (or the beginning if it was the first)
How can I terminate the crawl marking it as failed?
I'd like that next time I restart the crawl it proceeds from last saved checkpoint

Are you saying that you'd like the job to stop immediately, due to the exception that was thrown?

Will this document be recrawled? When? Can we control this?

Do you have another way you'd like errors to behave?

@matteogrolla
Copy link
Author

Hi @mwmitchell,
this is a closed source framework of an established product, so I believed I'd find a paragraph in the documentation describing how to deal with the different kind of exceptions, but the only example is a nullpointer exception.
Anyway, since you ask, I try to approach the subject in general and then describe some practical scenarios that I have to deal with.

In the context of a batch job errors can be partitioned in

  • Unretriable Errors: (will never work, don't bother retry)
    when they arise the failed operation should be logged and if possible the crawl should continue (otherwise it should stop)

  • Retriable Errors: (may work on next attempt)
    when they arise the failed operation should be retried a certain number of times (maybe undefinitely) and if it still fails, should be logged.
    Then if possible the crawl should continue (otherwise it should stop)
    The attempts necessary to succeed can be many and it may be usefull to stop the crawl and restart it when it can proceed successfully. (Maybe fusion needs maintenance and must be restarted)

Most errors should be thrown during the communication with the documents source (a web service, a mail server...), but if I'm not wrong the connector framework is a distributed system, so even fetchContext's emits are not error free and I'd like to understand what happens when these errors arise.

Here are some practical scenarios that I have to deal with

-scenarios A source system goes offline (retriable exception needing many retries)
-- scenario A1: (I've understood how to implement it)
connector: asks ids of docs published on 2020-01-01
source: returns doc ids
connector: emits those ids in the fetchcontext as transient and checkpoints 2020-01-01
source: GOES OFFLINE
connector: keeps trying fetching ids for 2020-01-02
tries fetching docs body for ids in the fetchcontext
both requests fail
requests are retried endlessly

next morning
source: GOES ONLINE
connector
bodies of docids for 2020-01-01 are fetched
doc ids for 2020-01-02 are fetched
the crawl proceeds

QUESTION: what happens if the crawl is stopped when source is offline? and maybe fusion is restarted?
In randomContentIncremental docIds are emitted as TRANSIENT candidate, and I don't know what that transient means

-- scenario A2: a proposal
connector: asks ids of docs published on 2020-01-01
source: returns doc ids
connector: emits those ids in the fetchcontext as transient and checkpoints 2020-01-01
source: GOES OFFLINE
connector: keeps trying fetching ids for 2020-01-02
tries fetching docs body for ids in the fetchcontext
both requests fail
crawl is STOPPED with (for example) fetchContext.stopCrawl()

next morning someone (or maybe a scheduler) restarts the crawl
source: GOES ONLINE
connector:
bodies of docids for 2020-01-01 are fetched
doc ids for 2020-01-02 are fetched
the crawl continues

-scenario B wrong request to source system (unretriable exception that shuld stop the crawl)
user: specifies a batch size too large
connector: asks a large batch of doc ids
source: fails
connector:
stops the crawl with fetchContext.stopCrawl()

  • scenario C doc is deleted between fetch id and fetch body (unretriable exception that lets crawl proceed)
    connector: fetches id of doc1 from source system
    user: deletes doc1 from source system
    connector: tries fetching body of doc1
    logs error and proceeds (I'd like at least the number of errors to be visible in the UI at the end of the crawl, so the exception should reach the framework and not just logged by custom code)

QUESTION: I don't understand the responsibility of fetchContext.newResult()
I believed it was "we are done with this input, let's continue with next"
In randomContentIncremental this doesn't work if the input triggers emitDocument (the else part, line 61)
input triggers emitDocument
emitDocument may throw exception
fetchContext.newResult() is never reached
but we continue anyway with next input

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants