-
Notifications
You must be signed in to change notification settings - Fork 142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler lockfile for RunDeck task not being released #612
Comments
Everything seems to be OK with locking itself (well,
|
Thanks for the good work on looking into this. Let me address your feedback line by line:
Do you have more information about what the unhandled error could be? Something else worth looking into that could be related is that I have noticed sometimes that This is a sign that the logic for managing if a certain set of tests of measurements is probably a bit broken. This may also be related to (5).
Yes you are right, it's documented inside of some comment if I recall correctly. If we want to be a bit smarter about this we could skip adding to the queue if it already has an element in it. There may be some twisted construct that abstracts this sort of logic already.
It's interesting that you see this exception. It's probably an issue with setting the Do you have some traceback of this exception?
What is the reason for censorship being detected? The web_connectivity does produce some false positives, it's a bit better than the situation with the older http_requests test, but there are for sure improvements that can be made to improve it's accuracy. If you have some example reports of it can you attach it here?
Yes, this seems like a bug and in general I was hoping that this would end up being addressed as part of the refactor documented in https://github.com/TheTorProject/ooni-probe/issues/563. That ticket actually may contain some useful information in looking into the refactoring of the scheduler.
This is a very good point, though it's actually another deck and we will be adding new decks in the future. I will think a bit about how we can support this use case, but make it abstract enough to work also when we have more decks.
Yes, I have also noticed that sometimes the web UI will start blocking (like HTTP requests will be pending and the UI becomes irresponsive). I fear this is due to some code being blocking and the twisted event loop being stuck. Is this what you are talking about or are you talking about the startup time? |
ONE of reliable ways to reproduce the issue was to throw an exception in task generator, like
If the exception is risen iside of That's not the only reason as I still can reproduce the issue after the hotfix in non-reliable way. I'm still investigating it as it does not give so clean clues like tracebacks :)
The easy way to trigger that behaviour is to run a test against TARPIT. twisted has huge timeout to trigger It looks like a bug, but that's not a blocker, I'm forking it to separate ticket to document it, but I'm unsure that we'll ever fix it due to upcomming MeasurementKit integration.
The smartest thing I can imagine is
It looks like miscommunication between ooni-wui & ooni-probe that does not validate submitted form.
IMHO, making decks "orthogonal" without any intersection between them is a way to go. I would love to hear a reasoning to avoid that in ooni-probe :)
OK, forking that to separate ticket. |
Yes this makes sense. I suggested a slightly better fix for this in the PR.
Damn, what steps are you following to reproduce this bug?
Yes that seems like a wise thing to do. I agree it's not a blocker and we probably want to fix it as part of another issue (https://github.com/TheTorProject/ooni-probe/issues/618).
Hum, I'm unsure this is a good behaviour. The problem here is that actually the queue of scheduled tasks will in normal cases accumulate a very big backlog of tasks waiting on the lock (that is because the normal deck takes 1 hour to run and the scheduler will try to reschedule them and wait on the lock every 30 seconds). So I think the problems here are actually 2:
Damn you are right. Filed this issue about it: OpenObservatory/ooni-wui#9
Yes this makes sense. It does however make the configuration of decks slightly harder, but I think in general this is a good approach. Maybe what we can do is put the "risky" test (http-invalid-request-line) inside of it's own deck by itself and the web-connectivity one on it's own.
|
If the lock has too many waiters then some stack overflow raises RuntimeError and scheduler fails altogether.
If the lock has too many waiters then some stack overflow raises RuntimeError and scheduler fails altogether.
The trivial fix makes sense as current SchedulerService ticks every 30 seconds and imprecise scheduling is acceptable.
I decided to avoid suicide-on-deadlock as there are some task deferred to other threads and I'm unsure if ooni-probe is robust enough to handle interruptions and damaged data, so I don't consider wise to take a risk introducing such a questionable feature during RC stage. |
If the lock has too many waiters then some stack overflow raises RuntimeError and scheduler fails altogether.
So it seems like issues related to this are not entirely solved. When going through the process of updating from a lepidopter 1.6.1 to the latest 2.0.0 I was able to reproduce the permanent locking of the scheduler again... Here is a relevant traceback:
When this happens all the lockfiles are permanently locked leading to not scheduling of measurements. This doesn't happen when I install it fresh on other systems, so I am not yet sure exactly why it's happening on this pi. Edit: |
Shall we introduce a newer update recipe to fix https://github.com/TheTorProject/ooni-probe/issues/612#issuecomment-254012251 ? @hellais Have you tried reproducing this to other non Pi non fresh systems (upgrade, instead of fresh install)? |
LOL. I realised that instead of commenting I edited your comment. Posting it here are restoring the original.
I am going to monitor how many lepidopters update today and if it seems like they all went offline I will push a new updated that sleeps a bit restarts the service.
No luck so far reproducing it on non raspberry pi deployments. It could also have been a temporary issue. |
Yep, AFAIR, we discussed that the scheduler is not exception-safe still. I'm going to fix it. |
What is the status of this? Is this still an issue? I have not seen this happening anymore since more recent releases. |
Issue moved to ooni/probe-legacy #26 via ZenHub |
It seems like there is an issue in the releasing of the lockfile for the scheduler of the
RunDeck
task.This was found while looking into the issue related to https://github.com/TheTorProject/ooni-probe/issues/606.
As a result of this decks are only run once and then they are not re-triggered daily due to the lockfile still existing on disk.
The relevant code is: https://github.com/TheTorProject/ooni-probe/blob/master/ooni/agent/scheduler.py#L24.
In debugging this I noticed that the last run time is being updated properly (https://github.com/TheTorProject/ooni-probe/blob/master/ooni/agent/scheduler.py#L129), but for some reason the lockfile is never being released.
I would suggest as a way of debugging this to write some unittests that try to trigger a long running task and then queue a bunch of periodic tasks that acquire the lock on the filesystem and then see if at the end the lockfile is released.
@darkk can you look into this and see if you are able to see why this is happening?
The text was updated successfully, but these errors were encountered: