Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reduce load on SGE with qacct command #2

Open
wants to merge 11 commits into
base: 3.21.x
Choose a base branch
from

Conversation

ionox0
Copy link
Member

@ionox0 ionox0 commented Jul 1, 2020

Hi @nikhil

I opened this PR just as a suggestion

The DMP team had to make a few changes to Toil to get it to work on SGE. The main issue is similar to how we had to do some rate-limiting for LSF

Perhaps @huyu335 can describe these other changes, and we can merge them into a single release of toil-msk that works for both teams

Perhaps you can let me know if there is somewhere else you might suggest we merge this, if you think it's a good idea

Thank you!

@huyu335
Copy link

huyu335 commented Jul 1, 2020

Hi @ionox0,

Thanks for tagging me! Hope you are staying safe in this time period.

  1. When job IDs are used up in SGE, new jobs will be attached to the same job IDs. That means when we check the job status, old jobs will be picked up. So I added job name into the checking point for the completion of a job (considering both jobID and jobName, instead of just jobID) (5fae86c)
  2. related to point 1, hotfix on checking the job names, and a boolean switch if finding a job (196c756, 3d936cc)
  3. changed the sleep time from 1s to 15s, because 1s is too short for our system (2ea6c95, 6d388bb, 1a0ad90)
  4. when submitting to SGE, the mem value was used to be the value a user typed in multiply by the core number. After the change, the mem value a user type will be the total mem for this job, for example, if the input is "mem=48, core=4", the old mem will be "48*4", the new will be "48"(6e6eb8e)

Please let me know if I need to add more information.

Thanks,
Yu Hu

@ionox0
Copy link
Member Author

ionox0 commented Jul 7, 2020

Thank you!

I'll first have to make sure that these changes don't affect the lsfBatchSystem, because that class inherits from abstractGridEngineBatchSystem.

@nikhil I see you have the toil tests passing under travis. Is the issues/3097-3098-3099-3100-improve-LSF-batchsystem branch a better place for this PR to go?

Or maybe you can point me to the commit where you fixed the tests and I can cherry-pick it into this PR?

@nikhil
Copy link
Member

nikhil commented Jul 7, 2020

Hey @ionox0 and @huyu335,

Awesome work! This looks fine so far, though I would want a confirmation from @ionox0 that this does not affect the LSF batchsystem.

  • I think these changes would be really useful to the TOIL community and I would recommend that you make a PR to their master branch as well (https://github.com/dataBiosphere/toil). This repo is basically a staging area for features/fixes to be contributed back to the TOIL community.

  • The contributing guidelines are here under Pull Requests. You have already described the issues in your comment, you just need to itemize them into issue tickets and link it to your PR. As an example, here is one I recently added (Improve lsf batchsystem stability DataBiosphere/toil#3101)

  • @ionox0, regarding tests, I always had the travis tests fail for this repo but pass on base TOIL.

@ionox0
Copy link
Member Author

ionox0 commented Jul 8, 2020

Thanks for the info! Yes I agree, I'll submit the PR to the base Toil repo, much appreciated 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants