Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large numbers of jobs cause slow loading and many error messages in Job status tab #376

Open
iamh2o opened this issue Nov 24, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@iamh2o
Copy link

iamh2o commented Nov 24, 2024

Description

  • PCUI has been amazing. Thank you. My bug:
    • I am running a snakemake pipeline which has ~2000 tasks to complete, and I am allowing 200 jobs to be in queue at a time (limited by my max spot quota as well).
    • When I go to the jobs list in PCUI, it stalls a bit, error messages begin to appear, and eventually behind them the list of jobs appears.

Steps to reproduce the issue

  • Launch a lot of jobs, open the jobs tab in PCUI.

Expected behaviour

Job status errors

  • To see the jobs list as it appears with fewer jobs in queue.

Actual behaviour

  • Open Jobs status
  • Spins a moment
  • Every 10s or so, an error appears in a red bar. There are a variety:
Error: Expecting property name enclosed in double quotes: line 1176 column 5 (char 23980)
Error: Expecting value: line 1177 column 1 (char 23980)
Error: Unterminated string starting at: line 1176 column 5 (char 23973)
Error: Expecting property name enclosed in double quotes: line 1176 column 4 (char 23980)
Screenshot 2024-11-24 at 2 01 04 PM
  • jobs list does begin to appear.
  • Once the jobs list has appeared, it seems to load w/out error messages for a while.

Job ID Link Error

  • This is a bug that happens with every job, irrespective of the large number of jobs causing errors I report above. When clicking on a Job status ID from the ID column, I get the following error for every ID:
Error: not enough values to unpack (expected 2, got 1)
  • NEW and only occuring when the large number of jobs behavior is seen, I also get this error:
Error: An error occurred while trying to complete your request. Please try again later. If the problem persists, please contact support for further assistance.
Screenshot 2024-11-24 at 2 18 38 PM

Required info

In order to help us determine the root cause of the issue, please provide the following information:

  • Region PCUI : us-west-2
  • AZ of cluster: us-west-2d
  • version PCUI: public.ecr.aws/pcm/parallelcluster-ui:2024.10.0 (is this it?)
  • version pcluster: 3.11.1

Additional info

The following information is not required but helpful:

  • I connect to the pcui from a mac via chrome

If having problems with cluster creation or update

My cluster yaml:

---
Region: us-west-2  
Image:
  Os: ubuntu2204
HeadNode:
  InstanceType: r7i.2xlarge
  Networking:
    ElasticIp: true
    SubnetId: subnet-pub 
  DisableSimultaneousMultithreading: false
  Ssh:
    KeyName: KEY  # must be ed25519 for ubuntu
    AllowedIps: "0.0.0.0/0" # SET THIS TO YOUR DESIRED FILTER
  Dcv:
    Enabled: false
  LocalStorage:
    RootVolume:
      Size: 775
      VolumeType: gp3
      DeleteOnTermination: true
    EphemeralVolume:
      MountDir: /head_root
  CustomActions:
    OnNodeConfigured:
      Script: 
        s3://BUCKET/cluster_boot_config/post_install_ubuntu_combined.sh       # head and each compute can have different scripts if desired
      Args:
      - us-west-2
      - BUCKET
      - na
      - na
  Iam:
    S3Access:
    - BucketName: BUCKET
      EnableWriteAccess: false
    AdditionalIamPolicies:
    - Policy: arn:aws:iam::acct:policy/pclusterTagsAndBudget
    - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Scheduling:
  Scheduler: slurm
  SlurmSettings:
    EnableMemoryBasedScheduling: false
    ScaledownIdletime: 5
    Dns:
      DisableManagedDns: false
    QueueUpdateStrategy: DRAIN
  SlurmQueues:
  - Name: i8
    CapacityType: SPOT
    AllocationStrategy: lowest-price
    ComputeResources:
    - Name: r7gb64
      Instances:
      - InstanceType: r7i.2xlarge
      MinCount: 0
      MaxCount: 22
      SpotPrice: 1.2488 # Calculated using (median spot price)+1.01.
      Networking:
        PlacementGroup:
          Enabled: false
      Efa:
        Enabled: false
    - Name: r6gb64
      Instances:
      - InstanceType: r6i.2xlarge
      MinCount: 0
      MaxCount: 22
      SpotPrice: 1.2462 # Calculated using (median spot price)+1.01.
      Networking:
        PlacementGroup:
          Enabled: false
      Efa:
        Enabled: false
    Networking:
      SubnetIds:
      - subnet-012424a948f57e9ee
    CustomActions:
      OnNodeConfigured:
        Script: 
          s3://BUCKET2/cluster_boot_config/post_install_ubuntu_combined.sh
        Args:
        - us-west-2
        - BUCKET
        - na
        - na
    Iam:
      S3Access:
      - BucketName: BUCKET
        EnableWriteAccess: false
      AdditionalIamPolicies:
      - Policy: arn:aws:iam::acct:policy/pclusterTagsAndBudget
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
  - Name: i128
    CapacityType: SPOT
    AllocationStrategy: lowest-price
    ComputeResources:
    - Name: c6gb256
      Instances:
      - InstanceType: c6i.metal
      - InstanceType: c6i.32xlarge
      MinCount: 0
      MaxCount: 22
      SpotPrice: 1.8034 # Calculated using (median spot price)+1.01.
      Networking:
        PlacementGroup:
          Enabled: false
      Efa:
        Enabled: false
    - Name: m6gb512
      Instances:
      - InstanceType: m6i.32xlarge
      - InstanceType: m6i.metal
      MinCount: 0
      MaxCount: 22
      SpotPrice: 2.1581 # Calculated using (median spot price)+1.01.
      Networking:
        PlacementGroup:
          Enabled: false
      Efa:
        Enabled: false
    - Name: r6gb1024r6
      Instances:
      - InstanceType: r6i.metal
      - InstanceType: r6i.32xlarge
      MinCount: 0
      MaxCount: 22
      SpotPrice: 2.0494 # Calculated using (median spot price)+1.01.
      Networking:
        PlacementGroup:
          Enabled: false
      Efa:
        Enabled: false
    Networking:
      SubnetIds:
      - subnet-012424a948f57e9ee
    CustomActions:
      OnNodeConfigured:
        Script: 
          s3://BUCKET2/cluster_boot_config/post_install_ubuntu_combined.sh
        Args:
        - us-west-2
        - BUCKET
        - na
        - na
    Iam:
      S3Access:
      - BucketName: BUCKET
        EnableWriteAccess: false
      AdditionalIamPolicies:
      - Policy: arn:aws:iam::acct:policy/pclusterTagsAndBudget
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
  - Name: i192
    CapacityType: SPOT
    AllocationStrategy: lowest-price
    ComputeResources:
    - Name: c7gb384
      Instances:
      - InstanceType: c7i.48xlarge
      - InstanceType: c7i.metal-48xl
      MinCount: 0
      MaxCount: 22
      SpotPrice: 2.4093 # Calculated using (median spot price)+1.01.
      Networking:
        PlacementGroup:
          Enabled: false
      Efa:
        Enabled: false
    - Name: m7gb768
      Instances:
      - InstanceType: m7i.metal-48xl
      - InstanceType: m7i.48xlarge
      MinCount: 0
      MaxCount: 22
      SpotPrice: 2.5016 # Calculated using (median spot price)+1.01.
      Networking:
        PlacementGroup:
          Enabled: false
      Efa:
        Enabled: false
    - Name: r7gb1536
      Instances:
      - InstanceType: r7i.48xlarge
      - InstanceType: r7i.metal-48xl
      MinCount: 0
      MaxCount: 22
      SpotPrice: 2.209 # Calculated using (median spot price)+1.01.
      Networking:
        PlacementGroup:
          Enabled: false
      Efa:
        Enabled: false
    Networking:
      SubnetIds:
      - subnet-012424a948f57e9ee
    CustomActions:
      OnNodeConfigured:
        Script: 
          s3://BUCKET/cluster_boot_config/post_install_ubuntu_combined.sh
        Args:
        - us-west-2
        - BUCKET
        - na
        - na
    Iam:
      S3Access:
      - BucketName: BUCKET
        EnableWriteAccess: false
      AdditionalIamPolicies:
      - Policy: arn:aws:iam::acct:policy/pclusterTagsAndBudget
      - Policy: arn:aws:iam::aws:policy/AmazonSSMManagedInstanceCore
Monitoring:
  DetailedMonitoring: false
  Logs:
    CloudWatch:
      Enabled: true
      RetentionInDays: 3  # must be 0,1,3,5,7,14,30,60,90...
SharedStorage:  # This is the local FS which is expensive but fast, could be swapped for EFS, etc.
- MountDir: /fsx    # The cost of this will be roughly $22.93 per day, so it should not be kept hot unless in active use.
  Name: fsx-daylily-07123j     # WARNING, EDIT NAME WILL DEL EXISTING DATA
  StorageType: FsxLustre
  FsxLustreSettings:
    ImportPath: s3://BUCKET/data/
    StorageCapacity: 4800
    DeploymentType: SCRATCH_2
    AutoImportPolicy: NEW_CHANGED_DELETED
    DeletionPolicy: Retain    # Set to true to keep the FSX after the cluster is deleted
Tags:  # TAGs necessary for per-user/project/job cost tracking 
- Key: aws-parallelcluster-username
  Value: daylily
- Key: aws-parallelcluster-jobid
  Value: NA
- Key: aws-parallelcluster-project
  Value: da-us-west-2d-daylily-07123j
- Key: aws-parallelcluster-clustername
  Value: daylily-07123j
- Key: aws-parallelcluster-enforce-budget
  Value: enforce
DevSettings:
  Timeouts:
    HeadNodeBootstrapTimeout: 3600
    ComputeNodeBootstrapTimeout: 3600
...


If having problems with custom image creation

n/a

@iamh2o iamh2o added the bug Something isn't working label Nov 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant