Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Machine requirement: TRSS replacement #3116

Closed
sxa opened this issue Jun 26, 2023 · 33 comments
Closed

New Machine requirement: TRSS replacement #3116

sxa opened this issue Jun 26, 2023 · 33 comments

Comments

@sxa
Copy link
Member

sxa commented Jun 26, 2023

I need to request a new machine:

  • New machine operating system (e.g. linux/windows/macos/solaris/aix): Linux
  • New machine architecture (e.g. x64/aarch32/arm32/ppc64/ppc64le/sparc): Probably x64
  • Provider (leave blank if it does not matter):
  • Desired usage: Replacement for the TRSS server
  • Any unusual specification/setup required: New system will be running under docker, so will need docker to be available.
  • How many of them are required: 1

Please explain what this machine is needed for: Replacement for the existing TRSS server which is running on AWS and should therefore be decommissioned as most others there have been as it is not a sponsored provider.

@sxa
Copy link
Member Author

sxa commented Jun 26, 2023

@sxa
Copy link
Member Author

sxa commented Aug 8, 2023

Created at 172.187.145.103 [CHANGED - SEE LATER COMMENT] - infrastructure team's keys added

@sxa
Copy link
Member Author

sxa commented Aug 8, 2023

@llxia @smlambert Should the second link with the docker-compose docs work straight out of the box with the current aqa-test-tools repo? The package.json has references to docker compose instead of docker-compose and if I fix it gets a bit further then hits this problem:

Removing intermediate container 125d62fd987c
 ---> dc4be6de7db7
Step 4/13 : COPY package.json package-lock.json .
When using COPY with more than one source file, the destination must be a directory and end with a /
ERROR: Service 'client' failed to build : Build failed

@sxa
Copy link
Member Author

sxa commented Aug 9, 2023

Also from taking a look at the video, is it storing the database in the directory where you run npm docker run as opposed to in the docker container itself (I need to know for setting up the file systems appropriately)?
The docs for docker-compose also says "Using Docker is a good way to test and development locally." Is it definitely also suitable and ready for production use this way?

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Aug 22, 2023

The new machine has 100g in /var/lib/docker, however because of https://github.com/adoptium/aqa-test-tools/blob/6ffa683e929c20b6eb31eb2ecaf473d2e977544e/docker-compose.yml#L9C30-L9C30, mongo will store everything in a docker-host mounted volume in the top level aqa-test-tools directory because thats where the docker-compose file is. Right now on the new machine I have the service running from /home/jenkins/aqa-test-tools which means mongo is storing everything in a docker-host mounted volume in /home/jenkins/aqa-test-tools, ie not in /var/lib/docker where all the space is.

I think a solution is to add a disk and mount it to a separate directory, like /data/, and launch docker compose from there

@Haroon-Khel Haroon-Khel self-assigned this Aug 22, 2023
@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Aug 23, 2023

What ive done so far:

  • TRSS is running on http://172.187.145.103/
  • It is running as a non root user
  • port 80 is open on the machine via the azure console
  • The mongo data from the current machine has been imported into the new machine

What needs to be done:

  • The back end is not 'pointing' at our adoptium jenkins. I copied the credentials file from the current machine into a trssConf.json file on the new machine. The logs indicate that the credentials file is being used, yet the new TRSS instance does not look like it is getting any new test data from the adoptium jenkins server I think it may be pointing at the adoptium jenkins instance. Some of the builds have 23/08/2023 timestamp (I kicked off the service yesterday (22/08))
  • Certificate files need to be copied from the current machine
  • The nginx conf files on the new machine are the bare minimum. I suggest the conf files from the current machine are copied over after the certificates get copied over

@sxa
Copy link
Member Author

sxa commented Aug 24, 2023

I think a solution is to add a disk and mount it to a separate directory, like /data/, and launch docker compose from there

In that case I recommend that we redo the disk with /var/lib/docker on it and created a smaller file system for that and a larger one that covers the data - I'd probably prefer that being mounted on /home or /home/jenkins or another path underneath those - seems simpler not to have a complete new separate directory off / for this, and it's always good to have /home separate to avoid it filling up the root file system.

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Aug 24, 2023

Ill add the disk to /home/jenkins/trss/ and launch the service from there.

Regarding the import of the mongo data, I used exportMongo.sh on the existing machine. However since now mongo runs in a docker container the restore command (mentioned in that script) needs to be run in the container, and the archive needs to be copied into the running container.

https://adoptium.slack.com/archives/C5219G28G/p1692865247390769?thread_ts=1692109686.752019&cid=C5219G28G, at the moment the performance of the machine when it comes to loading data is extremely slow. Ill continue to monitor the performance after ive added the new disk

@Haroon-Khel
Copy link
Contributor

Update

The machine's ip has changed to 20.90.182.165. The TRSS service is running on port 80 on the machine. The data disk is 128g with 30g for /var/lib/docker, 60g for /home/jenkins/trss (which is where mongo will be storing its data) and 4g swap.

More can be added if needs be, at the moment mongo (and the other trss services) are taking up 4.9G (thats including the import of the data from the current trss machine)

@sxa
Copy link
Member Author

sxa commented Sep 19, 2023

MongoDB was chewing up two full cores of CPU continuously which the original server was not doing. It appears ok today though - has a configuration change occurred to resolve that problem @Haroon-Khel ?

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Sep 19, 2023

I checked too. It still eats up the CPU when you go to http://20.90.182.165/. By this I mean mongod uses 0.7% when idle, and goes up to 200% when accessing the service. Funnily enough I did the same thing for the current trss service and I found it behaved the same way, ie on the current trss machine mongod eats up upto 400% of the CPU but only when accessed, stays at 0.3%ish when idle

PID     USER    PR  NI    VIRT   RES         SHR   S    %CPU  %MEM    TIME+         COMMAND 
9630 mongodb   20   0     10.439g  8.002g   6932  S     355.1  53.0  139390:51   mongod  

Managed to capture top output after refreshing https://trss.adoptium.net/. So its not specific to the new trss instance

@Haroon-Khel
Copy link
Contributor

Using htop, im seeing mongodb hit upto 500% CPU usage, on the current trss machine

@sxa
Copy link
Member Author

sxa commented Sep 19, 2023

Interesting - I'm no longer seeing it getting stuck at high CPU load although it does sit at 200% when loading the initial TRSS homepage - I wonder if we could optimise some of the queries there or improve the indexing?

@smlambert Can you see if this new TRSS instance is working adequately for the JDK21 triage? Would be good to give it a bit of a workout before switching it to be the primary one.

@smlambert
Copy link
Contributor

smlambert commented Sep 19, 2023

Something is not quite right with it, as I am struggling to add new pipelines to monitor, trying to add https://ci.adoptium.net/job/build-scripts/job/release-openjdk21-pipeline/ and the service seems to 'go away' (would be good to see the logs to see what is going on). I can give details when I am back from 3 day PTO and can share a screen. Let's keep the old one up and running.

@sxa sxa pinned this issue Sep 20, 2023
@sxa sxa added this to the 2023-09 (September) milestone Sep 22, 2023
@Haroon-Khel Haroon-Khel moved this to In Progress in Adoptium 4Q 2023 Plan Sep 26, 2023
@sxa
Copy link
Member Author

sxa commented Sep 28, 2023

Haroon has indicated that he's been able to replicate the problem that Shelley mentioned in the previous comment.

Given that we are having issues here with the new instance and are unable to switch over the production server we should perhaps consider creating a direct replica of the existing server for now and see if that behaves as expected and look at fixing underlying issues with the new deployment asynchronously to avoid delaying the switchover.

How long would it take to see if that works?

@Haroon-Khel
Copy link
Contributor

Yes I have seen the data just disappear on occasion. Annoyingly there is no log file in the normal mongo log file location

root@5a6e8cab57d6:/# ls -la /var/log/mongodb/
total 12
drwxr-xr-x 1 mongodb mongodb 4096 Sep  2  2022 .
drwxr-xr-x 1 root    root    4096 Sep  2  2022 ..

@Haroon-Khel
Copy link
Contributor

Given that we are having issues here with the new instance and are unable to switch over the production server we should perhaps consider creating a direct replica of the existing server for now and see if that behaves as expected and look at fixing underlying issues with the new deployment asynchronously to avoid delaying the switchover.

I could increase the new trss server's cpu and ram to make it similar to the current server and then observe any differences. That would be quicker than creating a whole replica

@sxa
Copy link
Member Author

sxa commented Sep 28, 2023

I could increase the new trss server's cpu and ram to make it similar to the current server and then observe any differences. That would be quicker than creating a whole replica

I'm sceptical as to whether that will make any difference but if you can quickly test it then feel free.

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Sep 28, 2023

I've resized the machine to 8cpu and 16g ram. On htop im seeing CPU usage upto 500% when the trss service is accessed, 0.7% when idle. So similar to the current trss machine. Data loads up alot faster than before.

@smlambert Could you try adding another pipeline again, see if we hit the same error

@sxa
Copy link
Member Author

sxa commented Sep 29, 2023

Yes I have seen the data just disappear on occasion. Annoyingly there is no log file in the normal mongo log file location

root@5a6e8cab57d6:/# ls -la /var/log/mongodb/
total 12
drwxr-xr-x 1 mongodb mongodb 4096 Sep  2  2022 .
drwxr-xr-x 1 root    root    4096 Sep  2  2022 ..

Can we see if we can enable the logging? https://betterstack.com/community/questions/how-to-log-all-or-slow-mongodb-queries/ shows how to enable query logging which might give us an idea of where the slowness is occurring. There's also stuff in there to specifically track slow-to-execute queries in there which might be useful. https://www.mongodb.com/docs/manual/reference/log-messages/#logging-slow-operations is the official docs related to logging slow queries.

@Haroon-Khel
Copy link
Contributor

Looks like logging is already enabled in /etc/mongod.conf

# where to write logging data.
systemLog:
  destination: file
  logAppend: true
  path: /var/log/mongodb/mongod.log

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Sep 29, 2023

Running db.adminCommand( { getLog: "global" } ) in mongo shows me a somewhat readable log

@Haroon-Khel
Copy link
Contributor

@smlambert The changes in https://github.com/adoptium/aqa-test-tools/pull/821/files have been applied to the new trss machine. Could you try again to add a new pipeline

@smlambert
Copy link
Contributor

I am now able to add new pipelines without issue, thanks @Haroon-Khel !

@sxa
Copy link
Member Author

sxa commented Oct 11, 2023

Let's see how it goes this week - from the comments being made it seems to be working well now (other than a problem when I configured the nginx frontend to rate limit more than it needed) We can then look at switching this to be the primary server next week. @Haroon-Khel Have you looked at copying the SSL certificate from the old machine across as a test to make sure we can enable it with that (It will show as an invalid certificate when you connect, but worth making sure the setup of nginx is correct)

@sxa
Copy link
Member Author

sxa commented Oct 11, 2023

@Haroon-Khel It looks like the light rate limiting may have stopped the mongodb becoming overwhelmed with requests as it's not jumping up to 500% CPU in the way it was previously. Can you confirm this (i.e. it's not just me not looking at it properly!) If that's the case we might be able to drop the server back down to 2 CPUs.

@Haroon-Khel
Copy link
Contributor

It looks like the light rate limiting may have stopped the mongodb becoming overwhelmed with requests as it's not jumping up to 500% CPU in the way it was previously.

Can confirm. At most it is hitting 100% cpu. If it stays like this we could certainly drop the cpus down

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Oct 11, 2023

Have you looked at copying the SSL certificate from the old machine across as a test to make sure we can enable it with that (It will show as an invalid certificate when you connect, but worth making sure the setup of nginx is correct)

I did but it is a bit more complicated than that. The certs on the existing trss server were setup using certbot so I need to find out how to mimic that setup on the new server.

i did try copying the cert files (that the nginx config on the current machine point to) over to the new machine, along with the current nginx config, but I did not get anywhere. I was at least expecting http://20.90.182.165/ to have a faulty certs proceed with caution barrier but instead it just gave me nginx errors

@sxa
Copy link
Member Author

sxa commented Oct 11, 2023

I was at least expecting http://20.90.182.165/ to have a faulty certs proceed with caution barrier but instead it just gave me nginx errors

Hmmm http would never give a faulty cert option (unless it redirected!) - you'd need to go to the HTTPS port for it to present the certificate. Do you know what the errors were? Copying the files across shouldn't have caused anything related to the HTTP connections to break.

If it's not an obvious reason I suggest we take a look at this and the switch of CPUs tomorrow morning as I doubt too many people will be using the server until the afternoon.

@sxa sxa unpinned this issue Oct 12, 2023
@sxa
Copy link
Member Author

sxa commented Nov 10, 2023

DNS entry has been updated so should propogate everywhere soon https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/issues/3904

@sxa
Copy link
Member Author

sxa commented Dec 18, 2023

LetsEncrypt certbot is now set up and will auto-renew via a systemd timer

@sxa
Copy link
Member Author

sxa commented Jan 3, 2024

Old machine in AWS has now been decommissioned so will not be incurring future charges - closing.

@sxa sxa closed this as completed Jan 3, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Adoptium 4Q 2023 Plan Jan 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

No branches or pull requests

4 participants