-
-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New Machine requirement: TRSS replacement #3116
Comments
We have an existing playbook for setting up a TRSS server, but the current recommended way to set up TRSS is using docker compose, Some references: |
Created at |
@llxia @smlambert Should the second link with the docker-compose docs work straight out of the box with the current
|
Also from taking a look at the video, is it storing the database in the directory where you run |
|
The new machine has 100g in I think a solution is to add a disk and mount it to a separate directory, like |
What ive done so far:
What needs to be done:
|
In that case I recommend that we redo the disk with |
Ill add the disk to Regarding the import of the mongo data, I used exportMongo.sh on the existing machine. However since now mongo runs in a docker container the restore command (mentioned in that script) needs to be run in the container, and the archive needs to be copied into the running container. https://adoptium.slack.com/archives/C5219G28G/p1692865247390769?thread_ts=1692109686.752019&cid=C5219G28G, at the moment the performance of the machine when it comes to loading data is extremely slow. Ill continue to monitor the performance after ive added the new disk |
Update The machine's ip has changed to 20.90.182.165. The TRSS service is running on port 80 on the machine. The data disk is 128g with 30g for /var/lib/docker, 60g for /home/jenkins/trss (which is where mongo will be storing its data) and 4g swap. More can be added if needs be, at the moment mongo (and the other trss services) are taking up 4.9G (thats including the import of the data from the current trss machine) |
MongoDB was chewing up two full cores of CPU continuously which the original server was not doing. It appears ok today though - has a configuration change occurred to resolve that problem @Haroon-Khel ? |
I checked too. It still eats up the CPU when you go to http://20.90.182.165/. By this I mean
Managed to capture |
Using |
Interesting - I'm no longer seeing it getting stuck at high CPU load although it does sit at 200% when loading the initial TRSS homepage - I wonder if we could optimise some of the queries there or improve the indexing? @smlambert Can you see if this new TRSS instance is working adequately for the JDK21 triage? Would be good to give it a bit of a workout before switching it to be the primary one. |
Something is not quite right with it, as I am struggling to add new pipelines to monitor, trying to add https://ci.adoptium.net/job/build-scripts/job/release-openjdk21-pipeline/ and the service seems to 'go away' (would be good to see the logs to see what is going on). I can give details when I am back from 3 day PTO and can share a screen. Let's keep the old one up and running. |
Haroon has indicated that he's been able to replicate the problem that Shelley mentioned in the previous comment. Given that we are having issues here with the new instance and are unable to switch over the production server we should perhaps consider creating a direct replica of the existing server for now and see if that behaves as expected and look at fixing underlying issues with the new deployment asynchronously to avoid delaying the switchover. How long would it take to see if that works? |
Yes I have seen the data just disappear on occasion. Annoyingly there is no log file in the normal mongo log file location
|
I could increase the new trss server's cpu and ram to make it similar to the current server and then observe any differences. That would be quicker than creating a whole replica |
I'm sceptical as to whether that will make any difference but if you can quickly test it then feel free. |
I've resized the machine to 8cpu and 16g ram. On @smlambert Could you try adding another pipeline again, see if we hit the same error |
Can we see if we can enable the logging? https://betterstack.com/community/questions/how-to-log-all-or-slow-mongodb-queries/ shows how to enable query logging which might give us an idea of where the slowness is occurring. There's also stuff in there to specifically track slow-to-execute queries in there which might be useful. https://www.mongodb.com/docs/manual/reference/log-messages/#logging-slow-operations is the official docs related to logging slow queries. |
Looks like logging is already enabled in
|
Running |
@smlambert The changes in https://github.com/adoptium/aqa-test-tools/pull/821/files have been applied to the new trss machine. Could you try again to add a new pipeline |
I am now able to add new pipelines without issue, thanks @Haroon-Khel ! |
Let's see how it goes this week - from the comments being made it seems to be working well now (other than a problem when I configured the nginx frontend to rate limit more than it needed) We can then look at switching this to be the primary server next week. @Haroon-Khel Have you looked at copying the SSL certificate from the old machine across as a test to make sure we can enable it with that (It will show as an invalid certificate when you connect, but worth making sure the setup of nginx is correct) |
@Haroon-Khel It looks like the light rate limiting may have stopped the mongodb becoming overwhelmed with requests as it's not jumping up to 500% CPU in the way it was previously. Can you confirm this (i.e. it's not just me not looking at it properly!) If that's the case we might be able to drop the server back down to 2 CPUs. |
Can confirm. At most it is hitting 100% cpu. If it stays like this we could certainly drop the cpus down |
I did but it is a bit more complicated than that. The certs on the existing trss server were setup using certbot so I need to find out how to mimic that setup on the new server. i did try copying the cert files (that the nginx config on the current machine point to) over to the new machine, along with the current nginx config, but I did not get anywhere. I was at least expecting http://20.90.182.165/ to have a |
Hmmm http would never give a faulty cert option (unless it redirected!) - you'd need to go to the HTTPS port for it to present the certificate. Do you know what the errors were? Copying the files across shouldn't have caused anything related to the HTTP connections to break. If it's not an obvious reason I suggest we take a look at this and the switch of CPUs tomorrow morning as I doubt too many people will be using the server until the afternoon. |
DNS entry has been updated so should propogate everywhere soon https://gitlab.eclipse.org/eclipsefdn/helpdesk/-/issues/3904 |
LetsEncrypt certbot is now set up and will auto-renew via a systemd timer |
Old machine in AWS has now been decommissioned so will not be incurring future charges - closing. |
I need to request a new machine:
Please explain what this machine is needed for: Replacement for the existing TRSS server which is running on AWS and should therefore be decommissioned as most others there have been as it is not a sponsored provider.
The text was updated successfully, but these errors were encountered: