Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Machine requirement: Solaris/x64 systems (Equinix replacement) #3347

Closed
Tracked by #3292
sxa opened this issue Jan 16, 2024 · 10 comments
Closed
Tracked by #3292

New Machine requirement: Solaris/x64 systems (Equinix replacement) #3347

sxa opened this issue Jan 16, 2024 · 10 comments

Comments

@sxa
Copy link
Member

sxa commented Jan 16, 2024

I need to request a new machine:

  • New machine operating system (e.g. linux/windows/macos/solaris/aix): Solaris
  • New machine architecture (e.g. x64/aarch32/arm32/ppc64/ppc64le/sparc): x64
  • Provider (leave blank if it does not matter):
  • Desired usage: Build, test and TC
  • Any unusual specification/setup required: Can be separate servers or hosted on a hypervisor. Exising ones are on an ESXi server
  • How many of them are required: 4 (minimum) - one build, 2 test, 1 TC

Please explain what this machine is needed for:

@sxa
Copy link
Member Author

sxa commented Feb 21, 2024

Noting that the licensing for ESXi was recently changed by Broadcom so it is likely that it will not be possible to utilise that for the replacement.

@sxa
Copy link
Member Author

sxa commented Feb 28, 2024

@steelhead31 @Haroon-Khel Have either of you you used Solaris VMs with the libvirt/kvm provider in vagrant instead of virtualbox?

@Haroon-Khel
Copy link
Contributor

I have not

@steelhead31
Copy link
Contributor

Nor I, there are some libvirt vagrant boxes available on vagrantup though.

@sxa
Copy link
Member Author

sxa commented Mar 1, 2024

Ubuntu EFI secure boot warning with Azure trusted VMs

│ UEFI Secure Boot requires additional configuration to work with third-party drivers.

│ The system will assist you in configuring UEFI Secure Boot. To permit the use of
│ third-party drivers, a new Machine-Owner Key (MOK) has been generated. This key now
│ needs to be enrolled in your system's firmware.

│ To ensure that this change is being made by you as an authorized user, and not by an
│ attacker, you must choose a password now and then confirm the change after reboot using
│ the same password, in both the "Enroll MOK" and "Change Secure Boot state" menus that
│ will be presented to you when this system reboots.
│ If you proceed but do not confirm the password upon reboot, Ubuntu will still be able
│ to boot on your system but any hardware that requires third-party drivers to work
│ correctly may not be usable.

If you try to bring up a VM without additional work, then you'll get this error:

Error while connecting to Libvirt: Error making a connection to libvirt URI qemu:///system: Call to virConnectOpen failed: Failed to connect socket to '/var/run/libvirt/libvirt-sock': No such file or directory

In theory this can be mitigated by: /usr/src/linux-headers-6.5.0-1015-azure/scripts/sign-file sha256 /var/lib/shim-signed/mok/MOK.der /var/lib/shim-signed/mok/MOK.priv /var/lib/dkms/virtualbox/6.1.50/6.5.0-1015-azure/x86_64/module/vboxdrv.ko but I haven't got that to work yet ... Probably because the MOK password hasn't been entered on startup (You're prompted to set up MOK during the install of virtualbox)

Using a "standard" VM of a D4 specification (which supports nested virtualisation) allows vagrant to work successfully without a reboot loop. Note that a d3as-V4 or B4ls_V2 will not work and gives the message Stderr: VBoxManage: error: AMD-V is not available (VERR_SVM_NO_SVM) when attempting to start the VM from Vagrant. Standard D16ds v4 (16 vcpus, 64 GiB memory) works ok.

To connect the default ssh configuration on the Ubuntu client will not work so you need to connect with:

  • ssh [email protected] -p 2222 -o HostKeyAlgorithms=ssh-rsa,ssh-dss,ecdsa-sha2-nistp256,ssh-ed25519 -o PubKeyAcceptedKeyTypes=ssh-rsa -i .vagrant/machines/adoptopenjdkSol10/virtualbox/private_key (From the directory with the Vagrantfile since that's where the keys are relative to. We should probably see if some of these options for algorithms can be set in the Vagrantfile)

Noting that vagrant ssh by default also uses -o LogLevel=FATAL -o Compression=yes -o IdentitiesOnly=yes -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null but those aren't mandatory

Steps to recreate

scp 150.239.60.120:/home/will/solaris10_homemade_v2.box . 
sudo apt-get -y update && sudo apt install -y joe vagrant virtualbox
vagrant box add --name solaris10 solaris10_homemade_v2.box
wget -O Vagrantfile https://raw.githubusercontent.com/adoptium/infrastructure/master/ansible/vagrant/Vagrantfile.Solaris10
vagrant up

Working system types (Numbers in brackets are cores/memGB):

  • D16ds_V4 (16/64)
  • D2s_V3 (2/8)
  • d8s_V3 (8/32)
  • f4s_V2 (4/8)

Failing system types (reboot loop in the VM)

  • D8ls_V5 (8/16)
  • D16dsV5 (16/64)

Failures with no VMX/SVM:

  • d3as-V4
  • B4ls_V2 (Intel/VMX)
  • B4als_V2 (AMD/SVM)
  • D16as v4 (AMD/SVM - Note that the intel equivalent works)

I'd ideally have a 8/16 or 16/32 but this seem only available in configurations that don't work from the ones I've found so far :-(

@sxa
Copy link
Member Author

sxa commented Mar 5, 2024

Created new system on dockerhost-azure-ubuntu2204-x64-1 which has had vagrant and virtualbox installed from the adoptium repositories. This machine has had ssh exposed via port 2200 on the host, although the algorithm requirements mean there are issues connecting to it. I have a set it up in jenkins using JNLP for now.

A build ran locally completed in about 20 minutes.

The AQA pipeline job has been run at https://ci.adoptium.net/job/AQA_Test_Pipeline/220/ although that may need a re-run since it was running during today's jenkins update. The "Second run" table below from job 221 is after the /etc/hosts fix and after the jenkins upgrade was fully complete:

Job First run Second run
sanity.openjdk link 😢 [1] link
extended.openjdk link 😢 [1] link 😢 (10 failures)
sanity.perf link link
extended.perf link link
sanity.system link link
extended.system link 😢 [2] link 😢 [2]
sanity.functional link link
extended.functional link link
special.functional link link
Key Description
Job passed
😢 Job completed with failures
Job fell over and didn't run to completion

[1] - Many of these were "unable to resolve hostname" errors - I have manually added azsol10b to /etc/hosts, although this may well get resolved on a reboot.

[2] - Message (Noting that /export/home is a 22Gb file system with 90% free at the start of a test job):

11:41:54  There is 2499 Mb free
11:41:54  
Test machine has only 2499 Mb free on drive containing /export/home/jenkins/workspace/11:41:54  
11:41:54  There must be at least 3Gb (3072Mb) free to be sure of capturing diagnostics
11:41:54  files in the event of a test failure.

Re-queuing extended.system after creating a dummy 1Gb file to fix the buggy space detection: https://ci.adoptium.net/job/Test_openjdk8_hs_extended.system_x86-64_solaris/376/console PASSED ✅

So we're left with the ten failures from extended.openjdk.

@jiekang jiekang moved this from Todo to In Progress in 2024 1Q Adoptium Plan Mar 5, 2024
@sxa
Copy link
Member Author

sxa commented Mar 6, 2024

So we're left with the ten failures from extended.openjdk. Re-running the appropriate targets in Grinder:

Grinder machine Time Result
9047 azure-1 release 2h36m 9 failures
9048 esxi-bld-1 release 1h39m 2 failures: jdk_security3_0, jdk_tools_0
9049 esxi-test-1 release 1h42m 2 failures: jdk_security3_0, jdk_tools_0
9050 esxi-test-1 nightly n/a
9051 azure-1 nightly 1h42m
9052 esxi-test-1 nightly 3h23 5 failures
9053 esxi-test-1 nightly - Repeat for good measure

@sxa
Copy link
Member Author

sxa commented Mar 7, 2024

Starting over with a cleaner setup now that we have prototyped this. Both of the dockerhost machines have had a /home/solaris file system created alongside an appropriate user with enough space to host the VMs. The Vagrantfile is under a subsirectory of solaris' home with the same name as the machine. The vagrant processes will run as that user:

Host Guest
dockerhost-skytap-ubuntu2204-x64-1 build-skytap-solaris10-x64-1
dockerhost-azure-ubuntu2204-x64-1 test-skytap-solaris10-x64-1

Setup process is using the box we defined in the past (this is a repeat of the section from an earlier comment in here)

scp 150.239.60.120:/home/will/solaris10_homemade_v2.box . 
sudo apt-get -y update && sudo apt install -y joe vagrant virtualbox
vagrant box add --name solaris10 solaris10_homemade_v2.box
wget -O Vagrantfile https://raw.githubusercontent.com/adoptium/infrastructure/master/ansible/vagrant/Vagrantfile.Solaris10
vagrant up

Noting that I started getting issues with the audio driver:

Stderr: VBoxManage: error: Failed to construct device 'ichac97' instance #0 (VERR_CFGM_NOT_ENOUGH_SPACE)
VBoxManage: error: Details: code NS_ERROR_FAILURE (0x80004005), component ConsoleWrap, interface IConsole

This can be solved by disabling audio support the VirtualBox UI for the machine (Unclear why it started happening when it was previously ok on the Azure machine)

To connect to the machine use the following, after which you can enable an appropriate key for the root user via sudo, and adjust /etc/ssh/sshd_config to allow root logins without-password:
ssh [email protected] -p 2222 -o HostKeyAlgorithms=ssh-rsa,ssh-dss,ecdsa-sha2-nistp256,ssh-ed25519 -o PubKeyAcceptedKeyTypes=ssh-rsa -i .vagrant/machines/adoptopenjdkSol10/virtualbox/private_key

Until we get jenkins able to ssh to these machines I am starting them with the following script:

#!/bin/sh
PATH=/usr/local/bin:/opt/csw/bin:/usr/lib/jvm/bell-jdk-11.0.18/bin:$PATH; export PATH
LD_PRELOAD_64=/usr/lib/jvm/fallocate.so; export LD_PRELOAD_64
while true; do
  java -jar agent.jar -url https://ci.adoptium.net/ -secret XXXXX -name "XXXXX" -workDir "/export/home/jenkins"
  sleep 300
done

@sxa
Copy link
Member Author

sxa commented Mar 7, 2024

Systems are live and operating as expected. Note to infra team: You can go the solaris user on the machine and from the machine's subdirectory use the ssh command in the previous comment to connect to the host. I've added the team's keys onto the machine too so you can get to it as the root user.

/etc/hosts had to be updated manually to have an entry for the hostname output - we should have the playbooks doing that if we can - hopefullly it won't disappear on restart since I've adjusted /etc/hostname accordingly.

This could do with being documented somewhere else but since they are operational (Other than adoptium/aqa-tests#5127 which is being tracked in that issue) I'm closing this issue

@steelhead31
Copy link
Contributor

When creating an Azure VM that supports nested virtualization, the following restrictions are in place:

Must be a TYPE D or TYPE £ machine, of Version 3.
Must only use the "Standard" security model, trusted launch should not be used/enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: Done
Development

No branches or pull requests

3 participants