Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create new jobs to handle running Solaris tests via a Linux proxy machine #4099

Open
Tracked by #3742
sxa opened this issue Dec 20, 2024 · 16 comments
Open
Tracked by #3742

Create new jobs to handle running Solaris tests via a Linux proxy machine #4099

sxa opened this issue Dec 20, 2024 · 16 comments
Assignees
Labels
macos Issues that affect or relate to the MAC OS solaris Issues that affect or relate to the SOLARIS OS testing Issues that enhance or fix our test suites

Comments

@sxa
Copy link
Member

sxa commented Dec 20, 2024

This covers the implementation of what has been discussed in adoptium/infrastructure#3742 (comment)

Current status: Prototype jobs have been created for x64 and SPARC at:

These are currently connecting to the target machine as the vagrant user (even on SPARC where I've created a user with that name for consistency).

Jobs are currently set up to run the full AQA suite of tests instead of running as individual jobs but that can be changed later if desired and archive the artefacts

@github-actions github-actions bot added macos Issues that affect or relate to the MAC OS solaris Issues that affect or relate to the SOLARIS OS testing Issues that enhance or fix our test suites labels Dec 20, 2024
@sxa sxa removed this from Adoptium Backlog Dec 20, 2024
@sxa sxa self-assigned this Dec 20, 2024
@sxa
Copy link
Member Author

sxa commented Jan 7, 2025

Prototype now working (although hard coded to a specific tag. There is a dotests.x64.sh script and a dotests.sparcv9 script on the proxy host which is copied across to the target machine using scp as dotests.sh and that is executed.

It needs to be parameterised to be able to take the tag/URL as a parameter (for use when retrieving the artifact directly from jenkins, since we can't use copyArtifact) but otherwise it works.
It currently has a loop which can loop over each suite that is required and then copy the TAP output back to the proxy machine for archiving. The proxy agent is running as the solaris user on dockerhost-azure-ubuntu2204-x64-1

@sxa
Copy link
Member Author

sxa commented Jan 10, 2025

Verification (Solaris/x64)

Failures (based on looking at https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-temurin-simpletest/50/tapResults:

  • sanity.openjdk:
    • jdk/lambda/vm/InterfaceAccessFlagsTest.java
  • sanity.system (All due to being unable to locate mauve/mauve.jar:
    • MauveSingleThrdLoad_HS_5m_0
    • MauveSingleInvocLoad_HS_5m_0
    • MauveMultiThrdLoad_5m_0
  • extended.system (All except OAuthTest_0 are due to being unabel to locate mauve/mauve.jar:
    • MiniMix_5m_0
    • MiniMix_10m_0
    • MiniMix_aot_5m_0
    • OAuthTest_0
  • extended.openjdk:
    • jdk_beans_0 (10 failures - mostly font/color related)
    • jdk_security3_0 (javax/net/ssl/ciphersuites/DisabledAlgorithms.java)
    • jdk_management_0 (sun/management/jmxremote/bootstrap/SSLConfigFilePermissionTest.sh)
    • jdk_imageio_0 (2 failures - plugins/jpeg/JPEGsNotAcceleratedTest.java and javax/imageio/AppletResourceTest.java)
  • special.openjdk:
    • jdk_math_jre_0 (Error: JDK not found)

Solaris/SPARC

The SPARC run had mostly the same failures although OAuthTest_0 in the extended.system suite didn't fail as it was skipped. Two other failures:

  • extended.openjdk:
    • hotspot_jdk_0 `serviceability/sa/jmap-hashcode/Test8028623.java
    • jdk_security3_0: sun/security/ssl/SSLSocketImpl/ClientSocketCloseHang.java (NOTE: Different failure in this suite from the x64 run)

@adamfarley
Copy link
Contributor

adamfarley commented Jan 10, 2025

These test failures were seen in the old Solaris pipelines:

Sparc:

x64:


These test failures appear to be new:

Sparc:

  • ClientSocketCloseHang.java Example.

x64: in progress

  • InterfaceAccessFlagsTest.java Example.
  • jdk_math_jre_0 Example.
  • jdk_beans_0 Example
  • AppletResourceTest Example
  • All the "could not find mauve jar" issues, though we did have an instance of "could not find stf.pl in those extended tests. Example.
  • OAuthTest_0 timeout (connection refused errors are common in the old pipeline, though). Example.

@sxa
Copy link
Member Author

sxa commented Jan 10, 2025

@adamfarley FYI I've brought the "normal" solaris jenkins agents for the test boxes back online in case you want to try anything via Grinder

@sxa
Copy link
Member Author

sxa commented Jan 10, 2025

@adamfarley Also if you're going to run grinders it would probably be good to compare on the last published EA ones and the "new" builds from my pipelines in https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-sparcv9-temurin-simplepipe/ and https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-temurin-simplepipe/ just in case there's something wrong with the build itself.

@adamfarley
Copy link
Contributor

adamfarley commented Jan 10, 2025

Sure thing.

TLDR: No problems on the new build that weren't on the old build (when run in Grinder, anyway).

Details:

x64:

OpenJDK result: All tests passed except for the jdk_math_jre_0 target which failed in both cases with the same error
Note: The jdk_custom appears to have run wrong (test/tkg bug, I think) for both builds, so here's are the reruns (which all passed): Old build, new build.

System result:

Can't open perl script "/export/home/jenkins/workspace/Grinder/aqa-tests/TKG/../../jvmtest/system/security/..//STF/stf.core/scripts/stf.pl": No such file or directory

This error occurs with both old and new builds, so the framework is equally broken in both cases. :/

sparc:

OpenJDK result: Targets passed, except for math jre, which failed on both builds. Custom reruns are here: Old build, new build.

System result: Same as above. stf.pl not found.

@sxa
Copy link
Member Author

sxa commented Jan 13, 2025

Test jobs currently being run from the top level "simplepipe" pipelines with propagate: false due the job failing with an ERRORstate if one suite fails which is causing the pipeline to not continue to the following steps e.g.

15:30:04 TOTAL: 23   EXECUTED: 9   PASSED: 8   FAILED: 1   DISABLED: 0   SKIPPED: 14
15:30:04 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
15:30:05 
15:30:06 TESTCASES RESULTS SUMMARY: passed: 4,947; failed: 1; error: 0; skipped: 0
15:30:06 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
15:30:06 To rebuild the failed test in a jenkins job, copy the following link and fill out the <Jenkins URL> and <FAILED test target>:
15:30:06 <Jenkins URL>/parambuild/?JDK_VERSION=8&JDK_IMPL=hotspot&JDK_VENDOR=temurin&BUILD_LIST=openjdk&PLATFORM=sparcv9_solaris&TARGET=<FAILED test target>
15:30:06 
15:30:06 For example, to rebuild the failed tests in <Jenkins URL>=https://ci.adoptium.net/job/Grinder, use the following links:
15:30:06 https://ci.adoptium.net/job/Grinder/parambuild/?JDK_VERSION=8&JDK_IMPL=hotspot&JDK_VENDOR=temurin&BUILD_LIST=openjdk&PLATFORM=sparcv9_solaris&TARGET=jdk_lang_0
15:30:06 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
15:30:06 gmake[1]: *** [settings.mk:450: resultsSummary] Error 2
15:30:06 gmake[1]: Leaving directory '/export/home/vagrant/aqa-tests/TKG'
15:30:06 gmake: *** [makefile:62: _sanity.openjdk] Error 2

I will continue to look at resolving that, but in the meantin propgate: false seems to work, although it means the yellow warning status is not shown in the pipeline block..

@sxa
Copy link
Member Author

sxa commented Jan 13, 2025

OpenJDK result: All tests passed except for the jdk_math_jre_0 target which failed in both cases with the same error

To be clear, are you saying that when running on Grinder the other failures in the openjdk suite do not occur and my failing tests have all passed? If so we need to see why that is.

These test failures appear to be new:

Is that "new" in the newer builds since the last 8u332 GA, or "new" as in only showing up in my "simple" pipelines and not in the original test jobs?

@adamfarley
Copy link
Contributor

OpenJDK result: All tests passed except for the jdk_math_jre_0 target which failed in both cases with the same error

To be clear, are you saying that when running on Grinder the other failures in the openjdk suite do not occur and my failing tests have all passed? If so we need to see why that is.

Yes, the tests that failed in the simple pipeline have passed in the grinder job (except for jck_math, which fails with both normal builds and simple pipeline builds).

These test failures appear to be new:

Is that "new" in the newer builds since the last 8u332 GA, or "new" as in only showing up in my "simple" pipelines and not in the original test jobs?

The "cannot find mauve" issue appears to have been first seen on the simple pipelines, but I can't prove that it only affects the simple pipelines because the grinder can't find stf.pl, and that issue was only seen once before.

@sxa
Copy link
Member Author

sxa commented Jan 14, 2025

Can't open perl script "/export/home/jenkins/workspace/Grinder/aqa-tests/TKG/../../jvmtest/system/security/..//STF/stf.core/scripts/stf.pl": No such file or directory

This was potentially caused by the /tmp/mauve directory which was present on the machine but owned by a different user (The new pipelines are NOT running as the jenkins user). Removing that directory caused the tests to run through properly:
Passing test | Previous failing test
There is still an issue relating to the cannot find mauve.jar which appears to only be happening from the new jobs. I wonder if it's because I'm running without a WORKSPACE variable set, so it's choosing to put things Continuing investigation.

@sxa
Copy link
Member Author

sxa commented Jan 15, 2025

Noting that the parsing of the df output is not quite correct on Solaris. It's giving this message:

Test machine has only 1893 Mb free on drive containing /export/home/local/vagrant/aqa-tests/TKG/../TKG/output_17369506484404/TestJlmLocal_0.

There must be at least 3Gb (3072Mb) free to be sure of capturing diagnostics
files in the event of a test failure.

despite parsing the df output with df_header of:

Filesystem           1024-blocks        Used   Available Capacity  Mounted on

and df_body of

/dev/dsk/c1t0d0s7       23510982     1939385    21336488     9%    /export/home

Based on the detection of 1893 Mb free it seems likely that it is parsing the used instead of available field, so as a temporary workaround I'll fill up the space for a while ;-)

@sxa
Copy link
Member Author

sxa commented Jan 15, 2025

I think the problem with stf.pl may have disappeared by either:

  • removing /opt/xpg4/bin from the PATH (where the "other" df command was).
  • adding a definition of $WORKSPACE pointing to $HOME/workspace although I'm not convinced it is doing much with that.
    I've also added smoke test functionality to the test job which is described by the commands being added to the FAQ in doc: update FAQ.md to include smoke test reproduction infrastructure#3860 so that is now also being run,

https://ci.adoptium.net/job/build-scripts/job/jobs/job/jdk8u/job/jdk8u-solaris-x64-temurin-simpletest/62/console is running with the smoke tests and sanity.system

Notes:

  1. The smoke test on Solaris only runs Java_Version_0 and not Adopt_HS_FeatureTests_0
  2. The installed version of ant in /usr/local/bin is not used by the system tests. stf seems to ALWAYS download version 1.10.2 (earlier than what we now install elsewhere). It goes into /var/tmp

@sxa
Copy link
Member Author

sxa commented Jan 15, 2025

The simpletest jobs are now back to the original mauve.jar error on both x64 and SPARC: on the three mauve tests

FAILED test targets:
	MauveSingleThrdLoad_HS_5m_0
	MauveSingleInvocLoad_HS_5m_0
	MauveMultiThrdLoad_5m_0

The mauve references in the log are as follows:

21:11:44 check-if-already-built:
21:11:44      [echo] Checking if /export/home/vagrant/aqa-tests/systemtest_prereqs/mauve/mauve.jar already exists
21:11:44      [echo] openjdk_test_mauve_already_built is ${openjdk_test_mauve_already_built}
[...]
21:11:46 check-if-work_jar_file-exists:
21:11:46      [echo] Checking if /var/tmp//mauve/mauve.jar exists
21:11:46      [echo] openjdk_test_mauve_work_jar_file_exists is ${openjdk_test_mauve_work_jar_file_exists}
[...]
21:12:01 GEN 16:27:31.003 - Using Mode NoOptions. Values = ''
21:12:01 GEN stderr Exception in thread "main" net.adoptopenjdk.stf.StfException: Note: file 'mauve/mauve.jar' could not be found in any of the supplied test roots: '/export/home/vagrant/jvmtest/system/systemtest_prereqs'

The original sanity.system jobs are still failing locating stf.pl

12:55:06  Can't open perl script "/export/home/jenkins/workspace/Test_openjdk8_hs_sanity.system_x86-64_solaris/aqa-tests/TKG/../../jvmtest/system/mauveLoadTest/..//STF/stf.core/scripts/stf.pl": No such file or directory
12:55:06  -----------------------------------
12:55:06  MauveSingleThrdLoad_HS_5m_0_FAILED
12:55:06  -----------------------------------

Also from simpletest#63 - this is coming from the code in https://github.com/adoptium/aqa-systemtest/blob/5279e7ee7ddf2a4381f8e5c650b4c13b239f00f7/openjdk.test.mauve/build.xml#L362 and may indicate a CVS retrieval issue:

delete-work-dir:
   [delete] Deleting directory /var/tmp/mauve

create-work-dir:
    [mkdir] Created dir: /var/tmp/mauve

get-source:
     [exec] Could not read password for host: java.io.FileNotFoundException: /export/home/local/vagrant/.cvspass (No such file or directory)
     [exec] Cannot connect to host sourceware.org:2401.
     [exec] Result: 1

check-if-source-available:
     [echo] Checking if /var/tmp//mauve/mauve/gnu/testlet/config.java.in exists
     [echo] mauve_source_available is ${mauve_source_available}

By comparison, this is from a passing run on Linux/x64:

21:44:33  delete-work-dir:
21:44:33  
21:44:33  create-work-dir:
21:44:33  
21:44:33  get-source:
21:44:33  
21:44:33  check-if-source-available:
21:44:33       [echo] Checking if /tmp/mauve/mauve/gnu/testlet/config.java.in exists
21:44:33       [echo] mauve_source_available is ${mauve_source_available}
21:44:33 

Note also that the inability to resolve some of these variables appears unique to the recent runs of simpletest and did not occur in the last successful "normal" Solaris/x64 run at https://ci.adoptium.net/job/Test_openjdk8_hs_sanity.system_x86-64_solaris/401/consoleFull

@sxa
Copy link
Member Author

sxa commented Jan 16, 2025

There are a small number of test cases (mostly in java_beans) which are failing due to the absence of a DISPLAY variable - I've started an Xvfb on :5 on both machines and adjusted the dotests.sh script to point at that so hopefully the next runs will be better.

@sxa
Copy link
Member Author

sxa commented Jan 17, 2025

Summary from performing triage on the January dry-runs using builds created with the new pipelines:

  • as per the last comment, Xvfb wasn't being started as that is typically done via the pipelines. This was causing a small number of tests in java_beans_0 and java_imageio_0 to fail. To mitigate that for now I have manually started an Xvfb on display :5 and hard coded the setting of DISPLAY=:5 in the environment in dotests.sh
  • Some of the system tests (Mauve in sanity, MiniMix in extended) require mauve.jar to be present and that is not put in place by the normal make compile; make TARGET process, so I have added in an explicit curl of move from the systemtest.getDependency job:
curl -o `pwd`/aqa-tests/systemtest_prereqs/mauve/mauve.jar \
    https://ci.adoptium.net/job/systemtest.getDependency/lastSuccessfulBuild/artifact/systemtest_prereqs/mauve/mauve.jar

This should significant reduce the number of test failures that are outstandiung.

@sxa sxa removed the macos Issues that affect or relate to the MAC OS label Jan 17, 2025
@github-actions github-actions bot added the macos Issues that affect or relate to the MAC OS label Jan 17, 2025
@sxa
Copy link
Member Author

sxa commented Jan 17, 2025

Latest version of the dotests.sh script which is used by the simpletest jobs is: dotests.sh.txt
Note that this currently relies on the Xvfb already being run as DISPLAY :5

@sxa sxa removed the macos Issues that affect or relate to the MAC OS label Jan 17, 2025
@github-actions github-actions bot added the macos Issues that affect or relate to the MAC OS label Jan 17, 2025
@sxa sxa moved this to In Progress in 2025 1Q Adoptium Plan Jan 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
macos Issues that affect or relate to the MAC OS solaris Issues that affect or relate to the SOLARIS OS testing Issues that enhance or fix our test suites
Projects
Status: In Progress
Development

No branches or pull requests

2 participants