JlmRemote* failures #246

jdekonin · 2019-04-08T14:44:35Z

I can reproduce 2 different scenarios with the first test (I didn't look into the others). One where LT1 appears to complete a workload and exiting cleanly but this causes CL1 to end with an UndeclaredThrowableException. Two where CL1 terminates with very little meanful output to determine cause.

03:57:46  	TestJlmRemoteClassNoAuth_0
03:57:46  	TestJlmRemoteMemoryAuth_0
03:57:46  	TestJlmRemoteMemoryNoAuth_0
03:57:46  	TestIBMJlmRemoteClassAuth_0
03:57:46  	TestIBMJlmRemoteClassNoAuth_0
03:57:46  	TestIBMJlmRemoteMemoryAuth_0
03:57:46  	TestIBMJlmRemoteMemoryNoAuth_0

Could be related to #145 and #208. Although the platform I see it on is x86 linux, variant openj9.

The text was updated successfully, but these errors were encountered:

Mesbah-Alam · 2019-04-08T15:04:30Z

For scenario 1, what we need to try is increase the workload on LT1 so that it does not end before CL1 finishes. Apparently the workload LT1 is running is not long-running enough, especially in the internal machines.

jdekonin · 2019-04-08T15:14:22Z

This is almost identical to what I am seeing.
https://ci.adoptopenjdk.net/view/Test_system/job/openjdk12_hs_systemtest_ppc64le_linux/23/consoleFull

Mesbah-Alam · 2019-04-08T20:06:45Z

The PPC64LE failure above is different:

CL2 stderr j> 2019/04/08 10:22:08.197   Memory Pool:              G1 Old Gen
CL2 stderr j> 2019/04/08 10:22:08.198   Memory Type:              HEAP
CL2 stderr j> 2019/04/08 10:22:08.199   Peak Usage:               init = 118489088(115712K) used = 19607136(19147K) committed = 120586240(117760K) max = 2147483648(2097152K)
CL2 stderr j> 2019/04/08 10:22:08.199   Current Usage:            init = 118489088(115712K) used = 19874816(19409K) committed = 45088768(44032K) max = 2147483648(2097152K)
CL2 stderr Exception in thread "main" java.lang.AssertionError: Peak Usage used memory smaller than Current Usage used memory
CL2 stderr 	at org.junit.Assert.fail(Assert.java:88)
CL2 stderr 	at net.adoptopenjdk.test.jlm.resources.MemoryData.checkPeakAndCurrentMemoryUsage(MemoryData.java:521)
CL2 stderr 	at net.adoptopenjdk.test.jlm.resources.MemoryData.writeData(MemoryData.java:401)
CL2 stderr 	at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.getStatsViaServer(MemoryProfiler.java:251)
CL2 stderr 	at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.main(MemoryProfiler.java:112)

Mesbah-Alam · 2019-04-08T20:36:02Z

#247 - increases workload in server process in all JLM remote tests. This should eliminate error caused by server process finishing earlier than the client. We may still see JLM test failures caused by other issues however.

jdekonin · 2019-04-09T16:21:12Z

I can see the change -suite.mini-mix.totalNumberTests 300000 -> 900000 int he job and I still see problems. The error(s) have changed to "Step 5 - Wait for the processes to complete" either "Process LT1 has ended unexpectedly" or "Process CL1 has ended unexpectedly"

jdekonin · 2019-04-10T12:45:08Z

Eclipse OpenJ9 failure :
https://ci.eclipse.org/openj9/job/Test-sanity.system-JDK8-osx_x86-64_cmprssptrs/126/console
https://ci.eclipse.org/openj9/job/Test-sanity.system-JDK8-linux_x86-64_cmprssptrs/291/console
https://ci.eclipse.org/openj9/job/Test-sanity.system-JDK8-linux_x86-64/290/console
https://ci.eclipse.org/openj9/job/Test-sanity.system-JDK8-aix_ppc-64_cmprssptrs/264/console
https://ci.eclipse.org/openj9/job/Test-sanity.system-JDK8-linux_390-64_cmprssptrs/312/console

pshipton · 2019-04-10T13:00:23Z

Note we also saw similar failures to the previous comment on Monday night in OpenJ9 builds.

Mesbah-Alam · 2019-04-10T15:21:54Z

In all 4 cases in #246 (comment), it's the JLMRemoteThreadNoAuth test that fails, and they all fail with the same cause-

STF 01:18:03.240 - Monitoring processes: CL1 LT1
CL1 j> 2019/04/10 01:18:03.645 ServerURL=service:jmx:rmi:///jndi/rmi://localhost:1234/jmxrmi
CL1 j> 2019/04/10 01:18:03.704 Trying to connect using JMXConnectorFactory
CL1 j> 2019/04/10 01:18:14.231 Monitored VM not ready at Apr 10, 2019 1:18:14 AM (attempt 0).
CL1 j> 2019/04/10 01:18:14.232 Wait 10 secs and trying again...
CL1 j> 2019/04/10 01:18:14.233 Trying to connect using JMXConnectorFactory
CL1 j> 2019/04/10 01:18:17.573 Connection established!
CL1 j> 2019/04/10 01:18:28.154 Starting to write data
STF 01:23:02.432 - Heartbeat: Process LT1 is still running
STF 01:28:02.362 - Heartbeat: Process LT1 is still running
STF 01:28:04.375 - **FAILED** Process CL1 has timed out

-- which means the client process is connecting successfully with the server, writing data, but simply is taking longer than the set timeout (10m) to complete.

Had a conversation with Joe, and we will not increase the time limit in the client process in these tests a.t.m, but wait for the lack of resource issue in the machines to be solved first.

pshipton · 2019-04-10T15:48:02Z

@Mesbah-Alam @jdekonin please clarify "wait for the lack of resource issue in the machines to be solved first".

This problem occurs in the open nightly builds at OpenJ9 since Monday night. It also occurred in a 0.14 release build (https://ci.eclipse.org/openj9/job/Test-sanity.system-JDK8-linux_x86-64/288/), where there were few OpenJ9/OMR/OpenJDK code changes. eclipse-openj9/openj9#5393 was merged.

We did add the CentOS 6 machines on Monday, the 0.14 failure occurred on cent6-x64-6, but this problem isn't restricted to xlinux.

Mesbah-Alam · 2019-04-10T16:02:22Z

Joe would have the details, but in short, there are simply too much swapping going on in the machines, as too many processes are sharing the resources at one time, which may indicate why the tests are running slower than usual in general-- causing failures especially in system tests that have a client-server type model (e.g. JLM tests, SCC tests).

jdekonin · 2019-04-10T18:46:20Z

I watched with htop and nmon on an linux x86 and aix machine running sanity.system testing. Both experienced the above mentioned timeouts when the system was at high memory ( high 90% ) and high cpu load utilization.

At one point of failure on AIX, a ps | grep java showed 6 process none of which were running with a Xmx. I believe that the default java was starting up with was 512m. The system idle was using 2.8 of 8G, add to that the heap usage of 3G (at that point in time) and there is only 2.2G for those 6 java process and any others for memory consumption. Add to that each java instance using 16 gc threads and I believe we are getting alot of thrashing on the machine.

From what I could see swap was hardly being used in either linux or aix? Although on AIX its only setup as 512m, whereas x86 linux its 4g.

Mesbah-Alam · 2019-04-10T19:39:58Z

I've updated the JLM tests:

Added-Xmx256 to all the sub-processes they start.
Decreased the thread limit of the server's workload to 30 from 200.
Increased timeout limit for client processes (CL process is seen to be timing out on AIX, especially).

#251.

TestJlmRemoteThreadNoAuth now passed a 5x Grinder on linux x64 (ran on internal Grinder). It also passes a 2x Grinder on AIX. I'll go ahead and deliver the change.

pshipton · 2019-04-10T22:15:41Z

Why did the failure start occurring on Monday? I'd never seen it before that. Did the tests change on Monday?

Mesbah-Alam · 2019-04-11T04:59:02Z

Tests were changed 2 days ago - thread limit was increased to ensure the server process runs long enough for clients to finish.

This could cause the resource issue on some platforms. Thread limit has been put back to what it was originally today.

pshipton · 2019-04-11T05:08:54Z

Can system test changes be better tested before they are merged? i.e. with a test run on all platforms.

Mesbah-Alam · 2019-04-11T05:16:18Z

Yes, that's what needs to be done going forward. We have only been testing on one or two platforms before delivering a change so far.

lumpfish · 2020-09-08T16:51:00Z

Updated #360 with an analysis of the JLM timeouts and a PR to fix: #360 (comment)

lumpfish · 2020-10-14T15:27:38Z

The tests are now passing due to the fix in #361.

This issue also mentions the symptoms described in #274. Closing this as that issue can be used to resolve that problem.

Mesbah-Alam mentioned this issue Apr 8, 2019

Increase workload on server process in JLM tests #247

Merged

karianna added the bug label Apr 11, 2019

This was referenced Jul 20, 2020

All “TestJlmRemote” tests failed on Test_openjdk8_hs_sanity.system_x86-64_linux adoptium/aqa-tests#1903

Closed

jdk14: JlmRemote testcases failing, unsure why...? eclipse-openj9/openj9#10156

Closed

lumpfish mentioned this issue Aug 18, 2020

JLM remote tests failing on s390x openj9-openjdk8 #360

Closed

lumpfish closed this as completed Oct 14, 2020

karianna added this to the October 2020 milestone Oct 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JlmRemote* failures #246

JlmRemote* failures #246

jdekonin commented Apr 8, 2019

Mesbah-Alam commented Apr 8, 2019

jdekonin commented Apr 8, 2019

Mesbah-Alam commented Apr 8, 2019

Mesbah-Alam commented Apr 8, 2019

jdekonin commented Apr 9, 2019

jdekonin commented Apr 10, 2019

pshipton commented Apr 10, 2019

Mesbah-Alam commented Apr 10, 2019 •

edited

Loading

pshipton commented Apr 10, 2019 •

edited

Loading

Mesbah-Alam commented Apr 10, 2019 •

edited

Loading

jdekonin commented Apr 10, 2019

Mesbah-Alam commented Apr 10, 2019 •

edited

Loading

pshipton commented Apr 10, 2019

Mesbah-Alam commented Apr 11, 2019 •

edited

Loading

pshipton commented Apr 11, 2019

Mesbah-Alam commented Apr 11, 2019

lumpfish commented Sep 8, 2020

lumpfish commented Oct 14, 2020

JlmRemote* failures #246

JlmRemote* failures #246

Comments

jdekonin commented Apr 8, 2019

Mesbah-Alam commented Apr 8, 2019

jdekonin commented Apr 8, 2019

Mesbah-Alam commented Apr 8, 2019

Mesbah-Alam commented Apr 8, 2019

jdekonin commented Apr 9, 2019

jdekonin commented Apr 10, 2019

pshipton commented Apr 10, 2019

Mesbah-Alam commented Apr 10, 2019 • edited Loading

pshipton commented Apr 10, 2019 • edited Loading

Mesbah-Alam commented Apr 10, 2019 • edited Loading

jdekonin commented Apr 10, 2019

Mesbah-Alam commented Apr 10, 2019 • edited Loading

pshipton commented Apr 10, 2019

Mesbah-Alam commented Apr 11, 2019 • edited Loading

pshipton commented Apr 11, 2019

Mesbah-Alam commented Apr 11, 2019

lumpfish commented Sep 8, 2020

lumpfish commented Oct 14, 2020

Mesbah-Alam commented Apr 10, 2019 •

edited

Loading

pshipton commented Apr 10, 2019 •

edited

Loading

Mesbah-Alam commented Apr 10, 2019 •

edited

Loading

Mesbah-Alam commented Apr 10, 2019 •

edited

Loading

Mesbah-Alam commented Apr 11, 2019 •

edited

Loading