Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JlmRemote* failures #246

Closed
jdekonin opened this issue Apr 8, 2019 · 18 comments
Closed

JlmRemote* failures #246

jdekonin opened this issue Apr 8, 2019 · 18 comments
Labels
Milestone

Comments

@jdekonin
Copy link
Contributor

jdekonin commented Apr 8, 2019

I can reproduce 2 different scenarios with the first test (I didn't look into the others). One where LT1 appears to complete a workload and exiting cleanly but this causes CL1 to end with an UndeclaredThrowableException. Two where CL1 terminates with very little meanful output to determine cause.

03:57:46  	TestJlmRemoteClassNoAuth_0
03:57:46  	TestJlmRemoteMemoryAuth_0
03:57:46  	TestJlmRemoteMemoryNoAuth_0
03:57:46  	TestIBMJlmRemoteClassAuth_0
03:57:46  	TestIBMJlmRemoteClassNoAuth_0
03:57:46  	TestIBMJlmRemoteMemoryAuth_0
03:57:46  	TestIBMJlmRemoteMemoryNoAuth_0

Could be related to #145 and #208. Although the platform I see it on is x86 linux, variant openj9.

@Mesbah-Alam
Copy link
Contributor

For scenario 1, what we need to try is increase the workload on LT1 so that it does not end before CL1 finishes. Apparently the workload LT1 is running is not long-running enough, especially in the internal machines.

@jdekonin
Copy link
Contributor Author

jdekonin commented Apr 8, 2019

@Mesbah-Alam
Copy link
Contributor

The PPC64LE failure above is different:

CL2 stderr j> 2019/04/08 10:22:08.197   Memory Pool:              G1 Old Gen
CL2 stderr j> 2019/04/08 10:22:08.198   Memory Type:              HEAP
CL2 stderr j> 2019/04/08 10:22:08.199   Peak Usage:               init = 118489088(115712K) used = 19607136(19147K) committed = 120586240(117760K) max = 2147483648(2097152K)
CL2 stderr j> 2019/04/08 10:22:08.199   Current Usage:            init = 118489088(115712K) used = 19874816(19409K) committed = 45088768(44032K) max = 2147483648(2097152K)
CL2 stderr Exception in thread "main" java.lang.AssertionError: Peak Usage used memory smaller than Current Usage used memory
CL2 stderr 	at org.junit.Assert.fail(Assert.java:88)
CL2 stderr 	at net.adoptopenjdk.test.jlm.resources.MemoryData.checkPeakAndCurrentMemoryUsage(MemoryData.java:521)
CL2 stderr 	at net.adoptopenjdk.test.jlm.resources.MemoryData.writeData(MemoryData.java:401)
CL2 stderr 	at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.getStatsViaServer(MemoryProfiler.java:251)
CL2 stderr 	at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.main(MemoryProfiler.java:112)

@Mesbah-Alam
Copy link
Contributor

#247 - increases workload in server process in all JLM remote tests. This should eliminate error caused by server process finishing earlier than the client. We may still see JLM test failures caused by other issues however.

@jdekonin
Copy link
Contributor Author

jdekonin commented Apr 9, 2019

I can see the change -suite.mini-mix.totalNumberTests 300000 -> 900000 int he job and I still see problems. The error(s) have changed to "Step 5 - Wait for the processes to complete" either "Process LT1 has ended unexpectedly" or "Process CL1 has ended unexpectedly"

@pshipton
Copy link
Contributor

Note we also saw similar failures to the previous comment on Monday night in OpenJ9 builds.

@Mesbah-Alam
Copy link
Contributor

Mesbah-Alam commented Apr 10, 2019

In all 4 cases in #246 (comment), it's the JLMRemoteThreadNoAuth test that fails, and they all fail with the same cause-

STF 01:18:03.240 - Monitoring processes: CL1 LT1
CL1 j> 2019/04/10 01:18:03.645 ServerURL=service:jmx:rmi:///jndi/rmi://localhost:1234/jmxrmi
CL1 j> 2019/04/10 01:18:03.704 Trying to connect using JMXConnectorFactory
CL1 j> 2019/04/10 01:18:14.231 Monitored VM not ready at Apr 10, 2019 1:18:14 AM (attempt 0).
CL1 j> 2019/04/10 01:18:14.232 Wait 10 secs and trying again...
CL1 j> 2019/04/10 01:18:14.233 Trying to connect using JMXConnectorFactory
CL1 j> 2019/04/10 01:18:17.573 Connection established!
CL1 j> 2019/04/10 01:18:28.154 Starting to write data
STF 01:23:02.432 - Heartbeat: Process LT1 is still running
STF 01:28:02.362 - Heartbeat: Process LT1 is still running
STF 01:28:04.375 - **FAILED** Process CL1 has timed out

-- which means the client process is connecting successfully with the server, writing data, but simply is taking longer than the set timeout (10m) to complete.

Had a conversation with Joe, and we will not increase the time limit in the client process in these tests a.t.m, but wait for the lack of resource issue in the machines to be solved first.

@pshipton
Copy link
Contributor

pshipton commented Apr 10, 2019

@Mesbah-Alam @jdekonin please clarify "wait for the lack of resource issue in the machines to be solved first".

This problem occurs in the open nightly builds at OpenJ9 since Monday night. It also occurred in a 0.14 release build (https://ci.eclipse.org/openj9/job/Test-sanity.system-JDK8-linux_x86-64/288/), where there were few OpenJ9/OMR/OpenJDK code changes. eclipse-openj9/openj9#5393 was merged.

We did add the CentOS 6 machines on Monday, the 0.14 failure occurred on cent6-x64-6, but this problem isn't restricted to xlinux.

@Mesbah-Alam
Copy link
Contributor

Mesbah-Alam commented Apr 10, 2019

Joe would have the details, but in short, there are simply too much swapping going on in the machines, as too many processes are sharing the resources at one time, which may indicate why the tests are running slower than usual in general-- causing failures especially in system tests that have a client-server type model (e.g. JLM tests, SCC tests).

@jdekonin
Copy link
Contributor Author

I watched with htop and nmon on an linux x86 and aix machine running sanity.system testing. Both experienced the above mentioned timeouts when the system was at high memory ( high 90% ) and high cpu load utilization.

At one point of failure on AIX, a ps | grep java showed 6 process none of which were running with a Xmx. I believe that the default java was starting up with was 512m. The system idle was using 2.8 of 8G, add to that the heap usage of 3G (at that point in time) and there is only 2.2G for those 6 java process and any others for memory consumption. Add to that each java instance using 16 gc threads and I believe we are getting alot of thrashing on the machine.

From what I could see swap was hardly being used in either linux or aix? Although on AIX its only setup as 512m, whereas x86 linux its 4g.

@Mesbah-Alam
Copy link
Contributor

Mesbah-Alam commented Apr 10, 2019

I've updated the JLM tests:

  1. Added-Xmx256 to all the sub-processes they start.
  2. Decreased the thread limit of the server's workload to 30 from 200.
  3. Increased timeout limit for client processes (CL process is seen to be timing out on AIX, especially).

#251.

TestJlmRemoteThreadNoAuth now passed a 5x Grinder on linux x64 (ran on internal Grinder). It also passes a 2x Grinder on AIX. I'll go ahead and deliver the change.

@pshipton
Copy link
Contributor

Why did the failure start occurring on Monday? I'd never seen it before that. Did the tests change on Monday?

@Mesbah-Alam
Copy link
Contributor

Mesbah-Alam commented Apr 11, 2019

Tests were changed 2 days ago - thread limit was increased to ensure the server process runs long enough for clients to finish.

This could cause the resource issue on some platforms. Thread limit has been put back to what it was originally today.

@pshipton
Copy link
Contributor

Can system test changes be better tested before they are merged? i.e. with a test run on all platforms.

@Mesbah-Alam
Copy link
Contributor

Yes, that's what needs to be done going forward. We have only been testing on one or two platforms before delivering a change so far.

@lumpfish
Copy link
Contributor

lumpfish commented Sep 8, 2020

Updated #360 with an analysis of the JLM timeouts and a PR to fix: #360 (comment)

@lumpfish
Copy link
Contributor

The tests are now passing due to the fix in #361.

This issue also mentions the symptoms described in #274. Closing this as that issue can be used to resolve that problem.

@karianna karianna added this to the October 2020 milestone Oct 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants