-
-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JlmRemote* failures #246
Comments
For scenario 1, what we need to try is increase the workload on LT1 so that it does not end before CL1 finishes. Apparently the workload LT1 is running is not long-running enough, especially in the internal machines. |
This is almost identical to what I am seeing. |
The PPC64LE failure above is different:
|
#247 - increases workload in server process in all JLM remote tests. This should eliminate error caused by server process finishing earlier than the client. We may still see JLM test failures caused by other issues however. |
I can see the change |
Note we also saw similar failures to the previous comment on Monday night in OpenJ9 builds. |
In all 4 cases in #246 (comment), it's the JLMRemoteThreadNoAuth test that fails, and they all fail with the same cause-
-- which means the client process is connecting successfully with the server, writing data, but simply is taking longer than the set timeout (10m) to complete. Had a conversation with Joe, and we will not increase the time limit in the client process in these tests a.t.m, but wait for the lack of resource issue in the machines to be solved first. |
@Mesbah-Alam @jdekonin please clarify "wait for the lack of resource issue in the machines to be solved first". This problem occurs in the open nightly builds at OpenJ9 since Monday night. It also occurred in a 0.14 release build (https://ci.eclipse.org/openj9/job/Test-sanity.system-JDK8-linux_x86-64/288/), where there were few OpenJ9/OMR/OpenJDK code changes. eclipse-openj9/openj9#5393 was merged. We did add the CentOS 6 machines on Monday, the 0.14 failure occurred on cent6-x64-6, but this problem isn't restricted to xlinux. |
Joe would have the details, but in short, there are simply too much swapping going on in the machines, as too many processes are sharing the resources at one time, which may indicate why the tests are running slower than usual in general-- causing failures especially in system tests that have a client-server type model (e.g. JLM tests, SCC tests). |
I watched with At one point of failure on AIX, a From what I could see swap was hardly being used in either linux or aix? Although on AIX its only setup as 512m, whereas x86 linux its 4g. |
I've updated the JLM tests:
#251. TestJlmRemoteThreadNoAuth now passed a 5x Grinder on linux x64 (ran on internal Grinder). It also passes a 2x Grinder on AIX. I'll go ahead and deliver the change. |
Why did the failure start occurring on Monday? I'd never seen it before that. Did the tests change on Monday? |
Tests were changed 2 days ago - thread limit was increased to ensure the server process runs long enough for clients to finish. This could cause the resource issue on some platforms. Thread limit has been put back to what it was originally today. |
Can system test changes be better tested before they are merged? i.e. with a test run on all platforms. |
Yes, that's what needs to be done going forward. We have only been testing on one or two platforms before delivering a change so far. |
Updated #360 with an analysis of the JLM timeouts and a PR to fix: #360 (comment) |
I can reproduce 2 different scenarios with the first test (I didn't look into the others). One where LT1 appears to complete a workload and exiting cleanly but this causes CL1 to end with an
UndeclaredThrowableException
. Two where CL1 terminates with very little meanful output to determine cause.Could be related to #145 and #208. Although the platform I see it on is x86 linux, variant openj9.
The text was updated successfully, but these errors were encountered: