Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multiple-pause-resume.sh: Check for aplay/arecord pid when waiting #1126

Conversation

ujfalusi
Copy link
Contributor

@ujfalusi ujfalusi commented Nov 3, 2023

It has been observed that the expect process has been terminated (pidof returns with no PID) but the child aplay is still running. This only happens with ChainDMA enabled PCMs where the time to stop the stream takes more than 100ms (from trigger stop to PCM close). With the right timing on such PCM the pidof expect will break the wait loop and we progress to the next iteration which - again with right timing and setup - might fail if the aplay is still in a process of closing and the next iteration includes the same PCM.

Add the aplay and arecord commands to the pidof command to make sure that the iteration has been completed before starting a new one. We will still catch timeouts and in that case we should kill the stuck processes.

It has been observed that the expect process has been terminated (pidof
returns with no PID) but the child aplay is still running.
This only happens with ChainDMA enabled PCMs where the time to stop
the stream takes more than 100ms (from trigger stop to PCM close).
With the right timing on such PCM the pidof expect will break the wait
loop and we progress to the next iteration which - again with right
timing and setup - might fail if the aplay is still in a process of closing
and the next iteration includes the same PCM.

Add the aplay and arecord commands to the pidof command to make sure that
the iteration has been completed before starting a new one.
We will still catch timeouts and in that case we  should kill the stuck
processes.

Signed-off-by: Peter Ujfalusi <[email protected]>
@ujfalusi ujfalusi requested a review from a team as a code owner November 3, 2023 08:48
@ujfalusi
Copy link
Contributor Author

ujfalusi commented Nov 3, 2023

400 iteration on local device:

TPLG=/lib/firmware/intel/sof-ipc4-tplg/sof-hda-generic.tplg MODEL=TGLU_UP_HDA_IPC4ZPH SOF_TEST_INTERVAL=5 ~/work/sof-test/test-case/multiple-pause-resume.sh -l 400 -r 5
....
declare -- cmd="kernel_logger_cmd --since=@1699008086"
2023-11-03 10:41:44 UTC [INFO] Starting func_exit_handler(0)
2023-11-03 10:42:58 UTC [INFO] pkill -TERM -f mtrace-reader.py
2023-11-03 10:42:58 UTC [INFO] nlines=1361430 /home/sof/work/sof-test/logs/multiple-pause-resume/2023-11-03-10:47:48-6916/mtrace.txt
2023-11-03 10:44:12 UTC [INFO] ktime=9574 sof-test PID=22522: ending
2023-11-03 10:44:12 UTC [INFO] Test Result: PASS!

Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except for cleanups, we should stop using pkill and pgrep and use pid_list instead.

@keqiaozhang
Copy link
Contributor

SOFCI TEST

@marc-hb
Copy link
Collaborator

marc-hb commented Nov 9, 2023

Still no jenkins test results published :-(

Screenshot 2023-11-09 at 12 54 26

@marc-hb
Copy link
Collaborator

marc-hb commented Nov 10, 2023

A fix for thesofproject/sof#8445 has been merged. @ujfalusi is this still needed? Naive question sorry.

@ujfalusi
Copy link
Contributor Author

A fix for thesofproject/sof#8445 has been merged. @ujfalusi is this still needed? Naive question sorry.

While now ChainDMA will stop faster (and pause) and we might not see the issue this patch is fixing, I'm fine to close it, but we should note that the pidof expect is not an indication that aplay/arecord has been actually terminated.

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 15, 2024

@ujfalusi I completely rewrote the Expect script in commit 006946c. Can you re-test and see if that makes any difference for this PR?

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 15, 2024

SOFCI TEST

This just failed and seems relevant? Everything else is green.

https://sof-ci.01.org/softestpr/PR1126/build646/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=multiple-pause-resume-50

t=20373 ms: cmd1 arecord Port0: (48/50) Found   === PAUSE ===  ,  pausing for 27 ms
t=20626 ms: cmd1 arecord Port0: (49/50) Found volume ### | xx%, recording for 43 ms
t=20751 ms: cmd1 arecord Port0: (49/50) Found   === PAUSE ===  ,  pausing for 45 ms
t=21004 ms: cmd1 arecord Port0: (50/50) Found volume ### | xx%, recording for 22 ms
t=21129 ms: cmd1 arecord Port0: (50/50) Found   === PAUSE ===  ,  pausing for 31 ms
t=21129 ms: cmd1 arecord Port0: WARNING: volume was always 00%!
t=21129 ms: cmd1 arecord Port0: SUCCESS: /home/ubuntu/sof-test/case-lib/apause.exp arecord -D hw:0,0 -r 48000 -c 2 -f S16_LE -vv -i /dev/null

2024-07-15 21:19:26 UTC [REMOTE_ERROR] Still have expect process not finished after wait for 250
11836 expect /home/ubuntu/sof-test/case-lib/apause.exp cmd6 aplay Port2 50 20 30 aplay -D hw:0,2 -r 48000 -c 2 -f S16_LE -vv -i /dev/zero
11836 expect /home/ubuntu/sof-test/case-lib/apause.exp cmd6 aplay Port2 50 20 30 aplay -D hw:0,2 -r 48000 -c 2 -f S16_LE -vv -i /dev/zero
11838 aplay -D hw:0,2 -r 48000 -c 2 -f S16_LE -vv -i /dev/zero
pkill aplay with return 0
pkill arecord with return 1
t=252604 ms: cmd6 aplay Port2: ERROR: /home/ubuntu/sof-test/case-lib/apause.exp: EOF.
t=252604 ms: cmd6 aplay Port2: #+                                                 | 00%
process: aplay, status: not found
process: arecord, status: not found
2024-07-15 21:19:27 UTC [REMOTE_WARNING] Kill process catch error
2024-07-15 21:19:27 UTC [REMOTE_INFO] Starting func_exit_handler(1)
2024-07-15 21:19:27 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2024-07-15 21:19:27 UTC [REMOTE_ERROR]  main()  @  /home/ubuntu/sof-test/test-case/multiple-pause-resume.sh
2024-07-15 21:19:28 UTC [REMOTE_INFO] pkill -TERM -f mtrace-reader.py
2024-07-15 21:19:28 UTC [REMOTE_INFO] nlines=31147 /home/ubuntu/sof-test/logs/multiple-pause-resume/2024-07-15-21:10:06-1294/mtrace.txt
+ grep -B 2 -A 1 -i --word-regexp -e ERR -e ERROR -e '' -e OSError /home/ubuntu/sof-test/logs/multiple-pause-resume/2024-07-15-21:10:06-1294/mtrace.txt
2024-07-15 21:19:30 UTC [REMOTE_INFO] ktime=1854 sof-test PID=10427: ending
2024-07-15 21:19:30 UTC [REMOTE_INFO] Test Result: FAIL!

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 20, 2024

SOFCI TEST

@marc-hb marc-hb merged commit 6fec8e0 into thesofproject:main Jul 20, 2024
5 of 7 checks passed
@marc-hb
Copy link
Collaborator

marc-hb commented Jul 20, 2024

This looks simple and useful: merged.

@ujfalusi can you please take a look at failure https://sof-ci.01.org/softestpr/PR1126/build682/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=multiple-pause-resume-50 and the previous one above? Does it give the extra information that you were expecting?

EDIT: here's a timeout example: https://sof-ci.01.org/softestpr/PR1224/build686/devicetest/index.html

@ujfalusi
Copy link
Contributor Author

Both shows that after a set of timeout there is an expect and aplay running.
Looks to me that the internal expect timeout is not triggered and the main script kills expect/aplay, then the expect fort he aplay process hits EOL (aplay is killed).

I'm not sure what to make of the whole issue, but for sure we had a stall aplay process.

What I'm not sure with the rewritten expect script is the impact of the sof-process-kill.sh call... It kills the stall processes and that forces the expect to exit as well, but not sure what would happen if it is not done.

The history of this change was that in certain cases the test reported an error on subsequent iteration, depending on few factors and that was hard to piece back together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants