kernel panic on reboot since l4tr35.2.1 on kirkstone (jetson-xavier-agx) #1197

elbbit01 · 2023-03-07T17:28:52Z

elbbit01
Mar 7, 2023

Hello,
I opened a new topic since even if it is coming from a previous one, I think this deserves its own topic just in case others have the same problem. It is coming from here: #1182

First I will describe the status of my layers and how do I get to the kernel panic, then I paste the kernel panic and finally some other observations in comparison with the behavior of the nvidia ubuntu sample rootfs. I understand that it is a lot of information and I hope I am clear enough and I am just missing a small detail that triggers all of it.

Summarizing the context:

I am adapting my old AB update system from the old dunfell branch with cboot to kirkstone with EFI. I was trying with an older commit of kirkstone branch of meta-tegra, containing yet L4T 35.1, but nvbootctrl was not working properly. After updating to the latest version of the meta-tegra kirsktone branch including L4T r35.2.1, nvbootctrl works better but I get a kernel panic on every reboot.
I have some changes in the kernel adding extra .cfg and a patch for adapting the device tree to RootFS redundancy. The device tree is correct (it is redundant and looks as it should) and all the cfgs were working before the update to L4T r35.2.1.
I just build my previous commit and the reboot works (kirkstone branch on the status of 17th of february). If I build after checking out the latest kirkstone I will have this problem in rebooting (but a better nvbootctrl).

After this context, here is my lovely kernel panic, hoping that somebody of you recognizes what could it be, and can give me a hint (in bold, my suspicion of what is important):

[ 222.795879] systemd-shutdown[1]: Watchdog running with a timeout of 10min.
[ 222.818502] systemd-shutdown[1]: Syncing filesystems and block devices.
[ 222.828040] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[ 222.844916] systemd-journald[293]: Received SIGTERM from PID 1 (systemd-shutdow).
[ 222.847098] audit: type=1335 audit(1678204612.177:20): pid=293 uid=0 auid=4294967295 tty=(none) ses=4294967295 subj=kernel comm="systemd-journal" exe="/lib/systemd/systemd-journald" nl-mcgrp=1 op=disconnect res=1
[ 222.860092] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[ 222.875634] systemd-shutdown[1]: Unmounting file systems.
[ 222.877750] [546]: Remounting '/' read-only with options 'n/a'.
[ 222.890700] EXT4-fs (mmcblk0p1): re-mounted. Opts: (null)
[ 222.893337] systemd-shutdown[1]: All filesystems unmounted.
[ 222.893534] systemd-shutdown[1]: Deactivating swaps.
[ 222.893830] systemd-shutdown[1]: All swaps deactivated.
[ 222.893997] systemd-shutdown[1]: Detaching loop devices.
[ 222.898154] systemd-shutdown[1]: All loop devices detached.
[ 222.898305] systemd-shutdown[1]: Stopping MD devices.
[ 222.898804] systemd-shutdown[1]: All MD devices stopped.
[ 222.904067] systemd-shutdown[1]: Detaching DM devices.
[ 222.909467] systemd-shutdown[1]: All DM devices detached.
[ 222.914655] systemd-shutdown[1]: All filesystems, swaps, loop devices, MD devices and DM devices detached.
[ 222.934327] systemd-shutdown[1]: Syncing filesystems and block devices.
[ 222.935328] systemd-shutdown[1]: Rebooting.
[ 222.935472] kvm: exiting hardware virtualization
[ 222.939388] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000010
[ 222.948455] Mem abort info:
[ 222.951309] ESR = 0x96000004
[ 222.954008] EC = 0x25: DABT (current EL), IL = 32 bits
[ 222.959651] SET = 0, FnV = 0
[ 222.962511] EA = 0, S1PTW = 0
[ 222.965726] Data abort info:
[ 222.968646] ISV = 0, ISS = 0x00000004
[ 222.972395] CM = 0, WnR = 0
[ 222.975198] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000106d03000
[ 222.981685] [0000000000000010] pgd=0000000000000000, p4d=0000000000000000
[ 222.988754] Internal error: Oops: 96000004 [#1] PREEMPT SMP
[ 222.994272] Modules linked in: nvgpu nvmap
[ 222.998285] CPU: 5 PID: 1 Comm: systemd-shutdow Not tainted 5.10.104-l4t-r35.2.1-KS-Gamma1-MinimalOS-1.0.0+02+ #1
[ 223.008713] Hardware name: Unknown Jetson-AGX/Jetson-AGX, BIOS v35.2.1 01/24/2023
[ 223.016236] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[ 223.022521] pc : nvgpu_cond_signal+0x1c/0x50 [nvgpu]
[ 223.027026] lr : nvgpu_cond_signal+0x1c/0x50 [nvgpu]
[ 223.031794] sp : ffff80001004bc30
[ 223.035464] x29: ffff80001004bc30 x28: ffff348b8013e580
[ 223.040632] x27: 0000000000000000 x26: 0000000000000000
[ 223.045973] x25: 0000000000000000 x24: ffffa3679d782c00
[ 223.051396] x23: 0000000000000000 x22: 0000000000000001
[ 223.057182] x21: ffff348b892b01b8 x20: ffff348b892b0000
[ 223.062502] x19: 0000000000000010 x18: ffffffffffffffff
[ 223.067670] x17: 0000000000000000 x16: ffffa3679bafd7ac
[ 223.073182] x15: ffff80009004b957 x14: 0000000000000004
[ 223.078792] x13: 0000000000000000 x12: ffff348b845ad2d8
[ 223.083945] x11: 0000000000000040 x10: 0000000000000a80
[ 223.089458] x9 : ffff80001004b9b0 x8 : 0000000000000004
[ 223.094971] x7 : 0000000000000000 x6 : 0000000000000003
[ 223.100740] x5 : 0000000000000000 x4 : ffffffffffff7648
[ 223.106165] x3 : 0000000000000000 x2 : ffffa3679bc7d170
[ 223.111502] x1 : 0000000000000020 x0 : ffffa3679bc7d170
[ 223.116839] Call trace:
[ 223.119427] nvgpu_cond_signal+0x1c/0x50 [nvgpu]
[ 223.123702] nvgpu_kernel_shutdown_notification+0xa8/0xd0 [nvgpu]
[ 223.129714] blocking_notifier_call_chain+0x78/0xac
[ 223.134535] __do_sys_reboot+0x1cc/0x290
[ 223.138718] __arm64_sys_reboot+0x30/0x40
[ 223.142488] el0_svc_common.constprop.0+0x80/0x1c4
[ 223.147569] do_el0_svc+0x74/0x8c
[ 223.150539] el0_svc+0x1c/0x2c
[ 223.153955] el0_sync_handler+0x9c/0x120
[ 223.157705] el0_sync+0x16c/0x180
[ 223.160961] Code: aa0003f3 d50320ff aa1e03e0 9402d415 (39400260)
[ 223.167072] ---[ end trace e11be0ffdabe8544 ]---
[ 223.171359] Kernel panic - not syncing: Oops: Fatal exception
[ 223.177216] SMP: stopping secondary CPUs
[ 223.180911] Kernel Offset: 0x23678bad0000 from 0xffff800010000000
[ 223.186844] PHYS_OFFSET: 0xffffcb7580000000
[ 223.190957] CPU features: 0x8240002,03802a30
[ 223.195507] Memory Limit: none
[ 223.198412] ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---

I am not sure if it has something to do with the problem or not, but I have been comparing with the sample rootfs from nvidia, and this is what I found out as differences:

After a manual reset, the device starts properly. If I set to boot from partition B with nvbootctrl, it does it. It does not do it from B to A again. It creates the variable in the esp partition (BootChainFwNext-781e084c-a330-417c-b678-38e696380cb has the proper value "0") on the same way, but somehow it is not taken. On the sample ubuntu is working.
nvbootctrl dump-slot-info is always showing the current slot retry_count=2 (instead of 3) and the other slot retry_count=3, and that means after a first flash and before rebooting. Also after kernel panics and rebooting shows the same, never goes down, never is overtaken from the bootloader to come back to the previous partition.
Both slots after reboot in the sample ubuntu have on esp nothing but a variable file, BootChainFwStatus-781e084c-a330-417c-b678-38e696380cb9. The first boot after flashing, this variable is not there, but 2 others (TegraPlatformCompatSpec-781e084c-a330-417c-b678-38e696380cb9 and TegraPlatformSpec-781e084c-a330-417c-b678-38e696380cb9) which after reboot are not there any more.
For my distribution built with kirkstone, BootchainFwStatus is not even under /sys/firmware/efi/efivars and of course not created as a variable at the esp partition. The other 2 are always there, even after reboot (remember, there is a kernel panic so I have to do it physically).

And I think that is all, I hope somebody recognizes something because I am a little bit out of options.

Best regards,
Alvaro.

dwalkes · 2023-03-07T19:39:59Z

dwalkes
Mar 7, 2023
Collaborator

Great writeup, thanks for the detail.

I don't have any ideas, but cross linking what I suspect is relevant source at https://github.com/OE4T/linux-tegra-5.10/blob/5921377f5ffb5b1fbca9e40a187d1059743ef631/nvidia/nvgpu/drivers/gpu/nvgpu/os/linux/module.c#L152 based on the stacktrace.

Is there a way to reproduce this with https://github.com/OE4T/tegra-demo-distro ?

1 reply

elbbit01 Mar 9, 2023
Author

Thanks for the link and for the idea to reproduce it with the demo-distro.

With some light changes, I still dont reproduce it on the demo-distro, I am working to adapt my kernel patches on it to see if I can make it work or reproduce the error.

In parallel I will see if I can find out what could I have done in relation to the nvgpu module.

Regards,
Alvaro.

elbbit01 · 2023-03-12T10:50:32Z

elbbit01
Mar 12, 2023
Author

Hi, just for information, I solved the problem of the kernel panic.

It was really tricky.

First of all, the main problem was there since a while, but the kernel panic is just produced with the latest version of L4T. I found out, that before, I also had an error, but it was still managing to shut down.

I made a change to my machine configuration, and even if it has a "require jetson-agx-xavier-devkit.conf", seems that it was not including the tegra-firmware at all.

I didnt notice this also for a while since it was a minimal OS without gpu drivers, but I built a small c++ software with qt dependencies that activated nvgpu, which triggers the kernel panic trying to unload the kernel module without the rest of the firmware.

Of course, this was not so easy to reproduce because if all there conditions were not there, the problem was disappearing.

I hope nobody has this problem again, but for information, just in case, I have been this error visible in dmesg since I introduced my own machine configuration, which of course is not there any more if you install back the tegra-firmware:

[ 14.828712] gk20a 17000000.gv11b: Direct firmware load for gv11b/gpmu_ucode_image.bin failed with error -2
[ 14.828742] gk20a 17000000.gv11b: Direct firmware load for tegra19x/gpmu_ucode_image.bin failed with error -2
[ 14.828752] nvgpu: 17000000.gv11b pmu_fw_read:242 [ERR] failed to load pmu ucode!!
[ 14.828765] nvgpu: 17000000.gv11b nvgpu_finalize_poweron:1010 [ERR] Failed initialization for: g->ops.pmu.pmu_early_init

Anyway, thanks for the support and your work on the layer.

Regards,
Alvaro.

2 replies

dwalkes Mar 12, 2023
Collaborator

Thanks for the follow-up @elbbit01 and glad you solved the problem. Anything we should add at https://github.com/OE4T/meta-tegra/wiki/Creating-a-custom-MACHINE#custom-machine-definitions-for-existing-hardware which would have helped you avoid this?

elbbit01 Mar 20, 2023
Author

Hi, almost let this unanswered, sorry!

I actually thought I knew how to do it from the normal YOCTO documentation and didnt look further. Your Wiki looks good. Thanks again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenEmbedded for Tegra

kernel panic on reboot since l4tr35.2.1 on kirkstone (jetson-xavier-agx) #1197

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

OpenEmbedded for Tegra

kernel panic on reboot since l4tr35.2.1 on kirkstone (jetson-xavier-agx) #1197

elbbit01 Mar 7, 2023

Replies: 2 comments · 3 replies

dwalkes Mar 7, 2023 Collaborator

elbbit01 Mar 9, 2023 Author

elbbit01 Mar 12, 2023 Author

dwalkes Mar 12, 2023 Collaborator

elbbit01 Mar 20, 2023 Author

elbbit01
Mar 7, 2023

Replies: 2 comments 3 replies

dwalkes
Mar 7, 2023
Collaborator

elbbit01 Mar 9, 2023
Author

elbbit01
Mar 12, 2023
Author

dwalkes Mar 12, 2023
Collaborator

elbbit01 Mar 20, 2023
Author