Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crwodpose test, AP (easy), ap (medium), ap (hard) output score are 0 #8

Open
da13132 opened this issue Oct 28, 2023 · 11 comments
Open

Comments

@da13132
Copy link

da13132 commented Oct 28, 2023

Thank you very much for your excellent work. When I tried to use the swinL checkpoint you released to test with the crowdpose data set, I found that the scores of AP (easy), ap (medium), and ap (hard) were 0 points or about 0.05 points, and the AP score was only 63.4. When I tried to start from scratch Using res50 as the backbone network to train crowdpose, the scores of AP (easy), ap (medium), and ap (hard) are still 0 points. I haven't found the cause of this error yet, and I hope you can help me!

@Michel-liu
Copy link
Owner

Thanks for your interest in our work. Issues here may be helpful jeffffffli/CrowdPose#3 HRNet/HigherHRNet-Human-Pose-Estimation#26. I will also check our checkpoint next month, sorry I am too busy right now.

@da13132
Copy link
Author

da13132 commented Oct 29, 2023

Thanks for your interest in our work. Issues here may be helpful Jeff-sjtu/CrowdPose#3 HRNet/HigherHRNet-Human-Pose-Estimation#26. I will also check our checkpoint next month, sorry I am too busy right now.

Thank you for your help. I tried to apply the modifications in these two tips, but it didn't work. Did you encounter ap easy=0 when you first designed the code?

@YSLDTZY
Copy link

YSLDTZY commented Oct 30, 2023

Dear author, I have also encountered a similar issue. During training, my loss value cannot decrease quickly. After training for 60 epochs, my loss decreased from 200 to 100
IoU metric: keypoints
Average Precision (AP) @ [IoU=0.50:0.95 | area=all | maxDets=20]=0.000
Average Precision (AP) @ [IoU=0.50 | area=all | maxDets=20]=0.000
Average Precision (AP) @ [IoU=0.75 | area=all | maxDets=20]=0.000
Average Precision (AP) @ [IoU=0.50:0.95 | area=medium | maxDets=20]=0.000
Average Precision (AP) @ [IoU=0.50:0.95 | area=large | maxDets=20]=0.000
Average Recall (AR) @ [IoU=0.50:0.95 | area=all | maxDets=20]=0.003
Average Recall (AR) @ [IoU=0.50 | area=all | maxDets=20]=0.014
Average Recall (AR) @ [IoU=0.75 | area=all | maxDets=20]=0.001
Average Recall (AR) @ [IoU=0.50:0.95 | area=medium | maxDets=20]=0.001
Average Recall (AR) @ [IoU=0.50:0.95 | area=large | maxDets=20]=0.006
My training command is:
Python - m torch. distributed. launch -- nproc_ Per_ Node=2-- master_ Port 30694/17106/TZY/GroupPose main/GroupPose main/main.py - c/17106/TZY/GroupPose main/Configure/grouppose. py -- coco_ Path/17106/TZY/co -- output_ Dir/17106/TZY/GroupPose main/GroupPose main/output
First round of training:
Epoch: [0] [ 0/7075] eta: 3:00:53 lr: 0.000100 class_error: 15.28 loss: 258.0212 (258.0212) loss_ce: 1.9009 (1.9009) loss_ce_0: 2.0638 (2.0638) loss_ce_1: 2.3746 (2.3746) loss_ce_2: 1.7082 (1.7082) loss_ce_3: 2.0051 (2.0051) loss_ce_4: 1.9215 (1.9215) loss_ce_interm: 2.2982 (2.2982) loss_keypoints: 31.0085 (31.0085) loss_keypoints_0: 31.0085 (31.0085) loss_keypoints_1: 31.0085 (31.0085) loss_keypoints_2: 31.0085 (31.0085) loss_keypoints_3: 31.0085 (31.0085) loss_keypoints_4: 31.0085 (31.0085) loss_keypoints_interm: 31.0121 (31.0121) loss_oks: 3.8123 (3.8123) loss_oks_0: 3.8123 (3.8123) loss_oks_1: 3.8123 (3.8123) loss_oks_2: 3.8123 (3.8123) loss_oks_3: 3.8123 (3.8123) loss_oks_4: 3.8123 (3.8123) loss_oks_interm: 3.8123 (3.8123) class_error_unscaled: 15.2778 (15.2778) loss_ce_unscaled: 0.9505 (0.9505) loss_ce_0_unscaled: 1.0319 (1.0319) loss_ce_1_unscaled: 1.1873 (1.1873) loss_ce_2_unscaled: 0.8541 (0.8541) loss_ce_3_unscaled: 1.0025 (1.0025) loss_ce_4_unscaled: 0.9607 (0.9607) loss_ce_interm_unscaled: 1.1491 (1.1491) loss_keypoints_unscaled: 3.1008 (3.1008) loss_keypoints_0_unscaled: 3.1008 (3.1008) loss_keypoints_1_unscaled: 3.1008 (3.1008) loss_keypoints_2_unscaled: 3.1008 (3.1008) loss_keypoints_3_unscaled: 3.1008 (3.1008) loss_keypoints_4_unscaled: 3.1008 (3.1008) loss_keypoints_interm_unscaled: 3.1012 (3.1012) loss_oks_unscaled: 0.9531 (0.9531) loss_oks_0_unscaled: 0.9531 (0.9531) loss_oks_1_unscaled: 0.9531 (0.9531) loss_oks_2_unscaled: 0.9531 (0.9531) loss_oks_3_unscaled: 0.9531 (0.9531) loss_oks_4_unscaled: 0.9531 (0.9531) loss_oks_interm_unscaled: 0.9531 (0.9531) set_class_unscaled: 0.9961 (0.9961) set_class_0_unscaled: 1.0696 (1.0696) set_class_1_unscaled: 1.1464 (1.1464) set_class_2_unscaled: 0.8642 (0.8642) set_class_3_unscaled: 1.0210 (1.0210) set_class_4_unscaled: 0.9893 (0.9893) set_class_interm_unscaled: 1.1464 (1.1464) set_keypoints_unscaled: 7.8091 (7.8091) set_keypoints_0_unscaled: 7.8091 (7.8091) set_keypoints_1_unscaled: 7.8091 (7.8091) set_keypoints_2_unscaled: 7.8091 (7.8091) set_keypoints_3_unscaled: 7.8091 (7.8091) set_keypoints_4_unscaled: 7.8091 (7.8091) set_keypoints_interm_unscaled: 7.8091 (7.8091) time: 1.5341 data: 0.0470 max mem: 7013

Last round of training:
Epoch: [59] [7074/7075] eta: 0:00:00 lr: 0.000010 class_error: 0.00 loss: 97.1752 (107.5205) loss_ce: 0.8269 (0.8742) loss_ce_0: 0.8250 (0.8765) loss_ce_1: 0.8282 (0.8757) loss_ce_2: 0.8362 (0.8806) loss_ce_3: 0.8302 (0.8735) loss_ce_4: 0.8297 (0.8736) loss_ce_interm: 0.8417 (0.8987) loss_keypoints: 9.7989 (11.3177) loss_keypoints_0: 9.9545 (11.5409) loss_keypoints_1: 9.9025 (11.4450) loss_keypoints_2: 9.9144 (11.4255) loss_keypoints_3: 9.8633 (11.3787) loss_keypoints_4: 9.8677 (11.3704) loss_keypoints_interm: 10.0733 (11.8164) loss_oks: 2.8996 (2.9902) loss_oks_0: 2.9507 (3.0196) loss_oks_1: 2.8922 (3.0087) loss_oks_2: 2.9134 (3.0060) loss_oks_3: 2.9199 (2.9955) loss_oks_4: 2.9171 (2.9949) loss_oks_interm: 3.0561 (3.0582) class_error_unscaled: 0.0000 (0.0247) loss_ce_unscaled: 0.4134 (0.4371) loss_ce_0_unscaled: 0.4125 (0.4382) loss_ce_1_unscaled: 0.4141 (0.4379) loss_ce_2_unscaled: 0.4181 (0.4403) loss_ce_3_unscaled: 0.4151 (0.4367) loss_ce_4_unscaled: 0.4148 (0.4368) loss_ce_interm_unscaled: 0.4208 (0.4494) loss_keypoints_unscaled: 0.9799 (1.1318) loss_keypoints_0_unscaled: 0.9955 (1.1541) loss_keypoints_1_unscaled: 0.9903 (1.1445) loss_keypoints_2_unscaled: 0.9914 (1.1425) loss_keypoints_3_unscaled: 0.9863 (1.1379) loss_keypoints_4_unscaled: 0.9868 (1.1370) loss_keypoints_interm_unscaled: 1.0073 (1.1816) loss_oks_unscaled: 0.7249 (0.7476) loss_oks_0_unscaled: 0.7377 (0.7549) loss_oks_1_unscaled: 0.7231 (0.7522) loss_oks_2_unscaled: 0.7284 (0.7515) loss_oks_3_unscaled: 0.7300 (0.7489) loss_oks_4_unscaled: 0.7293 (0.7487) loss_oks_interm_unscaled: 0.7640 (0.7645) set_class_unscaled: 0.3826 (0.3858) set_class_0_unscaled: 0.3852 (0.3888) set_class_1_unscaled: 0.3880 (0.3913) set_class_2_unscaled: 0.3795 (0.3793) set_class_3_unscaled: 0.3804 (0.3856) set_class_4_unscaled: 0.3754 (0.3827) set_class_interm_unscaled: 0.3664 (0.3689) set_keypoints_unscaled: 5.6254 (6.0854) set_keypoints_0_unscaled: 5.6518 (6.0633) set_keypoints_1_unscaled: 5.6157 (6.0832) set_keypoints_2_unscaled: 5.6051 (6.0713) set_keypoints_3_unscaled: 5.5765 (6.0492) set_keypoints_4_unscaled: 5.5902 (6.0639) set_keypoints_interm_unscaled: 5.7375 (6.0779) time: 0.3616 data: 0.0433 max mem: 16698

I hope you can help me.
Thank you.

@da13132
Copy link
Author

da13132 commented Oct 30, 2023

Dear author, I have also encountered a similar issue. During training, my loss value cannot decrease quickly. After training for 60 epochs, my loss decreased from 200 to 100 IoU metric: keypoints Average Precision (AP) @ [IoU=0.50:0.95 | area=all | maxDets=20]=0.000 Average Precision (AP) @ [IoU=0.50 | area=all | maxDets=20]=0.000 Average Precision (AP) @ [IoU=0.75 | area=all | maxDets=20]=0.000 Average Precision (AP) @ [IoU=0.50:0.95 | area=medium | maxDets=20]=0.000 Average Precision (AP) @ [IoU=0.50:0.95 | area=large | maxDets=20]=0.000 Average Recall (AR) @ [IoU=0.50:0.95 | area=all | maxDets=20]=0.003 Average Recall (AR) @ [IoU=0.50 | area=all | maxDets=20]=0.014 Average Recall (AR) @ [IoU=0.75 | area=all | maxDets=20]=0.001 Average Recall (AR) @ [IoU=0.50:0.95 | area=medium | maxDets=20]=0.001 Average Recall (AR) @ [IoU=0.50:0.95 | area=large | maxDets=20]=0.006 My training command is: Python - m torch. distributed. launch -- nproc_ Per_ Node=2-- master_ Port 30694/17106/TZY/GroupPose main/GroupPose main/main.py - c/17106/TZY/GroupPose main/Configure/grouppose. py -- coco_ Path/17106/TZY/co -- output_ Dir/17106/TZY/GroupPose main/GroupPose main/output First round of training: Epoch: [0] [ 0/7075] eta: 3:00:53 lr: 0.000100 class_error: 15.28 loss: 258.0212 (258.0212) loss_ce: 1.9009 (1.9009) loss_ce_0: 2.0638 (2.0638) loss_ce_1: 2.3746 (2.3746) loss_ce_2: 1.7082 (1.7082) loss_ce_3: 2.0051 (2.0051) loss_ce_4: 1.9215 (1.9215) loss_ce_interm: 2.2982 (2.2982) loss_keypoints: 31.0085 (31.0085) loss_keypoints_0: 31.0085 (31.0085) loss_keypoints_1: 31.0085 (31.0085) loss_keypoints_2: 31.0085 (31.0085) loss_keypoints_3: 31.0085 (31.0085) loss_keypoints_4: 31.0085 (31.0085) loss_keypoints_interm: 31.0121 (31.0121) loss_oks: 3.8123 (3.8123) loss_oks_0: 3.8123 (3.8123) loss_oks_1: 3.8123 (3.8123) loss_oks_2: 3.8123 (3.8123) loss_oks_3: 3.8123 (3.8123) loss_oks_4: 3.8123 (3.8123) loss_oks_interm: 3.8123 (3.8123) class_error_unscaled: 15.2778 (15.2778) loss_ce_unscaled: 0.9505 (0.9505) loss_ce_0_unscaled: 1.0319 (1.0319) loss_ce_1_unscaled: 1.1873 (1.1873) loss_ce_2_unscaled: 0.8541 (0.8541) loss_ce_3_unscaled: 1.0025 (1.0025) loss_ce_4_unscaled: 0.9607 (0.9607) loss_ce_interm_unscaled: 1.1491 (1.1491) loss_keypoints_unscaled: 3.1008 (3.1008) loss_keypoints_0_unscaled: 3.1008 (3.1008) loss_keypoints_1_unscaled: 3.1008 (3.1008) loss_keypoints_2_unscaled: 3.1008 (3.1008) loss_keypoints_3_unscaled: 3.1008 (3.1008) loss_keypoints_4_unscaled: 3.1008 (3.1008) loss_keypoints_interm_unscaled: 3.1012 (3.1012) loss_oks_unscaled: 0.9531 (0.9531) loss_oks_0_unscaled: 0.9531 (0.9531) loss_oks_1_unscaled: 0.9531 (0.9531) loss_oks_2_unscaled: 0.9531 (0.9531) loss_oks_3_unscaled: 0.9531 (0.9531) loss_oks_4_unscaled: 0.9531 (0.9531) loss_oks_interm_unscaled: 0.9531 (0.9531) set_class_unscaled: 0.9961 (0.9961) set_class_0_unscaled: 1.0696 (1.0696) set_class_1_unscaled: 1.1464 (1.1464) set_class_2_unscaled: 0.8642 (0.8642) set_class_3_unscaled: 1.0210 (1.0210) set_class_4_unscaled: 0.9893 (0.9893) set_class_interm_unscaled: 1.1464 (1.1464) set_keypoints_unscaled: 7.8091 (7.8091) set_keypoints_0_unscaled: 7.8091 (7.8091) set_keypoints_1_unscaled: 7.8091 (7.8091) set_keypoints_2_unscaled: 7.8091 (7.8091) set_keypoints_3_unscaled: 7.8091 (7.8091) set_keypoints_4_unscaled: 7.8091 (7.8091) set_keypoints_interm_unscaled: 7.8091 (7.8091) time: 1.5341 data: 0.0470 max mem: 7013

Last round of training: Epoch: [59] [7074/7075] eta: 0:00:00 lr: 0.000010 class_error: 0.00 loss: 97.1752 (107.5205) loss_ce: 0.8269 (0.8742) loss_ce_0: 0.8250 (0.8765) loss_ce_1: 0.8282 (0.8757) loss_ce_2: 0.8362 (0.8806) loss_ce_3: 0.8302 (0.8735) loss_ce_4: 0.8297 (0.8736) loss_ce_interm: 0.8417 (0.8987) loss_keypoints: 9.7989 (11.3177) loss_keypoints_0: 9.9545 (11.5409) loss_keypoints_1: 9.9025 (11.4450) loss_keypoints_2: 9.9144 (11.4255) loss_keypoints_3: 9.8633 (11.3787) loss_keypoints_4: 9.8677 (11.3704) loss_keypoints_interm: 10.0733 (11.8164) loss_oks: 2.8996 (2.9902) loss_oks_0: 2.9507 (3.0196) loss_oks_1: 2.8922 (3.0087) loss_oks_2: 2.9134 (3.0060) loss_oks_3: 2.9199 (2.9955) loss_oks_4: 2.9171 (2.9949) loss_oks_interm: 3.0561 (3.0582) class_error_unscaled: 0.0000 (0.0247) loss_ce_unscaled: 0.4134 (0.4371) loss_ce_0_unscaled: 0.4125 (0.4382) loss_ce_1_unscaled: 0.4141 (0.4379) loss_ce_2_unscaled: 0.4181 (0.4403) loss_ce_3_unscaled: 0.4151 (0.4367) loss_ce_4_unscaled: 0.4148 (0.4368) loss_ce_interm_unscaled: 0.4208 (0.4494) loss_keypoints_unscaled: 0.9799 (1.1318) loss_keypoints_0_unscaled: 0.9955 (1.1541) loss_keypoints_1_unscaled: 0.9903 (1.1445) loss_keypoints_2_unscaled: 0.9914 (1.1425) loss_keypoints_3_unscaled: 0.9863 (1.1379) loss_keypoints_4_unscaled: 0.9868 (1.1370) loss_keypoints_interm_unscaled: 1.0073 (1.1816) loss_oks_unscaled: 0.7249 (0.7476) loss_oks_0_unscaled: 0.7377 (0.7549) loss_oks_1_unscaled: 0.7231 (0.7522) loss_oks_2_unscaled: 0.7284 (0.7515) loss_oks_3_unscaled: 0.7300 (0.7489) loss_oks_4_unscaled: 0.7293 (0.7487) loss_oks_interm_unscaled: 0.7640 (0.7645) set_class_unscaled: 0.3826 (0.3858) set_class_0_unscaled: 0.3852 (0.3888) set_class_1_unscaled: 0.3880 (0.3913) set_class_2_unscaled: 0.3795 (0.3793) set_class_3_unscaled: 0.3804 (0.3856) set_class_4_unscaled: 0.3754 (0.3827) set_class_interm_unscaled: 0.3664 (0.3689) set_keypoints_unscaled: 5.6254 (6.0854) set_keypoints_0_unscaled: 5.6518 (6.0633) set_keypoints_1_unscaled: 5.6157 (6.0832) set_keypoints_2_unscaled: 5.6051 (6.0713) set_keypoints_3_unscaled: 5.5765 (6.0492) set_keypoints_4_unscaled: 5.5902 (6.0639) set_keypoints_interm_unscaled: 5.7375 (6.0779) time: 0.3616 data: 0.0433 max mem: 16698

I hope you can help me. Thank you.

You trained using a single card, right? Your log shows that the loss has not decreased. This may be caused by the learning rate, so your AP score is 0, but your AP (EASY) is consistent with my problem

@Michel-liu
Copy link
Owner

Sorry for the delayed reply. Yes, don't worry about AP=0. I have successfully fixed it before, following many issue instructions (including the above two) on the web, but I forgot the details, really a long time ago. I am sure some bugs in the crowdpose evaluation API. How about generating results into JSON file and then running the evaluation code? This should be working.

I will check the crowdpose evaluation API in detail later. Also, if you prefer, you can check it too and share in this issue. @da13132

@YSLDTZY
Copy link

YSLDTZY commented Oct 30, 2023

亲爱的作者,我也遇到过类似的问题。在训练过程中,我的损失值不可能快速下降。经过 60 个 epoch 的训练后,我的损失从 200 减少到 100 IoU 指标:关键点平均精度 (AP) @ [IoU=0.50:0.95 | 区域=全部| maxDets=20]=0.000 平均精度 (AP) @ [IoU=0.50 | maxDets=20]=0.000 区域=全部| maxDets=20]=0.000 平均精度 (AP) @ [IoU=0.75 | maxDets=20]=0.000 区域=全部| maxDets=20]=0.000 平均精度 (AP) @ [IoU=0.50:0.95 | 面积=中 | maxDets=20]=0.000 平均精度 (AP) @ [IoU=0.50:0.95 | 面积=大| maxDets=20]=0.000 平均召回率 (AR) @ [IoU=0.50:0.95 | 区域=全部| maxDets=20]=0.003 平均召回率 (AR) @ [IoU=0.50 | 区域=全部| maxDets=20]=0.014 平均召回率 (AR) @ [IoU=0.75 | maxDets=20]=0.014 区域=全部| maxDets=20]=0.001 平均召回率 (AR) @ [IoU=0.50:0.95 | 面积=中 | maxDets=20]=0.001 平均召回率 (AR) @ [IoU=0.50:0.95 | 面积=大 | maxDets=20]=0.006 我的训练命令是:Python - m torch.maxDets=20]=0.006 我的训练命令是:Python - m torch.maxDets=20]=0.006 分散式。launch -- nproc_ Per_ Node=2 -- master_ Port 30694/17106/TZY/GroupPose main/GroupPose main/main.py - c/17106/TZY/GroupPose main/Configure/grouppose. py -- coco_ Path/17106/TZY/co -- output_ Dir/17106/TZY/GroupPose main/GroupPose main/output 第一轮训练: Epoch: [0] [ 0/7075] eta: 3:00:53 lr :0.000100类错误:15.28损失:258.0212(258.0212)loss_ce:1.9009(1.9009)loss_ce_0:2.0638(2.0638)loss_ce_1:2.3746(2.3746)loss_ce_2:1.7082(1.708) 2)loss_ce_3:2.0051(2.0051)loss_ce_4:1.9215(1.9215)loss_ce_interterm:2.2982 (2.2982)loss_keypoints:31.0085(31.0085)loss_keypoints_0:31.0085(31.0085)loss_keypoints_1:31.0085(31.0085)loss_keypoints_2:31.0085(31.0085)loss_keypoints_3:31.0085 (31.0085)loss_keypoints_4:31.0085(31.0085)loss_keypoints_interm:31.0121(31.0121)loss_oks:3.8123(3.8123) )loss_oks_0:3.8123(3.8123)loss_oks_1:3.8123(3.8123)loss_oks_2:3.8123(3.8123)loss_oks_3:3.8123(3.8123)loss_oks_4:3.8123(3.8123)loss_oks_interm:3 .8123 (3.8123) class_error_unscaled: 15.2778 (15.2778) loss_ce_unscaled: 0.9505 (0.9505) loss_ce_0_unscaled :1.0319(1.0319)loss_ce_1_unscaled:1.1873(1.1873)loss_ce_2_unscaled:0.8541(0.8541)loss_ce_3_unscaled:1.0025(1.0025)loss_ce_4_unscaled:0.9607(0.9607)loss_ce_in term_unscaled:1.1491(1.1491)loss_keypoints_unscaled:3.1008(3.1008)loss_keypoints_0_unscaled:3.1008(3.1008)loss_keypoints_1_unscaled:3.1008 (3.1008)loss_keypoints_2_unscaled:3.1008(3.1008)loss_keypoints_3_unscaled:3.1008(3.1008)loss_keypoints_4_unscaled:3.1008(3.1008)loss_keypoints_interm_unscaled:3.1012(3.1012)loss_oks_未缩放:0.9531(0.9531)loss_oks_0_unscaled:0.9531(0.9531)loss_oks_1_unscaled:0.9531(0.9531)loss_oks_2_unscaled:0.9531(0.9531) )loss_oks_3_unscaled:0.9531(0.9531)loss_oks_4_unscaled:0.9531(0.9531)loss_oks_interm_unscaled:0.9531(0.9531)set_class_unscaled:0.9961(0.9961)set_class_0_unscaled:1.0696( 1.0696) set_class_1_unscaled: 1.1464 (1.1464) set_class_2_unscaled: 0.8642 (0.8642) set_class_3_unscaled: 1.0210 (1.
最后一轮训练: Epoch: [59] [7074/7075] eta: 0:00:00 lr: 0.000010 class_error: 0.00 loss: 97.1752 (107.5205) loss_ce: 0.8269 (0.8742) loss_ce_0: 0.8250 (0.8765) loss_ce_1: 0.8282( 0.8757)loss_ce_2:0.8362(0.8806)loss_ce_3:0.8302(0.8735)loss_ce_4:0.8297(0.8736)loss_ce_interm:0.8417(0.8987)loss_keypoints:9.7989(11.3177)loss_keypoints_0 :9.9545(11.5409)loss_keypoints_1:9.9025(11.4450)loss_keypoints_2:9.9144(11.4255) loss_keypoints_3:9.8633(11.3787)loss_keypoints_4:9.8677(11.3704)loss_keypoints_interm:10.0733(11.8164)loss_oks:2.8996(2.9902)loss_oks_0:2.9507(3.0196)loss_oks_1 :2.8922(3.0087)loss_oks_2:2.9134(3.0060)loss_oks_3:2.9199(2.9955)loss_oks_4: 2.9171(2.9949)loss_oks_interm:3.0561(3.0582)class_error_unscaled:0.0000(0.0247)loss_ce_unscaled:0.4134(0.4371)loss_ce_0_unscaled:0.4125(0.4382)loss_ce_1_unscaled:0。 4141(0.4379)loss_ce_2_unscaled:0.4181(0.4403)loss_ce_3_unscaled:0.4151(0.4367)loss_ce_4_unscaled:0.4148( 0.4368)loss_ce_interm_unscaled:0.4208(0.4494)loss_keypoints_unscaled:0.9799(1.1318)loss_keypoints_0_unscaled:0.9955(1.1541)loss_keypoints_1_unscaled:0.9903(1.1445)loss_keypoints_2_un缩放:0.9914(1.1425)loss_keypoints_3_unscaled:0.9863(1.1379)loss_keypoints_4_unscaled:0.9868(1.1370)loss_keypoints_interm_unscaled:1.0073(1.1816) loss_oks_unscaled:0.7249(0.7476)loss_oks_0_unscaled:0.7377(0.7549)loss_oks_1_unscaled:0.7231(0.7522)loss_oks_2_unscaled:0.7284(0.7515)loss_oks_3_unscaled:0.7300(0) .7489)loss_oks_4_unscaled:0.7293(0.7487)loss_oks_interm_unscaled:0.7640(0.7645)set_class_unscaled:0.3826(0.3858)set_class_0_unscaled: 0.3852 (0.3888) set_class_1_unscaled: 0.3880 (0.3913) set_class_2_unscaled: 0.3795 (0.3793) set_class_3_unscaled: 0.3804 (0.3856) set_class_4_unscaled: 0.3754 (0.3827) set_class_interm_未缩放:0.3664 (0.3689) set_keypoints_unscaled:5.6254 (6.0854) set_keypoints_0_unscaled:5.6518 (6.0633) set_keypoints_1_unscaled:5.6157 ( 6.0832)set_keypoints_2_unscaled:5.6051(6.0713)set_keypoints_3_unscaled:5.5765(6.0492)set_keypoints_4_unscaled:5.5902(6.0639)set_keypoints_interm_unscaled:5.7375(6.0779)时间:0。 3616 数据:0.0433 最大内存:16698
我希望你可以帮助我。谢谢。

你是用一张卡训练的吧?你的日志显示损失并没有减少。这可能是学习率造成的,所以你的AP分数是0,但是你的AP(EASY)与我的问题一致
老哥我用的4卡

@da13132
Copy link
Author

da13132 commented Oct 30, 2023

亲爱的作者,我也遇到过类似的问题。在训练过程中,我的损失值不可能快速下降。经过 60 个 epoch 的训练后,我的损失从 200 减少到 100 IoU 指标:关键点平均精度 (AP) @ [IoU=0.50:0.95 | 区域=全部| maxDets=20]=0.000 平均精度 (AP) @ [IoU=0.50 | maxDets=20]=0.000 区域=全部| maxDets=20]=0.000 平均精度 (AP) @ [IoU=0.75 | maxDets=20]=0.000 区域=全部| maxDets=20]=0.000 平均精度 (AP) @ [IoU=0.50:0.95 | 面积=中 | maxDets=20]=0.000 平均精度 (AP) @ [IoU=0.50:0.95 | 面积=大| maxDets=20]=0.000 平均召回率 (AR) @ [IoU=0.50:0.95 | 区域=全部| maxDets=20]=0.003 平均召回率 (AR) @ [IoU=0.50 | 区域=全部| maxDets=20]=0.014 平均召回率 (AR) @ [IoU=0.75 | maxDets=20]=0.014 区域=全部| maxDets=20]=0.001 平均召回率 (AR) @ [IoU=0.50:0.95 | 面积=中 | maxDets=20]=0.001 平均召回率 (AR) @ [IoU=0.50:0.95 | 面积=大 | maxDets=20]=0.006 我的训练命令是:Python - m torch.maxDets=20]=0.006 我的训练命令是:Python - m torch.maxDets=20]=0.006 分散式。launch -- nproc_ Per_ Node=2 -- master_ Port 30694/17106/TZY/GroupPose main/GroupPose main/main.py - c/17106/TZY/GroupPose main/Configure/grouppose. py -- coco_ Path/17106/TZY/co -- output_ Dir/17106/TZY/GroupPose main/GroupPose main/output 第一轮训练: Epoch: [0] [ 0/7075] eta: 3:00:53 lr :0.000100类错误:15.28损失:258.0212(258.0212)loss_ce:1.9009(1.9009)loss_ce_0:2.0638(2.0638)loss_ce_1:2.3746(2.3746)loss_ce_2:1.7082(1.708) 2)loss_ce_3:2.0051(2.0051)loss_ce_4:1.9215(1.9215)loss_ce_interterm:2.2982 (2.2982)loss_keypoints:31.0085(31.0085)loss_keypoints_0:31.0085(31.0085)loss_keypoints_1:31.0085(31.0085)loss_keypoints_2:31.0085(31.0085)loss_keypoints_3:31.0085 (31.0085)loss_keypoints_4:31.0085(31.0085)loss_keypoints_interm:31.0121(31.0121)loss_oks:3.8123(3.8123) )loss_oks_0:3.8123(3.8123)loss_oks_1:3.8123(3.8123)loss_oks_2:3.8123(3.8123)loss_oks_3:3.8123(3.8123)loss_oks_4:3.8123(3.8123)loss_oks_interm:3 .8123 (3.8123) class_error_unscaled: 15.2778 (15.2778) loss_ce_unscaled: 0.9505 (0.9505) loss_ce_0_unscaled :1.0319(1.0319)loss_ce_1_unscaled:1.1873(1.1873)loss_ce_2_unscaled:0.8541(0.8541)loss_ce_3_unscaled:1.0025(1.0025)loss_ce_4_unscaled:0.9607(0.9607)loss_ce_in term_unscaled:1.1491(1.1491)loss_keypoints_unscaled:3.1008(3.1008)loss_keypoints_0_unscaled:3.1008(3.1008)loss_keypoints_1_unscaled:3.1008 (3.1008)loss_keypoints_2_unscaled:3.1008(3.1008)loss_keypoints_3_unscaled:3.1008(3.1008)loss_keypoints_4_unscaled:3.1008(3.1008)loss_keypoints_interm_unscaled:3.1012(3.1012)loss_oks_未缩放:0.9531(0.9531)loss_oks_0_unscaled:0.9531(0.9531)loss_oks_1_unscaled:0.9531(0.9531)loss_oks_2_unscaled:0.9531(0.9531) )loss_oks_3_unscaled:0.9531(0.9531)loss_oks_4_unscaled:0.9531(0.9531)loss_oks_interm_unscaled:0.9531(0.9531)set_class_unscaled:0.9961(0.9961)set_class_0_unscaled:1.0696( 1.0696) set_class_1_unscaled: 1.1464 (1.1464) set_class_2_unscaled: 0.8642 (0.8642) set_class_3_unscaled: 1.0210 (1.
最后一轮训练: Epoch: [59] [7074/7075] eta: 0:00:00 lr: 0.000010 class_error: 0.00 loss: 97.1752 (107.5205) loss_ce: 0.8269 (0.8742) loss_ce_0: 0.8250 (0.8765) loss_ce_1: 0.8282( 0.8757)loss_ce_2:0.8362(0.8806)loss_ce_3:0.8302(0.8735)loss_ce_4:0.8297(0.8736)loss_ce_interm:0.8417(0.8987)loss_keypoints:9.7989(11.3177)loss_keypoints_0 :9.9545(11.5409)loss_keypoints_1:9.9025(11.4450)loss_keypoints_2:9.9144(11.4255) loss_keypoints_3:9.8633(11.3787)loss_keypoints_4:9.8677(11.3704)loss_keypoints_interm:10.0733(11.8164)loss_oks:2.8996(2.9902)loss_oks_0:2.9507(3.0196)loss_oks_1 :2.8922(3.0087)loss_oks_2:2.9134(3.0060)loss_oks_3:2.9199(2.9955)loss_oks_4: 2.9171(2.9949)loss_oks_interm:3.0561(3.0582)class_error_unscaled:0.0000(0.0247)loss_ce_unscaled:0.4134(0.4371)loss_ce_0_unscaled:0.4125(0.4382)loss_ce_1_unscaled:0。 4141(0.4379)loss_ce_2_unscaled:0.4181(0.4403)loss_ce_3_unscaled:0.4151(0.4367)loss_ce_4_unscaled:0.4148( 0.4368)loss_ce_interm_unscaled:0.4208(0.4494)loss_keypoints_unscaled:0.9799(1.1318)loss_keypoints_0_unscaled:0.9955(1.1541)loss_keypoints_1_unscaled:0.9903(1.1445)loss_keypoints_2_un缩放:0.9914(1.1425)loss_keypoints_3_unscaled:0.9863(1.1379)loss_keypoints_4_unscaled:0.9868(1.1370)loss_keypoints_interm_unscaled:1.0073(1.1816) loss_oks_unscaled:0.7249(0.7476)loss_oks_0_unscaled:0.7377(0.7549)loss_oks_1_unscaled:0.7231(0.7522)loss_oks_2_unscaled:0.7284(0.7515)loss_oks_3_unscaled:0.7300(0) .7489)loss_oks_4_unscaled:0.7293(0.7487)loss_oks_interm_unscaled:0.7640(0.7645)set_class_unscaled:0.3826(0.3858)set_class_0_unscaled: 0.3852 (0.3888) set_class_1_unscaled: 0.3880 (0.3913) set_class_2_unscaled: 0.3795 (0.3793) set_class_3_unscaled: 0.3804 (0.3856) set_class_4_unscaled: 0.3754 (0.3827) set_class_interm_未缩放:0.3664 (0.3689) set_keypoints_unscaled:5.6254 (6.0854) set_keypoints_0_unscaled:5.6518 (6.0633) set_keypoints_1_unscaled:5.6157 ( 6.0832)set_keypoints_2_unscaled:5.6051(6.0713)set_keypoints_3_unscaled:5.5765(6.0492)set_keypoints_4_unscaled:5.5902(6.0639)set_keypoints_interm_unscaled:5.7375(6.0779)时间:0。 3616 数据:0.0433 最大内存:16698
我希望你可以帮助我。谢谢。

你是用一张卡训练的吧?你的日志显示损失并没有减少。这可能是学习率造成的,所以你的AP分数是0,但是你的AP(EASY)与我的问题一致
老哥我用的4卡

我尝试使用单卡训练时也出现了LOSS不下降的情况,这可能是学习率以及batchsize的设置导致的,我目前使用单卡,LR=0.0001以及batchsize=8可以正常训练;我最初的设置是单卡batchsize=1以及LR=0.001,这种情况下loss是不下降的,你可以试试调整学习率以及batchsize,或许训练会恢复正常

@da13132
Copy link
Author

da13132 commented Oct 30, 2023

Sorry for the delayed reply. Yes, don't worry about AP=0. I have successfully fixed it before, following many issue instructions (including the above two) on the web, but I forgot the details, really a long time ago. I am sure some bugs in the crowdpose evaluation API. How about generating results into JSON file and then running the evaluation code? This should be working.

I will check the crowdpose evaluation API in detail later. Also, if you prefer, you can check it too and share in this issue. @da13132

Thanks for your reply, I'll try using a json file for testing. Going further, I will try to use xtcocotools to replace crowdposetools. If I have further findings, I will share them.

@YSLDTZY
Copy link

YSLDTZY commented Oct 30, 2023

亲爱的作者,我也遇到过类似的问题。在训练过程中,我的损失值不可能快速下降。经过 60 个 epoch 的训练后,我的损失从 200 减少到 100 IoU 指标:关键点平均精度(美联社)@ [IoU=0.50:0.95 | 区域=全部| maxDets=20]=0.000 平均精度 (AP) @ [IoU=0.50 | maxDets=20]=0.000 平均精度 (AP) @ [IoU=0.50] maxDets=20]=0.000 区域=全部| maxDets=20]=0.000 平均精度 (AP) @ [IoU=0.75 | maxDets=20]=0.000 平均精度 (AP) @ [IoU=0.75] maxDets=20]=0.000 区域=全部| maxDets=20]=0.000 平均精度 (AP) @ [IoU=0.50:0.95 | 面积=中 | maxDets=20]=0.000 平均精度 (AP) @ [IoU=0.50:0.95 | 面积=大| maxDets=20]=0.000 平均感知率 (AR) @ [IoU=0.50:0.95 | maxDets=20]=0.000 区域=全部| maxDets=20]=0.003 平均感知率 (AR) @ [IoU=0.50 | maxDets=20]=0.003 区域=全部| maxDets=20]=0.014 平均感知率 (AR) @ [IoU=0.75 | maxDets=20]=0.014 maxDets=20]=0.014 区域=全部| maxDets=20]=0.001 平均感知率 (AR) @ [IoU=0.50:0.95 | maxDets=20]=0.001 面积=中 | maxDets=20]=0.001 平均感知率 (AR) @ [IoU=0.50:0.95 | maxDets=20]=0.001 面积=大 | maxDets=20]=0.006 我的训练命令是:Python - m torch.maxDets=20]=0.006 我的训练命令是:Python - m torch.maxDets=20]=0.006 分散式。launch -- nproc_ Per_ Node= 2 -- master_端口 30694/17106/TZY/GroupPose main/GroupPose main/main.py - c/17106/TZY/GroupPose main/Configure/grouppose.2 py -- coco_ Path/17106/TZY/co -- output_ Dir/17106/TZY/GroupPose main/GroupPose main/output 第一轮训练: Epoch: [0] [ 0/7075] eta: 3:00:53 lr :0.000100类错误:15.28损失:258.0212(258.0212)loss_ce:1.9009(1.9009)loss_ce_0:2.0638(2.0638)loss_ce_1:2.3746(2.3746)loss_ce_2:1.7082(1.708)2)loss_ce_ 3:2.0051(2.0051)loss_ce_4:1.9215(1.9215) loss_ce_interterm:2.2982 (2.2982)loss_keypoints:31.0085(31.0085)loss_keypoints_0:31.0085(31.0085)loss_keypoints_1:31.0085(31.0085)loss_keypoints_2:31.0085(31.0085)loss_keypoints_3 :31.0085 (31.0085)loss_keypoints_4:31.0085(31.0085)loss_keypoints_interm:31.0121(31.0121)loss_oks: 3.8123(3.8123) )loss_oks_0:3.8123(3.8123)loss_oks_1:3.8123(3.8123)loss_oks_2:3.8123(3.8123)loss_oks_3:3.8123(3.8123)loss_oks_4:3.8123(3.8123) loss_oks_interm:3 .8123 (3.8123) class_error_unscaled: 15.2778 (15.2778) loss_ce_unscaled : 0.9505 (0.9505) loss_ce_0_unscaled :1.0319(1.0319)loss_ce_1_unscaled:1.1873(1.1873)loss_ce_2_unscaled:0.8541(0.8541)loss_ce_3_unscaled:1.0025(1.0025)loss_ce_4_unscaled d:0.9607(0.9607)loss_ce_in term_unscaled:1.1491(1.1491)loss_keypoints_unscaled:3.1008(3.1008)loss_keypoints_0_unscaled: 3.1008(3.1008)loss_keypoints_1_unscaled:3.1008 (3.1008)loss_keypoints_2_unscaled:3.1008(3.1008)loss_keypoints_3_unscaled:3.1008(3.1008)loss_keypoints_4_unscaled:3.1008(3.1008)loss_key points_interm_unscaled:3.1012(3.1012)loss_oks_未缩放:0.9531(0.9531)loss_oks_0_unscaled:0.9531(0.9531)loss_oks_1_unscaled :0.9531(0.9531)loss_oks_2_unscaled:0.9531(0.9531) )loss_oks_3_unscaled:0.9531(0.9531)loss_oks_4_unscaled:0.9531(0.9531)loss_oks_interm_unscaled:0.9531(0.9531)set_class _unscaled:0.9961(0.9961)set_class_0_unscaled:1.0696( 1.0696) set_class_1_unscaled: 1.1464 (1.1464) set_class_2_unscaled: 0.8642 (0.8642) set_class_3_unscaled: 1.0210 (1.
最后训练: Epoch: [59] [7074/7075] eta: 0:00:00 lr: 0.000010 class_error: 0.00 loss: 97.1752 (107.5205) loss_ce: 0.8269 (0.8742) loss_ce_0: 0.8250 (0.8765) loss_ce_1: 0 .8282( 0.8757)loss_ce_2:0.8362(0.8806)loss_ce_3:0.8302(0.8735)loss_ce_4:0.8297(0.8736)loss_ce_interm:0.8417(0.8987)loss_keypoints:9.7989(11.3177)loss_keypoints_0 :9.9 545(11.5409)loss_keypoints_1:9.9025(11.4450)loss_keypoints_2:9.9144(11.4255) loss_keypoints_3:9.8633(11.3787)loss_keypoints_4:9.8677(11.3704)loss_keypoints_interm:10.0733(11.8164)loss_oks:2.8996(2.9902)loss_oks_0:2.9507(3.0196)loss_oks_1 :2.8 922(3.0087)loss_oks_2:2.9134(3.0060)loss_oks_3:2.9199(2.9955)loss_oks_4: 2.9171(2.9949)loss_oks_interm:3.0561(3.0582)class_error_unscaled:0.0000(0.0247)loss_ce_unscaled:0.4134(0.4371)loss_ce_0_unscaled:0.4125(0.4382)loss_ce_1_unscaled:0。 4141 (0.4379)loss_ce_2_unscaled:0.4181(0.4403)loss_ce_3_unscaled:0.4151(0.4367)loss_ce_4_unscaled: 0.4148( 0.4368)loss_ce_interm_unscaled:0.4208(0.4494)loss_keypoints_unscaled:0.9799(1.1318)loss_keypoints_0_unscaled:0.9955(1.1541)loss_keypoints_1_unscaled:0.9903(1.1445)loss_keypoints _2_un缩放:0.9914(1.1425)loss_keypoints_3_unscaled:0.9863(1.1379)loss_keypoints_4_unscaled:0.9868(1.1370)loss_keypoints_interm_unscaled:1.0073 (1.1816)loss_oks_unscaled:0.7249(0.7476)loss_oks_0_unscaled:0.7377(0.7549)loss_oks_1_unscaled:0.7231(0.7522)loss_oks_2_unscaled:0.7284(0.7515)loss_oks_3_unscaled:0.730 0(0) .7489)loss_oks_4_unscaled:0.7293(0.7487)loss_oks_interm_unscaled:0.7640(0.7645)set_class_unscaled: 0.3826(0.3858)set_class_0_unscaled: 0.3852 (0.3888) set_class_1_unscaled: 0.3880 (0.3913) set_class_2_unscaled: 0.3795 (0.3793) set_class_3_unscaled: 0.3804 (0.3856) set_class_4_un缩放:0.3754 (0.3827) set_class_interm_未缩放:0.3664 (0.3689) set_keypoints_unscaled:5.6254 (6.0854) set_keypoints_0_unscaled :5.6518 (6.0633) set_keypoints_1_unscaled:5.6157 ( 6.0832)set_keypoints_2_unscaled:5.6051(6.0713)set_keypoints_3_unscaled:5.5765(6.0492)set_keypoints_4_unscaled:5.5902(6.0639)set _keypoints_interm_unscaled:5.7375(6.0779)时间:0。 3616 数据:0.0433 最大内存:16698我希望
你可以帮助我。谢谢。

你是用一张卡训练的吧?你的日志显示损失并没有减少。这可能是学习率造成的,所以你的AP分数是0,但是你的AP(EASY)与我的问题一致老哥
我使用的4卡

我尝试使用单卡训练时也出现了LOSS不掉的情况,这可能是学习率以及batchsize的设置导致的,我目前使用单卡,LR=0.0001以及batchsize=8正常训练;我最初的设置是可以的单卡batchsize=1以及LR=0.001,情况下loss不恢复的这种,你可以尝试调整学习率以及batchsize,也许训练会恢复正常

谢谢老哥,我试试

@Michel-liu
Copy link
Owner

Thanks for your interest in our work. The model is sensitive to learning rate and batch size. To align with previous methods, we have not tested our models with smaller batch sizes before. So I strongly recommend you follow the official training settings for getting reported performance. But you can try some common rules, such as linear adjustment. Also, welcome to share your experience in this issue. @YSLDTZY

@YSLDTZY
Copy link

YSLDTZY commented Oct 30, 2023

感谢您对我们工作的兴趣。该模型对学习率和批量大小敏感。为了与以前的方法保持一致,我们之前没有使用较小的批量大小测试我们的模型。因此,我强烈建议您遵循官方培训设置以获得报告的表现。但你可以尝试一些常见的规则,比如线性调整。另外,欢迎分享您在这个问题上的经验。@YSLDTZY

Thank you for your answer. I previously trained according to official parameters, except for the different number of training cards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants