Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练qwen2系列的模型最佳训练策略 #5

Open
Siegfried-qgf opened this issue Dec 11, 2024 · 33 comments
Open

训练qwen2系列的模型最佳训练策略 #5

Siegfried-qgf opened this issue Dec 11, 2024 · 33 comments

Comments

@Siegfried-qgf
Copy link

你好!我想在1台8张a100的机器上训练qwen2系列的模型,我该如何修改Galvatron的代码来找到最佳训练策略,需要修改哪些模块,代码的执行流程又是怎样?

@Fizzmy
Copy link
Collaborator

Fizzmy commented Dec 11, 2024

感谢您的关注!您可以通过这个教程来完整的执行profile、搜索和训练过程。对于qwen系列模型,您可以使用galvatron/models/llama_hf路径下的脚本。具体来说,您可以在meta_configs路径中按照格式添加您想要的模型配置的json文件,接着将您的模型名称添加到arguments.py中model_size的choices中,并修改meta_configs/config_utils.py的path_dict使其模型名称能够对应上模型配置的json文件。接着就可以按照这个教程来执行代码,注意需要将相关脚本的model_size改为您所自定义的模型名称。我们在仓库中提供了qwen2.5-72b的配置的json格式的例子,您可以作为参考。
如果还有任何问题,欢迎继续咨询!

@Siegfried-qgf
Copy link
Author

Siegfried-qgf commented Dec 14, 2024

非常感谢您的解答,我正在进行计算时间分析和显存分析的步骤,我看到你给的关于llama的时间分析和显存分析的结果如下。

llama计算分析
{ "layernum[6]_bsz2": 77.27328643798828, "layernum[12]_bsz2": 143.17074584960938, "layertype_0": 4.789421272277832, "layernum[2]_bsz2": 31.147062301635746, "layernum[4]_bsz2": 54.69291191101074, "layernum[2]_bsz8": 109.32056350708005, "layernum[4]_bsz8": 194.11891174316403, "layernum[2]_bsz1": 18.34820804595947, "layernum[2]_bsz3": 44.21484451293946, "layernum[2]_bsz4": 56.972291564941415, "layernum[2]_bsz5": 72.85300216674804, "layernum[2]_bsz6": 86.12442932128906, "layernum[2]_bsz7": 96.37412567138674, "layernum[2]_bsz9": 125.0731384277344, "layernum[2]_bsz10": 138.55611572265624, "layernum[2]_bsz11": 152.41731567382814, "layernum[2]_bsz12": 162.45811767578127, "layernum[2]_bsz13": 178.4720306396484, "layernum[2]_bsz14": 189.28246612548827, "layernum[4]_bsz1": 30.91028499603272, "layernum[4]_bsz3": 77.54770202636718, "layernum[4]_bsz4": 99.15223999023438, "layernum[4]_bsz5": 124.74575653076172, "layernum[4]_bsz6": 145.07332458496094, "layernum[4]_bsz7": 167.81986389160159, "layernum[4]_bsz9": 214.1194534301758, "layernum[4]_bsz10": 237.16009979248045, "layernum[4]_bsz11": 260.8160919189453, "layernum[4]_bsz12": 284.8269073486328, "layernum[4]_bsz13": 307.13702087402345, "layernum[4]_bsz14": 332.1334320068359, "layertype_0_bsz1": 6.281038475036624, "layertype_other_0_bsz1": 5.786131095886223, "layertype_0_bsz2": 5.886462402343748, "layertype_other_0_bsz2": 3.800606346130376, "layertype_0_bsz3": 5.555476252237953, "layertype_other_0_bsz3": 3.627328999837246, "layertype_0_bsz4": 5.27249355316162, "layertype_other_0_bsz4": 3.698085784912113, "layertype_0_bsz5": 5.189275436401369, "layertype_other_0_bsz5": 4.192049560546872, "layertype_0_bsz6": 4.912407938639324, "layertype_other_0_bsz6": 4.529255676269528, "layertype_0_bsz7": 5.1032670157296325, "layertype_other_0_bsz7": 3.5611982073102695, "layertype_0_bsz8": 5.299896764755249, "layertype_other_0_bsz8": 3.065276908874509, "layertype_0_bsz9": 4.947017500135633, "layertype_other_0_bsz9": 4.002980380588114, "layertype_0_bsz10": 4.9301992034912105, "layertype_other_0_bsz10": 3.9952131652832037, "layertype_0_bsz11": 4.927217102050779, "layertype_other_0_bsz11": 4.00168540261009, "layertype_0_bsz12": 5.0986995697021475, "layertype_other_0_bsz12": 3.3407773335774777, "layertype_0_bsz13": 4.948653470552886, "layertype_other_0_bsz13": 3.8313108004056438, "layertype_0_bsz14": 5.10182021004813, "layertype_other_0_bsz14": 3.3165357317243314 }

llama显存分析
{ "1_1_8": { "layernum[1]_bsz8_rank0_ms": 918.7900390625, "layernum[1]_bsz8_rank0_act": 457.322265625, "layernum[1]_bsz8_rank0_act_peak": 1277.06005859375, "layernum[1]_bsz8_rank7_ms": 918.7900390625, "layernum[1]_bsz8_rank7_act": 457.322265625, "layernum[1]_bsz8_rank7_act_peak": 1277.06005859375, "layernum[2]_bsz8_rank0_ms": 1305.30615234375, "layernum[2]_bsz8_rank0_act": 758.60400390625, "layernum[2]_bsz8_rank0_act_peak": 1600.853515625, "layernum[2]_bsz8_rank7_ms": 1305.30615234375, "layernum[2]_bsz8_rank7_act": 758.60400390625, "layernum[2]_bsz8_rank7_act_peak": 1600.853515625 }, "1_2_4": { "layernum[1]_bsz8_rank0_ms": 918.8056640625, "layernum[1]_bsz8_rank0_act": 537.353515625, "layernum[1]_bsz8_rank0_act_peak": 936.84375, "layernum[1]_bsz8_rank7_ms": 918.8056640625, "layernum[1]_bsz8_rank7_act": 537.353515625, "layernum[1]_bsz8_rank7_act_peak": 936.84375, "layernum[2]_bsz8_rank0_ms": 1305.33740234375, "layernum[2]_bsz8_rank0_act": 901.66650390625, "layernum[2]_bsz8_rank0_act_peak": 1204.64892578125, "layernum[2]_bsz8_rank7_ms": 1305.33740234375, "layernum[2]_bsz8_rank7_act": 901.66650390625, "layernum[2]_bsz8_rank7_act_peak": 1204.64892578125 }, "1_4_2": { "layernum[1]_bsz8_rank0_ms": 918.8369140625, "layernum[1]_bsz8_rank0_act": 697.416015625, "layernum[1]_bsz8_rank0_act_peak": 1096.8984375, "layernum[1]_bsz8_rank7_ms": 918.8369140625, "layernum[1]_bsz8_rank7_act": 697.416015625, "layernum[1]_bsz8_rank7_act_peak": 1096.8984375, "layernum[2]_bsz8_rank0_ms": 1305.39990234375, "layernum[2]_bsz8_rank0_act": 1189.79150390625, "layernum[2]_bsz8_rank0_act_peak": 1492.75830078125, "layernum[2]_bsz8_rank7_ms": 1305.39990234375, "layernum[2]_bsz8_rank7_act": 1189.79150390625, "layernum[2]_bsz8_rank7_act_peak": 1492.75830078125 }, "1_8_1": { "layernum[1]_bsz8_rank0_ms": 918.8994140625, "layernum[1]_bsz8_rank0_act": 1018.541015625, "layernum[1]_bsz8_rank0_act_peak": 1417.0078125, "layernum[1]_bsz8_rank7_ms": 918.8994140625, "layernum[1]_bsz8_rank7_act": 1018.541015625, "layernum[1]_bsz8_rank7_act_peak": 1417.0078125, "layernum[2]_bsz8_rank0_ms": 1305.52490234375, "layernum[2]_bsz8_rank0_act": 1767.04150390625, "layernum[2]_bsz8_rank0_act_peak": 2069.97705078125, "layernum[2]_bsz8_rank7_ms": 1305.52490234375, "layernum[2]_bsz8_rank7_act": 1767.04150390625, "layernum[2]_bsz8_rank7_act_peak": 2069.97705078125 }, "1_1_8_c": { "layernum[1]_bsz8_rank0_ms": 918.7900390625, "layernum[1]_bsz8_rank0_act": 173.04052734375, "layernum[1]_bsz8_rank0_act_peak": 1357.06005859375, "layernum[1]_bsz8_rank7_ms": 918.7900390625, "layernum[1]_bsz8_rank7_act": 173.04052734375, "layernum[1]_bsz8_rank7_act_peak": 1357.06005859375, "layernum[2]_bsz8_rank0_ms": 1306.30224609375, "layernum[2]_bsz8_rank0_act": 189.04052734375, "layernum[2]_bsz8_rank0_act_peak": 1396.57177734375, "layernum[2]_bsz8_rank7_ms": 1305.30615234375, "layernum[2]_bsz8_rank7_act": 189.04052734375, "layernum[2]_bsz8_rank7_act_peak": 1397.56787109375 }, "1_2_4_c": { "layernum[1]_bsz8_rank0_ms": 918.8056640625, "layernum[1]_bsz8_rank0_act": 205.04052734375, "layernum[1]_bsz8_rank0_act_peak": 886.322265625, "layernum[1]_bsz8_rank7_ms": 918.8056640625, "layernum[1]_bsz8_rank7_act": 205.04052734375, "layernum[1]_bsz8_rank7_act_peak": 886.322265625, "layernum[2]_bsz8_rank0_ms": 1305.33740234375, "layernum[2]_bsz8_rank0_act": 237.04052734375, "layernum[2]_bsz8_rank0_act_peak": 885.337890625, "layernum[2]_bsz8_rank7_ms": 1305.33740234375, "layernum[2]_bsz8_rank7_act": 237.04052734375, "layernum[2]_bsz8_rank7_act_peak": 885.337890625 }, "1_4_2_c": { "layernum[1]_bsz8_rank0_ms": 918.8369140625, "layernum[1]_bsz8_rank0_act": 269.04052734375, "layernum[1]_bsz8_rank0_act_peak": 992.392578125, "layernum[1]_bsz8_rank7_ms": 918.8369140625, "layernum[1]_bsz8_rank7_act": 269.04052734375, "layernum[1]_bsz8_rank7_act_peak": 993.392578125, "layernum[2]_bsz8_rank0_ms": 1305.39990234375, "layernum[2]_bsz8_rank0_act": 333.04052734375, "layernum[2]_bsz8_rank0_act_peak": 991.892578125, "layernum[2]_bsz8_rank7_ms": 1305.39990234375, "layernum[2]_bsz8_rank7_act": 334.04052734375, "layernum[2]_bsz8_rank7_act_peak": 992.892578125 }, "1_8_1_c": { "layernum[1]_bsz8_rank0_ms": 918.8994140625, "layernum[1]_bsz8_rank0_act": 397.04052734375, "layernum[1]_bsz8_rank0_act_peak": 1413.751953125, "layernum[1]_bsz8_rank7_ms": 918.8994140625, "layernum[1]_bsz8_rank7_act": 397.04052734375, "layernum[1]_bsz8_rank7_act_peak": 1413.751953125, "layernum[2]_bsz8_rank0_ms": 1305.52490234375, "layernum[2]_bsz8_rank0_act": 525.04052734375, "layernum[2]_bsz8_rank0_act_peak": 1412.751953125, "layernum[2]_bsz8_rank7_ms": 1305.52490234375, "layernum[2]_bsz8_rank7_act": 525.04052734375, "layernum[2]_bsz8_rank7_act_peak": 1412.751953125 }, "2_1_4": { "layernum[2]_bsz8_rank0_ms": 1323.80517578125, "layernum[2]_bsz8_rank0_act": 632.57861328125, "layernum[2]_bsz8_rank0_act_peak": 980.65673828125, "layernum[2]_bsz8_rank7_ms": 1355.8134765625, "layernum[2]_bsz8_rank7_act": 914.5205078125, "layernum[2]_bsz8_rank7_act_peak": 1174.615234375 }, "2_2_2": { "layernum[2]_bsz8_rank0_ms": 1324.81298828125, "layernum[2]_bsz8_rank0_act": 792.67236328125, "layernum[2]_bsz8_rank0_act_peak": 938.76611328125, "layernum[2]_bsz8_rank7_ms": 1419.8291015625, "layernum[2]_bsz8_rank7_act": 1074.6142578125, "layernum[2]_bsz8_rank7_act_peak": 1286.716796875 }, "2_4_1": { "layernum[2]_bsz8_rank0_ms": 1323.87548828125, "layernum[2]_bsz8_rank0_act": 1112.85986328125, "layernum[2]_bsz8_rank0_act_peak": 1203.53173828125, "layernum[2]_bsz8_rank7_ms": 1548.8603515625, "layernum[2]_bsz8_rank7_act": 1394.923828125, "layernum[2]_bsz8_rank7_act_peak": 1606.904296875 }, "4_1_2": { "layernum[4]_bsz8_rank0_ms": 2624.87548828125, "layernum[4]_bsz8_rank0_act": 1265.14111328125, "layernum[4]_bsz8_rank0_act_peak": 1493.26611328125, "layernum[4]_bsz8_rank7_ms": 2688.9072265625, "layernum[4]_bsz8_rank7_act": 1829.0390625, "layernum[4]_bsz8_rank7_act_peak": 1943.24169921875 }, "4_2_1": { "layernum[4]_bsz8_rank0_ms": 2624.93798828125, "layernum[4]_bsz8_rank0_act": 1585.32861328125, "layernum[4]_bsz8_rank0_act_peak": 1643.48486328125, "layernum[4]_bsz8_rank7_ms": 2816.9697265625, "layernum[4]_bsz8_rank7_act": 2149.470703125, "layernum[4]_bsz8_rank7_act_peak": 2263.65771484375 }, "layertype_0": { "parameter_size": 773.2509765625, "tp_activation_per_bsz_dict": { "1": 301.28173828125, "2": 182.156494140625, "4": 123.0938720703125, "8": 93.56256103515625, "checkpoint": 16.0 } }, "other_memory_pp_off": { "model_states": 4258.0, "activation": 572.53076171875 }, "other_memory_pp_on_first": { "model_states": 2157.0, "activation": 46.5582275390625 }, "other_memory_pp_on_last": { "model_states": 2285.0, "activation": 184.5286865234375 }, "layertype_0_sp": { "parameter_size": 773.0166015625, "tp_activation_per_bsz_dict": { "1": 430.2666015625, "2": 223.13330078125, "4": 111.566650390625, "8": 55.7833251953125, "checkpoint": 16.0 } }, "other_memory_pp_off_sp": { "model_states": { "1": 4286.2578125, "2": 2209.376953125, "4": 1168.8447265625, "8": 648.57861328125 }, "activation": { "1": 749.54736328125, "2": 312.853759765625, "4": 140.6016845703125, "8": 66.53814697265625 } }, "other_memory_pp_on_first_sp": { "model_states": { "1": 2036.8134765625, "2": 1148.51611328125, "4": 654.39111328125, "8": 327.195556640625 }, "activation": { "1": 79.4915771484375, "2": 24.26336669921875, "4": 17.14447021484375, "8": 8.572235107421875 } }, "other_memory_pp_on_last_sp": { "model_states": { "1": 2164.8759765625, "2": 1148.57861328125, "4": 653.51611328125, "8": 326.758056640625 }, "activation": { "1": 639.0853271484375, "2": 304.08563232421875, "4": 151.90032958984375, "8": 75.95016479492188 } }, "1_1_8_sp": { "layernum[1]_bsz8_rank0_ms": 922.29052734375, "layernum[1]_bsz8_rank0_act": 776.28515625, "layernum[1]_bsz8_rank0_act_peak": 1265.044921875, "layernum[1]_bsz8_rank7_ms": 922.29052734375, "layernum[1]_bsz8_rank7_act": 776.28515625, "layernum[1]_bsz8_rank7_act_peak": 1265.044921875, "layernum[2]_bsz8_rank0_ms": 1308.798828125, "layernum[2]_bsz8_rank0_act": 1206.5517578125, "layernum[2]_bsz8_rank0_act_peak": 1734.8232421875, "layernum[2]_bsz8_rank7_ms": 1308.798828125, "layernum[2]_bsz8_rank7_act": 1206.5517578125, "layernum[2]_bsz8_rank7_act_peak": 1734.8232421875 }, "1_2_4_sp": { "layernum[1]_bsz8_rank0_ms": 938.32177734375, "layernum[1]_bsz8_rank0_act": 792.28515625, "layernum[1]_bsz8_rank0_act_peak": 1195.83740234375, "layernum[1]_bsz8_rank7_ms": 938.32177734375, "layernum[1]_bsz8_rank7_act": 792.28515625, "layernum[1]_bsz8_rank7_act_peak": 1195.83740234375, "layernum[2]_bsz8_rank0_ms": 1324.353515625, "layernum[2]_bsz8_rank0_act": 1238.5517578125, "layernum[2]_bsz8_rank0_act_peak": 1545.59619140625, "layernum[2]_bsz8_rank7_ms": 1324.353515625, "layernum[2]_bsz8_rank7_act": 1238.5517578125, "layernum[2]_bsz8_rank7_act_peak": 1545.59619140625 }, "1_2_4_vtp_sp": { "layernum[1]_bsz8_rank0_ms": 938.3759765625, "layernum[1]_bsz8_rank0_act": 792.33056640625, "layernum[1]_bsz8_rank0_act_peak": 1071.97412109375, "layernum[1]_bsz8_rank7_ms": 938.3759765625, "layernum[1]_bsz8_rank7_act": 792.33056640625, "layernum[1]_bsz8_rank7_act_peak": 1071.97412109375, "layernum[2]_bsz8_rank0_ms": 1324.40771484375, "layernum[2]_bsz8_rank0_act": 1238.59716796875, "layernum[2]_bsz8_rank0_act_peak": 1421.73291015625, "layernum[2]_bsz8_rank7_ms": 1324.40771484375, "layernum[2]_bsz8_rank7_act": 1238.59716796875, "layernum[2]_bsz8_rank7_act_peak": 1421.73291015625 }, "1_4_2_sp": { "layernum[1]_bsz8_rank0_ms": 970.35302734375, "layernum[1]_bsz8_rank0_act": 792.28515625, "layernum[1]_bsz8_rank0_act_peak": 1195.82958984375, "layernum[1]_bsz8_rank7_ms": 970.35302734375, "layernum[1]_bsz8_rank7_act": 792.28515625, "layernum[1]_bsz8_rank7_act_peak": 1195.82958984375, "layernum[2]_bsz8_rank0_ms": 1356.416015625, "layernum[2]_bsz8_rank0_act": 1238.5517578125, "layernum[2]_bsz8_rank0_act_peak": 1545.58056640625, "layernum[2]_bsz8_rank7_ms": 1356.416015625, "layernum[2]_bsz8_rank7_act": 1238.5517578125, "layernum[2]_bsz8_rank7_act_peak": 1545.58056640625 }, "1_4_2_vtp_sp": { "layernum[1]_bsz8_rank0_ms": 970.4853515625, "layernum[1]_bsz8_rank0_act": 792.38525390625, "layernum[1]_bsz8_rank0_act_peak": 1008.67333984375, "layernum[1]_bsz8_rank7_ms": 970.4853515625, "layernum[1]_bsz8_rank7_act": 792.38525390625, "layernum[1]_bsz8_rank7_act_peak": 1008.67333984375, "layernum[2]_bsz8_rank0_ms": 1356.54833984375, "layernum[2]_bsz8_rank0_act": 1240.65185546875, "layernum[2]_bsz8_rank0_act_peak": 1360.42431640625, "layernum[2]_bsz8_rank7_ms": 1356.54833984375, "layernum[2]_bsz8_rank7_act": 1240.65185546875, "layernum[2]_bsz8_rank7_act_peak": 1360.42431640625 }, "1_8_1_sp": { "layernum[1]_bsz8_rank0_ms": 1034.41552734375, "layernum[1]_bsz8_rank0_act": 792.28515625, "layernum[1]_bsz8_rank0_act_peak": 1195.81396484375, "layernum[1]_bsz8_rank7_ms": 1034.41552734375, "layernum[1]_bsz8_rank7_act": 792.28515625, "layernum[1]_bsz8_rank7_act_peak": 1195.81396484375, "layernum[2]_bsz8_rank0_ms": 1420.541015625, "layernum[2]_bsz8_rank0_act": 1238.5517578125, "layernum[2]_bsz8_rank0_act_peak": 1545.54931640625, "layernum[2]_bsz8_rank7_ms": 1420.541015625, "layernum[2]_bsz8_rank7_act": 1238.5517578125, "layernum[2]_bsz8_rank7_act_peak": 1545.54931640625 }, "1_8_1_vtp_sp": { "layernum[1]_bsz8_rank0_ms": 1034.7041015625, "layernum[1]_bsz8_rank0_act": 792.49462890625, "layernum[1]_bsz8_rank0_act_peak": 978.57177734375, "layernum[1]_bsz8_rank7_ms": 1034.7041015625, "layernum[1]_bsz8_rank7_act": 792.49462890625, "layernum[1]_bsz8_rank7_act_peak": 978.57177734375, "layernum[2]_bsz8_rank0_ms": 1420.82958984375, "layernum[2]_bsz8_rank0_act": 1240.76123046875, "layernum[2]_bsz8_rank0_act_peak": 1330.30712890625, "layernum[2]_bsz8_rank7_ms": 1420.82958984375, "layernum[2]_bsz8_rank7_act": 1240.76123046875, "layernum[2]_bsz8_rank7_act_peak": 1330.30712890625 }, "1_1_8_c_sp": { "layernum[1]_bsz8_rank0_ms": 922.29052734375, "layernum[1]_bsz8_rank0_act": 362.0185546875, "layernum[1]_bsz8_rank0_act_peak": 1313.044921875, "layernum[1]_bsz8_rank7_ms": 922.29052734375, "layernum[1]_bsz8_rank7_act": 362.0185546875, "layernum[1]_bsz8_rank7_act_peak": 1313.044921875, "layernum[2]_bsz8_rank0_ms": 1308.306640625, "layernum[2]_bsz8_rank0_act": 378.0185546875, "layernum[2]_bsz8_rank0_act_peak": 1368.556640625, "layernum[2]_bsz8_rank7_ms": 1308.306640625, "layernum[2]_bsz8_rank7_act": 378.0185546875, "layernum[2]_bsz8_rank7_act_peak": 1368.556640625 }, "2_1_4_sp": { "layernum[2]_bsz8_rank0_ms": 1296.32958984375, "layernum[2]_bsz8_rank0_act": 922.50146484375, "layernum[2]_bsz8_rank0_act_peak": 1528.54833984375, "layernum[2]_bsz8_rank7_ms": 1327.34521484375, "layernum[2]_bsz8_rank7_act": 1614.56884765625, "layernum[2]_bsz8_rank7_act_peak": 2294.72021484375 }, "2_2_2_sp": { "layernum[2]_bsz8_rank0_ms": 1362.32958984375, "layernum[2]_bsz8_rank0_act": 922.50146484375, "layernum[2]_bsz8_rank0_act_peak": 1294.53271484375, "layernum[2]_bsz8_rank7_ms": 1359.37646484375, "layernum[2]_bsz8_rank7_act": 1614.56884765625, "layernum[2]_bsz8_rank7_act_peak": 2294.72021484375 }, "2_2_2_vtp_sp": { "layernum[2]_bsz8_rank0_ms": 1361.40771484375, "layernum[2]_bsz8_rank0_act": 922.54052734375, "layernum[2]_bsz8_rank0_act_peak": 1169.60302734375, "layernum[2]_bsz8_rank7_ms": 1360.32958984375, "layernum[2]_bsz8_rank7_act": 1614.58837890625, "layernum[2]_bsz8_rank7_act_peak": 2170.89208984375 }, "2_4_1_sp": { "layernum[2]_bsz8_rank0_ms": 1424.39208984375, "layernum[2]_bsz8_rank0_act": 922.50146484375, "layernum[2]_bsz8_rank0_act_peak": 1217.04833984375, "layernum[2]_bsz8_rank7_ms": 1424.40771484375, "layernum[2]_bsz8_rank7_act": 1614.56884765625, "layernum[2]_bsz8_rank7_act_peak": 2294.72021484375 }, "2_4_1_vtp_sp": { "layernum[2]_bsz8_rank0_ms": 1426.51708984375, "layernum[2]_bsz8_rank0_act": 922.54833984375, "layernum[2]_bsz8_rank0_act_peak": 1029.68896484375, "layernum[2]_bsz8_rank7_ms": 1425.64208984375, "layernum[2]_bsz8_rank7_act": 1614.62744140625, "layernum[2]_bsz8_rank7_act_peak": 2107.73583984375 }, "4_1_2_sp": { "layernum[4]_bsz8_rank0_ms": 2564.43994140625, "layernum[4]_bsz8_rank0_act": 1843.00146484375, "layernum[4]_bsz8_rank0_act_peak": 2337.06396484375, "layernum[4]_bsz8_rank7_ms": 2628.47119140625, "layernum[4]_bsz8_rank7_act": 3227.13525390625, "layernum[4]_bsz8_rank7_act_peak": 4341.42333984375 }, "4_2_1_sp": { "layernum[4]_bsz8_rank0_ms": 2692.51806640625, "layernum[4]_bsz8_rank0_act": 1843.00146484375, "layernum[4]_bsz8_rank0_act_peak": 2103.03271484375, "layernum[4]_bsz8_rank7_ms": 2692.54931640625, "layernum[4]_bsz8_rank7_act": 3227.13525390625, "layernum[4]_bsz8_rank7_act_peak": 4341.40771484375 }, "4_2_1_vtp_sp": { "layernum[4]_bsz8_rank0_ms": 2692.64306640625, "layernum[4]_bsz8_rank0_act": 1843.07958984375, "layernum[4]_bsz8_rank0_act_peak": 1979.17333984375, "layernum[4]_bsz8_rank7_ms": 2692.70556640625, "layernum[4]_bsz8_rank7_act": 3227.17431640625, "layernum[4]_bsz8_rank7_act_peak": 4217.75146484375 } }

请原谅我还没有仔细阅读您的代码,在详细阅读代码前,我有一些问题:
1.在计算分析和显存分析的结果里,不同的key代表的意思分别是什么,如果我想得到您给出的计算分析和显存分析的结果文件,我需要如何设置参数,包括profile_batch_size、layernum等。后面的搜索步骤的搜索空间是不是被限定在这个计算分析和显存的结果文件之中。
2.在计算分析的结果里,global_tp_deg=1,没有计算tp的时间,在搜索过程里如果想搜索tp是不是还得另外测tp的计算时间?
3.在显存分析的结果里,"1_1_8"是指pp_tp_dp吧,为什么1_1_8的"layernum[1]_bsz8_rank0_ms": 918.7900390625和1_2_4的一样,tp=2,在rank0上占用的显存不是应该小一点吗?

@Fizzmy
Copy link
Collaborator

Fizzmy commented Dec 14, 2024

  1. 通常来说您不需要改动layer num参数,关于更多的参数说明,可以参阅这里以及仓库代码中的profile参数说明。一般来说只需要保持默认值即可,如果发生OOM错误,可以考虑按照上面的参数说明缩小batch size或sequence length。搜索空间不会被限制,因为Galvatron会基于profile数据拟合相关的计算和内存消耗曲线。只要完成profile,即可以完成对任意大小的序列长度、模型数量和batch size大小的建模。当你想对不同的序列长度进行建模时,为了确保建模的准确性,这里会推荐您多完成几组不同的序列长度profiling,以确保拟合的曲线准确(代码中需要至少8个不同的序列长度数据才会进行计算时间的拟合)。
  2. 不需要测tp时间,在我们的实践过程中发现,不同tp维度的计算时间和tp维度是成反比例关系的,因此只需要tp维度为1的时间即可,剩下维度的时间都可以通过计算得出。
  3. 因为在进行profiling的时候,我们的profiling脚本是默认开启zero3的,因此对于任何维度的tp,参数都会被fully shard,因此model state是相等的,这一点不需要进行额外改动,我们的脚本最后会根据profiling结果计算出正确的内存消耗。真正建模会用到的变量存储在memory文件的layertype_other_memory开头的文件中,您可以参考这些数据。

@Siegfried-qgf
Copy link
Author

感谢
能不能更新一下最新的requirement.txt
我pull了最新版本的仓库
遇到如下错误╰─❯ bash profile_computation.sh
Traceback (most recent call last):
File "profiler.py", line 3, in
from galvatron.core import GalvatronProfiler, initialize_galvatron
File "/mnt/gefei/LLM-X/Hetu-Galvatron/galvatron/core/init.py", line 6, in
from .parallel import *
File "/mnt/gefei/LLM-X/Hetu-Galvatron/galvatron/core/parallel.py", line 13, in
from .utils import rgetattr, rsetattr, rhasattr
File "/mnt/gefei/LLM-X/Hetu-Galvatron/galvatron/core/utils.py", line 5, in
from .dataloader import compile_helpers
File "/mnt/gefei/LLM-X/Hetu-Galvatron/galvatron/core/dataloader.py", line 1, in
from megatron.training.training import build_train_valid_test_data_iterators
File "/mnt/gefei/LLM-X/Hetu-Galvatron/galvatron/site_package/megatron/training/init.py", line 17, in
from .training import pretrain, get_model, get_train_valid_test_num_samples
File "/mnt/gefei/LLM-X/Hetu-Galvatron/galvatron/site_package/megatron/training/training.py", line 29, in
from megatron.core.optimizer import get_megatron_optimizer, OptimizerConfig
File "/mnt/gefei/LLM-X/Hetu-Galvatron/galvatron/site_package/megatron/core/optimizer/init.py", line 6, in
from apex.optimizers import FusedAdam as Adam
File "/root/anaconda3/envs/galvatron/lib/python3.8/site-packages/apex/init.py", line 18, in
from apex.interfaces import (ApexImplementation,
File "/root/anaconda3/envs/galvatron/lib/python3.8/site-packages/apex/interfaces.py", line 1, in
from zope.interface import implements
ImportError: cannot import name 'implements' from 'zope.interface' (/root/anaconda3/envs/galvatron/lib/python3.8/site-packages/zope/interface/init.py)
可能和apex和zope.inferface的版本有关?想请问您是什么版本

@Fizzmy
Copy link
Collaborator

Fizzmy commented Dec 15, 2024

非常抱歉我们在文档中疏忽了这一点,您需要再额外安装apex,您可以下载最新版本的apex并通过以下方式进行编译安装:

git clone https://github.com/NVIDIA/apex
cd apex
# if pip >= 23.1 (ref: https://pip.pypa.io/en/stable/news/#v23-1) which supports multiple `--config-settings` with the same key... 
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./
# otherwise
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

我们稍后也会进行对文档进行补充完善

@Siegfried-qgf
Copy link
Author

我的torch按照文档的要求版本为2.0.1+cu118 但是出现了以下错误。

File "/mnt/gefei/LLM-X/Hetu-Galvatron/galvatron/core/utils.py", line 10, in
from torch.distributed.fsdp._common_utils import _named_parameters_with_duplicates
ImportError: cannot import name '_named_parameters_with_duplicates' from 'torch.distributed.fsdp._common_utils' (/root/anaconda3/envs/galvatron/lib/python3.8/site-packages/torch/distributed/fsdp/_common_utils.py)

torch2.0.1的torch.distributed.fsdp._common_utils.py文件与Hetu-Galvatron/galvatron/core/utils.py中的文件引用不一致,请问是更新了torch版本吗

@Fizzmy
Copy link
Collaborator

Fizzmy commented Dec 16, 2024

是的,我们更新了所支持的torch的版本,请更新到2.1版本,我们在文档中已经更新,但是requirements还未更新,请参照最新版本文档

@Siegfried-qgf
Copy link
Author

我这边遇到一个bug 在profile_time结束记录结果的时候 在记录第二次结果时会出现json的报错
Traceback (most recent call last):
File "profiler.py", line 88, in
profiler.process_profiled_data()
File "/usr/local/lib/python3.8/dist-packages/galvatron/core/profiler.py", line 365, in process_profiled_data
config = read_json_config(time_config_path)
File "/usr/local/lib/python3.8/dist-packages/galvatron/utils/config_utils.py", line 15, in read_json_config
return json.load(open(path,'r',encoding="utf-8"))
File "/usr/lib/python3.8/json/init.py", line 293, in load
return loads(fp.read(),
File "/usr/lib/python3.8/json/init.py", line 357, in loads
return _default_decoder.decode(s)
File "/usr/lib/python3.8/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 4 column 2 (char 105)

这个json文件会多一个右括号 请问你有这种情况么。
{
"layernum[1]_bsz8_seq4096": 25.143798446655275,
"layernum[2]_bsz8_seq4096": 43.06454353332519
}}

@Fizzmy
Copy link
Collaborator

Fizzmy commented Dec 19, 2024

我这边遇到一个bug 在profile_time结束记录结果的时候 在记录第二次结果时会出现json的报错 Traceback (most recent call last): File "profiler.py", line 88, in profiler.process_profiled_data() File "/usr/local/lib/python3.8/dist-packages/galvatron/core/profiler.py", line 365, in process_profiled_data config = read_json_config(time_config_path) File "/usr/local/lib/python3.8/dist-packages/galvatron/utils/config_utils.py", line 15, in read_json_config return json.load(open(path,'r',encoding="utf-8")) File "/usr/lib/python3.8/json/init.py", line 293, in load return loads(fp.read(), File "/usr/lib/python3.8/json/init.py", line 357, in loads return _default_decoder.decode(s) File "/usr/lib/python3.8/json/decoder.py", line 340, in decode raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 4 column 2 (char 105)

这个json文件会多一个右括号 请问你有这种情况么。 { "layernum[1]_bsz8_seq4096": 25.143798446655275, "layernum[2]_bsz8_seq4096": 43.06454353332519 }}

这种情况会稳定出现吗?可以尝试将文件内容改为{}后重新profile

@Siegfried-qgf
Copy link
Author

43.06454353332519

按照你的方法可以了 我覆盖以前的文件时会出错

@Siegfried-qgf
Copy link
Author

我又遇到一个问题 在profile_memory的时候 tp=1没问题
tp>1的时候会报错

WARNING:megatron.core.models.common.embeddings.rotary_pos_embedding:Setting apply_rope_fusion to false because its implementation is not included in Apex. Try upgrading to the latest version
Traceback (most recent call last):
File "train_dist_random.py", line 104, in
train(args)
File "train_dist_random.py", line 82, in train
loss = model.forward_backward(batch, iter, profiler,
File "/usr/local/lib/python3.8/dist-packages/galvatron/core/hybrid_parallel_model.py", line 51, in forward_backward
loss = model.no_pipeline_forward_backward(batch, loss_func,
File "/usr/local/lib/python3.8/dist-packages/galvatron/core/pipeline/pipeline.py", line 270, in no_pipeline_forward_backward
output_tensor = self.forward_step(
File "/usr/local/lib/python3.8/dist-packages/galvatron/core/pipeline/pipeline.py", line 736, in forward_step
output_tensor, loss_func = forward_step_func(batch[0], model)
File "/usr/local/lib/python3.8/dist-packages/galvatron/core/pipeline/pipeline.py", line 32, in forward_step
outputs = model(*inputs,**kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/galvatron/core/pipeline/pipeline.py", line 1428, in forward
inputs = module(inputs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/galvatron/core/parallel.py", line 219, in forward
return self.module(*inputs_relocated, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/galvatron/models/llama_hf/LlamaModel_sequential.py", line 74, in forward
hidden_states = self.layer(hidden_states, attention_mask = attention_mask) # , position_ids = position_ids)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/galvatron/models/llama_hf/LlamaModel_tensor_parallel.py", line 80, in forward
attention_output = self.attention(
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/galvatron/models/llama_hf/LlamaModel_tensor_parallel.py", line 49, in forward
hidden_states, bias = self.attention(hidden_states, attention_mask,rotary_pos_emb=rotary_pos_emb)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/galvatron/core/tensor_parallel/transformer.py", line 815, in forward
query_layer = apply_rotary_pos_emb(query_layer, q_pos_emb,self.config)
File "/usr/local/lib/python3.8/dist-packages/galvatron/site_package/megatron/core/models/common/embeddings/rotary_pos_embedding.py", line 247, in apply_rotary_pos_emb
return apply_rotary_pos_emb_bshd(t, freqs, rotary_interleaved=config.rotary_interleaved)
File "/usr/local/lib/python3.8/dist-packages/galvatron/site_package/megatron/core/models/common/embeddings/rotary_pos_embedding.py", line 195, in apply_rotary_pos_emb_bshd
t = (t * cos
) + (rotate_half(t, rotary_interleaved) * sin)
RuntimeError: The size of tensor a (4096) must match the size of tensor b (8192) at non-singleton dimension 0

@Fizzmy
Copy link
Collaborator

Fizzmy commented Dec 19, 2024

请问具体的运行脚本是什么样的呢?

@Siegfried-qgf
Copy link
Author

请问具体的运行脚本是什么样的呢?

export NUM_NODES=1
export NUM_GPUS_PER_NODE=8
export MASTER_ADDR=$MASTER_ADDR
export MASTER_PORT=$MASTER_PORT

export NCCL_SOCKET_IFNAME=ib0

export NODE_RANK=$RANK

LAUNCHER="python3 -m torch.distributed.launch"
LAUNCHER="${LAUNCHER} --nnodes ${NUM_NODES}"
LAUNCHER="${LAUNCHER} --nproc_per_node ${NUM_GPUS_PER_NODE}"

export PROFILE_LAUNCHER="$LAUNCHER"
export PROFILE_TRAINER="train_dist_random.py"

PROFILE_ARGS_BF16="
--profile_type memory
--set_model_config_manually 0
--set_layernum_manually 0
--profile_batch_size 8
--layernum_min 1
--layernum_max 2
--max_tp_deg 8
--profile_dp_type zero3
--mixed_precision bf16
--use-flash-attn
--shape_order SBH
--make-vocab-size-divisible-by 128"

MODEL_ARGS="
--model_size llama-13b
--vocab_size 32000
--hidden_size 4096
--num_attention_heads 32
--seq_length 2048"

python3 profiler.py ${MODEL_ARGS} ${PROFILE_ARGS_BF16}

@Fizzmy
Copy link
Collaborator

Fizzmy commented Dec 19, 2024

可以尝试在脚本中添加--sequence_parallel,另外如果想要profile指定长度的seq length请添加--set_layernum_manually 1,这样--seq_length才会起作用

@Siegfried-qgf
Copy link
Author

可以尝试在脚本中添加--sequence_parallel,另外如果想要profile指定长度的seq length请添加--set_layernum_manually 1,这样--seq_length才会起作用
我修改了我的脚本如下,我不想指定seq length 这样是使用默认的模型config吧
export NUM_NODES=1
export NUM_GPUS_PER_NODE=8
export MASTER_ADDR=$MASTER_ADDR
export MASTER_PORT=$MASTER_PORT
export NODE_RANK=$RANK

LAUNCHER="python3 -m torch.distributed.launch"
LAUNCHER="${LAUNCHER} --nnodes ${NUM_NODES}"
LAUNCHER="${LAUNCHER} --nproc_per_node ${NUM_GPUS_PER_NODE}"

export PROFILE_LAUNCHER="$LAUNCHER"
export PROFILE_TRAINER="train_dist_random.py"

PROFILE_ARGS_BF16="
--profile_type memory
--set_model_config_manually 0
--set_layernum_manually 0
--profile_batch_size 8
--layernum_min 1
--layernum_max 2
--max_tp_deg 8
--profile_dp_type zero3
--mixed_precision bf16
--use-flash-attn
--shape_order SBH
--sequence_parallel
--make-vocab-size-divisible-by 128"

MODEL_ARGS="
--model_size llama-13b "

python3 profiler.py ${MODEL_ARGS} ${PROFILE_ARGS_BF16}

加上--sequence_parallel部分可以跑比如1_1_8_sp , 1_2_4_sp但在2_4_1_sp报错

return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/galvatron/models/llama_hf/LlamaModel_sequential.py", line 164, in forward
loss = tensor_parallel.vocab_parallel_cross_entropy(logits_parallel, labels, tp_group = self.tp_group)
File "/usr/local/lib/python3.8/dist-packages/galvatron/site_package/megatron/core/tensor_parallel/cross_entropy.py", line 168, in vocab_parallel_cross_entropy
return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target, label_smoothing, tp_group)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.8/dist-packages/galvatron/site_package/megatron/core/tensor_parallel/cross_entropy.py", line 92, in forward
loss = torch.log(sum_exp_logits) - predicted_logits
RuntimeError: The size of tensor a (8) must match the size of tensor b (4) at non-singleton dimension 1

如果在上面的脚本的基础上去掉--sequence_parallel还是会报刚才的旋转位置编码相关的shape错误。或许是由于这个warning?请问你在运行的时候有吗
WARNING:megatron.core.models.common.embeddings.rotary_pos_embedding:Setting apply_rope_fusion to false because its implementation is not included in Apex. Try upgrading to the latest version

另外,在测试memory的时候要sp和非sp都测一遍吗

@Fizzmy
Copy link
Collaborator

Fizzmy commented Dec 19, 2024

请提供运行时完整的log信息,测试memory的时候默认开启sp就可以,除非你需要搜索使用不包含SP的Megatron-TP的并行策略

@Siegfried-qgf
Copy link
Author

请提供运行时完整的log信息,测试memory的时候默认开启sp就可以,除非你需要搜索使用不包含SP的Megatron-TP的并行策略

========================Galvatron Parallel Config =============================
Galvatron parallel config mode: [GLOBAL config mode]
[GLOBAL config mode] Loaded global hybrid parallel strategy:
global_batch_size: 8, chunks: 1
pp_deg: 2, tp_deg: 2, sdp_deg: 2, tp_consecutive_flag: 1, checkpoint_flag: 0
pipeline_type: gpipe, default_dp_type: zero3, dtype: bf16
pp_division: [1, 1]
pp_ranks: [0, 1]
use_sp: [False]
================================================================================
Creating Model...
Model Layer Types:
['embed', 'gpt_dec', 'gpt_dec', 'norm', 'cls']
tp_sizes_whole: [1, 2, 2, 1, 1]
sp_sizes_whole: [1, 1, 1, 1, 1]
tp_consec_whole: [1, 1, 1, 1, 1]
dp_types_whole: [0, 1, 1, 0, 0]
pp_ranks_whole: [0, 0, 1, 1, 1]
checkpoint_flags_whole: [0, 0, 0, 0, 0]
dp_sizes_whole: [4, 2, 2, 4, 4]
================================================================================
====================== Galvatron Communication Group ===========================
Embedding group for rank 0:
[0, 4]
TP groups for rank 0 (all layers):
[0] [0, 1] [0, 1] [0] [0]
SP groups for rank 0 (all layers):
[0] [0] [0] [0] [0]
DP groups for rank 0 (all layers):
[0, 1, 2, 3] [0, 2] [0, 2] [0, 1, 2, 3] [0, 1, 2, 3]
SDP groups for rank 0 (all layers):
[0, 1, 2, 3] [0, 2] [0, 2] [0, 1, 2, 3] [0, 1, 2, 3]
Split groups for rank 0:
None None None None None
AllGather groups for rank 0:
None None None None None
Fused split groups for rank 0:
None None None None None
Fused allgather groups for rank 0:
None None None None None
================================================================================
/usr/local/lib/python3.8/dist-packages/galvatron/site_package/megatron/core/tensor_parallel/layers.py:787: UserWarning: sequence_parallel is set to True, but tensor model parallel size is 1. Disabling sequence parallel.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/galvatron/site_package/megatron/core/tensor_parallel/layers.py:787: UserWarning: sequence_parallel is set to True, but tensor model parallel size is 1. Disabling sequence parallel.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/galvatron/site_package/megatron/core/tensor_parallel/layers.py:787: UserWarning: sequence_parallel is set to True, but tensor model parallel size is 1. Disabling sequence parallel.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/galvatron/site_package/megatron/core/tensor_parallel/layers.py:787: UserWarning: sequence_parallel is set to True, but tensor model parallel size is 1. Disabling sequence parallel.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/galvatron/site_package/megatron/core/tensor_parallel/layers.py:787: UserWarning: sequence_parallel is set to True, but tensor model parallel size is 1. Disabling sequence parallel.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/galvatron/site_package/megatron/core/tensor_parallel/layers.py:787: UserWarning: sequence_parallel is set to True, but tensor model parallel size is 1. Disabling sequence parallel.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/galvatron/site_package/megatron/core/tensor_parallel/layers.py:787: UserWarning: sequence_parallel is set to True, but tensor model parallel size is 1. Disabling sequence parallel.
warnings.warn(
/usr/local/lib/python3.8/dist-packages/galvatron/site_package/megatron/core/tensor_parallel/layers.py:787: UserWarning: sequence_parallel is set to True, but tensor model parallel size is 1. Disabling sequence parallel.
warnings.warn(
Creating Dataset...
[W ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[W ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
After creating model [Allocated]
Max memory: 1668.85 MB Current memory : 458.77 MB
Start training...
Before Forward [Allocated]
Max memory: 458.90 MB Current memory : 458.90 MB
After creating model [Allocated]
Max memory: 1710.78 MB Current memory : 458.78 MB
Before Forward [Allocated]
Max memory: 458.90 MB Current memory : 458.90 MB
[W ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[W ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
WARNING:megatron.core.models.common.embeddings.rotary_pos_embedding:Setting apply_rope_fusion to false because its implementation is not included in Apex. Try upgrading to the latest version
WARNING:megatron.core.models.common.embeddings.rotary_pos_embedding:Setting apply_rope_fusion to false because its implementation is not included in Apex. Try upgrading to the latest version
WARNING:megatron.core.models.common.embeddings.rotary_pos_embedding:Setting apply_rope_fusion to false because its implementation is not included in Apex. Try upgrading to the latest version
WARNING:megatron.core.models.common.embeddings.rotary_pos_embedding:Setting apply_rope_fusion to false because its implementation is not included in Apex. Try upgrading to the latest version
[W ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[W ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[W ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[W ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[W ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
After Forward [Allocated]
Max memory: 3563.08 MB Current memory : 2940.56 MB
[W ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[W ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[W ProcessGroupNCCL.cpp:1849] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
WARNING:megatron.core.models.common.embeddings.rotary_pos_embedding:Setting apply_rope_fusion to false because its implementation is not included in Apex. Try upgrading to the latest version
WARNING:megatron.core.models.common.embeddings.rotary_pos_embedding:Setting apply_rope_fusion to false because its implementation is not included in Apex. Try upgrading to the latest version
WARNING:megatron.core.models.common.embeddings.rotary_pos_embedding:Setting apply_rope_fusion to false because its implementation is not included in Apex. Try upgrading to the latest version
WARNING:megatron.core.models.common.embeddings.rotary_pos_embedding:Setting apply_rope_fusion to false because its implementation is not included in Apex. Try upgrading to the latest version
Traceback (most recent call last):
File "train_dist_random.py", line 104, in
train(args)
File "train_dist_random.py", line 82, in train
loss = model.forward_backward(batch, iter, profiler,
File "/usr/local/lib/python3.8/dist-packages/galvatron/core/hybrid_parallel_model.py", line 44, in forward_backward
loss = model.gpipe_forward(batch, loss_func, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/galvatron/core/pipeline/pipeline.py", line 635, in gpipe_forward
output_tensor = self.forward_step(
File "/usr/local/lib/python3.8/dist-packages/galvatron/core/pipeline/pipeline.py", line 738, in forward_step
output_tensor, loss_func = forward_step_func(input_tensor, model)
File "/usr/local/lib/python3.8/dist-packages/galvatron/core/pipeline/pipeline.py", line 32, in forward_step
outputs = model(*inputs,**kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/galvatron/core/pipeline/pipeline.py", line 1428, in forward
inputs = module(inputs, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/galvatron/core/parallel.py", line 219, in forward
return self.module(*inputs_relocated, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 839, in forward
output = self._fsdp_wrapped_module(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/galvatron/models/llama_hf/LlamaModel_sequential.py", line 164, in forward
loss = tensor_parallel.vocab_parallel_cross_entropy(logits_parallel, labels, tp_group = self.tp_group)
File "/usr/local/lib/python3.8/dist-packages/galvatron/site_package/megatron/core/tensor_parallel/cross_entropy.py", line 168, in vocab_parallel_cross_entropy
return _VocabParallelCrossEntropy.apply(vocab_parallel_logits, target, label_smoothing, tp_group)
File "/usr/local/lib/python3.8/dist-packages/torch/autograd/function.py", line 539, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/usr/local/lib/python3.8/dist-packages/galvatron/site_package/megatron/core/tensor_parallel/cross_entropy.py", line 92, in forward
loss = torch.log(sum_exp_logits) - predicted_logits
RuntimeError: The size of tensor a (4) must match the size of tensor b (2) at non-singleton dimension 1`

@Siegfried-qgf
Copy link
Author

vtp是可以的
"2_1_4_sp": {
"layernum[2]_bsz8_seq4096_rank0_ms": 1851.41552734375,
"layernum[2]_bsz8_seq4096_rank0_act": 2309.49169921875,
"layernum[2]_bsz8_seq4096_rank0_act_peak": 3417.29443359375,
"layernum[2]_bsz8_seq4096_rank7_ms": 1851.43505859375,
"layernum[2]_bsz8_seq4096_rank7_act": 3289.60986328125,
"layernum[2]_bsz8_seq4096_rank7_act_peak": 3643.59423828125
},
"2_2_2_vtp_sp": {
"layernum[2]_bsz8_seq4096_rank0_ms": 2011.59912109375,
"layernum[2]_bsz8_seq4096_rank0_act": 2309.56005859375,
"layernum[2]_bsz8_seq4096_rank0_act_peak": 2889.18505859375,
"layernum[2]_bsz8_seq4096_rank7_ms": 2011.63818359375,
"layernum[2]_bsz8_seq4096_rank7_act": 3289.62353515625,
"layernum[2]_bsz8_seq4096_rank7_act_peak": 3487.59326171875
},
中间少了一个2_2_2_sp

@Fizzmy
Copy link
Collaborator

Fizzmy commented Dec 19, 2024

我们修复了这一问题,现在可以正确的进行profile了

@Siegfried-qgf
Copy link
Author

我们修复了这一问题,现在可以正确的进行profile了

非常感谢你的帮助!

我已经跑通了search_dist的过程。
这是我的scrip

export NUM_NODES=1
export NUM_GPUS_PER_NODE=8

MODEL_SIZE="llama-13b"
MEMORY=60
SEQ=4096
FINE_GRAINED=1
MODEL_ARGS="
--model_size ${MODEL_SIZE}
--set_model_config_manually 0
--set_layernum_manually 0
--set_seqlen_manually 1
--vocab_size 32000
--hidden_size 4096
--num_hidden_layers 24
--num_attention_heads 32
--seq_length ${SEQ}"

BSZ_ARGS="
--min_bsz 64
--max_bsz 64
--bsz_scale 1
--settle_bsz -1
--recommend_min_bsz 0
"

SEARCH_SPACE_ARGS="
--search_space full
--sp_space tp+sp
--disable_dp 0
--disable_tp 0
--disable_pp 0
--disable_sdp 0
--disable_ckpt 0
--disable_vtp 0
--disable_tp_consec 1
--max_tp_deg 8
--max_pp_deg 8
--fine_grained_mode ${FINE_GRAINED}
--no_async_grad_reduce
--sequence_parallel
"
#--profile_mode sequence
SEARCH_ARGS="
${BSZ_ARGS}
${SEARCH_SPACE_ARGS}
${MODEL_ARGS}
--num_nodes ${NUM_NODES}
--num_gpus_per_node ${NUM_GPUS_PER_NODE}
--memory_constraint $MEMORY
--mixed_precision bf16
--pipeline_type pipedream_flush
--default_dp_type zero2
--embed_sdp 0
"

BACKGROUND=1

if [ $BACKGROUND -eq 1 ]; then
echo "Search in background..."
OUTPUT_FILE="log/Search_${MODEL_SIZE}${MEMORY}GB${NUM_NODES}Nodes_${NUM_GPUS_PER_NODE}GPUs_per_node_${SEQ}_${FINE_GRAINED}.log"
nohup python3 search_dist.py ${SEARCH_ARGS} > "$OUTPUT_FILE" 2>&1 &

else
echo "Search in foreground..."
python3 search_dist.py ${SEARCH_ARGS}
fi

得到的结果
{
"pp_deg": 2,
"tp_sizes_enc": "2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2,2",
"tp_consecutive_flags": "1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1",
"dp_types_enc": "1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0",
"use_sp": "1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1",
"checkpoint": "1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1",
"global_bsz": 64,
"chunks": 8,
"pp_division": "20,20",
"pipeline_type": "pipedream_flush",
"default_dp_type": "zero2",
"vtp": 2,
"vsp": 0
}

这里对得到结果表达的含义有些不理解,能稍微解释一下吗,还有对我的script有什么建议么?谢谢

@Fizzmy
Copy link
Collaborator

Fizzmy commented Dec 20, 2024

pp_deg: 流水线并行度,表示模型被切分成多少段进行流水线并行
tp_sizes_enc: 每层的张量并行度,每个数字表示对应层使用的张量并行度
tp_consecutive_flags: 张量并行的GPU分配方式,1表示使用连续的GPU,0表示不连续
dp_types_enc: 数据并行类型,0表示使用default_dp_type,1表示Zero3
use_sp: Ulysses序列并行的使用标志,0表示该层不使用序列并行,1表示该层使用序列并行
global_bsz: 全局batch size,表示总的训练batch大小
chunks: micro-batch数量
pp_division: 流水线切分点,每个数字表示每段包含的层数
checkpoint: 激活值检查点配置,1表示该层使用checkpointing,0表示不使用
pipeline_type: 流水线调度策略
default_dp_type: 默认的数据并行类型
vtp: 词表的张量并行度
vsp: 词表是否使用Ulysses序列并行,0表示不使用,1表示使用

您的script没有什么问题,如果想使用search engine的更多功能可以查看我们的api文档。目前文档也正在完善中,结果的解释也会加入文档中,欢迎给出更多建议!

@Siegfried-qgf
Copy link
Author

你好,我在考虑根据galvatron提供的并行策略,用我自己的megatron训练框架或者deepspeed(llama_factory等训练框架)进行训练。
galvatron给出的策略结果是细分到每层,有些框架可能不适用,有些参数也是?如果我想讲galvatron给出的策略用于其他的训练框架,该怎么设置呢,或者您有这方面的工作计划吗?

@Fizzmy
Copy link
Collaborator

Fizzmy commented Dec 24, 2024

您可以通过设置Galvatron搜索引擎的fine_grained变量参数为0,以及设置disable_*变量,来限制并行策略的搜索范围,以及通过设置sp_space来限制sp策略搜索范围,进而达到搜索指定框架支持的粗粒度并行策略作为参考。需要注意的是,因为profile是在galvatron框架进行的,所以在内存和时间的估计上的可能会有偏差。另外,由于Megatron只支持zero1的dp策略,而Galvatron只提供ddp、zero2或zero3的dp策略选择,这里建议您固定使用zero2的dp策略搜索进行搜索,也能达到相似的搜索结果,如果会出现OOM错误,可以尝试调小内存限制。

@Siegfried-qgf
Copy link
Author

结果中出现了"dp_types_enc": "1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0", 想了解一下对不同层采用不同的数据并行策略是galvatron的特色功能吗?megatron之类的训练框架可以这样么,如果我想要么是所有层都用zero2,要么都用zero3呢

@Fizzmy
Copy link
Collaborator

Fizzmy commented Dec 26, 2024

是的galvatron支持不同层使用不同的并行策略,这是其他训练框架所没有的特性(据我们所知),如果你想全部层使用并行策略,请设置Galvatron搜索引擎的fine_grained_mode变量参数为0。你使用的search_dist脚本具体内容是什么样的呢?

@Siegfried-qgf
Copy link
Author

export NUM_NODES=1
export NUM_GPUS_PER_NODE=8

MODEL_SIZE="llama-13b"
MEMORY=60
SEQ=4096
FINE_GRAINED=1
MODEL_ARGS="
--model_size ${MODEL_SIZE}
--set_model_config_manually 0
--set_layernum_manually 0
--set_seqlen_manually 1
--vocab_size 32000
--hidden_size 4096
--num_hidden_layers 24
--num_attention_heads 32
--seq_length ${SEQ}"

BSZ_ARGS="
--min_bsz 64
--max_bsz 64
--bsz_scale 1
--settle_bsz -1
--recommend_min_bsz 0
"

SEARCH_SPACE_ARGS="
--search_space full
--sp_space tp+sp
--disable_dp 0
--disable_tp 0
--disable_pp 0
--disable_sdp 0
--disable_ckpt 0
--disable_vtp 0
--disable_tp_consec 1
--max_tp_deg 8
--max_pp_deg 8
--fine_grained_mode ${FINE_GRAINED}
--no_async_grad_reduce
--sequence_parallel
"
#--profile_mode sequence
SEARCH_ARGS="
${BSZ_ARGS}
${SEARCH_SPACE_ARGS}
${MODEL_ARGS}
--num_nodes ${NUM_NODES}
--num_gpus_per_node ${NUM_GPUS_PER_NODE}
--memory_constraint $MEMORY
--mixed_precision bf16
--pipeline_type pipedream_flush
--default_dp_type zero2
--embed_sdp 0
"

BACKGROUND=1

if [ $BACKGROUND -eq 1 ]; then
echo "Search in background..."
OUTPUT_FILE="log/Search_${MODEL_SIZE}${MEMORY}GB${NUM_NODES}Nodes_${NUM_GPUS_PER_NODE}GPUs_per_node_${SEQ}_${FINE_GRAINED}.log"
nohup python3 search_dist.py ${SEARCH_ARGS} > "$OUTPUT_FILE" 2>&1 &

else
echo "Search in foreground..."
python3 search_dist.py ${SEARCH_ARGS}
fi

不同层使用不同的数据并行方式看起来很奇怪hh,你们有证明相比于megatron的优越性吗,如果有不错的提升,我想改进一下我的基于deepspeed的框架,让它也能支持。

@Fizzmy
Copy link
Collaborator

Fizzmy commented Dec 26, 2024

你需要改成FINE_GRAINED=0以支持所有层相同策略的搜素,关于与其他训练框架的比较,可以参考我们的论文GalvatronGalvatron-BMW

@Siegfried-qgf
Copy link
Author

我这边遇到一个bug 在profile_time结束记录结果的时候 在记录第二次结果时会出现json的报错 Traceback (most recent call last): File "profiler.py", line 88, in profiler.process_profiled_data() File "/usr/local/lib/python3.8/dist-packages/galvatron/core/profiler.py", line 365, in process_profiled_data config = read_json_config(time_config_path) File "/usr/local/lib/python3.8/dist-packages/galvatron/utils/config_utils.py", line 15, in read_json_config return json.load(open(path,'r',encoding="utf-8")) File "/usr/lib/python3.8/json/init.py", line 293, in load return loads(fp.read(), File "/usr/lib/python3.8/json/init.py", line 357, in loads return _default_decoder.decode(s) File "/usr/lib/python3.8/json/decoder.py", line 340, in decode raise JSONDecodeError("Extra data", s, end) json.decoder.JSONDecodeError: Extra data: line 4 column 2 (char 105)
这个json文件会多一个右括号 请问你有这种情况么。 { "layernum[1]_bsz8_seq4096": 25.143798446655275, "layernum[2]_bsz8_seq4096": 43.06454353332519 }}

这种情况会稳定出现吗?可以尝试将文件内容改为{}后重新profile

这个问题还是时常会出现

@Fizzmy
Copy link
Collaborator

Fizzmy commented Dec 27, 2024

请问能否给出问题的完整复现步骤,提供相关的系统信息和完整的运行日志?我们在自己的环境并未复现出该问题

@Siegfried-qgf
Copy link
Author

File "/usr/local/lib/python3.8/dist-packages/galvatron/core/search_engine.py", line 230, in get_profiled_model_configs
tp_activation_per_bsz_dict = layer_mem_config[[self.seqlen_list[i]]]['tp_activation_per_bsz_dict'].copy()

这里layer_mem_config[[self.seqlen_list[i]]]是不是多了一个中括号

@Fizzmy
Copy link
Collaborator

Fizzmy commented Dec 30, 2024

是的,感谢您的反馈,我们将在新版本修复

@Siegfried-qgf
Copy link
Author

你好 新年快乐
我在search的时候遇到了一些问题
当search_space设置为dp+pp或者只设dp时会报错
image

你能确认一下galvatron只限制搜索范围为dp或者dp+pp的可用性吗
以下是我的脚本
export NUM_NODES=1
export NUM_GPUS_PER_NODE=8

MODEL_SIZE="qwen2.5-14b"
MEMORY=80
SEQ=4096
FINE_GRAINED=0
MODEL_ARGS="
--model_size ${MODEL_SIZE}
--set_model_config_manually 0
--set_layernum_manually 0
--set_seqlen_manually 1
--vocab_size 32000
--hidden_size 4096
--num_hidden_layers 24
--num_attention_heads 32
--seq_length ${SEQ}"

BSZ_ARGS="
--min_bsz 64
--max_bsz 64
--bsz_scale 8
--settle_bsz -1
--recommend_min_bsz 0
"
#--sp_space tp+sp
SEARCH_SPACE_ARGS="
--search_space dp+pp
--sp_space tp+sp
--disable_dp 0
--disable_tp 0
--disable_pp 0
--disable_sdp 0
--disable_ckpt 0
--disable_vtp 0
--disable_tp_consec 1
--max_tp_deg 8
--max_pp_deg 8
--fine_grained_mode ${FINE_GRAINED}
--no_async_grad_reduce
"
#--sequence_parallel

#--profile_mode sequence
SEARCH_ARGS="
${BSZ_ARGS}
${SEARCH_SPACE_ARGS}
${MODEL_ARGS}
--num_nodes ${NUM_NODES}
--num_gpus_per_node ${NUM_GPUS_PER_NODE}
--memory_constraint $MEMORY
--mixed_precision bf16
--pipeline_type pipedream_flush
--default_dp_type zero2
--embed_sdp 0
"

BACKGROUND=1

if [ $BACKGROUND -eq 1 ]; then
echo "Search in background..."
OUTPUT_FILE="log/Search_${MODEL_SIZE}${MEMORY}GB${NUM_NODES}Nodes_${NUM_GPUS_PER_NODE}GPUs_per_node_${SEQ}_${FINE_GRAINED}.log"
nohup python3 search_dist.py ${SEARCH_ARGS} > "$OUTPUT_FILE" 2>&1 &

else
echo "Search in foreground..."
python3 search_dist.py ${SEARCH_ARGS}
fi

@Fizzmy
Copy link
Collaborator

Fizzmy commented Jan 4, 2025

新年快乐!请问你这里使用的是哪个版本的代码呢?请尝试把代码更新到最新版本

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants