1. 程式人生 > >AMD cpu 下 Pytorch 多卡並行卡死問題解決

AMD cpu 下 Pytorch 多卡並行卡死問題解決

修改 res 錯誤 一個bug 問題解決 div 網絡 pytorch code

dataparallel not working on nvidia gpus and amd cpus

https://github.com/pytorch/pytorch/issues/13045 問題: 多卡運行時, 網絡會卡在那裏不能運行. 系統是 AMD Ryzen5 1600x 和 兩張taitanXP 之前兩張卡是2070+taitanXP是可以多卡運行的, 只不過是顯存不一樣大... 看了下日誌, 都是下面的錯誤
these error messages were found in the dmesg log:

[1118468.873266] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ea13a000 flags=0x0020]
[
1118468.942145] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ea139068 flags=0x0020] [1118468.942189] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0000040 flags=0x0020] [1118468.942227] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00007c0 flags=0x0020] [
1118468.942265] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0001040 flags=0x0020] [1118468.942303] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0000f40 flags=0x0020] [1118468.942340] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00016c0 flags=0x0020] [1118468.942377] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0002040 flags=0x0020]

搜了一下, 似乎是一個bug . . . 臨時解決辦法: 修改 /etc/default/grub
GRUB_DEFAULT=0
GRUB_TIMEOUT_STYLE=hidden
GRUB_TIMEOUT=10
GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
GRUB_CMDLINE_LINUX="iommu=soft" # 註意修改這一行 ...

然後 sudo update grub 最後重啟 這樣就可以正常運行了

AMD cpu 下 Pytorch 多卡並行卡死問題解決