AMD cpu 下 Pytorch 多卡並行卡死問題解決
阿新 • • 發佈:2019-03-12
修改 res 錯誤 一個bug 問題解決 div 網絡 pytorch code
dataparallel not working on nvidia gpus and amd cpus
https://github.com/pytorch/pytorch/issues/13045 問題: 多卡運行時, 網絡會卡在那裏不能運行. 系統是 AMD Ryzen5 1600x 和 兩張taitanXP 之前兩張卡是2070+taitanXP是可以多卡運行的, 只不過是顯存不一樣大... 看了下日誌, 都是下面的錯誤these error messages were found in the dmesg log: [1118468.873266] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ea13a000 flags=0x0020] [1118468.942145] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000ea139068 flags=0x0020] [1118468.942189] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0000040 flags=0x0020] [1118468.942227] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00007c0 flags=0x0020] [1118468.942265] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0001040 flags=0x0020] [1118468.942303] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0000f40 flags=0x0020] [1118468.942340] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d00016c0 flags=0x0020] [1118468.942377] nvidia 0000:0a:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000f address=0x00000000d0002040 flags=0x0020]
搜了一下, 似乎是一個bug . . . 臨時解決辦法: 修改 /etc/default/grub
GRUB_DEFAULT=0 GRUB_TIMEOUT_STYLE=hidden GRUB_TIMEOUT=10 GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian` GRUB_CMDLINE_LINUX_DEFAULT="quiet splash" GRUB_CMDLINE_LINUX="iommu=soft" # 註意修改這一行 ...
然後 sudo update grub 最後重啟 這樣就可以正常運行了
AMD cpu 下 Pytorch 多卡並行卡死問題解決