Troubleshooting
Performance fluctuation due to CPU bottlenecks
- Set the CPU governor to performance mode.
~$ sudo apt install cpufrequtils
~$ cat /etc/init.d/cpufrequtils
...
GOVERNOR="performance"
...
~$ sudo systemctl daemon-reload && sudo /etc/init.d/cpufrequtils reload
~$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
...
performance
performance
... - Try disabling Hyperthreading.
- Use the
taskset
command to explicitly set the application’s CPU affinity to the same NUMA node as the Mango GPUBoost card.
Cannot allocate MR due to the memlock limitation
- RDMA operations require memory pinning to prevent swapping. If the allowed pinned memory size is too
small, MR registration may fail. To resolve this, modify
/etc/security/limits.conf
to increase the pinned memory limit. You may also set it to unlimited if necessary.~$ cat /etc/security/limits.conf
...
* soft memlock unlimited
* hard memlock unlimited
~$ ulimit -l
unlimited
The RDMA interface name appears as rocep..
instead of mb_<i>
- To change the interface name to
mb_<i>
, modify/usr/lib/udev/rules.d/60-rdma-persistent-naming.rules
and reload the driver.~$ sudo modprobe -r mango-aux-rdma
~$ cat /usr/lib/udev/rules.d/60-rdma-persistent-naming.rule
...
ACTION=="add", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_KERNEL"
...
~$ sudo modprobe mango-aux-rdma
~$ rdma link
link mb_0/1 state DOWN physical_state DISABLED netdev ens102np0
link mb_1/1 state DOWN physical_state DISABLED netdev ens104np0
Application terminates with completion status 12
...
Completion with error at client
Failed status 12: wr_id 0 syndrom 0x81
scnt=128, ccnt=64
- Completion status 12 (
IBV_WC_RETRY_EXC_ERR
) means that the retransmission attempts have exceeded the allowed limit. This issue may be mitigated by enabling congestion control to reduce packet drops.
Invalid argument error when loading the driver
~$ sudo modprobe mango-aux-rdma
modprobe: ERROR: could not insert 'mango_aux_rdma': Invalid argument
- Make sure the linux-modules-extra package is installed (see Prerequisites).
- Check whether Mellanox OFED is installed (see Prerequisites).