Skip to main content

Troubleshooting

Performance fluctuation due to CPU bottlenecks

  • Set the CPU governor to performance mode.
    ~$ sudo apt install cpufrequtils
    ~$ cat /etc/init.d/cpufrequtils
    ...
    GOVERNOR="performance"
    ...
    ~$ sudo systemctl daemon-reload && sudo /etc/init.d/cpufrequtils reload
    ~$ cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
    ...
    performance
    performance
    ...
  • Try disabling Hyperthreading.
  • Use the taskset command to explicitly set the application’s CPU affinity to the same NUMA node as the Mango BoostX™ RoCE AI card.

Cannot allocate MR due to the memlock limitation

  • RDMA operations require memory pinning to prevent swapping. If the allowed pinned memory size is too small, MR registration may fail. To resolve this, modify /etc/security/limits.conf to increase the pinned memory limit. You may also set it to unlimited if necessary.
    ~$ cat /etc/security/limits.conf
    ...
    * soft memlock unlimited
    * hard memlock unlimited
    ~$ ulimit -l
    unlimited

The RDMA interface name appears as rocep.. instead of mb_<i>

  • To change the interface name to mb_<i>, modify /usr/lib/udev/rules.d/60-rdma-persistent-naming.rules and reload the driver.
    ~$ sudo modprobe -r mango-aux-rdma
    ~$ cat /usr/lib/udev/rules.d/60-rdma-persistent-naming.rules
    ...
    ACTION=="add", SUBSYSTEM=="infiniband", PROGRAM="rdma_rename %k NAME_KERNEL"
    ...
    ~$ sudo modprobe mango-aux-rdma
    ~$ rdma link
    link mb_0/1 state DOWN physical_state DISABLED netdev ens102np0
    link mb_1/1 state DOWN physical_state DISABLED netdev ens104np0

Application terminates with completion status 12

...
Completion with error at client
Failed status 12: wr_id 0 syndrom 0x81
scnt=128, ccnt=64
  • Completion status 12 (IBV_WC_RETRY_EXC_ERR) means that the retransmission attempts have exceeded the allowed limit. This issue may be mitigated by enabling congestion control to reduce packet drops.

Invalid argument error when loading the driver

~$ sudo modprobe mango-aux-rdma
modprobe: ERROR: could not insert 'mango_aux_rdma': Invalid argument

Uninstall Mellanox OFED

The Mango BoostX™ RoCE AI package conflicts with the Mellanox OFED software stack. If Mellanox OFED is installed in your environment, uninstall it with the following command:

~$ /usr/sbin/ofed_uninstall.sh --force

Disabling ACS for P2P

To enable PCIe peer-to-peer data transfer between devices (e.g., between a GPU and the Mango BoostX™ RoCE AI card — see GPU Workloads → PCIe Configuration), the PCIe Access Control Service (ACS) must be disabled. On some platforms ACS is disabled by default; if it is enabled, run the following script with root privileges.

disable_acs.sh
#!/bin/bash
# Disable ACS on every device that supports it

PLATFORM=$(dmidecode --string system-product-name)
logger "PLATFORM=${PLATFORM}"

# Enforce platform check here.
# case "${PLATFORM}" in
# "OAM"*)
# logger "INFO: Disabling ACS is no longer necessary for ${PLATFORM}"
# exit 0
# ;;
# *)
# ;;
# esac

# Must be run as root to access extended PCI config space
if [ "$EUID" -ne 0 ]; then
echo "ERROR: $0 must be run as root"
exit 1
fi

for BDF in $(lspci -d "*:*:*" | awk '{print $1}'); do
# Skip if it doesn't support ACS
setpci -v -s "${BDF}" ECAP_ACS+0x6.w > /dev/null 2>&1
if [ $? -ne 0 ]; then
# echo "${BDF} does not support ACS, skipping"
continue
fi

logger "Disabling ACS on $(lspci -s ${BDF})"
setpci -v -s "${BDF}" ECAP_ACS+0x6.w=0000

if [ $? -ne 0 ]; then
logger "Error enabling directTrans ACS on ${BDF}"
continue
fi

NEW_VAL=$(setpci -v -s "${BDF}" ECAP_ACS+0x6.w | awk '{print $NF}')

if [ "${NEW_VAL}" != "0000" ]; then
logger "Failed to enable directTrans ACS on ${BDF}"
continue
fi
done
exit 0

Other issues not listed above

  • If you encounter an issue not covered here, report it to contact@mangoboost.io. Please attach a debug log capturing the state of both the hardware and the software (a tar.gz of /var/log/mango/rdma).