Troubleshooting
Using multiple NTI cards in a single server
- It is not recommended to use multiple NTI cards in a single server to expand bandwidth.
- In the current version, such configurations have not been sufficiently validated for performance and stability.
- Support for this feature is planned for future versions.
Requesting multiple target subsystem connections to the NTI agent
- If even one connection request succeeds, the NTI agent considers the operation successful.
- Currently, failed connection requests are not retried continuously.
Network failure or target server reboot
- A network failure or a target server reboot will trigger an NVMe timeout on the host.
- This timeout causes the NVMe controller to reset as part of the host NVMe driver’s recovery process.
- If the target server returns to a normal state before the NVMe reset process completes, the NVMe will recover automatically.
- If the target server returns to a normal state after the NVMe reset process has completed, the NVMe must be manually re-bound.
Unexpected low FIO performance
- Ensure that both the initiator and target nodes are configured with the correct NUMA setup, as NVMe/TCPperformance is highly sensitive to NUMA settings.
- You can verify the NUMA status of PCIe devices with the following commands:
~$ sudo apt install hwloc
~$ lstopo
Performance bottleneck in the NVMe-oF target node
- Ensure that the target server has at least 32 cores and a proper NUMA configuration to achieve maximum full-duplex performance (5.5 million IOPS).
High CPU cycles consumed due to spinlock contention in FIO
- When too many FIO threads attempt to access a single file, it can result in excessive CPU cycles spent on spinlock contention.
- Example call graph
aio_read/aio_write -> security_file_permission -> spin_lock
- This issue can lead to performance degradation and may also occur with local Samsung NVMe devices.
Performance impact due to CPU scheduling
- If workloads are scheduled on cores with a long NUMA distance from the NTI device, performance degradation may occur.
Performance fluctuation caused by bandwidth disparity
- A bandwidth gap between the NTI TOE engine and the target-side NIC can result in packet drops.