Skip to main content

Troubleshooting

Using multiple NTI cards in a single server

  • It is not recommended to use multiple NTI cards in a single server to expand bandwidth.
  • In the current version, such configurations have not been sufficiently validated for performance and stability.
  • Support for this feature is planned for future versions.

Requesting multiple target subsystem connections to the NTI agent

  • If even one connection request succeeds, the NTI agent considers the operation successful.
  • Currently, failed connection requests are not retried continuously.

Network failure or target server reboot

  • A network failure or a target server reboot will trigger an NVMe timeout on the host.
  • This timeout causes the NVMe controller to reset as part of the host NVMe driver’s recovery process.
  • If the target server returns to a normal state before the NVMe reset process completes, the NVMe will recover automatically.
  • If the target server returns to a normal state after the NVMe reset process has completed, the NVMe must be manually re-bound.

Unexpected low FIO performance

  • Ensure that both the initiator and target nodes are configured with the correct NUMA setup, as NVMe/TCPperformance is highly sensitive to NUMA settings.
  • You can verify the NUMA status of PCIe devices with the following commands:
    ~$ sudo apt install hwloc
    ~$ lstopo

Performance bottleneck in the NVMe-oF target node

  • Ensure that the target server has at least 32 cores and a proper NUMA configuration to achieve maximum full-duplex performance (5.5 million IOPS).

High CPU cycles consumed due to spinlock contention in FIO

  • When too many FIO threads attempt to access a single file, it can result in excessive CPU cycles spent on spinlock contention.
  • Example call graph
    • aio_read/aio_write -> security_file_permission -> spin_lock
  • This issue can lead to performance degradation and may also occur with local Samsung NVMe devices.

Performance impact due to CPU scheduling

  • If workloads are scheduled on cores with a long NUMA distance from the NTI device, performance degradation may occur.

Performance fluctuation caused by bandwidth disparity

  • A bandwidth gap between the NTI TOE engine and the target-side NIC can result in packet drops.