$ cat <<EOF > 99-machine-config-blacklist-irdma.yaml
NVIDIA GPUDirect Remote Direct Memory Access (RDMA) allows for an application in one computer to directly access the memory of another computer without needing access through the operating system. This provides the ability to bypass kernel intervention in the process, freeing up resources and greatly reducing the CPU overhead normally needed to process network communications. This is useful for distributing GPU-accelerated workloads across clusters. And because RDMA is so suited toward high bandwidth and low latency applications, this makes it ideal for big data and machine learning applications.
There are currently three configuration methods for NVIDIA GPUDirect RDMA:
This method allows for an NVIDIA GPUDirect RDMA device to be shared among multiple pods on the OKD worker node where the device is exposed.
This method provides direct physical Ethernet access on the worker node by creating an additional host network on a pod. A plugin allows the network device to be moved from the host network namespace to the network namespace on the pod.
The Single Root IO Virtualization (SR-IOV) method can share a single network device, such as an Ethernet adapter, with multiple pods. SR-IOV segments the device, recognized on the host node as a physical function (PF), into multiple virtual functions (VFs). The VF is used like any other network device.
Each of these methods can be used across either the NVIDIA GPUDirect RDMA over Converged Ethernet (RoCE) or Infiniband infrastructures, providing an aggregate total of six methods of configuration.
All methods of NVIDIA GPUDirect RDMA configuration require the installation of specific Operators. Use the following steps to install the Operators:
Install the Node Feature Discovery Operator.
Install the SR-IOV Operator.
Install the NVIDIA Network Operator (NVIDIA documentation).
Install the NVIDIA GPU Operator (NVIDIA documentation).
On some systems, including the DellR750xa, the IRDMA kernel module creates problems for the NVIDIA Network Operator when unloading and loading the DOCA drivers. Use the following procedure to disable the module.
Generate the following machine configuration file by running the following command:
$ cat <<EOF > 99-machine-config-blacklist-irdma.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 99-worker-blacklist-irdma
spec:
kernelArguments:
- "module_blacklist=irdma"
Create the machine configuration on the cluster and wait for the nodes to reboot by running the following command:
$ oc create -f 99-machine-config-blacklist-irdma.yaml
machineconfig.machineconfiguration.openshift.io/99-worker-blacklist-irdma created
Validate in a debug pod on each node that the module has not loaded by running the following command:
$ oc debug node/nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
Starting pod/nvd-srv-32nvidiaengrdu2dcredhatcom-debug-btfj2 ...
To use host binaries, run `chroot /host`
pod IP: 10.6.135.11
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# lsmod|grep irdma
sh-5.1#
In some cases, device names won’t persist following a reboot. For example, on R760xa systems Mellanox devices might be renamed after a reboot. You can avoid this problem by using a MachineConfig
to set persistence.
Gather the MAC address names from the worker nodes for the node into a file and provide names for the interfaces that need to persist. This example uses the file 70-persistent-net.rules
and stashes the details in it.
$ cat <<EOF > 70-persistent-net.rules
SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:3b:51:28",ATTR{type}=="1",NAME="ibs2f0"
SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:3b:51:29",ATTR{type}=="1",NAME="ens8f0np0"
SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:f0:36:d0",ATTR{type}=="1",NAME="ibs2f0"
SUBSYSTEM=="net",ACTION=="add",ATTR{address}=="b8:3f:d2:f0:36:d1",ATTR{type}=="1",NAME="ens8f0np0"
EOF
Convert that file into a base64 string without line breaks and set the output to the variable PERSIST
:
$ PERSIST=`cat 70-persistent-net.rules| base64 -w 0`
$ echo $PERSIST
U1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjozYjo1MToyOSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMCIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImliczJmMCIKU1VCU1lTVEVNPT0ibmV0IixBQ1RJT049PSJhZGQiLEFUVFJ7YWRkcmVzc309PSJiODozZjpkMjpmMDozNjpkMSIsQVRUUnt0eXBlfT09IjEiLE5BTUU9ImVuczhmMG5wMCIK
Create a machine configuration and set the base64 encoding in the custom resource file by running the following command:
$ cat <<EOF > 99-machine-config-udev-network.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 99-machine-config-udev-network
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- contents:
source: data:text/plain;base64,$PERSIST
filesystem: root
mode: 420
path: /etc/udev/rules.d/70-persistent-net.rules
Create the machine configuration on the cluster by running the following command:
$ oc create -f 99-machine-config-udev-network.yaml
machineconfig.machineconfiguration.openshift.io/99-machine-config-udev-network created
Use the get mcp
command to view the machine configuration status:
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-9adfe851c2c14d9598eea5ec3df6c187 True False False 1 1 1 0 6h21m
worker rendered-worker-4568f1b174066b4b1a4de794cf538fee False True False 2 0 0 0 6h21m
The nodes will reboot and when the updating field returns to false
, you can validate on the nodes by looking at the devices in a debug pod.
The Node Feature Discovery (NFD) Operator manages the detection of hardware features and configuration in an OKD cluster by labeling the nodes with hardware-specific information. NFD labels the host with node-specific attributes, such as PCI cards, kernel, operating system version, and so on.
You have installed the NFD Operator.
Validate that the Operator is installed and running by looking at the pods in the openshift-nfd
namespace by running the following command:
$ oc get pods -n openshift-nfd
NAME READY STATUS RESTARTS AGE
nfd-controller-manager-8698c88cdd-t8gbc 2/2 Running 0 2m
With the NFD controller running, generate the NodeFeatureDiscovery
instance and add it to the cluster.
The ClusterServiceVersion
specification for NFD Operator provides default values, including the NFD operand image that is part of the Operator payload. Retrieve its value by running the following command:
$ NFD_OPERAND_IMAGE=`echo $(oc get csv -n openshift-nfd -o json | jq -r '.items[0].metadata.annotations["alm-examples"]') | jq -r '.[] | select(.kind == "NodeFeatureDiscovery") | .spec.operand.image'`
Optional: Add entries to the default deviceClassWhiteList
field, to support more network adapters, such as the NVIDIA BlueField DPUs.
apiVersion: nfd.openshift.io/v1
kind: NodeFeatureDiscovery
metadata:
name: nfd-instance
namespace: openshift-nfd
spec:
instance: ''
operand:
image: '${NFD_OPERAND_IMAGE}'
servicePort: 12000
prunerOnDelete: false
topologyUpdater: false
workerConfig:
configData: |
core:
sleepInterval: 60s
sources:
pci:
deviceClassWhitelist:
- "02"
- "03"
- "0200"
- "0207"
- "12"
deviceLabelFields:
- "vendor"
Create the 'NodeFeatureDiscovery` instance by running the following command:
$ oc create -f nfd-instance.yaml
nodefeaturediscovery.nfd.openshift.io/nfd-instance created
Validate that the instance is up and running by looking at the pods under the openshift-nfd
namespace by running the following command:
$ oc get pods -n openshift-nfd
NAME READY STATUS RESTARTS AGE
nfd-controller-manager-7cb6d656-jcnqb 2/2 Running 0 4m
nfd-gc-7576d64889-s28k9 1/1 Running 0 21s
nfd-master-b7bcf5cfd-qnrmz 1/1 Running 0 21s
nfd-worker-96pfh 1/1 Running 0 21s
nfd-worker-b2gkg 1/1 Running 0 21s
nfd-worker-bd9bk 1/1 Running 0 21s
nfd-worker-cswf4 1/1 Running 0 21s
nfd-worker-kp6gg 1/1 Running 0 21s
Wait a short period of time and then verify that NFD has added labels to the node. The NFD labels are prefixed with feature.node.kubernetes.io
, so you can easily filter them.
$ oc get node -o json | jq '.items[0].metadata.labels | with_entries(select(.key | startswith("feature.node.kubernetes.io")))'
{
"feature.node.kubernetes.io/cpu-cpuid.ADX": "true",
"feature.node.kubernetes.io/cpu-cpuid.AESNI": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX": "true",
"feature.node.kubernetes.io/cpu-cpuid.AVX2": "true",
"feature.node.kubernetes.io/cpu-cpuid.CETSS": "true",
"feature.node.kubernetes.io/cpu-cpuid.CLZERO": "true",
"feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8": "true",
"feature.node.kubernetes.io/cpu-cpuid.CPBOOST": "true",
"feature.node.kubernetes.io/cpu-cpuid.EFER_LMSLE_UNS": "true",
"feature.node.kubernetes.io/cpu-cpuid.FMA3": "true",
"feature.node.kubernetes.io/cpu-cpuid.FP256": "true",
"feature.node.kubernetes.io/cpu-cpuid.FSRM": "true",
"feature.node.kubernetes.io/cpu-cpuid.FXSR": "true",
"feature.node.kubernetes.io/cpu-cpuid.FXSROPT": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBPB": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBRS": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBRS_PREFERRED": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBRS_PROVIDES_SMP": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBS": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBSBRNTRGT": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBSFETCHSAM": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBSFFV": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBSOPCNT": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBSOPCNTEXT": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBSOPSAM": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBSRDWROPCNT": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBSRIPINVALIDCHK": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBS_FETCH_CTLX": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBS_OPFUSE": "true",
"feature.node.kubernetes.io/cpu-cpuid.IBS_PREVENTHOST": "true",
"feature.node.kubernetes.io/cpu-cpuid.INT_WBINVD": "true",
"feature.node.kubernetes.io/cpu-cpuid.INVLPGB": "true",
"feature.node.kubernetes.io/cpu-cpuid.LAHF": "true",
"feature.node.kubernetes.io/cpu-cpuid.LBRVIRT": "true",
"feature.node.kubernetes.io/cpu-cpuid.MCAOVERFLOW": "true",
"feature.node.kubernetes.io/cpu-cpuid.MCOMMIT": "true",
"feature.node.kubernetes.io/cpu-cpuid.MOVBE": "true",
"feature.node.kubernetes.io/cpu-cpuid.MOVU": "true",
"feature.node.kubernetes.io/cpu-cpuid.MSRIRC": "true",
"feature.node.kubernetes.io/cpu-cpuid.MSR_PAGEFLUSH": "true",
"feature.node.kubernetes.io/cpu-cpuid.NRIPS": "true",
"feature.node.kubernetes.io/cpu-cpuid.OSXSAVE": "true",
"feature.node.kubernetes.io/cpu-cpuid.PPIN": "true",
"feature.node.kubernetes.io/cpu-cpuid.PSFD": "true",
"feature.node.kubernetes.io/cpu-cpuid.RDPRU": "true",
"feature.node.kubernetes.io/cpu-cpuid.SEV": "true",
"feature.node.kubernetes.io/cpu-cpuid.SEV_64BIT": "true",
"feature.node.kubernetes.io/cpu-cpuid.SEV_ALTERNATIVE": "true",
"feature.node.kubernetes.io/cpu-cpuid.SEV_DEBUGSWAP": "true",
"feature.node.kubernetes.io/cpu-cpuid.SEV_ES": "true",
"feature.node.kubernetes.io/cpu-cpuid.SEV_RESTRICTED": "true",
"feature.node.kubernetes.io/cpu-cpuid.SEV_SNP": "true",
"feature.node.kubernetes.io/cpu-cpuid.SHA": "true",
"feature.node.kubernetes.io/cpu-cpuid.SME": "true",
"feature.node.kubernetes.io/cpu-cpuid.SME_COHERENT": "true",
"feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD": "true",
"feature.node.kubernetes.io/cpu-cpuid.SSE4A": "true",
"feature.node.kubernetes.io/cpu-cpuid.STIBP": "true",
"feature.node.kubernetes.io/cpu-cpuid.STIBP_ALWAYSON": "true",
"feature.node.kubernetes.io/cpu-cpuid.SUCCOR": "true",
"feature.node.kubernetes.io/cpu-cpuid.SVM": "true",
"feature.node.kubernetes.io/cpu-cpuid.SVMDA": "true",
"feature.node.kubernetes.io/cpu-cpuid.SVMFBASID": "true",
"feature.node.kubernetes.io/cpu-cpuid.SVML": "true",
"feature.node.kubernetes.io/cpu-cpuid.SVMNP": "true",
"feature.node.kubernetes.io/cpu-cpuid.SVMPF": "true",
"feature.node.kubernetes.io/cpu-cpuid.SVMPFT": "true",
"feature.node.kubernetes.io/cpu-cpuid.SYSCALL": "true",
"feature.node.kubernetes.io/cpu-cpuid.SYSEE": "true",
"feature.node.kubernetes.io/cpu-cpuid.TLB_FLUSH_NESTED": "true",
"feature.node.kubernetes.io/cpu-cpuid.TOPEXT": "true",
"feature.node.kubernetes.io/cpu-cpuid.TSCRATEMSR": "true",
"feature.node.kubernetes.io/cpu-cpuid.VAES": "true",
"feature.node.kubernetes.io/cpu-cpuid.VMCBCLEAN": "true",
"feature.node.kubernetes.io/cpu-cpuid.VMPL": "true",
"feature.node.kubernetes.io/cpu-cpuid.VMSA_REGPROT": "true",
"feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ": "true",
"feature.node.kubernetes.io/cpu-cpuid.VTE": "true",
"feature.node.kubernetes.io/cpu-cpuid.WBNOINVD": "true",
"feature.node.kubernetes.io/cpu-cpuid.X87": "true",
"feature.node.kubernetes.io/cpu-cpuid.XGETBV1": "true",
"feature.node.kubernetes.io/cpu-cpuid.XSAVE": "true",
"feature.node.kubernetes.io/cpu-cpuid.XSAVEC": "true",
"feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT": "true",
"feature.node.kubernetes.io/cpu-cpuid.XSAVES": "true",
"feature.node.kubernetes.io/cpu-hardware_multithreading": "false",
"feature.node.kubernetes.io/cpu-model.family": "25",
"feature.node.kubernetes.io/cpu-model.id": "1",
"feature.node.kubernetes.io/cpu-model.vendor_id": "AMD",
"feature.node.kubernetes.io/kernel-config.NO_HZ": "true",
"feature.node.kubernetes.io/kernel-config.NO_HZ_FULL": "true",
"feature.node.kubernetes.io/kernel-selinux.enabled": "true",
"feature.node.kubernetes.io/kernel-version.full": "5.14.0-427.35.1.el9_4.x86_64",
"feature.node.kubernetes.io/kernel-version.major": "5",
"feature.node.kubernetes.io/kernel-version.minor": "14",
"feature.node.kubernetes.io/kernel-version.revision": "0",
"feature.node.kubernetes.io/memory-numa": "true",
"feature.node.kubernetes.io/network-sriov.capable": "true",
"feature.node.kubernetes.io/pci-102b.present": "true",
"feature.node.kubernetes.io/pci-10de.present": "true",
"feature.node.kubernetes.io/pci-10de.sriov.capable": "true",
"feature.node.kubernetes.io/pci-15b3.present": "true",
"feature.node.kubernetes.io/pci-15b3.sriov.capable": "true",
"feature.node.kubernetes.io/rdma.available": "true",
"feature.node.kubernetes.io/rdma.capable": "true",
"feature.node.kubernetes.io/storage-nonrotationaldisk": "true",
"feature.node.kubernetes.io/system-os_release.ID": "rhcos",
"feature.node.kubernetes.io/system-os_release.OPENSHIFT_VERSION": "4.17",
"feature.node.kubernetes.io/system-os_release.OSTREE_VERSION": "417.94.202409121747-0",
"feature.node.kubernetes.io/system-os_release.RHEL_VERSION": "9.4",
"feature.node.kubernetes.io/system-os_release.VERSION_ID": "4.17",
"feature.node.kubernetes.io/system-os_release.VERSION_ID.major": "4",
"feature.node.kubernetes.io/system-os_release.VERSION_ID.minor": "17"
}
Confirm there is a network device that is discovered:
$ oc describe node | grep -E 'Roles|pci' | grep pci-15b3
feature.node.kubernetes.io/pci-15b3.present=true
feature.node.kubernetes.io/pci-15b3.sriov.capable=true
feature.node.kubernetes.io/pci-15b3.present=true
feature.node.kubernetes.io/pci-15b3.sriov.capable=true
Single root I/O virtualization (SR-IOV) enhances the performance of NVIDIA GPUDirect RDMA by providing sharing across multiple pods from a single device.
You have installed the SR-IOV Operator.
Validate that the Operator is installed and running by looking at the pods in the openshift-sriov-network-operator
namespace by running the following command:
$ oc get pods -n openshift-sriov-network-operator
NAME READY STATUS RESTARTS AGE
sriov-network-operator-7cb6c49868-89486 1/1 Running 0 22s
For the default SriovOperatorConfig
CR to work with the MLNX_OFED container, run this command to update the following values:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovOperatorConfig
metadata:
name: default
namespace: openshift-sriov-network-operator
spec:
enableInjector: true
enableOperatorWebhook: true
logLevel: 2
Create the resource on the cluster by running the following command:
$ oc create -f sriov-operator-config.yaml
sriovoperatorconfig.sriovnetwork.openshift.io/default created
Patch the sriov-operator so the MOFED container can work with it by running the following command:
$ oc patch sriovoperatorconfig default --type=merge -n openshift-sriov-network-operator --patch '{ "spec": { "configDaemonNodeSelector": { "network.nvidia.com/operator.mofed.wait": "false", "node-role.kubernetes.io/worker": "", "feature.node.kubernetes.io/pci-15b3.sriov.capable": "true" } } }'
sriovoperatorconfig.sriovnetwork.openshift.io/default patched
The NVIDIA network Operator manages NVIDIA networking resources and networking related components such as drivers and device plugins to enable NVIDIA GPUDirect RDMA workloads.
You have installed the NVIDIA network Operator.
Validate that the network Operator is installed and running by confirming the controller is running in the nvidia-network-operator
namespace by running the following command:
$ oc get pods -n nvidia-network-operator
NAME READY STATUS RESTARTS AGE
nvidia-network-operator-controller-manager-6f7d6956cd-fw5wg 1/1 Running 0 5m
With the Operator running, create the NicClusterPolicy
custom resource file. The device you choose depends on your system configuration. In this example, the Infiniband interface ibs2f0
is hard coded and is used as the shared NVIDIA GPUDirect RDMA device.
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
nicFeatureDiscovery:
image: nic-feature-discovery
repository: ghcr.io/mellanox
version: v0.0.1
docaTelemetryService:
image: doca_telemetry
repository: nvcr.io/nvidia/doca
version: 1.16.5-doca2.6.0-host
rdmaSharedDevicePlugin:
config: |
{
"configList": [
{
"resourceName": "rdma_shared_device_ib",
"rdmaHcaMax": 63,
"selectors": {
"ifNames": ["ibs2f0"]
}
},
{
"resourceName": "rdma_shared_device_eth",
"rdmaHcaMax": 63,
"selectors": {
"ifNames": ["ens8f0np0"]
}
}
]
}
image: k8s-rdma-shared-dev-plugin
repository: ghcr.io/mellanox
version: v1.5.1
secondaryNetwork:
ipoib:
image: ipoib-cni
repository: ghcr.io/mellanox
version: v1.2.0
nvIpam:
enableWebhook: false
image: nvidia-k8s-ipam
repository: ghcr.io/mellanox
version: v0.2.0
ofedDriver:
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
forcePrecompiled: false
terminationGracePeriodSeconds: 300
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: true
enable: true
force: true
timeoutSeconds: 300
podSelector: ''
maxParallelUpgrades: 1
safeLoad: false
waitForCompletion:
timeoutSeconds: 0
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 24.10-0.7.0.0-0
env:
- name: UNLOAD_STORAGE_MODULES
value: "true"
- name: RESTORE_DRIVER_ON_pod_TERMINATION
value: "true"
- name: CREATE_IFNAMES_UDEV
value: "true"
Create the NicClusterPolicy
custom resource on the cluster by running the following command:
$ oc create -f network-sharedrdma-nic-cluster-policy.yaml
nicclusterpolicy.mellanox.com/nic-cluster-policy created
Validate the NicClusterPolicy
by running the following command in the DOCA/MOFED container:
$ oc get pods -n nvidia-network-operator
NAME READY STATUS RESTARTS AGE
doca-telemetry-service-hwj65 1/1 Running 2 160m
kube-ipoib-cni-ds-fsn8g 1/1 Running 2 160m
mofed-rhcos4.16-9b5ddf4c6-ds-ct2h5 2/2 Running 4 160m
nic-feature-discovery-ds-dtksz 1/1 Running 2 160m
nv-ipam-controller-854585f594-c5jpp 1/1 Running 2 160m
nv-ipam-controller-854585f594-xrnp5 1/1 Running 2 160m
nv-ipam-node-xqttl 1/1 Running 2 160m
nvidia-network-operator-controller-manager-5798b564cd-5cq99 1/1 Running 2 5d23h
rdma-shared-dp-ds-p9vvg 1/1 Running 0 85m
rsh
into the mofed
container to check the status by running the following command:
$ MOFED_pod=$(oc get pods -n nvidia-network-operator -o name | grep mofed)
$ oc rsh -n nvidia-network-operator -c mofed-container ${MOFED_pod}
sh-5.1# ofed_info -s
OFED-internal-24.07-0.6.1:
sh-5.1# ibdev2netdev -v
0000:0d:00.0 mlx5_0 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ibs2f0 (Up)
0000:a0:00.0 mlx5_1 (MT41692 - 900-9D3B4-00EN-EA0) BlueField-3 E-series SuperNIC 400GbE/NDR single port QSFP112, PCIe Gen5.0 x16 FHHL, Crypto Enabled, 16GB DDR5, BMC, Tall Bracket fw 32.42.1000 port 1 (ACTIVE) ==> ens8f0np0 (Up)
Create a IPoIBNetwork
custom resource file:
apiVersion: mellanox.com/v1alpha1
kind: IPoIBNetwork
metadata:
name: example-ipoibnetwork
spec:
ipam: |
{
"type": "whereabouts",
"range": "192.168.6.225/28",
"exclude": [
"192.168.6.229/30",
"192.168.6.236/32"
]
}
master: ibs2f0
networkNamespace: default
Create the IPoIBNetwork
resource on the cluster by running the following command:
$ oc create -f ipoib-network.yaml
ipoibnetwork.mellanox.com/example-ipoibnetwork created
Create a MacvlanNetwork
custom resource file for your other interface:
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
name: rdmashared-net
spec:
networkNamespace: default
master: ens8f0np0
mode: bridge
mtu: 1500
ipam: '{"type": "whereabouts", "range": "192.168.2.0/24", "gateway": "192.168.2.1"}'
Create the resource on the cluster by running the following command:
$ oc create -f macvlan-network.yaml
macvlannetwork.mellanox.com/rdmashared-net created
The GPU Operator automates the management of the NVIDIA drivers, device plugins for GPUs, the NVIDIA Container Toolkit, and other components required for GPU provisioning.
You have installed the GPU Operator.
Check that the Operator pod is running to look at the pods under the namespace by running the following command:
$ oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-operator-b4cb7d74-zxpwq 1/1 Running 0 32s
Create a GPU cluster policy custom resource file similar to the following example:
apiVersion: nvidia.com/v1
kind: ClusterPolicy
metadata:
name: gpu-cluster-policy
spec:
vgpuDeviceManager:
config:
default: default
enabled: true
migManager:
config:
default: all-disabled
name: default-mig-parted-config
enabled: true
operator:
defaultRuntime: crio
initContainer: {}
runtimeClass: nvidia
use_ocp_driver_toolkit: true
dcgm:
enabled: true
gfd:
enabled: true
dcgmExporter:
config:
name: ''
serviceMonitor:
enabled: true
enabled: true
cdi:
default: false
enabled: false
driver:
licensingConfig:
nlsEnabled: true
configMapName: ''
certConfig:
name: ''
rdma:
enabled: false
kernelModuleConfig:
name: ''
upgradePolicy:
autoUpgrade: true
drain:
deleteEmptyDir: false
enable: false
force: false
timeoutSeconds: 300
maxParallelUpgrades: 1
maxUnavailable: 25%
podDeletion:
deleteEmptyDir: false
force: false
timeoutSeconds: 300
waitForCompletion:
timeoutSeconds: 0
repoConfig:
configMapName: ''
virtualTopology:
config: ''
enabled: true
useNvidiaDriverCRD: false
useOpenKernelModules: true
devicePlugin:
config:
name: ''
default: ''
mps:
root: /run/nvidia/mps
enabled: true
gdrcopy:
enabled: true
kataManager:
config:
artifactsDir: /opt/nvidia-gpu-operator/artifacts/runtimeclasses
mig:
strategy: single
sandboxDevicePlugin:
enabled: true
validator:
plugin:
env:
- name: WITH_WORKLOAD
value: 'false'
nodeStatusExporter:
enabled: true
daemonsets:
rollingUpdate:
maxUnavailable: '1'
updateStrategy: RollingUpdate
sandboxWorkloads:
defaultWorkload: container
enabled: false
gds:
enabled: true
image: nvidia-fs
version: 2.20.5
repository: nvcr.io/nvidia/cloud-native
vgpuManager:
enabled: false
vfioManager:
enabled: true
toolkit:
installDir: /usr/local/nvidia
enabled: true
When the GPU ClusterPolicy
custom resource has generated, create the resource on the cluster by running the following command:
$ oc create -f gpu-cluster-policy.yaml
clusterpolicy.nvidia.com/gpu-cluster-policy created
Validate that the Operator is installed and running by running the following command:
$ oc get pods -n nvidia-gpu-operator
NAME READY STATUS RESTARTS AGE
gpu-feature-discovery-d5ngn 1/1 Running 0 3m20s
gpu-feature-discovery-z42rx 1/1 Running 0 3m23s
gpu-operator-6bb4d4b4c5-njh78 1/1 Running 0 4m35s
nvidia-container-toolkit-daemonset-bkh8l 1/1 Running 0 3m20s
nvidia-container-toolkit-daemonset-c4hzm 1/1 Running 0 3m23s
nvidia-cuda-validator-4blvg 0/1 Completed 0 106s
nvidia-cuda-validator-tw8sl 0/1 Completed 0 112s
nvidia-dcgm-exporter-rrw4g 1/1 Running 0 3m20s
nvidia-dcgm-exporter-xc78t 1/1 Running 0 3m23s
nvidia-dcgm-nvxpf 1/1 Running 0 3m20s
nvidia-dcgm-snj4j 1/1 Running 0 3m23s
nvidia-device-plugin-daemonset-fk2xz 1/1 Running 0 3m23s
nvidia-device-plugin-daemonset-wq87j 1/1 Running 0 3m20s
nvidia-driver-daemonset-416.94.202410211619-0-ngrjg 4/4 Running 0 3m58s
nvidia-driver-daemonset-416.94.202410211619-0-tm4x6 4/4 Running 0 3m58s
nvidia-node-status-exporter-jlzxh 1/1 Running 0 3m57s
nvidia-node-status-exporter-zjffs 1/1 Running 0 3m57s
nvidia-operator-validator-l49hx 1/1 Running 0 3m20s
nvidia-operator-validator-n44nn 1/1 Running 0 3m23s
Optional: When you have verified the pods are running, remote shell into the NVIDIA driver daemonset pod and confirm that the NVIDIA modules are loaded. Specifically, ensure the nvidia_peermem
is loaded.
$ oc rsh -n nvidia-gpu-operator $(oc -n nvidia-gpu-operator get pod -o name -l app.kubernetes.io/component=nvidia-driver)
sh-4.4# lsmod|grep nvidia
nvidia_fs 327680 0
nvidia_peermem 24576 0
nvidia_modeset 1507328 0
video 73728 1 nvidia_modeset
nvidia_uvm 6889472 8
nvidia 8810496 43 nvidia_uvm,nvidia_peermem,nvidia_fs,gdrdrv,nvidia_modeset
ib_uverbs 217088 3 nvidia_peermem,rdma_ucm,mlx5_ib
drm 741376 5 drm_kms_helper,drm_shmem_helper,nvidia,mgag200
Optional: Run the nvidia-smi
utility to show the details about the driver and the hardware:
sh-4.4# nvidia-smi
+ .Example output
Wed Nov 6 22:03:53 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA A40 On | 00000000:61:00.0 Off | 0 |
| 0% 37C P0 88W / 300W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA A40 On | 00000000:E1:00.0 Off | 0 |
| 0% 28C P8 29W / 300W | 1MiB / 46068MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
While you are still in the driver pod, set the GPU clock to maximum using the nvidia-smi
command:
$ oc rsh -n nvidia-gpu-operator nvidia-driver-daemonset-416.94.202410172137-0-ndhzc
sh-4.4# nvidia-smi -i 0 -lgc $(nvidia-smi -i 0 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1)
GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:61:00.0
All done.
sh-4.4# nvidia-smi -i 1 -lgc $(nvidia-smi -i 1 --query-supported-clocks=graphics --format=csv,noheader,nounits | sort -h | tail -n 1)
GPU clocks set to "(gpuClkMin 1740, gpuClkMax 1740)" for GPU 00000000:E1:00.0
All done.
Validate the resource is available from a node describe perspective by running the following command:
$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A9
Capacity:
cpu: 128
ephemeral-storage: 1561525616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263596712Ki
nvidia.com/gpu: 2
pods: 250
rdma/rdma_shared_device_eth: 63
rdma/rdma_shared_device_ib: 63
Allocatable:
cpu: 127500m
ephemeral-storage: 1438028263499
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 262445736Ki
nvidia.com/gpu: 2
pods: 250
rdma/rdma_shared_device_eth: 63
rdma/rdma_shared_device_ib: 63
--
Capacity:
cpu: 128
ephemeral-storage: 1561525616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263596672Ki
nvidia.com/gpu: 2
pods: 250
rdma/rdma_shared_device_eth: 63
rdma/rdma_shared_device_ib: 63
Allocatable:
cpu: 127500m
ephemeral-storage: 1438028263499
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 262445696Ki
nvidia.com/gpu: 2
pods: 250
rdma/rdma_shared_device_eth: 63
rdma/rdma_shared_device_ib: 63
Before you create the resource pods, you need to create the machineconfig.yaml
custom resource (CR) that provides access to the GPU and networking resources without the need for user privileges.
Generate a Machineconfig
CR:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 02-worker-container-runtime
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- contents:
source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZV0KZGVmYXVsdF91bGltaXRzID0gWwoibWVtbG9jaz0tMTotMSIKXQo=
mode: 420
overwrite: true
path: /etc/crio/crio.conf.d/10-custom
Use the procedures in this section to create the workload pods for the shared and host devices.
Create the workload pods for a shared device RDMA on RDMA over Converged Ethernet (RoCE) for the NVIDIA Network Operator and test the pod configuration.
The NVIDIA GPUDirect RDMA device is shared among pods on the OKD worker node where the device is exposed.
Ensure that the Operator is running.
Delete the NicClusterPolicy
custom resource (CR), if it exists.
Generate custom pod resources:
$ cat <<EOF > rdma-eth-32-workload.yaml
apiVersion: v1
kind: pod
metadata:
name: rdma-eth-32-workload
namespace: default
annotations:
k8s.v1.cni.cncf.io/networks: rdmashared-net
spec:
nodeSelector:
kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
containers:
- image: quay.io/edge-infrastructure/nvidia-tools:0.1.5
name: rdma-eth-32-workload
resources:
limits:
nvidia.com/gpu: 1
rdma/rdma_shared_device_eth: 1
requests:
nvidia.com/gpu: 1
rdma/rdma_shared_device_eth: 1
EOF
$ cat <<EOF > rdma-eth-33-workload.yaml
apiVersion: v1
kind: pod
metadata:
name: rdma-eth-33-workload
namespace: default
annotations:
k8s.v1.cni.cncf.io/networks: rdmashared-net
spec:
nodeSelector:
kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
containers:
- image: quay.io/edge-infrastructure/nvidia-tools:0.1.5
name: rdma-eth-33-workload
securityContext:
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
nvidia.com/gpu: 1
rdma/rdma_shared_device_eth: 1
requests:
nvidia.com/gpu: 1
rdma/rdma_shared_device_eth: 1
EOF
Create the pods on the cluster by using the following commands:
$ oc create -f rdma-eth-32-workload.yaml
pod/rdma-eth-32-workload created
$ oc create -f rdma-eth-33-workload.yaml
pod/rdma-eth-33-workload created
Verify that the pods are running by using the following command:
$ oc get pods -n default
NAME READY STATUS RESTARTS AGE
rdma-eth-32-workload 1/1 Running 0 25s
rdma-eth-33-workload 1/1 Running 0 22s
Create the workload pods for a host device Remote Direct Memory Access (RDMA) for the NVIDIA Network Operator and test the pod configuration.
Ensure that the Operator is running.
Delete the NicClusterPolicy
custom resource (CR), if it exists.
Generate a new host device NicClusterPolicy
(CR), as shown below:
$ cat <<EOF > network-hostdev-nic-cluster-policy.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 24.10-0.7.0.0-0
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
env:
- name: UNLOAD_STORAGE_MODULES
value: "true"
- name: RESTORE_DRIVER_ON_pod_TERMINATION
value: "true"
- name: CREATE_IFNAMES_UDEV
value: "true"
sriovDevicePlugin:
image: sriov-network-device-plugin
repository: ghcr.io/k8snetworkplumbingwg
version: v3.7.0
config: |
{
"resourceList": [
{
"resourcePrefix": "nvidia.com",
"resourceName": "hostdev",
"selectors": {
"vendors": ["15b3"],
"isRdma": true
}
}
]
}
EOF
Create the NicClusterPolicy
CR on the cluster by using the following command:
$ oc create -f network-hostdev-nic-cluster-policy.yaml
nicclusterpolicy.mellanox.com/nic-cluster-policy created
Verify that the host device NicClusterPolicy
CR by using the following command in the DOCA/MOFED container:
$ oc get pods -n nvidia-network-operator
NAME READY STATUS RESTARTS AGE
mofed-rhcos4.16-696886fcb4-ds-9sgvd 2/2 Running 0 2m37s
mofed-rhcos4.16-696886fcb4-ds-lkjd4 2/2 Running 0 2m37s
nvidia-network-operator-controller-manager-68d547dbbd-qsdkf 1/1 Running 0 141m
sriov-device-plugin-6v2nz 1/1 Running 0 2m14s
sriov-device-plugin-hc4t8 1/1 Running 0 2m14s
Confirm that the resources appear in the cluster oc describe node
section by using the following command:
$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A7
Capacity:
cpu: 128
ephemeral-storage: 1561525616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263596708Ki
nvidia.com/hostdev: 2
pods: 250
Allocatable:
cpu: 127500m
ephemeral-storage: 1438028263499
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 262445732Ki
nvidia.com/hostdev: 2
pods: 250
--
Capacity:
cpu: 128
ephemeral-storage: 1561525616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263596704Ki
nvidia.com/hostdev: 2
pods: 250
Allocatable:
cpu: 127500m
ephemeral-storage: 1438028263499
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 262445728Ki
nvidia.com/hostdev: 2
pods: 250
Create a HostDeviceNetwork
CR file:
$ cat <<EOF > hostdev-network.yaml
apiVersion: mellanox.com/v1alpha1
kind: HostDeviceNetwork
metadata:
name: hostdev-net
spec:
networkNamespace: "default"
resourceName: "hostdev"
ipam: |
{
"type": "whereabouts",
"range": "192.168.3.225/28",
"exclude": [
"192.168.3.229/30",
"192.168.3.236/32"
]
}
EOF
Create the HostDeviceNetwork
resource on the cluster by using the following command:
$ oc create -f hostdev-network.yaml
hostdevicenetwork.mellanox.com/hostdev-net created
Confirm that the resources appear in the cluster oc describe node
section by using the following command:
$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A8
Capacity:
cpu: 128
ephemeral-storage: 1561525616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263596708Ki
nvidia.com/gpu: 2
nvidia.com/hostdev: 2
pods: 250
Allocatable:
cpu: 127500m
ephemeral-storage: 1438028263499
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 262445732Ki
nvidia.com/gpu: 2
nvidia.com/hostdev: 2
pods: 250
--
Capacity:
cpu: 128
ephemeral-storage: 1561525616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263596680Ki
nvidia.com/gpu: 2
nvidia.com/hostdev: 2
pods: 250
Allocatable:
cpu: 127500m
ephemeral-storage: 1438028263499
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 262445704Ki
nvidia.com/gpu: 2
nvidia.com/hostdev: 2
pods: 250
Configure a Single Root I/O Virtualization (SR-IOV) legacy mode host device RDMA on RoCE.
Generate a new host device NicClusterPolicy
custom resource (CR):
$ cat <<EOF > network-sriovleg-nic-cluster-policy.yaml
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 24.10-0.7.0.0-0
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
env:
- name: UNLOAD_STORAGE_MODULES
value: "true"
- name: RESTORE_DRIVER_ON_pod_TERMINATION
value: "true"
- name: CREATE_IFNAMES_UDEV
value: "true"
EOF
Create the policy on the cluster by using the following command:
$ oc create -f network-sriovleg-nic-cluster-policy.yaml
nicclusterpolicy.mellanox.com/nic-cluster-policy created
Verify the pods by using the following command in the DOCA/MOFED container:
$ oc get pods -n nvidia-network-operator
NAME READY STATUS RESTARTS AGE
mofed-rhcos4.16-696886fcb4-ds-4mb42 2/2 Running 0 40s
mofed-rhcos4.16-696886fcb4-ds-8knwq 2/2 Running 0 40s
nvidia-network-operator-controller-manager-68d547dbbd-qsdkf 1/1 Running 13 (4d ago) 4d21h
Create an SriovNetworkNodePolicy
CR that generates the Virtual Functions (VFs) for the device you want to operate in SR-IOV legacy mode. See the following example:
$ cat <<EOF > sriov-network-node-policy.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: sriov-legacy-policy
namespace: openshift-sriov-network-operator
spec:
deviceType: netdevice
mtu: 1500
nicSelector:
vendor: "15b3"
pfNames: ["ens8f0np0#0-7"]
nodeSelector:
feature.node.kubernetes.io/pci-15b3.present: "true"
numVfs: 8
priority: 90
isRdma: true
resourceName: sriovlegacy
EOF
Create the CR on the cluster by using the following command:
Ensure that SR-IOV Global Enable is enabled. For more information, see Unable to enable SR-IOV and receiving the message "not enough MMIO resources for SR-IOV" in Red Hat Enterprise Linux. |
$ oc create -f sriov-network-node-policy.yaml
sriovnetworknodepolicy.sriovnetwork.openshift.io/sriov-legacy-policy created
Each node has scheduling disabled. The nodes reboot to apply the configuration. You can view the nodes by using the following command:
$ oc get nodes
NAME STATUS ROLES AGE VERSION
edge-19.edge.lab.eng.rdu2.redhat.com Ready control-plane,master,worker 5d v1.29.8+632b078
nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com Ready worker 4d22h v1.29.8+632b078
nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com NotReady,SchedulingDisabled worker 4d22h v1.29.8+632b078
After the nodes have rebooted, verify that the VF interfaces exist by opening up a debug pod on each node. Run the following command:
a$ oc debug node/nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
Starting pod/nvd-srv-33nvidiaengrdu2dcredhatcom-debug-cqfjz ...
To use host binaries, run `chroot /host`
pod IP: 10.6.135.12
If you don't see a command prompt, try pressing enter.
sh-5.1# chroot /host
sh-5.1# ip link show | grep ens8
26: ens8f0np0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
42: ens8f0v0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
43: ens8f0v1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
44: ens8f0v2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
45: ens8f0v3: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
46: ens8f0v4: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
47: ens8f0v5: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
48: ens8f0v6: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
49: ens8f0v7: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
Repeat the previous steps on the second node, if necessary.
Optional: Confirm that the resources appear in the cluster oc describe node
section by using the following command:
$ oc describe node -l node-role.kubernetes.io/worker=| grep -E 'Capacity:|Allocatable:' -A8
Capacity:
cpu: 128
ephemeral-storage: 1561525616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263596692Ki
nvidia.com/gpu: 2
nvidia.com/hostdev: 0
openshift.io/sriovlegacy: 8
--
Allocatable:
cpu: 127500m
ephemeral-storage: 1438028263499
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 262445716Ki
nvidia.com/gpu: 2
nvidia.com/hostdev: 0
openshift.io/sriovlegacy: 8
--
Capacity:
cpu: 128
ephemeral-storage: 1561525616Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 263596688Ki
nvidia.com/gpu: 2
nvidia.com/hostdev: 0
openshift.io/sriovlegacy: 8
--
Allocatable:
cpu: 127500m
ephemeral-storage: 1438028263499
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 262445712Ki
nvidia.com/gpu: 2
nvidia.com/hostdev: 0
openshift.io/sriovlegacy: 8
After the VFs for SR-IOV legacy mode are in place, generate the SriovNetwork
CR file. See the following example:
$ cat <<EOF > sriov-network.yaml
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: sriov-network
namespace: openshift-sriov-network-operator
spec:
vlan: 0
networkNamespace: "default"
resourceName: "sriovlegacy"
ipam: |
{
"type": "whereabouts",
"range": "192.168.3.225/28",
"exclude": [
"192.168.3.229/30",
"192.168.3.236/32"
]
}
EOF
Create the custom resource on the cluster by using the following command:
$ oc create -f sriov-network.yaml
sriovnetwork.sriovnetwork.openshift.io/sriov-network created
Create the workload pods for a shared device Remote Direct Memory Access (RDMA) for an Infiniband installation.
Generate custom pod resources:
$ cat <<EOF > rdma-ib-32-workload.yaml
apiVersion: v1
kind: pod
metadata:
name: rdma-ib-32-workload
namespace: default
annotations:
k8s.v1.cni.cncf.io/networks: example-ipoibnetwork
spec:
nodeSelector:
kubernetes.io/hostname: nvd-srv-32.nvidia.eng.rdu2.dc.redhat.com
containers:
- image: quay.io/edge-infrastructure/nvidia-tools:0.1.5
name: rdma-ib-32-workload
resources:
limits:
nvidia.com/gpu: 1
rdma/rdma_shared_device_ib: 1
requests:
nvidia.com/gpu: 1
rdma/rdma_shared_device_ib: 1
EOF
$ cat <<EOF > rdma-ib-32-workload.yaml
apiVersion: v1
kind: pod
metadata:
name: rdma-ib-33-workload
namespace: default
annotations:
k8s.v1.cni.cncf.io/networks: example-ipoibnetwork
spec:
nodeSelector:
kubernetes.io/hostname: nvd-srv-33.nvidia.eng.rdu2.dc.redhat.com
containers:
- image: quay.io/edge-infrastructure/nvidia-tools:0.1.5
name: rdma-ib-33-workload
securityContext:
capabilities:
add: [ "IPC_LOCK" ]
resources:
limits:
nvidia.com/gpu: 1
rdma/rdma_shared_device_ib: 1
requests:
nvidia.com/gpu: 1
rdma/rdma_shared_device_ib: 1
EOF
Create the pods on the cluster by using the following commands:
$ oc create -f rdma-ib-32-workload.yaml
pod/rdma-ib-32-workload created
$ oc create -f rdma-ib-33-workload.yaml
pod/rdma-ib-33-workload created
Verify that the pods are running by using the following command:
$ oc get pods
NAME READY STATUS RESTARTS AGE
rdma-ib-32-workload 1/1 Running 0 10s
rdma-ib-33-workload 1/1 Running 0 3s
Confirm Remote Direct Memory Access (RDMA) connectivity is working between the systems, specifically for Legacy Single Root I/O Virtualization (SR-IOV) Ethernet.
Connect to each rdma-workload-client
pod by using the following command:
$ oc rsh -n default rdma-sriov-32-workload
sh-5.1#
Check the IP address assigned to the first workload pod by using the following command. In this example, the first workload pod is the RDMA test server.
sh-5.1# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0@if3970: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default
link/ether 0a:58:0a:80:02:a7 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.128.2.167/23 brd 10.128.3.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::858:aff:fe80:2a7/64 scope link
valid_lft forever preferred_lft forever
3843: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 26:34:fd:53:a6:ec brd ff:ff:ff:ff:ff:ff
altname enp55s0f0v5
inet 192.168.4.225/28 brd 192.168.4.239 scope global net1
valid_lft forever preferred_lft forever
inet6 fe80::2434:fdff:fe53:a6ec/64 scope link
valid_lft forever preferred_lft forever
sh-5.1#
The IP address of the RDMA server assigned to this pod is the net1
interface. In this example, the IP address is 192.168.4.225
.
Run the ibstatus
command to get the link_layer
type, Ethernet or Infiniband, associated with each RDMA device mlx5_x
. The output also shows the status of all of the RDMA devices by checking the state
field, which shows either ACTIVE
or DOWN
.
sh-5.1# ibstatus
Infiniband device 'mlx5_0' port 1 status:
default gid: 0000:0000:0000:0000:0000:0000:0000:0000
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_1' port 1 status:
default gid: fe80:0000:0000:0000:e8eb:d303:0072:1415
base lid: 0xc
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: InfiniBand
Infiniband device 'mlx5_2' port 1 status:
default gid: 0000:0000:0000:0000:0000:0000:0000:0000
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_3' port 1 status:
default gid: 0000:0000:0000:0000:0000:0000:0000:0000
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_4' port 1 status:
default gid: 0000:0000:0000:0000:0000:0000:0000:0000
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_5' port 1 status:
default gid: 0000:0000:0000:0000:0000:0000:0000:0000
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_6' port 1 status:
default gid: 0000:0000:0000:0000:0000:0000:0000:0000
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_7' port 1 status:
default gid: fe80:0000:0000:0000:2434:fdff:fe53:a6ec
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_8' port 1 status:
default gid: 0000:0000:0000:0000:0000:0000:0000:0000
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_9' port 1 status:
default gid: 0000:0000:0000:0000:0000:0000:0000:0000
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
sh-5.1#
To get the link_layer
for each RDMA mlx5
device on your worker node, run the ibstat
command:
sh-5.1# ibstat | egrep "Port|Base|Link"
Port 1:
Physical state: LinkUp
Base lid: 0
Port GUID: 0x0000000000000000
Link layer: Ethernet
Port 1:
Physical state: LinkUp
Base lid: 12
Port GUID: 0xe8ebd30300721415
Link layer: InfiniBand
Port 1:
Base lid: 0
Port GUID: 0x0000000000000000
Link layer: Ethernet
Port 1:
Base lid: 0
Port GUID: 0x0000000000000000
Link layer: Ethernet
Port 1:
Base lid: 0
Port GUID: 0x0000000000000000
Link layer: Ethernet
Port 1:
Base lid: 0
Port GUID: 0x0000000000000000
Link layer: Ethernet
Port 1:
Base lid: 0
Port GUID: 0x0000000000000000
Link layer: Ethernet
Port 1:
Physical state: LinkUp
Base lid: 0
Port GUID: 0x2434fdfffe53a6ec
Link layer: Ethernet
Port 1:
Base lid: 0
Port GUID: 0x0000000000000000
Link layer: Ethernet
Port 1:
Base lid: 0
Port GUID: 0x0000000000000000
Link layer: Ethernet
sh-5.1#
For RDMA Shared Device or Host Device workload pods, the RDMA device named mlx5_x
is already known and is typically mlx5_0
or mlx5_1
. For RDMA Legacy SR-IOV workload pods, you need to determine which RDMA device is associated with which Virtual Function (VF) subinterface. Provide this information by using the following command:
sh-5.1# rdma link show
link mlx5_0/1 state ACTIVE physical_state LINK_UP
link mlx5_1/1 subnet_prefix fe80:0000:0000:0000 lid 12 sm_lid 1 lmc 0 state ACTIVE physical_state LINK_UP
link mlx5_2/1 state DOWN physical_state DISABLED
link mlx5_3/1 state DOWN physical_state DISABLED
link mlx5_4/1 state DOWN physical_state DISABLED
link mlx5_5/1 state DOWN physical_state DISABLED
link mlx5_6/1 state DOWN physical_state DISABLED
link mlx5_7/1 state ACTIVE physical_state LINK_UP netdev net1
link mlx5_8/1 state DOWN physical_state DISABLED
link mlx5_9/1 state DOWN physical_state DISABLED
In this example, the RDMA device names mlx5_7
is associated with the net1
interface. This output is used in the next command to perform the RDMA bandwidth test, which also verifies RDMA connectivity between worker nodes.
Run the following ib_write_bw
RDMA bandwidth test command:
sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_7 -p 10000 --source_ip 192.168.4.225 --use_cuda=0 --use_cuda_dmabuf
where:
The mlx5_7
RDMA device is passed in the -d
switch.
The source IP address is 192.168.4.225
to start the RDMA server.
The --use_cuda=0
, --use_cuda_dmabuf
switches indicate that the use of GPUDirect RDMA.
WARNING: BW peak won't be measured in this run.
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
************************************
* Waiting for client to connect... *
************************************
Open another terminal window and run oc rsh
command on the second workload pod that acts as the RDMA test client pod:
$ oc rsh -n default rdma-sriov-33-workload
sh-5.1#
Obtain the RDMA test client pod IP address from the net1
interface by using the following command:
sh-5.1# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0@if4139: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1400 qdisc noqueue state UP group default
link/ether 0a:58:0a:83:01:d5 brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.131.1.213/23 brd 10.131.1.255 scope global eth0
valid_lft forever preferred_lft forever
inet6 fe80::858:aff:fe83:1d5/64 scope link
valid_lft forever preferred_lft forever
4076: net1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP group default qlen 1000
link/ether 56:6c:59:41:ae:4a brd ff:ff:ff:ff:ff:ff
altname enp55s0f0v0
inet 192.168.4.226/28 brd 192.168.4.239 scope global net1
valid_lft forever preferred_lft forever
inet6 fe80::546c:59ff:fe41:ae4a/64 scope link
valid_lft forever preferred_lft forever
sh-5.1#
Obtain the link_layer
type associated with each RDMA device mlx5_x
by using the following command:
sh-5.1# ibstatus
Infiniband device 'mlx5_0' port 1 status:
default gid: 0000:0000:0000:0000:0000:0000:0000:0000
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_1' port 1 status:
default gid: fe80:0000:0000:0000:e8eb:d303:0072:09f5
base lid: 0xd
sm lid: 0x1
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: InfiniBand
Infiniband device 'mlx5_2' port 1 status:
default gid: fe80:0000:0000:0000:546c:59ff:fe41:ae4a
base lid: 0x0
sm lid: 0x0
state: 4: ACTIVE
phys state: 5: LinkUp
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_3' port 1 status:
default gid: 0000:0000:0000:0000:0000:0000:0000:0000
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_4' port 1 status:
default gid: 0000:0000:0000:0000:0000:0000:0000:0000
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_5' port 1 status:
default gid: 0000:0000:0000:0000:0000:0000:0000:0000
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_6' port 1 status:
default gid: 0000:0000:0000:0000:0000:0000:0000:0000
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_7' port 1 status:
default gid: 0000:0000:0000:0000:0000:0000:0000:0000
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_8' port 1 status:
default gid: 0000:0000:0000:0000:0000:0000:0000:0000
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Infiniband device 'mlx5_9' port 1 status:
default gid: 0000:0000:0000:0000:0000:0000:0000:0000
base lid: 0x0
sm lid: 0x0
state: 1: DOWN
phys state: 3: Disabled
rate: 200 Gb/sec (4X HDR)
link_layer: Ethernet
Optional: Obtain the firmware version of Mellanox cards by using the ibstat
command:
sh-5.1# ibstat
CA 'mlx5_0'
CA type: MT4123
Number of ports: 1
Firmware version: 20.43.1014
Hardware version: 0
Node GUID: 0xe8ebd303007209f4
System image GUID: 0xe8ebd303007209f4
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
CA 'mlx5_1'
CA type: MT4123
Number of ports: 1
Firmware version: 20.43.1014
Hardware version: 0
Node GUID: 0xe8ebd303007209f5
System image GUID: 0xe8ebd303007209f4
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 13
LMC: 0
SM lid: 1
Capability mask: 0xa651e848
Port GUID: 0xe8ebd303007209f5
Link layer: InfiniBand
CA 'mlx5_2'
CA type: MT4124
Number of ports: 1
Firmware version: 20.43.1014
Hardware version: 0
Node GUID: 0x566c59fffe41ae4a
System image GUID: 0xe8ebd303007209f4
Port 1:
State: Active
Physical state: LinkUp
Rate: 200
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x546c59fffe41ae4a
Link layer: Ethernet
CA 'mlx5_3'
CA type: MT4124
Number of ports: 1
Firmware version: 20.43.1014
Hardware version: 0
Node GUID: 0xb2ae4bfffe8f3d02
System image GUID: 0xe8ebd303007209f4
Port 1:
State: Down
Physical state: Disabled
Rate: 200
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
CA 'mlx5_4'
CA type: MT4124
Number of ports: 1
Firmware version: 20.43.1014
Hardware version: 0
Node GUID: 0x2a9967fffe8bf272
System image GUID: 0xe8ebd303007209f4
Port 1:
State: Down
Physical state: Disabled
Rate: 200
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
CA 'mlx5_5'
CA type: MT4124
Number of ports: 1
Firmware version: 20.43.1014
Hardware version: 0
Node GUID: 0x5aff2ffffe2e17e8
System image GUID: 0xe8ebd303007209f4
Port 1:
State: Down
Physical state: Disabled
Rate: 200
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
CA 'mlx5_6'
CA type: MT4124
Number of ports: 1
Firmware version: 20.43.1014
Hardware version: 0
Node GUID: 0x121bf1fffe074419
System image GUID: 0xe8ebd303007209f4
Port 1:
State: Down
Physical state: Disabled
Rate: 200
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
CA 'mlx5_7'
CA type: MT4124
Number of ports: 1
Firmware version: 20.43.1014
Hardware version: 0
Node GUID: 0xb22b16fffed03dd7
System image GUID: 0xe8ebd303007209f4
Port 1:
State: Down
Physical state: Disabled
Rate: 200
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
CA 'mlx5_8'
CA type: MT4124
Number of ports: 1
Firmware version: 20.43.1014
Hardware version: 0
Node GUID: 0x523800fffe16d105
System image GUID: 0xe8ebd303007209f4
Port 1:
State: Down
Physical state: Disabled
Rate: 200
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
CA 'mlx5_9'
CA type: MT4124
Number of ports: 1
Firmware version: 20.43.1014
Hardware version: 0
Node GUID: 0xd2b4a1fffebdc4a9
System image GUID: 0xe8ebd303007209f4
Port 1:
State: Down
Physical state: Disabled
Rate: 200
Base lid: 0
LMC: 0
SM lid: 0
Capability mask: 0x00010000
Port GUID: 0x0000000000000000
Link layer: Ethernet
sh-5.1#
To determine which RDMA device is associated with the Virtual Function subinterface that the client workload pod uses, run the following command. In this example, the net1
interface is using the RDMA device mlx5_2
.
sh-5.1# rdma link show
link mlx5_0/1 state ACTIVE physical_state LINK_UP
link mlx5_1/1 subnet_prefix fe80:0000:0000:0000 lid 13 sm_lid 1 lmc 0 state ACTIVE physical_state LINK_UP
link mlx5_2/1 state ACTIVE physical_state LINK_UP netdev net1
link mlx5_3/1 state DOWN physical_state DISABLED
link mlx5_4/1 state DOWN physical_state DISABLED
link mlx5_5/1 state DOWN physical_state DISABLED
link mlx5_6/1 state DOWN physical_state DISABLED
link mlx5_7/1 state DOWN physical_state DISABLED
link mlx5_8/1 state DOWN physical_state DISABLED
link mlx5_9/1 state DOWN physical_state DISABLED
sh-5.1#
Run the following ib_write_bw
RDMA bandwidth test command:
sh-5.1# /root/perftest/ib_write_bw -R -T 41 -s 65536 -F -x 3 -m 4096 --report_gbits -q 16 -D 60 -d mlx5_2 -p 10000 --source_ip 192.168.4.226 --use_cuda=0 --use_cuda_dmabuf 192.168.4.225
where:
The mlx5_2
RDMA device is passed in the -d
switch.
The source IP address 192.168.4.226
and the destination IP address of the RDMA server 192.168.4.225
.
The --use_cuda=0
, --use_cuda_dmabuf
switches indicate that the use of GPUDirect RDMA.
WARNING: BW peak won't be measured in this run.
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
Requested mtu is higher than active mtu
Changing to active mtu - 3
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 61:00
Picking device No. 0
[pid = 8909, dev = 0] device name = [NVIDIA A40]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
using DMA-BUF for GPU buffer address at 0x7f8738600000 aligned at 0x7f8738600000 with aligned size 2097152
allocated GPU buffer of a 2097152 address at 0x23a7420 for type CUDA_MEM_DEVICE
Calling ibv_reg_dmabuf_mr(offset=0, size=2097152, addr=0x7f8738600000, fd=40) for QP #0
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_2
Number of qps : 16 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
TX depth : 128
CQ Moderation : 1
CQE Poll Batch : 16
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : ON
Data ex. method : rdma_cm TOS : 41
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x012d PSN 0x3cb6d7
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
local address: LID 0000 QPN 0x012e PSN 0x90e0ac
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
local address: LID 0000 QPN 0x012f PSN 0x153f50
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
local address: LID 0000 QPN 0x0130 PSN 0x5e0128
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
local address: LID 0000 QPN 0x0131 PSN 0xd89752
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
local address: LID 0000 QPN 0x0132 PSN 0xe5fc16
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
local address: LID 0000 QPN 0x0133 PSN 0x236787
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
local address: LID 0000 QPN 0x0134 PSN 0xd9273e
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
local address: LID 0000 QPN 0x0135 PSN 0x37cfd4
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
local address: LID 0000 QPN 0x0136 PSN 0x3bff8f
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
local address: LID 0000 QPN 0x0137 PSN 0x81f2bd
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
local address: LID 0000 QPN 0x0138 PSN 0x575c43
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
local address: LID 0000 QPN 0x0139 PSN 0x6cf53d
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
local address: LID 0000 QPN 0x013a PSN 0xcaaf6f
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
local address: LID 0000 QPN 0x013b PSN 0x346437
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
local address: LID 0000 QPN 0x013c PSN 0xcc5865
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
remote address: LID 0000 QPN 0x026d PSN 0x359409
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
remote address: LID 0000 QPN 0x026e PSN 0xe387bf
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
remote address: LID 0000 QPN 0x026f PSN 0x5be79d
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
remote address: LID 0000 QPN 0x0270 PSN 0x1b4b28
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
remote address: LID 0000 QPN 0x0271 PSN 0x76a61b
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
remote address: LID 0000 QPN 0x0272 PSN 0x3d50e1
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
remote address: LID 0000 QPN 0x0273 PSN 0x1b572c
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
remote address: LID 0000 QPN 0x0274 PSN 0x4ae1b5
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
remote address: LID 0000 QPN 0x0275 PSN 0x5591b5
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
remote address: LID 0000 QPN 0x0276 PSN 0xfa2593
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
remote address: LID 0000 QPN 0x0277 PSN 0xd9473b
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
remote address: LID 0000 QPN 0x0278 PSN 0x2116b2
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
remote address: LID 0000 QPN 0x0279 PSN 0x9b83b6
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
remote address: LID 0000 QPN 0x027a PSN 0xa0822b
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
remote address: LID 0000 QPN 0x027b PSN 0x6d930d
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
remote address: LID 0000 QPN 0x027c PSN 0xb1a4d
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 10329004 0.00 180.47 0.344228
---------------------------------------------------------------------------------------
deallocating GPU buffer 00007f8738600000
destroying current CUDA Ctx
sh-5.1#
A positive test is seeing an expected BW average and MsgRate in Mpps.
Upon completion of the ib_write_bw
command, the server side output also appears on the server pod. See the following example:
WARNING: BW peak won't be measured in this run.
Perftest doesn't supports CUDA tests with inline messages: inline size set to 0
************************************
* Waiting for client to connect... *
************************************
Requested mtu is higher than active mtu
Changing to active mtu - 3
initializing CUDA
Listing all CUDA devices in system:
CUDA device 0: PCIe address is 61:00
Picking device No. 0
[pid = 9226, dev = 0] device name = [NVIDIA A40]
creating CUDA Ctx
making it the current CUDA Ctx
CUDA device integrated: 0
using DMA-BUF for GPU buffer address at 0x7f447a600000 aligned at 0x7f447a600000 with aligned size 2097152
allocated GPU buffer of a 2097152 address at 0x2406400 for type CUDA_MEM_DEVICE
Calling ibv_reg_dmabuf_mr(offset=0, size=2097152, addr=0x7f447a600000, fd=40) for QP #0
---------------------------------------------------------------------------------------
RDMA_Write BW Test
Dual-port : OFF Device : mlx5_7
Number of qps : 16 Transport type : IB
Connection type : RC Using SRQ : OFF
PCIe relax order: ON Lock-free : OFF
ibv_wr* API : ON Using DDP : OFF
CQ Moderation : 1
CQE Poll Batch : 16
Mtu : 1024[B]
Link type : Ethernet
GID index : 3
Max inline data : 0[B]
rdma_cm QPs : ON
Data ex. method : rdma_cm TOS : 41
---------------------------------------------------------------------------------------
Waiting for client rdma_cm QP to connect
Please run the same command with the IB/RoCE interface IP
---------------------------------------------------------------------------------------
local address: LID 0000 QPN 0x026d PSN 0x359409
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
local address: LID 0000 QPN 0x026e PSN 0xe387bf
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
local address: LID 0000 QPN 0x026f PSN 0x5be79d
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
local address: LID 0000 QPN 0x0270 PSN 0x1b4b28
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
local address: LID 0000 QPN 0x0271 PSN 0x76a61b
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
local address: LID 0000 QPN 0x0272 PSN 0x3d50e1
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
local address: LID 0000 QPN 0x0273 PSN 0x1b572c
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
local address: LID 0000 QPN 0x0274 PSN 0x4ae1b5
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
local address: LID 0000 QPN 0x0275 PSN 0x5591b5
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
local address: LID 0000 QPN 0x0276 PSN 0xfa2593
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
local address: LID 0000 QPN 0x0277 PSN 0xd9473b
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
local address: LID 0000 QPN 0x0278 PSN 0x2116b2
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
local address: LID 0000 QPN 0x0279 PSN 0x9b83b6
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
local address: LID 0000 QPN 0x027a PSN 0xa0822b
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
local address: LID 0000 QPN 0x027b PSN 0x6d930d
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
local address: LID 0000 QPN 0x027c PSN 0xb1a4d
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:225
remote address: LID 0000 QPN 0x012d PSN 0x3cb6d7
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
remote address: LID 0000 QPN 0x012e PSN 0x90e0ac
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
remote address: LID 0000 QPN 0x012f PSN 0x153f50
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
remote address: LID 0000 QPN 0x0130 PSN 0x5e0128
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
remote address: LID 0000 QPN 0x0131 PSN 0xd89752
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
remote address: LID 0000 QPN 0x0132 PSN 0xe5fc16
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
remote address: LID 0000 QPN 0x0133 PSN 0x236787
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
remote address: LID 0000 QPN 0x0134 PSN 0xd9273e
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
remote address: LID 0000 QPN 0x0135 PSN 0x37cfd4
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
remote address: LID 0000 QPN 0x0136 PSN 0x3bff8f
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
remote address: LID 0000 QPN 0x0137 PSN 0x81f2bd
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
remote address: LID 0000 QPN 0x0138 PSN 0x575c43
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
remote address: LID 0000 QPN 0x0139 PSN 0x6cf53d
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
remote address: LID 0000 QPN 0x013a PSN 0xcaaf6f
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
remote address: LID 0000 QPN 0x013b PSN 0x346437
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
remote address: LID 0000 QPN 0x013c PSN 0xcc5865
GID: 00:00:00:00:00:00:00:00:00:00:255:255:192:168:04:226
---------------------------------------------------------------------------------------
#bytes #iterations BW peak[Gb/sec] BW average[Gb/sec] MsgRate[Mpps]
65536 10329004 0.00 180.47 0.344228
---------------------------------------------------------------------------------------
deallocating GPU buffer 00007f447a600000
destroying current CUDA Ctx