$ sudo sysctl -a
You can configure namespaced sysctls to manage kernel parameters for specific pods or network interfaces. By enabling safe or unsafe namespaced sysctls, you can fine-tune your environment’s performance and networking behavior at the pod level.
Only sysctls that are namespaced can be set independently on pods. If a sysctl is not namespaced, it is known as a node-level sysctl. You must use another method of setting the sysctl, such as by using the Node Tuning Operator. To set node-level sysctls, see "Using the Node Tuning Operator".
Network sysctls are a special category of sysctl. Network sysctls include:
System-wide sysctls, for example net.ipv4.ip_local_port_range, that are valid for all networking. You can set these independently for each pod on a node.
Interface-specific sysctls, for example net.ipv4.conf.IFNAME.accept_local, that only apply to a specific additional network interface for a given pod. You can set these independently for each additional network configuration. You set these by using a configuration in the tuning-cni after the network interfaces are created.
Only those sysctls considered safe are enabled by default. A cluster administrator can manually enable unsafe sysctls on the node to be available to the user.
The Linux sysctl interface allows you to modify kernel parameters at runtime to manage subsystems such as networking, virtual memory, and MDADM. By accessing the sysctl interface, you can view and adjust system configurations without rebooting the operating system.
You can modify the following subsystems by using sysctls:
kernel (common prefix: kernel.)
networking (common prefix: net.)
virtual memory (common prefix: vm.)
MDADM (common prefix: dev.)
Refer to the Kernel.org documentation for more information on the subsystems you can manage. You can get a list of all parameters by running the following command:
$ sudo sysctl -a
In a OKD cluster, you can use namespaced sysctls, which apply to specific pods on a node, or node-level sysctls, which affect an entire node.
Some sysctls are namespaced in the Linux kernels. This means that you can set them independently for each pod on a node. Being namespaced is a requirement for sysctls to be accessible in a pod context within Kubernetes.
The following sysctls are known to be namespaced:
kernel.shm*
kernel.msg*
kernel.sem
fs.mqueue.*
Additionally, most of the sysctls in the net.* group are known to be namespaced. Their namespace adoption differs based on the kernel version and distributor.
Node-level sysctls are not namespaced and must be set
manually by a cluster administrator, either by using of the underlying Linux
distribution of the nodes, such as by modifying the /etc/sysctls.conf file,
or by using a daemon set with privileged containers. You can also use the Node Tuning Operator to set node-level sysctls.
|
Consider marking nodes with special sysctls as tainted. Only schedule pods onto them that need those sysctl settings. Use the taints and toleration feature to mark the nodes. |
In a OKD cluster, you can use safe or unsafe sysctls.
For system-wide sysctls to be considered safe, they must be namespaced. A namespaced sysctl ensures there is isolation between namespaces and therefore pods. If you set a sysctl for one pod it must not take any of the following actions:
Influence on any other pod on the node
Harm the node health
Gain CPU or memory resources outside of the resource limits of a pod
|
Being namespaced alone is not sufficient for the sysctl to be considered safe. |
Any sysctl that is not added to the allowed list on OKD is considered unsafe for OKD.
Unsafe sysctls are not allowed by default. For system-wide sysctls, a cluster administrator must manually enable them on a per-node basis. Pods with disabled unsafe sysctls are scheduled but do not launch.
|
You cannot manually enable interface-specific unsafe sysctls. |
OKD adds the following system-wide and interface-specific safe sysctls to an allowed safe list:
| sysctl | Description |
|---|---|
|
When set to |
|
Defines the local port range that is used by TCP and UDP to choose the local port. The first number is the first port number, and the second number is the last local port number. If possible, it is better if these numbers have different parity (one even and one odd value). They must be greater than or equal to |
|
When |
|
Restricts |
|
Defines the first unprivileged port in the network namespace. To disable all privileged ports, set to |
|
Specifies a range of comma-separated local ports that you want to reserve for applications or services. |
|
Specifies the interval in seconds before the first |
|
Specifies the time in seconds that a connection remains in the |
|
Specifies the interval in seconds between the |
|
Specifies how many |
| sysctl | Description |
|---|---|
|
Accepts IPv4 ICMP redirect messages. |
|
Accepts IPv4 packets with strict source route (SRR) option. |
|
Defines the behavior for gratuitous ARP frames with an IPv4 address that is not already present in the ARP table:
|
|
Defines the mode for notification of IPv4 address and device changes. |
|
Disables IPSEC policy (SPD) for this IPv4 interface. |
|
Accepts ICMP redirect messages only to gateways listed in the interface’s current gateway list. |
|
Sends redirects only if the node acts as a router. That is, a host should not send an ICMP redirect message. It is used by routers to notify the host about a better routing path that is available for a particular destination. |
|
Accepts IPv6 Router advertisements; autoconfigure using them. It also determines whether or not to transmit router solicitations. Router solicitations are transmitted only if the functional setting is to accept router advertisements. |
|
Accepts IPv6 ICMP redirect messages. |
|
Accepts IPv6 packets with SRR option. |
|
Defines the behavior for gratuitous ARP frames with an IPv6 address that is not already present in the ARP table:
|
|
Defines the mode for notification of IPv6 address and device changes. |
|
Controls the hardware address to IP mapping lifetime in the neighbor table for IPv6. |
|
Sets the retransmit timer for neighbor discovery messages. |
|
When setting these values using the |
You can modify the default list of safe interface-specific sysctls by updating the cni-sysctl-allowlist in the openshift-multus namespace.
|
The support for updating the interface-specific safe sysctls list is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process. For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope. |
For example, the following procedure modifies the predefined list of safe sysctls to add sysctls that allow users to enforce stricter reverse path forwarding for IPv4. For more information on reverse path forwarding see Reverse Path Forwarding.
View the existing predefined list by running the following command:
$ oc get cm -n openshift-multus cni-sysctl-allowlist -oyaml
apiVersion: v1
data:
allowlist.conf: |-
^net.ipv4.conf.IFNAME.accept_redirects$
^net.ipv4.conf.IFNAME.accept_source_route$
^net.ipv4.conf.IFNAME.arp_accept$
^net.ipv4.conf.IFNAME.arp_notify$
^net.ipv4.conf.IFNAME.disable_policy$
^net.ipv4.conf.IFNAME.secure_redirects$
^net.ipv4.conf.IFNAME.send_redirects$
^net.ipv6.conf.IFNAME.accept_ra$
^net.ipv6.conf.IFNAME.accept_redirects$
^net.ipv6.conf.IFNAME.accept_source_route$
^net.ipv6.conf.IFNAME.arp_accept$
^net.ipv6.conf.IFNAME.arp_notify$
^net.ipv6.neigh.IFNAME.base_reachable_time_ms$
^net.ipv6.neigh.IFNAME.retrans_time_ms$
kind: ConfigMap
metadata:
annotations:
kubernetes.io/description: |
Sysctl allowlist for nodes.
release.openshift.io/version: 4.0-0.nightly-2022-11-16-003434
creationTimestamp: "2022-11-17T14:09:27Z"
name: cni-sysctl-allowlist
namespace: openshift-multus
resourceVersion: "2422"
uid: 96d138a3-160e-4943-90ff-6108fa7c50c3
Edit the list by using the following command:
$ oc edit cm -n openshift-multus cni-sysctl-allowlist -oyaml
Add the ^net.ipv4.conf.IFNAME.rp_filter$ and ^net.ipv6.conf.IFNAME.rp_filter$ fields to the list of parameters to allow users to implement stricter reverse path forwarding.
# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
allowlist.conf: |-
^net.ipv4.conf.IFNAME.accept_redirects$
^net.ipv4.conf.IFNAME.accept_source_route$
^net.ipv4.conf.IFNAME.arp_accept$
^net.ipv4.conf.IFNAME.arp_notify$
^net.ipv4.conf.IFNAME.disable_policy$
^net.ipv4.conf.IFNAME.secure_redirects$
^net.ipv4.conf.IFNAME.send_redirects$
^net.ipv4.conf.IFNAME.rp_filter$
^net.ipv6.conf.IFNAME.accept_ra$
^net.ipv6.conf.IFNAME.accept_redirects$
^net.ipv6.conf.IFNAME.accept_source_route$
^net.ipv6.conf.IFNAME.arp_accept$
^net.ipv6.conf.IFNAME.arp_notify$
^net.ipv6.neigh.IFNAME.base_reachable_time_ms$
^net.ipv6.neigh.IFNAME.retrans_time_ms$
^net.ipv6.conf.IFNAME.rp_filter$
Save the changes to the file and exit.
|
The removal of |
Create a network attachment definition, such as reverse-path-fwd-example.yaml, with the following content:
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
name: tuningnad
namespace: default
spec:
config: '{
"cniVersion": "0.4.0",
"name": "tuningnad",
"plugins": [{
"type": "bridge"
},
{
"type": "tuning",
"sysctl": {
"net.ipv4.conf.IFNAME.rp_filter": "1"
}
}
]
}'
Apply the yaml by running the following command:
$ oc apply -f reverse-path-fwd-example.yaml
networkattachmentdefinition.k8.cni.cncf.io/tuningnad created
Create a pod such as examplepod.yaml using the following YAML:
apiVersion: v1
kind: Pod
metadata:
name: example
labels:
app: httpd
namespace: default
annotations:
k8s.v1.cni.cncf.io/networks: tuningnad
spec:
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
containers:
- name: httpd
image: 'image-registry.openshift-image-registry.svc:5000/openshift/httpd:latest'
ports:
- containerPort: 8080
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
where:
metadata.annotationsSpecifies the name of the configured NetworkAttachmentDefinition.
Apply the YAML by running the following command:
$ oc apply -f examplepod.yaml
Verify that the pod is created by running the following command:
$ oc get pod
NAME READY STATUS RESTARTS AGE
example 1/1 Running 0 47s
Log in to the pod by running the following command:
$ oc rsh example
Verify the value of the configured sysctl flag. For example, find the value net.ipv4.conf.net1.rp_filter by running the following command:
sh-4.4# sysctl net.ipv4.conf.net1.rp_filter
net.ipv4.conf.net1.rp_filter = 1
You can modify kernel parameters for all containers in a pod by adding the sysctls parameter to the securityContext parameter in a pod spec.
Safe sysctls are allowed by default.
This example uses the pod securityContext to set the following safe sysctls:
kernel.shm_rmid_forced
net.ipv4.ip_local_port_range
net.ipv4.tcp_syncookies
net.ipv4.ping_group_range
|
To avoid destabilizing your operating system, modify sysctl parameters only after you understand their effects. |
The following procedure shows how to start a pod with the configured sysctl settings.
|
In most cases you modify an existing pod definition and add the |
Create a YAML file sysctl_pod.yaml that defines an example pod and add the securityContext spec, as shown in the following example:
apiVersion: v1
kind: Pod
metadata:
name: sysctl-example
namespace: default
spec:
containers:
- name: podexample
image: centos
command: ["bin/bash", "-c", "sleep INF"]
securityContext:
runAsUser: 2000
runAsGroup: 3000
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
sysctls:
- name: kernel.shm_rmid_forced
value: "1"
- name: net.ipv4.ip_local_port_range
value: "32770 60666"
- name: net.ipv4.tcp_syncookies
value: "0"
- name: net.ipv4.ping_group_range
value: "0 200000000"
where:
spec.containers.securityContext.runAsUserSpecifies which user ID the container is run with.
spec.containers.securityContext.runAsGroupSpecifies which primary group ID the containers is run with.
spec.containers.securityContext.allowPrivilegeEscalationSpecifies whether a pod can request privilege escalation. The default is true. This boolean directly controls whether the no_new_privs flag gets set on the container process.
spec.containers.securityContext.capabilitiesSpecifies permitted privileged actions without giving full root access. This policy ensures all capabilities are dropped from the pod.
spec.securityContext.runAsNonRoot: trueSpecifies that the container runs with a user with any UID other than 0.
spec.securityContext.seccompProfile.type: RuntimeDefaultSpecifies that the default seccomp profile is enabled for a pod or container workload.
Create the pod by running the following command:
$ oc apply -f sysctl_pod.yaml
Check that the pod is created by running the following command:
$ oc get pod
NAME READY STATUS RESTARTS AGE
sysctl-example 1/1 Running 0 14s
Log in to the pod by running the following command:
$ oc rsh sysctl-example
Verify the values of the configured sysctl flags. For example, find the value kernel.shm_rmid_forced by running the following command:
sh-4.4# sysctl kernel.shm_rmid_forced
kernel.shm_rmid_forced = 1
You can run a pod that is configured to use unsafe sysctls on a node where a cluster administrator explicitly enabled unsafe sysctls. You might use unsafe sysctls for situations such as high performance or real-time application tuning.
You can use the taints and toleration feature or labels on nodes to schedule those pods onto the right nodes.
The following example uses the pod securityContext to set a safe sysctl kernel.shm_rmid_forced and two unsafe sysctls, net.core.somaxconn and kernel.msgmax. There is no distinction between safe and unsafe sysctls in the specification.
|
To avoid destabilizing your operating system, modify sysctl parameters only after you understand their effects. |
The following example illustrates what happens when you add safe and unsafe sysctls to a pod specification:
Create a YAML file sysctl-example-unsafe.yaml that defines an example pod and add the securityContext specification, as shown in the following example:
apiVersion: v1
kind: Pod
metadata:
name: sysctl-example-unsafe
spec:
containers:
- name: podexample
image: centos
command: ["bin/bash", "-c", "sleep INF"]
securityContext:
runAsUser: 2000
runAsGroup: 3000
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
sysctls:
- name: kernel.shm_rmid_forced
value: "0"
- name: net.core.somaxconn
value: "1024"
- name: kernel.msgmax
value: "65536"
Create the pod by using the following command:
$ oc apply -f sysctl-example-unsafe.yaml
Verify that the pod is scheduled but does not deploy because unsafe sysctls are not allowed for the node using the following command:
$ oc get pod
NAME READY STATUS RESTARTS AGE
sysctl-example-unsafe 0/1 SysctlForbidden 0 14s
As a cluster administrator, you can allow certain unsafe sysctls for very special situations such as high performance or real-time application tuning.
If you want to use unsafe sysctls, a cluster administrator must enable them individually for a specific type of node. The sysctls must be namespaced.
You can further control which sysctls are set in pods by specifying lists of sysctls or sysctl patterns in the allowedUnsafeSysctls field of the Security Context Constraints.
The allowedUnsafeSysctls option controls specific needs such as high performance or real-time application tuning.
|
Because these sysctls are considered unsafe, the use of unsafe sysctls is at-your-own-risk and can lead to severe problems, such as improper behavior of containers, resource shortage, or node breakage. |
List existing MachineConfig objects for your OKD cluster to decide how to label your machine config by running the following command:
$ oc get machineconfigpool
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-bfb92f0cd1684e54d8e234ab7423cc96 True False False 3 3 3 0 42m
worker rendered-worker-21b6cb9a0f8919c88caf39db80ac1fce True False False 3 3 3 0 42m
Add a label to the machine config pool where the containers with the unsafe sysctls are to run by running the following command:
$ oc label machineconfigpool worker custom-kubelet=sysctl
Create a YAML file that defines a KubeletConfig custom resource (CR):
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: custom-kubelet
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: sysctl
kubeletConfig:
allowedUnsafeSysctls:
- "kernel.msg*"
- "net.core.somaxconn"
where:
spec.machineConfigPoolSelector.matchLabelsSpecifies the label from the machine config pool.
spec.kubeletConfig.allowedUnsafeSysctlsSpecifies a list of unsafe sysctls to allow.
Create the object by running the following command:
$ oc apply -f set-sysctl-worker.yaml
Wait for the Machine Config Operator to generate the new rendered configuration and apply it to the machines by running the following command:
$ oc get machineconfigpool worker -w
After some minutes the UPDATING status changes from True to False:
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
worker rendered-worker-f1704a00fc6f30d3a7de9a15fd68a800 False True False 3 2 2 0 71m
worker rendered-worker-f1704a00fc6f30d3a7de9a15fd68a800 False True False 3 2 3 0 72m
worker rendered-worker-0188658afe1f3a183ec8c4f14186f4d5 True False False 3 3 3 0 72m
Create a YAML file that defines a pod and add the securityContext spec, as shown in the following example:
apiVersion: v1
kind: Pod
metadata:
name: sysctl-example-safe-unsafe
spec:
containers:
- name: podexample
image: centos
command: ["bin/bash", "-c", "sleep INF"]
securityContext:
runAsUser: 2000
runAsGroup: 3000
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
securityContext:
runAsNonRoot: true
seccompProfile:
type: RuntimeDefault
sysctls:
- name: kernel.shm_rmid_forced
value: "0"
- name: net.core.somaxconn
value: "1024"
- name: kernel.msgmax
value: "65536"
Create the pod by running the following command:
$ oc apply -f sysctl-example-safe-unsafe.yaml
Warning: would violate PodSecurity "restricted:latest": forbidden sysctls (net.core.somaxconn, kernel.msgmax)
pod/sysctl-example-safe-unsafe created
Verify that the pod is created by running the following command:
$ oc get pod
NAME READY STATUS RESTARTS AGE
sysctl-example-safe-unsafe 1/1 Running 0 19s
Log in to the pod by running the following command:
$ oc rsh sysctl-example-safe-unsafe
Verify the values of the configured sysctl flags. For example, find the value of the net.core.somaxconn parameter by running the following command:
sh-4.4# sysctl net.core.somaxconn
net.core.somaxconn = 1024
The unsafe sysctl is now allowed and the value is set as defined in the securityContext spec of the updated pod specification.