$ sudo sysctl -a
Sysctl settings are exposed via Kubernetes, allowing users to modify certain kernel parameters at runtime for namespaces within a container. Only sysctls that are namespaced can be set independently on pods. If a sysctl is not namespaced, called node-level, it cannot be set within OpenShift Container Platform. Moreover, only those sysctls considered safe are whitelisted by default; you can manually enable other unsafe sysctls on the node to be available to the user.
In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime. Parameters are available via the /proc/sys/ virtual process file system. The parameters cover various subsystems, such as:
kernel (common prefix: kernel.)
networking (common prefix: net.)
virtual memory (common prefix: vm.)
MDADM (common prefix: dev.)
More subsystems are described in Kernel documentation. To get a list of all parameters, you can run:
$ sudo sysctl -a
A number of sysctls are namespaced in today’s Linux kernels. This means that you can set them independently for each pod on a node. Being namespaced is a requirement for sysctls to be accessible in a pod context within Kubernetes.
The following sysctls are known to be namespaced:
kernel.shm*
kernel.msg*
kernel.sem
fs.mqueue.*
Additionally, most of the sysctls in the net.* group are known to be namespaced. Their namespace adoption differs based on the kernel version and distributor.
To check which net.* sysctls are namespaced on your system, run the following command:
$ podman run --rm -ti docker.io/fedora \
/bin/sh -c "dnf install -y findutils && find /proc/sys/ \
| grep -e /proc/sys/net"
Sysctls that are not namespaced are called node-level and must be set manually by the cluster administrator, either by means of the underlying Linux distribution of the nodes, such as by modifying the /etc/sysctls.conf file, or by using a DaemonSet with privileged containers.
Consider marking nodes with special sysctls as tainted. Only schedule pods onto them that need those sysctl settings. Use the taints and toleration feature to mark the nodes. |
Sysctls are grouped into safe and unsafe sysctls. In addition to proper namespacing, a safe sysctl must be properly isolated between pods on the same node. This means that if you set a sysctl as safe for one pod it must not:
Influence any other pod on the node
Harm the node’s health
Gain CPU or memory resources outside of the resource limits of a pod
By far, most of the namespaced sysctls are not necessarily considered safe.
Currently, OpenShift Container Platform supports, or whitelists, the following sysctls in the safe set:
kernel.shm_rmid_forced
net.ipv4.ip_local_port_range
net.ipv4.tcp_syncookies
This list might be extended in future versions when the kubelet supports better isolation mechanisms.
All safe sysctls are enabled by default. All unsafe sysctls are disabled by default, and the cluster administrator must manually enable them on a per-node basis. Pods with disabled unsafe sysctls will be scheduled but will fail to launch.
The cluster administrator can allow certain unsafe sysctls for very special situations such as high performance or real-time application tuning.
If you want to use unsafe sysctls, cluster administrators must enable them individually on nodes. They can enable only namespaced sysctls.
You can further control which sysctls can be set in pods by specifying lists of sysctls or sysctl patterns in the forbiddenSysctls
and allowedUnsafeSysctls
fields of the Security Context Constraints.
The forbiddenSysctls
option excludes specific sysctls.
The allowedUnsafeSysctls
option controls specific needs such as high performance or real-time application tuning.
Due to their nature of being unsafe, the use of unsafe sysctls is at-your-own-risk and can lead to severe problems like wrong behavior of containers, resource shortage, or complete breakage of a node. |
Add the unsafe sysctls to the kubeletArguments
parameter of the appropriate node configuration map file, as described in Configuring Node Resources:
For example:
$ oc edit cm node-config-compute -n openshift-node
...
kubeletArguments:
allowed-unsafe-sysctls: (1)
- "net.ipv4.tcp_keepalive_time"
- "net.ipv4.tcp_keepalive_intvl"
- "net.ipv4.tcp_keepalive_probes"
1 | Add the unsafe sysctls you want to use. |
Create a new SCC that uses the contents of the restricted SCC and add the unsafe sysctls:
... allowHostDirVolumePlugin: false allowHostIPC: false allowHostNetwork: false allowHostPID: false allowHostPorts: false allowPrivilegeescalation: true allowPrivilegedContainer: false allowedCapabilities: null allowedUnsafeSysctls: (1) - net.ipv4.tcp_keepalive_time - net.ipv4.tcp_keepalive_intvl - net.ipv4.tcp_keepalive_probes ... metadata: name: restricted-sysctls (2) ...
1 | Add the unsafe sysctls you want to use. |
2 | Specify a new name for the SCC. |
Grant the new SCC access to your pod ServiceAccount:
$ oc adm policy add-scc-to-user restricted-sysctls -z default -n your_project_name
Add the unsafe sysctls to the DeploymentConfig for your pods.
kind: DeploymentConfig ... template: ... spec: containers: ... securityContext: sysctls: - name: net.ipv4.tcp_keepalive_time value: "300" - name: net.ipv4.tcp_keepalive_intvl value: "20" - name: net.ipv4.tcp_keepalive_probes value: "3"
Restart the node service to apply the changes:
# systemctl restart atomic-openshift-node
Sysctls are set on pods using the pod’s securityContext
. The securityContext
applies to all containers in the same pod.
The following example uses the pod securityContext
to set a safe sysctl
kernel.shm_rmid_forced
and two unsafe sysctls, net.ipv4.route.min_pmtu
and
kernel.msgmax
. There is no distinction between safe and unsafe sysctls in
the specification.
To avoid destabilizing your operating system, modify sysctl parameters only after you understand their effects. |
Modify the YAML file that defines the pod and add the securityContext
spec, as
shown in the following example:
apiVersion: v1
kind: Pod
metadata:
name: sysctl-example
spec:
securityContext:
sysctls:
- name: kernel.shm_rmid_forced
value: "0"
- name: net.ipv4.route.min_pmtu
value: "552"
- name: kernel.msgmax
value: "65536"
...
A pod with the unsafe sysctls specified above will fail to launch on any node that the admin has not explicitly enabled those two unsafe sysctls. As with node-level sysctls, use the taints and toleration feature or labels on nodes to schedule those pods onto the right nodes. |