Configuring cluster memory to meet container memory and risk requirements - Working with clusters | Nodes

Understanding managing application memory
- Managing application memory strategy
Understanding OpenJDK settings for OpenShift Dedicated
Finding the memory request and limit from within a pod
Understanding OOM kill policy
Understanding pod eviction

As a cluster administrator, you can help your clusters operate efficiently through managing application memory by:

Determining the memory and risk requirements of a containerized application component and configuring the container memory parameters to suit those requirements.
Configuring containerized application runtimes (for example, OpenJDK) to adhere optimally to the configured container memory parameters.
Diagnosing and resolving memory-related error conditions associated with running in a container.

Understanding managing application memory

It is recommended to fully read the overview of how OpenShift Dedicated manages Compute Resources before proceeding.

For each kind of resource (memory, CPU, storage), OpenShift Dedicated allows optional request and limit values to be placed on each container in a pod.

Note the following about memory requests and memory limits:

Memory request
- The memory request value, if specified, influences the OpenShift Dedicated scheduler. The scheduler considers the memory request when scheduling a container to a node, then fences off the requested memory on the chosen node for the use of the container.
- If a node’s memory is exhausted, OpenShift Dedicated prioritizes evicting its containers whose memory usage most exceeds their memory request. In serious cases of memory exhaustion, the node OOM killer may select and kill a process in a container based on a similar metric.
- The cluster administrator can assign quota or assign default values for the memory request value.
- The cluster administrator can override the memory request values that a developer specifies, to manage cluster overcommit.
Memory limit
- The memory limit value, if specified, provides a hard limit on the memory that can be allocated across all the processes in a container.
- If the memory allocated by all of the processes in a container exceeds the memory limit, the node Out of Memory (OOM) killer will immediately select and kill a process in the container.
- If both memory request and limit are specified, the memory limit value must be greater than or equal to the memory request.
- The cluster administrator can assign quota or assign default values for the memory limit value.
- The minimum memory limit is 12 MB. If a container fails to start due to a Cannot allocate memory pod event, the memory limit is too low. Either increase or remove the memory limit. Removing the limit allows pods to consume unbounded node resources.

Managing application memory strategy

The steps for sizing application memory on OpenShift Dedicated are as follows:

Determine expected container memory usage

Determine expected mean and peak container memory usage, empirically if necessary (for example, by separate load testing). Remember to consider all the processes that may potentially run in parallel in the container: for example, does the main application spawn any ancillary scripts?
Determine risk appetite

Determine risk appetite for eviction. If the risk appetite is low, the container should request memory according to the expected peak usage plus a percentage safety margin. If the risk appetite is higher, it may be more appropriate to request memory according to the expected mean usage.
Set container memory request

Set container memory request based on the above. The more accurately the request represents the application memory usage, the better. If the request is too high, cluster and quota usage will be inefficient. If the request is too low, the chances of application eviction increase.
Set container memory limit, if required

Set container memory limit, if required. Setting a limit has the effect of immediately killing a container process if the combined memory usage of all processes in the container exceeds the limit, and is therefore a mixed blessing. On the one hand, it may make unanticipated excess memory usage obvious early ("fail fast"); on the other hand it also terminates processes abruptly.

Note that some OpenShift Dedicated clusters may require a limit value to be set; some may override the request based on the limit; and some application images rely on a limit value being set as this is easier to detect than a request value.

If the memory limit is set, it should not be set to less than the expected peak container memory usage plus a percentage safety margin.
Ensure application is tuned

Ensure application is tuned with respect to configured request and limit values, if appropriate. This step is particularly relevant to applications which pool memory, such as the JVM. The rest of this page discusses this.

Understanding OpenJDK settings for OpenShift Dedicated

The default OpenJDK settings do not work well with containerized environments. As a result, some additional Java memory settings must always be provided whenever running the OpenJDK in a container.

The JVM memory layout is complex, version dependent, and describing it in detail is beyond the scope of this documentation. However, as a starting point for running OpenJDK in a container, at least the following three memory-related tasks are key:

Overriding the JVM maximum heap size.
Encouraging the JVM to release unused memory to the operating system, if appropriate.
Ensuring all JVM processes within a container are appropriately configured.

Optimally tuning JVM workloads for running in a container is beyond the scope of this documentation, and may involve setting multiple additional JVM options.

Understanding how to override the JVM maximum heap size

For many Java workloads, the JVM heap is the largest single consumer of memory. Currently, the OpenJDK defaults to allowing up to 1/4 (1/-XX:MaxRAMFraction) of the compute node’s memory to be used for the heap, regardless of whether the OpenJDK is running in a container or not. It is therefore essential to override this behavior, especially if a container memory limit is also set.

There are at least two ways the above can be achieved:

If the container memory limit is set and the experimental options are supported by the JVM, set -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap.

The UseCGroupMemoryLimitForHeap option has been removed in JDK 11. Use -XX:+UseContainerSupport instead.

This sets -XX:MaxRAM to the container memory limit, and the maximum heap size (-XX:MaxHeapSize / -Xmx) to 1/-XX:MaxRAMFraction (1/4 by default).
Directly override one of -XX:MaxRAM, -XX:MaxHeapSize or -Xmx.

This option involves hard-coding a value, but has the advantage of allowing a safety margin to be calculated.

Understanding how to encourage the JVM to release unused memory to the operating system

By default, the OpenJDK does not aggressively return unused memory to the operating system. This may be appropriate for many containerized Java workloads, but notable exceptions include workloads where additional active processes co-exist with a JVM within a container, whether those additional processes are native, additional JVMs, or a combination of the two.

Java-based agents can use the following JVM arguments to encourage the JVM to release unused memory to the operating system:

-XX:+UseParallelGC
-XX:MinHeapFreeRatio=5 -XX:MaxHeapFreeRatio=10 -XX:GCTimeRatio=4
-XX:AdaptiveSizePolicyWeight=90.

These arguments are intended to return heap memory to the operating system whenever allocated memory exceeds 110% of in-use memory (-XX:MaxHeapFreeRatio), spending up to 20% of CPU time in the garbage collector (-XX:GCTimeRatio). At no time will the application heap allocation be less than the initial heap allocation (overridden by -XX:InitialHeapSize / -Xms). Detailed additional information is available Tuning Java’s footprint in OpenShift (Part 1), Tuning Java’s footprint in OpenShift (Part 2), and at OpenJDK and Containers.

Understanding how to ensure all JVM processes within a container are appropriately configured

In the case that multiple JVMs run in the same container, it is essential to ensure that they are all configured appropriately. For many workloads it will be necessary to grant each JVM a percentage memory budget, leaving a perhaps substantial additional safety margin.

Many Java tools use different environment variables (JAVA_OPTS, GRADLE_OPTS, and so on) to configure their JVMs and it can be challenging to ensure that the right settings are being passed to the right JVM.

The JAVA_TOOL_OPTIONS environment variable is always respected by the OpenJDK, and values specified in JAVA_TOOL_OPTIONS will be overridden by other options specified on the JVM command line. By default, to ensure that these options are used by default for all JVM workloads run in the Java-based agent image, the OpenShift Dedicated Jenkins Maven agent image sets:

JAVA_TOOL_OPTIONS="-XX:+UnlockExperimentalVMOptions
-XX:+UseCGroupMemoryLimitForHeap -Dsun.zip.disableMemoryMapping=true"

The UseCGroupMemoryLimitForHeap option has been removed in JDK 11. Use -XX:+UseContainerSupport instead.

This does not guarantee that additional options are not required, but is intended to be a helpful starting point.

Finding the memory request and limit from within a pod

An application wishing to dynamically discover its memory request and limit from within a pod should use the Downward API.

Procedure

Configure the pod to add the MEMORY_REQUEST and MEMORY_LIMIT stanzas:

Create a YAML file similar to the following:

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: test
    image: fedora:latest
    command:
    - sleep
    - "3600"
    env:
    - name: MEMORY_REQUEST (1)
      valueFrom:
        resourceFieldRef:
          containerName: test
          resource: requests.memory
    - name: MEMORY_LIMIT (2)
      valueFrom:
        resourceFieldRef:
          containerName: test
          resource: limits.memory
    resources:
      requests:
        memory: 384Mi
      limits:
        memory: 512Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: [ALL]

1	Add this stanza to discover the application memory request value.
2	Add this stanza to discover the application memory limit value.

Create the pod by running the following command:
```
$ oc create -f <file_name>.yaml
```

Verification

Access the pod using a remote shell:
```
$ oc rsh test
```

Check that the requested values were applied:

$ env | grep MEMORY | sort

Example output

MEMORY_LIMIT=536870912
MEMORY_REQUEST=402653184

The memory limit value can also be read from inside the container by the /sys/fs/cgroup/memory/memory.limit_in_bytes file.

Understanding OOM kill policy

OpenShift Dedicated can kill a process in a container if the total memory usage of all the processes in the container exceeds the memory limit, or in serious cases of node memory exhaustion.

When a process is Out of Memory (OOM) killed, this might result in the container exiting immediately. If the container PID 1 process receives the SIGKILL, the container will exit immediately. Otherwise, the container behavior is dependent on the behavior of the other processes.

For example, a container process exited with code 137, indicating it received a SIGKILL signal.

If the container does not exit immediately, an OOM kill is detectable as follows:

Access the pod using a remote shell:
```
# oc rsh test
```
Run the following command to see the current OOM kill count in /sys/fs/cgroup/memory/memory.oom_control:
```
$ grep '^oom_kill ' /sys/fs/cgroup/memory/memory.oom_control
```
Example output
```
oom_kill 0
```
Run the following command to provoke an OOM kill:
```
$ sed -e '' </dev/zero
```
Example output
```
Killed
```
Run the following command to view the exit status of the sed command:
```
$ echo $?
```
Example output
```
137
```
The 137 code indicates the container process exited with code 137, indicating it received a SIGKILL signal.
Run the following command to see that the OOM kill counter in /sys/fs/cgroup/memory/memory.oom_control incremented:
```
$ grep '^oom_kill ' /sys/fs/cgroup/memory/memory.oom_control
```
Example output
```
oom_kill 1
```
If one or more processes in a pod are OOM killed, when the pod subsequently exits, whether immediately or not, it will have phase Failed and reason OOMKilled. An OOM-killed pod might be restarted depending on the value of restartPolicy. If not restarted, controllers such as the replication controller will notice the pod’s failed status and create a new pod to replace the old one.

Use the follwing command to get the pod status:
```
$ oc get pod test
```
Example output
```
NAME      READY     STATUS      RESTARTS   AGE
test      0/1       OOMKilled   0          1m
```
- If the pod has not restarted, run the following command to view the pod:
  $ oc get pod test -o yaml
  Example output
  ... status: containerStatuses: - name: test ready: false restartCount: 0 state: terminated: exitCode: 137 reason: OOMKilled phase: Failed
- If restarted, run the following command to view the pod:
  $ oc get pod test -o yaml
  Example output
  ... status: containerStatuses: - name: test ready: true restartCount: 1 lastState: terminated: exitCode: 137 reason: OOMKilled state: running: phase: Running

Understanding pod eviction

OpenShift Dedicated may evict a pod from its node when the node’s memory is exhausted. Depending on the extent of memory exhaustion, the eviction may or may not be graceful. Graceful eviction implies the main process (PID 1) of each container receiving a SIGTERM signal, then some time later a SIGKILL signal if the process has not exited already. Non-graceful eviction implies the main process of each container immediately receiving a SIGKILL signal.

An evicted pod has phase Failed and reason Evicted. It will not be restarted, regardless of the value of restartPolicy. However, controllers such as the replication controller will notice the pod’s failed status and create a new pod to replace the old one.

$ oc get pod test

Example output

NAME      READY     STATUS    RESTARTS   AGE
test      0/1       Evicted   0          1m

$ oc get pod test -o yaml

Example output

...
status:
  message: 'Pod The node was low on resource: [MemoryPressure].'
  phase: Failed
  reason: Evicted