This is a cache of https://docs.openshift.com/serverless/1.34/knative-serving/scalability-and-performance-serving.html. It is a snapshot of the page at 2024-11-28T16:32:45.490+0000.
Scalability and Performance | Serving | Red Hat OpenShift Serverless 1.34
×

OpenShift Serverless consists of several different components that have different resource requirements and scaling behaviors. These components are horizontally and vertically scalable, but their resource requirements and configuration highly depend on the actual use-case.

Control-plane components

These components are responsible for observing and reacting to custom resources and continuously reconfiguring the system, for example, the controller pods.

Data-plane components

These components are directly involved in requests and response handling, for example, the Knative Servings activator component.

The following metrics and findings were recorded using the following test setup:

  • A cluster running OpenShift Container Platform 4.13

  • The cluster running 4 compute nodes in AWS with a machine type of m6.xlarge

  • OpenShift Serverless 1.30

Overhead of OpenShift Serverless Serving

As components of OpenShift Serverless Serving are part of the data-plane, requests from clients are routed through:

  • The ingress-gateway (Kourier or service Mesh)

  • The activator component

  • The queue-proxy sidecar container in each Knative service

These components introduce an additional hop in networking and perform additional tasks, for example, adding observability and request queuing. The following are the measured latency overheads:

  • Each additional network hop adds 0.5 ms to 1 ms latency to a request. Depending on the current load of the Knative service and if the Knative service was scaled to zero before the request, the activator component is not always a part of the data-plane.

  • Depending on the payload size, each of the components is consuming up to 1 vCPU of CPU for handling 2500 requests per second.

Known limitations of OpenShift Serverless Serving

The maximum number of Knative services that can be created is 3,000. This corresponds to the OpenShift Container Platform Kubernetes services limit of 10,000, since 1 Knative service creates 3 Kubernetes services.

Scaling and performance of OpenShift Serverless Serving

OpenShift Serverless Serving has to be scaled and configured based on the following parameters:

  • Number of Knative services

  • Number of Revisions

  • Amount of concurrent requests in the system

  • Size of payloads of the requests

  • The startup-latency and response latency of the Knative service added by the user’s web application

  • Number of changes of the Knativeservice custom resource (CR) over time

KnativeServing default configuration

Per default, OpenShift Serverless Serving is configured to run all components with high-availability and medium-sized CPU and memory requests and limits. This means that the high-available field in KnativeServing CR is automatically set to a value of 2 and all system components are scaled to two replicas. This configuration is suitable for medium workload scenarios and has been tested with:

  • 170 Knative services

  • 1-2 Revisions per Knative service

  • 89 test scenarios mainly focused on testing the control plane

  • 48 re-creating scenarios where Knative services are deleted and re-created

  • 41 stable scenarios, in which requests are slowly but continuously sent to the system

During these test cases, the system components effectively consumed:

Component Measured Resources

Operator in project openshift-serverless

1 GB Memory, 0.2 Cores of CPU

Serving components in project knative-serving

5 GB Memory, 2.5 Cores of CPU

Minimal requirements of OpenShift Serverless Serving

While the default setup is suitable for medium-sized workloads, it might be over-sized for smaller setups or under-sized for high-workload scenarios. To configure OpenShift Serverless Serving for a minimal workload scenario, you need to know the idle consumption of the system components.

Idle consumption

The idle consumption is dependent on the number of Knative services. The following memory usage has been measured for the components in the knative-serving and knative-serving-ingress OpenShift Container Platform projects:

Component 0 services 100 services 500 services 1000 services

activator

55Mi

86Mi

300Mi

450Mi

autoscaler

52Mi

102Mi

225Mi

350Mi

controller

100Mi

135Mi

310Mi

500Mi

webhook

60Mi

60Mi

60Mi

60Mi

3scale-kourier-gateway

20Mi

60Mi

190Mi

330Mi

net-kourier-controller

90Mi

170Mi

340Mi

430Mi

Either 3scale-kourier-gateway and net-kourier-controller components or istio-ingressgateway and net-istio-controller components are installed.

The memory consumption of net-istio is based on the total number of pods within the mesh.

Configuring Serving for minimal workloads

Procedure
  • You can configure Knative Serving for minimal workloads using the KnativeServing custom resource (CR):

    A minimal workload configuration in KnativeServing CR
    apiVersion: operator.knative.dev/v1beta1
    kind: KnativeServing
    metadata:
      name: knative-serving
      namespace: knative-serving
    spec:
      high-availability:
        replicas: 1 (1)
      workloads:
        - name: activator
          replicas: 2 (2)
          resources:
            - container: activator
              requests:
                cpu: 250m (3)
                memory: 60Mi (4)
              limits:
                cpu: 1000m
                memory: 600Mi
        - name: controller
          replicas: 1 (5)
          resources:
            - container: controller
              requests:
                cpu: 10m
                memory: 100Mi
              limits: (6)
                cpu: 200m
                memory: 300Mi
        - name: webhook
          replicas: 2
          resources:
            - container: webhook
              requests:
                cpu: 100m (7)
                memory: 60Mi
              limits:
                cpu: 200m
                memory: 200Mi
      podDisruptionBudgets: (8)
        - name: activator-pdb
          minAvailable: 1
        - name: webhook-pdb
          minAvailable: 1
    1 Setting this to 1 scales all system components to one replica.
    2 Activator should always be scaled to a minimum of 2 instances to avoid downtime.
    3 Activator CPU requests should not be set lower than 250m, as a HorizontalPodAutoscaler will use this as a reference to scale up and down.
    4 Adjust memory requests to the idle values from the previous table. Also adjust memory limits according to your expected load (this might need custom testing to find the best values).
    5 One webhook and one controller are sufficient for a minimal-workload scenario
    6 These limits are sufficient for a minimal-workload scenario, but they also might need adjustments depending on your concrete workload.
    7 Webhook CPU requests should not be set lower than 100m, as a HorizontalPodAutoscaler will use this as a reference to scale up and down.
    8 Adjust the PodDistruptionBudgets to a value lower than replicas, to avoid problems during node maintenance.

Configuring Serving for high workloads

You can configure Knative Serving for high workloads using the KnativeServing custom resource (CR). The following findings are relevant to configuring Knative Serving for a high workload:

These findings have been tested with requests with a payload size of 0-32 kb. The Knative service backends used in those tests had a startup latency between 0 to 10 seconds and response times between 0 to 5 seconds.

  • All data-plane components are mostly increasing CPU usage on higher requests and payload scenarios, so the CPU requests and limits have to be tested and potentially increased.

  • The activator component also might need more memory, when it has to buffer more or bigger request payloads, so the memory requests and limits might need to be increased as well.

  • One activator pod can handle approximately 2500 requests per second before it starts to increase latency and, at some point, leads to errors.

  • One 3scale-kourier-gateway or istio-ingressgateway pod can also handle approximately 2500 requests per second before it starts to increase latency and, at some point, leads to errors.

  • Each of the data-plane components consumes up to 1 vCPU of CPU for handling 2500 requests per second. Note that this highly depends on the payload size and the response times of the Knative service backend.

Fast startup and fast response-times of your Knative service user workloads are critical for good performance of the overall system. The Knative Serving components are buffering incoming requests when the Knative service user backend is scaling up or when request concurrency has reached its capacity. If your Knative service user workload introduces long startup or request latency, it will either overload the activator component (when the CPU and memory configuration is too low) or lead to errors for the calling clients.

Procedure
  • To fine-tune your installation, use the previous findings combined with your own test results to configure the KnativeServing custom resource:

    A high workload configuration in KnativeServing CR
    apiVersion: operator.knative.dev/v1beta1
    kind: KnativeServing
    metadata:
      name: knative-serving
      namespace: knative-serving
    spec:
      high-availability:
        replicas: 2 (1)
      workloads:
        - name: component-name (2)
          replicas: 2 (3)
          resources:
            - container: container-name
              requests:
                cpu: (4)
                memory:
              limits:
                cpu:
                memory:
      podDisruptionBudgets: (5)
        - name: name-of-pod-disruption-budget
          minAvailable: 1
    1 Set this parameter to at least 2 to make sure you always have at least two instances of every component running. You can also use workloads to override the replicas for certain components.
    2 Use the workloads list to configure specific components. Use the deployment name of the component and set the replicas field.
    3 For the activator, webhook, and 3scale-kourier-gateway components, which use horizontal pod autoscalers (HPAs), the replicas field sets the minimum number of replicas. The actual number of replicas depends on the CPU load and scaling done by the HPAs.
    4 Set the requested and limited CPU and memory according to at least the idle consumption while also taking the previous findings and your own test results into consideration.
    5 Adjust the PodDistruptionBudgets to a value lower than replicas to avoid problems during node maintenance. The default minAvailable is set to 1, so if you increase the required replicas, you must also increase minAvailable.

As each environment is highly specific, it is essential to test and find your own ideal configuration. Use the monitoring and alerting functionality of OpenShift Container Platform to continuously monitor your actual resource consumption and make adjustments if needed.

If you are using the OpenShift Serverless and service Mesh integration, additional CPU processing is added by the istio-proxy sidecar containers. For more information about this, see the service Mesh documentation.