Tomer Figenblat

Posted on Jan 26, 2024 • Edited on Feb 8, 2024 • Originally published at developers.redhat.com

How to distribute workloads using Open Cluster Management

#kubernetes #openclustermangmenet

Open Cluster Management (OCM) was accepted to the Cloud Native Computing Foundation (CNCF) in late 2021 and is currently at the Sandbox project maturity level. OCM is a community-driven project focused on multicluster and multicloud scenarios for Kubernetes applications. This article shows how to bootstrap Open Cluster Management and handle work distribution using ManifestWork. We also discuss several ways to select clusters for various tasks using the ManagedClusterSet and Placement resources.

Basic Open Cluster Management

Open Cluster Management is based on a hub-spoke architecture. In this design, a single hub cluster prescribes prescriptions, and one or more spoke clusters act upon these prescriptions. In Open Cluster Management, spoke clusters are called managed clusters. The component running on the hub cluster is the cluster manager. The component running on a managed cluster, is the klusterlet agent.

OCM requires one-sided communication. From the managed clusters to the hub cluster. The communication to the hub is handled by two APIs:

Registration: For joining managed clusters to the hub cluster and manage their lifecycle.
Work: For relaying workloads in the form of prescriptions prescribed on the hub cluster and reconciled on the managed clusters.

OCM resources are easily managed using a command-line interface (CLI) named clusteradm.

A number of resources help you select the clusters used in your application. We will look at all these resources in this article:

ManagedClusterSets collect ManagedClusters into groups.
Placements let you select clusters from a ManagedClusterSet.
To select clusters for a placement, you can use labels, ClusterClaims, taints and tolerations, and prioritizers.

As a simple, basic use case for placements, say you have two ManagedClusterSets, one for all the clusters in Israel and one for all the clusters in Canada. You can use placements to cherry-pick from these sets the managed clusters that are suited for testing or production.

Setting up the example environment

To get started, you'll need to install clusteradm and kubectl and start up three Kubernetes clusters. To simplify cluster administration, this article starts up three kind clusters with the following names and purposes:

kind-fedora1 runs the hub cluster.
kind-rhel1 runs one managed cluster.
kind-qnap1 runs another managed cluster.

Bootstrapping OCM

The bootstrapping task consists of initializing the hub cluster and joining the managed clusters to it. OCM registration involves a double opt-in handshake initiated by the managed clusters and accepted by the hub cluster. At any point in time, the connection can be ended by either party.

Initializing the hub cluster

Run the following command to initialize the hub cluster. Output from the clusteradm command is filtered by the grep command and assigned to the joinCommand variable:

$ joinCommand=$(clusteradm init --context kind-fedora1 --wait | grep clusteradm)

Note: This command includes the deployment of the cluster manager, so it might take a couple of minutes.

You can verify that the cluster manager is running by checking for pods in the designated open-cluster-management-hub and open-cluster-management namespaces.

Joining and accepting the managed clusters

Join each managed cluster to the hub by injecting the cluster name into the stored joinCommand variable and running the command. This command should be run for every managed cluster. In our example, we have two clusters to manage. As mentioned earlier, registration is a double opt-in handshake, so every join request needs to get accepted by the hub cluster through a clusteradm command:

$ eval $(echo "$joinCommand --context kind-rhel1 --insecure-skip-tls-verify --wait" | sed 's/<cluster_name>/kind-rhel1/g' -)
$ eval $(echo "$joinCommand --context kind-qnap1 --insecure-skip-tls-verify --wait" | sed 's/<cluster_name>/kind-qnap1/g' -)
$ clusteradm --context kind-fedora1 accept --clusters kind-rhel1,kind-qnap1 --wait

Note: These commands deploy the klusterlet agents and initialize the registration, so they might take a couple of minutes.

You can verify that the klusterlet agent is running by checking for pods in the designated open-cluster-management-agent and open-cluster-management namespaces for every managed cluster.

For every managed cluster initiating a join request, the registration api creates a cluster-scoped ManagedCluster resource on the hub cluster with the specification and status for the associated managed cluster.

Excerpts from the resource for one of the clusters in our example, kind-rhel1, follow. Note the status object, which holds valuable information about the cluster:

apiVersion: cluster.open-cluster-management.io/v1
kind: ManagedCluster
metadata:
  name: kind-rhel1
  ...
spec:
  hubAcceptsClient: true
  leaseDurationSeconds: 60
  ...
status:
  allocatable:
    cpu: "4"
    ephemeral-storage: 71645Mi
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 8012724Ki
    pods: "110"
  capacity:
    cpu: "4"
    ephemeral-storage: 71645Mi
    hugepages-1Gi: "0"
    hugepages-2Mi: "0"
    memory: 8012724Ki
    pods: "110"
  conditions:
  - lastTransitionTime: "2022-12-21T10:23:47Z"
    message: Accepted by hub cluster admin
    reason: HubClusterAdminAccepted
    status: "True"
    type: HubAcceptedManagedCluster
  - lastTransitionTime: "2022-12-21T10:23:47Z"
    message: Managed cluster joined
    reason: ManagedClusterJoined
    status: "True"
    type: ManagedClusterJoined
  - lastTransitionTime: "2022-12-21T10:23:47Z"
    message: Managed cluster is available
    reason: ManagedClusterAvailable
    status: "True"
    type: ManagedClusterConditionAvailable
  version:
    kubernetes: v1.25.3

Workload distribution across managed clusters

After a successful registration, the hub cluster creates a designated namespace for every joined managed cluster. These namespaces are called cluster namespaces and form the targets for workload distributions. In our example, we expect two namespaces named kind-qnap1 and kind-rhel1.

It's important to note that cluster namespaces are used for only workload distributions and certain other activities related to the managed cluster. Anything else, such as placement resources (discussed later in this article) and subscriptions (to be discussed in a future article), go into your application's namespace.

To distribute workloads across managed clusters, apply a ManifestWork resource describing your workload in the cluster namespace. The Klusterlet Agents periodically check for ManifestWorks in their designated namespaces, reconcile themselves, and report back with the reconciliation status.

Here's a simple example of a ManifestWork including a namespace and a simple deployment. When applied to a cluster namespace, the associated managed cluster will apply the workload in an orderly fashion:

apiVersion: work.open-cluster-management.io/v1
kind: ManifestWork
metadata:
  namespace: <target managed cluster>
  name: hello-work-demo
spec:
  workload:
    manifests:
      - apiVersion: v1
        kind: Namespace
        metadata:
          name: hello-workload
      - apiVersion: apps/v1
        kind: Deployment
        metadata:
          name: hello
          namespace: hello-workload
        spec:
          selector:
            matchLabels:
              app: hello
          template:
            metadata:
              labels:
                app: hello
            spec:
              containers:
                - name: hello
                  image: quay.io/asmacdo/busybox
                  command:
                    ["sh", "-c", 'echo "Hello, Kubernetes!" &amp;&amp; sleep 3600']

You can verify that the distribution took place by looking for the hello deployment in the hello-workload namespace in each cluster you intended to deploy to.

After deploying this ManifestWork, the klusterlet agent of the associated managed cluster creates the necessary resources and reports back. Excerpts from a status report for the previous ManifestWork follow:

apiVersion: work.open-cluster-management.io/v1
kind: ManifestWork
metadata:
  ...
spec:
  ...
status:
  conditions:
  - lastTransitionTime: "2022-12-21T11:15:35Z"
    message: All resources are available
    observedGeneration: 1
    reason: ResourcesAvailable
    status: "True"
    type: Available
  - lastTransitionTime: "2022-12-21T11:15:35Z"
    message: Apply manifest work complete
    observedGeneration: 1
    reason: AppliedManifestWorkComplete
    status: "True"
    type: Applied
  resourceStatus:
    manifests:
    - conditions:
      - lastTransitionTime: "2022-12-21T11:15:35Z"
        message: Apply manifest complete
        reason: AppliedManifestComplete
        status: "True"
        type: Applied
      - lastTransitionTime: "2022-12-21T11:15:35Z"
        message: Resource is available
        reason: ResourceAvailable
        status: "True"
        type: Available
      - lastTransitionTime: "2022-12-21T11:15:35Z"
        message: ""
        reason: NoStatusFeedbackSynced
        status: "True"
        type: StatusFeedbackSynced
      resourceMeta:
        group: ""
        kind: Namespace
        name: hello-workload
        namespace: ""
        ordinal: 0
        resource: namespaces
        version: v1
      statusFeedback: {}
    - conditions:
      - lastTransitionTime: "2022-12-21T11:15:35Z"
        message: Apply manifest complete
        reason: AppliedManifestComplete
        status: "True"
        type: Applied
      - lastTransitionTime: "2022-12-21T11:15:35Z"
        message: Resource is available
        reason: ResourceAvailable
        status: "True"
        type: Available
      - lastTransitionTime: "2022-12-21T11:15:35Z"
        message: ""
        reason: NoStatusFeedbackSynced
        status: "True"
        type: StatusFeedbackSynced
      resourceMeta:
        group: apps
        kind: Deployment
        name: hello
        namespace: hello-workload
        ordinal: 1
        resource: deployments
        version: v1
      statusFeedback: {}

For every ManifestWork resource identified on the hub cluster, the klusterlet agent creates a cluster-scoped AppliedManifestWork resource on the managed cluster. This resource serves as the owner and the status reporter for the workload. Excerpts from the AppliedManifestWork for the previous ManifestWork follow:

apiVersion: work.open-cluster-management.io/v1
kind: AppliedManifestWork
metadata:
  ...
spec:
  ...
  manifestWorkName: hello-work-demo
status:
  appliedResources:
  - group: ""
    name: hello-workload
    namespace: ""
    resource: namespaces
    version: v1
    ....
  - group: apps
    name: hello
    namespace: hello-workload
    resource: deployments
    version: v1
    ...

Grouping managed clusters

As explained earlier, the hub cluster creates a cluster-scoped ManagedCluster to represent each managed cluster joined. You can group multiple ManagedClusters using the cluster-scoped ManagedClusterSet.

At the start, there are two pre-existing ManagedClustetSets: The default one, which includes every newly joined ManagedCluster, and the global one that includes all ManagedClusters:

$ clusteradm --context kind-fedora1 get clustersets

<ManagedClusterSet>
└── <default>
│   ├── <BoundNamespace>
│   ├── <Status> 2 ManagedClusters selected
└── <global>
    └── <Status> 2 ManagedClusters selected
    └── <BoundNamespace>

Now add your own set:

$ clusteradm --context kind-fedora1 create clusterset managed-clusters-region-a

Next, configure both of your clusters as members. The following command overwrites the designated label for ManagedClusters custom resources:

$ clusteradm --context kind-fedora1 clusterset set managed-clusters-region-a --clusters kind-rhel1,kind-qnap1

When you set your clusters as members of your cluster set, they are removed from the pre-existing default set.

As stated earlier, the ManagedClusterSet resource is cluster-scoped. When you write your application using the cluster set, you have to bind it into your application's namespace, which you can do using clusteradm.

The following command creates a namespace-scoped ManagedClusterSetBinding custom resource in your application namespace:

$ clusteradm --context kind-fedora1 clusterset bind managed-clusters-region-a --namespace our-application-ns

You can verify through the following command that your clusters have moved from the default set to your new set, and that your set is bound to your application namespace:

$ clusteradm --context kind-fedora1 get clustersets

<ManagedClusterSet>
└── <default>
│   ├── <BoundNamespace>
│   ├── <Status> No ManagedCluster selected
└── <global>
│   ├── <Status> 2 ManagedClusters selected
│   ├── <BoundNamespace>
└── <managed-clusters-region-a>
    └── <BoundNamespace> our-application-ns
    └── <Status> 2 ManagedClusters selected

Selecting clusters from the set

With your ManagedClusterSet bound to your application namespace, you can create a Placement to dynamically select clusters from the set. The following subsections show several ways to select clusters, fetch the selected cluster list, and prioritize managed clusters.

Using labels to select clusters

You can select clusters for a placement using labels. The following configuration tells your placement which labels to look for in ManagedClusters within the ManagedClusterSets configured:

apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Placement
metadata:
  name: our-label-placement
  namespace: our-application-ns
spec:
  numberOfClusters: 1
  clusterSets:
    - managed-clusters-region-a
  predicates:
    - requiredClusterSelector:
        labelSelector:
          matchLabels:
            our-custom-label: "include-me-4-tests"

Because you haven't yet labeled any of your ManagedClusters yet, your placement will not find any clusters:

$ kubectl --context kind-fedora1 get placement -n our-application-ns

NAME                  SUCCEEDED   REASON                    SELECTEDCLUSTERS
our-label-placement   False       NoManagedClusterMatched   0

So add the appropriate label on one of your ManagedClusters, i.e. kind-rhel1:

$ kubectl --context kind-fedora1 label managedcluster kind-rhel1 our-custom-label="include-me-4-tests"

Your placement should now pick this up:

$ kubectl --context kind-fedora1 get placement -n our-application-ns

NAME                  SUCCEEDED   REASON                  SELECTEDCLUSTERS
our-label-placement   True        AllDecisionsScheduled   1

Using ClusterClaims to select clusters

ClusterClaims address two concerns with the use of labels for placement. The first is that, although labels are useful, their overuse makes them error-prone. The second concern is that, in our case, the labels are added for resources on the hub cluster, requiring the cluster administrator or another permitted user with access.

With ClusterClaims, the selection of clusters can be delegated to the managed clusters. ClusterClaims are custom resources applied on the managed cluster. Their content is propagated to the hub as a status for their associated ManagedCluster resources. The cluster administrators for a managed cluster can decide, for instance, which of their clusters are used for tests and which are used for production by simply applying this agreed-upon custom resource.

Clusterclaims can also be used in conjunction with labels to fine-grain your selection.

Apply the following YAML in one of your managed clusters. This example uses kind-qnap1:

apiVersion: cluster.open-cluster-management.io/v1alpha1
kind: ClusterClaim
metadata:
  name: our-custom-clusterclaim
spec:
  value: include-me-for-tests

Propagation can be verified on the associated ManagedCluster resource on the hub cluster:

apiVersion: cluster.open-cluster-management.io/v1
kind: ManagedCluster
metadata:
  name: kind-qnap1
  ...
spec:
  hubAcceptsClient: true
  ...
status:
  allocatable:
    ...
  capacity:
    ...
  clusterClaims:
  - name: our-custom-clusterclaim
    value: include-me-for-tests
  conditions:
  ...
  version:
    kubernetes: v1.25.3

Now you can create a placement based on your ClusterClaim:

apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Placement
metadata:
  name: our-clusterclaim-placement
  namespace: our-application-ns
spec:
  numberOfClusters: 1
  clusterSets:
    - managed-clusters-region-a
  predicates:
    - requiredClusterSelector:
        claimSelector:
          matchExpressions:
            - key: our-custom-clusterclaim
              operator: In
              values:
                - include-me-for-tests

And verify the placement selected the claimed ManagedCluster:

$ kubectl --context kind-fedora1 get placement -n our-application-ns

NAME                         SUCCEEDED   REASON                  SELECTEDCLUSTERS
our-clusterclaim-placement   True        AllDecisionsScheduled   1
our-label-placement          True        AllDecisionsScheduled   1

Using taints and tolerations for the selection

Taints and tolerations help you filter out unhealthy or otherwise not-ready clusters.

Taints are properties of ManagedCluster resources. The following command adds a taint to make your placement deselect your kind-qnap1 cluster. The command also removes existing taints from this ManagedCluster, so be careful when executing such commands:

$ kubectl --context kind-fedora1 patch managedcluster kind-qnap1 --type='json' -p='[{"op": "add", "path": "/spec/taints", "value": [{"effect": "NoSelect", "key": "our-custom-taint-key", "value": "our-custom-taint-value" }] }]'

To verify that the taint was added, execute:

$ kubectl --context kind-fedora1 get managedcluster kind-qnap1 -o jsonpath='{.spec.taints[*]'}

{"effect":"NoSelect","key":"our-custom-taint-key","timeAdded":"2022-12-22T15:38:47Z","value":"our-custom-taint-value"}

Now verify that the relevant placement has deselected your cluster based on the NoSelect effect:

$ kubectl --context kind-fedora1 get placement -n our-application-ns our-clusterclaim-placement

NAME                         SUCCEEDED   REASON                    SELECTEDCLUSTERS
our-clusterclaim-placement   False       NoManagedClusterMatched   0

A toleration overrides taints. So your next experiment is to make your placement ignore the previous taint using a toleration. Again, be careful when executing commands like the following because it removes any existing tolerations from the placement:

$ kubectl --context kind-fedora1 patch placement -n our-application-ns our-clusterclaim-placement --type='json' -p='[{"op": "add", "path": "/spec/tolerations", "value": [{"key": "our-custom-taint-key", "value": "our-custom-taint-value", "operator": Equal }] }]'

Verify that the toleration was added:

$ kubectl --context kind-fedora1 get placement -n our-application-ns our-clusterclaim-placement -o jsonpath='{.spec.tolerations[*]'}

"key":"our-custom-taint-key","operator":"Equal","value":"our-custom-taint-value"}

And verify that the relevant placement has reselected your cluster:

$ kubectl --context kind-fedora1 get placement -n our-application-ns our-clusterclaim-placement

NAME                         SUCCEEDED   REASON                  SELECTEDCLUSTERS
our-clusterclaim-placement   True        AllDecisionsScheduled   1

Two taints are automatically created by the system:

cluster.open-cluster-management.io/unavailable
cluster.open-cluster-management.io/unreachable

You can't manually modify these taints, but you can add tolerations to override them. You can even issue temporary tolerations that last a specified number of TolerationSeconds, as described in the placement documentation.

Fetching the selected cluster list

As long as a placement has selected at least one cluster, the system creates PlacementDecision resources listing the selected clusters:

$ kubectl --context kind-fedora1 get placementdecisions -n our-application-ns

NAME                                    AGE
our-clusterclaim-placement-decision-1   10m
our-label-placement-decision-1          22m

A PlacementDecision has the same namespace and name as its placement counterpart.

Your PlacementDecision for the placement selected by label should display your labeled ManagedCluster:

apiVersion: cluster.open-cluster-management.io/v1beta1
kind: PlacementDecision
metadata:
  name: our-label-placement-decision-1
  namespace: our-application-ns
  ownerReferences:
  - apiVersion: cluster.open-cluster-management.io/v1beta1
    kind: Placement
    name: our-label-placement
    ...
  ...
status:
  decisions:
  - clusterName: kind-rhel1
    reason: ""

Similarly, your PlacementDecision for the placement selected by ClusterClaim should display your claimed ManagedCluster:

apiVersion: cluster.open-cluster-management.io/v1beta1
kind: PlacementDecision
metadata:
  name: our-clusterclaim-placement-decision-1
  namespace: our-application-ns
  ownerReferences:
  - apiVersion: cluster.open-cluster-management.io/v1beta1
    kind: Placement
    name: our-clusterclaim-placement
    ...
  ...
status:
  decisions:
  - clusterName: kind-qnap1
    reason: ""

Prioritizing clusters for selection

Prioritizers tell placements to prefer some clusters based on built-in ScoreCoordinates. You can also extend prioritization by using an AddOnPlacementScore.

The following settings configure your placement to use prioritization and sort your clusters based on the allocated memory and CPU capacity (as viewed in the cluster status). Each coordinate is assigned a different weight for the selection:

apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Placement
metadata:
  name: our-clusterclaim-placement
  namespace: our-application-ns
spec:
  numberOfClusters: 2
  clusterSets:
    ...
  prioritizerPolicy:
    mode: Exact
    configurations:
      - scoreCoordinate:
          builtIn: ResourceAllocatableMemory
        weight: 2
      - scoreCoordinate:
          builtIn: ResourceAllocatableCPU
        weight: 3
  predicates:
    - requiredClusterSelector:
        ...

To extend the built-in score coordinates, use a designated AddOnPlacementScore in your application namespace:

apiVersion: cluster.open-cluster-management.io/v1alpha1
kind: AddOnPlacementScore
metadata:
  name: our-addon-placement-score
  namespace: our-application-ns
status:
  conditions:
  - lastTransitionTime: "2021-10-28T08:31:39Z"
    message: AddOnPlacementScore updated successfully
    reason: AddOnPlacementScoreUpdated
    status: "True"
    type: AddOnPlacementScoreUpdated
  validUntil: "2021-10-29T18:31:39Z"
  scores:
  - name: "our-custom-score-a"
    value: 66
  - name: "our-custom-score-b"
    value: 55

Now, you can modify your placement and add score coordinates referencing your custom addon scores:

apiVersion: cluster.open-cluster-management.io/v1beta1
kind: Placement
metadata:
  name: our-clusterclaim-placement
  namespace: our-application-ns
spec:
  numberOfClusters: 2
  clusterSets:
    ...
  prioritizerPolicy:
    mode: Exact
    configurations:
      - scoreCoordinate:
          builtIn: ResourceAllocatableMemory
        weight: 2
      - scoreCoordinate:
          builtIn: ResourceAllocatableCPU
        weight: 3
      - scoreCoordinate:
          builtIn: AddOn
          addOn:
            resourceName: our-addon-placement-score
            scoreName: our-custom-score-a
        weight: 1
      - scoreCoordinate:
          builtIn: AddOn
          addOn:
            resourceName: our-addon-placement-score
            scoreName: our-custom-score-b
        weight: 4
  predicates:
    - requiredClusterSelector:
        ...

Summarizing OCM basics

This article introduced Open Cluster Management and explained how to:

Bootstrap OCM on hub and spoke (managed) clusters.
Deploy a workload across managed clusters.
Group managed clusters into sets.
Select clusters from cluster sets.
Customize placement scheduling.

Upcoming articles will cover various features, addons, frameworks, and integrations related to OCM. These components make use of the infrastructure concerning work distributions and placements described in this article. In the meantime, you can learn about how to prevent computer overload with remote kind clusters. I want to thank you for taking the time to read this article, and I hope you got something out of it. Feel free to comment below if you have questions. We welcome your feedback.

DEV Community