Skip to content

Kubernetes should configure the ambient capability set #56374

@danderson

Description

@danderson

/kind bug

What happened:

The following takes place on a k8s 1.8.2 cluster.

I have a Docker container image that wants to listen on :80, and specifies a non-root USER. To get this running, in my pod spec the container has the following security context:

securityContext:
    capabilities:
        drop:
        - all
        add:
        - NET_BIND_SERVICE
    allowPrivilegeEscalation: false

When I schedule this pod on the cluster, the container fails to bind to :80 (permission denied), and goes into a crashloop. Note that Kubernetes did not complain that this configuration is in any way infeasible.

The reason for this is that Linux capabilities interact in surprising ways with other security mechanisms. In this case, the problem is that I'm also running the container as a non-root user, and Kubernetes/Docker are only setting the inherited, permitted, effective and bounding capability sets. The catch is: the effective and permitted sets get cleared when you transition from UID 0 to UID !0, so my container ends up with:

CapInh:	0000000000000400
CapPrm:	0000000000000000
CapEff:	0000000000000000
CapBnd:	0000000000000400
CapAmb:	0000000000000000
NoNewPrivs:	1

0x400 is CAP_NET_BIND_SERVICE, and as you can see my effective capabilities do not have this bit set.

The linux kernel corrected this very confusing behavior by introducing the ambient capability set, which does not have surprising behaviors when you transition from UID 0 to !0. If you've set a capability as ambient, you keep it unless you explicitly revoke it.

What you expected to happen:

I expect the capabilities I assign in my podspec to still exist when my main binary execs, regardless of other security context configuration (assuming k8s accepted my manifest as valid). To me, that translates to: k8s should be writing the caps described by securityContext.capabilities into the ambient capability set, as well as the other capability sets.

Alternatively, if you believe the current behavior of securityContext.capabilities is working as intended, there should be another knob somewhere that I can use to populate the ambient capability set. However, I would strongly encourage you to instead consider the current behavior of securityContext.capabilities combined with non-root users as a bug, because it will likely trip up ~everyone using it unless they know a lot about the linux capability implementation.

How to reproduce it (as minimally and precisely as possible):

Deploy this pod to a cluster using the default container runtime. You should see it crashlooping, with kubectl logs bug-demo showing that netcat is not allowed to bind to :80. If you comment out runAsUser and let the container binary run as root, it'll work fine. Similarly, if you modify the container to have a binary that has been altered with setcap net_bind_service=+ep, the contianer will run correctly as !root, because the setcap'd binary allows the container to regain the privileges it lost when transitioning out of UID 0.

apiVersion: v1
kind: Pod
metadata:
  name: bug-demo
spec:
  containers:
  - name: netcat
    image: danderson/bug-demo:latest
    args:
    - /bin/sh
    - -c
    - "netcat -l -p 80"
    securityContext:
      runAsUser: 65534
      capabilities:
        drop:
        - all
        add:
        - NET_BIND_SERVICE
      allowPrivilegeEscalation: false

Anything else we need to know?:

Environment:

  • Kubernetes version (use kubectl version): 1.8.2 server, 1.8.1 kubectl
  • Cloud provider or hardware configuration: bare metal cluster, single node (master taint removed), set up with kubeadm.
  • OS (e.g. from /etc/os-release): Debian Testing
  • Kernel (e.g. uname -a): Linux pandora 4.12.0-2-amd64 Unit test coverage in Kubelet is lousy. (~30%) #1 SMP Debian 4.12.13-1 (2017-09-19) x86_64 GNU/Linux
  • Install tools: kubeadm
  • Others:

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureCategorizes issue or PR as related to a new feature.lifecycle/frozenIndicates that an issue or PR should not be auto-closed due to staleness.sig/nodeCategorizes an issue or PR as relevant to SIG Node.sig/securityCategorizes an issue or PR as relevant to SIG Security.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions