memory leak in kubelet 1.12.5



**What happened**:
After upgrading to kubernetes 1.12.5 we observe failing nodes, that are caused by kubelet eating all over the memory after some time.

![image](https://user-images.githubusercontent.com/50872/52051018-9f335c00-2552-11e9-8520-2c22a5f64ab2.png)

I use image `k8s.gcr.io/hyperkube:v1.12.5` to run kubelet on 102 clusters and since a week we see some nodes leaking memory, caused by kubelet. 

I investigated some of these kubelets with strace and pprof.
With 3s of running strace I saw >= 50 openat() calls to the same file from the same threadid (pid) from kubelet:
```
    50 [pid 12416] openat(10, "__pycache__", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW) = 8
    54 [pid 12368] openat(6, "__pycache__", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW) = 4
    72 [pid 12368] openat(8, "__pycache__", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW) = 4
    80 [pid 12368] openat(4, "__pycache__", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW) = 3
    85 [pid 12416] openat(8, "__pycache__", O_RDONLY|O_NOCTTY|O_NONBLOCK|O_DIRECTORY|O_NOFOLLOW) = 6
```

If I do pprof kubelet it shows client_go metrics and compress is taking over most of the compute time.
```
(pprof) root@ip-172-31-10-50:~# go tool pprof  http://localhost:10248/debug/pprof/heap
Fetching profile from http://localhost:10248/debug/pprof/heap
Saved profile in /root/pprof/pprof.hyperkube.localhost:10248.alloc_objects.alloc_space.inuse_objects.inuse_space.002.pb.gz
Entering interactive mode (type "help" for commands)
(pprof) top
2406.93MB of 2451.55MB total (98.18%)
Dropped 2863 nodes (cum <= 12.26MB)
Showing top 10 nodes out of 34 (cum >= 2411.39MB)
      flat  flat%   sum%        cum   cum%
 2082.07MB 84.93% 84.93%  2082.07MB 84.93%  k8s.io/kubernetes/vendor/github.com/beorn7/perks/quantile.newStream (inline)
  311.65MB 12.71% 97.64%  2398.72MB 97.84%  k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus.newSummary
   10.71MB  0.44% 98.08%  2414.43MB 98.49%  k8s.io/kubernetes/vendor/github.com/prometheus/client_golang/prometheus.(*MetricVec).getOrCreateMetricWithLabelValues
    2.50MB   0.1% 98.18%  2084.57MB 85.03%  k8s.io/kubernetes/vendor/github.com/beorn7/perks/quantile.NewTargeted
         0     0% 98.18%  2412.06MB 98.39%  k8s.io/kubernetes/cmd/kubelet/app.startKubelet.func1
         0     0% 98.18%  2412.06MB 98.39%  k8s.io/kubernetes/pkg/kubelet.(*Kubelet).HandlePodAdditions
         0     0% 98.18%  2412.06MB 98.39%  k8s.io/kubernetes/pkg/kubelet.(*Kubelet).Run
         0     0% 98.18%  2412.06MB 98.39%  k8s.io/kubernetes/pkg/kubelet.(*Kubelet).syncLoop
         0     0% 98.18%  2412.06MB 98.39%  k8s.io/kubernetes/pkg/kubelet.(*Kubelet).syncLoopIteration
         0     0% 98.18%  2411.39MB 98.36%  k8s.io/kubernetes/pkg/kubelet/pod.(*basicManager).AddPod
```

Memory profile png:
![mem_profile001](https://user-images.githubusercontent.com/50872/52052850-043d8080-2558-11e9-8c2f-93db2eb56850.png)


**The reflector metrics seem to be the problem**:

```
curl -s localhost:10255/metrics | sed 's/{.*//' | sort | uniq -c | sort -nr
  25749 reflector_watch_duration_seconds
  25749 reflector_list_duration_seconds
  25749 reflector_items_per_watch
  25749 reflector_items_per_list
   8583 reflector_watches_total
   8583 reflector_watch_duration_seconds_sum
   8583 reflector_watch_duration_seconds_count
   8583 reflector_short_watches_total
   8583 reflector_lists_total
   8583 reflector_list_duration_seconds_sum
   8583 reflector_list_duration_seconds_count
   8583 reflector_last_resource_version
   8583 reflector_items_per_watch_sum
   8583 reflector_items_per_watch_count
   8583 reflector_items_per_list_sum
   8583 reflector_items_per_list_count
    165 storage_operation_duration_seconds_bucket
     51 kubelet_runtime_operations_latency_microseconds
     44 rest_client_request_latency_seconds_bucket
     33 kubelet_docker_operations_latency_microseconds
     17 kubelet_runtime_operations_latency_microseconds_sum
     17 kubelet_runtime_operations_latency_microseconds_count
     17 kubelet_runtime_operations
```

**What you expected to happen**:

I expect that kubelet does not need so much memory

**How to reproduce it (as minimally and precisely as possible)**:

I don't know

**Anything else we need to know?**:

One of the affected clusters has only 120 Pods and 3 Pods are in CrashLoopBackOff state, one pod since 6 days on an affected node which was investigated.



**Environment**:
- Kubernetes version (use `kubectl version`): v1.12.5
- Cloud provider or hardware configuration: AWS
- OS (e.g. from /etc/os-release): 
```
# cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1800.7.0
VERSION_ID=1800.7.0
BUILD_ID=2018-08-15-2254
PRETTY_NAME="Container Linux by CoreOS 1800.7.0 (Rhyolite)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"
```
- Kernel (e.g. `uname -a`): Linux ip-172-31-10-50.eu-central-1.compute.internal 4.14.63-coreos #1 SMP Wed Aug 15 22:26:16 UTC 2018 x86_64 Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz GenuineIntel GNU/Linux
- Install tools: custom https://github.com/zalando-incubator/kubernetes-on-aws



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

memory leak in kubelet 1.12.5 #73587

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

memory leak in kubelet 1.12.5 #73587

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions