Kubernetes監(jiān)控手冊07-監(jiān)控controller-manager

2022-12-22 14:17:11 來源:51CTO博客

寫在前面

controller-manager 是 Kubernetes 控制面的組件,通常不太可能出問題,一般監(jiān)控一下通用的進程指標(biāo)就問題不大了,不過 controller-manager 確實也暴露了很多??/metrics??白盒指標(biāo),我們也一并梳理一下相關(guān)內(nèi)容。

黑盒測試

類似上一篇《??Kubernetes監(jiān)控手冊06-監(jiān)控APIServer??》描述的方法,我們先從黑盒角度測試一下,看看 controller-manager 的??/metrics??接口是否直接可用。

[root@tt-fc-dev01.nj manifests]# ss -tlnp|grep controllerLISTEN 0      128                *:10257            *:*    users:(("kube-controller",pid=2782446,fd=7))[root@tt-fc-dev01.nj manifests]# curl -s http://localhost:10257/metricsClient sent an HTTP request to an HTTPS server.[root@tt-fc-dev01.nj manifests]# curl -k -s https://localhost:10257/metrics{  "kind": "Status",  "apiVersion": "v1",  "metadata": {},  "status": "Failure",  "message": "forbidden: User \"system:anonymous\" cannot get path \"/metrics\"",  "reason": "Forbidden",  "details": {},  "code": 403}

看起來也是需要認證的,我們直接復(fù)用上一篇創(chuàng)建的 Token,看看能否拿到數(shù)據(jù):


(資料圖片)

[root@tt-fc-dev01.nj yamls]# token=`kubectl get secret categraf-token-6whbs -n flashcat -o jsonpath={.data.token} | base64 -d`[root@tt-fc-dev01.nj yamls]# curl -s -k -H "Authorization: Bearer $token" https://localhost:10257/metrics > cm.metrics[root@tt-fc-dev01.nj yamls]# head -n 6 cm.metrics# HELP apiserver_audit_event_total [ALPHA] Counter of audit events generated and sent to the audit backend.# TYPE apiserver_audit_event_total counterapiserver_audit_event_total 0# HELP apiserver_audit_requests_rejected_total [ALPHA] Counter of apiserver requests rejected due to an error in audit logging backend.# TYPE apiserver_audit_requests_rejected_total counterapiserver_audit_requests_rejected_total 0[root@tt-fc-dev01.nj yamls]# cat cm.metrics | wc -l10070

妥了,可以復(fù)用之前的 Token。

配置采集

我們還是使用 Prometheus agent mode 來拉取數(shù)據(jù),原汁原味的,只要把 controller-manager 部分也加上就行了。改造之后的 prometheus-agent-configmap.yaml 內(nèi)容如下:

apiVersion: v1kind: ConfigMapmetadata:  name: prometheus-agent-conf  labels:    name: prometheus-agent-conf  namespace: flashcatdata:  prometheus.yml: |-    global:      scrape_interval: 15s      evaluation_interval: 15s    scrape_configs:      - job_name: "apiserver"        kubernetes_sd_configs:        - role: endpoints        scheme: https        tls_config:          insecure_skip_verify: true        authorization:          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token        relabel_configs:        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]          action: keep          regex: default;kubernetes;https      - job_name: "controller-manager"        kubernetes_sd_configs:        - role: endpoints        scheme: https        tls_config:          insecure_skip_verify: true        authorization:          credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token        relabel_configs:        - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]          action: keep          regex: kube-system;kube-controller-manager;https    remote_write:    - url: "http://10.206.0.16:19000/prometheus/v1/write"

這里我新增了一個scrape job name:controller-manager,Kubernetes 服務(wù)發(fā)現(xiàn)仍然使用 endpoints,匹配規(guī)則有三點(通過 relabel_configs 的 keep 實現(xiàn)):

??__meta_kubernetes_namespace??endpoint 的 namespace 要求是 kube-system??__meta_kubernetes_service_name??service name 要求是 kube-controller-manager??__meta_kubernetes_endpoint_port_name??endpoint 的 port_name 要求是叫 https

如果你沒有采集成功,就要去看看有沒有這個 endpoint:

[work@tt-fc-dev01.nj yamls]$ kubectl get endpoints -n kube-systemNAME                      ENDPOINTS                                                            AGEetcd                      10.206.0.16:2381                                                     126detcd-service              10.206.0.16:2379                                                     75detcd-service2             10.206.10.16:2379                                                    75dkube-controller-manager   10.206.0.16:10257                                                    74dkube-dns                  172.16.0.85:53,172.16.1.4:53,172.16.0.85:53 + 3 more...              324dkube-scheduler            10.206.0.16:10259                                                    131dkube-state-metrics        172.16.3.198:8081,172.16.3.198:8080                                  75dkubelet                   10.206.0.11:10250,10.206.0.16:10250,10.206.0.17:10250 + 15 more...   315d[work@tt-fc-dev01.nj yamls]$ kubectl get endpoints -n kube-system kube-controller-manager -o yamlapiVersion: v1kind: Endpointsmetadata:  annotations:    endpoints.kubernetes.io/last-change-trigger-time: "2022-09-15T09:43:21Z"  creationTimestamp: "2022-09-15T09:43:21Z"  labels:    k8s-app: kube-controller-manager  name: kube-controller-manager  namespace: kube-system  resourceVersion: "112212043"  uid: 52cfb383-6d2b-452e-9a1f-95c7a898a1b4subsets:- addresses:  - ip: 10.206.0.16    nodeName: 10.206.0.16    targetRef:      kind: Pod      name: kube-controller-manager-10.206.0.16      namespace: kube-system      resourceVersion: "112211925"      uid: d9515495-057c-4ea6-ad1f-28341498710f  ports:  - name: https    port: 10257    protocol: TCP

??__meta_kubernetes_endpoint_port_name??就是上面的倒數(shù)第三行。這些信息我的環(huán)境里都是有的,如果你的環(huán)境沒有對應(yīng)的 endpoint,可以手工創(chuàng)建一個 service,孔飛老師之前給大家準備過一個??https://github.com/flashcatcloud/categraf/blob/main/k8s/controller-service.yaml??,把這個 controller-service.yaml apply 一下就行了。另外,如果是用 kubeadm 安裝的 controller-manager,還要記得修改??/etc/kubernetes/manifests/kube-controller-manager.yaml??,調(diào)整 controller-manager 的啟動參數(shù):??--bind-address=0.0.0.0??。

監(jiān)控大盤

controller-manager 的大盤已經(jīng)準備好了,地址在??https://github.com/flashcatcloud/categraf/blob/main/k8s/cm-dash.json??,可以直接導(dǎo)入夜鶯使用。如果覺得大盤有需要改進的地方,歡迎PR。

監(jiān)控指標(biāo)

controller-manager 的關(guān)鍵指標(biāo)分別是啥意思,孔飛老師之前整理過,我給搬過來了:

# HELP rest_client_request_duration_seconds [ALPHA] Request latency in seconds. Broken down by verb and URL.# TYPE rest_client_request_duration_seconds histogram請求apiserver的耗時分布,按照url+verb統(tǒng)計# HELP cronjob_controller_cronjob_job_creation_skew_duration_seconds [ALPHA] Time between when a cronjob is scheduled to be run, and when the corresponding job is created# TYPE cronjob_controller_cronjob_job_creation_skew_duration_seconds histogramcronjob 創(chuàng)建到運行的時間分布# HELP leader_election_master_status [ALPHA] Gauge of if the reporting system is master of the relevant lease, 0 indicates backup, 1 indicates master. "name" is the string used to identify the lease. Please make sure to group by name.# TYPE leader_election_master_status gauge控制器的選舉狀態(tài),0表示backup, 1表示master # HELP node_collector_zone_health [ALPHA] Gauge measuring percentage of healthy nodes per zone.# TYPE node_collector_zone_health gauge每個zone的健康node占比# HELP node_collector_zone_size [ALPHA] Gauge measuring number of registered Nodes per zones.# TYPE node_collector_zone_size gauge每個zone的node數(shù)# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.# TYPE process_cpu_seconds_total countercpu使用量(也可以理解為cpu使用率)# HELP process_open_fds Number of open file descriptors.# TYPE process_open_fds gauge控制器打開的fd數(shù)# HELP pv_collector_bound_pv_count [ALPHA] Gauge measuring number of persistent volume currently bound# TYPE pv_collector_bound_pv_count gauge當(dāng)前綁定的pv數(shù)量# HELP pv_collector_unbound_pvc_count [ALPHA] Gauge measuring number of persistent volume claim currently unbound# TYPE pv_collector_unbound_pvc_count gauge當(dāng)前沒有綁定的pvc數(shù)量 # HELP pv_collector_bound_pvc_count [ALPHA] Gauge measuring number of persistent volume claim currently bound# TYPE pv_collector_bound_pvc_count gauge當(dāng)前綁定的pvc數(shù)量# HELP pv_collector_total_pv_count [ALPHA] Gauge measuring total number of persistent volumes# TYPE pv_collector_total_pv_count gaugepv總數(shù)量# HELP workqueue_adds_total [ALPHA] Total number of adds handled by workqueue# TYPE workqueue_adds_total counter各個controller已接受的任務(wù)總數(shù)與apiserver的workqueue_adds_total指標(biāo)類似# HELP workqueue_depth [ALPHA] Current depth of workqueue# TYPE workqueue_depth gauge各個controller隊列深度,表示一個controller中的任務(wù)的數(shù)量與apiserver的workqueue_depth類似,這個是指各個controller中隊列的深度,數(shù)值越小越好# HELP workqueue_queue_duration_seconds [ALPHA] How long in seconds an item stays in workqueue before being requested.# TYPE workqueue_queue_duration_seconds histogram任務(wù)在隊列中的等待耗時,按照控制器分別統(tǒng)計# HELP workqueue_work_duration_seconds [ALPHA] How long in seconds processing an item from workqueue takes.# TYPE workqueue_work_duration_seconds histogram任務(wù)出隊到被處理完成的時間,按照控制分別統(tǒng)計# HELP workqueue_retries_total [ALPHA] Total number of retries handled by workqueue# TYPE workqueue_retries_total counter任務(wù)進入隊列重試的次數(shù)# HELP workqueue_longest_running_processor_seconds [ALPHA] How many seconds has the longest running processor for workqueue been running.# TYPE workqueue_longest_running_processor_seconds gauge正在處理的任務(wù)中,最長耗時任務(wù)的處理時間# HELP endpoint_slice_controller_syncs [ALPHA] Number of EndpointSlice syncs# TYPE endpoint_slice_controller_syncs counterendpoint_slice 同步的數(shù)量(1.20以上)# HELP get_token_fail_count [ALPHA] Counter of failed Token() requests to the alternate token source# TYPE get_token_fail_count counter獲取token失敗的次數(shù)# HELP go_memstats_gc_cpu_fraction The fraction of this program"s available CPU time used by the GC since the program started.# TYPE go_memstats_gc_cpu_fraction gaugecontroller gc的cpu使用率

相關(guān)文章

??Kubernetes監(jiān)控手冊01-體系介紹????Kubernetes監(jiān)控手冊02-宿主監(jiān)控概述????Kubernetes監(jiān)控手冊03-宿主監(jiān)控實操????Kubernetes監(jiān)控手冊04-監(jiān)控Kube-Proxy????Kubernetes監(jiān)控手冊05-監(jiān)控Kubelet????Kubernetes監(jiān)控手冊06-監(jiān)控APIServer??

關(guān)于作者

本文作者秦曉輝,??Flashcat??合伙人,文章內(nèi)容是Flashcat技術(shù)團隊共同沉淀的結(jié)晶,作者做了編輯整理,我們會持續(xù)輸出監(jiān)控、穩(wěn)定性保障相關(guān)的技術(shù)文章,文章可轉(zhuǎn)載,轉(zhuǎn)載請注明出處,尊重技術(shù)人員的成果。

如果對 Nightingale、Categraf、Prometheus 等技術(shù)感興趣,歡迎加入我們的微信群組,聯(lián)系我(picobyte)拉入部落,和社區(qū)同仁一起探討監(jiān)控技術(shù)。

標(biāo)簽: 越小越好 黑盒測試 技術(shù)團隊

上一篇:今日熱門!Oracle19C windows安裝部署
下一篇:SpringBoot入門三十一,多數(shù)據(jù)源的使用