2019-01-22

GKE node migration

[文章目录]

先讓 node 隔離
再讓 pod 搬家
最後在 GKE 縮減 pool-1

事情是這樣的～
我的 GKE Lab 上初次使用 MACHINE_TYPE:g1-small，僅具有 1 cpu 資源
在 Master 角色上，此規格不敷使用，故有轉移至更高規格 node 上的想法。

我參照此篇說明，去實作，文章中提到的是以整個 nodepool 為遷移範例
我是僅針對單一 node 去實作。作法如下：

先讓 node 隔離

kubectl cordon 此隔離動作，並不影響現有 node 上面的服務、pod 的運作，僅影響著後續新的 pod 需求，並不會於被隔離狀態下的 node 部署。

先於 GKE 上，產生資源足夠的新 node pool，例如 MACHINE_TYPE:n1-standard-2

# 觀察 node-pools list
$ gcloud container node-pools list --cluster afu-first-cluster-1
NAME    MACHINE_TYPE   DISK_SIZE_GB  NODE_VERSION
pool-1  g1-small       30            1.11.6-gke.2
pool-2  n1-standard-2  100           1.11.6-gke.2

舊 node 設定隔離：

# 觀察目前 pool-1 既有 node info
$ kubectl get nodes -l cloud.google.com/gke-nodepool=pool-1
NAME                                           STATUS   ROLES    AGE   VERSION
gke-afu-first-cluster-1-pool-1-dd90def6-7h0t   Ready    <none>   21h   v1.11.6-gke.2
gke-afu-first-cluster-1-pool-1-dd90def6-dhfn   Ready    <none>   21h   v1.11.6-gke.2
gke-afu-first-cluster-1-pool-1-dd90def6-lrcb   Ready    <none>   23h   v1.11.6-gke.2
gke-afu-first-cluster-1-pool-1-dd90def6-slvd   Ready    <none>   21h   v1.11.6-gke.2

# 針對 node:gke-afu-first-cluster-1-pool-1-dd90def6-lrcb 進行隔離
$ kubectl cordon gke-afu-first-cluster-1-pool-1-dd90def6-lrcb
node/gke-afu-first-cluster-1-pool-1-dd90def6-lrcb cordoned

# 觀察目前 pool-1 既有 node info
$ kubectl get nodes -l cloud.google.com/gke-nodepool=pool-1
NAME                                           STATUS                     ROLES    AGE   VERSION
gke-afu-first-cluster-1-pool-1-dd90def6-7h0t   Ready                      <none>   21h   v1.11.6-gke.2
gke-afu-first-cluster-1-pool-1-dd90def6-dhfn   Ready                      <none>   21h   v1.11.6-gke.2
gke-afu-first-cluster-1-pool-1-dd90def6-lrcb   Ready,SchedulingDisabled   <none>   23h   v1.11.6-gke.2
gke-afu-first-cluster-1-pool-1-dd90def6-slvd   Ready                      <none>   21h   v1.11.6-gke.2

再讓 pod 搬家

舊 node 進行服務遷移
kubectl drain 此動作，會促使節點上的 pod 遷移至其他節點。

# 指定 node 進行遷移，出現 DaemonSet pod 無法遷移的情況。
$ kubectl drain --force gke-afu-first-cluster-1-pool-1-dd90def6-lrcb
node/gke-afu-first-cluster-1-pool-1-dd90def6-lrcb already cordoned
error: unable to drain node "gke-afu-first-cluster-1-pool-1-dd90def6-lrcb", aborting command...

There are pending nodes to be drained:
 gke-afu-first-cluster-1-pool-1-dd90def6-lrcb
error: DaemonSet-managed pods (use --ignore-daemonsets to ignore): consul-zk9wn, calico-node-6fdsp, ip-masq-agent-l85nd

# 指定 node 進行遷移，新增參數
# --ignore-daemonsets 忽略 DaemonSet pod
# --delete-local-data 刪除節點上資料
# --grace-period 設定寬限期
$ kubectl drain --force --ignore-daemonsets --delete-local-data --grace-period=10 gke-afu-first-cluster-1-pool-1-dd90def6-lrcb
node/gke-afu-first-cluster-1-pool-1-dd90def6-lrcb already cordoned
WARNING: Ignoring DaemonSet-managed pods: consul-zk9wn, calico-node-6fdsp, ip-masq-agent-l85nd
pod/l7-default-backend-7ff48cffd7-zbd8m evicted
pod/kube-dns-autoscaler-67c97c87fb-nc2rd evicted
pod/tiller-deploy-77c96688d7-jckdw evicted
pod/metrics-server-v0.2.1-fd596d746-cdp8p evicted
pod/calico-typha-horizontal-autoscaler-5ff7f558cc-zms8d evicted
pod/calico-typha-vertical-autoscaler-5d4bf57df5-pwjvn evicted
pod/calico-typha-5b857668fd-lzkf7 evicted
pod/calico-node-vertical-autoscaler-547d98499d-nt874 evicted
pod/kube-dns-7549f99fcc-kbvkt evicted
node/gke-afu-first-cluster-1-pool-1-dd90def6-lrcb evicted

最後在 GKE 縮減 pool-1

最後在 GKE 上縮減 pool-1 數量大小 4 -> 3，GKE 會將上述結點進行移除

完成，做過兩遍，都符合節點遷移需求，有此經驗後下次再進行遷移比較有所概念了～

參考官方文件說明頁
參考網友解說