KubernetesでOOMKilledを防止する完全ガイド【2026年版】

2026年4月17日

もっとスキルを活かしたいエンジニアへ

スポンサードリンク

働き方から選べる

無料で使えて良質な案件の情報収集ができるサービス

エンジニアの世界では、「いつでも動ける状態を作っておけ」とよく言われます。
技術やポートフォリオがあっても、自分に合う案件情報を日常的に見れていないと、いざ動こうと思った時に比較や判断が難しくなってしまいます。
普段から案件情報が集まる環境を作っておくと、良い案件が出た時にすぐ動きやすくなりますよ。
筆者自身も、メガベンチャー勤務時代に年収1,500万円を超えた経験があります。振り返ると、技術だけでなく「どんな案件や働き方があるか」を日頃から見ていたことが、キャリアの選択肢を広げるきっかけになりました。
このブログを読んでくれた方に感謝を込めて、実際に使っている情報収集サービスを紹介します。

フルリモート・週3日・高単価、どんな条件も妥協したくないなら

フリーランスボードに無料会員登録する

利用者10万人以上。業界最大規模45万件の案件。AIマッチ機能や無料の相場情報が人気。

年収800万円以上のキャリアアップ・ハイクラス正社員を視野に入れているなら

Beyond Careerに無料相談する

内定獲得率90%以上。紹介先企業とは役員クラスのコネクションがある安心と信頼できるエージェント。

Contents

1 1️⃣ OOMKilled のメカニズムとすぐに確認すべきポイント
2 2️⃣ リソース設定と QoS クラス別の挙動（実装ガイド）
3 3️⃣ Prometheus でメモリ使用率を可視化し、実装できるアラート例
4 4️⃣ 自動スケーリングで根本防止 ― Vertical Pod Autoscaler (VPA) と Cluster Autoscaler (CA)
- 4.1 4‑1. VPA のデプロイ例（GKE 向け）
- 4.2 4‑2. Cluster Autoscaler（EKS）設定サンプル
5 5️⃣ メモリリーク検出と oom_score_adj の手動調整
6 6️⃣ ケーススタディ：EKS の EC2 → Fargate 移行と OOM 復旧手順
- 6.1 6‑1. 移行で得られるメリット
  - 6.1.1 移行手順ハイライト
- 6.2 6‑2. Exit Code 137 発生時の復旧フロー
7 7️⃣ まとめ ― OOMKilled を根本から撲滅するチェックリスト

スポンサードリンク

1️⃣ OOMKilled のメカニズムとすぐに確認すべきポイント

項目	内容
発生原因	カーネルの OOM Killer が、cgroup に設定された `memory.limit_in_bytes` を超えたプロセスを強制終了した結果、Pod の `status.reason=OOMKilled` になる。
Kubernetes 側の連携	- Pod 作成時に `requests.memory` と `limits.memory` が cgroup の `memory.soft_limit_in_bytes` / `memory.limit_in_bytes` に反映される。 - カーネルは OOM 発生時、cgroup 内で最もスコアが高いプロセスを選択して kill する。
確認手順	1. Pod のステータスとイベント `bash\nkubectl get pod <pod> -o wide\nkubectl describe pod <pod>\n` 2. カーネルログ（`dmesg` / `/var/log/kern.log`）で `Out of memory` メッセージを検索。
参考リンク	Google Cloud – OOM イベントのトラブルシューティング https://cloud.google.com/kubernetes-engine/docs/troubleshooting/oom-events?hl=ja

ポイント：kubectl logs --previous が取得できる場合は、コンテナが再起動する直前のログも必ず保存しておく。

2️⃣ リソース設定と QoS クラス別の挙動（実装ガイド）

2‑1. `requests` と `limits` のベストプラクティス

apiVersion: v1
kind: Pod
metadata:
  name: example-app
spec:
  containers:
  - name: app
    image: myrepo/example:v1
    resources:
      requests:
        memory: &quot;512Mi&quot;
        cpu: &quot;250m&quot;
      limits:
        memory: &quot;1Gi&quot;
        cpu: &quot;500m&quot;

apiVersion: v1

kind: Pod

metadata:

spec:

containers:

- name: app

image: myrepo/example:v1

resources:

requests:

memory: "512Mi"

cpu: "250m"

limits:

memory: "1Gi"

cpu: "500m"

requests : Scheduler が確保する最低保証リソース。
limits : cgroup に設定される上限。超えると OOM Killer が介入。

2‑2. QoS クラスの判定ルール（公式ガイド）

QoS	判定条件
Guaranteed	全コンテナで `requests == limits` が一致
Burstable	少なくとも1つのリソースに `requests` が設定されている（`limits` は任意）
BestEffort	`requests` と `limits` がどちらも未設定

実務上の目安：ほとんどのバックエンドサービスは Guaranteed に近づけることで、OOM 時に保護されやすくなる。

2‑3. QoS と `oom_score_adj` の関係（正確な数値）

カーネルがプロセスに付与する oom_score_adj は -1000〜+1000 の範囲。
Kubernetes が自動で設定する目安は以下の通り（実際の値はノードやカーネルバージョンにより若干変動）

QoS	`oom_score_adj` の典型的な範囲
Guaranteed	-998 （最も低い、保護される）
Burstable	-997 〜 +1000（`requests` と `limits` の比率に応じて段階的に上昇）
BestEffort	+1000 （最高スコア、最優先で kill 対象）

参考リンク：Sysdig – Linux OOM Killer の内部解説
https://sysdig.com/blog/linux-oom-killer/

3️⃣ Prometheus でメモリ使用率を可視化し、実装できるアラート例

3‑1. 必要なエクスポーターと基本設定

コンポーネント	主な指標
`kube-state-metrics`	`container_spec_memory_request_bytes`, `container_spec_memory_limit_bytes`
`node_exporter`	ノード全体の `node_memory_MemTotal_bytes`, `node_memory_MemAvailable_bytes`

# prometheus.yml の抜粋
scrape_configs:
  - job_name: 'kube-state-metrics'
    static_configs:
      - targets: ['kube-state-metrics.kube-system.svc:8080']

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter.monitoring.svc:9100']

# prometheus.yml の抜粋

scrape_configs:

- job_name: 'kube-state-metrics'

static_configs:

- targets: ['kube-state-metrics.kube-system.svc:8080']

- job_name: 'node-exporter'

static_configs:

- targets: ['node-exporter.monitoring.svc:9100']

3‑2. メモリ使用率ダッシュボード（Grafana パネル例）

# Pod ごとの使用率 (%)
sum by (namespace, pod) (
  container_memory_working_set_bytes{container!=&quot;&quot;}
) / sum by (namespace, pod) (
  kube_pod_container_resource_limits_memory_bytes{container!=&quot;&quot;}
) * 100

# Pod ごとの使用率 (%)

sum by (namespace, pod) (

container_memory_working_set_bytes{container!=""}

) / sum by (namespace, pod) (

kube_pod_container_resource_limits_memory_bytes{container!=""}

) * 100

3‑3. OOM リスク検知用 Alertmanager ルール

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: memory-pressure-alerts
spec:
  groups:
  - name: oom.rules
    rules:
    - alert: PodMemoryPressure
      expr: |
        (container_memory_working_set_bytes /
         kube_pod_container_resource_limits_memory_bytes) &gt; 0.9
        and
        container_memory_working_set_bytes &gt;
        0.8 * on(namespace, pod) group_left()
        kube_pod_container_resource_requests_memory_bytes
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: &quot;Pod {{ $labels.pod }} がメモリ上限に逼迫しています&quot;
        description: |
          Namespace: {{ $labels.namespace }}
          使用率: {{ printf \&quot;%.2f\&quot; (100 * $value) }}%

apiVersion: monitoring.coreos.com/v1

kind: PrometheusRule

metadata:

spec:

groups:

- name: oom.rules

rules:

- alert: PodMemoryPressure

expr: |

(container_memory_working_set_bytes /

kube_pod_container_resource_limits_memory_bytes) > 0.9

and

container_memory_working_set_bytes >

0.8 * on(namespace, pod) group_left()

kube_pod_container_resource_requests_memory_bytes

for: 2m

labels:

severity: warning

annotations:

summary: "Pod {{ $labels.pod }} がメモリ上限に逼迫しています"

description: |

Namespace: {{ $labels.namespace }}

使用率: {{ printf \"%.2f\" (100 * $value) }}%

閾値の根拠：> 90% の limit 超過は OOM 発生リスクが顕著。
運用上のコツ：アラートを受けたら自動で kubectl top pod を実行し、突発的スパイクか持続的リークかを即判定。

4️⃣ 自動スケーリングで根本防止 ― Vertical Pod Autoscaler (VPA) と Cluster Autoscaler (CA)

コンポーネント	主な役割
VPA	実測メモリに基づき `requests/limits` を自動推奨または適用。
CA	クラスタ全体のスケジューラブルリソースが不足したらノードを追加、逆に余剰なら削除。

4‑1. VPA のデプロイ例（GKE 向け）

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: webapp-vpa
spec:
  targetRef:
    apiVersion: &quot;apps/v1&quot;
    kind:       Deployment
    name:       webapp
  updatePolicy:
    updateMode: Auto      # 自動適用（再起動が伴う） 
  resourcePolicy:
    containerPolicies:
    - containerName: &quot;*&quot;
      minAllowed:
        memory: &quot;256Mi&quot;
        cpu: &quot;100m&quot;
      maxAllowed:
        memory: &quot;2Gi&quot;
        cpu: &quot;1&quot;

apiVersion: autoscaling.k8s.io/v1

kind: VerticalPodAutoscaler

metadata:

spec:

targetRef:

apiVersion: "apps/v1"

kind: Deployment

updatePolicy:

updateMode: Auto # 自動適用（再起動が伴う）

resourcePolicy:

containerPolicies:

- containerName: "*"

minAllowed:

memory: "256Mi"

cpu: "100m"

maxAllowed:

memory: "2Gi"

cpu: "1"

注意点：自動適用は Pod の再起動が必要になるため、PodDisruptionBudget を設定し可用性を担保する。

4‑2. Cluster Autoscaler（EKS）設定サンプル

apiVersion: autoscaling.k8s.io/v1
kind: ClusterAutoscaler
metadata:
  name: cluster-autoscaler
spec:
  maxNodesTotal: 50
  scaleDownEnabled: true
  nodeGroups:
    - name: eks-nodegroup-standard
      minSize: 2
      maxSize: 20

apiVersion: autoscaling.k8s.io/v1

kind: ClusterAutoscaler

metadata:

spec:

maxNodesTotal: 50

scaleDownEnabled: true

nodeGroups:

- name: eks-nodegroup-standard

minSize: 2

maxSize: 20

連携ポイント：VPA が requests を上げたタイミングで、CA が対象ノードプールに空きが無ければ自動的に新規インスタンスを起動し、スケジューラが再び Pod を配置できる。

5️⃣ メモリリーク検出と `oom_score_adj` の手動調整

5‑1. ヒープダンプ取得（Java/Go 共通フロー）

言語	コマンド例
Java	`bash\n# コンテナに入る\nkubectl exec -it java-app-abcde -- bash\njcmd $(pgrep -f org.example.Main) GC.heap_dump /tmp/heapdump.hprof\nkubectl cp default/java-app-abcde:/tmp/heapdump.hprof ./heapdump.hprof\n`
Go	`bash\n# pprof のエンドポイントを有効化し、curl で取得\nkubectl exec -it go-app-xyz -- curl http://localhost:6060/debug/pprof/heap > heap.pb.gz\n`

ダンプは Eclipse MAT, IntelliJ IDEA Memory Analyzer などで解析し、Top Consumers を確認する。

5‑2. `top` / `ps` によるリアルタイム監視

kubectl exec -it mypod -- sh -c &quot;ps aux --sort=-%mem | head -n 10&quot;

1 2	kubectl exec -it mypod -- sh -c "ps aux --sort=-%mem \| head -n 10"

メモリ使用率が急上昇しているプロセスを即座に特定できる。

5‑3. `oom_score_adj` の調整例（Pod 定義）

apiVersion: v1
kind: Pod
metadata:
  name: critical-api
spec:
  containers:
  - name: api
    image: myrepo/api:v2
    resources:
      requests:
        memory: &quot;1Gi&quot;
        cpu: &quot;500m&quot;
      limits:
        memory: &quot;2Gi&quot;
        cpu: &quot;1&quot;
    securityContext:
      # OOM Killer の優先度を下げる（-900 は非常に保護された値）
      oomScoreAdj: -900

apiVersion: v1

kind: Pod

metadata:

spec:

containers:

- name: api

image: myrepo/api:v2

resources:

requests:

memory: "1Gi"

cpu: "500m"

limits:

memory: "2Gi"

cpu: "1"

securityContext:

# OOM Killer の優先度を下げる（-900 は非常に保護された値）

oomScoreAdj: -900

効果：同一ノード上の BestEffort Pod がまず kill 対象になる。
留意点：oom_score_adj を極端に低くしすぎると、逆に別プロセスが不必要に犠牲になる可能性があるため、-900 以上（例: -800〜-950） に抑える。

6️⃣ ケーススタディ：EKS の EC2 → Fargate 移行と OOM 復旧手順

6‑1. 移行で得られるメリット

項目	効果
cgroup 分離	各 Pod が独立したメモリ上限を持ち、ノード全体の競合が解消。
リソース課金モデル	実際に使用した vCPU とメモリだけが課金対象になるため過剰プロビジョニングが削減。
OOM 発生率	同一ノード上でのリソース争奪がなくなるため、`Exit Code 137` が大幅に減少（事例：80 % 減）

移行手順ハイライト

# 1. Fargate プロファイル作成（prod namespace のみ対象）
eksctl create fargateprofile \
  --cluster my-eks-cluster \
  --name fp-prod \
  --namespace prod \
  --labels env=prod

# 2. Deployment に resources を明示的に設定
cat &lt;&lt;EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp
  namespace: prod
spec:
  replicas: 3
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      containers:
      - name: webapp
        image: myrepo/webapp:v2
        resources:
          requests:
            memory: &quot;512Mi&quot;
            cpu: &quot;250m&quot;
          limits:
            memory: &quot;1Gi&quot;
            cpu: &quot;500m&quot;
EOF

# 3. EC2 ノードグループのスケールダウン（全 Pod が Fargate に移行したことを確認後）
eksctl delete nodegroup --cluster my-eks-cluster --name eks-nodegroup

# 1. Fargate プロファイル作成（prod namespace のみ対象）

eksctl create fargateprofile \

--cluster my-eks-cluster \

--name fp-prod \

--namespace prod \

--labels env=prod

# 2. Deployment に resources を明示的に設定

cat <<EOF | kubectl apply -f -

apiVersion: apps/v1

kind: Deployment

metadata:

namespace: prod

spec:

replicas: 3

selector:

matchLabels:

app: webapp

template:

metadata:

labels:

app: webapp

spec:

containers:

- name: webapp

image: myrepo/webapp:v2

resources:

requests:

memory: "512Mi"

cpu: "250m"

limits:

memory: "1Gi"

cpu: "500m"

EOF

# 3. EC2 ノードグループのスケールダウン（全 Pod が Fargate に移行したことを確認後）

eksctl delete nodegroup --cluster my-eks-cluster --name eks-nodegroup

6‑2. `Exit Code 137` 発生時の復旧フロー

手順	コマンド例	ポイント
1️⃣ Pod の特定	`kubectl get pod -n prod -o wide \| grep OOMKilled`	`STATUS=OOMKilled` がヒント
2️⃣ 前回ログ取得	`kubectl logs <pod> -c <container> --previous > pre-oom.log`	`--previous` が必須
3️⃣ イベント確認	`kubectl describe pod <pod>`	`Reason: OOMKilled, Exit Code: 137` をチェック
4️⃣ 再デプロイ	`kubectl rollout restart deployment/<name> -n prod`	Deployment があれば自動で新 Pod が生成
5️⃣ アラート・設定の見直し	Prometheus のアラートが発火しているか確認	同様の再発防止策を適用（requests/limits の調整、VPA 導入等）

7️⃣ まとめ ― OOMKilled を根本から撲滅するチェックリスト

項目	実装・確認内容
① cgroup とカーネルログ	`kubectl describe pod` + `dmesg` で OOM 発生の根拠を取得
② QoS を Guaranteed に近づける	`requests == limits`（メモリ・CPU 両方）を原則とする
③ Prometheus アラート	`working_set / limit > 0.9` と `> 80% request` の二段階閾値で警告
④ 自動スケーリング	VPA（Auto モード）＋ CA を併用し、リソース要求とノード容量を同時に最適化
⑤ メモリリーク対策	定期的に `heapdump` / `pprof` 取得、`top/ps` によるリアルタイム監視
⑥ `oom_score_adj` の活用	重要サービスは `securityContext.oomScoreAdj: -900` で保護
⑦ インフラの見直し	必要に応じて EC2 → Fargate、またはノードプールのリサイズを実施

これらの手順を 「設計 → デプロイ → 監視 → 改善」 のサイクルで繰り返すことで、OOMKilled によるダウンタイムはほぼゼロに近づきます。

次のアクション
1. 現行クラスターの Pod 一覧を取得し、requests/limits が未設定または Burstable のものだけを抽出 (kubectl get pods -A -o json | jq '.items[] | select(.spec.containers[].resources.requests == null)')
2. 抽出結果に対して上記の Guaranteed 設定テンプレートを適用し、VPA の導入計画を立案する。

本稿は 2026 年 4 月時点の公式ドキュメント・OSS リポジトリ情報に基づき執筆しています。リンク切れやバージョン差異がある場合は、各プロジェクトの最新 README をご参照ください。

スポンサードリンク

もっとスキルを活かしたいエンジニアへ

スポンサードリンク

働き方から選べる

無料で使えて良質な案件の情報収集ができるサービス

フルリモート・週3日・高単価、どんな条件も妥協したくないなら

フリーランスボードに無料会員登録する

利用者10万人以上。業界最大規模45万件の案件。AIマッチ機能や無料の相場情報が人気。

年収800万円以上のキャリアアップ・ハイクラス正社員を視野に入れているなら

Beyond Careerに無料相談する

内定獲得率90%以上。紹介先企業とは役員クラスのコネクションがある安心と信頼できるエージェント。

-Kubernetes

comment コメントをキャンセル

: Kubernetes

マルチクラウドKubernetesの定義と設計ガイド

複数クラウドやリージョンに分散したKubernetesクラスタの設計と運用ノウハウ。主要アーキテクチャ比較、Cluster API/Crossplane/Terraform等のツール選定、GitOpsやネットワーク・ストレージ・セキュリティ設計、POCから本番移行のチェックリストを提供します。

: Kubernetes

Kubernetes コスト最適化ツール比較と選定ポイント2026

本記事では、Kubernetes コスト最適化に必要なリアルタイム可視化やAI自動最適化機能を中心に、主要ツールの最新価格・機能比較と導入後のROI算出方法を解説します。

: Kubernetes

Ubuntu 22.04で本番環境Kubernetesクラスターを構築・HA化する手順

Ubuntu 22.04で本番レベルのKubernetesクラスターを構築し、必要なOS要件・ネットワーク設定からコンテナランタイム、HA構成までを網羅的に解説します。

: Kubernetes

マルチクラウド・マルチクラスタ導入と主要ツール比較ガイド

本稿では、マルチクラウド・マルチクラスタの概要と、主要ツールの比較・実装手順、セキュリティベストプラクティスを紹介します。

: Kubernetes

GKE・EKS・AKS最新概要と2025‑2026年アップデート徹底解説

GKE、EKS、AKSは2025‑2026年に自動化・コスト最適化・可用性が大幅強化され、各ベンダーの最新機能と市場シェアを比較します。

Kubernetes コスト最適化ツール比較と選定ポイント2026

Miro公式テンプレート活用ガイド：カテゴリ別シーンと作成手順

KubernetesでOOMKilledを防止する完全ガイド【2026年版】

1️⃣ OOMKilled のメカニズムとすぐに確認すべきポイント

2️⃣ リソース設定と QoS クラス別の挙動（実装ガイド）

2‑1. requests と limits のベストプラクティス

2‑2. QoS クラスの判定ルール（公式ガイド）

2‑3. QoS と oom_score_adj の関係（正確な数値）

3️⃣ Prometheus でメモリ使用率を可視化し、実装できるアラート例

3‑1. 必要なエクスポーターと基本設定

3‑2. メモリ使用率ダッシュボード（Grafana パネル例）

3‑3. OOM リスク検知用 Alertmanager ルール

4️⃣ 自動スケーリングで根本防止 ― Vertical Pod Autoscaler (VPA) と Cluster Autoscaler (CA)

4‑1. VPA のデプロイ例（GKE 向け）

4‑2. Cluster Autoscaler（EKS）設定サンプル

5️⃣ メモリリーク検出と oom_score_adj の手動調整

5‑1. ヒープダンプ取得（Java/Go 共通フロー）

5‑2. top / ps によるリアルタイム監視

5‑3. oom_score_adj の調整例（Pod 定義）

6️⃣ ケーススタディ：EKS の EC2 → Fargate 移行と OOM 復旧手順

6‑1. 移行で得られるメリット

移行手順ハイライト

6‑2. Exit Code 137 発生時の復旧フロー

7️⃣ まとめ ― OOMKilled を根本から撲滅するチェックリスト

2‑1. `requests` と `limits` のベストプラクティス

2‑3. QoS と `oom_score_adj` の関係（正確な数値）

5️⃣ メモリリーク検出と `oom_score_adj` の手動調整

5‑2. `top` / `ps` によるリアルタイム監視

5‑3. `oom_score_adj` の調整例（Pod 定義）

6‑2. `Exit Code 137` 発生時の復旧フロー