エラーバジェットとは？計算式・残量管理・実務活用ガイド

2026年5月9日

もっとスキルを活かしたいエンジニアへ

スポンサードリンク

働き方から選べる

無料で使えて良質な案件の情報収集ができるサービス

エンジニアの世界では、「いつでも動ける状態を作っておけ」とよく言われます。
技術やポートフォリオがあっても、自分に合う案件情報を日常的に見れていないと、いざ動こうと思った時に比較や判断が難しくなってしまいます。
普段から案件情報が集まる環境を作っておくと、良い案件が出た時にすぐ動きやすくなりますよ。
筆者自身も、メガベンチャー勤務時代に年収1,500万円を超えた経験があります。振り返ると、技術だけでなく「どんな案件や働き方があるか」を日頃から見ていたことが、キャリアの選択肢を広げるきっかけになりました。
このブログを読んでくれた方に感謝を込めて、実際に使っている情報収集サービスを紹介します。

フルリモート・週3日・高単価、どんな条件も妥協したくないなら

フリーランスボードに無料会員登録する

利用者10万人以上。業界最大規模45万件の案件。AIマッチ機能や無料の相場情報が人気。

年収800万円以上のキャリアアップ・ハイクラス正社員を視野に入れているなら

Beyond Careerに無料相談する

内定獲得率90%以上。紹介先企業とは役員クラスのコネクションがある安心と信頼できるエージェント。

Contents

1 1. エラーバジェットとは何か ― SLI／SLO との位置付け
- 1.1 1‑1. 計算式の根拠
2 2. 計算例とエクセルテンプレート
- 2.1 2‑1. 月間（30日）で SLO=99.9 % の場合
- 2.2 2‑2. エクセル／CSV テンプレート
3 3. Prometheus でエラーバジェットの実測値を取得する
- 3.1 3‑1. ダウンタイム（秒）を正しく集計するクエリ
- 3.2 3‑2. クエリの解説（コードブロック内コメント）
4 4. 可視化とアラート設定
- 4.1 4‑1. Grafana ダッシュボード（Stat パネル）全体構造
- 4.2 4‑2. Datadog でエラーバジェット枯渇アラートを作る
5 5. エラーバジェット枯渇時の具体的アクションフロー
6 6. エラーバジェットレビュー会議の運用ガイド
- 6.1 6‑1. 定例ミーティングの設計
  - 6.1.1 アジェンダ例（四半期レビュー）
- 6.2 6‑2. 合意形成のポイント
7 7. まとめと次のステップ
- 7.1 参考文献（公式・信頼性の高い情報源）

スポンサードリンク

1. エラーバジェットとは何か ― SLI／SLO との位置付け

用語	定義	主な情報源
SLI（Service Level Indicator）	実際に測定したサービスの品質指標（例：稼働率、レイテンシ）。	Google SRE Book – Measuring Service Levels https://sre.google/sre-book/monitoring/#measuring-service-level-indicators
SLO（Service Level Objective）	ビジネスが受容できる品質目標。SLI の期待値として設定する。	同上
エラーバジェット	`ErrorBudget = (1 - SLO) × 期間` によって算出され、許容できるダウンタイム（秒）を表す。	Google SRE Book – Error Budgets https://sre.google/sre-book/alerting/#error-budgets
SLA（Service Level Agreement）	顧客と交わす契約上の保証。エラーバジェットは内部運用での意思決定材料。	同上

1‑1. 計算式の根拠

[
\text{ErrorBudget}{\text{seconds}} = (1 - \text{SLO}) \times \underbrace{\text{期間（秒）}}{\text{period_seconds}}
]

SLO は 0〜1 の小数で表記（例：99.9 % → 0.999）。
期間は「月」「四半期」など任意だが、秒単位に変換して計算するのが一般的。

実務上の注意
- SLO が 99.95 %（0.9995）の場合、月間（30日）のエラーバジェットは 0.0005 × 2 592 000 ≈ 1 296 秒（≈21.6分）。
- エラーバジェットは残量として定期的にモニタリングし、枯渇時にリリースペースを調整する指標になる。

2. 計算例とエクセルテンプレート

2‑1. 月間（30日）で SLO=99.9 % の場合

項目	値
SLO	`0.999`
期間	`30 days = 30 × 24 × 3 600 = 2 592 000 秒`
許容ダウンタイム（ErrorBudget）	`(1 - 0.999) × 2 592 000 = 2 592 秒 ≈ 43.2 分`

2‑2. エクセル／CSV テンプレート

ファイル: error_budget_template.xlsx（ダウンロードリンクは社内共有サーバーに配置してください）
シート構成
Settings – SLO、期間（日数）、サービス名を入力。
Calc – 上記計算式で自動的に AllowedSeconds と RemainingBudget を算出。
Chart – 残量の時系列グラフと「枯渇警告」セル（負の場合は赤表示）。

ポイント：テンプレートは数式がロックされているため、入力ミスを防げます。

3. Prometheus でエラーバジェットの実測値を取得する

3‑1. ダウンタイム（秒）を正しく集計するクエリ

up メトリクスは 1 が「正常」、0 が「ダウン」を示す gauge。
直接 increase(up[30d]) == 0 では期待通りに集計できません。代わりに次の手順で算出します。

# ------------------------------------------------------------
# 前提: scrape_interval が 15 秒の場合（Prometheus のデフォルト）
# ------------------------------------------------------------

# 1. ダウンフラグ（0→1 に変換）を作成
down_flag = (1 - up{job=&quot;myservice&quot;})   # up=1 → 0、up=0 → 1

# 2. ダウntime のサンプル数を合計
down_samples = sum_over_time(down_flag[30d])

# 3. 実ダウンタイム（秒）に変換
#    サンプル数 × scrape_interval (秒)
actual_down_seconds = down_samples * 15

# ------------------------------------------------------------

# 前提: scrape_interval が 15 秒の場合（Prometheus のデフォルト）

# ------------------------------------------------------------

# 1. ダウンフラグ（0→1 に変換）を作成

down_flag = (1 - up{job="myservice"}) # up=1 → 0、up=0 → 1

# 2. ダウntime のサンプル数を合計

down_samples = sum_over_time(down_flag[30d])

# 3. 実ダウンタイム（秒）に変換

# サンプル数 × scrape_interval (秒)

actual_down_seconds = down_samples * 15

クエリ例（単一行で書く場合）

sum_over_time((1 - up{job=&quot;myservice&quot;})[30d]) * 15

1 2	sum_over_time((1 - up{job="myservice"})[30d]) * 15

15 は実際に設定している scrape_interval（秒）。環境に合わせて変更してください。
取得した値は「期間中の合計ダウン秒数」になるので、エラーバジェット残量は次式で求められます。

# 許容ダウンタイム（秒）＝ (1 - slo) × period_seconds
allowed_seconds = 0.001 * 30 * 24 * 3600   # 例: SLO=99.9 % の月間

error_budget_remaining_seconds = allowed_seconds - sum_over_time((1 - up{job=&quot;myservice&quot;})[30d]) * 15

# 許容ダウンタイム（秒）＝ (1 - slo) × period_seconds

allowed_seconds = 0.001 * 30 * 24 * 3600 # 例: SLO=99.9 % の月間

error_budget_remaining_seconds = allowed_seconds - sum_over_time((1 - up{job="myservice"})[30d]) * 15

3‑2. クエリの解説（コードブロック内コメント）

# ------------------------------------------------------------
# 許容ダウンタイム (秒) = (1 - SLO) × 期間(秒)
#   SLO=0.999 → (1-0.999)=0.001
#   period_seconds = 30日 × 24h × 3600s = 2_592_000 s
# ------------------------------------------------------------
allowed_seconds = 0.001 * 30 * 24 * 3600   # → 2 592 秒

# ------------------------------------------------------------
# 実際のダウンタイム (秒) を算出
#   down_flag = 1 - up    ← up が 0 のときだけ 1 になる
#   sum_over_time() はサンプル数を返すので、scrape_interval(秒) と掛け合わせる
# ------------------------------------------------------------
actual_down_seconds = sum_over_time((1 - up{job=&quot;myservice&quot;})[30d]) * 15

# ------------------------------------------------------------
# エラーバジェット残量 (秒)
# ------------------------------------------------------------
error_budget_remaining_seconds = allowed_seconds - actual_down_seconds

# ------------------------------------------------------------

# 許容ダウンタイム (秒) = (1 - SLO) × 期間(秒)

# SLO=0.999 → (1-0.999)=0.001

# period_seconds = 30日 × 24h × 3600s = 2_592_000 s

# ------------------------------------------------------------

allowed_seconds = 0.001 * 30 * 24 * 3600 # → 2 592 秒

# ------------------------------------------------------------

# 実際のダウンタイム (秒) を算出

# down_flag = 1 - up ← up が 0 のときだけ 1 になる

# sum_over_time() はサンプル数を返すので、scrape_interval(秒) と掛け合わせる

# ------------------------------------------------------------

actual_down_seconds = sum_over_time((1 - up{job="myservice"})[30d]) * 15

# ------------------------------------------------------------

# エラーバジェット残量 (秒)

# ------------------------------------------------------------

error_budget_remaining_seconds = allowed_seconds - actual_down_seconds

4. 可視化とアラート設定

4‑1. Grafana ダッシュボード（Stat パネル）全体構造

以下は Grafana 9.x 用にエクスポートした JSON 全文です。インポートすれば即座に「エラーバジェット残量（秒）」が表示されます。

{
  &quot;dashboard&quot;: {
    &quot;id&quot;: null,
    &quot;uid&quot;: &quot;error-budget-demo&quot;,
    &quot;title&quot;: &quot;Error Budget Monitoring&quot;,
    &quot;timezone&quot;: &quot;browser&quot;,
    &quot;panels&quot;: [
      {
        &quot;type&quot;: &quot;stat&quot;,
        &quot;title&quot;: &quot;エラーバジェット残量 (秒)&quot;,
        &quot;datasource&quot;: &quot;Prometheus&quot;,
        &quot;targets&quot;: [
          {
            &quot;expr&quot;: &quot;(0.001 * 30 * 24 * 3600) - sum_over_time((1 - up{job=\&quot;myservice\&quot;})[30d]) * 15&quot;,
            &quot;legendFormat&quot;: &quot;&quot;,
            &quot;refId&quot;: &quot;A&quot;
          }
        ],
        &quot;options&quot;: {
          &quot;colorMode&quot;: &quot;value&quot;,
          &quot;graphMode&quot;: &quot;none&quot;,
          &quot;justifyMode&quot;: &quot;auto&quot;,
          &quot;orientation&quot;: &quot;auto&quot;,
          &quot;reduceOptions&quot;: {
            &quot;calcs&quot;: [&quot;last&quot;],
            &quot;fields&quot;: &quot;&quot;,
            &quot;values&quot;: false
          },
          &quot;textMode&quot;: &quot;auto&quot;
        },
        &quot;fieldConfig&quot;: {
          &quot;defaults&quot;: {
            &quot;unit&quot;: &quot;seconds&quot;,
            &quot;thresholds&quot;: {
              &quot;mode&quot;: &quot;absolute&quot;,
              &quot;steps&quot;: [
                { &quot;value&quot;: null, &quot;color&quot;: &quot;green&quot; },
                { &quot;value&quot;: 0,    &quot;color&quot;: &quot;red&quot; }
              ]
            }
          },
          &quot;overrides&quot;: []
        },
        &quot;gridPos&quot;: {&quot;x&quot;: 0, &quot;y&quot;: 0, &quot;w&quot;: 12, &quot;h&quot;: 6}
      }
    ],
    &quot;schemaVersion&quot;: 38,
    &quot;version&quot;: 1,
    &quot;refresh&quot;: &quot;5m&quot;
  },
  &quot;folderId&quot;: 0,
  &quot;overwrite&quot;: false
}

{

"dashboard": {

"id": null,

"uid": "error-budget-demo",

"title": "Error Budget Monitoring",

"timezone": "browser",

"panels": [

{

"type": "stat",

"title": "エラーバジェット残量 (秒)",

"datasource": "Prometheus",

"targets": [

{

"expr": "(0.001 * 30 * 24 * 3600) - sum_over_time((1 - up{job=\"myservice\"})[30d]) * 15",

"legendFormat": "",

"refId": "A"

}

"options": {

"colorMode": "value",

"graphMode": "none",

"justifyMode": "auto",

"orientation": "auto",

"reduceOptions": {

"calcs": ["last"],

"fields": "",

"values": false

"textMode": "auto"

"fieldConfig": {

"defaults": {

"unit": "seconds",

"thresholds": {

"mode": "absolute",

"steps": [

{ "value": null, "color": "green" },

{ "value": 0, "color": "red" }

]

}

"overrides": []

"gridPos": {"x": 0, "y": 0, "w": 12, "h": 6}

}

"schemaVersion": 38,

"version": 1,

"refresh": "5m"

"folderId": 0,

"overwrite": false

}

インポート手順
1. Grafana の UI → 「+」→「Import」
2. 上記 JSON を貼り付けて「Load」→データソースを Prometheus に設定して完了。

4‑2. Datadog でエラーバジェット枯渇アラートを作る

Datadog のモニタは カスタムメトリクス を利用するとシンプルです。以下の例では myservice.error_budget_remaining（秒）というカウンタを外部スクリプトで送信し、閾値が 0 以下になったら通知します。

# Datadog Monitor (YAML) – Error Budget Depletion
name: &quot;Error Budget Depletion - myservice&quot;
type: metric alert
query: |
  max(last_5m):avg:myservice.error_budget_remaining{env:prod,service:myservice} &lt; 0
message: |
  &#x26a0;&#xfe0f; エラーバジェットが枯渇しました。直ちにリリースペースを見直し、障害復旧タスクへシフトしてください。
options:
  thresholds:
    critical: 0
  notify_no_data: true
  no_data_timeframe: 10
  renotify_interval: 30
  include_tags: true
  evaluation_delay: 300

# Datadog Monitor (YAML) – Error Budget Depletion

name: "Error Budget Depletion - myservice"

type: metric alert

query: |

max(last_5m):avg:myservice.error_budget_remaining{env:prod,service:myservice} < 0

message: |

⚠️ エラーバジェットが枯渇しました。直ちにリリースペースを見直し、障害復旧タスクへシフトしてください。

options:

thresholds:

critical: 0

notify_no_data: true

no_data_timeframe: 10

renotify_interval: 30

include_tags: true

evaluation_delay: 300

myservice.error_budget_remaining の算出は、先ほどの Prometheus クエリ結果を datadog-agent の DogStatsD 経由で送信するスクリプト（例：Python）で実装します。
詳細は Datadog 公式ドキュメント → Metric Monitors https://docs.datadoghq.com/monitors/create/types/#metric を参照。

5. エラーバジェット枯渇時の具体的アクションフロー

フェーズ	実施内容	担当・ツール
1️⃣ アラート受信	Grafana/Datadog の赤色ステータスを確認。	SREオンコール
2️⃣ リリースブロック	CI/CD パイプラインに `ERROR_BUDGET_OK=false` フラグを設定し、デプロイを自動停止。	GitHub Actions / Jenkins
3️⃣ 障害復旧タスク優先化	既存のインシデントチケットを上位ステータスへ変更し、担当者を割り当てる。	Jira, PagerDuty
4️⃣ ポストモーテム	枯渇原因（例：外部 API の遅延）を分析し、改善策をドキュメント化。	Confluence
5️⃣ SLO 再評価	必要ならばビジネス側と協議し、SLO を緩和または期間を伸長する。	プロダクトマネージャー・経営層

ベストプラクティス
- 枯渇が判明したら 30 分以内にリリースブロック を実施（Google SRE の推奨）。
- ポストモーテムは Blameless に徹し、原因と再発防止策だけを記載。

6. エラーバジェットレビュー会議の運用ガイド

6‑1. 定例ミーティングの設計

頻度	主な目的	推奨参加者
月次（30 分）	ダッシュボード確認、残量のトレンド把握。	SRE リーダー、プロダクトオーナー、開発リード
四半期（90 分）	大規模障害の振り返り、SLO/エラーバジェット設定の見直し。	上記に加えて経営層・カスタマーサクセス

アジェンダ例（四半期レビュー）

前回レビュー以降の 残量推移（Grafana スクリーンショット）。
障害トップ3 のインシデント概要と対応結果。
改善施策 の進捗確認（JIRA チケット一覧）。
SLO 改訂提案（必要ならば新しいビジネス要件に合わせて）
次回までの アクションアイテム と所有者決定。
会議資料は事前に Google Slides で共有し、議事録は Confluence に保存。
アクションは JIRA のエピックとして管理し、ステータスが「Done」になるまで追跡する。

6‑2. 合意形成のポイント

項目	推奨手法
SLO の妥当性	ユーザーへのインパクトとコストを定量化し、バリューベースドプライシングで合意。
エラーバジェット期間	月次／四半期のどちらか一方だけでなく、ハイブリッド（30日+90日）を併用すると可視性が高まる。
アラート閾値	0 秒だけでなく警告レベル（例：残量 20 % 以下）を設定し、早期対応を促す。

7. まとめと次のステップ

概念を正しく理解 → SLI・SLO とエラーバジェットは同一フレームワーク内で相互依存していることを認識。
計算式を自動化 → 提供した Excel テンプレートと Prometheus クエリで「許容ダウンタイム」と「実績ダウンタイム」をリアルタイムに取得。
可視化・アラート → Grafana の Stat パネルと Datadog のモニタを導入し、残量が 0 以下になると即座に通知される体制を構築。
枯渇時のプロセス → リリースブロック → 障害復旧 → ポストモーテム → SLO 再評価という一連のフローを標準化。
定期レビューで文化醸成 → 月次・四半期の会議でエラーバジェット指標を経営層と共有し、信頼性向上への継続的投資を確保。

次にやること
- 本稿の JSON を Grafana にインポートし、ダッシュボードを本番環境で有効化。
- Prometheus の scrape_interval が 15 秒以外の場合はクエリ中の掛け算係数を修正。
- エクセルテンプレートを全チームに配布し、最初の「設定シート」だけ入力して残量計算ができることを確認。

参考文献（公式・信頼性の高い情報源）

項目	URL
Google SRE Book – Monitoring & Error Budgets	https://sre.google/sre-book/monitoring/#error-budgets
Prometheus Query Basics	https://prometheus.io/docs/prometheus/latest/querying/basics/
Grafana Stat Panel Documentation	https://grafana.com/docs/grafana/latest/panels-visualizations/stat-panel/
Datadog Metric Monitor Guide	https://docs.datadoghq.com/monitors/create/types/#metric
SLA vs SLO – Cloudflare Blog (補足)	https://www.cloudflare.com/ja-jp/learning/ddos/what-is-sla/