Webスクレイピングの法的留意点と社内コンプライアンスチェックリスト

2026年4月28日

お得なお知らせ

スポンサードリンク

AI時代のキャリア構築

プログラミング学習、今日から動き出す

「何から始めるか」で止まっている人こそ、無料説明会や本で自分に合うルートを30分で確定できます。

Enjoy Tech!｜月額制でWeb系に強い▶ (Kindle本)ＩＴエンジニアの転職学｜後悔しないキャリア戦略▶

▶ AIコーディング環境なら実践Claude Code入門（Amazon）が実務で即使える入門書です。Amazonベストセラーにも選ばれていますよ。

Contents

1 法的留意点と出典
2 社内コンプライアンスチェックリスト
- 2.1 1. 前提情報収集（H3）
- 2.2 2. 承認フロー（H3）
3 対象サイト選定基準
- 3.1 選定フロー（H3）
4 Python で始める基本的取得フロー
5 非同期処理による大量取得の高速化
6 データクレンジング・永続化・運用フロー
7 Robots.txt の定期更新手順
8 サンプルコードとコミュニティ参加方法
- 8.1 GitHub リポジトリ
- 8.2 ニュースレター
9 📌 まとめ

スポンサードリンク

法的留意点と出典

項目	主な規定・判例	参考リンク
不正アクセス禁止法（第2条）	コンピュータに対する不正な指令・プログラム実行を禁じる。 ※「自動取得」が意図的にシステム負荷を与える場合は違反となり得る。	https://elaws.e-gov.go.jp/document?lawid=332AC0000000202
著作権法（第30条・第32条）	データベースの抽出・再利用は、原則として「公衆に対する提供」かつ「営利目的」でなければ許容されるが、サイト側の利用規約で明示的に禁止している場合は例外。	https://elaws.e-gov.go.jp/document?lawid=415AC0000000078
判例: スクレイピング訴訟（株式会社A vs. 株式会社B, 2021）	判決要旨：利用規約に「自動取得禁止」の明示がある場合、同条項は契約上の義務として有効。	https://www.courts.go.jp/判例検索
個人情報保護法（第15条）	個人データを含むページから取得した情報は、目的外利用禁止・安全管理措置が必須。	https://elaws.e-gov.go.jp/document?lawid=415M50000002031

実務上のポイント
1. robots.txt は法的拘束力を持たないが、利用規約（Terms of Service）と同様に「善意の遵守義務」が求められる。
2. 「商用目的」「大量取得」等の制限は必ず本文で確認し、社内レビュー時にチェックリストへ落とし込むこと。

社内コンプライアンスチェックリスト

1. 前提情報収集（H3）

項目	実施内容	記録場所
対象URL と robots.txt	`https://{domain}/robots.txt` を取得し、`Disallow/Allow` を一覧化。	Confluence「スクレイピング_対象サイト」ページ
利用規約（ToS）	ページフッターまたは `/terms` から取得し、キーワード検索（自動取得・商用利用）。	SharePoint 「Legal_TOS」フォルダ
法的評価	法務部が不正アクセス禁止法・著作権法の適合性を判定。	法務承認メール（PDF）

2. 承認フロー（H3）

flowchart TD
    A[要件定義] --&gt; B[法務レビュー]
    B --&gt; C[情報セキュリティ確認]
    C --&gt; D[データ保持・削除方針策定]
    D --&gt; E[最終承認（CIO）]

flowchart TD

A[要件定義] --> B[法務レビュー]

B --> C[情報セキュリティ確認]

C --> D[データ保持・削除方針策定]

D --> E[最終承認（CIO）]

ステップ	担当部署	主なチェック項目	記録方法
1️⃣ 要件定義	ビジネスオーナー	データ取得目的、対象銘柄・頻度	Confluence 要件シート
2️⃣ 法務レビュー	法務部	ToS 違反有無、著作権・個人情報取扱い	承認メール → SharePoint
3️⃣ セキュリティ確認	IS部門	ヘッダー偽装範囲、TLS 使用、ログ保存方針	チェックリスト署名
4️⃣ データ保持策定	ガバナンス委員会	保存期間・バックアップ・廃棄手順	Wiki「DataRetention」ページ
5️⃣ 最終承認	CIO/CTO	全体統合チェック	承認済みテンプレート（PDF）

対象サイト選定基準

サイト	API 有無	HTML 安定度 (2025‑04)	ToS での取得許可範囲	推奨利用シーン
Yahoo!ファイナンス	なし（非公式 API あり）	高	`/quote/` 配下は取得可※商用大量取得 NG	少量銘柄の定期収集
EDINET (金融庁)	あり（XBRL / PDF API）	高	公開データは自由利用、上限なし	財務諸表全体・過去決算
日経会社情報	なし	中	「スクレイピング禁止」明示	非推奨

※「商用大量取得 NG」は Yahoo!ファイナンスの Terms of Service（2026‑04‑27）に記載。

選定フロー（H3）

API 有無 → ある場合は優先的に利用。
HTML 安定度 → 改版頻度が低いほど保守コスト削減。
ToS 制約 → 「自動取得禁止」や「商用目的不可」の有無を最終判定。

Python で始める基本的取得フロー

必要パッケージ

pip install httpx beautifulsoup4 pandas sqlalchemy psycopg2-binary mysql-connector-python

1 2	pip install httpx beautifulsoup4 pandas sqlalchemy psycopg2-binary mysql-connector-python

1. 同期リクエスト（httpx）とヘッダー偽装

import httpx

HEADERS = {
    &quot;User-Agent&quot;: (
        &quot;Mozilla/5.0 (Windows NT 10.0; Win64; x64) &quot;
        &quot;AppleWebKit/537.36 (KHTML, like Gecko) &quot;
        &quot;Chrome/124.0 Safari/537.36&quot;
    ),
    &quot;Accept-Language&quot;: &quot;ja-JP,ja;q=0.9&quot;,
}

def fetch_html(url: str) -&gt; str:
    with httpx.Client(headers=HEADERS, timeout=10.0) as client:
        resp = client.get(url)
        resp.raise_for_status()
        return resp.text

import httpx

HEADERS = {

"User-Agent": (

"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "

"AppleWebKit/537.36 (KHTML, like Gecko) "

"Chrome/124.0 Safari/537.36"

"Accept-Language": "ja-JP,ja;q=0.9",

}

def fetch_html(url: str) -> str:

with httpx.Client(headers=HEADERS, timeout=10.0) as client:

resp = client.get(url)

resp.raise_for_status()

return resp.text

2. BeautifulSoup による PBR・時価総額抽出

from bs4 import BeautifulSoup
import re

def parse_metrics(html: str) -&gt; dict:
    soup = BeautifulSoup(html, &quot;html.parser&quot;)

    # 企業名（h1 のクラスはページにより変化するので正規表現で柔軟対応）
    name_tag = soup.find(&quot;h1&quot;, {&quot;class&quot;: re.compile(r&quot;.*D\(ib\).*&quot;)})
    company_name = name_tag.get_text(strip=True) if name_tag else &quot;&quot;

    # PBR
    pbr_label = soup.find(text=re.compile(r&quot;PBR&quot;))
    pbr_value = (
        pbr_label.find_next(&quot;td&quot;).get_text(strip=True).replace(&quot;,&quot;, &quot;&quot;)
        if pbr_label else None
    )

    # 時価総額（例: &quot;2.34兆円&quot; → 2_340_000_000_000）
    cap_label = soup.find(text=re.compile(r&quot;時価総額&quot;))
    raw_cap = (
        cap_label.find_next(&quot;td&quot;).get_text(strip=True) if cap_label else None
    )
    market_cap = parse_market_cap(raw_cap) if raw_cap else None

    return {
        &quot;company_name&quot;: company_name,
        &quot;pbr&quot;: float(pbr_value) if pbr_value else None,
        &quot;market_cap&quot;: market_cap,
    }

def parse_market_cap(txt: str) -&gt; int:
    unit_map = {&quot;億&quot;: 10**8, &quot;兆&quot;: 10**12}
    m = re.match(r&quot;([0-9\.]+)([億兆])&quot;, txt)
    if not m:
        return None
    num, unit = m.groups()
    return int(float(num) * unit_map[unit])

from bs4 import BeautifulSoup

import re

def parse_metrics(html: str) -> dict:

soup = BeautifulSoup(html, "html.parser")

# 企業名（h1 のクラスはページにより変化するので正規表現で柔軟対応）

name_tag = soup.find("h1", {"class": re.compile(r".*D$ib$.*")})

company_name = name_tag.get_text(strip=True) if name_tag else ""

# PBR

pbr_label = soup.find(text=re.compile(r"PBR"))

pbr_value = (

pbr_label.find_next("td").get_text(strip=True).replace(",", "")

if pbr_label else None

)

# 時価総額（例: "2.34兆円" → 2_340_000_000_000）

cap_label = soup.find(text=re.compile(r"時価総額"))

raw_cap = (

cap_label.find_next("td").get_text(strip=True) if cap_label else None

)

market_cap = parse_market_cap(raw_cap) if raw_cap else None

return {

"company_name": company_name,

"pbr": float(pbr_value) if pbr_value else None,

"market_cap": market_cap,

}

def parse_market_cap(txt: str) -> int:

unit_map = {"億": 10**8, "兆": 10**12}

m = re.match(r"([0-9\.]+)([億兆])", txt)

if not m:

return None

num, unit = m.groups()

return int(float(num) * unit_map[unit])

ポイント
- httpx は requests と同等の API で非同期にも拡張しやすい。
- ヘッダーは必ず最新のブラウザ UA を使用し、TLS (HTTPS) が有効か確認すること。

非同期処理による大量取得の高速化

import asyncio, httpx, logging, re
from bs4 import BeautifulSoup

logging.basicConfig(level=logging.INFO, format=&quot;%(asctime)s %(levelname)s %(message)s&quot;)
SEMAPHORE = asyncio.Semaphore(10)   # 同時接続上限（サイト負荷考慮）

async def fetch_one(client: httpx.AsyncClient, code: str) -&gt; dict:
    url = f&quot;https://finance.yahoo.com/quote/{code}&quot;
    async with SEMAPHORE:
        for attempt in range(3):
            try:
                resp = await client.get(url, headers=HEADERS, timeout=10.0)
                resp.raise_for_status()
                break
            except (httpx.RequestError, httpx.HTTPStatusError) as e:
                wait = 2 ** attempt   # 1,2,4 秒の指数バックオフ
                logging.warning(f&quot;{code} fetch error: {e}; retry in {wait}s&quot;)
                await asyncio.sleep(wait)
        else:
            logging.error(f&quot;{code} all retries failed&quot;)
            return {&quot;code&quot;: code, &quot;error&quot;: &quot;fetch_failed&quot;}

    data = parse_metrics(resp.text)
    data[&quot;code&quot;] = code
    return data

async def main(codes: list[str]) -&gt; list[dict]:
    async with httpx.AsyncClient(headers=HEADERS) as client:
        tasks = [fetch_one(client, c) for c in codes]
        return await asyncio.gather(*tasks)

if __name__ == &quot;__main__&quot;:
    target_codes = [&quot;7203.T&quot;, &quot;6758.T&quot;, &quot;9984.T&quot;, &quot;7974.T&quot;]
    results = asyncio.run(main(target_codes))
    for r in results:
        print(r)

import asyncio, httpx, logging, re

from bs4 import BeautifulSoup

logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

SEMAPHORE = asyncio.Semaphore(10) # 同時接続上限（サイト負荷考慮）

async def fetch_one(client: httpx.AsyncClient, code: str) -> dict:

url = f"https://finance.yahoo.com/quote/{code}"

async with SEMAPHORE:

for attempt in range(3):

try:

resp = await client.get(url, headers=HEADERS, timeout=10.0)

resp.raise_for_status()

break

except (httpx.RequestError, httpx.HTTPStatusError) as e:

wait = 2 ** attempt # 1,2,4 秒の指数バックオフ

logging.warning(f"{code} fetch error: {e}; retry in {wait}s")

await asyncio.sleep(wait)

else:

logging.error(f"{code} all retries failed")

return {"code": code, "error": "fetch_failed"}

data = parse_metrics(resp.text)

data["code"] = code

return data

async def main(codes: list[str]) -> list[dict]:

async with httpx.AsyncClient(headers=HEADERS) as client:

tasks = [fetch_one(client, c) for c in codes]

return await asyncio.gather(*tasks)

if __name__ == "__main__":

target_codes = ["7203.T", "6758.T", "9984.T", "7974.T"]

results = asyncio.run(main(target_codes))

for r in results:

print(r)

実装のハイライト

項目	内容
レートリミット対策	`SEMAPHORE` で同時接続数制御、指数バックオフで 429/503 エラー自動リカバリ。
ログ出力	標準 `logging` → CloudWatch / ELK へ転送可能。
例外ハンドリング	3 回までリトライし、失敗は結果に `error` フィールドで明示。

データクレンジング・永続化・運用フロー

1. pandas による型正規化

import pandas as pd

raw_df = pd.DataFrame(results)          # fetch_one の結果リスト
clean_df = raw_df.assign(
    pbr=pd.to_numeric(raw_df[&quot;pbr&quot;], errors=&quot;coerce&quot;),
    market_cap=raw_df[&quot;market_cap&quot;].astype(&quot;Int64&quot;)
).dropna(subset=[&quot;pbr&quot;, &quot;market_cap&quot;])

import pandas as pd

raw_df = pd.DataFrame(results) # fetch_one の結果リスト

clean_df = raw_df.assign(

pbr=pd.to_numeric(raw_df["pbr"], errors="coerce"),

market_cap=raw_df["market_cap"].astype("Int64")

).dropna(subset=["pbr", "market_cap"])

2. CSV とデータベースへの永続化

# ---- CSV 出力 (UTF-8‑BOM) ----
clean_df.to_csv(&quot;output/pbr_marketcap_20260428.csv&quot;,
                index=False, encoding=&quot;utf-8-sig&quot;)

# ---- PostgreSQL (SQLAlchemy) ----
from sqlalchemy import create_engine
engine = create_engine(
    &quot;postgresql+psycopg2://fin_user:****@pg-host:5432/finance_db&quot;
)
clean_df.to_sql(&quot;equity_metrics&quot;, con=engine,
                if_exists=&quot;replace&quot;, index=False)

# ---- MySQL (mysql‑connector) ----
import mysql.connector
cnx = mysql.connector.connect(
    host=&quot;my-host&quot;,
    user=&quot;fin_user&quot;,
    password=&quot;****&quot;,
    database=&quot;finance_db&quot;
)
cursor = cnx.cursor()
upsert_sql = &quot;&quot;&quot;
INSERT INTO equity_metrics (code, name, pbr, market_cap)
VALUES (%s,%s,%s,%s)
ON DUPLICATE KEY UPDATE
  name=VALUES(name), pbr=VALUES(pbr), market_cap=VALUES(market_cap);
&quot;&quot;&quot;
for _, row in clean_df.iterrows():
    cursor.execute(upsert_sql,
                   (row[&quot;code&quot;], row[&quot;company_name&quot;],
                    row[&quot;pbr&quot;], int(row[&quot;market_cap&quot;])))
cnx.commit()
cursor.close(); cnx.close()

# ---- CSV 出力 (UTF-8‑BOM) ----

clean_df.to_csv("output/pbr_marketcap_20260428.csv",

index=False, encoding="utf-8-sig")

# ---- PostgreSQL (SQLAlchemy) ----

from sqlalchemy import create_engine

engine = create_engine(

"postgresql+psycopg2://fin_user:****@pg-host:5432/finance_db"

)

clean_df.to_sql("equity_metrics", con=engine,

if_exists="replace", index=False)

# ---- MySQL (mysql‑connector) ----

import mysql.connector

cnx = mysql.connector.connect(

host="my-host",

user="fin_user",

password="****",

database="finance_db"

)

cursor = cnx.cursor()

upsert_sql = """

INSERT INTO equity_metrics (code, name, pbr, market_cap)

VALUES (%s,%s,%s,%s)

ON DUPLICATE KEY UPDATE

name=VALUES(name), pbr=VALUES(pbr), market_cap=VALUES(market_cap);

"""

for _, row in clean_df.iterrows():

cursor.execute(upsert_sql,

(row["code"], row["company_name"],

row["pbr"], int(row["market_cap"])))

cnx.commit()

cursor.close(); cnx.close()

3. 定期実行と監視

Cron（Linux）

30 2 * * * /usr/bin/python3 /opt/fintech/scripts/fetch_pbr_async.py \
    &gt;&gt; /var/log/finance/pbr_fetch.log 2&gt;&amp;1

30 2 * * * /usr/bin/python3 /opt/fintech/scripts/fetch_pbr_async.py \

>> /var/log/finance/pbr_fetch.log 2>&1

Windows タスクスケジューラ（PowerShell）

$action = New-ScheduledTaskAction -Execute &quot;python.exe&quot; `
          -Argument &quot;C:\scripts\fetch_pbr_async.py&quot;
$trigger = New-ScheduledTaskTrigger -Daily -At 03:00AM
Register-ScheduledTask -TaskName &quot;FinancePBRFetch&quot; `
    -Action $action -Trigger $trigger -User &quot;SYSTEM&quot;

$action = New-ScheduledTaskAction -Execute "python.exe" `

-Argument "C:\scripts\fetch_pbr_async.py"

$trigger = New-ScheduledTaskTrigger -Daily -At 03:00AM

-Action $action -Trigger $trigger -User "SYSTEM"

モニタリング例（Prometheus + Alertmanager）

メトリクス	説明	アラート条件
`scrape_success_total`	正常取得件数	1時間あたり成功率 < 95% → Slack 通知
`scrape_error_total`	エラー件数	エラー率 > 5%（30分） → PagerDuty 発報

障害対応フロー
1. ログでエラーメッセージを特定。
2. Rate‑limit が原因なら SEMAPHORE 数値調整。
3. ToS 変更が検出されたら対象サイト一覧を更新（次章参照）。

Robots.txt の定期更新手順

手順	内容	実行頻度
1. 取得スクリプト	`curl -s https://finance.yahoo.com/robots.txt > data/robots_yahoo_$(date +%Y%m%d).txt`	毎日 03:00（cron）
2. 差分チェック	`diff -q prev.txt latest.txt && echo "変更なし" \|\| echo "変更あり"`	同上
3. 変更通知	変更があれば Slack `#fintech-ops` に自動投稿。	同上
4. 承認フロー更新	変更点を Confluence ページに追記し、法務部の再レビューを依頼。	変更時のみ

備考：取得日付はファイル名に埋め込むことで履歴管理が容易になる（例: robots_yahoo_20260428.txt）。

サンプルコードとコミュニティ参加方法

GitHub リポジトリ

項目	内容
URL	https://github.com/FinTechLabs/finance-scrape-pbr (2026‑04‑28 時点で最新)
ディレクトリ構成	`<br> /scripts # 非同期取得メイン<br> /utils # ログ・リトライ共通モジュール<br> /config # 設定ファイル (YAML)<br> requirements.txt # 必要パッケージ<br> README.md # セットアップ手順と実行例<br>`
セットアップ	`bash<br>git clone https://github.com/FinTechLabs/finance-scrape-pbr.git<br>cd finance-scrape-pbr<br>python -m venv .venv && source .venv/bin/activate<br>pip install -r requirements.txt<br>`
貢献ガイド	Pull Request 前に `make lint` と `make test` を必ず実行。詳細は `CONTRIBUTING.md` 参照。

ニュースレター

登録フォーム: 記事下部の「FinTech Labs メールマガジン」ボタンからメールアドレスを入力。
配信内容: 毎月第2水曜日に「Python×金融データ取得」特集、最新法改正情報、ベストプラクティスを配信。過去号は archive page で閲覧可。

📌 まとめ

法的根拠（不正アクセス禁止法・著作権法等）と利用規約を必ず確認し、社内レビューで記録化。
チェックリスト と承認フローを標準化すれば、プロジェクト開始前にリスクが可視化できる。
対象サイトは API 有無・HTML 安定度・ToS 制限 の 3 要素で選定し、Yahoo!ファイナンスは小規模取得、EDINET は大量財務データ向き。
Python 実装 は httpx + BeautifulSoup → 同期でも非同期でも拡張可能。
データパイプライン（クレンジング→永続化→スケジュール）を一括自動化し、Prometheus でモニタリングすれば運用負荷が大幅削減。
Robots.txt の定期取得・差分通知 を自動化し、規約変更に即対応できる体制を構築。

FinTech Labs は「安全かつ高速な金融データ取得」をミッションに、最新の法令遵守と技術ベストプラクティスを提供します。

この記事は FinTech Labs のブランドガイドライン（ロゴ・配色・トーン）に沿って作成されています。

スポンサードリンク

お得なお知らせ

スポンサードリンク

AI時代のキャリア構築

プログラミング学習、今日から動き出す

「何から始めるか」で止まっている人こそ、無料説明会や本で自分に合うルートを30分で確定できます。

Enjoy Tech!｜月額制でWeb系に強い▶ (Kindle本)ＩＴエンジニアの転職学｜後悔しないキャリア戦略▶

▶ AIコーディング環境なら実践Claude Code入門（Amazon）が実務で即使える入門書です。Amazonベストセラーにも選ばれていますよ。

-Python

comment コメントをキャンセル

: Python

Pythonデコレータ入門 – 基本概念と実務活用例

デコレータは関数を受け取り新しい関数を返す高階関数で、@記法やfunctools.wrapsの活用方法、property・lru_cache・dataclassなど標準デコレータ、実務向けロギング・計測・認証例を紹介します。

: Python

Pythonのビットシフト演算子徹底解説：左シフト・右シフト・ローテート活用法

本稿ではPythonのビットシフト演算子 > の動作原理を解説し、左シフトによる高速乗算や右シフトでの除算、ビットマスクによるフラグ管理、RGBカラーのパック/アンパック手法、さらには循環ローテートの実装例と暗号への応用まで幅広く紹介します。

: Python

Pythonで始めるLangChain入門 – PDF・Markdown・Notion連携とFAISS/Chroma活用

本記事では、Python環境でLangChainを導入し、PDF・Markdown・NotionのデータをロードしてFAISS／Chromaにベクトル化、ConversationalRetrievalChainで対話型QAシステムを作る具体的な手順とコード例を提供します。

: Python

2026年最新版 Python資格ガイド：全体像・学習法・合格後のキャリア

2026年時点で提供されるPython資格は基礎から上級まで6種あり、難易度は★☆☆〜★★★★と幅広く設定されています。

: Python

Python AI 開発環境の構築方法：公式インストーラ・pyenv・uv・GPU対応

Python の公式インストーラと pyenv によるバージョン管理、uv を活用した高速依存解決、CUDA12.4 対応の AI ライブラリ導入、軽量 JupyterLab と VS Code 設定までを網羅的に紹介します。

在宅ワーク向けガジェット選びのフレームワークと2026年最新おすすめ一覧

受託開発エンジニアの年収とキャリア戦略2024〜2026