AI Search Evaluation ⑩ Citation Consistency

til/applied-sciences/information-retrieval/ai-search-metrics/10-citation-consistency

10-citation-consistency.mdupdated 2026-07-293056 words

ダブルクリックで英日反転

Applied Sciences · Engineering

EN

Citation Consistency measures whether an AI search engine cites the same domains for the same query week over week, using Jaccard similarity. It captures ranking volatility and determines how many observation weeks are needed to verify an intervention's effect.

Definition & Formula

Jaccard similarity of cited domains: |Domains_weekN ∩ Domains_weekN+1| ÷ |Domains_weekN ∪ Domains_weekN+1|
Computed per query, then averaged; per-engine × query breakdowns are also examined.
Temporal extension: build a weekly Jaccard time series and inspect its variance.

Why It Matters

Business: high consistency → stable SERP-like signal; easy to measure intervention effect.
Low consistency → high noise; harder to attribute domain changes to actual actions.
Academic grounding: test-retest reliability (Cohen 1988); Krippendorff's α can supplement.
Helps judge whether an engine's retrieval policy is a stable trait or a shifting state.

Worked Example (hypothetical, 77 factual queries × 2 weeks)

ChatGPT Search: 0.78 — highly stable.
Copilot: 0.71 · Gemini: 0.65.
AI Overview: 0.42 — ~60% of cited domains turn over each week.
AI Overview requires ~4 consecutive weeks of data to correct for variance when verifying a client intervention.

Use in the AI-Search Project

Applied to all queries in daily longitudinal tracking, rolled up weekly.
Aggregated by engine × intent category × week; visualised as a volatility trend chart.
Represents the temporal-stability aspect of C2 Source Trustworthiness Bias.
Also used to distinguish state vs. trait in engine retrieval behaviour.

→ A score of 0.42 means 60% domain turnover per week — you need 4+ weeks of data before an AI Overview intervention can be reliably evaluated.

Applied Sciences · Engineering

AI検索評価 ⑩ 引用一貫性（Citation Consistency）

JP

引用一貫性（Citation Consistency）とは、同じクエリに対してAI検索エンジンが週をまたいで同じドメインを引用するかをJaccard類似度（＝集合の重複率）で測る指標。ランキング変動の激しさを捉え、施策効果の検証に必要な観測週数の目安を与える。

定義と計算式

一貫性 = |第N週の引用ドメイン ∩ 第N+1週| ÷ |第N週 ∪ 第N+1週|（Jaccard類似度）
クエリごとに計算して平均。エンジン×クエリ単位の内訳も分析する。
時系列延長：週次Jaccard値の系列を作り、分散を観察する。

なぜ重要か

ビジネス観点：一貫性が高い＝安定したSERP（検索結果ページ）に近く、施策効果を読みやすい。
一貫性が低い＝ノイズが大きく、ドメイン変動を施策に帰属しにくい。
学術根拠：再検査信頼性（Cohen 1988）・Krippendorffのα（＝評価者間一致度の指標）で補完可能。
エンジンの検索方針が安定した特性（trait）か変動する状態（state）かを判断する材料になる。

計算例（仮想シナリオ：77問×2週）

ChatGPT Search: 0.78（高安定）
Copilot: 0.71 · Gemini: 0.65
AI Overview: 0.42 ── 先週引用したドメインの約60%が今週は入れ替わる。
AI Overview で施策効果を検証するには、分散補正のため連続4週以上のデータが必要。

AIサーチプロジェクトでの活用

日次縦断トラッキングの全クエリを対象に週次集計。
エンジン×意図カテゴリ×週で集計し、変動トレンドチャートで可視化。
C2「情報源信頼性バイアス」の時間的安定性（temporal-stability）側面を構成。
エンジンの検索ポリシーがtraitかstateかを区別する指標としても使用。

→ スコア0.42は週ごとに60%のドメインが入れ替わることを意味し、AI Overviewの施策評価には最低4週の連続データが必要になる。

Applied Sciences · Engineering