click any point in either field, or pick a feature cluster from the dropdown
Inspect a DPO chosen/rejected dataset through the lens of a base model's SAE features. Two UMAPs on the left show the data and the features; click anywhere to see detail on the right.
#N in a panel → drills into that cluster.Each feature cluster carries a badge summarizing whether it tends to fire on the chosen side, the rejected side, both, or neither — across the prompts that activate it strongly. The categories use a 95% confidence interval so small samples don't get over-claimed:
N is the point estimate.Hover any badge for full counts and confidence bounds.
Δ — signed chosen − rejected disparity. Positive: chosen side fires the feature(s) harder. Negative: rejected side fires harder.Δ_k — a topic's overall chosen-vs-rejected lean across all feature clusters.Δ̄ — average Δ for a feature cluster across a topic's prompts.d — Δ rescaled by its standard deviation. Scale-invariant; values around 0.2 / 0.5 / 0.8 are small / medium / large effects.z — Δ scaled by its standard error. Magnitude indicates how far a topic's lean stands above noise; useful for ranking, not as a literal p-value (pairs are coupled by construction, so absolute z is inflated).✓ / ✗ — robustness check. The dataset is split into two halves and Δ is recomputed on each. ✓ means the sign agrees between halves; ✗ means it doesn't (likely noise).This is a predictive tool: it looks for chosen-vs-rejected signal that already exists in the base model's representations, before any preference training has happened. Strong signal here flags which topics × feature clusters DPO's gradient is most likely to move.
The strongest predictive signals in the dataset. Each row is one (topic, feature cluster) pair where the chosen and rejected responses fire that feature cluster systematically differently than they do elsewhere — the kind of pre-training asymmetry DPO is most likely to amplify.
Default filters keep only signals that survive a two-half robustness check, with minimum group sizes that make the estimate well-defined and stable. Loosen them to inspect raw rankings, but expect more noise.
click any point in either field, or pick a feature cluster from the dropdown
The strongest predictive signals in the dataset. Each row is one (data cluster, feature cluster) pair where the chosen and rejected responses fire that feature cluster systematically differently than they do elsewhere — the kind of pre-training asymmetry DPO is most likely to amplify. Ranked by the (k, m)-specific portion of the disparity by default so a few highly-polarized clusters don't dominate the list.
The default interaction ranking subtracts off "this data cluster has big disparity on most features" and "this feature cluster has big disparity in most data clusters" — leaving the part that's specific to the (k, m) combination. Switch the ranking metric below to see raw |Δ| if you want the flat headline numbers instead.
Inspect a DPO chosen/rejected dataset through the lens of a base model's SAE features. Two UMAPs on the left show the feature space and the data clusters; click anywhere to see detail on the right.
#N in a panel → drills into that cluster.Each feature cluster carries a badge summarizing whether it tends to fire on the chosen side, the rejected side, both, or neither — across the prompts that activate it strongly. The categories use a 95% confidence interval so small samples don't get over-claimed:
N is the point estimate.Hover any badge for full counts and confidence bounds.
Δ — signed chosen − rejected disparity. Positive: chosen side fires the feature(s) harder. Negative: rejected side fires harder.Δ̄ — average Δ for a feature cluster across a data cluster's prompts.d — Δ rescaled by its standard deviation. Scale-invariant; values around 0.2 / 0.5 / 0.8 are small / medium / large effects.interaction — the (k, m)-specific portion of the disparity, after subtracting the row and column averages. Useful when raw |Δ| is dominated by clusters polarized everywhere.✓ / ✗ — robustness check. The dataset is split into two halves and Δ is recomputed on each. ✓ means the sign agrees between halves; ✗ means it doesn't (likely noise).This is a predictive tool: it looks for chosen-vs-rejected signal that already exists in the base model's representations, before any preference training has happened. Strong signal here flags which feature clusters DPO's gradient is most likely to move in which data clusters.