RNA-velocity benchmark — Ancheta et al. (PLoS Comput Biol 2026)

Ancheta, Dorman, Le Treut, Gurung, Huber, Royer, Granados & Lange. PLOS Computational Biology 22(6):e1014303 (2026-06-01). Chan Zuckerberg Biohub SF (Ancheta & Huber also UCSF; Lange now Institut de la Vision, Paris; Granados now Calico). This is the “UCSF benchmark” bookmarked in velocity-discourse-2025-2026 (RNA-Seq Blog gloss), now ingested — a second, independent reliability benchmark alongside Luo et al. (velocity-benchmark-17studies). The two converge: no method is uniformly best; treat velocity as hypothesis-generating.

Summary

A focused comparison of five RNA-velocity methods across three developmental datasets of increasing complexity, scoring not “which is right” but how much the methods agree, how self-consistent each field is, how stable the driver-gene rankings are, and how robust each is to shallow sequencing. The headline is sobering: agreement is dataset-dependent and often low, driver-gene rankings are method-sensitive (only Pdx1 survives a five-method intersection on the β-lineage), and “none of the methods perform highly in the three evaluated parameters.” The authors’ explicit stance: RNA velocity is a hypothesis-generation tool whose predictions must be experimentally validated; cross-method concordance is a conservative robustness check, not evidence of causality. Notably the benchmark is purely directional — it never touches absolute time, kinetic-rate scale, or labeling, which is itself signal for the wiki’s physical-time-grounding lens.

Methods & datasets

  • Five methods: velocyto (steady-state), scVelo stochastic (scv-Sto), scVelo dynamical (scv-Dyn), UniTVelo (unified latent time, top-down RBF), DeepVelo (Chen 2022 neural-ODE — note the DeepVelo name-collision caveat).
  • Three datasets, rising complexity: mouse pancreatic development (3,696 cells); zebrafish neuro-mesodermal progenitors (ZF NMP, 16,035); zebrafish embryo 24 hpf (12,914).

Key Claims

  • Local Consistency (LC) — mean cosine similarity of a cell’s velocity to its 30 nearest neighbors. UniTVelo highest (≈0.8–1.0 across datasets); scv-Dyn consistently lowest; DeepVelo / scv-Sto intermediate. System complexity dominates: pancreas 92% of cells reach consensus LC > 0.5, ZF embryo 74%, but ZF NMP only 39% — the harder the biology, the less self-consistent every method gets.
  • Method agreement (A1 pairwise, A2 vs the five-method median vector) is strongly dataset-dependent and frequently low. DeepVelo A2 = 0.894 (pancreas) vs 0.667 (ZF NMP); UniTVelo A2 = 0.848 (pancreas) vs 0.305 (ZF NMP), 0.392 (ZF embryo); velocyto shows the inverse pattern — poor on pancreas (0.432) but best on the zebrafish sets (0.908–0.922). No method is universally concordant.
  • Driver-gene rankings are method-sensitive. Taking each method’s top-100 genes, only one gene (Pdx1) sits in the intersection of all five for the β-lineage terminal state. Methods split into two camps: {DeepVelo, scv-Sto, velocyto} agree on 79%; {scv-Dyn, UniTVelo} agree on 55%. “RNA-velocity-based driver-gene rankings … should be treated as hypothesis-generating.”
  • Robustness to sequencing depth (read subsampling 2%→98%). velocyto most robust (~0.85 cosine similarity already at 5% reads); scv-Sto / scv-Dyn / UniTVelo plateau ~0.7; DeepVelo least robust (~0.5). The simplest steady-state model degrades most gracefully.
  • No winner declared. “This paper does not aim to claim which method is superior; instead, it seeks to equip scientists with guidance.” And the load-bearing caution: “None of the methods perform highly in the three evaluated parameters … we always recommend validating RNA velocity predictions.

Physical-time grounding (standing lens)

A benchmark, not a method — but scored on the four axes, the absences are the point:

  1. Latent time — ordinal or metric? Not addressed. The benchmark evaluates only directionality (cosine-similarity-based LC / agreement / transition probabilities); it never asks whether any method’s time is ordinal vs metric. The whole exercise lives at the direction level — exactly the level JianhuaXing and GennadyGorin argue is being over-read as dynamics.
  2. Scale degeneracy. Not addressed / sidestepped. Magnitude enters only via the depth-robustness “magnitude stability” check; absolute rate scale and its non-identifiability are never examined. Comparing on cosine similarity deliberately discards magnitude, so the degeneracy is invisible by construction.
  3. External time anchor. None. No metabolic-labeling, no real-time series, no absolute-rate measurement enters the evaluation — consistent with all five methods being snapshot-only.
  4. Constant-rate assumptions. Not analyzed. Rate constancy isn’t a benchmarked axis; the methods’ differing α/β/γ treatments are folded into “method choice,” whose downstream effect the paper measures empirically (low agreement) without diagnosing the kinetic cause.

The benchmark says that velocity is unstable (low cross-method agreement, complexity-sensitive, method-dependent drivers); the wiki’s lens and the skeptics (velocity-skepticism) say why — the temporal/identifiability axis is under-constrained from snapshot data. This paper is the empirical complement to the theoretical critique. That it scores everything on direction, never on physical time, is itself evidence the field’s evaluation culture has not internalized the physical-time axis FlowVelo targets.

Key Quotes

“None of the methods perform highly in the three evaluated parameters … we always recommend validating RNA velocity predictions.” — Conclusions.

“RNA-velocity-based driver-gene rankings are method-sensitive and should be treated as hypothesis-generating. Cross-method concordance can serve as a conservative robustness check to prioritize candidates for follow-up, but does not establish causality.” — Discussion.

Connections

  • velocity-benchmark-17studies — the other (Luo et al.) reliability benchmark; same conclusion (no uniform winner) from a wider 14×17 sweep. Cite the two together as independent empirical corroboration.
  • velocity-skepticism — this is the empirical-reliability strand, now with two benchmarks.
  • velocyto — comes off best on robustness + zebrafish agreement (the simplest model ages well).
  • scVelo / DeepVelo — both benchmarked here (scVelo in two flavors; DeepVelo = Chen 2022).
  • physical-time-grounding / physical-time-grounding-across-methods — the lens this benchmark conspicuously does not apply (direction-only evaluation).
  • LiorPachter — his “gobbledygook” tweet amplified velocity-benchmark-17studies, not this one; but this paper’s measured tone is the better citation (honesty guardrail).
  • velocity-discourse-2025-2026 — bookmarked there as the “UCSF benchmark.”
  • FlowVelo — motivates evaluating on physical time, not just direction; and using cross-method concordance only as a conservative prior.

Contradictions

  • No factual conflict with velocity-benchmark-17studiesreinforces it from an independent group and datasets. Minor framing nuance: where Luo et al. span 14 methods × 17 studies, Ancheta et al. go deep on 5 methods × 3 datasets with finer per-dataset agreement/robustness numbers. Together: breadth (Luo) + depth (Ancheta), same verdict.