Benchmarking RNA velocity methods across 17 independent studies

Luo, Ren, Yang, You, Zhou, Qin & Li. Cell Reports Methods 6:101367 (2026-04-20), Xiamen University. (Published version of the bioRxiv preprint LiorPachter amplified as “gobbledygook”; full PDF now in raw/.)

Summary

A systematic benchmark of 15 RNA-velocity methods across 17 real datasets (5 human, 12 mouse) plus 3 simulated topologies (bifurcation, linear, cycle; via dyngen), scored on accuracy (CBDir, ICCoh, velocity consistency), method agreement (A1/A2), stability (cell downsampling, varying HVGs), and usability (runtime, memory). Headline: no method is best on all axes, direction accuracy is poor on average (mean CBDir ≈ 0.1), and methods disagree with each other (pairwise A1 mostly < 0.4) — so the authors recommend running multiple methods and trusting their consensus. Crucially for this wiki, every metric scores direction/coherence, never physical time — so even the weak claim (correct arrows) is shaky, well before any latent-time claim.

Key Claims

  • Direction barely recovered. Mean CBDir ≈ 0.1 across methods (cross-boundary direction correctness; 1 = perfect, 0 = chance). Best: veloVI (0.23), Pyro-Velocity (0.17). veloVAE’s directions were mostly reversed (negative CBDir). Accuracy declines with complexity — human bone marrow (multiple trajectories) hit mean CBDir −0.193; mature PBMC gave biologically wrong directions (e.g. CD8⁺ cytotoxic → naive T, backwards).
  • Methods disagree. Pairwise method-agreement A1 mostly < 0.4; latentvelo and cell2fate disagree with nearly everything. Only a few clusters agree (velocyto↔scVelo-sto; CellRank↔ cellDancer; DeepVelo↔veloVI↔Dynamo-sto capture some consensus).
  • In-cluster coherence is high (and easy). ICCoh ≥ 0.7 for most (latentvelo 0.99, UniTVelo 0.96, MultiVelo 0.96) — i.e. fields look locally smooth even when globally wrong; smoothness ≠ correctness.
  • Stability / usability. CBDir is highly sensitive to HVG choice (≈1,000 optimal); UniTVelo / LatentVelo robust to downsampling, veloVI stable across HVGs. GPU methods (DeepVelo, veloVI) are fast + low-memory; cell2fate / Pyro-Velocity memory-hungry; cellDancer / MultiVelo slow.
  • Scenario-based best practice (the deliverable).
    • Million-cell atlas → veloVI, DeepVelo, Dynamo-sto, scVelo-sto (scale + low memory).
    • Low-quality / sparse / low-depth → UniTVelo, LatentVelo, veloVI, Pyro-Velocity.
    • Complex / multi-branching lineage → DeepVelo, veloVI, LatentVelo.
    • Overall best-balanced accuracy: veloVI, Pyro-Velocity, DeepVelo.

Physical-time grounding (standing lens)

A reliability benchmark, not a method — and a telling one. It scores only direction and coherence (CBDir, ICCoh, consistency, agreement); it never tests physical or even ordinal time. So it supplies the empirical leg of velocity-skepticism: if the direction of the field is near-chance on average and methods don’t agree, claims of metric time built on top are far less secure. Note the field-internal honesty: even the published abstract is measured (“limitations and challenges”, “best-practice”); the “gobbledygook” framing is LiorPachter’s gloss, not the authors’.

Figure

Benchmark Fig 2 — per-method accuracy

Fig 2 — accuracy across 15 methods × 17 datasets (Luo et al., Cell Rep Methods 2026; velocity-benchmark-17studies). (A) CBDir — note how many methods straddle 0 (chance) and several go negative (reversed direction); veloVI leads but only at ~0.23. (B) ICCoh — uniformly high (local smoothness is easy). (C) velocity consistency — moderate. The gap between high ICCoh and near-zero CBDir is the key reading: fields look coherent while pointing the wrong way.

Key Quotes

“no single method exhibited superior performance in all the assessments, and unexpected underperformance was observed in certain cases … the lack of uniformity in the inference results highlights the necessity to compare and control of multiple methods in a single analysis.” — Abstract.

Connections

Contradictions

  • None. It documents cross-method instability and near-chance direction accuracy — consistent with, and strengthening, the wiki’s skeptical framing. (Upgraded from abstract-level to full text; method count corrected 14 → 15, now Cell Reports Methods 2026.)