"Historical citations (PPO Schulman 1707.06347, InstructGPT 2203.02155, DPO Rafailov 2023 NeurIPS, DeepSeekMath GRPO 2402.03300, DeepSeek-R1 2501.12948, KTO/IPO/SimPO/ORPO)", "Callout 'empty ...
"MAJOR: Codex cross-checked Tinker source (train_on_policy.py); default is sampled-token + IS + negative-KL-advantage, NOT full-vocab. Tutorial repeatedly attributed full-vocab path to Tinker — wrong.