TY - JOUR
T1 - A survey of slow thinking-based reasoning LLMs using reinforcement learning and test-time scaling law
AU - Pan, Qianjun
AU - Ji, Wenkai
AU - Ding, Yuyang
AU - Li, Junsong
AU - Chen, Shilian
AU - Wang, Junyi
AU - Zhou, Jie
AU - Chen, Qin
AU - Zhang, Min
AU - Wu, Yulan
AU - He, Liang
N1 - Publisher Copyright:
© 2025
PY - 2026/3
Y1 - 2026/3
N2 - This survey presents a focused and conceptually distinct framework for understanding recent advancements in reasoning large language models (LLMs) designed to emulate “slow thinking”, a deliberate, analytical mode of cognition analogous to System 2 in dual-process theory from cognitive psychology. While prior review works have surveyed reasoning LLMs through fragmented lenses, such as isolated technical paradigms (e.g., reinforcement learning or test-time scaling) or broad post-training taxonomies, this work uniquely integrates reinforcement learning and test-time scaling as synergistic mechanisms within a unified “slow thinking” paradigm. By synthesizing insights from over 200 studies, we identify three interdependent pillars that collectively enable advanced reasoning: (1) Test-time scaling, which dynamically allocates computational resources based on task complexity via search, adaptive computation, and verification; (2) Reinforcement learning, which refines reasoning trajectories through reward modeling, policy optimization, and self-improvement; and (3) Slow-thinking frameworks, which structure reasoning into stepwise, hierarchical, or hybrid processes such as long Chain-of-Thought and multi-agent deliberation. Unlike existing surveys, our framework is goal-oriented, centering on the cognitive objective of “slow thinking” as both a unifying principle and a design imperative. This perspective enables a systematic analysis of how diverse techniques converge toward human-like deep reasoning. The survey charts a trajectory toward next-generation LLMs that balance cognitive fidelity with computational efficiency, while also outlining key challenges and future directions. Advancing such reasoning capabilities is essential for deploying LLMs in high-stakes domains including scientific discovery, autonomous agents, and complex decision support systems.
AB - This survey presents a focused and conceptually distinct framework for understanding recent advancements in reasoning large language models (LLMs) designed to emulate “slow thinking”, a deliberate, analytical mode of cognition analogous to System 2 in dual-process theory from cognitive psychology. While prior review works have surveyed reasoning LLMs through fragmented lenses, such as isolated technical paradigms (e.g., reinforcement learning or test-time scaling) or broad post-training taxonomies, this work uniquely integrates reinforcement learning and test-time scaling as synergistic mechanisms within a unified “slow thinking” paradigm. By synthesizing insights from over 200 studies, we identify three interdependent pillars that collectively enable advanced reasoning: (1) Test-time scaling, which dynamically allocates computational resources based on task complexity via search, adaptive computation, and verification; (2) Reinforcement learning, which refines reasoning trajectories through reward modeling, policy optimization, and self-improvement; and (3) Slow-thinking frameworks, which structure reasoning into stepwise, hierarchical, or hybrid processes such as long Chain-of-Thought and multi-agent deliberation. Unlike existing surveys, our framework is goal-oriented, centering on the cognitive objective of “slow thinking” as both a unifying principle and a design imperative. This perspective enables a systematic analysis of how diverse techniques converge toward human-like deep reasoning. The survey charts a trajectory toward next-generation LLMs that balance cognitive fidelity with computational efficiency, while also outlining key challenges and future directions. Advancing such reasoning capabilities is essential for deploying LLMs in high-stakes domains including scientific discovery, autonomous agents, and complex decision support systems.
KW - Deep thinking
KW - Long chain-of-thought
KW - Reasoning LLMs
KW - Slow thinking
KW - Survey
KW - Test-time scaling law
UR - https://www.scopus.com/pages/publications/105015417091
U2 - 10.1016/j.ipm.2025.104394
DO - 10.1016/j.ipm.2025.104394
M3 - 文章
AN - SCOPUS:105015417091
SN - 0306-4573
VL - 63
JO - Information Processing and Management
JF - Information Processing and Management
IS - 2
M1 - 104394
ER -