ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models
ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models addresses the memory bottleneck caused by the massive key-value (KV)...
ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models addresses the memory bottleneck caused by the massive key-value (KV)...
CIAware-Bench: Benchmarking Control Intervention Awareness Across Frontier LLMs As AI models become more autonomous, developers often use "control protocols"...
Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields introduces a new benchmark designed to test how...
This research investigates whether on-premise, open-source Large Language Models (LLMs) can assist in tuning controllers for complex industrial processes.
Monte Carlo Pass Search (MCPS) is a new framework designed to evaluate football passes by treating them as a distribution of possible outcomes rather than a...
The Role of Feedback Alignment in Self-Distillation explores how to improve language models by refining the "context" they receive during training.
What Fits (Into Few Tokens) Doesn’t Overfit: Compression and Generalization in ML Research Agents This paper investigates why machine learning benchmarks, wh...
A History-Aware Visually Grounded Critic for Computer Use Agents Computer Use Agents (CUAs) are AI models designed to perform complex, multi-step tasks on co...
ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity introduces a new evaluation framework designed to measure how effectively AI agents can perf...
Superficial Beliefs in LLM Decision-Making This research investigates whether Large Language Models (LLMs) possess a genuine, structured internal logic when...