Hugging Face PapersMay 27, 2026Worth watching

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Best model hits 48% accuracy and drops ~47 points over long tasks; agents fail by losing analytical state, not by running short.

AI AgentsApplied AI