LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

Best model hits 48% accuracy and drops ~47 points over long tasks; agents fail by losing analytical state, not by running short.

AI AgentsApplied AI
Read original on Hugging Face Papers