LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis
Best model hits 48% accuracy and drops ~47 points over long tasks; agents fail by losing analytical state, not by running short.
Best model hits 48% accuracy and drops ~47 points over long tasks; agents fail by losing analytical state, not by running short.