GTA-2: Benchmarking General Tool Agents from Atomic Tool-Use to Open-Ended Workflows
Advanced agent harnesses beat the base model by 172%, proving execution design matters more than model size for real-world workflow completion.
Advanced agent harnesses beat the base model by 172%, proving execution design matters more than model size for real-world workflow completion.