Post by Mary Newhauser
Member of Technical Staff @ Fastino Labs
🥇 The paper we’ve been waiting for is finally here. Amazon Web Services (AWS) researchers just published a wild paper on arXiv, demonstrating that a 350M fine-tuned SLM outperformed several fine-tuned LLMs on tool use and agentic tasks. How did the experiment play out? First, they fine-tuned a SLM (facebook/opt-350m) on the ToolBench dataset for three specific tasks generally performed by LLMs: 1. Document summarization 2. Query answering 3. Structured data interpretation Then, they compared the performance of their fine-tuned SLM to proprietary CoT LLMs (ChatGPT, Claude) and open LLMs fine-tuned for tool use on the ToolBench dataset. The results were wild. The fine-tuned SLM outperformed ALL LLMs on all aspects of tool use on the ToolBench evaluation framework. They also report that: • 350M is the sweet spot for SLM size for tool use • The SLM learned to suppress irrelevant behaviors and focus better on tool-use only Why is this interesting? For a few reasons: • The SLM is very, very small — well under 1B parameters. The fact that it could outperform all LLMs in the experiment is very impressive. • The SLM is a decoder-only model that performs generative tasks. Many modern SLMs are fine-tuned to perform non-generative tasks like classification and NER. It’s important to remember that a SLM like this isn’t intended to replace LLMs across multiple applications and domains. A fine-tuned SLM can replace a LLM for a specific domain and application. But still, this is wild. 📄 Paper: https://lnkd.in/gTqmzfTg 🔗 ToolBench Repo: https://lnkd.in/gCCfyNyp