{"by":"HenryNdubuaku","descendants":210,"id":48111896,"kids":[48116333,48113166,48117983,48113387,48116454,48113889,48115385,48115978,48125737,48113615,48112914,48113360,48118236,48117879,48115552,48132122,48116373,48115524,48118253,48114111,48114828,48114709,48114927,48115290,48131437,48114180,48133248,48114414,48115220,48117970,48128602,48114350,48117417,48123144,48119554,48114445,48115116,48126882,48137173,48116171,48115334,48114163,48118182,48115161,48123148,48115928,48127269,48159047,48130433,48157445,48143906,48118796,48113335,48128500,48117647,48124608,48130724,48132703,48124234,48140939,48122630,48117773,48123669,48130920,48136452,48122076,48150018,48119190,48120546,48118588,48116260,48115470,48121382,48117521,48123040,48127573,48118696,48112181,48113596],"score":755,"text":"Hey HN, Henry here from Cactus. We open-sourced Needle, a 26M parameter function-calling (tool use) model. It runs at 6000 tok&#x2F;s prefill and 1200 tok&#x2F;s decode on consumer devices.<p>We were always frustrated by the little effort made towards building agentic models that run on budget phones, so we conducted investigations that led to an observation: agentic experiences are built upon tool calling, and massive models are overkill for it. Tool calling is fundamentally retrieval-and-assembly (match query to tool name, extract argument values, emit JSON), not reasoning. Cross-attention is the right primitive for this, and FFN parameters are wasted at this scale.<p>Simple Attention Networks: the entire model is just attention and gating, no MLPs anywhere. Needle is an experimental run for single-shot function calling for consumer devices (phones, watches, glasses...).<p>Training:\n- Pretrained on 200B tokens across 16 TPU v6e (27 hours)\n- Post-trained on 2B tokens of synthesized function-calling data (45 minutes)\n- Dataset synthesized via Gemini with 15 tool categories (timers, messaging, navigation, smart home, etc.)<p>You can test it right now and finetune on your Mac&#x2F;PC: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;cactus-compute&#x2F;needle\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;cactus-compute&#x2F;needle</a><p>The full writeup on the architecture is here: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;cactus-compute&#x2F;needle&#x2F;blob&#x2F;main&#x2F;docs&#x2F;simple_attention_networks.md\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;cactus-compute&#x2F;needle&#x2F;blob&#x2F;main&#x2F;docs&#x2F;simp...</a><p>We found that the &quot;no FFN&quot; finding generalizes beyond function calling to any task where the model has access to external structured knowledge (RAG, tool use, retrieval-augmented generation). The model doesn&#x27;t need to memorize facts in FFN weights if the facts are provided in the input. Experimental results to published.<p>While it beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, LFM2.5-350M on single-shot function calling, those models have more scope&#x2F;capacity and excel in conversational settings. We encourage you to test on your own tools via the playground and finetune accordingly.<p>This is part of our broader work on Cactus (<a href=\"https:&#x2F;&#x2F;github.com&#x2F;cactus-compute&#x2F;cactus\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;cactus-compute&#x2F;cactus</a>), an inference engine built from scratch for mobile, wearables and custom hardware. We wrote about Cactus here previously: <a href=\"https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=44524544\">https:&#x2F;&#x2F;news.ycombinator.com&#x2F;item?id=44524544</a><p>Everything is MIT licensed. Weights: <a href=\"https:&#x2F;&#x2F;huggingface.co&#x2F;Cactus-Compute&#x2F;needle\" rel=\"nofollow\">https:&#x2F;&#x2F;huggingface.co&#x2F;Cactus-Compute&#x2F;needle</a>\nGitHub: <a href=\"https:&#x2F;&#x2F;github.com&#x2F;cactus-compute&#x2F;needle\" rel=\"nofollow\">https:&#x2F;&#x2F;github.com&#x2F;cactus-compute&#x2F;needle</a>","time":1778608991,"title":"Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model","type":"story","url":"https://github.com/cactus-compute/needle"}