Research

Making Tool Calling Safer and More Reliable

October 16, 2025 • 6 min read • Reid Johnson

Why Publish Research?

Our mission is to unlock human excellence. We've begun with Pokket, a performance coach in your pocket. A coach that works with you each day, learns from you, and guides you step by step toward your goals and beyond.

But in creating Pokket, our research has implications that extend far beyond this domain. AI systems will shape every aspect of the world in the coming decades. If our work can contribute to making this future more positive for humanity, we believe in sharing it whenever possible.

The Problem

When developing a system of any scope, it's important to begin with its foundation. The more critical the system, the stronger its foundation needs to be.

In autonomous AI agents, a key part of their foundations stem from what are known as "tool calls". These tools enable large language models (LLMs) to do more than generate text – to use search engines, or send emails, or execute code. In short, these tools act as connective tissue between LLMs and the outside world.

For Pokket, similar tool calls are integrated across our systems. Pokket can use them to identify the correct research to use for a given conversation, or to check on someone's goals, or to search our social media. In some cases, Pokket uses tools to signal a safety concern.

Yet the tool calling of existing systems leaves much to be desired. Even top-tier systems and models often lack consistency and reliability – inexcusable in mission-critical architectures or domains. So, we began exploring alternative solutions, beginning with the concept of Natural Language Tools (NLT).

The Research

In most LLM-based systems, tool calling is defined in strict, programmatic structures, with inputs and outputs like so:

{
  "tool_calls": [{
    "name": "check_talk_to_a_human",
    "description": "Used when the user requests..."
  }]
}

Natural Language Tools takes a different approach, instead utilizing natural, human-coded language to define and implement tools.

In this paper, we evaluated our NLT approach across 10 models and 6,400 trials, spanning customer service and mental health domains. We compared it to cutting-edge tool calling implementations from OpenAI, Google, and others. Through these trials, we found that NLT significantly improved tool-call performance, boosting accuracy by more than 18 percentage points (69.1% to 87.5%). Variance overall fell more than 70%.

Ultimately, this means that with NLT we are able to increase the resilience, effectiveness, and reliability of our systems. This gives us greater confidence that our safety architecture will trigger when necessary, or that the right research will be applied to the right problem. We've since expanded this paradigm, applying it and similar changes across Pokket's system.

Final Thoughts

Natural Language Tools represents a step toward more reliable, and ultimately safer, AI systems. We hope that this research can help others have more reliable tool-calling implementations, especially for mission-critical and safety-related tools. We'll be publishing more research in the coming weeks and months, ranging from agentic architecture to modular safety systems.

Stay tuned.

For questions about this research, please connect with us via our Discord. For those interested in experiencing Pokket firsthand, click here to sign up!