AI Agent Tool Calling Explained: Full Guide

The moment I realized LLMs can’t actually do anything on their own

I was demoing an AI agent to a colleague when she asked a simple question: “But how does it actually search the web? Doesn’t the AI just know things?” I paused. Because the honest answer that a language model, by itself, knows nothing about what happened last Tuesday, can’t run a Python script, can’t check a database, and can’t send an email is one that most AI demos carefully obscure.

An LLM on its own is a text-in, text-out function. Everything that makes an AI agent feel like it’s doing things, browsing, calculating, writing files, calling APIs comes from tool calling. It’s the mechanism that connects a language model’s reasoning to the real world, and it’s one of the most important concepts to understand if you’re building, deploying, or evaluating AI agents.

This article explains exactly how AI agent tool calling works, from the underlying mechanics to how you design tools well, where it breaks, and how to fix it.

What is tool calling?

Tool calling, also called function calling, is the mechanism by which an LLM signals that it wants to invoke an external function and receives the result back to continue its reasoning. The model doesn’t execute anything directly. It generates a structured output that says “I want to call this function, with these arguments.” Your application code intercepts that output, runs the actual function, and feeds the result back into the model’s context.

That handoff model requests, code executes, result returns is the entire engine behind every AI agent that does anything useful in the real world.

The major LLM providers all support this natively. OpenAI calls it function calling. Anthropic calls it tool use. Google calls it function calling too. The mechanics differ slightly by provider, but the concept is identical across all of them.

How AI agent tool calling works, step by step

Let’s walk through a complete tool calling cycle from the first prompt to the final answer, the unobscured version.

Step 1: Define your tools

Before the conversation starts, you tell the model what tools are available. Each tool is described using a structured schema, typically JSON Schema, that specifies the tool’s name, what it does, and what arguments it accepts.

// Tool definition (OpenAI / Anthropic style)
const tools = [
  {
    name: "search_web",
    description: "Search the internet for current information on a topic. Use when the user asks about recent events, current data, or anything that may have changed. Do NOT use for general knowledge you can answer from training.",
    input_schema: {
      type: "object",
      properties: {
        query: {
          type: "string",
          description: "The search query to run"
        },
        num_results: {
          type: "integer",
          description: "Number of results to return. Default 5, max 10.",
          default: 5
        }
      },
      required: ["query"]
    }
  },
  {
    name: "run_python",
    description: "Execute a Python code snippet and return stdout output. Use for calculations, data processing, or anything requiring computation.",
    input_schema: {
      type: "object",
      properties: {
        code: {
          type: "string",
          description: "Valid Python code to execute"
        }
      },
      required: ["code"]
    }
  }
];

The description field is not decorative. It’s how the model decides when to call the tool. Write it like documentation for a colleague who needs to know exactly what the tool does and when to reach for it.

Step 2: The model receives the user message and tool list

When the user sends a message, you pass both the conversation and the tool definitions to the API. The model reads both and decides whether it can answer directly or whether it needs to call a tool first.

const response = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  tools: tools,
  messages: [
    { role: "user", content: "What's the current price of NVIDIA stock?" }
  ]
});

Step 3: Model outputs a tool call request

Instead of answering directly, the model returns a structured tool call. It does not return the answer; it returns its intent to look up the answer. This is the key moment most people misunderstand: the model isn’t fetching anything. It’s asking your code to fetch it.

// Model response (stop_reason: "tool_use")
{
  "stop_reason": "tool_use",
  "content": [
    {
      "type": "tool_use",
      "id": "tool_call_abc123",
      "name": "search_web",
      "input": {
        "query": "NVIDIA stock price today",
        "num_results": 3
      }
    }
  ]
}

Step 4: Your code executes the tool

Your application reads the tool call from the response, routes it to the actual function, and runs it. The model is not involved in this step at all. This is important: tool execution happens entirely in your code, which means you control what tools exist, what they can access, and what guardrails surround them.

// Your application handles the execution
async function handleToolCall(toolCall) {
  if (toolCall.name === "search_web") {
    const results = await searchWeb(toolCall.input.query, toolCall.input.num_results);
    return results;
  }
  if (toolCall.name === "run_python") {
    const output = await executePython(toolCall.input.code);
    return output;
  }
  throw new Error(`Unknown tool: ${toolCall.name}`);
}

Step 5: The tool result is passed back to the model

You take the output of the tool execution and send it back to the model as a new message in the conversation, tagged as a tool result. The model now has the information it needed and can generate a final response.

// Send tool result back to model
const finalResponse = await anthropic.messages.create({
  model: "claude-sonnet-4-20250514",
  max_tokens: 1024,
  tools: tools,
  messages: [
    { role: "user", content: "What's the current price of NVIDIA stock?" },
    { role: "assistant", content: response.content },  // the tool call
    {
      role: "user",
      content: [
        {
          type: "tool_result",
          tool_use_id: "tool_call_abc123",
          content: JSON.stringify(searchResults)
        }
      ]
    }
  ]
});

Step 6: The model generates the final answer

With the tool result in context, the model now has what it needs to answer the original question. It synthesizes the retrieved data into a natural language response, and if it needs another tool call to complete the answer, the loop continues.

Parallel AI agent tool calling: doing multiple things at once

Modern LLMs can request multiple tool calls in a single response. If a user asks “Compare NVIDIA and AMD stock prices,” the model doesn’t have to search for one, wait for the result, then search for the other. It can request both in parallel.

// Model requests two tools at once
{
  "stop_reason": "tool_use",
  "content": [
    {
      "type": "tool_use",
      "id": "call_001",
      "name": "search_web",
      "input": { "query": "NVIDIA stock price today" }
    },
    {
      "type": "tool_use",
      "id": "call_002",
      "name": "search_web",
      "input": { "query": "AMD stock price today" }
    }
  ]
}
 
// Execute both in parallel, return both results
const [nvidiaResult, amdResult] = await Promise.all([
  handleToolCall(toolCalls[0]),
  handleToolCall(toolCalls[1])
]);

Parallel tool calling cuts latency significantly on tasks that require multiple independent lookups. Always execute parallel tool calls concurrently in your application code running them sequentially when the model intended them in parallel wastes time unnecessarily.

Designing good tools: the part most tutorials skip

The quality of your tool definitions directly determines how well your agent performs. A poorly described tool gets called at the wrong time, with the wrong arguments, producing results the model doesn’t know how to use. Here’s what separates good tool design from bad.

Write descriptions that answer “when, not just what.”

Most developers write tool descriptions that explain what the tool does: “Searches the web.” That’s necessary but not sufficient. The model needs to know when to call it versus not. Add decision guidance:

// Weak description
"description": "Search the web for information."
 
// Strong description
"description": "Search the web for current, real-time, or recent information.
Use when the user asks about events after your knowledge cutoff, current prices,
live data, or anything that may have changed recently. Do NOT use for general
knowledge questions you can answer from your own training."

Use precise argument types and constraints

The model will pass arguments exactly as your schema allows. If you accept a string with no constraints, you’ll get strings of any length, format, or language. Be specific:

// Vague schema
"date": { "type": "string", "description": "A date" }
 
// Precise schema
"date": {
  "type": "string",
  "description": "Date in ISO 8601 format (YYYY-MM-DD). Example: 2025-06-01.",
  "pattern": "^\\d{4}-\\d{2}-\\d{2}$"
}

Keep each tool focused on one thing

A tool that does too many things gets called inconsistently. A A manage_calendar tool that can create, read, update, and delete events is harder for the model to use correctly than four separate focused tools. Single-responsibility tools produce more reliable behavior.

Return structured, parseable results

What the tool returns matters as much as what it accepts. Return JSON with consistent keys rather than raw text blobs. The model will reason over the structure of your output; inconsistent formats degrade reasoning quality.

// Unstructured return (hard to reason over)
return "NVIDIA is trading at $924.31, up 2.3% today";
 
// Structured return (easy to reason over)
return {
  ticker: "NVDA",
  price: 924.31,
  change_percent: 2.3,
  currency: "USD",
  timestamp: "2025-06-01T14:32:00Z"
};

The tool calling loop in multi-step agents

Most real tasks require more than one tool call. An agent asked to “research recent AI developments and write a summary” might call a search tool five or six times, then synthesize everything. The model keeps calling tools, and your application keeps executing them and returning results until the model decides it has enough information to give a final answer.

// Multi-step tool loop
async function runAgentLoop(userMessage, tools) {
  const messages = [{ role: "user", content: userMessage }];
 
  while (true) {
    const response = await llm.call({ messages, tools });
 
    // If model is done, return the final text
    if (response.stop_reason === "end_turn") {
      return response.content.find(b => b.type === "text").text;
    }
 
    // Handle all tool calls in this response
    const toolCalls = response.content.filter(b => b.type === "tool_use");
    messages.push({ role: "assistant", content: response.content });
 
    const toolResults = await Promise.all(toolCalls.map(async (call) => ({
      type: "tool_result",
      tool_use_id: call.id,
      content: JSON.stringify(await handleToolCall(call))
    })));
 
    messages.push({ role: "user", content: toolResults });
 
    // Safety: prevent infinite loops
    if (messages.length > 50) throw new Error("Max iterations exceeded");
  }
}

The max iterations guard is not optional. Without a hard stop, a confused or looping agent will run indefinitely, consuming tokens and budget with no exit condition.

Common tool calling failures and how to fix them

The model calls the wrong tool

The model reaches for a tool that isn’t the right one for the task, usually because descriptions overlap or are too vague. Fix by sharpening the decision boundary in your descriptions. Make each tool’s “when to use” and “when NOT to use” explicit.

Model passes malformed arguments

The model generates an argument value that doesn’t match the expected type or format, a string where an integer was expected, a date in the wrong format, or a missing required field. Fix by tightening your JSON Schema constraints. Add patterns for strings, ranges for numbers, and explicit examples in descriptions.

The tool returns an error, and the agent halts

A tool fails network timeout, invalid API key, rate limit and the agent doesn’t know what to do with the error. Always return structured error responses that the model can reason about:

// Instead of throwing, return a structured error
try {
  return await callExternalAPI(args);
} catch (err) {
  return {
    error: true,
    error_type: "api_timeout",
    message: "The external service timed out after 10 seconds.",
    suggestion: "Retry the request or try a different data source."
  };
}

When the model receives a structured error, it can decide to retry, try a different tool, or explain the problem to the user rather than silently stalling.

Model hallucinates a tool that doesn’t exist

In some cases, especially with less capable models, the model will generate a tool call for a function you never defined. Your router will throw an “unknown tool” error. Fix by explicitly listing available tools in the system prompt and handling unknown tool calls gracefully with a message back to the model: “That tool is not available. Available tools: [list].”

Infinite tool call loops

The agent calls tool A, gets a result, calls tool A again slightly differently, gets another result, and repeats, never converging on a final answer. This usually happens when the tool’s output doesn’t give the model enough signal to decide it has what it needs. Fix by enriching tool return values with a completeness signal, and implementing a hard iteration cap in your loop.

Tool security: what most developers ignore until it’s too late

Tool calling is powerful precisely because it lets an AI agent interact with real systems. That power requires guardrails.

Principle of least privilege. Every tool should have the minimum permissions needed to do its job. A search tool doesn’t need write access to your database. A calendar reader doesn’t need the ability to send emails. Scope your tool implementations as narrowly as possible.

Validate all arguments server-side. The model generates arguments, but your code executes them. Never trust the model’s output as inherently safe. Validate argument values against your own rules before passing them to an API or database, just as you would with user-submitted form data.

Confirm before destructive actions. If a tool can delete, modify, or send something on behalf of a user, build in a confirmation step. Have the agent present what it’s about to do and require explicit approval before executing. An AI that can silently delete files is a liability.

Log every tool call. Every tool invocation name, arguments, result, and timestamp should be logged. This is your audit trail for debugging, compliance, and detecting misuse.

Tool calling quick reference

Concept	What it means	Key implementation note
Tool definition	Schema describing name, purpose, and arguments	Description quality directly affects call accuracy
Tool call request	Structured output that the model returns when it wants to invoke a tool	The model doesn’t execute your code
Tool execution	Your application runs the actual function	You control permissions, validation, and error handling
Tool result	Output sent back to the model as context	Return structured JSON for the best reasoning quality
Parallel tool calls	Multiple tool call requests in a single model response	Execute concurrently, not sequentially
Tool loop	Repeated call-execute-return cycle until the goal is met	Always enforce a maximum iteration limit
Tool security	Permissions, validation, and confirmation for destructive actions	Treat tool arguments like untrusted user input

AI Agent Tool Calling Explained: How Agents Use Tools to Get Things Done