

Picture by Creator
llama.cpp is the unique, high-performance framework that powers many well-liked native AI instruments, together with Ollama, native chatbots, and different on-device LLM options. By working straight with llama.cpp, you’ll be able to reduce overhead, achieve fine-grained management, and optimize efficiency to your particular {hardware}, making your native AI brokers and purposes quicker and extra configurable
On this tutorial, I’ll information you thru constructing AI purposes utilizing llama.cpp, a strong C/C++ library for working massive language fashions (LLMs) effectively. We’ll cowl establishing a llama.cpp server, integrating it with Langchain, and constructing a ReAct agent able to utilizing instruments like internet search and a Python REPL.
1. Organising the llama.cpp Server
This part covers the set up of llama.cpp and its dependencies, configuring it for CUDA help, constructing the mandatory binaries, and working the server.
Notice: we’re utilizing an NVIDIA RTX 4090 graphics card working on a Linux working system with the CUDA toolkit pre-configured. If you do not have entry to related native {hardware}, you’ll be able to hire GPU situations from Huge.ai for a less expensive worth.


Screenshot from Huge.ai | Console
- Replace your system’s package deal listing and set up important instruments like build-essential, cmake, curl, and git. pciutils is included for {hardware} info, and libcurl4-openssl-dev is required for llama.cpp to obtain fashions from Hugging Face.
apt-get replace
apt-get set up pciutils build-essential cmake curl libcurl4-openssl-dev git -y
- Clone the official llama.cpp repository from GitHub and use cmake to configure the construct.
# Clone llama.cpp repository
git clone https://github.com/ggml-org/llama.cpp
# Configure construct with CUDA help
cmake llama.cpp -B llama.cpp/construct
-DBUILD_SHARED_LIBS=OFF
-DGGML_CUDA=ON
-DLLAMA_CURL=ON
- Compile llama.cpp and all its instruments, together with the server. For comfort, copy all of the compiled binaries from the llama.cpp/construct/bin/ listing to the primary llama.cpp/ listing.
# Construct all crucial binaries together with server
cmake --build llama.cpp/construct --config Launch -j --clean-first
# Copy all binaries to primary listing
cp llama.cpp/construct/bin/* llama.cpp/
- Begin the llama.cpp server with a unsloth/gemma-3-4b-it-GGUF mannequin.
./llama.cpp/llama-server
-hf unsloth/gemma-3-4b-it-GGUF:Q4_K_XL
--host 0.0.0.0
--port 8000
--n-gpu-layers 999
--ctx-size 8192
--threads $(nproc)
--temp 0.6
--cache-type-k q4_0
--jinja
- You may take a look at if the server is working appropriately by sending a POST request utilizing curl.
(primary) [email protected]:/workspace$ curl -X POST http://localhost:8000/v1/chat/completions
-H "Content material-Kind: software/json"
-d '{
"messages": [
{"role": "user", "content": "Hello! How are you today?"}
],
"max_tokens": 150,
"temperature": 0.7
}'
Output:
{"selections":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":"nOkay, user greeted me with a simple "Hello! How are you today?" nnHmm, this seems like a casual opening. The user might be testing the waters to see if I respond naturally, or maybe they genuinely want to know how an AI assistant conceptualizes "being" but in a friendly way. nnI notice they used an exclamation mark, which feels warm and possibly playful. Maybe they're in a good mood or just trying to make conversation feel less robotic. nnSince I don't have emotions, I should clarify that gently but still keep it warm. The response should acknowledge their greeting while explaining my nature as an AI. nnI wonder if they're asking because they're curious about AI consciousness, or just being polite"}}],"created":1749319250,"mannequin":"gpt-3.5-turbo","system_fingerprint":"b5605-5787b5da","object":"chat.completion","utilization":{"completion_tokens":150,"prompt_tokens":9,"total_tokens":159},"id":"chatcmpl-jNfif9mcYydO2c6nK0BYkrtpNXSnseV1","timings":{"prompt_n":9,"prompt_ms":65.502,"prompt_per_token_ms":7.278,"prompt_per_second":137.40038472107722,"predicted_n":150,"predicted_ms":1207.908,"predicted_per_token_ms":8.052719999999999,"predicted_per_second":124.1816429728092}}
2. Constructing an AI Agent with Langgraph and llama.cpp
Now, let’s use Langgraph and Langchain to work together with the llama.cpp server and construct a multi device AI agent.
- Set your Tavily API key for search capabilities.
- For Langchain to work with the native llama.cpp server (which emulates an OpenAI API), you’ll be able to set OPENAI_API_KEY to native or any non-empty string, because the base_url will direct requests domestically.
export TAVILY_API_KEY="your_api_key_here"
export OPENAI_API_KEY=native
- Set up the mandatory Python libraries: langgraph for creating brokers, tavily-python for the Tavily search device, and varied langchain packages for LLM interactions and instruments.
%%seize
!pip set up -U
langgraph tavily-python langchain langchain-community langchain-experimental langchain-openai
- Configure ChatOpenAI from Langchain to speak together with your native llama.cpp server.
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
mannequin="unsloth/gemma-3-4b-it-GGUF:Q4_K_XL",
temperature=0.6,
base_url="http://localhost:8000/v1",
)
- Arrange the instruments that your agent will have the ability to use.
- TavilySearchResults: Permits the agent to go looking the online.
- PythonREPLTool: Supplies the agent with a Python Learn-Eval-Print Loop to execute code.
from langchain_community.instruments import TavilySearchResults
from langchain_experimental.instruments.python.device import PythonREPLTool
search_tool = TavilySearchResults(max_results=5, include_answer=True)
code_tool = PythonREPLTool()
instruments = [search_tool, code_tool]
- Use LangGraph’s pre constructed create_react_agent perform to create an agent that may cause and act (ReAct framework) utilizing the LLM and the outlined instruments.
from langgraph.prebuilt import create_react_agent
agent = create_react_agent(
mannequin=llm,
instruments=instruments,
)
3. Take a look at the AI Agent with Instance Queries
Now, we’ll take a look at the AI agent and likewise show which instruments the agent makes use of.
- This helper perform extracts the names of the instruments utilized by the agent from the dialog historical past. That is helpful for understanding the agent’s decision-making course of.
def extract_tool_names(dialog: dict) -> listing[str]:
tool_names = set()
for msg in dialog.get('messages', []):
calls = []
if hasattr(msg, 'tool_calls'):
calls = msg.tool_calls or []
elif isinstance(msg, dict):
calls = msg.get('tool_calls') or []
if not calls and isinstance(msg.get('additional_kwargs'), dict):
calls = msg['additional_kwargs'].get('tool_calls', [])
else:
ak = getattr(msg, 'additional_kwargs', None)
if isinstance(ak, dict):
calls = ak.get('tool_calls', [])
for name in calls:
if isinstance(name, dict):
if 'identify' in name:
tool_names.add(name['name'])
elif 'perform' in name and isinstance(name['function'], dict):
fn = name['function']
if 'identify' in fn:
tool_names.add(fn['name'])
return sorted(tool_names)
- Outline a perform to run the agent with a given query and print the instruments used and the ultimate reply.
def run_agent(query: str):
end result = agent.invoke({"messages": [{"role": "user", "content": question}]})
raw_answer = end result["messages"][-1].content material
tools_used = extract_tool_names(end result)
return tools_used, raw_answer
- Let’s ask the agent for the highest 5 breaking information tales. It ought to use the tavily_search_results_json device.
instruments, reply = run_agent("What are the highest 5 breaking information tales?")
print("Instruments used ➡️", instruments)
print(reply)
Output:
Instruments used ➡️ ['tavily_search_results_json']
Listed here are the highest 5 breaking information tales based mostly on the offered sources:
1. **Gaza Humanitarian Disaster:** Ongoing battle and challenges in Gaza, together with the Eid al-Adha vacation, and the retrieval of a Thai hostage's physique.
2. **Russian Drone Assaults on Kharkiv:** Russia continues to focus on Ukrainian cities with drone and missile strikes.
3. **Wagner Group Departure from Mali:** The Wagner Group is leaving Mali after heavy losses, however Russia's Africa Corps stays.
4. **Trump-Musk Feud:** A dispute between former President Trump and Elon Musk might have implications for Tesla inventory and the U.S. house program.
5. **Schooling Division Staffing Cuts:** The Biden administration is searching for Supreme Court docket intervention to dam deliberate staffing cuts on the Schooling Division.
- Let’s ask the agent to jot down and execute Python code for the Fibonacci sequence. It ought to use the Python_REPL device.
instruments, reply = run_agent(
"Write a code for the Fibonacci sequence and execute it utilizing Python REPL."
)
print("Instruments used ➡️", instruments)
print(reply)
Output:
Instruments used ➡️ ['Python_REPL']
The Fibonacci sequence as much as 10 phrases is [0, 1, 1, 2, 3, 5, 8, 13, 21, 34].
Ultimate Ideas
On this information, I’ve used a small quantized LLM, which generally struggles with accuracy, particularly in terms of deciding on the instruments. In case your objective is to construct production-ready AI brokers, I extremely suggest working the newest, full-sized fashions with llama.cpp. Bigger and more moderen fashions typically present higher outcomes and extra dependable outputs
It’s vital to notice that establishing llama.cpp might be tougher in comparison with user-friendly instruments like Ollama. Nonetheless, in case you are keen to take a position the time to debug, optimize, and tailor llama.cpp to your particular {hardware}, the efficiency features and suppleness are effectively value it.
One of many greatest benefits of llama.cpp is its effectivity: you don’t want high-end {hardware} to get began. It runs effectively on common CPUs and laptops with out devoted GPUs, making native AI accessible to virtually everybody. And when you ever want extra energy, you’ll be able to all the time hire an inexpensive GPU occasion from a cloud supplier.
Abid Ali Awan (@1abidaliawan) is an authorized information scientist skilled who loves constructing machine studying fashions. At present, he’s specializing in content material creation and writing technical blogs on machine studying and information science applied sciences. Abid holds a Grasp’s diploma in expertise administration and a bachelor’s diploma in telecommunication engineering. His imaginative and prescient is to construct an AI product utilizing a graph neural community for college kids combating psychological sickness.