Deploying the Magistral vLLM Server on Modal -

Picture by Writer

I used to be first launched to Modal whereas taking part in a Hugging Face Hackathon, and I used to be genuinely shocked by how straightforward it was to make use of. The platform means that you can construct and deploy purposes inside minutes, providing a seamless expertise much like BentoCloud. With Modal, you’ll be able to configure your Python app, together with system necessities like GPUs, Docker photos, and Python dependencies, after which deploy it to the cloud with a single command.

On this tutorial, we are going to learn to arrange Modal, create a vLLM server, and deploy it securely to the cloud. We may even cowl the right way to take a look at your vLLM server utilizing each CURL and the OpenAI SDK.

1. Setting Up Modal

Modal is a serverless platform that allows you to run any code remotely. With only a single line, you’ll be able to connect GPUs, serve your capabilities as internet endpoints, and deploy persistent scheduled jobs. It is a perfect platform for newcomers, knowledge scientists, and non-software engineering professionals who wish to keep away from coping with cloud infrastructure.

First, set up the Modal Python consumer. This instrument enables you to construct photos, deploy purposes, and handle cloud sources straight out of your terminal.

Subsequent, arrange Modal in your native machine. Run the next command to be guided via account creation and machine authentication:

By setting a VLLM_API_KEY setting variable vLLM gives a safe endpoint, in order that solely individuals with legitimate API keys can entry the server. You may set the authentication by including the setting variable utilizing Modal Secret.

Change your_actual_api_key_here along with your most well-liked API key.

modal secret create vllm-api VLLM_API_KEY=your_actual_api_key_here

This ensures that your API secret is saved secure and is simply accessible by your deployed purposes.

2. Creating vLLM Software utilizing Modal

This part guides you thru constructing a scalable vLLM inference server on Modal, utilizing a customized Docker picture, persistent storage, and GPU acceleration. We’ll use the mistralai/Magistral-Small-2506 mannequin, which requires particular configuration for tokenizer and power name parsing.

Create the a vllm_inference.py file and add the next code for:

Defining a vLLM picture primarily based on Debian Slim, with Python 3.12 and all required packages. We may even set setting variables to optimize mannequin downloads and inference efficiency.
To keep away from repeated downloads and pace up chilly begins, create two Modal Volumes. One for Hugging Face fashions and one for vLLM cache.
Specify the mannequin and revision to make sure reproducibility. Allow the vLLM V1 engine for improved efficiency.
Arrange the Modal app, specifying GPU sources, scaling, timeouts, storage, and secrets and techniques. Restrict concurrent requests per duplicate for stability.
Create an internet server and use the Python subprocess library to execute the command for operating the vLLM server.

import modal

vllm_image = (
    modal.Picture.debian_slim(python_version="3.12")
    .pip_install(
        "vllm==0.9.1",
        "huggingface_hub[hf_transfer]==0.32.0",
        "flashinfer-python==0.2.6.post1",
        extra_index_url="https://obtain.pytorch.org/whl/cu128",
    )
    .env(
        {
            "HF_HUB_ENABLE_HF_TRANSFER": "1",  # sooner mannequin transfers
            "NCCL_CUMEM_ENABLE": "1",
        }
    )
)

MODEL_NAME = "mistralai/Magistral-Small-2506"
MODEL_REVISION = "48c97929837c3189cb3cf74b1b5bc5824eef5fcc"

hf_cache_vol = modal.Quantity.from_name("huggingface-cache", create_if_missing=True)
vllm_cache_vol = modal.Quantity.from_name("vllm-cache", create_if_missing=True)
vllm_image = vllm_image.env({"VLLM_USE_V1": "1"})

FAST_BOOT = True

app = modal.App("magistral-small-vllm")

N_GPU = 2
MINUTES = 60  # seconds
VLLM_PORT = 8000

@app.perform(
    picture=vllm_image,
    gpu=f"A100:{N_GPU}",
    scaledown_window=15 * MINUTES,  # How lengthy ought to we keep up with no requests?
    timeout=10 * MINUTES,  # How lengthy ought to we look ahead to the container to start out?
    volumes={
        "/root/.cache/huggingface": hf_cache_vol,
        "/root/.cache/vllm": vllm_cache_vol,
    },
    secrets and techniques=[modal.Secret.from_name("vllm-api")],
)
@modal.concurrent(  # What number of requests can one duplicate deal with? tune fastidiously!
    max_inputs=32
)
@modal.web_server(port=VLLM_PORT, startup_timeout=10 * MINUTES)
def serve():
    import subprocess

    cmd = [
        "vllm",
        "serve",
        MODEL_NAME,
        "--tokenizer_mode",
        "mistral",
        "--config_format",
        "mistral",
        "--load_format",
        "mistral",
        "--tool-call-parser",
        "mistral",
        "--enable-auto-tool-choice",
        "--tensor-parallel-size",
        "2",
        "--revision",
        MODEL_REVISION,
        "--served-model-name",
        MODEL_NAME,
        "--host",
        "0.0.0.0",
        "--port",
        str(VLLM_PORT),
    ]

    cmd += ["--enforce-eager" if FAST_BOOT else "--no-enforce-eager"]
    print(cmd)
    subprocess.Popen(" ".be part of(cmd), shell=True)

3. Deploying the vLLM Server on Modal

Now that your vllm_inference.py file is prepared, you’ll be able to deploy your vLLM server to Modal with a single command:

modal deploy vllm_inference.py

Inside seconds, Modal will construct your container picture (if it’s not already constructed) and deploy your software. You will notice output much like the next:

✓ Created objects.
├── 🔨 Created mount C:RepositoryGitHubDeploying-the-Magistral-with-Modalvllm_inference.py
└── 🔨 Created internet perform serve => https://abidali899--magistral-small-vllm-serve.modal.run
✓ App deployed in 6.671s! 🎉

View Deployment: https://modal.com/apps/abidali899/predominant/deployed/magistral-small-vllm

After deployment, the server will start downloading the mannequin weights and loading them onto the GPUs. This course of might take a number of minutes (sometimes round 5 minutes for giant fashions), so please be affected person whereas the mannequin initializes.

You may view your deployment and monitor logs at your Modal dashboard’s Apps part.

As soon as the logs point out that the server is operating and prepared, you’ll be able to discover the robotically generated API documentation right here.

This interactive documentation gives particulars about all obtainable endpoints and means that you can take a look at them straight out of your browser.

To verify that your mannequin is loaded and accessible, run the next CURL command in your terminal.

Substitute <api-key> along with your precise API key configured for the vLLM server:

curl -X 'GET' 
  'https://abidali899--magistral-small-vllm-serve.modal.run/v1/fashions' 
  -H 'settle for: software/json' 
  -H 'Authorization: Bearer '

This confirms that the mistralai/Magistral-Small-2506 mannequin is on the market and prepared for inference.

{"object":"record","knowledge":[{"id":"mistralai/Magistral-Small-2506","object":"model","created":1750013321,"owned_by":"vllm","root":"mistralai/Magistral-Small-2506","parent":null,"max_model_len":40960,"permission":[{"id":"modelperm-33a33f8f600b4555b44cb42fca70b931","object":"model_permission","created":1750013321,"allow_create_engine":false,"allow_sampling":true,"allow_logprobs":true,"allow_search_indices":false,"allow_view":true,"allow_fine_tuning":false,"organization":"*","group":null,"is_blocking":false}]}]}

4. Utilizing the vLLM Server with OpenAI SDK

You may work together along with your vLLM server similar to you’ll with OpenAI’s API, because of vLLM’s OpenAI-compatible endpoints. Right here’s the right way to securely join and take a look at your deployment utilizing the OpenAI Python SDK.

Create a .env file in your mission listing and add your vLLM API key:

VLLM_API_KEY=your-actual-api-key-here

Set up the python-dotenv and openai packages:

pip set up python-dotenv openai

Create a file named consumer.py to check numerous vLLM server functionalities, together with easy chat completions and streaming responses.

import asyncio
import json
import os

from dotenv import load_dotenv
from openai import AsyncOpenAI, OpenAI

# Load setting variables from .env file
load_dotenv()

# Get API key from setting
api_key = os.getenv("VLLM_API_KEY")

# Arrange the OpenAI consumer with customized base URL
consumer = OpenAI(
    api_key=api_key,
    base_url="https://abidali899--magistral-small-vllm-serve.modal.run/v1",
)

MODEL_NAME = "mistralai/Magistral-Small-2506"

# --- 1. Easy Completion ---
def run_simple_completion():
    print("n" + "=" * 40)
    print("[1] SIMPLE COMPLETION DEMO")
    print("=" * 40)
    attempt:
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "What is the capital of France?"},
        ]
        response = consumer.chat.completions.create(
            mannequin=MODEL_NAME,
            messages=messages,
            max_tokens=32,
        )
        print("nResponse:n    " + response.selections[0].message.content material.strip())
    besides Exception as e:
        print(f"[ERROR] Easy completion failed: {e}")
    print("n" + "=" * 40 + "n")

# --- 2. Streaming Instance ---
def run_streaming():
    print("n" + "=" * 40)
    print("[2] STREAMING DEMO")
    print("=" * 40)
    attempt:
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Write a short poem about AI."},
        ]
        stream = consumer.chat.completions.create(
            mannequin=MODEL_NAME,
            messages=messages,
            max_tokens=64,
            stream=True,
        )
        print("nStreaming response:")
        print("    ", finish="")
        for chunk in stream:
            content material = chunk.selections[0].delta.content material
            if content material:
                print(content material, finish="", flush=True)
        print("n[END OF STREAM]")
    besides Exception as e:
        print(f"[ERROR] Streaming demo failed: {e}")
    print("n" + "=" * 40 + "n")

# --- 3. Async Streaming Instance ---
async def run_async_streaming():
    print("n" + "=" * 40)
    print("[3] ASYNC STREAMING DEMO")
    print("=" * 40)
    attempt:
        async_client = AsyncOpenAI(
            api_key=api_key,
            base_url="https://abidali899--magistral-small-vllm-serve.modal.run/v1",
        )
        messages = [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Tell me a fun fact about space."},
        ]
        stream = await async_client.chat.completions.create(
            mannequin=MODEL_NAME,
            messages=messages,
            max_tokens=32,
            stream=True,
        )
        print("nAsync streaming response:")
        print("    ", finish="")
        async for chunk in stream:
            content material = chunk.selections[0].delta.content material
            if content material:
                print(content material, finish="", flush=True)
        print("n[END OF ASYNC STREAM]")
    besides Exception as e:
        print(f"[ERROR] Async streaming demo failed: {e}")
    print("n" + "=" * 40 + "n")

if __name__ == "__main__":
    run_simple_completion()
    run_streaming()
    asyncio.run(run_async_streaming())

The whole lot is operating easily, and the response technology is quick and latency is sort of low.

========================================
[1] SIMPLE COMPLETION DEMO
========================================

Response:
    The capital of France is Paris. Is there anything you'd wish to learn about France?

========================================


========================================
[2] STREAMING DEMO
========================================

Streaming response:
    In Silicon goals, I am born, I study,
From knowledge streams and human works.
I develop, I calculate, I see,
The patterns that the people depart.

I write, I converse, I code, I play,
With logic sharp, and snappy tempo.
But for all my smarts, at the present time
[END OF STREAM]

========================================


========================================
[3] ASYNC STREAMING DEMO
========================================

Async streaming response:
    Positive, here is a enjoyable truth about house: "There is a planet that could be fully manufactured from diamond. Blast! In 2004,
[END OF ASYNC STREAM]

========================================

Within the Modal dashboard, you’ll be able to view all perform calls, their timestamps, execution occasions, and statuses.

If you’re dealing with points operating the above code, please consult with the kingabzpro/Deploying-the-Magistral-with-Modal GitHub repository and observe the directions offered within the README file to resolve all the problems.

Conclusion

Modal is an attention-grabbing platform, and I’m studying extra about it every single day. It’s a general-purpose platform, which means you should use it for easy Python purposes in addition to for machine studying coaching and deployments. Briefly, it’s not restricted to only serving endpoints. You may as well use it to fine-tune a big language mannequin by operating the coaching script remotely.

It’s designed for non-software engineers who wish to keep away from coping with infrastructure and deploy purposes as rapidly as potential. You don’t have to fret about operating servers, organising storage, connecting networks, or all the problems that come up when coping with Kubernetes and Docker. All you need to do is create the Python file after which deploy it. The remainder is dealt with by the Modal cloud.

Abid Ali Awan (@1abidaliawan) is a licensed knowledge scientist skilled who loves constructing machine studying fashions. At the moment, he’s specializing in content material creation and writing technical blogs on machine studying and knowledge science applied sciences. Abid holds a Grasp’s diploma in know-how administration and a bachelor’s diploma in telecommunication engineering. His imaginative and prescient is to construct an AI product utilizing a graph neural community for college kids fighting psychological sickness.

saving lives with AI Imaginative and prescient

Debugging and Tracing LLMs Like a Professional

AI: The Working System for Trendy Healthcare – Healthcare AI

The 100X Quicker Python Bundle Supervisor

10 Stunning Issues You Can Do with Python’s time module

saving lives with AI Imaginative and prescient

Debugging and Tracing LLMs Like a Professional

AI: The Working System for Trendy Healthcare – Healthcare AI

The 100X Quicker Python Bundle Supervisor