Deep dive into Docker Model Runner

TL;DR: Docker Model Runner represents a paradigm shift in local AI development, bringing the familiarity and reliability of Docker workflows to large language model inference. Unlike traditional containerized solutions, it runs models directly on the host for optimal performance while maintaining Docker’s ecosystem benefits.

Introduction

The landscape of AI development is undergoing a fundamental transformation. Local development for applications powered by LLMs is gaining momentum, and for good reason. Privacy concerns, cost optimization, and the need for offline functionality are driving developers away from cloud-based APIs toward local inference solutions.

Enter Docker Model Runner — a beta feature introduced with Docker Desktop 4.40 for macOS on Apple silicon (Now to many other Platforms) that promises to revolutionize how developers build, test, and deploy AI-powered applications. This isn’t just another local inference tool; it’s a complete reimagining of how AI models fit into modern development workflows.

What is Docker Model Runner

Docker Model Runner is designed to make AI model execution as simple as running a container. With this Beta release, we’re giving developers a fast, low-friction way to run models, test them, and iterate on application code that uses models locally, without all the usual setup headaches.

At its core, Docker Model Runner is a lightweight runtime integrated directly into Docker Desktop that allows developers to pull, run, and manage AI models using familiar Docker commands. But there’s a crucial architectural difference that sets it apart from traditional containerized solutions.

Key Characteristics

Host-Native Execution: Unlike our usual Docker containers, Model Runner doesn’t run the AI model in a Docker container. Instead; Docker Desktop runs the inference engine (currently `llama.cpp`) directly on your host machine.
OCI Artifact Distribution: Models are packaged as OCI Artifacts, an open standard that allows you to distribute and version them through the same registries and workflows that already use for containers.
OpenAI API Compatibility: Docker Model Runner exposes an OpenAI-compatible API, making integration with existing tools and libraries seamless.

Host-Native Approach

The most significant architectural decision in Docker Model Runner is its departure from traditional containerization for model execution. When you run a model, Docker calls an Inference Server API endpoint hosted by the Model Runner through Docker Desktop, and provide an OpenAI compatible API. The Inference Server will use `llama.cpp` as the Inference Engine, running as a native host process

This design choice delivers several critical advantages:

Performance Optimization: By using host-based execution, we avoid the performance limitations of running models inside virtual machines. This translates to significantly faster inference times, especially on Apple Silicon where direct Metal API access is crucial.
GPU Acceleration: Apple Silicon’s Metal API is used for GPU acceleration, providing native performance without the overhead of virtualization layers.
Memory Efficiency: The model will stay in memory until another model is requested, or until a pre-defined inactivity timeout (currently 5 minutes) is reached.

API Architecture

GET /engines/llama.cpp/v1/models
POST /engines/llama.cpp/v1/chat/completions
POST /engines/llama.cpp/v1/completions
POST /engines/llama.cpp/v1/embeddings

From host processes: http://localhost:12434/
From containers: http://model-runner.docker.internal/

Enabling Docker Model Runner

Run:

docker desktop enable model-runner --tcp 12434

Verifying Docker Model Runner

docker model status

Usage of Docker Model Runner

docker model list - Lists down all the models
docker model pull ai/smollm2 - Pulls a Model from Docker Hub
docker model ls - Lists all download models
docker model rm ai/smollm2 - Removes a model
docker model version - Presents the version of Docker Model Runner
docker model run ai/smollm2 "Explain Cloud Computing" - Running a model with a prompt
docker model run ai/smollm2 - Interactive model with the model

Deep dive into Docker Model Runner

Introduction

What is Docker Model Runner

Key Characteristics

Host-Native Approach

API Architecture

Enabling Docker Model Runner

Verifying Docker Model Runner

Usage of Docker Model Runner

Comments

Docker

More from this blog

Read on: MCP on AWS Lambda

Technical: Serverless MCP on AWS Lambda using Go

How do you think effectively?

Building & Hosting a Discord Bot on AWS

Command Palette

Introduction

What is Docker Model Runner

Key Characteristics

Host-Native Approach

API Architecture

Enabling Docker Model Runner

Verifying Docker Model Runner

Usage of Docker Model Runner

Comments

Docker

More from this blog