How to Remove “Hidden Code” from ChatGPT and other LLMs

2025-11-22 anna No comments yet

As large language models (LLMs) increasingly generate working code and are integrated into development pipelines and agent stacks, there’s a rising risk that hidden or malicious instructions — whether embedded in model outputs, injected via webpages or third-party plugins, or introduced during model training — can cause unsafe behavior when that code is executed.

According to user reports circulating in developer communities, a software developer experienced catastrophic data loss — approximately 800GB of files were deleted, including the entire CursorAI application itself — after executing code generated with assistance from Gemini 3 while working inside the CursorAI IDE. As developers increasingly rely on LLMs for code generation, the consequences of unreviewed or unsafe scripts grow more severe.

Therefore, it is quite important to know how to detect and remove dangerous codes generated by LLM.

What is “hidden code” in the context of ChatGPT and LLMs?

What do people mean by “hidden code”?

“Hidden code” is an umbrella term developers use to describe any embedded instructions or executable content within the text (or files) that an LLM ingests or emits, including:

Prompt-style instructions embedded inside user content (e.g., “Ignore earlier instructions…” hidden in a PDF).
Invisible characters or zero-width spaces used to hide tokens or break tokenization assumptions.
Encoded payloads (base64, URL-encoded, steganographic embeddings inside images or documents).
Hidden HTML/JS or script blocks included in formatted content that might be interpreted by downstream renderers.
Metadata or annotations (file comments, hidden layers in PDFs) that instruct retrieval systems or the model.
Implicit behaviors arising from generated code that uses dangerous APIs (e.g., eval, exec, subprocess, or network/system calls) — even when the intent is not explicitly malicious.
Prompt-injected instructions that cause the model to generate code that includes hidden commands or backdoor-like logic because an attacker engineered the prompt or context.

These attack vectors are often called prompt injection or indirect prompt injection when the goal is to change model behavior. The security community now treats prompt injection as a core LLM vulnerability and OWASP has formalized it as an LLM risk category.

How is this different from regular malware or XSS?

The difference is the semantic layer: prompt injection targets the model’s instruction-following behavior rather than the host OS or browser rendering engine. That said, hidden HTML or script that ends up running in a web renderer is still an executable attack (XSS-like); both semantic and execution layers must be defended. Industry leaders and researchers have called prompt injection a “frontier security challenge” and continue to publish mitigation strategies.

Why can LLM produce hidden or dangerous code?

Model behavior, training data, and instruction context

LLMs are trained to produce plausible continuations given context and instructions. If the context contains adversarial cues, or if a user asks the model for code that performs powerful actions, the model can output code that includes subtle or active behavior.

LLMs produce plausible-but-unsafe code

LLMs are optimized for fluency and usefulness, not for safety in the presence of destructive side effects. They will happily generate a succinct rm -rf /path/to/dir or shutil.rmtree() call when asked to “clean up” — and because their responses are often phrased confidently, users may copy-and-run with insufficient scrutiny. This “confident hallucination” problem is why seemingly innocuous requests become dangerous.

Automation of obfuscation workflows

Threat actors are now automating code obfuscation by chaining LLM calls: one model generates a payload, another reworks it to avoid signature detection, and so on. Industry threat reports and vendor analyses in 2025 document this “AI-assisted obfuscation” as an emerging technique.

How can you detect hidden code inside model outputs?

Quick triage checklist

Scan for invisible/unusual Unicode (zero-width joiners, zero-width spaces, byte order marks, non-ASCII homoglyphs).
Run static analysis / AST parsing to identify use of powerful APIs (eval, exec, subprocess, os.system, reflective calls).
Look for encoded payloads (base64, hex blobs, repeated long strings or compressed content).
Check for obfuscation patterns (string concatenation that constructs API names, character arithmetic, chr() chains).
Use semantic analysis to confirm whether the code actually performs I/O, networking, or filesystem mutation.

Static pattern detection (fast, first line)

Language-aware parsing and linting. Immediately parse generated output into code blocks vs. prose. Run formatters and linters (Black/Prettier, pylint, eslint). Lint rules should flag use of eval, exec, rm -rf, raw subprocess calls, or shell pipes that construct commands dynamically.
Token- and string-pattern scanners. Search for high-risk tokens and patterns: sudo, absolute paths like /home/, C:\, rm -rf, shutil.rmtree, subprocess.Popen, inline base64 blobs, long uninterpretable strings, and shebangs that switch interpreter context.
Secret scanning & provenance checks. Detect hardcoded credentials, URLs pointing at untrusted registries, or code that dynamically pulls packages from arbitrary sources.

Static analysis catches many obvious issues quickly and is cheap to run as part of a CI gate.

Semantic and contextual detection (deeper)

Intent analysis. Use a secondary model or a rule engine to classify the generated code’s intent: is it “read,” “write,” “delete,” “network,” “install”? Anything categorized as delete/write should trigger escalation.
Data-flow analysis. Analyze the code to detect whether unvalidated or user-supplied paths can reach destructive APIs. For instance, if a variable derived from an LLM output or a remote file is later concatenated into a shell command, flag it.
Provenance correlation. Keep a full record of the conversation, system prompt, and context pages. If suspicious outputs correlate with a particular external document or plugin call, that can indicate prompt injection or a tainted context.

Dynamic and behavioral detection (most reliable)

Sandbox execution with monitoring. Execute generated code in a tightly-restricted ephemeral environment with no network, no host mounts, and syscall filtering (seccomp). Monitor file-system activity, attempted network calls, process spawning, and unusual I/O.
Canary testing. Before running on real data, run the code against synthetic directories that contain sentinel files; monitor for deletions or overwrites.
Behavioral heuristics. Look for loops that traverse parent directories, recursive operations without depth checks, or rename patterns that could injure many files (e.g., repeatedly writing the same filename).
Dynamic analysis is the only way to detect payloads that are obfuscated, delayed, or triggered only at runtime.

How should you remove or neutralize hidden code before executing LLM outputs?

Defensive removal vs. altering semantics

There are two goals when “removing hidden code”:

Sanitization — remove content that is clearly non-code or suspicious (invisible Unicode, zero-width chars, appended base64 payloads). This should not change the intended, benign logic.
Neutralization — for anything that executes or calls external services, disable those calls or make them no-ops until verified.

Always prefer neutralization + review over blind deletion: removing parts of code arbitrarily can produce broken or unexpected behavior. Instead, replace suspicious constructs with explicit, logged stubs that fail safely (raise exceptions or return safe defaults).

Step 1 — Treat generated code as untrusted data

Never execute code directly from ChatGPT (or any LLM) without passing it through a removal and hardening pipeline. That pipeline should be enforced by policy and automated in CI/CD.

Step 2 — Extract and canonicalize code

Normalize text and remove zero-width characters: Strip characters such as U+200B, U+200C, U+200D, U+FEFF, and other zero-width / formatting codepoints. Log what was removed for auditing. This step eliminates many “hidden” encodings used for visual stealth.
Strip all non-code context: remove narrative, hidden comments, and any HTML/Markdown wrappers. Convert code to canonical form using language formatters (Black, Prettier) so obfuscated whitespace or control characters are normalized.
Reject or quarantine code with these constructs: dynamic eval, raw subprocess calls (os.system, subprocess.Popen), inline base64 blobs decoded into execution, or embedded #! directives that attempt to shift the interpreter context. Normalize text and remove zero-width characters
Strip characters such as U+200B, U+200C, U+200D, U+FEFF, and other zero-width / formatting codepoints. Log what was removed for auditing. This step eliminates many “hidden” encodings used for visual stealth.

Step 3 — Parse into AST and replace risky nodes

With the code parsed into an AST, find nodes that call dynamic execution (e.g., exec), or that programmatically build function names. Replace them with safe stubs that raise a controlled exception indicating “unsafe dynamic behavior blocked.” Generate a sanitized copy of the AST-backed source for review. Run security pattern checks (custom semgrep rules for your environment). Where matches are found, mark and neutralize them.

Step 4 — Static hardening & rewriting

Automated rewriting: pass code through an automated sanitizer that replaces dangerous calls with safe wrappers — e.g., replace os.system() / subprocess with an approved sandboxed executor that enforces timeouts and network blocks.
Capability gating: modify or remove API keys, tokens, or calls to privileged endpoints; replace them with mock adapters for local testing. Prevent accidental inclusion of secrets or URLs.
Dependency rewrites: block dynamic pip / npm installs created by the code. Require dependencies to be declared and approved via your registry.

Step 5 — Run inside an aggressive sandbox

Ephemeral containers / microVMs: execute the code in a container/VM that has no network, no access to host credentials, and limited filesystem access. Technologies like gVisor, Firecracker, or dedicated ephemeral execution services are appropriate. If code must access I/O, use a proxy that enforces policy.
System-call filters & seccomp: limit which syscalls are allowed. File writes outside a temp directory should be blocked.
Resource/time limits: set CPU/memory/time limits so even logical bombs cannot run indefinitely.

Sandbox execution plus monitoring often uncovers payloads that static checks miss. Industry guidance and recent white papers recommend sandboxing as a core mitigation.

What automated tools and rules should be in your pipeline?

Recommended toolchain components

Unicode sanitation module (custom or existing libraries). Must log normalized characters.
Parser + AST analyzer for each target language (Python ast, typed-ast, JavaScript parsers, Java parsers).
Static analyzers / SAST: Bandit (Python), Semgrep (multi-lang, customizable), ESLint with security plugins.
Entropy and decoder heuristics: detect base64/hex/gzip and route to inspection.
Sandbox runtime: minimal container with strict seccomp/AppArmor profile or language-level interpreter with disabled syscalls.
Policy enforcer: a component that decides allowed modules, allowed endpoints, and safe API wrappers.
Audit trail: immutable logs that record original output, sanitized output, diffs, and decisions.

Example semgrep patterns (conceptual)

Use short, conservative rules that flag use of dangerous functions. For instance:

Flag eval, exec, Function constructor (JS), dynamic imports, or string-built API names.
Flag network calls outside allowlist (e.g., requests.get to unknown hosts).
Flag writes to sensitive paths (/etc, system folders) or spawning of processes.

(Keep these as configuration items per organization and tighten them over time.)

What are practical sanitization snippets (safe examples)?

Below are non-dangerous, defensive examples you can adapt. They are sanitization and detection snippets — not exploit instructions.

Example: strip zero-width characters (Python, defensive)

import re
ZERO_WIDTH_RE = re.compile(r'[\u200B\u200C\u200D\uFEFF\u2060]')
def strip_zero_width(s: str) -> str:
    cleaned = ZERO_WIDTH_RE.sub('', s)
    return cleaned

This removes characters attackers often use to hide code in otherwise visible text. Always log what was removed and treat removal as part of the audit trail.

Example: parse and inspect AST (Python, conceptual)

import ast

def has_dynamic_exec(source: str) -> bool:
    tree = ast.parse(source)
    for node in ast.walk(tree):
        if isinstance(node, ast.Call):
            if getattr(node.func, 'id', '') in ('eval', 'exec',):
                return True
        if isinstance(node, ast.Attribute):
            if getattr(node, 'attr', '') in ('popen', 'system'):
                return True
    return False

If has_dynamic_exec returns True, do not run the code; instead replace the dynamic node with a safe stub and require review.

Note: these examples are defensive in nature. Do not remove logging, auditing, or human review from your pipeline.

Closing thoughts: treat LLM output like untrusted code, always

LMs are powerful productivity tools — they can produce elegant code, accelerate drafts, and automate routine work. But where they meet execution, the rules of security change: model outputs must be treated as untrusted artifacts. The combination of prompt injections, backdoor research, and real-world vulnerability disclosures over the past 18–30 months makes a clear point: the risk surface has grown and will continue to evolve.

Practical defenses that combine parsing, static analysis, sandboxed dynamic testing, governance, and continuous red-teaming will stop most attacks. But teams must also invest in organizational controls: least privilege, provenance, and a culture that assumes LLM outputs need verification. The industry is building tools and frameworks to make these patterns easier; meanwhile, adopting the checklist above reduces the chance that a hidden payload slips.

Developers can access latest LLM API such as Claude Sonnet 4.5 API and Gemini 3 Pro Preview etc through CometAPI, the latest model version is always updated with the official website. To begin, explore the model’s capabilities in the Playground and consult the API guide for detailed instructions. Before accessing, please make sure you have logged in to CometAPI and obtained the API key. CometAPI offer a price far lower than the official price to help you integrate.

Ready to Go?→ Sign up for CometAPI today !

If you want to know more tips, guides and news on AI follow us on VK, X and Discord!

One API
Access 500+ AI Models!

Free For A Limited Time! Register Now
Get Free Token Instantly！

Get Free API Key

API Docs

anna

Anna, an AI research expert, focuses on cutting-edge exploration of large language models and generative AI, and is dedicated to analyzing technical principles and future trends with academic depth and unique insights.

How to Remove “Hidden Code” from ChatGPT and other LLMs