Binary patch as the only artifact for n-day research
When the Linux distribution releases a security update, a window appears in which vulnerable and corrected binaries are available at the same time. For n-day vulnerability research, this pair of ELF files is the main source of information. Not on the NVD (which often contains one sentence), not a commit in upstream - but two binaries, old and new. More details - in our Binary Vulnerabilities Analysis.
In the upstream Linux kernel, the culture “bugs are bugs” operates: security fixes often fall into public commites without explicit labeling. The commit may contain the wording “fix bounds check” or “avia in-place operation” without reference to the CVE identifier. Distributions - Ubuntu, Debian, RHEL - collect these fixes in binary packs, and diff between two versions of .deb or .rpm becomes the primary artifact for patch-based discovery vulnerability.
The task is formulated directly: having received a pair of binary packages (vulnerable and patched), to determine which function is changed for security reasons, which class of vulnerability it closed, through which input the surface is awaited vulnerable code, and that in the behavior of the program differs between the versions. Manually, it takes from hours to weeks - depends on the size of diff and the number of candidate functions. Patch2Vuln checks how deep an LLM agent can automate the recovery of vulnerabilities from patches in the complete absence of auxiliary information.
The work deliberately excludes from the input data of the agent: CVE page, security advisory distribution (USN), the source patch, pack changelog, public PoC and web access. The only source is binary artifacts. This simulates a real scenario: the package is updated, but the advisory has not yet been published or intentionally concise.
Agent Loop: from raw diff to structured audit
The agent receives a dossier for each candidate function and passes three stages. Each creates an artifact.
Preliminary Audit. Based on static analysis - decompiled diff, symbols, rows, call context - agent forms the primary hypothesis: what class of vulnerability is closed by patch, through which input a positive code is achievable, what degree of confidence is. The output is a structured JSON plus Markdown, suitable for machine and human processing.
Bounded Validation Plan. The agent formulates a specific plan for verifying the hypothesis: the input data for the old and new binaries, the expected differences in behavior - crash, the sanitizer signal, the difference in the output. Validation is limited to local execution in Docker. No networking, shellcode or exploit chains – the system clearly does not generate the operating code.
Final Audit. Integrate the results of validation with prior audit. If validation has confirmed the hypothesis (the old binary falls on the prepared input, the new one is not), confidence increases. If not, the agent reduces confidence or revises root-cause class.
The conceptual structure of the final audit for the case of tcpdump:
Benchmark results: 25 pairs of Ubuntu .deb
Patch2Vuln is estimated at 25 Ubuntu .deb packs pairs: 20 real security updates and 5 negative controls (packages without security fixes). Each pair is manually verified against the truth ground - the original patch and binary coupling functions. Non-tendinally benchmark - real updates from Ubuntu security pipeline.
Aggregated metrics:
• Patt Localization : Security-relevant function identified in 10 of 20 security pairs (50%)
• Root-cause class : correctly defined in 11 of 20 (55%) - in one case, cause root is determined correctly without accurate location of the function, through indirect signs in call graph
• Negative controls: all 5 are correctly classified as "unknown", not a single positive false, not a single generated validation differential
• Behavioralation validation : 2 of 20 pairs (both - tcpdump) received a minimized old/new behavioral differential - a real, replicated difference in processing a particular input
The reasons for failure are the most valuable part of the work. Oracle diagnostics (valuator with full access to ground truth) showed:
• 6 pairs : failure before the beginning of the model reasoning. The binary differ or rancher did not include security-relevantant function in the set of candidates. LLM, in principle, could not see it - there is nothing to analyze
• 1 pair : Context export error. The function was included in the candidates, but the decompiled context was cut or incorrectly transferred to the agent
• 3 pairs : errors of model reasoning. The function is visible, the context is transmitted, but the agent misinterpreted diff
The ratio of 6:1:3 (different: export : model) is a key result. The main bottleneck in the automatic analysis of binary patches is not the LLM component, but a binary diff and ranking. Ghidriff, like BinDiff or Diaphora, does not guarantee that the security-relevant change will be in the top. The function can be inline by the compiler -O2, distributed over several call sites after LTO, or disguised by mass changes - for example, by adding stack canaries to all functions.
tcpdump (bionic) - the agent successfully identified the vulnerability in filter files processing. The decompiled diff showed the addition of bounds check to the BPF filter reading functions. Root-cause is defined as out-of-bounds read, and limited validation confirmed: the old binary read buffer abroad at a specially prepared input, the new correctly rejected.
Expat - strong static reconstruction. The agent correctly defined the root-cause class in the XML parser, but failed to build a working trigger. For parsers, this is a typical story: a vulnerable path requires a specific combination of nested XML structures, which is difficult to generate automatically. Static audit is correct, behavioral validation - no.
Lircive Localization is successful, but the final root-cause class is conservative: “postsible memory safety issue” instead of a specific CWE-type. For n-day research, caution is justified. Conservative final class is better than false positive with high confidence. That's what I'd write in the report, too.
LLM-powered patch diffing in the industry: Bishop Fox approach
Patch2Vuln is not the only attempt to incite LLM to binary vulnerability analysis. Bishop Fox uses another stack and another methodology, which allows you to isolate invariant patterns and specific limitations of each approach.
Bishop Fox takes the Binary Ninja for decompilation and BinDiff for comparison instead of the bundle of Ghidra / Ghidriff. LLM receives two products: the first - decompiled functions with a request for a naming and description, the second - the index of the advisory text plus the results of the first product with a request for an oraative ranking of functions in terms of relevance.
Bishop Fox gives the agent vendor. Patch2Vuln is intentionally running without it. In real n-day research, both scenarios are found: USN/RHSA/DSA is available more often than it seems, but often contain a minimum of technical details.
Bishop Fox tested workflow on four CVE with CVSS 9.4+ and three models (Claude Haiku 3.5, Claude Sonnet 3.7, Claude Sonnet 4):
• Information disclosure (27 modified functions): All three models have placed a vulnerable function in the Top 5 in each run - reference result
• Format string injection (134 functions): Sonnet 3.7 and 4 - Top 5 in 100% cases, Haiku failed completely
• Authorization bypass (1400+ functions): Sonnet 3.7 showed an acceptable result, but the average cost is $35 per test
• Stack buffer overflow (708 features): all three models have failed. The reason is the extremely concise and the noise of the advisory from the mass addition of stack canaries in diff. Even after filtering canary changes, Haiku 3.5 only came out on Top 25 in 7 out of 9 runs
Bishop Fox: 66% of tests ended with a vulnerability in Top 25. The failure on the stack buffer overflow is indicative: when the advisory is minimal, and diff is noisy with the same type of change, the model loses support for ranking. Patch2Vuln does not have an advisory at all, but there are no false signals from misogynized.

Expansion disclosure window: why push diffing automation is critical
In the spring of this year, two cases in the Linux kernel showed how the interval between the publication of the comet and the understanding of its security value is compressed.
Description of the NVD indicates the revert of the comet in algif_aead, eliminating in-place operation during AEAD processing. CVE added to CISA KEV actively as exploited. Essence: Interface Interaction AF_ALG and systemic call splice() during the in-place operation in AEAED data from different mappings created the potential for unwanted data modification. Page cache is a common system cache; in-place data operation from different mappings has created the potential to spoil the cached pages. The details of possible exploitation in public sources are limited.
Series of related fixes splice/SKBFL_SHARED_FRAG in the network subsystem: CVE-2026-43284 (xfrm/ESP-in-UDP, CVSS 8.8 HIGH, CWE-123 Write-what-where Condition, vector CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:C/C:H/I:H/A:H) and CVE-2026-43500 (rxrpc, CVSS 7.8 HIGH, CWE-787 Out-of-bounds Write). Both are related to the fact that splice() through __ip_append_data attach pages from pipe to skb, and handlers xfrm/ESP-in-UDP and rxrpc do not check the flag SKBFL_SHARED_FRAG and modify shared-pages. Public PoCs for these CVE at the time of writing was not found.
The topstream features look like “awaid in-place decrypt on shared skb frarchs” or “revert to operating out-of-place”. To disassemble such a commit manually, you need an analyst’s opening hours who knows the kernel’s crypto subsystem and mechanics splice/page cache. LLM-agent is changing the economy of this process in three directions.
Massive monitoring of commits is getting cheaper. Pipeline can collect diffs according to subsystems, classify security-relevance, and route suspicious changes to humans. The model does not have to be accurate - it is enough to reduce the haystack.
Variant analysis (search for similar patterns in neighboring subsystems) is becoming more accessible. Found “splice + in-place AEAD = page cache corruption” in algif_aead? LLM does a good job of finding the same structural pattern in xfrm, rxpc, and other splice users.
Parallel independent detection of one vulnerability by different teams - a well-known pattern in kernel security - becomes even more likely.
Restrictions: where automatic analysis of binary patches does not work
Patch2Vuln honestly fixes the limitations, and for the practitioner of the vulnerability researcher they are fundamental.
Binary-diff coverage is the main bottleneck. If the compiler aggressively inlines the function, changes the layout code due to LTO (Link-Time Optimization), or security-fix is distributed over translation several units - differential may not highlight the key change. 6 of the 20 failures of Patch2Vuln - here. For packages collected with -O2 -flto, diff coating systematically worse than for -O0 or -Og.
Behavioralation remains unresolved. Of the 20 security couple, only 2 received behavioral differential. PoC input generation is a fundamentally challenging task: deep state bugs in parsers, race conditions, kernel-specific pathways are beyond the current exodus capabilities. Crash is the simplest signal, but not all vulnerabilities lead to crash. CWE-125 out-of-bounds readas in CVE-2014-0160 Heartbleed, it is manifested by data leakage, and not by a drop in the process.
Context window of course. The decompiled code of the complex kernel functions can occupy thousands of lines. Pipeline is forced to cut the context. One of the 20 failures is the export of context: the function got into the candidates, but the stripped decompiled fragment deprived the agent of key information. Classical situation: the most interesting thing was the window abroad.
False positive rate is not valued at a scale. All 5 negative controls are rejected - excellent, but 5 samples statistically do not prove anything. The real update stream contains dozens of non-security packages for each security fix.
Kernel modules - outside the benchmark. Patch2Vuln was tested on userspace binorists (tcpdump, libexpat, libarchive). Core modules with RBU sections, memory barriers, per-CPU structures - a completely different beast. Use-after-free in io_uring or race condition in netfilter requires an understanding of the context that the decompiler transmits poorly: allocation, locking order, atomicity of operations.
The cost of scaling. Bishop Fox showed that the authorization bypass with 1400+ functions costs $ 35 for the run on Claude Sonnet 3.7. For the mass triad of hundreds of .deb-par, the budget is running seriously. Local models are a potential alternative, but their quality on reasonable security is not yet benched under similar conditions.
Patch2Vuln sets the right frame: this is not an “automatic generation of exploits”, but a study of the depth on which an LLM agent penetrates into the semantics of the security patch in the presence of only binary artifacts. 10 out of 20 is half, but with a transparent diagnosis of failures that allows purposeful to improve the button-components.
When the Linux distribution releases a security update, a window appears in which vulnerable and corrected binaries are available at the same time. For n-day vulnerability research, this pair of ELF files is the main source of information. Not on the NVD (which often contains one sentence), not a commit in upstream - but two binaries, old and new. More details - in our Binary Vulnerabilities Analysis.
In the upstream Linux kernel, the culture “bugs are bugs” operates: security fixes often fall into public commites without explicit labeling. The commit may contain the wording “fix bounds check” or “avia in-place operation” without reference to the CVE identifier. Distributions - Ubuntu, Debian, RHEL - collect these fixes in binary packs, and diff between two versions of .deb or .rpm becomes the primary artifact for patch-based discovery vulnerability.
The task is formulated directly: having received a pair of binary packages (vulnerable and patched), to determine which function is changed for security reasons, which class of vulnerability it closed, through which input the surface is awaited vulnerable code, and that in the behavior of the program differs between the versions. Manually, it takes from hours to weeks - depends on the size of diff and the number of candidate functions. Patch2Vuln checks how deep an LLM agent can automate the recovery of vulnerabilities from patches in the complete absence of auxiliary information.
The work deliberately excludes from the input data of the agent: CVE page, security advisory distribution (USN), the source patch, pack changelog, public PoC and web access. The only source is binary artifacts. This simulates a real scenario: the package is updated, but the advisory has not yet been published or intentionally concise.
Agent Loop: from raw diff to structured audit
The agent receives a dossier for each candidate function and passes three stages. Each creates an artifact.
Preliminary Audit. Based on static analysis - decompiled diff, symbols, rows, call context - agent forms the primary hypothesis: what class of vulnerability is closed by patch, through which input a positive code is achievable, what degree of confidence is. The output is a structured JSON plus Markdown, suitable for machine and human processing.
Bounded Validation Plan. The agent formulates a specific plan for verifying the hypothesis: the input data for the old and new binaries, the expected differences in behavior - crash, the sanitizer signal, the difference in the output. Validation is limited to local execution in Docker. No networking, shellcode or exploit chains – the system clearly does not generate the operating code.
Final Audit. Integrate the results of validation with prior audit. If validation has confirmed the hypothesis (the old binary falls on the prepared input, the new one is not), confidence increases. If not, the agent reduces confidence or revises root-cause class.
The conceptual structure of the final audit for the case of tcpdump:
Three stages are not a formality. The division into a preliminary audit and the final allows you to accurately diagnose where an error occurred: the model misinterpreted diff (reasonging failure) or validation did not work. To improve the system, this diagnosis is critical.JSON:
{
"patch_family": "filter_file_bounds_check",
"root_cause_class": "out-of-bounds-read",
"input_surface": "crafted BPF filter file via -F flag",
"confidence": "medium-high",
"static_evidence": "bounds check added in read_filter()",
"validation_result": "old crashes on oversized filter; new rejects"
}
Benchmark results: 25 pairs of Ubuntu .deb
Patch2Vuln is estimated at 25 Ubuntu .deb packs pairs: 20 real security updates and 5 negative controls (packages without security fixes). Each pair is manually verified against the truth ground - the original patch and binary coupling functions. Non-tendinally benchmark - real updates from Ubuntu security pipeline.
Aggregated metrics:
• Patt Localization : Security-relevant function identified in 10 of 20 security pairs (50%)
• Root-cause class : correctly defined in 11 of 20 (55%) - in one case, cause root is determined correctly without accurate location of the function, through indirect signs in call graph
• Negative controls: all 5 are correctly classified as "unknown", not a single positive false, not a single generated validation differential
• Behavioralation validation : 2 of 20 pairs (both - tcpdump) received a minimized old/new behavioral differential - a real, replicated difference in processing a particular input
The reasons for failure are the most valuable part of the work. Oracle diagnostics (valuator with full access to ground truth) showed:
• 6 pairs : failure before the beginning of the model reasoning. The binary differ or rancher did not include security-relevantant function in the set of candidates. LLM, in principle, could not see it - there is nothing to analyze
• 1 pair : Context export error. The function was included in the candidates, but the decompiled context was cut or incorrectly transferred to the agent
• 3 pairs : errors of model reasoning. The function is visible, the context is transmitted, but the agent misinterpreted diff
The ratio of 6:1:3 (different: export : model) is a key result. The main bottleneck in the automatic analysis of binary patches is not the LLM component, but a binary diff and ranking. Ghidriff, like BinDiff or Diaphora, does not guarantee that the security-relevant change will be in the top. The function can be inline by the compiler -O2, distributed over several call sites after LTO, or disguised by mass changes - for example, by adding stack canaries to all functions.
tcpdump (bionic) - the agent successfully identified the vulnerability in filter files processing. The decompiled diff showed the addition of bounds check to the BPF filter reading functions. Root-cause is defined as out-of-bounds read, and limited validation confirmed: the old binary read buffer abroad at a specially prepared input, the new correctly rejected.
Expat - strong static reconstruction. The agent correctly defined the root-cause class in the XML parser, but failed to build a working trigger. For parsers, this is a typical story: a vulnerable path requires a specific combination of nested XML structures, which is difficult to generate automatically. Static audit is correct, behavioral validation - no.
Lircive Localization is successful, but the final root-cause class is conservative: “postsible memory safety issue” instead of a specific CWE-type. For n-day research, caution is justified. Conservative final class is better than false positive with high confidence. That's what I'd write in the report, too.
LLM-powered patch diffing in the industry: Bishop Fox approach
Patch2Vuln is not the only attempt to incite LLM to binary vulnerability analysis. Bishop Fox uses another stack and another methodology, which allows you to isolate invariant patterns and specific limitations of each approach.
Bishop Fox takes the Binary Ninja for decompilation and BinDiff for comparison instead of the bundle of Ghidra / Ghidriff. LLM receives two products: the first - decompiled functions with a request for a naming and description, the second - the index of the advisory text plus the results of the first product with a request for an oraative ranking of functions in terms of relevance.
Bishop Fox gives the agent vendor. Patch2Vuln is intentionally running without it. In real n-day research, both scenarios are found: USN/RHSA/DSA is available more often than it seems, but often contain a minimum of technical details.
Bishop Fox tested workflow on four CVE with CVSS 9.4+ and three models (Claude Haiku 3.5, Claude Sonnet 3.7, Claude Sonnet 4):
• Information disclosure (27 modified functions): All three models have placed a vulnerable function in the Top 5 in each run - reference result
• Format string injection (134 functions): Sonnet 3.7 and 4 - Top 5 in 100% cases, Haiku failed completely
• Authorization bypass (1400+ functions): Sonnet 3.7 showed an acceptable result, but the average cost is $35 per test
• Stack buffer overflow (708 features): all three models have failed. The reason is the extremely concise and the noise of the advisory from the mass addition of stack canaries in diff. Even after filtering canary changes, Haiku 3.5 only came out on Top 25 in 7 out of 9 runs
Bishop Fox: 66% of tests ended with a vulnerability in Top 25. The failure on the stack buffer overflow is indicative: when the advisory is minimal, and diff is noisy with the same type of change, the model loses support for ranking. Patch2Vuln does not have an advisory at all, but there are no false signals from misogynized.

Expansion disclosure window: why push diffing automation is critical
In the spring of this year, two cases in the Linux kernel showed how the interval between the publication of the comet and the understanding of its security value is compressed.
Description of the NVD indicates the revert of the comet in algif_aead, eliminating in-place operation during AEAD processing. CVE added to CISA KEV actively as exploited. Essence: Interface Interaction AF_ALG and systemic call splice() during the in-place operation in AEAED data from different mappings created the potential for unwanted data modification. Page cache is a common system cache; in-place data operation from different mappings has created the potential to spoil the cached pages. The details of possible exploitation in public sources are limited.
Series of related fixes splice/SKBFL_SHARED_FRAG in the network subsystem: CVE-2026-43284 (xfrm/ESP-in-UDP, CVSS 8.8 HIGH, CWE-123 Write-what-where Condition, vector CVSS:3.1/AV:L/AC:L/PR:L/UI:N/S:C/C:H/I:H/A:H) and CVE-2026-43500 (rxrpc, CVSS 7.8 HIGH, CWE-787 Out-of-bounds Write). Both are related to the fact that splice() through __ip_append_data attach pages from pipe to skb, and handlers xfrm/ESP-in-UDP and rxrpc do not check the flag SKBFL_SHARED_FRAG and modify shared-pages. Public PoCs for these CVE at the time of writing was not found.
The topstream features look like “awaid in-place decrypt on shared skb frarchs” or “revert to operating out-of-place”. To disassemble such a commit manually, you need an analyst’s opening hours who knows the kernel’s crypto subsystem and mechanics splice/page cache. LLM-agent is changing the economy of this process in three directions.
Massive monitoring of commits is getting cheaper. Pipeline can collect diffs according to subsystems, classify security-relevance, and route suspicious changes to humans. The model does not have to be accurate - it is enough to reduce the haystack.
Variant analysis (search for similar patterns in neighboring subsystems) is becoming more accessible. Found “splice + in-place AEAD = page cache corruption” in algif_aead? LLM does a good job of finding the same structural pattern in xfrm, rxpc, and other splice users.
Parallel independent detection of one vulnerability by different teams - a well-known pattern in kernel security - becomes even more likely.
Restrictions: where automatic analysis of binary patches does not work
Patch2Vuln honestly fixes the limitations, and for the practitioner of the vulnerability researcher they are fundamental.
Binary-diff coverage is the main bottleneck. If the compiler aggressively inlines the function, changes the layout code due to LTO (Link-Time Optimization), or security-fix is distributed over translation several units - differential may not highlight the key change. 6 of the 20 failures of Patch2Vuln - here. For packages collected with -O2 -flto, diff coating systematically worse than for -O0 or -Og.
Behavioralation remains unresolved. Of the 20 security couple, only 2 received behavioral differential. PoC input generation is a fundamentally challenging task: deep state bugs in parsers, race conditions, kernel-specific pathways are beyond the current exodus capabilities. Crash is the simplest signal, but not all vulnerabilities lead to crash. CWE-125 out-of-bounds readas in CVE-2014-0160 Heartbleed, it is manifested by data leakage, and not by a drop in the process.
Context window of course. The decompiled code of the complex kernel functions can occupy thousands of lines. Pipeline is forced to cut the context. One of the 20 failures is the export of context: the function got into the candidates, but the stripped decompiled fragment deprived the agent of key information. Classical situation: the most interesting thing was the window abroad.
False positive rate is not valued at a scale. All 5 negative controls are rejected - excellent, but 5 samples statistically do not prove anything. The real update stream contains dozens of non-security packages for each security fix.
Kernel modules - outside the benchmark. Patch2Vuln was tested on userspace binorists (tcpdump, libexpat, libarchive). Core modules with RBU sections, memory barriers, per-CPU structures - a completely different beast. Use-after-free in io_uring or race condition in netfilter requires an understanding of the context that the decompiler transmits poorly: allocation, locking order, atomicity of operations.
The cost of scaling. Bishop Fox showed that the authorization bypass with 1400+ functions costs $ 35 for the run on Claude Sonnet 3.7. For the mass triad of hundreds of .deb-par, the budget is running seriously. Local models are a potential alternative, but their quality on reasonable security is not yet benched under similar conditions.
Patch2Vuln sets the right frame: this is not an “automatic generation of exploits”, but a study of the depth on which an LLM agent penetrates into the semantics of the security patch in the presence of only binary artifacts. 10 out of 20 is half, but with a transparent diagnosis of failures that allows purposeful to improve the button-components.