Bypassing ML detectors to IDS: adversarial-attacks and testing the robustness of models

Depov · Jun 5, 2026

How IDS ML Detectors Make Traffic Decisions
The ML-based NIDS operates on a three-stage pipeline. Understanding each step is critical for building an evasion attack - without this you will poking perturbations at random.

Extracting signs. Network traffic is converted into numerical features - flow-based or packet-based. Typical set: flow content, average packet size, standard deviation of inter-arrival time (IAT), number of packets in each direction, payload ratio to header, TCP flags. NSL-KDD contains 41 characters + class tag, UNSW-NB15 - 47 signs + tags attack_cat and label (49 columns total), CICIDS2017 - more than 80 signs.

Normalization and selection. Signs are given to a single range (MinMax in [0, 1]) and a selection feature. Adversarial-perturbation works in a normalized space, and transferring to real units (bytes, milliseconds) requires a reverse transformation - about this below.
Adversarial-perturbation is a tiny, specially calculated change of input data that causes the AI model to make gross errors.

Classification. The trained model (Random Forest, XGBoost, DNN, LSTM) renders a verdict: benign or malicious, sometimes with detail by type of attack (DoS, Probe, R2L, U2R). Decision boundary between classes is what an attacker tries to move.

According to a study taken at IEEE AI+ TrustCom 2024, the main vulnerability is to manipulate the mutable features: packet size,, patterns timing flow. It is they who make malicious traffic “invisible” to the classifier. Immutable features (IP addresses, destination ports for a particular service) remain unchanged - otherwise the attack will simply break.

What to check right now? Look at what features your model uses. If the feature is 80% traquette value (packet size, timeline, IAT) - the model is vulnerable to adversarial perturbation by definition. “Can’t “be vulnerable” is vulnerable.
The place of adversarial ML in the attack chain: Mapping MITRE ATT&CK
Adversarial evasion is not an isolated technique. This is a evasion layer that is embedded in a full chain: the attacker receives the initial access, the agent turns around, and then masks C2-communications from the IDS ML module. But even before that, he needs a toolkit - surrogate-models trained on public datasets, frameworks for the generation of perturbations.

Separately about T1071: this technique is often associated with Windows (Cobalt Strike, Sliver), but according to the classification of MITRE ATT&CK it is cross-platform (Windows, Linux, macOS), although the public tests of the Atomic Red Team cover only Windows. HTTP-based C2 also works on Linux servers (Mythic, Havoc) with the same set of network traits that the ML detector tries to classify. Adversarial perturbation C2 traffic is equally relevant for any OS - Network flow-fours from the OS do not depend.

Operational context: adversarial perturbation is used after receiving initial access when you need to hide C2-communications from the IDS ML module. The previous step is to deploy an agent on the target host. Next - lateral movement through a disguised channel. Without bypassing ML detectors in the IDS channel lives for a minute. According to CrowdStrike Global Threat Report 2025, the crossout time (from initial access to start-up mobile) is 62 minutes. During this time, IDS with an ML module will have time to react if C2 is not disguised.
Taxonomy evasion-attacks on IDS ML detectors
Evasion attacks on intrusion detection systems vary in the level of access to the model and method of generation of adversarial-examples.

By access to the model. White-box - the attacker knows the architecture, weight, performances; gradient-based attacks (FGSM, PGD) work directly. Black-box - access only to predictions; you need a surrogate model or query-based optimization. Grey-box is a partial knowledge: the attacker knows that IDS uses Suricapa-based MPas with eve-log films, but does not know a specific model. On the real pentest, grey-box is the most typical scenario. I met a clean white box exactly once, and because the customer himself posted the model in the Jupyter-laptop on the internal GitLab.

According to the method of generation:

According to a number of publications, GAN-based evasion attacks on CICIDS2017 can reduce the accuracy of Decision Tree more than the Logistic Regression. With a poisoning attack, the picture is the opposite: Logistic Regression degrades faster. The conclusion is simple - the choice of the adversarial ML attack method on IDS depends on the type of target model. There is no universal “fraudster”.

Feature-space vs problem-space - the distinction that many miss. Feature-space attack modifies the numerical sign vector directly. Problem-space attack changes real traffic so that the changed features correctly affect the threaded feature vector. Not every feature-space perturbation is feasible in practice - it is impossible to install a negative package size or timing less than RTT network. It’s like a constraint enforcement when generating payloads: theoretically, a valid vector can be physically unrealizable.
Constraint enforcement is the process of guaranteeing that a system, process, or set of data functions strictly within the framework of given limits, security limits or logical rules.
Transferability: when the surrogate model is sufficient
In the real white-box pentest, access to the IDS model is rare. The main working scenario: to train the surrogate model on the public dataset and expect that adversarial-examples will be transferred (transfer) to the target.

Papernot et al. (Transferability in Machine Learning: from Phenomena to Black-Box Attacks using Adversarial Samples, 2016) showed a high transferability adversarial-examples for DNN. But for the tree-based models (Random Forest, XGBoost) - namely, they are more often put in production IDS because of the rate of infection - transferability is much lower and strongly depends on the engineering feature. If the feature set surrogate does not match the target, the transfer of adversarial examples in network security works unpredictably.

What this means in practice:
• Target IDS on DNN (Darktrace, Vectra AI and similar) - adversarial-examples with surrogate DNN of another architecture are transported with a high probability with a similar feature. This is the main scenario for evasion-attacks using machine learning.
• Target IDS on tree-based models (typically for open-source solutions based on Suricata with ML-plagins) - transferability is unpredictable. Surrogate should repeat the target system’s feature as much as possible. For PGD, this is critical; FGSM is less sensitive to the discrepancy of the features, but also the result is coarse.
• Ensemble from different architectures - seriously reduces transferability. Adversarial-example that deceives DNN often does not deceive RF in the same ensemble. If you see when testing you see that ensemble is holding a hit, it is a good sign.
Restrictions adversarial perturbation in real networks
The study from IEEE AI+ TrustCom 2024 emphasizes: in real networks, the attacker does not control all the features. The patical size (packet size, timing, number of packages) can be changed through padding and delays. Immutable (IP-addresses, target ports, TCP flags for the correct handshake) can not be changed without losing functionality.

Additional headaches - the features are interdependent. The change in packet size indirectly affects the flow and byte rate. The change in IAT shifts jitter statistics. The genetic algorithm (as suggested in the same study) takes these dependencies into account, but FGSM and PGD are not. They need an additional projection that filters out “impossible” combinations of films. Theoretically, a valid vector can be physically unrealizable - like a payload that passes all the checks, but falls in real execution.

When adversarial perturbation traffic does not work:
• IDS uses deep packet inspection in parallel with ML - perturbation flow fiction will not hide the signature in payload
• Stateful firewall tracks TCP states - you can't arbitrarily change TCP flags without breaking the session
• NGFW with HIDS component correlates network and host data - bypassing NGFW using adversarial examples is closed by host telemetry
I saw the situation when the flow-fictions were perfectly disguised, but the Suricata with the signature module still caught the C2 on the JA3 TLS print. Adversarial perturbation is not a silver bullet.
Protective measures and where ML-classifiers remain vulnerable
The results of the protective framework from the IEE AI+ TrustCom 2024: a combination of adversarial training, dataset balancing, feature engineering and ensemble learning seriously increases the robustness of the detection (in published benchmarks - by dozens of percentage points of accuracy and a noticeable decrease in false positive rate) compared to baseline in the presence of adversarial attacks. Testing was carried out on NSL-KDD and UNSW-NB15.

Adversarial training: The model is trained in a mixture of normal and adversarial-examples. Works against those attacks that were used in training (FGSM, PGD), but weakly against new methods (GAN-based, genetic algorithms). The analogy of network security is direct: signature IDS detects what it saw. Adversarial training expands the “vision”, but does not close the zero-day adversarial vector.

Ensemble learning: combination of models of different architectures (DNN + XGBoost). Adversarial-example, optimized for one architecture, often does not deceive the other. The ensemble reduces the effectiveness of transfer attacks, but increases the latency of the infection - for IDS on channels of 10 Gbps each of the milliseconds counted. You have to choose here.

Protocol-aware feature engineering: Adding features tied to the protocol specification (correctness of TCP flavocs, HTTP header validity, TLS-handshake order). These features are more difficult to mutate without violating the functionality of the attack. In my opinion, the protocol-aware approach is the most effective measure against the adversarial ML attacks on IDS, because it increases the share of immutable films in the decision of the boundary model. The attacker just has nothing to turn.
Blind areas that no measure closes
Neither adversarial training nor ensemble protects against problem-space attacks that modify real traffic without knowing a feature set. The attacker, which simply adds a random padding to the packages and randomizes timing, does not optimize an adversarial example – but shifts statistical signs to the model’s uncertainty zone. Such “rough” perturbations are less effective than gradient-based, but do not require any knowledge of the model at all. Trivial is absurd, but it works.

The second blind spot is the concept drift. The ML model is trained on the traffic of a certain period. Legitimate traffic migrates towards cloud services, changing distribution. The model is degraded without any active attacks. According to Mandiant M-Trends 2025, the median time of the attacker in the network until the detection is 11 days. This is enough to conduct a reconnaissance of normal traffic and emerge C2-patterns under the legitimate network profile.
Checklist: audit of the roboticity of the IDS ML detector
1. Fix the baseline accuracy and F1-score models on a clean test set - separately for each class of traffic
2. Run FGSM with ε=0.05, 0.10, 0.15, 0.20 - build the degradation curve
3. Perform a PGD attack (20 iterations, α=0.01) - compare with FGSM, fix the difference
4. Teach surrogate model of another architecture - evaluate transferability adversarial-examples
5. Divide the feature set into mutable and immutable value - check the feature: if the top 5 features are mutable, the model is vulnerable
6. Add 20–30% FGSM/PGD adversarial-examples to the training sample – retrain, repeat tests (does not protect against GAN-based and genetic attacks)
7. assemble ensemble from at least two architectures (DNN + tree-based) - compared with a single model under attack
8. Introduce protocol-aware fiction (TCP flags, HTTP method validity, TLS handshake order) - measure the increase in robustness
9. Check the model on the concept drift: to supply traffic for a period 3-6 months later than the traininger
10. Document: baseline accuracy, accuracy under FGSM/PGD, accuracy after adversarial training - for report
Most security engineers who exploit ML modules in IDS test accuracy on a holdout sample from the same dataset and consider the 97% detection rate to be the final metric. In fact, 97% accuracy means one thing: the model works well on data, distributed in the same way as the tutors. It is worth the attacker to spend half a day on the surrogate model and a 20-line FGSM script – and these 97% turn into 60% or lower.

Adversarial training helps, ensemble helps, protocol-aware features help - but each measure is separately insufficient - only the combination works. And the main thing I see in real projects: the resistance of ML models to attacks is not tested at all. The ML module is put as a “black box”, the vendor promises AI-detection, the security team puts a tick and leaves.

Regular retraining on fresh traffic is more critical than any individual adversarial defense - concept DNrift kills models quieter than FGSM, but larger. If the ML module in the IDS has not been retrained for six months, it is pointless to test its reworking - he has degraded himself. Take the checklist above, drive points 1-3 on your model. If the accuracy has subsided more than 15 p.p. at ε=0.10 - you have a problem, and it is not in adversarial attacks, but that the model no one checked.

Bypassing ML detectors to IDS: adversarial-attacks and testing the robustness of models

Depov

Activist

Attachments

Similar threads