Feature engineering for network traffic: why the quality of the features solves more than the choice of an algorithm in IDS ML models

Depov

Activist
ULTIMATE
SUPREME
PREMIUM
MEMBER
Joined
Feb 18, 2025
Messages
128
Reaction score
116
Deposit
0$
Six months ago, we killed two weeks for the selection of LightGBM hyperparameters for a C2 backed detect in corporate traffic. F1 was hanging between 0.71 and 0.74 - at least learning rate spin, at least max_depth, at least the number of trees. Then three types were added in one evening: the standard deviation of inter-package intervals, the entropy of the size of payload and the ratio of coming to the incoming bytes at the flow level. F1 jumped to 0.93 on the same test set. The algorithm is the same, the data is the same - what the model "sees" in traffic has changed. In a feature engineering network traffic for machine learning, the quality of the signs determines the ceiling of the model, and the algorithm only brings the ceiling closer to this ceiling.
Feature Engineering is the process of converting the initial data into a format that is better suited for learning machine learning models.
The algorithm is secondary: what the numbers say
A typical error in building ML models of intrusion detection is to invest 80% of the time in the model architecture and 20% in the preparation of data. The proportion must be reversed.

Three algorithms: Random Forest, LightGBM, simple full neural network. Serve raw flow-fictions from Zeek conn.log - duration, orig_bytes, resp_bytes, orig_pkts, resp_pkts, proto. The spread of F1 in the “mailcious” class is 2-4 percentage points. We add derivative traits - statistics on temporary windows, entropy, ratio-fic - and each of the algorithms shows an increase of 15-25% F1. The difference between the algorithms themselves in enriched films remains in the same 2-3%.
Random Forest is a machine learning algorithm that combines many independent “solution trees” into one ensemble.
LightGBM is a popular library of gradient booster algorithms over solutions trees developed by Microsoft.

The reason is simple: the algorithm finds separating hyperplane in the trait of the space that it was given. But he does not know how to come up with new axis of this space. If there is no feature in the data "standard deviation of IAT in 60 seconds", no model detects a jittered beacon on its signature in inter-package intervals - it simply has nowhere to see this pattern.

According to the IBM X-Force Thread Intelligence Index 2025, 70% of the incidents investigated by X-Force in 2024 have affected critical infrastructure. Singarette IDS based on Suricata or Snort catches known patterns, but does not adapt to the new TTPs. The ML-based NIDS theoretically solves this problem – provided that the performances reflect the real traffic behavior, not the noise in a beautiful wrapper.
Three Levels of Signs of Network Traffic
The extraction of signs from network traffic for machine learning works on three levels. Everyone adds context and increases detective ability, but requires more engineering work.
Package level: atomic characteristics
The lowest level is a separate package. It is pulled out: packet size (total_length from the IP header), TTL, TCP flags (SYN, ACK, FIN, RST, PSH, URG), transport layer protocol, source and destination ports, payload size.

In isolation, the packet level is almost useless for classification. One SYN package on port 443 is indistinguishable from the beginning of the legitimate TLS-handishka and from the first SYN scan package. The context appears only with aggregation.

But batch features are the foundation. If at this level you incorrectly parse flags through Scapy or lose retransmissions, all derivative signs will be contaminated. Wireshark here is your best friend for visual verification of what parser gives out at a particular session.
Flow level: aggregation of flows
Flow is a group of packages with a total 5-tuple (src_ip, dst_ip, src_port, dst_port, proto). Here begins the real work with the signs for IDS.
1780509964042.png

Zeek conn.log issues for each flow: ts, uid, id.orig_h/p, id.resp_h/p, proto, service,, duration orig_bytes, resp_bytes, conn_state, orig_pkts, resp_pkts. CICFlowMeter generates more than 80 statistical features: medium, minimum, maximum packet size in the opposite direction, standard deviation, full set of IAT statistics.

The critical parameter at this level is the timeout of flow. Zeek defines the boundaries of TCP sessions according to FIN/RST, for UDP uses timeout inactivity. CICFlowMeter puts 120 seconds by default. This setting directly hits the features: too long a timeout will glue two different sessions, too short - torupt one. To detect C2 backs with an interval of 300 seconds, the default timeout CICFlowMeter is no longer suitable - and this is a trap that many are on.
Behavioral level: windows and entropy
The most valuable and most difficult level. It's not one stream, but the behavior of the flow group behind a time window - usually 30-120 seconds.

Key behavioral features for detecting anomalies in network traffic:
• The number of unique dst_port from one src_ip in 60 seconds. The value of >20 is a horizontal scan. A normal user turns to 2-5 ports per minute.
• Entropy dst_port according to Shannon. If src_ip goes to 200 different ports outside the window, the entropy will be above 7 bits. The legitimate client has 1.5 bits below.
• Ratio SYN-only to total TCP connections outside the window. SYN-flood or SYN-scan give ratio > 0.8; normal traffic - 0.05–0.15.
• Standard deviation of IAT (inter-arrival time). C2-bacon with jitter shows a characteristic pattern: a small stD against the background of accidental noise of normal traffic. Stable IAT (std < 0.5 s) at a long session - red flag.
• Payload entropy at the byte level. Above 7.5 bits per byte with a small packet size is a likely sign of an encrypted or osteocated channel.
• Byte ratio (sent/received) at the level of flow. C2 with the "poll" commands shows a sharply asymmetrical pattern: many small outgoing, rare large incoming.
Practical pypeline: from pcap to trained model
Adjustments to the environment
• OS: Ubuntu 22.04+ or Kali Linux 2024+
• RAM: minimum 8 GB for pcap up to 500 MB; recommended 16 GB for large datasets (CIC-IDS2017 in pcap format occupies ~50 GB)
• Zeek: 6.0+ ( sudo apt install zeekor assembling from the source)
• Python: 3.10+, Depends on: pandas, numpy, scikit-learn, lightgbm, shap
• Mode: fully offline after setting addictions; Internet is not required
Pippleline Steps
1. Generation conn.log. Team zeek -r capture.pcap will create structured logs in the current directory. Each line conn.log is one flow with fields ts, uid, id.orig_h, id.orig_p, id.resp_h, id.resp_p, proto, service,, duration orig_bytes, resp_bytes and a dozen additional ones.

2. Parsing and baseline fiphes derivatives.
Python:
import pandas as pd

df = pd.read_csv('conn.log', sep='\t', comment='#',
names=['ts','uid','src_ip','src_port','dst_ip','dst_port',
'proto','service','duration','orig_bytes','resp_bytes',
'conn_state','missed_bytes','history','orig_pkts',
'resp_pkts','orig_ip_bytes','resp_ip_bytes'])
df['byte_ratio'] = df['orig_bytes'] / (df['resp_bytes'] + 1)
df['pkt_ratio'] = df['orig_pkts'] / (df['resp_pkts'] + 1)
df['bytes_per_pkt_fwd'] = df['orig_bytes'] / (df['orig_pkts'] + 1)
+ 1 In denominators, not laziness, but a necessity. Without this, the division into zero with unidirectional flow (a typical situation for SYN-scan, where resp_bytes = 0 and resp_pkts = 0) will fill the NN-set dataset and kill the training.

3. Behaviorals through time windows. Sort by timetamp and for each flow believe how many unique dst_port from this src_ip in the last 60 seconds. On large datasets naive groupby + rolling works for hours. More effective: pd.merge_asof or pre-aggregation into the tanks of 60 seconds with a calculation nunique. Entropy dst_port is calculated by Shannon function: H = -sum(p * log2(p)) distribution of ports in the window.

4. Training and verification via SHAP.
Python:
import shap
import lightgbm as lgb

model = lgb.LGBMClassifier(n_estimators=200, max_depth=6)
model.fit(X_train, y_train)
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values[1], X_test, max_display=15)
On my experiments, a stable three leaders feature: byte_ratio, dst_port_entropy_60s and std_iat. Phatchi like TTL or src_port find themselves in the tail - they do not carry behavioral information and work like noise.
Leakage via IP addresses
If you leave src_ip and dst_ip as categorical features (or encode through one-hot), the model will remember: “traffic from 192.168.1.105 = attack”. On the training dataset - F1 under 0.99. On the new traffic - 0.50. The model learned the addresses of the attacker’s laboratory machines, not the behavior.

I saw this error in three different projects, and each time the team wondered why the “perfect” model falls apart in the production. Solution: Never include IP addresses in the movies directly. Use behavioral aggregates by IP - the number of unique dst_port, the average duration of sessions, entropy of appeals.
Temporary leakage
When calculating window parts, a variation empath or the use of a bidirectional window (capture of future events) creates a situation: the model sees "the next 60 seconds after that, the flow will happen 50 more SYN." She predicts an attack with suspicious accuracy, but there is no future in the production of the future - and the model falls apart.

Check: Teach the model to a certain timetable, test the data after. If F1 falls by 15+ percentage points compared to the random split, there is a leak.
CICFlowMeter Trap
CICFlowMeter generates 80+ features, and the temptation to “drop everything into the model” is understandable. Problem: many of them are linear combinations of each other. total_fwd_packets multiplied by avg_fwd_packet_length is approximately equal to total. Multicolinearity does not kill trees directly, but masks the real contribution of each feature. SHAP shows 15 "important" traits, of which 10 are the same, the side view. Before the training, count the correlation matrix and discard the features with the paired correlation > 0.95. It’s boring, but without that, the interpretation of the model is a fiction.
Blind zones of ML-fi-fiction: what the attacker bypasses feature-based IDS
Feature engineering is a tool with clear boundaries. The attacker, who understands the detection device, purposefully bypasses specific features. For a penesster, knowledge of these mechanisms is the operating context that determines the OPSEC on the internal pentest.

Mimicry-attacks. If the IDS relies on byte_ratio, the attacker adds to the C2-responses, leveling the ratio to normal (~1:1). The SHAP profile of such traffic becomes indistinguishable from the legitimate HTTPS. On the internal pentest, this is done by setting up junk data in the C2-frimword profile - Cobalt Strike Smalleable profiles or Sliver HTTP C2 config.

Slow scanning. Ficha "number of unique dst_port over 60s" will not work if the attacker checks one port per minute. The expansion of the window up to 1 hour will help - but will create a computational load and a detection delay, unacceptable for real-time IDS. Classic trade-off, and there is no perfect solution.

Tunneling through legitimate protocols. DNS tunnel with subdomain entropy < 3 bits (short vocabulary instead of base64) will not be caught by the features "entropy subdomain > 4.5". Protocol Tunneling (T1572, C2) in such a performance requires not statistical, but semantic features - analysis of the contents of DNS-quittals, which goes beyond the classic feature engineering.

Jittered beaconing. Beacon with jitter > 30% blurs the standard deviation of IAT to the level of normal traffic. Detect requires second-order features: autocorrelation of the IAT row catches a hidden frequency even with significant noise. In practice, calculating autocorrelation for each src_ip in real time is an engineering problem, and few people pull it.

Traffic Signaling (T1205, C2) The attacker uses specific combinations of TCP flakes or TTL values as signals to activate backdoor. Statistical features do not detect single triggers - they "dissolve" in the aggregation window. Here you need a different approach - a rule-based detect on the level of individual packages, not ML.

Understanding these limitations is critical on both sides: the blue team sees where to strengthen the detection; the pentester understands what IDS features tracks and how to build communication with C2, non-trigging thresholds. On the internal pentest in the modern infrastructure with the ML-based IDS (Darktra, Vectra AI), knowledge of specific features is the difference between the captured beacon and the unnoticed channel.
Conclusion
For three years I have been building paipelvaline detection of anomalies in network traffic, and the main conclusion is disappointing: the industry will reinvest into algorithms and underinvest in performances. The conferences discuss Transformer-based IDS and graph neural networks to detect the lateral movement. At the same time, the basic set of films in most teams is raw conn.log from Zeek without a single derivative. Changing the Random Forest on Transformer in bad-chids is to put a sporting exhaust on a car with a punched wheel.

The increase from properly constructed signs is an order of magnitude more increase from the change of the algorithm, and this effect is reproduced on every dataset with which I have worked. The second inconvenient fact: a feature engineering for ML models of intrusion detection cannot be fully automated. AutoML selects a pipeline, but does not understand why high entropy dst_port in 60 seconds is a scan, not a CDN balancing. Domain expertise – understanding that is behind the number in real traffic – remains manual, and it is it that divides IDS precision with 0.70 and IDS precision with 0.95.
 
Top Bottom