Malware has long been a major threat to information security. Approaches to analyzing and protecting against these types of attacks vary. Generally, two approaches are distinguished: static and dynamic analysis.
The goal of static analysis is to find patterns of malicious content in a file or process memory. These can be strings, fragments of encoded or compressed data, or sequences of compiled code. The search can include not only individual patterns but also combinations of patterns with additional conditions (for example, linking to the signature location or checking the relative distance between locations).
Dynamic analysis is the analysis of program behavior. It's worth noting that a program can be run in so-called emulated mode, which allows for safe interpretation of actions without causing damage to the operating system. Another approach is to run the program in a virtual environment (sandbox). In this case, actions are executed honestly on the system, with subsequent call logging. The level of logging verbosity is a balance between the depth of observation and the performance of the analyzing system. The output is a log of the program's actions in the operating system (a behavior trace), which can be further analyzed.
Dynamic or behavioral analysis offers a key advantage: regardless of attempts to obfuscate the program code or conceal the attacker's intentions from the malware analyst, malicious activity will be detected. Reducing malware detection to behavioral analysis allows for the hypothesis of the robustness of an advanced malware detection algorithm. And behavioral reproducibility, thanks to the same initial state of the analysis environment (a snapshot of the virtual server state), simplifies the task of classifying legitimate and malicious behavior.
Behavioral analysis approaches are often based on rule sets. Expert analysis is translated into signatures, which are used by malware and file detection tools to draw conclusions. However, this can lead to a problem: only attacks that strictly match the written rules may be considered, while attacks that do not meet these conditions but are still malicious may be missed. The same problem arises when the same malware changes. This can be addressed either by using more flexible trigger criteria, meaning a more general rule, or by using multiple rules specific to each malware type. The first scenario risks many false positives, while the second is time-consuming, potentially delaying necessary updates.
There's a need to extend existing knowledge to other similar cases. That is, cases we haven't encountered or processed with rules before, but based on the similarity of certain features, we can conclude that the activity may be malicious. This is where machine learning algorithms come in.
When trained correctly, ML models have generalization ability. This means that the trained model has not only learned all the examples it was trained on, but is also capable of making decisions about new examples based on patterns in the training set.
However, for the generalizing ability to work, two main factors must be taken into account at the learning stage:
Dataset compilation began with the expert team. Features were selected that, in the experts' opinion, were expected to be significant for malware detection. All features could be reduced to n-grams of system calls. Next, a model was used to evaluate which features contributed most to detection, discarding the redundant features, and producing the final dataset.
Initial data:
{"count":1,"PID":"764","Method":"NtQuerySystemInformation","unixtime":"1639557419.628073","TID":"788","plugin":"syscall","PPID":"416","Others":"REST: ,Module=\"nt\",vCPU=1,CR3=0x174DB000,Syscall=51,NArgs=4,SystemInformationClass=0x53,SystemInformation=0x23BAD0,SystemInformationLength=0x10,ReturnLength=0x0","ProcessName":"windows\\system32\\svchost.exe"}
{"Key":"\\registry\\machine","GraphKey":"\\REGISTRY\\MACHINE","count":1,"plugin":"regmon","Method":"NtQueryKey","unixtime":"1639557419.752278","TID":"3420","ProcessName":"users\\john\\desktop\\e95b20e76110cb9e3ecf0410441e40fd.exe","PPID":"1324","PID":"616"}
{"count":1,"PID":"616","Method":"NtQueryKey","unixtime":"1639557419.752278","TID":"3420","plugin":"syscall","PPID":"1324","Others":"REST: ,Module=\"nt\",vCPU=0,CR3=0x4B7BF000,Syscall=19,NArgs=5,KeyHandle=0x1F8,KeyInformationClass=0x7,KeyInformation=0x20CD88,Length=0x4,ResultLength=0x20CD98","ProcessName":"users\\john\\desktop\\e95b20e76110cb9e3ecf0410441e40fd.exe"}
Intermediate data (sequences):
syscall_NtQuerySystemInformation*regmon_NtQueryKey*syscall_NtQueryKey
The feature vector is presented in the table:
The answer is the same old reference sample. We consider it the most accurate because examples from this sample are manually checked and labeled by experts, and with each update, we primarily verify that we still guarantee 100% accuracy on this sample. Testing in the wild confirms that accuracy is improving.
This is achieved by removing inconsistent reference data from the training set. By inconsistent data, we mean examples accumulated from the stream that are sufficiently close in vector distance to traces from the reference set but have the opposite label.
Our experiments showed that such examples are outliers even from the point of view of the data from the stream, since after removing them from the training set in order to increase the accuracy on the reference set, the accuracy on the stream also increased.
Examples where the ML approach was able to truly expand the solution include:
The goal of static analysis is to find patterns of malicious content in a file or process memory. These can be strings, fragments of encoded or compressed data, or sequences of compiled code. The search can include not only individual patterns but also combinations of patterns with additional conditions (for example, linking to the signature location or checking the relative distance between locations).
Dynamic analysis is the analysis of program behavior. It's worth noting that a program can be run in so-called emulated mode, which allows for safe interpretation of actions without causing damage to the operating system. Another approach is to run the program in a virtual environment (sandbox). In this case, actions are executed honestly on the system, with subsequent call logging. The level of logging verbosity is a balance between the depth of observation and the performance of the analyzing system. The output is a log of the program's actions in the operating system (a behavior trace), which can be further analyzed.
Dynamic or behavioral analysis offers a key advantage: regardless of attempts to obfuscate the program code or conceal the attacker's intentions from the malware analyst, malicious activity will be detected. Reducing malware detection to behavioral analysis allows for the hypothesis of the robustness of an advanced malware detection algorithm. And behavioral reproducibility, thanks to the same initial state of the analysis environment (a snapshot of the virtual server state), simplifies the task of classifying legitimate and malicious behavior.
Behavioral analysis approaches are often based on rule sets. Expert analysis is translated into signatures, which are used by malware and file detection tools to draw conclusions. However, this can lead to a problem: only attacks that strictly match the written rules may be considered, while attacks that do not meet these conditions but are still malicious may be missed. The same problem arises when the same malware changes. This can be addressed either by using more flexible trigger criteria, meaning a more general rule, or by using multiple rules specific to each malware type. The first scenario risks many false positives, while the second is time-consuming, potentially delaying necessary updates.
There's a need to extend existing knowledge to other similar cases. That is, cases we haven't encountered or processed with rules before, but based on the similarity of certain features, we can conclude that the activity may be malicious. This is where machine learning algorithms come in.
When trained correctly, ML models have generalization ability. This means that the trained model has not only learned all the examples it was trained on, but is also capable of making decisions about new examples based on patterns in the training set.
However, for the generalizing ability to work, two main factors must be taken into account at the learning stage:
- The set of features should be as complete as possible (so that the model can see as many patterns as possible and, accordingly, better extend its knowledge to new examples), but not redundant (so as not to store and process features that do not carry useful information for the model).
- The data set must be representative, balanced and regularly updated.
How expert knowledge is transferred to machine learning models
In the context of malware analysis, the source data is the files themselves, and the intermediate data is the auxiliary processes they create. These processes, in turn, make system calls. The sequences of these calls are the data we need to transform into a set of features.Dataset compilation began with the expert team. Features were selected that, in the experts' opinion, were expected to be significant for malware detection. All features could be reduced to n-grams of system calls. Next, a model was used to evaluate which features contributed most to detection, discarding the redundant features, and producing the final dataset.
Initial data:
{"count":1,"PID":"764","Method":"NtQuerySystemInformation","unixtime":"1639557419.628073","TID":"788","plugin":"syscall","PPID":"416","Others":"REST: ,Module=\"nt\",vCPU=1,CR3=0x174DB000,Syscall=51,NArgs=4,SystemInformationClass=0x53,SystemInformation=0x23BAD0,SystemInformationLength=0x10,ReturnLength=0x0","ProcessName":"windows\\system32\\svchost.exe"}
{"Key":"\\registry\\machine","GraphKey":"\\REGISTRY\\MACHINE","count":1,"plugin":"regmon","Method":"NtQueryKey","unixtime":"1639557419.752278","TID":"3420","ProcessName":"users\\john\\desktop\\e95b20e76110cb9e3ecf0410441e40fd.exe","PPID":"1324","PID":"616"}
{"count":1,"PID":"616","Method":"NtQueryKey","unixtime":"1639557419.752278","TID":"3420","plugin":"syscall","PPID":"1324","Others":"REST: ,Module=\"nt\",vCPU=0,CR3=0x4B7BF000,Syscall=19,NArgs=5,KeyHandle=0x1F8,KeyInformationClass=0x7,KeyInformation=0x20CD88,Length=0x4,ResultLength=0x20CD98","ProcessName":"users\\john\\desktop\\e95b20e76110cb9e3ecf0410441e40fd.exe"}
Intermediate data (sequences):
syscall_NtQuerySystemInformation*regmon_NtQueryKey*syscall_NtQueryKey
The feature vector is presented in the table:
… | syscall_NtQuerySystemInformation regmon_NtQueryKey | regmon_NtQueryKeysyscall_NtQueryKey | syscall_NtQuerySystemInformation*syscall_NtQueryKey | … |
… | 1 | 1 | 0 | … |
How the model's knowledge accumulated, how this process changed, and why it is important to stop data accumulation in time
As mentioned above, the main requirements for data are representativeness, balance, and regular updating. Let's explain all three points in the context of behavioral analysis of malicious files:- Representativeness. The distribution of data by features should be close to the distribution in real life.
- Balance. The initial data for training the model is labeled as "legitimate" or "malicious," and this information is fed into the model. This means we solve the problem when the number of malicious examples is close to the number of clean examples.
- Regular updates. This is largely related to representativeness. Since trends in malicious files are constantly changing, the model's knowledge must be updated regularly.
- The data is divided into two types: the main data stream and benchmark examples. The benchmark examples are manually verified by experts, and their labeling accuracy is guaranteed. They are needed for model validation and managing the training set by adding benchmarks. The main stream is labeled using rules and automated checks. It is needed to enrich the data set with a variety of real-world examples.
- All standards are immediately added to the training set.
- Additionally, a certain initial dataset is added from the stream to obtain the required training data volume. The required data volume here refers to the volume at which the training set is sufficiently complete (in terms of data diversity) and representative. Since benchmark examples are manually verified by experts, it is impossible to collect tens of thousands of examples from benchmarks alone, hence the need to obtain data diversity from the stream.
- The model is periodically tested on new data from the stream.
- Accuracy must be guaranteed primarily for reference examples; if contradictions arise, preference is given to the reference data, which are retained in any case.
- The training sample accumulated to the current moment is fixed;
- Data from the stream is now used only for testing the model; no instances are added to the training set;
- Updating the training set is only possible if the set of reference examples is updated.
- We verified that the trained and fixed model can be sufficiently robust to data drift;
- We control each new example added to the training set (reference examples are checked manually by experts);
- We can track every change and guarantee accuracy on a reference data set.
How to ensure model quality improves with each update
After the described process of data accumulation, a completely natural question may arise: why are we so sure that each update of the model actually improves it?The answer is the same old reference sample. We consider it the most accurate because examples from this sample are manually checked and labeled by experts, and with each update, we primarily verify that we still guarantee 100% accuracy on this sample. Testing in the wild confirms that accuracy is improving.
This is achieved by removing inconsistent reference data from the training set. By inconsistent data, we mean examples accumulated from the stream that are sufficiently close in vector distance to traces from the reference set but have the opposite label.
Our experiments showed that such examples are outliers even from the point of view of the data from the stream, since after removing them from the training set in order to increase the accuracy on the reference set, the accuracy on the stream also increased.
Complementarity of the ML approach and behavioral detections in the form of correlations
The ML model performed very well in combination with behavioral detections in the form of correlations. It's important to note that this was done in combination, as the model's generalization ability is beneficial in cases where a solution needs to be expanded to detect similar and related incidents, but not in cases where detection is required within a clear understanding of the rules and criteria for what constitutes malware.Examples where the ML approach was able to truly expand the solution include:
- Anomalous subprocess chains. A large number of branching chains is, in itself, a legitimate phenomenon. However, the model notices anomalies in the number of nodes, the degree of nesting, and the recurrence or non-recurrence of specific process names, while a human wouldn't presume to find such things malicious.
- Non-standard default call parameter values. In most cases, the analyst is interested in the significant parameters of the functions being searched for malware. The remaining parameters are, roughly speaking, default values and are of little interest. But at some point, it happens that instead of, say, five default values, a sixth one is encountered. The analyst might not have anticipated this, but the model noticed.
- Atypical function call sequences. This is the case when each function individually doesn't do anything malicious. Nor does it do anything malicious when combined. But it so happens that this sequence isn't found in legitimate software. It would take a tremendous amount of experience for an analyst to spot such a pattern on their own. The model, however, notices it (and more than one) by solving a non-standard classification problem based on a characteristic that wasn't originally intended as an indicator of maliciousness.
- The use of a specific component in a single call for malicious action. The system uses hundreds of objects in varying degrees and to varying degrees. It's unlikely to detect the use of one among millions of others—the granularity of the anomaly is still rather low.
- Proactive detection based on a threat model. It's decided that a certain action on a certain object in the system, even just once, is unacceptable. The model may not immediately recognize this as a significant event, and there's a risk of error or an uncertain decision when classifying something similar.
- Obfuscation of action sequences. For example, it might be known that 3-4 actions need to be performed in a specific order. The intervening steps are irrelevant. Adding unnecessary steps between the 3-4 key steps will confuse the model, and the decision will be incorrect. However, the limited number of features prevents such obfuscation from being accounted for by storing all combinations of call sequences, not just the total number.