Under the Hood of SAST: How Code Analysis Tools Find Security Defects

META

Activist
SUPREME
MEMBER
Joined
Mar 1, 2026
Messages
118
Reaction score
378
Deposit
0$
Today, we'll discuss how SAST solutions detect security flaws. I'll explain how different approaches to identifying potential vulnerabilities complement each other, why each is needed, and how theory translates into practice.


This article is based on the talk " Under the Hood of SAST: How Code Analysis Tools Find Security Defects " from TechLead Conf 2022. The content has been shortened and modified for readability.

SAST (Static Application Security Testing) is an approach to finding security defects without running the application. While "classic" static analysis focuses on finding errors, SAST focuses on finding potential vulnerabilities.

How do we view SAST from the outside? We take the source code, feed it to the analyzer, and the output is a report listing potential security issues.


The main goal of this article is to answer the question of how SAST tools find potential vulnerabilities.

Types of information used
SAST solutions don't analyze source code in its simple textual representation: this is inconvenient, inefficient, and often insufficient. Therefore, analyzers work with intermediate representations of the code and several types of information. Together, they provide the most complete picture of the application.

Syntactic information
Analyzers use intermediate representations of code to perform their work. The most common are syntax trees (abstract syntax trees or parse trees).

Let's look at the error pattern:

operand#1 <operator> operand#1
The point is that the same operand is used on both the left and right sides of the operator. Code like this can contain an error, for example, when using a comparison operator:

a == a
However, the above case is a special one, there are many variations:

one or both operands may be wrapped in parentheses;
the operator can be not only '==', but also '!=', '||', etc.
operands may not be identifiers, but rather references to elements, function calls, etc.
Parsing the code as plain text is inconvenient in this case. This is where syntax trees come in handy.

Consider the expression a == (a) . The parse tree for it might look like this:


Working with such trees is convenient: there's information about the structure, and extracting operands and operators from expressions is easy. Need to omit parentheses? No problem, just descend the tree.

Thus, trees serve as a convenient structured representation of code. But trees alone are not enough.

Semantic information
Let's look at an example:

if (lhsVar == rhsVar)
{ .... }
If lhsVar and rhsVar are double variables , this code may be problematic. For example, if both lhsVar and rhsVar are exactly equal to 0.5, this comparison will evaluate to true . However, if one value is equal to 0.5 and the other to 0.4999999999999, the comparison will evaluate to false . This raises the question: what behavior does the developer expect? If they expect this difference to be within acceptable error, the comparison should be rewritten.

Let's say we want to catch such cases. But here's the problem: the same comparison would be perfectly valid if the lhsVar and rhsVar types were integers.

Let's imagine: the analyzer checks the code and encounters the following expression:

if (lhsVar == rhsVar)
{ .... }
Question: should we be scolding ourselves here or not? We can look at the tree and see that the operands are identifiers, and that the infix operation is a comparison. However, we can't say whether this case is dangerous or not, since we don't know the types of the lhsVar and rhsVar variables .

This is where semantic information comes in. Using semantics, you can obtain information about tree nodes:

what type (in terms of the programming language) does the expression corresponding to the node have;
what entity the node is represented by: a local variable, a parameter, a field, etc.;
...
In the example above, we need information about the types of the lhsVar and rhsVar variables . All we need to do is retrieve this information through the semantic model. If the variables are of a real type, issue a warning.


Function annotations
Sometimes syntax and semantics aren't enough. Let's look at an example:

IEnumerable<int> seq = null;
var list = Enumerable.ToList(seq);
....
The ToList method is declared in an external library; the analyzer doesn't have access to the source code. There's a seq variable with a null value that's passed to the aforementioned ToList . Is this a safe operation or not?

Let's use the syntactic information. We can tell where the literal is, where the identifier is, and where the method call is. Is the method call safe? It's unclear.

Let's try some semantics. We can see that seq is a local variable, and even calculate its value. What can we learn about Enumerable.ToList ? For example, the return type and the parameter type. Is it safe to pass null inside? It's unclear.

One possible solution is annotations. Annotations are a way to tell the analyzer what a method does, what constraints it imposes on input and output values, and so on.

A conditional annotation for the ToList method in the analyzer code might look like this:

Annotation("System.Collections.Generic",
nameof(Enumerable),
nameof(Enumerable.ToList),
AddReturn(ReturnFlags.NotNull),
AddArg(ArgFlags.NotNull));
The main information this annotation contains:

The fully qualified name of the method (including the type name and namespace). If there are overloads, additional parameter information may be required;
Return value restrictions. ReturnFlags.NotNull indicates that the return value will not be null ;
Input value constraints. ArgFlags.NotNull tells the parser that the method's only argument must not be null .
Let's go back to the original example:

IEnumerable<int> seq = null;
var list = Enumerable.ToList(seq);
....
With the annotation mechanism, the analyzer knows the limitations of the ToList method . If it tracks the value of the seq variable, it can issue a warning about a NullReferenceException exception .

Types of analysis
Now that we have an idea of the information used for analysis, let's move on to the types of analysis themselves.

Pattern-based analysis
Sometimes "ordinary" errors are actually security flaws. Let's look at an example of such a vulnerability.

iOS: CVE-2014-1266

Vulnerability Information:

CVE-ID: CVE-2014-1266
CWE-ID: CWE-20: Improper Input Validation
Record in the NVD database
Description: The SSLVerifySignedServerKeyExchange function in libsecurity_ssl/lib/sslKeyExchange.c in the Secure Transport feature in the Data Security component in Apple iOS 6.x before 6.1.6 and 7.x before 7.0.6, Apple TV 6.x before 6.0.2, and Apple OS X 10.9.x before 10.9.2 does not check the signature in a TLS Server Key Exchange message, which allows man-in-the-middle attackers to spoof SSL servers by (1) using an arbitrary private key for the signing step or (2) omitting the signing step.
Code:

....
if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
goto fail;
goto fail;
if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
goto fail;
....
At a glance, the code might seem fine. In fact, the second goto is unconditional. Because of this, the check calling the SSLHashSHA1.final method was never performed.

Ideally, the code should be formatted like this:

....
if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)
goto fail;
goto fail;
if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)
goto fail;
....
How to catch such a defect using static analysis?

The first way is to see that goto is unconditional, and there are expressions without any labels following it.

Let's take a simplified code with the same meaning:

{
if (condition)
goto fail;
goto fail;
....
}
The tree for him might look like this:


Block is a set of statements. The tree shows that:

one goto statement refers to the if statement , and the second one refers directly to the block;
between the GotoStatement (goal operator) and the LabeledStatement (goal label) there is an ExpressionStatement ;
The goto statement associated with the block is executed unconditionally, and there is no label before the ExpressionStatement . Therefore, the ExpressionStatement is unreachable in this case.
Of course, this is a specific heuristic. In practice, such problems are better solved using more general mechanisms for calculating code reachability.

Another way to catch a defect is to see that the code formatting does not match the execution logic.

The simplified algorithm will be as follows:

See how much indentation there is before the then branch of the if statement .
Take the statement that follows the if statement.
If the statement is on the line following the then branch, and they have the same indentation, issue a warning.

For clarity, the algorithms are simplified and do not take into account corner cases. Most often, diagnostic rules are more complex and contain numerous exceptions for situations where a warning should not be issued.

Data flow analysis
Let's look at an example:

if (ptr || ptr->foo())
{ .... }
The developer messed up the logic and confused the '&&' and '||' operators. This means that if ptr is a null pointer, it will be dereferenced.

The context here is quite local, and it's possible to find the error using pattern analysis. Problems arise when the context becomes blurred. For example:

if (ptr)
{ .... }
// 50 lines of code
....
auto test = ptr->foo();
Here the ptr pointer is also checked for NULL , and then dereferenced without checking - this looks suspicious.

Note : In the text, I use NULL to denote the null pointer value, not as a C language macro.

It's difficult to catch this case using patterns. For example, the code above should raise an error, but the code below doesn't, since ptr definitely won't be a null pointer when dereferenced:

if (ptr)
{ .... }
// 50 lines of code
....
if (ptr)
{
auto test = ptr->foo();
....
}
Ultimately, we come to the conclusion that it would be a good idea to track variable values. For the examples above, this would help us know what value the ptr pointer contains at a certain point in the application. If the pointer is dereferenced with a NULL value , issue a warning; otherwise, don't issue one.

Data flow analysis helps track expression values at different points in the code. Based on this data, the analyzer generates warnings.

Data flow analysis is applicable to various types of data. Examples:

boolean: true or false ;
integer: value ranges;
pointers/references: null state.
Let's look at the pointer example again. Null pointer dereference is a security flaw, CWE-476: NULL Pointer Dereference .

if (ptr)
{ .... }
// 50 lines of code
....
auto test = ptr->foo();
The analyzer first encounters a NULL check on ptr . This places restrictions on the value of ptr : in the then branch of the if statement, ptr is not a null pointer. Knowing this, the analyzer will not issue a warning for code like this:

if (ptr)
{
ptr->foo();
}
What is the value of ptr outside of if ?

if (ptr)
{ .... }
// ptr - ???

// 50 lines of code
....
auto test = ptr->foo();
In general, it's unknown. However, the analyzer may take into account that ptr has already been checked for NULL. The developer thereby declares the contract that ptr may be NULL. This fact can be preserved.

As a result, when the analyzer encounters the expression auto test = ptr->foo() , it can check the conditions:

the exact value of ptr at the time of dereference is unknown;
In the code above, ptr was checked for NULL .
Compliance with both conditions looks suspicious and should trigger a warning.

Now let's look at how data flow analysis works with integer types. For this, let's take code that contains the security flaw CWE-570: Expression is Always False .

void DataFlowTest(int x)
{
if (x > 10)
{
var y = x - 10;
if (y < 0)
....
if (y <= 1)
....
}
}
Let's start from the beginning. Let's look at the method definition:

void DataFlowTest(int x)
{ .... }
In a local context (analysis within a single method), the analyzer has no information about the possible value of x . However, the parameter type is known— int . This already allows us to limit the range of possible values: [-2,147,483,648; 2,147,483,647] (assuming that we consider an int to be 4 bytes in size).

Further in the code there is a condition:

if (x > 10)
{ .... }
If the analyzer enters the then branch of the if statement , it imposes additional range constraints. In the then branch, the value of x is in the range [11; 2,147,483,647].

Next comes the declaration and initialization of the variable y :

var y = x - 10;
Since the analyzer knows the limits of x values , it can also calculate the possible values of y . To do this, 10 is subtracted from the boundary values. This means that the value of y lies in the range [1; 2,147,483,637].

Next is the if statement :

if (y < 0)
....
The analyzer knows that at this point in execution, the value of the variable y lies in the range [1; 2,147,483,637]. It turns out that y is always greater than 0, and the expression y < 0 is always false.

Let's look at a security flaw that can be found using data flow analysis.

ytnef: CVE-2017-6298

Vulnerability Information:

CVE-ID: CVE-2017-6298
CWE-ID: CWE-476 NULL Pointer Dereference
Record in the NVD database
Description: An issue was discovered in ytnef before 1.9.1. This is related to a patch described as "1 of 9. Null Pointer Deref / calloc return value not checked."
Let's look at the code:

....
TNEF->subject.data = calloc(size, sizeof(BYTE));
TNEF->subject.size = vl->size;
memcpy(TNEF->subject.data, vl->data, vl->size);
....
Let's analyze where the vulnerability comes from:

The calloc function allocates a block of memory and initializes it to zero. If the memory allocation fails, calloc returns a null pointer.
A potentially null pointer is written to the TNEF->subject.data field .
The TNEF->subject.data field is used as the first argument to the memcpy function . If the first argument to memcpy is a null pointer, undefined behavior will occur. As we recall, TNEF->subject.data can be a null pointer.
To find such a problem, both annotations and data flow analysis are useful.

Annotations:

calloc may return a null pointer;
The first argument to memcpy must not be a null pointer (the second one, by the way, must not be either).
Data flow analysis tracks:

writing a potentially null pointer from the return value of calloc to TNEF->subject.data ;
moving a value within the TNEF->subject.data field;
a potentially null pointer hit in the first memcpy argument from the TNEF->subject.data field .

The illustration above shows how the analyzer monitors expression values to find dereferences of a potentially null pointer.

Taint analysis
Sometimes the analyzer doesn't know the exact values of variables, or the possible values are too general to draw conclusions. However, the analyzer may know that the data came from an external source and could be compromised. This opens the door to finding new security flaws.

Let's look at an example of code vulnerable to SQL injection :

using (SqlConnection connection = new SqlConnection(_connectionString))
{
String userName = Request.Form["userName"];
using (var command = new SqlCommand()
{
Connection = connection,
CommandText = "SELECT * FROM Users WHERE UserName = '" + userName + "'",
CommandType = System.Data.CommandType.Text
})
{
using (var reader = command.ExecuteReader())
{ /* Data processing */ }
}
}
What interests us here is this:

data comes from the user and is written to the userName variable ;
userName is substituted into the query, which is written to the CommandText property ;
The created SQL command is sent for execution.
Let's say the username received from a user is " _SergVasiliev_ ." The resulting query will look like this:

SELECT * FROM Users WHERE UserName = '_SergVasiliev_'
The original logic is preserved - data for the user named _SergVasiliev_ is retrieved from the database .

Now let's assume the user sent the following string: 'OR'1'='1'. After inserting it into the template, the query will look like this:

SELECT * FROM Users WHERE UserName = '' OR '1'='1'
The attacker managed to modify the query logic. Part of the expression will always evaluate to true, causing the query to return data about all users.

By the way, this is where the meme about cars with strange license plates comes from:


Let's look at the vulnerable code again:

using (SqlConnection connection = new SqlConnection(_connectionString))
{
String userName = Request.Form["userName"];
using (var command = new SqlCommand()
{
Connection = connection,
CommandText = "SELECT * FROM Users WHERE UserName = '" + userName + "'",
CommandType = System.Data.CommandType.Text
})
{
using (var reader = command.ExecuteReader())
{ /* Data processing */ }
}
}
The analyzer doesn't know the exact value that will be written to userName . It could be either the safe _SergVasiliev_ or the dangerous ' OR '1'='1 . The code itself doesn't impose any restrictions on the string either.

It turns out that data flow analysis isn't suitable for finding SQL injection vulnerabilities in code. This is where taint analysis comes in.

Taint analysis works with data transfer traces. It helps the analyzer track where data comes from, how it spreads throughout the application, and where it ends up.

Taint analysis is used to detect various types of injections and security flaws that arise due to insufficient validation of user input.

For the SQL injection example, taint analysis can construct a data transfer path that can help identify the security flaw:


Let's look at an example of a real-world vulnerability that taint analysis can be useful for detecting.

BlogEngine.NET: CVE-2018-14485

Vulnerability Information:

CVE-ID: CVE-2018-14485
CWE-ID: CWE-611 Improper Restriction of XML External Entity Reference
Record in the NVD database
Description: BlogEngine.NET 3.3 allows XXE attacks via the POST body to metaweblog.axd.
We'll cover the BlogEngine.NET vulnerability briefly, as a detailed analysis would require a full article. By the way, it exists—you can read it here .

BlogEngine.NET is a blogging platform written in C#. Several blog handlers were found to be vulnerable to XXE (XML eXternal Entity) . This vulnerability allows data to be stolen from the machine hosting the blog. To do this, a specially configured XML file must be uploaded to a specific URL.

The XXE vulnerability has 2 components:

insecurely configured XML parser;
data from the attacker that this parser parses.
It's possible to monitor only the dangerous parser and issue a warning, regardless of the data it processes. This approach has its pros and cons:

Pros: Diagnostics are simplified because they don't rely on the data transfer path. If the analyzer can't track how data is transferred within the program, no problem—the warning will still be generated.
Cons: More false positive warnings. Warnings will be issued regardless of whether secure data is being processed.
Let's say we decide to track user data after all. This is where taint analysis comes to the rescue again.

Let's return to XXE. CVE-2018-14485 from BlogEngine.NET can be caught like this:


The analyzer tracks the transfer of data from the HTTP request and sees how it is passed between variables and methods. At the same time, the analyzer monitors the movement of the dangerous parser instance ( request of type XmlDocument ) through the program.

Together, this data converges in the request.LoadXml(xml) call —a parser with a dangerous configuration processes user data.

The theory behind XXE and a detailed description of this vulnerability are collected in the article " Vulnerabilities due to XML file processing: XXE in C# applications in theory and practice ."

I also recommend watching the report on which the article is based—it includes a video example of exploiting the vulnerability (timing: 28:43 ).

Conclusion
We've looked at some approaches used to find vulnerabilities, including their strengths and weaknesses. The main goal of this article is to explain how SAST tools find vulnerabilities. However, in conclusion, I'd like to remind you why they do it.

1. The number of vulnerabilities is growing year after year, as confirmed by statistics . This means security is a must.


2. The sooner a vulnerability is found, the easier and cheaper it is to fix. SAST helps reduce financial and reputational risks by detecting security flaws early. I covered this topic in more detail in the article " SAST's Place in the Secure SDLC: 3 Reasons to Implement It in the DevSecOps Pipeline ."


**

As a reminder, the text above is a condensed and readable version of the report " Under the Hood of SAST: How Code Analysis Tools Find Security Defects ." The report itself is similar in structure, but it includes more examples.
 
Top Bottom