The idea of DevSecOps, when it comes to software development and deployment, is to catch as many security issues at the beginning of an application's life-cycle. One of the ways to do this is to perform security testing on every source code commit to prevent insecure code from being introduced into a codebase. In the prior lab, memory corruption vulnerabilities in C were identified and exploited after program compilation. In this lab, we will examine the effectiveness of a lightweight source code analysis tool called semgrep
(semantic regular expression search) semgrep allows a developer to specify code rules in a markup language (YAML) that the tool will then use to search a codebase to find instances of. Large libraries of semgrep rules have been developed to identify vulnerable source patterns in a range of programming languages. Examples can be found at the semgrep
developer site, its GitHub repository and the semgrep-rules
Github repository.
To begin with, bring up the Kali VM. Then, within the VM, create a directory for the lab and a Python virtual environment within the directory. Then, activate the environment and install the semgrep
package.
mkdir sast
cd sast
virtualenv -p python3 env
source env/bin/activate
pip install semgrep
The Standard C library initially contained string manipulation calls that did not explicitly take into account the length of data being handled. Calls such as string copying (strcpy
) and string concatenation (strcat
) were thus vulnerable to overflow attacks in which the adversary controlling one of the source arguments of the call could corrupt the memory of a program. A semgrep
rule that can identify the use of these insecure function calls is shown below. It uses the pattern-either
directive and specifies two code patterns (strcpy
and strcat
) that it should flag if they appear in a file.
- id: insecure-string-operation
message: >-
Potential buffer overflow
severity: ERROR
languages:
- c
patterns:
- pattern-either:
- pattern: strcpy(...)
- pattern: strcat(...)
We can test the rule by creating a simple C program that contains an insecure use of strcpy
.
For example, the program below attempts to copy the name of the program (argv[0]
) into a 10 character array (some_string
), then print it out. Unfortunately, if the name of the program is more than 10 characters long, it results in memory corruption.
void main(int argc, char** argv) {
char some_string[10];
strcpy(some_string, argv[0]);
print("Name of command is: %s\n", some_string);
}
Create the two files, then run semgrep
to apply the rule to the file to detect the insecure code.
semgrep --config strcpy-rule.yaml strcpy.c
One of the problems with static analysis is that it can sometimes lead to false positives. When a developer is given too many false positives, they stop using the tool altogether. Consider the functions below. While one function contains a vulnerability, the other does not.
void str0(int argc, char** argv) {
char some_string[10];
strcpy(some_string, argv[0]);
}
void str1(int argc, char** argv) {
char some_string[10];
strcpy(some_string, "Hello");
}
Run the rule on the new file.
semgrep --config strcpy-rule.yaml strcpy-test-1.c
In the previous examples, a static string that is shorter than the allocated buffer is copied into it. We can attempt to fix this false positive by adding a bit more to the semgrep
rule. Consider the rule below that specifies logic to skip patterns that include a static string being copied.
- id: insecure-string-operation
message: >-
Potential buffer overflow
severity: ERROR
languages:
- c
patterns:
- pattern-either:
- pattern: strcpy(...)
- pattern: strcat(...)
- pattern-not: $FUN($BUF, "...")
Run semgrep
on the file again using the updated rules.
semgrep --config strcpy-rule-fix.yaml strcpy-test-1.c
While false positives are undesirable, so are false negatives: vulnerable code that passes through our check without being flagged. Consider the code below that copies a larger static string into the buffer. In this case, an overflow exists that should be caught.
void str2(int argc, char** argv) {
char some_string[10];
strcpy(some_string, "Hello world!");
}
Run the previous semgrep rule on this new file.
semgrep --config strcpy-rule-fix.yaml strcpy-test-2.c
Unfortunately, this example requires more than simple pattern matching to address. Without directly computing the length of the static string in the strcpy
against the length of the buffer, a rule will have difficulty avoiding false positives and negatives.
semgrep
provides a language for specifying rules that the semgrep
engine will apply to determine whether vulnerable code patterns exist. Rather than learn the language itself, one might use generative artificial intelligence tools to perform the generation of rules. Using one of ChatGPT, Gemini, or Claude, develop a semgrep
rule that attempts to return the correct results.
You are a talented developer of semgrep rules. Craft a rule that
detects vulnerable usage of the strcpy() call in C code. The rule
should accurately identify when unsafe usage causes a buffer overflow
while ignoring safe usage. For example, the rule should flag the
strcpy functions in str0 and str2, but not str1
void str0(int argc, char** argv) {
char some_string[10];
strcpy(some_string, argv[0]);
}
void str1(int argc, char** argv) {
char some_string[10];
strcpy(some_string, "Hello");
}
void str2(int argc, char** argv) {
char some_string[10];
strcpy(some_string, "Hello world!");
}
Because semgrep
is a pattern matcher and has limited ability to reason, it is difficult to create a rule that requires reasoning about the lengths of the source and destination buffers to determine if a vulnerability exists in code.
Generative AI models incorporate reasoning into their training and can address semgrep
's limited ability to perform reasoning. However, it is important to note that such models are heavyweight solutions that only make sense for tasks that can not be done more efficiently by tools such as semgrep
. In this step, using one of the Generative AI models from the previous step, for each code instance, create a new prompt and ask the model if the code can be exploited to corrupt memory in the program.
You are a talented analyzer of memory corruption vulnerabilities
in C. Explain whether or not the code below is vulnerable.
void str0(int argc, char** argv) {
char some_string[10];
strcpy(some_string, argv[0]);
}
You are a talented analyzer of memory corruption vulnerabilities
in C. Explain whether or not the code below is vulnerable.
void str1(int argc, char** argv) {
char some_string[10];
strcpy(some_string, "Hello");
}
You are a talented analyzer of memory corruption vulnerabilities
in C. Explain whether or not the code below is vulnerable.
void str2(int argc, char** argv) {
char some_string[10];
strcpy(some_string, "Hello world!");
}