Learn CodeQL in X Minutes
CodeQL is a white-box source code audit tool that organizes code and metadata in a very novel way, enabling researchers to “retrieve code like querying a database” and discover security issues in it. Last year, Github acquired Semmel, the company that developed CodeQL, and jointly established Github Security Lab. Semmel previously launched LGTM, a source code analysis platform for open source communities and enterprises. Security concerns, meanwhile, remain free to the open source community and developers like CodeQL.
This article will start from the basics of CodeQL, follow the interactive experience course 1 released by Github Security Lab, and lead readers to understand the practical application of CodeQL in code auditing - multiple security issues are found in a specific version of uboot source code, Nine CVE vulnerabilities were located through a CodeQL query statement.
How CodeQL works
The query of CodeQL needs to be based on a database, which is obtained by analyzing and extracting the source code through the Extractor module. After the database is established, we can use CodeQL to explore the source code and find some known problems in the code.
For compiled languages, CodeQL will “simulate” the compilation process when building a database. When a compilation toolchain such as make calls a compiler such as gcc, it will replace it with the same compilation parameters and call the extractor module to collect all relevant information about the source code, such as AST abstract syntax tree, function variable types, preprocessor operations, and more. For interpreted languages, because there is no compiler, CodeQL will obtain similar information by tracking execution.
After using the CodeQL CLI to run analysis 4 on the code warehouse, we get a “snapshot database” 5 (Snapshot Database), which stores the hierarchy of the code warehouse at a specific point in time (when the database is created) Representation, including AST syntax tree, CFG control flow relationship and DFG data flow relationship. In this database, each element in the code, such as function definition (Function), function call (FunctionCall), macro call (MacroInvocation) is an entity that can be retrieved. On these basis, we write CodeQL statement to analyze the code.
The installation of the CodeQL environment will not be described here. It is detailed in the official tutorial 6 and the course content 1 involved in this article. After importing the database in VSCode, we can start writing the first CodeQL statement.
Let’s take the byte order conversion function as an example to find the definitions of ntohs, ntohl, and ntohll in the uboot code library.
From the snippet above we can see that CodeQL follows similar basic semantics to SQL:
select... from... where.... But the difference is that CodeQL has added object-oriented thinking, such as
m.getName() You can get the name of the query object, and then call another function to perform regular matching to obtain the name matching logical expression we ultimately need. The running output of CodeQL in this example is shown in the figure below. The blue code fragments in each row in the table can be clicked to jump to the definition position of the corresponding macro in the uboot code library.
Several commonly used query object types:
- Function function definition, function declaration
- FunctionCall function call
- Macro macro definition
- MacroInvocation macro invocation
- Expr expression
- AssignExpr assignment expression (is a subset of Expr)
- ConditionalStmt conditional expression
Through the above method, we can use CodeQL to query and retrieve the basic units in the code. Going one step further, we can define classes to encapsulate complex judgment conditions and output more accurate results.
We extend the query of the macro call in the previous example by defining a NetworkByteSwap class, which represents the full set of expressions that meet “certain characteristics”. In this example, we limit what we need to include ntohs, ntohl expression of the macro call, and pass
from n select n list them simply.
The output of the above CodeQL statement is shown in the following figure.
We click on the first returned result, and we can see that the editor will jump to the file where the expression is located, and highlight the entire expression, which is very intuitive.
Now that we have learned it, we can use CodeQL to try to dig holes!
Recalling the previous step, we defined a NetworkByteSwap class to filter out expressions that call ntohs. Next, we introduce the taint tracking module in CodeQL, specify these expressions as taint sources (source), and set the data aggregation point as the third parameter of memcpy. According to the Linux manpage7, the third parameter of the memcpy function is the length of the data block to be copied, because ntohs is a function of base conversion, so the data input through ntohs is likely to be a user-controllable parameter value, Passing this path to memcpy can be transformed into a user-controllable memory operation. This is the source of the vulnerability. Convert this data flow relationship into CodeQL code as follows.
We only added about 20 lines of code to the previous example. After running, we got 9 results. According to the course introduction1, we should be able to get 9 CVE vulnerabilities at this time. According to the description in the CVE vulnerability library, after rough statistics, there are 6 CVE vulnerabilities that can be seen intuitively. Readers can try it by themselves.
This is the end of the entire U-Boot Challenge course. With 40 lines of code, a bunch of CVE vulnerabilities were dug out.
Code auditing is not an emerging field. We can find many mature tools in the industry and academia, such as Fortify SCA, RIPS, Coverity, etc. Commercial software such as Fortify provides a very complete rule base, which can quickly and automatically discover Generic security questions. CodeQL is closer to an analysis framework, which enables researchers to conduct more complex security modeling of audit objectives, but at the same time, it also relies more on researchers to have a deeper understanding of audit objectives and underlying technologies. Simple vulnerabilities can be discovered by tools, and more complex vulnerabilities need to be discovered by people. As an analysis framework, CodeQL can exert its maximum effect only with the blessing of expert experience. This is its shortcoming, but as a framework, it can provide rich APIs and concise semantics so that researchers can quickly verify analysis ideas , Discovering loopholes generated by specific scenarios, this high degree of freedom is also an advantage that cannot be ignored.
CodeQL principle explanation: How Semmle QL works:https://blog.semmle.com/introduction-to-variant-analysis-part-2/ ↩︎
Thoughts on some issues of CodeQL and detailed explanation of CVE-2019-3560 audit | Lenny’s Bloghttps://lenny233.github.io/2020/02/20/codql-and-cve-2019-3560/ ↩︎
Creating CodeQL databases — CodeQLhttps://help.semmle.com/codeql/codeql-cli/procedures/create-codeql-database.html ↩︎
Snapshot Database:https://help.semmle.com/codeql/about-codeql.html#about-codeql-databases ↩︎
Semmle official tutorial: the use of VSCode plug-ins:https://help.semmle.com/codeql/codeql-for-vscode/procedures/using-extension.html ↩︎
memcpy - copy memory area - Library Functions | ManKierhttps://www.mankier.com/3/memcpy ↩︎