Learn CodeQL in X Minutes
This article is machine translated which hasn’t been proofreaded by the author. The info it contains may be inaccurate. The author will do his best to get back (when he has time) and revise these articles. 🥰
For Chinese version of this article, see here.
CodeQL is a white-box source code auditing tool that organizes code and metadata in a very novel way, allowing researchers to “search the code like a database” and discover security issues within it. Github acquired Semmel, the company that developed CodeQL last year, and jointly established Github Security Lab. Semmel previously launched LGTM, a source code analysis platform for the open source community and enterprises. This platform can automatically discover and warn of security issues in open source software on Github. At the same time, like CodeQL, it remains free to the open source community and developers.
This article will start from the basic knowledge of CodeQL, follow the interactive experience course [^ course] published by Github Security Lab, and lead readers to understand the practical application of CodeQL in code auditing - multiple security issues were found in a specific version of the uboot source code, 9 CVE vulnerabilities were located through a CodeQL query statement.
How CodeQL works
The overall workflow of CodeQL [^ workflow] 1 is shown in the following figure:
The query of CodeQL needs to be built on the basis of a database, which is obtained after analyzing and extracting the source code through the Extractor module. After the database is established, we can use CodeQL to explore the source code and concurrency some known problems in modern code.
For compiled languages, CodeQL will “simulate” the compilation process when building a database. When compiling toolchains such as make call compilers such as gcc, they will call the extractor module with the same compilation parameters instead, collecting all relevant information about the source code, such as AST abstract syntax trees, function variable types, preprocessor operations, etc. For interpreted languages, since there is no compiler, CodeQL will obtain similar information by tracking execution.
After running the analysis [^ cdb] on the codebase using the CodeQL CLI, we get a “snapshot database” [^ sdb] (Snapshot Database), which stores the hierarchical representation of the codebase at a specific point in time (when the database is created), Including AST syntax tree, CFG control flow relationship and DFG data flow relationship. In this database, every element of the code, such as function definition (Function), function call (FunctionCall), macro call (MacroInvocation) is an entity that can be retrieved. On these bases, we write CodeQL statements to analyze the code.
The installation of the CodeQL environment will not be repeated here. It is explained in detail in the official tutorial [^ tutorial] and the course content involved in this article [^ course]. After importing the database in VSCode, we can start writing the first CodeQL statement.
Let’s take the byte order conversion function as an example to find the definitions of ntohs, ntohl, and ntohll in the uboot codebase.
As we can see from the above code snippet, CodeQL follows the basic semantics similar to SQL: < gt r = “3”/>. But the difference is that the object oriented idea has been added to CodeQL. For example, < gt r = “4”/> can get the name of the query object, and then call another function for regular matching to get the name matching logic we ultimately need. The running output of CodeQL in this example is shown in the figure below. The blue code snippet of each line in the table can be clicked to jump to the uboot code base, where the corresponding macro is defined.
Several commonly used query object types:
- Function function definition, function declaration
- FunctionCall function call
- Macro macro definition
- MacroInvocation macro call
- Expr expression
- AssignExpr assignment expression (is a subset of Expr)
- ConditionalStmt conditional expression
In the above way, we can use CodeQL to query and retrieve the basic units in the code. Further, we can define classes to encapsulate complex judgment conditions and output more accurate results.
We extend the query called by the macro in the previous example by defining a NetworkByteSwap class that represents the complete set of expressions that conform to “certain characteristics”. In this example, we restrict what we need to include ntohs, ntohl expressions called by macros, and simply enumerate them by < gt r = “6”/>.
The output of the above CodeQL statement is shown in the figure below.
When we click on the first result returned, we can see that the editor will jump to the file where the expression is located, and select and highlight the entire expression, which is very intuitive.
Now, we can use CodeQL to try to dig vulnerabilities!
Recalling the previous step, we defined a NetworkByteSwap class to filter out expressions that call ntohs. Next, we introduce the stain tracking module in CodeQL, specify these expressions as stain sources, and set the data aggregation point as the third parameter of memcpy. According to Linux manpage [^ memcpy], the third parameter of the memcpy function is the length of the data block to be copied, because ntohs is a function of base conversion, so the data input through ntohs is likely to be user-controllable parameter values. Passing this path to memcpy can be converted into user-controllable memory operations. This is the source of the vulnerability. Convert this data flow relationship into CodeQL code as follows.
We only added about 20 lines of code on the basis of the previous example. After running, we got 9 results. According to the course introduction [^ course], we should be able to get 9 CVE vulnerabilities at this time. However, the author has little talent and knowledge. According to the description in the CVE vulnerability library, after rough statistics, there are 6 CVE vulnerabilities that can be seen intuitively. Readers can try it themselves.
This is the end of the entire U-Boot Challenge course, with 40 lines of code, digging out a bunch of CVE vulnerabilities.
Code auditing is not an emerging field. We can find many mature tools in the industry and academia, such as Fortify SCA, RIPS, Coverity, etc. Commercial software such as Fortify provides a very complete rule base, which can quickly and automatically discover Generic security issues. CodeQL is closer to an analysis framework, which empowers researchers to conduct more complex security modeling of audit targets, but also relies more on researchers to have a deeper understanding of audit targets and underlying technologies. Simple vulnerabilities can be discovered by tools, and more complex vulnerabilities need to be discovered by humans. As an analysis framework, CodeQL can only exert its maximum effect with the blessing of expert experience. This is its disadvantage, but as a framework, it can provide rich APIs and concise semantics so that researchers can quickly verify analysis ideas., Discover loopholes generated by specific scenarios, this high degree of freedom is also its advantage that cannot be ignored.
<! – Note: Since backlinks are not allowed in WeChat articles, please click “Read the original text” below to view the full reference content. – >
Thinking about some issues of CodeQL and detailed explanation of CVE-2019-3560 audit | Lenny’s Blog < gt r = “15”/> ↩︎