forked from haoNoQ/clang-analyzer-guide
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathclang-analyzer-guide.tex
2618 lines (2013 loc) · 180 KB
/
clang-analyzer-guide.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[a4paper,12pt]{article}
\usepackage[utf8]{inputenc}
\usepackage[english]{babel}
\usepackage{float}
\usepackage{multicol}
% Images
\usepackage{graphicx}
\def\imgscale{0.20}
% Tables
\usepackage{array}
\renewcommand{\arraystretch}{2}
% Index Module
\usepackage{imakeidx}
\makeindex[intoc,title=Index of classes]
\makeindex[intoc,name=notion,title=Index of notions]
% Fonts And Paragraph Styles
\usepackage[a4paper, total={7.3in, 10.1in}]{geometry}
\setlength\parindent{0pt}
\setlength\parskip{1em}
\renewcommand*{\familydefault}{\sfdefault}
\usepackage[T1]{fontenc}
\usepackage[default]{sourcesanspro}
\usepackage{sfmath}
\clubpenalty=10000
\widowpenalty=10000
\usepackage{dingbat}
% No-Line-Break Sections That Hold The Listings Together
\newenvironment{nobr}{\begin{minipage}{\textwidth}\setlength\parskip{1em}
}{\end{minipage}\ignorespacesafterend}
% Fancy Enumerations
\usepackage{enumitem}
\setlist{topsep=-3pt}
% Colorful Section Styles
\setcounter{section}{-1}
\usepackage[usenames,dvipsnames]{color}
\definecolor{Section}{RGB}{0, 64, 128}
\definecolor{SectionNumber}{RGB}{255, 255, 255}
\usepackage{titlesec}
\usepackage{needspace}
\titleformat{\section}
{\nopagebreak\parskip0.2em\color{Section}\titlerule\normalfont\Large\bfseries}
{\parindent-2em\fcolorbox{Section}{Section}{\color{SectionNumber}\quad\thesection.\quad}}
{1em}{}
\titleformat{\subsection}
{\nopagebreak\parskip0.2em\color{Section}\titlerule\normalfont\large\bfseries}
{\fcolorbox{Section}{Section}{\color{SectionNumber}\ \ \thesubsection.\ \ }\nopagebreak}
{1em}{}
\titleformat{\subsubsection}
{\nopagebreak\parskip0.2em\color{Section}\titlerule\normalfont\bfseries}
{\fcolorbox{Section}{Section}{\color{SectionNumber}\thesubsubsection.}\nopagebreak}
{1em}{}
% Fancy Listings
\usepackage{listings}
\definecolor{InlineListing}{RGB}{0, 64, 128}
\lstset{basicstyle=\ttfamily\color{InlineListing}}
\definecolor{Command}{RGB}{0, 0, 0}
\definecolor{CommandOutput}{RGB}{128, 128, 128}
\definecolor{Console}{RGB}{240, 240, 240}
\definecolor{Executable}{RGB}{0, 96, 0}
\definecolor{Prompt}{RGB}{0, 64, 128}
\definecolor{Rule}{RGB}{192, 192, 192}
\lstdefinestyle{commandline}{
aboveskip=1.0em,
backgroundcolor=\color{Console},
basicstyle=\ttfamily\color{Prompt}\footnotesize,
belowskip=0.0em,
breaklines=false,
captionpos=b,
emptylines=1,
frame=single,
moredelim=**[is][\color{Command}]{@}{@},
moredelim=**[is][\color{Executable}]{@@}{@@},
moredelim=**[is][\color{CommandOutput}]{@@@}{@@@},
rulecolor=\color{Rule},
xleftmargin=1.9em,
xrightmargin=0.3em,
}
\definecolor{Background}{RGB}{240, 240, 240}
\definecolor{Code}{RGB}{0, 0, 0}
\definecolor{Comment}{RGB}{0, 128, 128}
\definecolor{Keyword}{RGB}{0, 0, 128}
\definecolor{String}{RGB}{192, 0, 192}
\lstdefinestyle{cplusplus}{
aboveskip=1.5em,
backgroundcolor=\color{Background},
basicstyle=\ttfamily\color{Code}\footnotesize,
belowskip=0.0em,
breaklines=false,
captionpos=b,
commentstyle=\color{Comment},
emptylines=1,
frame=single,
keywordstyle=\color{Keyword},
language=C++,
numbers=left,
rulecolor=\color{Rule},
showstringspaces=false,
stringstyle=\color{String},
xleftmargin=1.9em,
xrightmargin=0.3em,
}
% Hyper Reference Styles
\definecolor{Url}{RGB}{0, 64, 128}
\usepackage{hyperref}
\hypersetup{
colorlinks,
linkcolor={Url},
citecolor={Url},
urlcolor={Url}
}
% Fancy Table Of Contents
\usepackage{tocloft}
\renewcommand\cftbeforesecskip{10pt}
\renewcommand\cftbeforesubsecskip{0pt}
\renewcommand\cftsubsecafterpnum{\vskip0pt}
\begin{document}
\begin{center}
\hrule\bigskip\bigskip
{\Huge\textsc{Clang\ \ Static\ \ Analyzer}}
\bigskip
{\Large A\ \ Checker\ \ Developer's\ \ Guide}
\bigskip
{Rev. --- \today}
\bigskip\bigskip\hrule
\end{center}
\newpage
\tableofcontents
\newpage
\section{Preface}
The early draft of this document was composed mostly by Artem Dergachev during his work for the Samsung Research \& Development institute in Moscow. This document is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit \url{http://creativecommons.org/licenses/by/4.0/} or send a letter to Creative Commons, PO Box 1866, Mountain View, CA 94042, USA.
{\begin{center}\includegraphics{license.png}\end{center}}
The author is greatful to everybody who contributed to this guide by finding mistakes and omissions and giving valuable suggestions~--- and this list would probably grow as new revisions of this guide are released~--- in particular, to Alexey Sidorin, Julia Trofimovich, and Kirill Romanenkov.
The guide is still incomplete at parts, and it would need updates in case of changes in the analyzer core, which would inevitably occur at times. Additionally, because the author was not a native speaker of English, any suggestions on improving the grammar aspect of the guide are warmly welcome.
Below is a rough to-do list of stuff that is not yet properly explained in the guide, but would be considered useful to have, in no particular order:
\begin{multicols}{2}
\begin{enumerate}
\item Direct and default bindings in the region store.
\item Recent changes in work with live and dead symbols.
\item The \lstinline|WasInlined| attribute of the checker context.
\item The newly introduced \lstinline|CodeSpaceRegion| memory space.
\item The syntax for adding bug reporter visitors to the report.
\item How to read the \lstinline|stderr| dump of the program state.
\item The \lstinline|loc::GotoLabel| value class.
\item How to use the AST parent map.
\item Use \lstinline|check::ASTDecl<>|, probably for the whole translation unit, instead of \lstinline|check::EndOfTranslationUnit| for syntax-only checks.
\item Symbols of structural type.
\item A picture of how super-regions of any memory region usually look.
\item Add a picture of how path-sensitive bug reports look, eg. the HTML ones.
\item Describe more Objective-C-related stuff.
\item Using the \lstinline|SVal| visitor.
\item Writing tests.
\item Coding style needs updating~--- outdated constructs are used.
\end{enumerate}
\end{multicols}
The author welcomes suggestions, bug reports, pull-requests, forks, and whatever may come out of it, on github at \url{https://github.com/haonoq/clang-analyzer-guide}!
On the other hand, please do not send analyzer-related questions in private messages or on e-mail! The best place to ask questions is the cfe-dev mailing list~--- \url{http://lists.llvm.org/pipermail/cfe-dev/}, because other people would see the question and the answer, and probably even be able to find the discussion later through web search.
\newpage
\subsection{FAQ: a quick guide through the guide}
\begin{nobr}
\begin{itemize}
\item[\textbf{Q:}] \emph{Okay, so how do I write CSA checkers?}
\item[\textbf{A:}] For a step-by-step quick start guide on coding checkers, see subsection \ref{subsec:coding_intro}.
\end{itemize}
\medskip
\end{nobr}
\begin{nobr}
\begin{itemize}
\item[\textbf{Q:}] \emph{Reference guides are boring. I prefer learning by example. Should I keep reading?}
\item[\textbf{A:}] That's not really a reference guide, but rather a free-hand introduction to Clang Static Analyzer. You'd encounter code samples and useful snippets on almost every page. We did not try to copy the official Clang doxygen, which you would definitely refer to during your work on CSA checkers. However, we also strongly advise you to search through the official checker source code for finding how different classes and methods are used in practice.
\end{itemize}
\medskip
\end{nobr}
\begin{nobr}
\begin{itemize}
\item[\textbf{Q:}] \emph{This guide is quite big. My checker only needs to find calls to a certain function, and it shouldn't be hard. Do I really need to read the whole guide to implement it?}
\item[\textbf{A:}] For finding simple code patterns, an AST matcher would easily do the job. Probably the example in subsection~\ref{subsec:ast_matchers} is all you need to know.
\end{itemize}
\medskip
\end{nobr}
\begin{nobr}
\begin{itemize}
\item[\textbf{Q:}] \emph{Now I have a real problem. My program crashes, due to double-close of} \lstinline|FlyingElephantDescriptor|\emph{, once in a few weeks, and I badly want to catch and debug it. Please help!}
\item[\textbf{A:}] You came to the right place! If you want a checker that finds a program execution paths on which a certain sequence of events, such as double-free, occurs, then you need to implement a path-sensitive checker. You would probably need to read section \ref{sec:path_sensitive}, most importantly subsections~\ref{subsec:program_state} and~\ref{subsec:program_state_2}, paying a lot of attention in~\ref{subsubsec:gdm}, and probably look through subsection~\ref{subsec:path_sensitive_callbacks} to find the right callback to hook into.
\end{itemize}
\medskip
\end{nobr}
\begin{nobr}
\begin{itemize}
\item[\textbf{Q:}] \emph{So, how do I know if I need a path-sensitive or path-insensitive checker?}
\item[\textbf{A:}] It depends on what information you need the analyzer core to provide. If you want to understand this matter in-depth, see section \ref{sec:data_structures}. Most of the time, you'd pick a path-sensitive checker. Only if your check is really simple, would you want to rely on AST-based checkers.
\end{itemize}
\medskip
\end{nobr}
\begin{nobr}
\begin{itemize}
\item[\textbf{Q:}] \emph{My path-sensitive bug report is too short, I cannot figure out what's going on!}
\item[\textbf{A:}] By default, the analyzer doesn't draw path through sub-functions that returned before the bug was found. They only show the event of the bug, and decisions that lead to it in the same function or in its direct callers. If you need to highlight other events, probably inside sub-functions, then you need to implement a bug report visitor, as described in subsection~\ref{subsec:bug_visitors}.
\end{itemize}
\medskip
\end{nobr}
\begin{nobr}
\begin{itemize}
\item[\textbf{Q:}] \emph{Wait a minute, section} \ref{sec:data_structures} \emph{also mentions CFG-based analysis. When do I use this one?}
\item[\textbf{A:}] Almost never, unless you really know what you're doing, and in this case you almost certainly don't need this guide. Even though it'd be great to have a rough idea of what CFG is and how it looks, most of the time path-sensitive checkers turn out to do the same job much easier.
\end{itemize}
\medskip
\end{nobr}
\begin{nobr}
\begin{itemize}
\item[\textbf{Q:}] \emph{I'm reading the guide randomly, and I've no idea what you mean by ``GDM''.}
\item[\textbf{A:}] You can always refer to the alphabetical index at the end of the guide. In fact, because the index is quite short, you may also read through it to find things you missed. The index of classes also highlights most useful methods in CSA classes and points to usage examples for each class or method across the guide.
\end{itemize}
\medskip
\end{nobr}
\begin{nobr}
\begin{itemize}
\item[\textbf{Q:}] \emph{Your path-sensitive engine is fantastic! How does it work?}
\item[\textbf{A:}] The easiest way to explain how it works is probably say ``it constructs an exploded graph''. Even though the whole section \ref{sec:path_sensitive} is about how the path-sensitive engine works, you may also refer to the explanation of the exploded graph in subsection \ref{subsec:exploded_graph} for clearer understanding.
\end{itemize}
\medskip
\end{nobr}
\newpage
\section{Introduction to the Clang Static Analyzer}\label{sec:intro}
The Clang compiler, based on the LLVM infrastructure, provides much more than a way to turn your C, C++, or Objective-C code into a binary executable file. Clang allows reliably hooking onto the compilation process and obtaining exhaustive information of the data structures the compiler generates on each phase of the compilation. In other words, if you want to know more about your program, the compiler is the best person to ask --- and Clang is open to answer your questions. Assuming you ask the right questions, that is.
One of the applications for Clang tools is automatically finding defects in programs, providing much more warnings than your compiler would. For instance, the \lstinline|clang-tidy| tool finds style issues and unsafe or potentially unportable constructs by observing the syntax used in the program.
\emph{Clang Static Analyzer} is another tool that finds defects in programs. By exploring the program source code, this particular tool tries to execute parts of the program without compiling them or running the program --- as if reading the source code and imagining what would happen if it runs --- and reports run-time errors that would occur in such imaginary run-time. Because actual behavior of any real-world program depends on external factors, such as input values, random numbers, and behavior of library components (for which source code is not always available!), the analyzer engine denotes unknown values with algebraic symbols, and performs symbolic computations based on these symbols. It also discovers conditions on the symbolic values that lead the program towards the error.
As a result, Clang Static Analyzer is capable of finding deep bugs that occur only on rare program paths. These paths might have been missed by the manual testers or the automated test suites. Upon finding a bug, the analyzer draws the whole path that lead to the bug, with jump directions on each conditional statement.
However, the analyzer can only find bugs that it has been specifically engineered to find. Otherwise, upon encountering the problem, the analysis runs further and doesn't notice anything. For every particular kind of defects the analyzer finds, such as dereference of a null pointer or buffer overflow, there is a special module --- a \emph{checker}, that reacts on such defects during analysis.
So, essentially, the analyzer core is responsible for executing the program in a symbolic manner, and checkers subscribe on events they're interested in, check various assumptions on symbolic values at these events, and throw warnings if these assumptions are found to fail on the given path.
It means that you may want to not only use the analyzer to finds defects, but also adapt it to your particular project. For instance, you may want to enforce rules specific to your project, or find misuses of a specific library API you are using. In order to do that, you may find yourself wanting to write a new checker module for the analyzer. And no matter how easy it may be --- because Clang Static Analyzer is a very easy tool --- this guide should be able to help you.
\begin{nobr}
\subsection{MainCallChecker --- a simple tutorial checker}
For a quick start, we shall write a simple, though probably not very useful, static analyzer checker. The checker would find violations of the following rule defined by the C++ standard:
\qquad\textbf{basic.start.main.3:} The function \lstinline|main| shall not be used within a program.
\end{nobr}
In other words, the \lstinline|main()| function cannot be recursive; the program should never call \lstinline|main()|, otherwise behavior is undefined. Finding such defect sounds easy at a glance: just see if there's a function call in the program, and the function has name ``main''. Well, not in real life. The programmer may put a pointer to \lstinline|main| into a variable, pass this pointer around, and then accidentally call a function by pointer, which accidentally turns out to point to \lstinline|main|:
\begin{nobr}
\begin{lstlisting}[style=cplusplus,title=\lstinline|Example_Test.c|,numbers=none]
typedef int (*main_t)(int, char **);
int main(int argc, char **argv) {
main_t foo = main;
int exit_code = foo(argc, argv); // actually calls main()!
return exit_code;
}
\end{lstlisting}
\end{nobr}
So even in this simple case, the analyzer's path-sensitive engine has an advantage over a simple syntax-based check. Let's see if we can detect the error in \lstinline|Example_Test.c| with the help of the static analyzer.
All right, you got me: even putting the pointer to \lstinline|main| into a variable actually already means that \lstinline|main| was ``used'' within the program. But for educational purposes, we shall find out that in fact it is actually called.
\begin{nobr}
First, let us provide a definition of the checker in the list of checkers. Open up \lstinline|lib/StaticAnalyzer/Checkers/Checkers.td| in the clang source tree, and add a simple description of the checker somewhere too, say, \lstinline|alpha.core| package of checkers:
\begin{lstlisting}[style=commandline,title=\lstinline|@Checkers.td@|]
@@@...@@@
HelpText<"Check for assignment of a fixed address to a pointer">,
DescFile<"FixedAddressChecker.cpp">;
@@def MainCallChecker : Checker<"MainCall">,@@
@@ HelpText<"Check for calls to main">,@@
@@ DescFile<"MainCallChecker.cpp">;@@
def PointerArithChecker : Checker<"PointerArithm">,
HelpText<"Check for pointer arithmetic on locations other than array elements">,
DescFile<"PointerArithChecker">;
@@@...@@@
\end{lstlisting}
\end{nobr}
\begin{nobr}
After re-compiling Clang, this would make the checker appear in the list of checkers:
\begin{lstlisting}[style=commandline]
~ $ @clang -cc1 -analyzer-checker-help@
OVERVIEW: Clang Static Analyzer Checkers List
USAGE: -analyzer-checker <CHECKER or PACKAGE,...>
CHECKERS:
@@@...@@@
alpha.core.FixedAddr Check for assignment of a fixed address to a p
ointer
alpha.core.IdenticalExpr Warn about unintended use of identical express
ions in operators
@@alpha.core.MainCall Check for calls to main@@
alpha.core.PointerArithm Check for pointer arithmetic on locations othe
r than array elements
alpha.core.PointerSub Check for pointer subtractions on two pointers
pointing to different memory chunks
@@@...@@@
\end{lstlisting}
\end{nobr}
In this example, ``\lstinline|alpha.core.MainCallChecker|'' is the name of the checker in the registry. Once the checker is registered, it can be enabled via CSA command-line options by the given name, and also the relevant short description line appears in the analyzer checker help. ``\lstinline|alpha.core|'' is the category of the checker. For example, \lstinline|-analyzer-checker alpha.core| would enable all checkers in the \lstinline|alpha.core| category.
\begin{nobr}
Then, add the checker code to the \lstinline|lib/StaticAnalyzer/Checkers/CMakeLists.txt| file, so that its source code got eventually compiled on the next rebuild of Clang:
\begin{lstlisting}[style=commandline,title=\lstinline|@CMakeLists.txt@|]
@@@...@@@
LocalizationChecker.cpp
MacOSKeychainAPIChecker.cpp
MacOSXAPIChecker.cpp
@@MainCallChecker.cpp@@
MallocChecker.cpp
MallocOverflowSecurityChecker.cpp
@@@...@@@
\end{lstlisting}
\end{nobr}
\begin{nobr}
Finally, write some code in \lstinline|lib/StaticAnalyzer/Checkers/MainCallChecker.cpp| that we will soon explain:
\begin{lstlisting}[style=cplusplus,title=\lstinline|MainCallChecker.cpp|]
#include "ClangSACheckers.h"
#include "clang/StaticAnalyzer/Core/BugReporter/BugType.h"
#include "clang/StaticAnalyzer/Core/Checker.h"
#include "clang/StaticAnalyzer/Core/PathSensitive/CallEvent.h"
#include "clang/StaticAnalyzer/Core/PathSensitive/CheckerContext.h"
using namespace clang;
using namespace clang::ento;
namespace {
class MainCallChecker : public Checker<check::PreCall> {
mutable std::unique_ptr<BugType> BT;
public:
void checkPreCall(const CallEvent &Call, CheckerContext &C) const;
};
}
void MainCallChecker::checkPreCall(const CallEvent &Call,
CheckerContext &C) const {
if (const IdentifierInfo *II = Call.getCalleeIdentifier())
if (II->isStr("main")) {
if (!BT)
BT.reset(new BugType(this, "Call to main", "Example checker"));
ExplodedNode *N = C.generateErrorNode();
auto Report = llvm::make_unique<BugReport>(*BT, BT->getName(), N);
C.emitReport(std::move(Report));
}
}
void ento::registerMainCallChecker(CheckerManager &Mgr) {
Mgr.registerChecker<MainCallChecker>();
}
\end{lstlisting}
\end{nobr}
\begin{nobr}
After compiling clang, you should be able to run the static analyzer, enable the checker, and see the warning:
\begin{lstlisting}[style=commandline]
~ $ @clang -cc1 -analyze -analyzer-checker=alpha.core Example_Test.c@
Example_Test.c:4:19: warning: Call to main
int exit_code = foo(argc, argv); // actually calls main()!
@@^~~~~~~~~~~~~~~@@
1 warning generated.
\end{lstlisting}
\end{nobr}
\subsection{Checker example code explained}\label{subsec:coding_intro}
Now let us figure out how \lstinline|MainCallChecker| works internally. \lstinline|MainCallChecker| is a \emph{path-sensitive} check\-er: it can detect how values flow through variables on different program paths, and understand which execution paths are taken based on these values. We have already demonstrated this in \lstinline|Example_Test.c|, where we call a function after storing a pointer to this function in a variable \lstinline|foo|.
\begin{nobr}
\subsubsection{Declaring a checker class}
A CSA checker is implemented by inheriting from a class template \lstinline|Checker<...>|\index{Checker|textbf}, in which template parameters indicate the list of callbacks on which the checker subscribes:
\begin{lstlisting}[style=cplusplus,firstnumber=10]
namespace {
class MainCallChecker : public Checker<check::PreCall> {
mutable std::unique_ptr<BugType> BT;
public:
void checkPreCall(const CallEvent &Call, CheckerContext &C) const;
};
}
\end{lstlisting}
\end{nobr}
Checker class definitions are usually put into anonymous namespaces to avoid name collisions upon loading multiple checkers into the analyzer.
\lstinline|MainCallChecker| subscribes to the \lstinline|check::PreCall| event. The \lstinline|checkPreCall(...)| callback\index{Checker!check::PreCall} defined inside the checker will be called every time the path-sensitive engine of the analyzer encounters a function call and is about to analyze~it.
\begin{nobr}
\subsubsection{Implementing checker callbacks}
Now let us look at the implementation of the \lstinline|checkPreCall(...)| callback:
\begin{lstlisting}[style=cplusplus,firstnumber=19]
void MainCallChecker::checkPreCall(const CallEvent &Call,
CheckerContext &C) const {
if (const IdentifierInfo *II = Call.getCalleeIdentifier())
if (II->isStr("main")) {
if (!BT)
BT.reset(new BugType(this, "Call to main", "Example checker"));
ExplodedNode *N = C.generateErrorNode();
auto Report = llvm::make_unique<BugReport>(*BT, BT->getName(), N);
C.emitReport(std::move(Report));
}
}
\end{lstlisting}
\end{nobr}
The \lstinline|CallEvent| structure\index{CallEvent} available at the callback contains all the data on the function call event the analyzer core managed to gather for us. In particular, it contains information about the callee function, and values of the arguments.
Because our checker is path-sensitive, this information is a lot more than you may obtain by looking at the syntax tree. In particular, it may know the callee identifier even if a function is called by function pointer, because the analyzer core have predicted the value of this pointer on this execution path. On line 21, we use this to obtain the \emph{identifier} (\lstinline|IdentifierInfo| structure) for the callee function from the \lstinline|CallEvent| structure. In case we cannot obtain such info, that is, if \lstinline|getCalleeIdentifier()|\index{CallEvent!getCalleeIdentifier()} returns a \lstinline|NULL| pointer, we return from our callback and continue the analysis.
Now, on line 22, we see if the identifier we encountered has the name ``\lstinline|main|''. We are only interested in functions with the name ``\lstinline|main|'', so all further checks are made under this assumption.
\subsubsection{Throwing bug reports}
That's it for the checker logic. What remains is to produce a bug report for the user. For this, we use another object available in our callback, the \lstinline|CheckerContext|\index{CheckerContext|textbf} structure. This structure is a Swiss Army knife that contains various functions checkers can use to obtain information on the analysis and affect the analysis flow.
There's a variable \lstinline|BT|\index{BugType} in the checker, which stores a ``bug type'' for the checker --- a common way to identify bugs belonging to different checkers. A checker may have multiple bug types; they are traditionally stored and re-used inside the checker for performance. On line 24, the checker initializes its bug type structure \lstinline|BT| as a bug called ``Call to main'', within category ``Example checker'', unless it is already initialized.
On line 25, we use \lstinline|CheckerContext| to generate a \emph{sink node}\index[notion]{Exploded Node!Sink Node}\index{CheckerContext!generateSink()}, which means that the program would most likely crash after encountering this defect, and it is pointless to continue the analysis beyond this point. The node itself represents a point in the execution path. It is not necessary for the checker to stop the analysis once it finds a defect, if the defect is not critical.
Finally, on line 26, the checker creates a new \lstinline|BugReport|\index{BugReport|textbf} object. The report is thrown \emph{against} the sink node we generated before. \lstinline|BugReport| also contains the warning message, which in our case coincides with the bug type name.
Then we pass the report back to the \lstinline|CheckerContext| using the \lstinline|emitReport(...)|\index{CheckerContext!emitReport()} method. Reports would be gathered together, de-duplicated, and displayed to the user in the preferred manner.
\begin{nobr}
\subsubsection{Registering the checker}
Finally, there is a small piece of magic code to actually create the checker instance when the analysis starts. You may use this section to disable certain checkers for the whole translation units (eg. checkers for C++-only defects in plain C files), introduce dependencies between checkers, or to set checker options. The code below creates exactly one instance of the \lstinline|MainCallChecker| to feed upon the events yet to unfold:
\begin{lstlisting}[style=cplusplus,firstnumber=31]
void ento::registerMainCallChecker(CheckerManager &Mgr) {
Mgr.registerChecker<MainCallChecker>();
}
\end{lstlisting}\index{CheckerManager}\index{CheckerManager!registerChecker}
\end{nobr}
\begin{nobr}
\subsection{Compiling the checker as a standalone module}
We have been compiling the checker inside the Clang source tree. However, it is possible to compile the checker as a shared plugin library instead. In this case, you don't need to modify \lstinline|Checkers.td| or \lstinline|CMakeLists.txt| in order to run the checker; instead, you compile the checker as a standalone library, and load it in run-time.
\end{nobr}
\begin{nobr}
The syntax for registering the checker changes in the case of compiling as a plugin. You don't need to include the \lstinline|ClangSACheckers.h| header, but instead you include the \lstinline|CheckerRegistry.h| header:
\begin{lstlisting}[style=cplusplus,numbers=none]
#include "clang/StaticAnalyzer/Core/CheckerRegistry.h"
\end{lstlisting}
\end{nobr}
\begin{nobr}
Then we define an externally visible function in our library that would register the checker dynamically in the analyzer's \lstinline|CheckerRegistry|\index{CheckerRegistry|textbf}\index{CheckerRegistry!addChecker()}\index{clang\_registerCheckers()|textbf}:
\begin{lstlisting}[style=cplusplus,numbers=none]
extern "C"
void clang_registerCheckers (CheckerRegistry ®istry) {
registry.addChecker<MainCallChecker>("alpha.core.MainCallChecker",
"Checks for calls to main");
}
\end{lstlisting}
\end{nobr}
\begin{nobr}
The clang API version needs to match the plugin API version. Hence, the checker needs to store its version string in the externally visible \lstinline|clang_analyzerAPIVersionString|\index{clang\_analyzerAPIVersionString|textbf} variable for the purpose of compatibility checking:
\begin{lstlisting}[style=cplusplus,numbers=none]
extern "C" const char clang_analyzerAPIVersionString[] =
CLANG_ANALYZER_API_VERSION_STRING;
\end{lstlisting}
\end{nobr}
Once you load the checker via the usual clang plugin syntax --- \lstinline|clang -cc1 -load Checker.so| --- it would appear as a normal checker in the \lstinline|-analyzer-checker-help| list, and you should be able to enable it via \lstinline|-analyzer-checker|.
\subsection{Further reading}
The quick introduction in this section covers only a very little part of Clang Static Analyzer capabilities. There are various ways of learning how to develop CSA checkers efficiently.
Further sections of this guide should give you an idea of how the analyzer works and what sort of information is available for the checkers to use, and cover various technologies and tricks involved in creating a checker.
However, while developing CSA checkers, you would also inevitably consult the \emph{official LLVM\footnote{\url{http://llvm.org/doxygen}} and Clang\footnote{\url{http://clang.llvm.org/doxygen}} documentation}, which contains exhaustive information on classes, functions, and data structures you would regularly encounter.
CSA website contains a quick-start checker development manual\footnote{\url{http://clang-analyzer.llvm.org/checker_dev_manual.html}}. We also cannot avoid mentioning a highly recommended presentation video by CSA developers ``Building a Checker in 24 hours''\footnote{\url{http://llvm.org/devmtg/2012-11/videos/Zaks-Rose-Checker24Hours.mp4}}, which describes how a slightly more complicated path-sensitive checker works.
\newpage
\section{Kinds of analyses and program representations}\label{sec:data_structures}
The first decision you usually need to make when you create a checker is whether you need path-sensitivity to implement the desired check. Alternatively, you may implement your check by exploring the of the program on syntax level.
Path-sensitive analysis is usually many times slower than compilation. However, most of the time is taken by the analyzer core to construct the necessary data structures; the checkers are usually lightweight, unless some extremely heavy calculations were explicitly required. So if you are already running at least one path-sensitive checker, then adding another path-sensitive checker would not make the analysis significantly slower.
On the contrary, syntax-only analysis is usually as fast as compilation, or even faster, because code generation doesn't take place. However, syntax-level analysis does not gather enough information for most checks.
The easiest way to understand how different kinds of analyses compare to each other is to see how program is represented from the point of view of each kind of analysis.
\begin{nobr}
As an example, let us construct its \emph{abstract syntax tree}, \emph{control flow graph}, and path-sensitive \emph{exploded graph} for a simple function \lstinline|foo(...)|, which we put into a file called \lstinline|test.c| for further reference:
\begin{lstlisting}[style=cplusplus,title=\lstinline|test.c|]
void foo(int x) {
int y, z;
if (x == 0)
y = 5;
if (!x)
z = 6;
}
\end{lstlisting}
\end{nobr}
\subsection{Abstract syntax tree}
Clang \emph{abstract syntax tree}\index[notion]{Abstract Syntax Tree|textbf} (AST)\index[notion]{AST|see{Abstract Syntax Tree}} is the structure produced by the \emph{compiler} frontend and serves as the \emph{intermediate representation} of the program used by Clang. Binary code generation takes place based on AST. Unlike AST of the GCC C/C++ compiler, Clang AST contains not only the minimal information necessary to compile the program correctly, but also complete information about the program source code: each element of the tree remembers its source locations and how exactly it was written in the source, even before preprocessing took place. This makes the AST itself usable as the easiest framework for source-level analysis.
Below is a command-line dump of the abstract syntax tree for \lstinline|test.c|:
\begin{nobr}
\begin{lstlisting}[style=commandline]
~ $ @clang -cc1 -ast-dump test.c@ @@@
@@@TranslationUnitDecl@@@ <<invalid sloc>> <invalid sloc>
`-@@@FunctionDecl@@@ <test.c:1:1, line:7:1> line:1:6 @foo@ '@@void (int)@@'
|-@@@ParmVarDecl@@@ 0x3625c60 <col:10, col:14> col:14 used @x@ '@@int@@'
`-@@@CompoundStmt@@@ <col:17, line:7:1>
|-@@@DeclStmt@@@ <line:2:3, col:11>
| |-@@@VarDecl@@@ 0x3625de0 <col:3, col:7> col:7 used @y@ '@@int@@'
| `-@@@VarDecl@@@ 0x3625e50 <col:3, col:10> col:10 used @z@ '@@int@@'
|-@@@IfStmt@@@ <line:3:3, line:4:9>
| |-<<<NULL>>>
| |-@@@BinaryOperator@@@ <line:3:7, col:12> '@@int@@' '@==@'
| | |-@@@ImplicitCastExpr@@@ <col:7> '@@int@@' <LValueToRValue>
| | | `-@@@DeclRefExpr@@@ <col:7> '@@int@@' lvalue ParmVar 0x3625c60 '@x@' '@@int@@'
| | `-@@@IntegerLiteral@@@ <col:12> '@@int@@' @0@
| |-@@@BinaryOperator@@@ <line:4:5, col:9> '@@int@@' '@=@'
| | |-@@@DeclRefExpr@@@ <col:5> '@@int@@' lvalue Var 0x3625de0 '@y@' '@@int@@'
| | `-@@@IntegerLiteral@@@ <col:9> '@@int@@' @5@
| `-<<<NULL>>>
`-@@@IfStmt@@@ <line:5:3, line:6:9>
|-<<<NULL>>>
|-@@@UnaryOperator@@@ <line:5:7, col:8> '@@int@@' prefix '@!@'
| `-@@@ImplicitCastExpr@@@ <col:8> '@@int@@' <LValueToRValue>
| `-@@@DeclRefExpr@@@ <col:8> '@@int@@' lvalue ParmVar 0x3625c60 '@x@' 'int'
|-@@@BinaryOperator@@@ <line:6:5, col:9> '@@int@@' '@=@'
| |-@@@DeclRefExpr@@@ <col:5> '@@int@@' lvalue Var 0x3625e50 '@z@' '@@int@@'
| `-@@@IntegerLiteral@@@ <col:9> '@@int@@' @6@
`-<<<NULL>>>
\end{lstlisting}
\end{nobr}
Reading the AST is similar to reading the original program, annotated to display the semantics that weren't necessarily instantly obvious from the raw source code. For instance, you may understand that variable \lstinline|x| on line 3, on which the \lstinline|if| statement argument depends, is actually a parameter of \lstinline|foo(...)|, while variable~\lstinline|y| referenced on line 4 is a local variable declared on line 2 together with \lstinline|z|.
However, while constructing the AST, the compiler does not try to understand and model what exactly is going on in the program. It does not construct different execution paths through the branch statements, or try to predict how different branches interact.
The best thing you can do with AST is to detect unwanted \emph{code patterns}. For example, you may want to avoid C-style casts in your C++ project, you can easily create an AST-based checker that warns on all C-style casts. A bit more complicated example would be ensuring that return value of some function is always checked for error.
The \lstinline|security.InsecureAPI| family of checkers may serve as a good example of AST-based checkers in the default distribution of CSA.
However, if you try to catch, say, divisions by zero with an AST-based checker, even if it might be easy to find code patterns like ``\lstinline|y = x / 0|'', but it would be much harder to find code like ``\lstinline|z = 0; ...; y = x / z;|''. For such checks, a more powerful approach is necessary.
\subsection{Control flow graph}
Clang \emph{control flow graph}\index[notion]{Control Flow Graph|textbf} (CFG\index[notion]{CFG|see{Control Flow Graph}}) is a representation, using graph notation, of all paths that may seem to be possibly traversed through a program during its execution. CFG is constructed separately for every function body. Each node of the CFG represents a \emph{basic block}\index[notion]{Control Flow Graph!Basic Block|textbf} of statements that do not contain any branch statements, and are therefore executed sequentially. Each basic block ends with a \emph{terminator statement}\index[notion]{Control Flow Graph!Terminator|textbf}, which is a branch statement or a return from the function. Outgoing edges connect the basic block to other blocks that may be reached depending on the run-time value of the terminator branch condition.
\begin{nobr}
CSA provides an easy way of dumping the control flow graph:
\begin{lstlisting}[style=commandline]
~ $ @clang -cc1 -analyze -analyzer-checker=debug.ViewCFG test.c@ @@@
Writing '/tmp/CFG-02fc89.dot'... done.
Running 'xdot.py' program... done.@@@
\end{lstlisting}
\end{nobr}
Figure \ref{fig:cfg} shows the control flow graph for \lstinline|test.c|, simplified for easier reading.
\begin{figure}[!ht]\center
\includegraphics[scale=\imgscale]{cfg.pdf}
\caption{Simplified control flow graph for \lstinline|test.c|.}
\label{fig:cfg}
\end{figure}
CFG-based analysis is useful for creating safe checks, for which it is necessary to consider all possible program paths. For example, if you want to ensure that a certain branch condition always evaluates to \lstinline|false|, and thus the code below it is ``dead'', then you'd probably have no choice but to \emph{reach definitions} of all variables referenced inside the condition expression, and CFG would be the right tool for such analysis.
The CFG is constructed relatively easily from the AST. However, CFG-based analysis is often difficult to implement, because CFG does not instantly provide the data flow analysis; additional coding is required to achieve that. Clang framework provides some ready-made CFG-based solutions for checkers ``out of the box'', such as \lstinline|LivenessAnalysis|\index{LivenessAnalysis}.
The \lstinline|deadcode.DeadStores| checker is a good example of a CFG-based checker in the default distribution of CSA.
Sometimes CFG-based analysis is used in combination with path-sensitive analysis, when path-sensitive part of the checker is used to find a potential defect location, and later a CFG-based heuristic is implemented in order to improve true positive rate by inspecting other paths to or from the defect. The official \lstinline|deadcode.UnreachableCode| checker is an example of combining path-sensitive and CFG-based analysis.
However, an attentive reader would instantly find a flaw on Figure \ref{fig:cfg} that makes CFG-based analysis less useful. By simply looking at the CFG, you would not be able to figure out that once \lstinline|true| branch is taken in basic block \lstinline|[B4]|, \lstinline|true| branch is also inevitably taken in basic block \lstinline|[B2]|, and vice versa, because branching conditions before these blocks are related. In fact, depending on the initial value of \lstinline|x|, there are only two ways of reaching the exit block \lstinline|[B0]| from the entry block \lstinline|[B5]|: the program either goes through both \lstinline|[B3]| and \lstinline|[B1]|, as soon as \lstinline|x == 0|, or goes through none of them otherwise. This limitation significantly reduces the efficiency of CFG-based analysis.
\subsection{Exploded graph}\label{subsec:exploded_graph}
CSA \emph{exploded graph}\index[notion]{Exploded Graph|textbf}\index{ExplodedGraph} is the basic data structure of the path-sensitive static analyzer engine. Analyzer core tries to ``interpret'' the program code, and treats different paths through the CFG, even if they pass through same statements or basic blocks, separately, hence the term ``exploded''. Exploded graph consists of all paths through the CFG that were explored by the analyzer engine, and carries information regarding the program state\index[notion]{Exploded Node!Program State}\index{ProgramState} on each path in every statement. Nodes of the graph, referred to as \emph{exploded nodes}\index[notion]{Exploded Node}\index{ExplodedNode}, are pairs composed of the state of the program and the program point\index[notion]{Exploded Node!Program Point}\index{ProgramPoint} currently being analyzed.
\begin{nobr}
You can display the complete exploded graph for every analysis pass by turning on the special checker called \lstinline|debug.ViewExplodedGraph|:
\begin{lstlisting}[style=commandline]
~ $ @clang -cc1 -analyze -analyzer-checker=debug.ViewExplodedGraph test.c@ @@@
Writing '/tmp/ExprEngine-0528e9.dot'... done.
Running 'xdot.py' program... done.@@@
\end{lstlisting}
\end{nobr}
Exploded graphs are often very large. Exploded graph of \lstinline|test.c| generated by CSA has over 50 nodes, and is too large to include into this document. Still, figure \ref{fig:expgraph} should give you a rough idea of how it essentially looks.
\begin{figure}[!ht]\center
\includegraphics[scale=\imgscale]{expgraph.pdf}
\caption{Extremely simplified exploded graph for \lstinline|test.c|.}
\label{fig:expgraph}
\end{figure}
Let us see how path-sensitive analysis goes on inside \lstinline|test.c|. The analyzer starts with emulating the first operation, namely the comparison operator \lstinline|x == 0|. Because value of \lstinline|x| is unknown at this point of the analysis (and, in fact, will never be known), this value is represented as a \emph{symbol}\index[notion]{Symbolic Value!Symbolic Expression}\index{SymExpr}\index{SymExpr!SymbolRegionValue} \lstinline|reg_$0<x>|. This symbol is to be understood as ``the value stored at the memory region of parameter variable \lstinline|x| at the beginning of the analysis''.
Once the comparison statement is emulated, we reach the terminator of our CFG\index[notion]{Control Flow Graph!Terminator}, namely, the \lstinline|if| statement. Depending on the terminator condition, we jump to another CFG block\index[notion]{Control Flow Graph!Basic Block}, either \lstinline|[B3]| or \lstinline|[B2]|. Because we are unsure what branch we take, we split the exploded graph into the two possible paths. On each path, the new node is created by \emph{assuming} that symbol \lstinline|reg_$0<x>| takes values from a certain \emph{range}\index[notion]{Exploded Node!Program State!Range Constraint|textbf}: on the \lstinline|true| branch, it is assumed\index[notion]{Exploded Node!Program State!Assumption|see{Range Constraint}} to be an integer in range $[0, 0]$ (actually, equal to $0$), and on the \lstinline|false| branch, it is assumed to belong to $[-21417483648, -1]\cup[1, 21417483647]$. The range assumed on the symbols would stay inside all nodes branching off the current node unless the symbol itself is no longer referenced anywhere in the node and gets \emph{garbage-collected}\index[notion]{Garbage Collection}.
On \lstinline|[B3]|, we execute an assignment operator \lstinline|y = 5|. This is instantly represented in the node as a \emph{binding}: the value of variable \lstinline|y| is a \emph{concrete}\index[notion]{Symbolic Value!Concrete Value} (non-symbolic) value 5. Then we jump to \lstinline|[B2]| anyway. However, we are reaching \lstinline|[B2]| in a different state, hence it is represented by a different node in the exploded graph.
Now, upon reaching \lstinline|[B2]|, note how we no longer assume anything or try to guess what branch we take from there. Because range of symbol \lstinline|reg_$0<x>| is still stored inside the node, we already know the truth value of \lstinline|!x|.
On the \lstinline|true| branch, assignment \lstinline|z = 6| is executed. Note how the binding \lstinline|y = 5| is no longer present in the program state: it was garbage-collected\index[notion]{Garbage Collection} because variable \lstinline|y| is no longer referenced in further code. Finally, all branches reach the end block \lstinline|[B0]|, and the analysis stops.
Information stored in the exploded graph is exhaustive, and contains the best assumptions the analyzer core can make about the program execution. Moreover, unlike AST- and CFG-based analysis, path-sensitive CSA checkers rarely read the exploded graph passively, but instead actively participate in its construction, adding their own nodes, bindings, assumptions, leaving checker-specific marks, and splitting paths in the exploded graph at their own will.
\subsection{Further reading}
Coding AST- or CFG-based checkers is discussed in detail in section \ref{sec:ast_based}. If you are interested in coding path-sensitive checkers, jump directly to chapter \ref{sec:path_sensitive}; however, some knowledge of the Clang AST may still be useful. There is a highly recommended introduction to Clang AST available at the official Clang website\footnote{\url{http://clang.llvm.org/docs/IntroductionToTheClangAST.html}}.
\newpage
\section{AST-based checkers}\label{sec:ast_based}\index[notion]{Checker!AST-based|textbf}
Many simple checks can be implemented by looking at the syntax tree of the program and catching unwanted code patterns. Checkers that do not make use of the CSA path-sensitive engine are fast and often have good true positive rate, but are capable of catching only a very limited set of defects.
In this section we proceed to discuss two common technologies for creating syntax-only checkers: AST visitors and AST matchers. Usually having a good command of one of these technologies is sufficient, but sometimes you may want to use both matchers and visitors in the same checker.
AST-based checks are not ``the'' strength of the Clang Static Analyzer --- if you are using only AST-based information, then you could have done this check with any other Clang-based tool. In the official Clang Static Analyzer, there are a few AST-based checks, but normally AST-based checks go to the \lstinline|clang-tidy| tool.
On the other hand, it is not uncommon to use AST-based checks from within the path-sensitive engine, so that to have a better idea of the syntax behind the path-sensitive analysis events. So, even though in this section we shall deliberately learn how to write AST-only checks, exactly same techniques may make it into a path-sensitive checker, and it is useful to have a good command of them.
\subsection{Path-insensitive checker callbacks}\index[notion]{Checker!Callback}
Path-sensitive engine of CSA starts working only as soon as at least one checker subscribes to a checker callback that requires path-sensitive analysis to fire. If you are interested only in AST-based checkers, and disable all path-sensitive checkers, the analysis would run significantly faster.
Because AST-based checkers do not participate in construction of the data structures they analyze, only a few AST-only callbacks are defined. The most useful callbacks that do not instantly trigger the path-sensitive engine are \lstinline|check::EndOfTranslationUnit| and \lstinline|check::ASTCodeBody|.
\begin{nobr}
\subsubsection{check::EndOfTranslationUnit}\index{Checker!check::EndOfTranslationUnit|textbf}
\begin{lstlisting}[style=cplusplus,numbers=none]
void checkEndOfTranslationUnit(const TranslationUnitDecl *TU,
AnalysisManager &AM, BugReporter &BR) const;
\end{lstlisting}
In this callback, the complete AST of the program is available for analysis.
\end{nobr}
The entry point for visiting the AST --- the declaration of the whole translation unit --- is provided as the first argument, \lstinline|TU|. This callback is commonly used when not only executable code, but also declarations needs to be checked.
\begin{nobr}
\subsubsection{check::ASTCodeBody}\index{Checker!check::ASTCodeBody|textbf}
\begin{lstlisting}[style=cplusplus,numbers=none]
void checkASTCodeBody(const Decl *D,
AnalysisManager &AM, BugReporter &BR) const;
\end{lstlisting}
In this callback, a declaration of a function, code body of which the analyzer would normally analyze, would be provided on every call.
\end{nobr}
The body of the function that needs to be analyzed is available as \lstinline|D->getBody()|. This callback is convenient when only executable code needs to be analyzed.
\begin{nobr}
\subsubsection{check::ASTDecl<T>}\index{Checker!check::ASTDecl|textbf}
\begin{lstlisting}[style=cplusplus,numbers=none]
void checkASTDecl(const T *D, AnalysisManager &Mgr, BugReporter &BR) const;
\end{lstlisting}
This callback is called for all AST declarations of type \lstinline|T| (for example, for all variables if \lstinline|T| is \lstinline|VarDecl| or for all class fields if \lstinline|T| is \lstinline|FieldDecl|). This is often a convenient simple alternative for declaration visitors.
\end{nobr}
\subsection{AST visitors}\index[notion]{Checker!AST-based!AST Visitor|textbf}
The AST visitor mechanism is the most flexible tool for exploring the Clang AST. Clang provides numerous visitors for the AST, with similar syntax. For implementing checkers, two kinds of visitors are mostly useful:
\begin{itemize}
\item[---]\lstinline|ConstStmtVisitor|\index{ConstStmtVisitor|textbf} is widely used for checking code bodies, which is most often exactly what you need,
\item[---]\lstinline|ConstDeclVisitor|\index{ConstDeclVisitor|textbf} is sometimes used for checking declarations outside code bodies (such as global variables).
\end{itemize}
In order to use a visitor, you need to inherit a class from it and implement visitor callbacks for different kinds of AST nodes. Whenever a callback is not implemented for a particular node, a callback for a more generic node would be called anyway: so, for example, a \lstinline|CXXOperatorCallExpr| would be visited in one of the following callbacks:
\begin{itemize}
\item[---]\lstinline|VisitCXXOperatorCallExpr(...)|,
\item[---]\lstinline|VisitCallExpr(...)|,
\item[---]\lstinline|VisitExpr(...)|,
\item[---]\lstinline|VisitStmt(...)|,
\end{itemize}
whichever turns out to be the first one to be defined.
\subsubsection{Implementing a simple statement visitor}
As an example, let us see if we can rewrite \lstinline|alpha.core.MainCallChecker| described in section \ref{sec:intro} as an AST visitor. This checker would also serve as an example of using the \lstinline|check::ASTCodeBody| callback.
First, let us declare an AST visitor. The visitor stores references to the \lstinline|BugReporter|\index{BugReporter} object in order to throw path-insensitive reports, and also the current \lstinline|AnalysisDeclContext|\index{AnalysisDeclContext}, which is required for producing diagnostic locations for bug reports. The latter also wraps the original function we are analyzing.
\begin{nobr}
\begin{lstlisting}[style=cplusplus,numbers=none]
namespace {
class WalkAST : public ConstStmtVisitor<WalkAST> {
BugReporter &BR;
AnalysisDeclContext *ADC;
void VisitChildren(const Stmt *S);
public:
WalkAST(BugReporter &Reporter, AnalysisDeclContext *Context)
: BR(Reporter), ADC(Context) {}
void VisitStmt(const Stmt *S);
void VisitCallExpr(const CallExpr *CE);
};
}
\end{lstlisting}\index{ConstStmtVisitor}
\end{nobr}
The visitor defines two public callbacks: \lstinline|VisitCallExpr(...)| for special handling of function call expressions, and \lstinline|VisitStmt(...)| for visiting all other kinds of statements.
\begin{nobr}
These callbacks have one thing in common: they need to visit sub-statements whenever they're done visiting their statement. This operation is often separated into a sub-function called \lstinline|VisitChildren(...)|:
\begin{lstlisting}[style=cplusplus,numbers=none]
void WalkAST::VisitChildren(const Stmt *S) {
for (Stmt::const_child_iterator I = S->child_begin(), E = S->child_end();
I != E; ++I)
if (const Stmt *Child = *I)
Visit(Child);
}
\end{lstlisting}
\end{nobr}
\begin{nobr}
Now, \lstinline|VisitStmt(...)| doesn't really need to do anything else:
\begin{lstlisting}[style=cplusplus,numbers=none]
void WalkAST::VisitStmt(const Stmt *S) {
VisitChildren(S);
}
\end{lstlisting}
\end{nobr}
\begin{nobr}
Most of the checker logic is stored in written out in \lstinline|VisitCallExpr(...)|. We obtain the function declaration for the current call expression, take its identifier, and see if this identifier coincides with \lstinline|"main"|. If it does, we throw a path-insensitive (``basic'') report. Note that unlike path-sensitive checkers, syntax-only checkers do not have the convenient \lstinline|CheckerContext| wrapper available, so they need to access the \lstinline|BugReporter| object directly, and also put some effort in obtaining the necessary source locations.
\begin{lstlisting}[style=cplusplus,numbers=none]
void WalkAST::VisitCallExpr(const CallExpr *CE) {
if (const FunctionDecl *FD = CE->getDirectCallee())
if (const IdentifierInfo *II = FD->getIdentifier())
if (II->isStr("main")) {
SourceRange R = CE->getSourceRange();
PathDiagnosticLocation ELoc =
PathDiagnosticLocation::createBegin(CE, BR.getSourceManager(), ADC);
BR.EmitBasicReport(ADC->getDecl(), "Call to main", "Example checker",
"Call to main", ELoc, R);
}
VisitChildren(CE);
}
\end{lstlisting}\index{BugReporter!EmitBasicReport()}\index{PathDiagnosticLocation}\index{PathDiagnosticLocation!createBegin()}\index{BugReporter!getSourceManager()}\index{SourceManager}\index{AnalysisDeclContext}
\end{nobr}
\begin{nobr}
That's it for the implementation of the visitor. Now we simply need to create it and give it some code to visit. Since ``all code'' is a declaration rather than a statement, we subscribe on the \lstinline|check::ASTCodeBody| callback:
\begin{lstlisting}[style=cplusplus,numbers=none]
namespace {
class MainCallCheckerAST : public Checker<check::ASTCodeBody> {
public:
void checkASTCodeBody(const Decl *D, AnalysisManager &AM,
BugReporter &B) const;
};
}
\end{lstlisting}\index{Checker!check::ASTCodeBody}\index{AnalysisManager}\index{BugReporter}\index{AnalysisManager!getAnalysisDeclContext()}\index{AnalysisDeclContext}
\end{nobr}
\begin{nobr}
And implement the callback as follows:
\begin{lstlisting}[style=cplusplus,numbers=none]
void MainCallCheckerAST::checkASTCodeBody(const Decl *D, AnalysisManager &AM,
BugReporter &BR) const {
WalkAST Walker(BR, AM.getAnalysisDeclContext(D));
Walker.Visit(D->getBody());
}
\end{lstlisting}
\end{nobr}
This way the visitor starts from the compound statement that represents the function body, and descends into sub-statements.
\begin{nobr}
Checker is now ready. However, on the example code from chapter \ref{sec:intro} it is silent --- we're only detecting direct calls now, not calls through function pointers, because only that much is present in the AST. We would warn on a simpler code though:
\begin{lstlisting}[style=cplusplus,numbers=none]
void foo() {
main(0, 0); // Call to main!
}
\end{lstlisting}
\end{nobr}
\begin{nobr}
\subsubsection{Merging statement and declaration visitors}
Sometimes you'd like to intermix the two visitors together, in order to visit both statements and declarations. In this case, you can inherit your visitor from both visitors:
\begin{lstlisting}[style=cplusplus,numbers=none]
class WalkAST : public ConstStmtVisitor<WalkAST>,
public ConstDeclVisitor<WalkAST> {
/* ... */
public:
using ConstStmtVisitor<WalkAST>::Visit;
using ConstDeclVisitor<WalkAST>::Visit;
/* ... */
};
\end{lstlisting}\index{ConstStmtVisitor}\index{ConstDeclVisitor}
\end{nobr}
\subsection{AST matchers}\label{subsec:ast_matchers}\index[notion]{Checker!AST-based!AST Matcher|textbf}
AST matchers are the new API for finding simple code patterns in the Clang AST. They allow writing extremely concise declarative definitions of such patterns --- almost as short as describing them in words in a natural language --- and provide an interface for taking actions on every pattern found. Being preferable for simple code patterns, for their simplicity and code readability, AST matchers are not as omnipotent as AST visitors.
\subsubsection{Implementing a simple AST matcher}
As an example, let us see if we can rewrite \lstinline|alpha.core.MainCallChecker| described in section \ref{sec:intro} with the help of AST matchers.
\begin{nobr}
Recall that the checker needs to find calls to functions with the name \lstinline|"main"|. Knowing just that, we can instantly write a matcher that finds such calls:
\begin{lstlisting}[style=cplusplus,numbers=none]
callExpr(callee(functionDecl(hasName("main")))).bind("call")
\end{lstlisting}
\end{nobr}
Which means that complete checker logic now suits into a single line of code! All that remains is to write out the checker bureaucracy and throw the bug report. Note how the \lstinline|bind(...)| matcher command assigns a name to the AST node it is applied to, for future reference.
\begin{nobr}
The first thing we need to define is the matcher callback. This callback would fire whenever the matcher finds something. Matcher callbacks need to inherit from \lstinline|MatchFinder::MatchCallback|\index{MatchFinder|textbf}\index{MatchFinder!MatchCallback|textbf} and implement the method called \lstinline|run(...)|:
\begin{lstlisting}[style=cplusplus,numbers=none]
namespace {
class Callback : public MatchFinder::MatchCallback {
BugReporter &BR;
AnalysisDeclContext *ADC;
public:
void run(const MatchFinder::MatchResult &Result);
Callback(BugReporter &Reporter, AnalysisDeclContext *Context)
: BR(Reporter), ADC(Context) {}
};
}
\end{lstlisting}\index{MatchFinder!MatchResult}
\end{nobr}
\begin{nobr}
Ideally, the only thing match callback needs to do is throw the basic bug report. This is the case here. However, sometimes matchers cannot cover the whole checker logic, and it is natural to leave some final checks to the callback.
\begin{lstlisting}[style=cplusplus,numbers=none]
void Callback::run(const MatchFinder::MatchResult &Result) {
const CallExpr *CE = Result.Nodes.getStmtAs<CallExpr>("call");
assert(CE);
SourceRange R = CE->getSourceRange();
PathDiagnosticLocation ELoc =
PathDiagnosticLocation::createBegin(CE, BR.getSourceManager(), ADC);
BR.EmitBasicReport(ADC->getDecl(), "Call to main", "Example checker",
"Call to main", ELoc, R);
}
\end{lstlisting}
\index{MatchFinder!MatchResult!Nodes}\index{BugReporter!EmitBasicReport()}\index{PathDiagnosticLocation}\index{PathDiagnosticLocation!createBegin()}\index{BugReporter!getSourceManager()}\index{SourceManager}\index{AnalysisDeclContext}
\end{nobr}
In the callback, we obtain the call expression by its name, \lstinline|"call"|, defined via \lstinline|bind(...)|. Looking at the matcher, we are sure that such call expression is present, so we can assert that we have successfully obtained the statement by name.
\begin{nobr}
Now it is time to define the checker. This time, let us try out the whole-translation-unit matching:
\begin{lstlisting}[style=cplusplus,numbers=none]
namespace {
class MainCallCheckerMatchers : public Checker<check::EndOfTranslationUnit> {
public:
void checkEndOfTranslationUnit(const TranslationUnitDecl *TU,
AnalysisManager &AM, BugReporter &B) const;
};
}
\end{lstlisting}
\end{nobr}
\begin{nobr}
Finally, in the checker callback, we need to construct our matcher and use it to find bugs:
\begin{lstlisting}[style=cplusplus,numbers=none]
void MainCallCheckerMatchers::checkEndOfTranslationUnit(
const TranslationUnitDecl *TU, AnalysisManager &AM, BugReporter &B) const {
MatchFinder F;
Callback CB(B, AM.getAnalysisDeclContext(TU));
F.addMatcher(
stmt(hasDescendant(
callExpr(callee(functionDecl(hasName("main")))).bind("call"))),
&CB);
F.matchAST(AM.getASTContext());
}
\end{lstlisting}\index{Checker!check::EndOfTranslationUnit}\index{AnalysisManager}\index{BugReporter}\index{AnalysisManager!getASTContext()}\index{ASTContext}
\end{nobr}
The \lstinline|matchAST(...)|\index{MatchFinder!matchAST()|textbf} method of \lstinline|MatchFinder| lets it match the whole AST of the translation unit. The check\-er is now done. The output is similar to the visitor version of the checker.
The \lstinline|ASTContext|\index{ASTContext|textbf} structure, which we obtained from the \lstinline|AnalysisManager|, contains the whole AST of the program, and also various meta-information regarding the AST, such as implementation-specific traits imposed during compilation.
\begin{nobr}
\subsubsection{Re-using matchers}
If a certain sub-pattern repeats multiple times in your matcher, you can store and re-use it. In the example below, matcher \lstinline|TypeM| is stored and then re-used twice in two other matchers, which are in turn stored for later use:
\begin{lstlisting}[style=cplusplus]
TypeMatcher TypeM = templateSpecializationType().bind("type");
DeclarationMatcher VarDeclM = varDecl(hasType(TypeM)).bind("decl");
StatementMatcher TempObjM = temporaryObjectExpr(hasType(TypeM)).bind("stmt");
\end{lstlisting}\index{TypeMatcher}\index{DeclarationMatcher}\index{StatementMatcher}
\end{nobr}
\begin{nobr}
\subsubsection{Defining custom matchers}
Sometimes combining the predefined matchers is not enough to implement the desired check. In this case, it is often convenient to implement a custom AST matcher. Implementing AST matchers is a matter of a few lines of code, and many examples can be found in \lstinline|ASTMatchers.h|. When implementing a custom AST matcher inside the checker, you need to put it into \lstinline|clang::ast_matchers| namespace. The example below defines a custom declaration matcher that matches \lstinline|RecordDecl| nodes that declare unions rather than structures:
\begin{lstlisting}[style=cplusplus]
namespace clang {
namespace ast_matchers {
AST_MATCHER(RecordDecl, isUnion) {
return Node.isUnion();
}
} // end namespace clang
} // end namespace ast_matchers
\end{lstlisting}\index{AST\_MATCHER()}
\end{nobr}
\subsubsection{Matching particular statements}
As we mentioned before, the \lstinline|matchAST(...)|\index{MatchFinder!matchAST()} method of \lstinline|MatchFinder| matches the whole AST of the translation unit. Sometimes you want to match only a particular section of the AST. In this case, you can use the \lstinline|match(...)|\index{MatchFinder!match()|textbf} method.
For instance, let us try to implement \lstinline|MainCallChecker| using \lstinline|check::ASTCodeBody|. Then we need to match \lstinline|D->getBody()| with the \lstinline|MatchFinder|.
\begin{nobr}
However, the semantics of \lstinline|match(...)| is different from semantics of \lstinline|matchAST(...)|: the former tries to match the statement itself, the latter tries to match its sub-statements as well. So we need to modify our matcher to make it look for sub-statements manually:
\begin{lstlisting}[style=cplusplus]
void MainCallCheckerMatchers::checkASTCodeBody(const Decl *D,
AnalysisManager &AM,
BugReporter &BR) const {
MatchFinder F;
Callback CB(BR, AM.getAnalysisDeclContext(D));
F.addMatcher(
stmt(hasDescendant(
callExpr(callee(functionDecl(hasName("main")))).bind("call"))),