Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[clang][analyzer] Stabilize path-constraint order by using alloc IDs #121347

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

necto
Copy link
Contributor

@necto necto commented Dec 30, 2024

These IDs are essentially allocator offsets for the SymExpr pool. They are superior to raw pointer values because they are more controllable and are not randomized across executions.

They are not stable across runs because some spurious allocations might happen only in certain runs. For example, an unrelated ImmutableMap rotates its AVL tree based on the key values that are pointers. The rotation causes extra allocations in the shared allocator, and the next SymExpr allocation happens at a different offset.

Yet, these IDs order is stable across runs because SymExprs are allocated in the same order and allocator offset grows monotonously.

Stability of the constraint order is important for the stability of the analyzer results. I evaluated this change on a set of 200+ open-source C and C++ projects with the total number of ~78 000 symbolic-execution issues.

This patch reduced the run-to-run churn (flakiness) in SE issues from 80-90 to 30-40 (out of 78K) in our CSA deployment (in our setting flaky issues are mostly due to Z3 refutation instability).

Importantly, this change is necessary for the next step in stabilizing analysis results by caching Z3 query outcomes between analysis runs (work in progress).

Across our admittedly noisy CI runs, I detected no noticeable effect on memory footprint or analysis time.

CPP-5919

These IDs are essentially allocator offsets for the SymExpr pool. They
are superior to raw pointer values because they are more controllable
and are not randomized across executions.

They are not stable across runs because some spurious allocations might happen
only in certain runs. For example, an unrelated ImmutableMap rotates its
AVL tree based on the key values that are pointers. The rotation causes
extra allocations in the shared allocator, and the next SymExpr
allocation happens at a different offset.

Yet, these IDs *order* is stable across runs because SymExprs are
allocated in the same order and allocator offset grows monotonously.

Stability of the constraint order is important for the stability of the
analyzer results. I evaluated this change on a set of 200+ open-source
C++ projects with the total number of ~78 000 symbolic-execution issues.

This patch reduced the run-to-run churn (flakyness) in SE issues from
80-90 to 30-40 (out of 78K) in our CSA deployment (in our setting flaky
issues are mostly due to Z3 refutation instability).

Importantly, this change is necessary for the next step in stabilizing
analysis results by caching Z3 query outcomes between analysis
runs (work in progress).

Across our admittedly noisy CI runs, I detected no noticeable effect on
memory footprint or analysis time.

CPP-5919
@llvmbot llvmbot added clang Clang issues not falling into any other category clang:static analyzer labels Dec 30, 2024
@llvmbot
Copy link
Member

llvmbot commented Dec 30, 2024

@llvm/pr-subscribers-clang-static-analyzer-1

Author: Arseniy Zaostrovnykh (necto)

Changes

These IDs are essentially allocator offsets for the SymExpr pool. They are superior to raw pointer values because they are more controllable and are not randomized across executions.

They are not stable across runs because some spurious allocations might happen only in certain runs. For example, an unrelated ImmutableMap rotates its AVL tree based on the key values that are pointers. The rotation causes extra allocations in the shared allocator, and the next SymExpr allocation happens at a different offset.

Yet, these IDs order is stable across runs because SymExprs are allocated in the same order and allocator offset grows monotonously.

Stability of the constraint order is important for the stability of the analyzer results. I evaluated this change on a set of 200+ open-source C++ projects with the total number of ~78 000 symbolic-execution issues.

This patch reduced the run-to-run churn (flakyness) in SE issues from 80-90 to 30-40 (out of 78K) in our CSA deployment (in our setting flaky issues are mostly due to Z3 refutation instability).

Importantly, this change is necessary for the next step in stabilizing analysis results by caching Z3 query outcomes between analysis runs (work in progress).

Across our admittedly noisy CI runs, I detected no noticeable effect on memory footprint or analysis time.

CPP-5919


Full diff: https://github.com/llvm/llvm-project/pull/121347.diff

4 Files Affected:

  • (modified) clang/include/clang/StaticAnalyzer/Core/PathSensitive/RangedConstraintManager.h (+16-1)
  • (modified) clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h (+7-2)
  • (modified) clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h (+33-26)
  • (modified) clang/lib/StaticAnalyzer/Core/SymbolManager.cpp (+18-10)
diff --git a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/RangedConstraintManager.h b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/RangedConstraintManager.h
index 49ea006e27aa54..27ee809a99b2cd 100644
--- a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/RangedConstraintManager.h
+++ b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/RangedConstraintManager.h
@@ -401,7 +401,22 @@ class RangeSet {
   friend class Factory;
 };
 
-using ConstraintMap = llvm::ImmutableMap<SymbolRef, RangeSet>;
+struct ConstraintKVInfo : llvm::ImutKeyValueInfo<SymbolRef, RangeSet> {
+  static inline bool isEqual(key_type_ref L, key_type_ref R) {
+    return L->getAllocID() == R->getAllocID();
+  }
+
+  static inline bool isLess(key_type_ref L, key_type_ref R) {
+    return L->getAllocID() < R->getAllocID();
+  }
+
+  static inline void Profile(llvm::FoldingSetNodeID &ID, value_type_ref V) {
+    ID.AddInteger(V.first->getAllocID());
+    ID.Add(V.second);
+  }
+};
+
+using ConstraintMap = llvm::ImmutableMap<SymbolRef, RangeSet, ConstraintKVInfo>;
 ConstraintMap getConstraintMap(ProgramStateRef State);
 
 class RangedConstraintManager : public SimpleConstraintManager {
diff --git a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h
index 862a30c0e73633..91c62886f68261 100644
--- a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h
+++ b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h
@@ -31,6 +31,7 @@ class SymExpr : public llvm::FoldingSetNode {
   virtual void anchor();
 
 public:
+  using AllocIDType = int64_t;
   enum Kind {
 #define SYMBOL(Id, Parent) Id##Kind,
 #define SYMBOL_RANGE(Id, First, Last) BEGIN_##Id = First, END_##Id = Last,
@@ -39,9 +40,10 @@ class SymExpr : public llvm::FoldingSetNode {
 
 private:
   Kind K;
+  AllocIDType AllocID;
 
 protected:
-  SymExpr(Kind k) : K(k) {}
+  SymExpr(Kind K, AllocIDType AllocID) : K(K), AllocID(AllocID) {}
 
   static bool isValidTypeForSymbol(QualType T) {
     // FIXME: Depending on whether we choose to deprecate structural symbols,
@@ -56,6 +58,8 @@ class SymExpr : public llvm::FoldingSetNode {
 
   Kind getKind() const { return K; }
 
+  AllocIDType getAllocID() const { return AllocID; }
+
   virtual void dump() const;
 
   virtual void dumpToStream(raw_ostream &os) const {}
@@ -122,7 +126,8 @@ class SymbolData : public SymExpr {
   void anchor() override;
 
 protected:
-  SymbolData(Kind k, SymbolID sym) : SymExpr(k), Sym(sym) {
+  SymbolData(Kind k, SymbolID sym, AllocIDType AllocID)
+      : SymExpr(k, AllocID), Sym(sym) {
     assert(classof(this));
   }
 
diff --git a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
index 73732d532f630f..da13f25323d8a2 100644
--- a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
+++ b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
@@ -44,8 +44,9 @@ class SymbolRegionValue : public SymbolData {
   const TypedValueRegion *R;
 
 public:
-  SymbolRegionValue(SymbolID sym, const TypedValueRegion *r)
-      : SymbolData(SymbolRegionValueKind, sym), R(r) {
+  SymbolRegionValue(SymbolID sym, const TypedValueRegion *r,
+                    AllocIDType AllocID)
+      : SymbolData(SymbolRegionValueKind, sym, AllocID), R(r) {
     assert(r);
     assert(isValidTypeForSymbol(r->getValueType()));
   }
@@ -86,8 +87,9 @@ class SymbolConjured : public SymbolData {
 
 public:
   SymbolConjured(SymbolID sym, const Stmt *s, const LocationContext *lctx,
-                 QualType t, unsigned count, const void *symbolTag)
-      : SymbolData(SymbolConjuredKind, sym), S(s), T(t), Count(count),
+                 QualType t, unsigned count, const void *symbolTag,
+                 AllocIDType AllocID)
+      : SymbolData(SymbolConjuredKind, sym, AllocID), S(s), T(t), Count(count),
         LCtx(lctx), SymbolTag(symbolTag) {
     // FIXME: 's' might be a nullptr if we're conducting invalidation
     // that was caused by a destructor call on a temporary object,
@@ -138,8 +140,10 @@ class SymbolDerived : public SymbolData {
   const TypedValueRegion *R;
 
 public:
-  SymbolDerived(SymbolID sym, SymbolRef parent, const TypedValueRegion *r)
-      : SymbolData(SymbolDerivedKind, sym), parentSymbol(parent), R(r) {
+  SymbolDerived(SymbolID sym, SymbolRef parent, const TypedValueRegion *r,
+                AllocIDType AllocID)
+      : SymbolData(SymbolDerivedKind, sym, AllocID), parentSymbol(parent),
+        R(r) {
     assert(parent);
     assert(r);
     assert(isValidTypeForSymbol(r->getValueType()));
@@ -181,8 +185,8 @@ class SymbolExtent : public SymbolData {
   const SubRegion *R;
 
 public:
-  SymbolExtent(SymbolID sym, const SubRegion *r)
-      : SymbolData(SymbolExtentKind, sym), R(r) {
+  SymbolExtent(SymbolID sym, const SubRegion *r, AllocIDType AllocID)
+      : SymbolData(SymbolExtentKind, sym, AllocID), R(r) {
     assert(r);
   }
 
@@ -223,16 +227,17 @@ class SymbolMetadata : public SymbolData {
   const void *Tag;
 
 public:
-  SymbolMetadata(SymbolID sym, const MemRegion* r, const Stmt *s, QualType t,
-                 const LocationContext *LCtx, unsigned count, const void *tag)
-      : SymbolData(SymbolMetadataKind, sym), R(r), S(s), T(t), LCtx(LCtx),
-        Count(count), Tag(tag) {
-      assert(r);
-      assert(s);
-      assert(isValidTypeForSymbol(t));
-      assert(LCtx);
-      assert(tag);
-    }
+  SymbolMetadata(SymbolID sym, const MemRegion *r, const Stmt *s, QualType t,
+                 const LocationContext *LCtx, unsigned count, const void *tag,
+                 AllocIDType AllocID)
+      : SymbolData(SymbolMetadataKind, sym, AllocID), R(r), S(s), T(t),
+        LCtx(LCtx), Count(count), Tag(tag) {
+    assert(r);
+    assert(s);
+    assert(isValidTypeForSymbol(t));
+    assert(LCtx);
+    assert(tag);
+  }
 
     LLVM_ATTRIBUTE_RETURNS_NONNULL
     const MemRegion *getRegion() const { return R; }
@@ -287,8 +292,8 @@ class SymbolCast : public SymExpr {
   QualType ToTy;
 
 public:
-  SymbolCast(const SymExpr *In, QualType From, QualType To)
-      : SymExpr(SymbolCastKind), Operand(In), FromTy(From), ToTy(To) {
+  SymbolCast(const SymExpr *In, QualType From, QualType To, AllocIDType AllocID)
+      : SymExpr(SymbolCastKind, AllocID), Operand(In), FromTy(From), ToTy(To) {
     assert(In);
     assert(isValidTypeForSymbol(From));
     // FIXME: GenericTaintChecker creates symbols of void type.
@@ -333,8 +338,9 @@ class UnarySymExpr : public SymExpr {
   QualType T;
 
 public:
-  UnarySymExpr(const SymExpr *In, UnaryOperator::Opcode Op, QualType T)
-      : SymExpr(UnarySymExprKind), Operand(In), Op(Op), T(T) {
+  UnarySymExpr(const SymExpr *In, UnaryOperator::Opcode Op, QualType T,
+               AllocIDType AllocID)
+      : SymExpr(UnarySymExprKind, AllocID), Operand(In), Op(Op), T(T) {
     // Note, some unary operators are modeled as a binary operator. E.g. ++x is
     // modeled as x + 1.
     assert((Op == UO_Minus || Op == UO_Not) && "non-supported unary expression");
@@ -381,8 +387,9 @@ class BinarySymExpr : public SymExpr {
   QualType T;
 
 protected:
-  BinarySymExpr(Kind k, BinaryOperator::Opcode op, QualType t)
-      : SymExpr(k), Op(op), T(t) {
+  BinarySymExpr(Kind k, BinaryOperator::Opcode op, QualType t,
+                AllocIDType AllocID)
+      : SymExpr(k, AllocID), Op(op), T(t) {
     assert(classof(this));
     // Binary expressions are results of arithmetic. Pointer arithmetic is not
     // handled by binary expressions, but it is instead handled by applying
@@ -427,8 +434,8 @@ class BinarySymExprImpl : public BinarySymExpr {
 
 public:
   BinarySymExprImpl(LHSTYPE lhs, BinaryOperator::Opcode op, RHSTYPE rhs,
-                    QualType t)
-      : BinarySymExpr(ClassKind, op, t), LHS(lhs), RHS(rhs) {
+                    QualType t, AllocIDType AllocID)
+      : BinarySymExpr(ClassKind, op, t, AllocID), LHS(lhs), RHS(rhs) {
     assert(getPointer(lhs));
     assert(getPointer(rhs));
   }
diff --git a/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp b/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp
index f21e5c3ad7bd7c..35682bb6645729 100644
--- a/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp
+++ b/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp
@@ -170,7 +170,8 @@ SymbolManager::getRegionValueSymbol(const TypedValueRegion* R) {
   void *InsertPos;
   SymExpr *SD = DataSet.FindNodeOrInsertPos(profile, InsertPos);
   if (!SD) {
-    SD = new (BPAlloc) SymbolRegionValue(SymbolCounter, R);
+    SD = new (BPAlloc)
+        SymbolRegionValue(SymbolCounter, R, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(SD, InsertPos);
     ++SymbolCounter;
   }
@@ -188,7 +189,8 @@ const SymbolConjured* SymbolManager::conjureSymbol(const Stmt *E,
   void *InsertPos;
   SymExpr *SD = DataSet.FindNodeOrInsertPos(profile, InsertPos);
   if (!SD) {
-    SD = new (BPAlloc) SymbolConjured(SymbolCounter, E, LCtx, T, Count, SymbolTag);
+    SD = new (BPAlloc) SymbolConjured(SymbolCounter, E, LCtx, T, Count,
+                                      SymbolTag, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(SD, InsertPos);
     ++SymbolCounter;
   }
@@ -204,7 +206,8 @@ SymbolManager::getDerivedSymbol(SymbolRef parentSymbol,
   void *InsertPos;
   SymExpr *SD = DataSet.FindNodeOrInsertPos(profile, InsertPos);
   if (!SD) {
-    SD = new (BPAlloc) SymbolDerived(SymbolCounter, parentSymbol, R);
+    SD = new (BPAlloc) SymbolDerived(SymbolCounter, parentSymbol, R,
+                                     BPAlloc.getBytesAllocated());
     DataSet.InsertNode(SD, InsertPos);
     ++SymbolCounter;
   }
@@ -219,7 +222,8 @@ SymbolManager::getExtentSymbol(const SubRegion *R) {
   void *InsertPos;
   SymExpr *SD = DataSet.FindNodeOrInsertPos(profile, InsertPos);
   if (!SD) {
-    SD = new (BPAlloc) SymbolExtent(SymbolCounter, R);
+    SD = new (BPAlloc)
+        SymbolExtent(SymbolCounter, R, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(SD, InsertPos);
     ++SymbolCounter;
   }
@@ -236,7 +240,8 @@ SymbolManager::getMetadataSymbol(const MemRegion* R, const Stmt *S, QualType T,
   void *InsertPos;
   SymExpr *SD = DataSet.FindNodeOrInsertPos(profile, InsertPos);
   if (!SD) {
-    SD = new (BPAlloc) SymbolMetadata(SymbolCounter, R, S, T, LCtx, Count, SymbolTag);
+    SD = new (BPAlloc) SymbolMetadata(SymbolCounter, R, S, T, LCtx, Count,
+                                      SymbolTag, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(SD, InsertPos);
     ++SymbolCounter;
   }
@@ -252,7 +257,7 @@ SymbolManager::getCastSymbol(const SymExpr *Op,
   void *InsertPos;
   SymExpr *data = DataSet.FindNodeOrInsertPos(ID, InsertPos);
   if (!data) {
-    data = new (BPAlloc) SymbolCast(Op, From, To);
+    data = new (BPAlloc) SymbolCast(Op, From, To, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(data, InsertPos);
   }
 
@@ -268,7 +273,7 @@ const SymIntExpr *SymbolManager::getSymIntExpr(const SymExpr *lhs,
   SymExpr *data = DataSet.FindNodeOrInsertPos(ID, InsertPos);
 
   if (!data) {
-    data = new (BPAlloc) SymIntExpr(lhs, op, v, t);
+    data = new (BPAlloc) SymIntExpr(lhs, op, v, t, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(data, InsertPos);
   }
 
@@ -284,7 +289,8 @@ const IntSymExpr *SymbolManager::getIntSymExpr(APSIntPtr lhs,
   SymExpr *data = DataSet.FindNodeOrInsertPos(ID, InsertPos);
 
   if (!data) {
-    data = new (BPAlloc) IntSymExpr(lhs, op, rhs, t);
+    data =
+        new (BPAlloc) IntSymExpr(lhs, op, rhs, t, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(data, InsertPos);
   }
 
@@ -301,7 +307,8 @@ const SymSymExpr *SymbolManager::getSymSymExpr(const SymExpr *lhs,
   SymExpr *data = DataSet.FindNodeOrInsertPos(ID, InsertPos);
 
   if (!data) {
-    data = new (BPAlloc) SymSymExpr(lhs, op, rhs, t);
+    data =
+        new (BPAlloc) SymSymExpr(lhs, op, rhs, t, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(data, InsertPos);
   }
 
@@ -316,7 +323,8 @@ const UnarySymExpr *SymbolManager::getUnarySymExpr(const SymExpr *Operand,
   void *InsertPos;
   SymExpr *data = DataSet.FindNodeOrInsertPos(ID, InsertPos);
   if (!data) {
-    data = new (BPAlloc) UnarySymExpr(Operand, Opc, T);
+    data = new (BPAlloc)
+        UnarySymExpr(Operand, Opc, T, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(data, InsertPos);
   }
 

@llvmbot
Copy link
Member

llvmbot commented Dec 30, 2024

@llvm/pr-subscribers-clang

Author: Arseniy Zaostrovnykh (necto)

Changes

These IDs are essentially allocator offsets for the SymExpr pool. They are superior to raw pointer values because they are more controllable and are not randomized across executions.

They are not stable across runs because some spurious allocations might happen only in certain runs. For example, an unrelated ImmutableMap rotates its AVL tree based on the key values that are pointers. The rotation causes extra allocations in the shared allocator, and the next SymExpr allocation happens at a different offset.

Yet, these IDs order is stable across runs because SymExprs are allocated in the same order and allocator offset grows monotonously.

Stability of the constraint order is important for the stability of the analyzer results. I evaluated this change on a set of 200+ open-source C++ projects with the total number of ~78 000 symbolic-execution issues.

This patch reduced the run-to-run churn (flakyness) in SE issues from 80-90 to 30-40 (out of 78K) in our CSA deployment (in our setting flaky issues are mostly due to Z3 refutation instability).

Importantly, this change is necessary for the next step in stabilizing analysis results by caching Z3 query outcomes between analysis runs (work in progress).

Across our admittedly noisy CI runs, I detected no noticeable effect on memory footprint or analysis time.

CPP-5919


Full diff: https://github.com/llvm/llvm-project/pull/121347.diff

4 Files Affected:

  • (modified) clang/include/clang/StaticAnalyzer/Core/PathSensitive/RangedConstraintManager.h (+16-1)
  • (modified) clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h (+7-2)
  • (modified) clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h (+33-26)
  • (modified) clang/lib/StaticAnalyzer/Core/SymbolManager.cpp (+18-10)
diff --git a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/RangedConstraintManager.h b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/RangedConstraintManager.h
index 49ea006e27aa54..27ee809a99b2cd 100644
--- a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/RangedConstraintManager.h
+++ b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/RangedConstraintManager.h
@@ -401,7 +401,22 @@ class RangeSet {
   friend class Factory;
 };
 
-using ConstraintMap = llvm::ImmutableMap<SymbolRef, RangeSet>;
+struct ConstraintKVInfo : llvm::ImutKeyValueInfo<SymbolRef, RangeSet> {
+  static inline bool isEqual(key_type_ref L, key_type_ref R) {
+    return L->getAllocID() == R->getAllocID();
+  }
+
+  static inline bool isLess(key_type_ref L, key_type_ref R) {
+    return L->getAllocID() < R->getAllocID();
+  }
+
+  static inline void Profile(llvm::FoldingSetNodeID &ID, value_type_ref V) {
+    ID.AddInteger(V.first->getAllocID());
+    ID.Add(V.second);
+  }
+};
+
+using ConstraintMap = llvm::ImmutableMap<SymbolRef, RangeSet, ConstraintKVInfo>;
 ConstraintMap getConstraintMap(ProgramStateRef State);
 
 class RangedConstraintManager : public SimpleConstraintManager {
diff --git a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h
index 862a30c0e73633..91c62886f68261 100644
--- a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h
+++ b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymExpr.h
@@ -31,6 +31,7 @@ class SymExpr : public llvm::FoldingSetNode {
   virtual void anchor();
 
 public:
+  using AllocIDType = int64_t;
   enum Kind {
 #define SYMBOL(Id, Parent) Id##Kind,
 #define SYMBOL_RANGE(Id, First, Last) BEGIN_##Id = First, END_##Id = Last,
@@ -39,9 +40,10 @@ class SymExpr : public llvm::FoldingSetNode {
 
 private:
   Kind K;
+  AllocIDType AllocID;
 
 protected:
-  SymExpr(Kind k) : K(k) {}
+  SymExpr(Kind K, AllocIDType AllocID) : K(K), AllocID(AllocID) {}
 
   static bool isValidTypeForSymbol(QualType T) {
     // FIXME: Depending on whether we choose to deprecate structural symbols,
@@ -56,6 +58,8 @@ class SymExpr : public llvm::FoldingSetNode {
 
   Kind getKind() const { return K; }
 
+  AllocIDType getAllocID() const { return AllocID; }
+
   virtual void dump() const;
 
   virtual void dumpToStream(raw_ostream &os) const {}
@@ -122,7 +126,8 @@ class SymbolData : public SymExpr {
   void anchor() override;
 
 protected:
-  SymbolData(Kind k, SymbolID sym) : SymExpr(k), Sym(sym) {
+  SymbolData(Kind k, SymbolID sym, AllocIDType AllocID)
+      : SymExpr(k, AllocID), Sym(sym) {
     assert(classof(this));
   }
 
diff --git a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
index 73732d532f630f..da13f25323d8a2 100644
--- a/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
+++ b/clang/include/clang/StaticAnalyzer/Core/PathSensitive/SymbolManager.h
@@ -44,8 +44,9 @@ class SymbolRegionValue : public SymbolData {
   const TypedValueRegion *R;
 
 public:
-  SymbolRegionValue(SymbolID sym, const TypedValueRegion *r)
-      : SymbolData(SymbolRegionValueKind, sym), R(r) {
+  SymbolRegionValue(SymbolID sym, const TypedValueRegion *r,
+                    AllocIDType AllocID)
+      : SymbolData(SymbolRegionValueKind, sym, AllocID), R(r) {
     assert(r);
     assert(isValidTypeForSymbol(r->getValueType()));
   }
@@ -86,8 +87,9 @@ class SymbolConjured : public SymbolData {
 
 public:
   SymbolConjured(SymbolID sym, const Stmt *s, const LocationContext *lctx,
-                 QualType t, unsigned count, const void *symbolTag)
-      : SymbolData(SymbolConjuredKind, sym), S(s), T(t), Count(count),
+                 QualType t, unsigned count, const void *symbolTag,
+                 AllocIDType AllocID)
+      : SymbolData(SymbolConjuredKind, sym, AllocID), S(s), T(t), Count(count),
         LCtx(lctx), SymbolTag(symbolTag) {
     // FIXME: 's' might be a nullptr if we're conducting invalidation
     // that was caused by a destructor call on a temporary object,
@@ -138,8 +140,10 @@ class SymbolDerived : public SymbolData {
   const TypedValueRegion *R;
 
 public:
-  SymbolDerived(SymbolID sym, SymbolRef parent, const TypedValueRegion *r)
-      : SymbolData(SymbolDerivedKind, sym), parentSymbol(parent), R(r) {
+  SymbolDerived(SymbolID sym, SymbolRef parent, const TypedValueRegion *r,
+                AllocIDType AllocID)
+      : SymbolData(SymbolDerivedKind, sym, AllocID), parentSymbol(parent),
+        R(r) {
     assert(parent);
     assert(r);
     assert(isValidTypeForSymbol(r->getValueType()));
@@ -181,8 +185,8 @@ class SymbolExtent : public SymbolData {
   const SubRegion *R;
 
 public:
-  SymbolExtent(SymbolID sym, const SubRegion *r)
-      : SymbolData(SymbolExtentKind, sym), R(r) {
+  SymbolExtent(SymbolID sym, const SubRegion *r, AllocIDType AllocID)
+      : SymbolData(SymbolExtentKind, sym, AllocID), R(r) {
     assert(r);
   }
 
@@ -223,16 +227,17 @@ class SymbolMetadata : public SymbolData {
   const void *Tag;
 
 public:
-  SymbolMetadata(SymbolID sym, const MemRegion* r, const Stmt *s, QualType t,
-                 const LocationContext *LCtx, unsigned count, const void *tag)
-      : SymbolData(SymbolMetadataKind, sym), R(r), S(s), T(t), LCtx(LCtx),
-        Count(count), Tag(tag) {
-      assert(r);
-      assert(s);
-      assert(isValidTypeForSymbol(t));
-      assert(LCtx);
-      assert(tag);
-    }
+  SymbolMetadata(SymbolID sym, const MemRegion *r, const Stmt *s, QualType t,
+                 const LocationContext *LCtx, unsigned count, const void *tag,
+                 AllocIDType AllocID)
+      : SymbolData(SymbolMetadataKind, sym, AllocID), R(r), S(s), T(t),
+        LCtx(LCtx), Count(count), Tag(tag) {
+    assert(r);
+    assert(s);
+    assert(isValidTypeForSymbol(t));
+    assert(LCtx);
+    assert(tag);
+  }
 
     LLVM_ATTRIBUTE_RETURNS_NONNULL
     const MemRegion *getRegion() const { return R; }
@@ -287,8 +292,8 @@ class SymbolCast : public SymExpr {
   QualType ToTy;
 
 public:
-  SymbolCast(const SymExpr *In, QualType From, QualType To)
-      : SymExpr(SymbolCastKind), Operand(In), FromTy(From), ToTy(To) {
+  SymbolCast(const SymExpr *In, QualType From, QualType To, AllocIDType AllocID)
+      : SymExpr(SymbolCastKind, AllocID), Operand(In), FromTy(From), ToTy(To) {
     assert(In);
     assert(isValidTypeForSymbol(From));
     // FIXME: GenericTaintChecker creates symbols of void type.
@@ -333,8 +338,9 @@ class UnarySymExpr : public SymExpr {
   QualType T;
 
 public:
-  UnarySymExpr(const SymExpr *In, UnaryOperator::Opcode Op, QualType T)
-      : SymExpr(UnarySymExprKind), Operand(In), Op(Op), T(T) {
+  UnarySymExpr(const SymExpr *In, UnaryOperator::Opcode Op, QualType T,
+               AllocIDType AllocID)
+      : SymExpr(UnarySymExprKind, AllocID), Operand(In), Op(Op), T(T) {
     // Note, some unary operators are modeled as a binary operator. E.g. ++x is
     // modeled as x + 1.
     assert((Op == UO_Minus || Op == UO_Not) && "non-supported unary expression");
@@ -381,8 +387,9 @@ class BinarySymExpr : public SymExpr {
   QualType T;
 
 protected:
-  BinarySymExpr(Kind k, BinaryOperator::Opcode op, QualType t)
-      : SymExpr(k), Op(op), T(t) {
+  BinarySymExpr(Kind k, BinaryOperator::Opcode op, QualType t,
+                AllocIDType AllocID)
+      : SymExpr(k, AllocID), Op(op), T(t) {
     assert(classof(this));
     // Binary expressions are results of arithmetic. Pointer arithmetic is not
     // handled by binary expressions, but it is instead handled by applying
@@ -427,8 +434,8 @@ class BinarySymExprImpl : public BinarySymExpr {
 
 public:
   BinarySymExprImpl(LHSTYPE lhs, BinaryOperator::Opcode op, RHSTYPE rhs,
-                    QualType t)
-      : BinarySymExpr(ClassKind, op, t), LHS(lhs), RHS(rhs) {
+                    QualType t, AllocIDType AllocID)
+      : BinarySymExpr(ClassKind, op, t, AllocID), LHS(lhs), RHS(rhs) {
     assert(getPointer(lhs));
     assert(getPointer(rhs));
   }
diff --git a/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp b/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp
index f21e5c3ad7bd7c..35682bb6645729 100644
--- a/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp
+++ b/clang/lib/StaticAnalyzer/Core/SymbolManager.cpp
@@ -170,7 +170,8 @@ SymbolManager::getRegionValueSymbol(const TypedValueRegion* R) {
   void *InsertPos;
   SymExpr *SD = DataSet.FindNodeOrInsertPos(profile, InsertPos);
   if (!SD) {
-    SD = new (BPAlloc) SymbolRegionValue(SymbolCounter, R);
+    SD = new (BPAlloc)
+        SymbolRegionValue(SymbolCounter, R, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(SD, InsertPos);
     ++SymbolCounter;
   }
@@ -188,7 +189,8 @@ const SymbolConjured* SymbolManager::conjureSymbol(const Stmt *E,
   void *InsertPos;
   SymExpr *SD = DataSet.FindNodeOrInsertPos(profile, InsertPos);
   if (!SD) {
-    SD = new (BPAlloc) SymbolConjured(SymbolCounter, E, LCtx, T, Count, SymbolTag);
+    SD = new (BPAlloc) SymbolConjured(SymbolCounter, E, LCtx, T, Count,
+                                      SymbolTag, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(SD, InsertPos);
     ++SymbolCounter;
   }
@@ -204,7 +206,8 @@ SymbolManager::getDerivedSymbol(SymbolRef parentSymbol,
   void *InsertPos;
   SymExpr *SD = DataSet.FindNodeOrInsertPos(profile, InsertPos);
   if (!SD) {
-    SD = new (BPAlloc) SymbolDerived(SymbolCounter, parentSymbol, R);
+    SD = new (BPAlloc) SymbolDerived(SymbolCounter, parentSymbol, R,
+                                     BPAlloc.getBytesAllocated());
     DataSet.InsertNode(SD, InsertPos);
     ++SymbolCounter;
   }
@@ -219,7 +222,8 @@ SymbolManager::getExtentSymbol(const SubRegion *R) {
   void *InsertPos;
   SymExpr *SD = DataSet.FindNodeOrInsertPos(profile, InsertPos);
   if (!SD) {
-    SD = new (BPAlloc) SymbolExtent(SymbolCounter, R);
+    SD = new (BPAlloc)
+        SymbolExtent(SymbolCounter, R, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(SD, InsertPos);
     ++SymbolCounter;
   }
@@ -236,7 +240,8 @@ SymbolManager::getMetadataSymbol(const MemRegion* R, const Stmt *S, QualType T,
   void *InsertPos;
   SymExpr *SD = DataSet.FindNodeOrInsertPos(profile, InsertPos);
   if (!SD) {
-    SD = new (BPAlloc) SymbolMetadata(SymbolCounter, R, S, T, LCtx, Count, SymbolTag);
+    SD = new (BPAlloc) SymbolMetadata(SymbolCounter, R, S, T, LCtx, Count,
+                                      SymbolTag, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(SD, InsertPos);
     ++SymbolCounter;
   }
@@ -252,7 +257,7 @@ SymbolManager::getCastSymbol(const SymExpr *Op,
   void *InsertPos;
   SymExpr *data = DataSet.FindNodeOrInsertPos(ID, InsertPos);
   if (!data) {
-    data = new (BPAlloc) SymbolCast(Op, From, To);
+    data = new (BPAlloc) SymbolCast(Op, From, To, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(data, InsertPos);
   }
 
@@ -268,7 +273,7 @@ const SymIntExpr *SymbolManager::getSymIntExpr(const SymExpr *lhs,
   SymExpr *data = DataSet.FindNodeOrInsertPos(ID, InsertPos);
 
   if (!data) {
-    data = new (BPAlloc) SymIntExpr(lhs, op, v, t);
+    data = new (BPAlloc) SymIntExpr(lhs, op, v, t, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(data, InsertPos);
   }
 
@@ -284,7 +289,8 @@ const IntSymExpr *SymbolManager::getIntSymExpr(APSIntPtr lhs,
   SymExpr *data = DataSet.FindNodeOrInsertPos(ID, InsertPos);
 
   if (!data) {
-    data = new (BPAlloc) IntSymExpr(lhs, op, rhs, t);
+    data =
+        new (BPAlloc) IntSymExpr(lhs, op, rhs, t, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(data, InsertPos);
   }
 
@@ -301,7 +307,8 @@ const SymSymExpr *SymbolManager::getSymSymExpr(const SymExpr *lhs,
   SymExpr *data = DataSet.FindNodeOrInsertPos(ID, InsertPos);
 
   if (!data) {
-    data = new (BPAlloc) SymSymExpr(lhs, op, rhs, t);
+    data =
+        new (BPAlloc) SymSymExpr(lhs, op, rhs, t, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(data, InsertPos);
   }
 
@@ -316,7 +323,8 @@ const UnarySymExpr *SymbolManager::getUnarySymExpr(const SymExpr *Operand,
   void *InsertPos;
   SymExpr *data = DataSet.FindNodeOrInsertPos(ID, InsertPos);
   if (!data) {
-    data = new (BPAlloc) UnarySymExpr(Operand, Opc, T);
+    data = new (BPAlloc)
+        UnarySymExpr(Operand, Opc, T, BPAlloc.getBytesAllocated());
     DataSet.InsertNode(data, InsertPos);
   }
 

@necto
Copy link
Contributor Author

necto commented Dec 30, 2024

Hey, @steakhal, @NagyDonat
I am working caching Z3 refutation query outcomes to reduce the number of flaky issues, and constraint order is important to maximize the cache-hit rate. Turns out, global constraint order is beneficial regardless query cache. Would you take a look?

BTW, a couple of points on the design of the change:

  • I considered using simple incremental ids instead of allocator offsets, but found no advantage. Do you see any?
  • Would it make sense to unify SymbolData::Sym and SymExpr::AllocID, and maybe get rid of the intermediary SymbolData?

@steakhal
Copy link
Contributor

Disclaimer: I haven't checked the actual patch, but I'll come back to it :D Maybe next year.
I think I've seen already a variant of this downstream and I generally agreed with the vision. I don't expect much friction on this front, but I'll have a deeper look.


This patch reduced the run-to-run churn (flakiness) in SE issues from 80-90 to 30-40 (out of 78K) in our CSA deployment (in our setting flaky issues are mostly due to Z3 refutation instability).

Does the unstable constraint order only affect Z3 refutation, or did you measure/notice differences in report flakyness even without Z3 refutation? I think it would be great to see the impact of this change, while we eliminate the Z3 from the equation.


I considered using simple incremental ids instead of allocator offsets, but found no advantage. Do you see any?

I'm okay with using allocator offsets for this purpose. That's already a global state, I don't think having another one would really bring much benefit (aka. having a separate global counter for them).

Would it make sense to unify SymbolData::Sym and SymExpr::AllocID, and maybe get rid of the intermediary SymbolData?

I think SymbolData::Sym and SymExpr::AllocID could be unified. The SymbolData abstraction is useful in general and allows us to differentiate the "source" of the different symbols we have identities for (aka. they are the "interesting" symbols). These are always the "leaves" of the SymExpr trees, so that could be another useful way of looking at SymbolData - once you see that, you won't traverse any deeper.

Copy link
Contributor

@steakhal steakhal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The content look narrow and precise.
It raises me questions why is the only user of this new field the Constraint manager?
And if the constraint manager is the only user, why is the SMT-based solver not using this? Should all the other immutableSet/Map benefit from this stability when the key is a SymbolRef?
Why can't the default Profile changed for SymExpr to automatically "opt-in" this new ordering for the rest of the foldingset-user datastructures, such as DenseSet/Map too?

Comment on lines +404 to +419
struct ConstraintKVInfo : llvm::ImutKeyValueInfo<SymbolRef, RangeSet> {
static inline bool isEqual(key_type_ref L, key_type_ref R) {
return L->getAllocID() == R->getAllocID();
}

static inline bool isLess(key_type_ref L, key_type_ref R) {
return L->getAllocID() < R->getAllocID();
}

static inline void Profile(llvm::FoldingSetNodeID &ID, value_type_ref V) {
ID.AddInteger(V.first->getAllocID());
ID.Add(V.second);
}
};

using ConstraintMap = llvm::ImmutableMap<SymbolRef, RangeSet, ConstraintKVInfo>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this ordering be a more sensible ordering for all containers where SymbolRefs are the keys?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should. I am preparing a superceeding PR with this ordering applied universally, as ImutContainerInfo specialization, and with AllocID merged with SymbolID. I'll open it here once I have evaluation results

Comment on lines +437 to +438
QualType t, AllocIDType AllocID)
: BinarySymExpr(ClassKind, op, t, AllocID), LHS(lhs), RHS(rhs) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all of these SymExprs, we now have an AllocID - that is unique to the object we allocated.
So, why couldn't we just simplify all of the Profile functions of these with a unified impl at the SymExpr level, just "hashing" that AllocID only.

I feel like I'm coming back to this question all the time, and if you have something for it then we could formalize a comment at the Profile and/or at the AllocID data field to explain this "redundancy".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is the same as for SymbolData::Sym that it also a unique single integer identifying a SymbolData: it is not available until you allocate the symbol. The only use of SymExpr::Profile I observed is for SymbolManager to avoid reallocating identical SymExprs, for which it keeps track of all SymExprs allocated and allocates a new one only when necessary. But for that it needs to identify identical SymExprs before they are allocated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright. That makes sense :D Could you leave a comment about this somewhere?

@necto
Copy link
Contributor Author

necto commented Jan 2, 2025

Does the unstable constraint order only affect Z3 refutation, or did you measure/notice differences in report flakyness even without Z3 refutation? I think it would be great to see the impact of this change, while we eliminate the Z3 from the equation.

I've measured it now, and it seems that it affects the execution flow of issues (from 40-80 updated issues to 10-15 out of 80 K). It doesn't seem to have much effect on the appearing/disappearing issues (although without Z3 refutation there are only ~10 out of 80 K).

It raises me questions why is the only user of this new field the Constraint manager?
And if the constraint manager is the only user, why is the SMT-based solver not using this? Should all the other immutableSet/Map benefit from this stability when the key is a SymbolRef?

In an upcoming superceeding PR I generalized it for all structures using ImutContainerInfo, but SMT-based solver is using this anyway because it uses ConstraintsMap type that I update in this PR.

Why can't the default Profile changed for SymExpr to automatically "opt-in" this new ordering for the rest of the foldingset-user datastructures, such as DenseSet/Map too?

Symbol*::Profile alone can't help here, you need to use it explicitly in ImutContainerInfo. However, I failed to implement stable Profile particularly because of the QualTypes used in some of SymExprs: they do not have stable Profile implementation and rely on pointers. I am hesitant to go there as that will be affecting Clang globally and not only CSA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clang:static analyzer clang Clang issues not falling into any other category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants