[LV][VPlan] Add initial support for CSA vectorization #106560

michaelmaitland · 2024-08-29T14:15:49Z

This patch adds initial support for CSA vectorization LLVM. This new class
can be characterized by vectorization of assignment to a scalar in a loop,
such that the assignment is conditional from the perspective of its use.
An assignment is conditional in a loop if a value may or may not be assigned
in the loop body.

For example:

int t = init_val;
for (int i = 0; i < N; i++) {
  if (cond[i]) 
    t = a[i];
}
s = t; // use t

Using pseudo-LLVM code this can be vectorized as

vector.ph:
  ...
  %t = %init_val
  %init.mask = <all-false-vec>
  %init.data = <poison-vec> ; uninitialized
vector.body:
  ...
  %mask.phi = phi [%init.mask, %vector.ph], [%new.mask, %vector.body]
  %data.phi = phi [%data.mask, %vector.ph], [%new.mask, %vector.body]
  %cond.vec = <widened-cmp> ...
  %a.vec    = <widened-load> %a, %i
  %b        = <any-lane-active> %cond.vec
  %new.mask = select %b, %cond.vec, %mask.phi
  %new.data = select %b, %a.vec, %data.phi
  ...
middle.block:
  %s = <extract-last-active-lane> %new.mask, %new.data

On each iteration, we track whether any lane in the widened condition was active,
and if it was take the current mask and data as the new mask and data vector.
Then at the end of the loop, the scalar can be extracted only once.

This transformation works the same way for integer, pointer, and floating point
conditional assignment, since the transformation does not require inspection
of the data being assigned.

In the vectorization of a CSA, we will be introducing recipes into the vector
preheader, the vector body, and the middle block. Recipes that are introduced
into the preheader and middle block are executed only one time, and recipes
that are in the vector body will be possibly executed multiple times. The more
times that the vector body is executed, the less of an impact the preheader
and middle block cost have on the overall cost of a CSA.

A detailed explanation of the concept can be found here.

This patch is further tested in llvm/llvm-test-suite#155.

llvmbot · 2024-08-29T14:16:23Z

@llvm/pr-subscribers-backend-risc-v

@llvm/pr-subscribers-llvm-transforms

Author: Michael Maitland (michaelmaitland)

Changes

This PR contains a series of commits which together implement an initial version of conditional scalar vectorization. I will edit this description with a link to the RFC which gives a more in depth explanation of the changes.

This patch is further tested in llvm/llvm-test-suite#155.

Patch is 331.04 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/106560.diff

19 Files Affected:

(added) llvm/include/llvm/Analysis/CSADescriptors.h (+78)
(modified) llvm/include/llvm/Analysis/TargetTransformInfo.h (+9)
(modified) llvm/include/llvm/Analysis/TargetTransformInfoImpl.h (+2)
(modified) llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h (+18)
(modified) llvm/lib/Analysis/CMakeLists.txt (+1)
(added) llvm/lib/Analysis/CSADescriptors.cpp (+73)
(modified) llvm/lib/Analysis/TargetTransformInfo.cpp (+4)
(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp (+5)
(modified) llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h (+4)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp (+30-4)
(modified) llvm/lib/Transforms/Vectorize/LoopVectorize.cpp (+203-10)
(modified) llvm/lib/Transforms/Vectorize/VPlan.cpp (+5-2)
(modified) llvm/lib/Transforms/Vectorize/VPlan.h (+195)
(modified) llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp (+367-2)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp (+49)
(modified) llvm/lib/Transforms/Vectorize/VPlanTransforms.h (+9)
(modified) llvm/lib/Transforms/Vectorize/VPlanValue.h (+3)
(modified) llvm/lib/Transforms/Vectorize/VPlanVerifier.cpp (+3-3)
(added) llvm/test/Transforms/LoopVectorize/RISCV/csa.ll (+4234)

diff --git a/llvm/include/llvm/Analysis/CSADescriptors.h b/llvm/include/llvm/Analysis/CSADescriptors.h
new file mode 100644
index 00000000000000..3f95b3484d1e22
--- /dev/null
+++ b/llvm/include/llvm/Analysis/CSADescriptors.h
@@ -0,0 +1,78 @@
+//===- llvm/Analysis/CSADescriptors.h - CSA Descriptors --*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file "describes" conditional scalar assignments (CSA).
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/Analysis/LoopInfo.h"
+#include "llvm/IR/Instructions.h"
+#include "llvm/IR/Value.h"
+
+#ifndef LLVM_ANALYSIS_SIFIVECSADESCRIPTORS_H
+#define LLVM_ANALYSIS_SIFIVECSADESCRIPTORS_H
+
+namespace llvm {
+
+/// A Conditional Scalar Assignment (CSA) is an assignment from an initial
+/// scalar that may or may not occur.
+class CSADescriptor {
+  /// If the conditional assignment occurs inside a loop, then Phi chooses
+  /// the value of the assignment from the entry block or the loop body block.
+  PHINode *Phi = nullptr;
+
+  /// The initial value of the CSA. If the condition guarding the assignment is
+  /// not met, then the assignment retains this value.
+  Value *InitScalar = nullptr;
+
+  /// The Instruction that conditionally assigned to inside the loop.
+  Instruction *Assignment = nullptr;
+
+  /// Create a CSA Descriptor that models an invalid CSA.
+  CSADescriptor() = default;
+
+  /// Create a CSA Descriptor that models a valid CSA with its members
+  /// initialized correctly.
+  CSADescriptor(PHINode *Phi, Instruction *Assignment, Value *InitScalar)
+      : Phi(Phi), InitScalar(InitScalar), Assignment(Assignment) {}
+
+public:
+  /// If Phi is the root of a CSA, return the CSADescriptor of the CSA rooted by
+  /// Phi. Otherwise, return a CSADescriptor with IsValidCSA set to false.
+  static CSADescriptor isCSAPhi(PHINode *Phi, Loop *TheLoop);
+
+  operator bool() const { return isValid(); }
+
+  /// Returns whether SI is the Assignment in CSA
+  static bool isCSASelect(CSADescriptor Desc, SelectInst *SI) {
+    return Desc.getAssignment() == SI;
+  }
+
+  /// Return whether this CSADescriptor models a valid CSA.
+  bool isValid() const { return Phi && InitScalar && Assignment; }
+
+  /// Return the PHI that roots this CSA.
+  PHINode *getPhi() const { return Phi; }
+
+  /// Return the initial value of the CSA. This is the value if the conditional
+  /// assignment does not occur.
+  Value *getInitScalar() const { return InitScalar; }
+
+  /// The Instruction that is used after the loop
+  Instruction *getAssignment() const { return Assignment; }
+
+  /// Return the condition that this CSA is conditional upon.
+  Value *getCond() const {
+    if (auto *SI = dyn_cast_or_null<SelectInst>(Assignment))
+      return SI->getCondition();
+    return nullptr;
+  }
+};
+} // namespace llvm
+
+#endif // LLVM_ANALYSIS_CSADESCRIPTORS_H
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfo.h b/llvm/include/llvm/Analysis/TargetTransformInfo.h
index b2124c6106198e..d0192f7d90a812 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfo.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfo.h
@@ -1767,6 +1767,10 @@ class TargetTransformInfo {
         : EVLParamStrategy(EVLParamStrategy), OpStrategy(OpStrategy) {}
   };
 
+  /// \returns true if the loop vectorizer should vectorize conditional
+  /// scalar assignments for the target.
+  bool enableCSAVectorization() const;
+
   /// \returns How the target needs this vector-predicated operation to be
   /// transformed.
   VPLegalization getVPLegalizationStrategy(const VPIntrinsic &PI) const;
@@ -2175,6 +2179,7 @@ class TargetTransformInfo::Concept {
   virtual bool supportsScalableVectors() const = 0;
   virtual bool hasActiveVectorLength(unsigned Opcode, Type *DataType,
                                      Align Alignment) const = 0;
+  virtual bool enableCSAVectorization() const = 0;
   virtual VPLegalization
   getVPLegalizationStrategy(const VPIntrinsic &PI) const = 0;
   virtual bool hasArmWideBranch(bool Thumb) const = 0;
@@ -2940,6 +2945,10 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
     return Impl.hasActiveVectorLength(Opcode, DataType, Alignment);
   }
 
+  bool enableCSAVectorization() const override {
+    return Impl.enableCSAVectorization();
+  }
+
   VPLegalization
   getVPLegalizationStrategy(const VPIntrinsic &PI) const override {
     return Impl.getVPLegalizationStrategy(PI);
diff --git a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
index 11b07ac0b7fc47..dbf0cf888e168a 100644
--- a/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
+++ b/llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
@@ -956,6 +956,8 @@ class TargetTransformInfoImplBase {
     return false;
   }
 
+  bool enableCSAVectorization() const { return false; }
+
   TargetTransformInfo::VPLegalization
   getVPLegalizationStrategy(const VPIntrinsic &PI) const {
     return TargetTransformInfo::VPLegalization(
diff --git a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
index 0f4d1355dd2bfe..7ef29a8cb36e49 100644
--- a/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
+++ b/llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h
@@ -27,6 +27,7 @@
 #define LLVM_TRANSFORMS_VECTORIZE_LOOPVECTORIZATIONLEGALITY_H
 
 #include "llvm/ADT/MapVector.h"
+#include "llvm/Analysis/CSADescriptors.h"
 #include "llvm/Analysis/LoopAccessAnalysis.h"
 #include "llvm/Support/TypeSize.h"
 #include "llvm/Transforms/Utils/LoopUtils.h"
@@ -257,6 +258,10 @@ class LoopVectorizationLegality {
   /// induction descriptor.
   using InductionList = MapVector<PHINode *, InductionDescriptor>;
 
+  /// CSAList contains the CSA descriptors for all the CSAs that were found
+  /// in the loop, rooted by their phis.
+  using CSAList = MapVector<PHINode *, CSADescriptor>;
+
   /// RecurrenceSet contains the phi nodes that are recurrences other than
   /// inductions and reductions.
   using RecurrenceSet = SmallPtrSet<const PHINode *, 8>;
@@ -309,6 +314,12 @@ class LoopVectorizationLegality {
   /// Returns True if V is a Phi node of an induction variable in this loop.
   bool isInductionPhi(const Value *V) const;
 
+  /// Returns the CSAs found in the loop.
+  const CSAList& getCSAs() const { return CSAs; }
+
+  /// Returns true if Phi is the root of a CSA in the loop.
+  bool isCSAPhi(PHINode *Phi) const { return CSAs.count(Phi) != 0; }
+
   /// Returns a pointer to the induction descriptor, if \p Phi is an integer or
   /// floating point induction.
   const InductionDescriptor *getIntOrFpInductionDescriptor(PHINode *Phi) const;
@@ -463,6 +474,10 @@ class LoopVectorizationLegality {
   void addInductionPhi(PHINode *Phi, const InductionDescriptor &ID,
                        SmallPtrSetImpl<Value *> &AllowedExit);
 
+  // Updates the vetorization state by adding \p Phi to the CSA list.
+  void addCSAPhi(PHINode *Phi, const CSADescriptor &CSADesc,
+                 SmallPtrSetImpl<Value *> &AllowedExit);
+
   /// The loop that we evaluate.
   Loop *TheLoop;
 
@@ -507,6 +522,9 @@ class LoopVectorizationLegality {
   /// variables can be pointers.
   InductionList Inductions;
 
+  /// Holds the conditional scalar assignments
+  CSAList CSAs;
+
   /// Holds all the casts that participate in the update chain of the induction
   /// variables, and that have been proven to be redundant (possibly under a
   /// runtime guard). These casts can be ignored when creating the vectorized
diff --git a/llvm/lib/Analysis/CMakeLists.txt b/llvm/lib/Analysis/CMakeLists.txt
index 393803fad89383..24ca426990d9ed 100644
--- a/llvm/lib/Analysis/CMakeLists.txt
+++ b/llvm/lib/Analysis/CMakeLists.txt
@@ -46,6 +46,7 @@ add_llvm_component_library(LLVMAnalysis
   CostModel.cpp
   CodeMetrics.cpp
   ConstantFolding.cpp
+  CSADescriptors.cpp
   CtxProfAnalysis.cpp
   CycleAnalysis.cpp
   DDG.cpp
diff --git a/llvm/lib/Analysis/CSADescriptors.cpp b/llvm/lib/Analysis/CSADescriptors.cpp
new file mode 100644
index 00000000000000..d0377c8c16de33
--- /dev/null
+++ b/llvm/lib/Analysis/CSADescriptors.cpp
@@ -0,0 +1,73 @@
+//=== llvm/Analysis/CSADescriptors.cpp - CSA Descriptors -*- C++ -*-===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+//
+// This file "describes" conditional scalar assignments (CSA).
+//
+//===----------------------------------------------------------------------===//
+
+#include "llvm/Analysis/CSADescriptors.h"
+#include "llvm/IR/PatternMatch.h"
+#include "llvm/IR/Type.h"
+
+using namespace llvm;
+using namespace llvm::PatternMatch;
+
+#define DEBUG_TYPE "csa-descriptors"
+
+CSADescriptor CSADescriptor::isCSAPhi(PHINode *Phi, Loop *TheLoop) {
+  // Return CSADescriptor that describes a CSA that matches one of these
+  // patterns:
+  //   phi loop_inv, (select cmp, value, phi)
+  //   phi loop_inv, (select cmp, phi, value)
+  //   phi (select cmp, value, phi), loop_inv
+  //   phi (select cmp, phi, value), loop_inv
+  // If the CSA does not match any of these paterns, return a CSADescriptor
+  // that describes an InvalidCSA.
+
+  // Must be a scalar
+  Type *Type = Phi->getType();
+  if (!Type->isIntegerTy() && !Type->isFloatingPointTy() &&
+      !Type->isPointerTy())
+    return CSADescriptor();
+
+  // Match phi loop_inv, (select cmp, value, phi)
+  //    or phi loop_inv, (select cmp, phi, value)
+  //    or phi (select cmp, value, phi), loop_inv
+  //    or phi (select cmp, phi, value), loop_inv
+  if (Phi->getNumIncomingValues() != 2)
+    return CSADescriptor();
+  auto SelectInstIt = find_if(Phi->incoming_values(), [&Phi](Use &U) {
+    return match(U.get(), m_Select(m_Value(), m_Specific(Phi), m_Value())) ||
+           match(U.get(), m_Select(m_Value(), m_Value(), m_Specific(Phi)));
+  });
+  if (SelectInstIt == Phi->incoming_values().end())
+    return CSADescriptor();
+  auto LoopInvIt = find_if(Phi->incoming_values(), [&](Use &U) {
+    return U.get() != *SelectInstIt && TheLoop->isLoopInvariant(U.get());
+  });
+  if (LoopInvIt == Phi->incoming_values().end())
+    return CSADescriptor();
+
+  // Phi or Sel must be used only outside the loop,
+  // excluding if Phi use Sel or Sel use Phi
+  auto IsOnlyUsedOutsideLoop = [=](Value *V, Value *Ignore) {
+    return all_of(V->users(), [Ignore, TheLoop](User *U) {
+      if (U == Ignore)
+        return true;
+      if (auto *I = dyn_cast<Instruction>(U))
+        return !TheLoop->contains(I);
+      return true;
+    });
+  };
+  auto *Sel = cast<SelectInst>(SelectInstIt->get());
+  auto *LoopInv = LoopInvIt->get();
+  if (!IsOnlyUsedOutsideLoop(Phi, Sel) || !IsOnlyUsedOutsideLoop(Sel, Phi))
+    return CSADescriptor();
+
+  return CSADescriptor(Phi, Sel, LoopInv);
+}
diff --git a/llvm/lib/Analysis/TargetTransformInfo.cpp b/llvm/lib/Analysis/TargetTransformInfo.cpp
index 2c26493bd3f1ca..4f882475b74e74 100644
--- a/llvm/lib/Analysis/TargetTransformInfo.cpp
+++ b/llvm/lib/Analysis/TargetTransformInfo.cpp
@@ -1304,6 +1304,10 @@ bool TargetTransformInfo::preferEpilogueVectorization() const {
   return TTIImpl->preferEpilogueVectorization();
 }
 
+bool TargetTransformInfo::enableCSAVectorization() const {
+  return TTIImpl->enableCSAVectorization();
+}
+
 TargetTransformInfo::VPLegalization
 TargetTransformInfo::getVPLegalizationStrategy(const VPIntrinsic &VPI) const {
   return TTIImpl->getVPLegalizationStrategy(VPI);
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
index 537c62bb0aacd1..b76f2fc72c6f47 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp
@@ -1985,6 +1985,11 @@ bool RISCVTTIImpl::isLSRCostLess(const TargetTransformInfo::LSRCost &C1,
                   C2.ScaleCost, C2.ImmCost, C2.SetupCost);
 }
 
+bool RISCVTTIImpl::enableCSAVectorization() const {
+  return ST->hasVInstructions() &&
+         ST->getProcFamily() == RISCVSubtarget::SiFive7;
+}
+
 bool RISCVTTIImpl::isLegalMaskedCompressStore(Type *DataTy, Align Alignment) {
   auto *VTy = dyn_cast<VectorType>(DataTy);
   if (!VTy || VTy->isScalableTy())
diff --git a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
index cc69e1d118b5a1..17245150ec10ae 100644
--- a/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
+++ b/llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
@@ -287,6 +287,10 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {
     return TLI->isVScaleKnownToBeAPowerOfTwo();
   }
 
+  /// \returns true if the loop vectorizer should vectorize conditional
+  /// scalar assignments for the target.
+  bool enableCSAVectorization() const;
+
   /// \returns How the target needs this vector-predicated operation to be
   /// transformed.
   TargetTransformInfo::VPLegalization
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
index 66a779da8c25bc..9633ba9cc70ee9 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp
@@ -79,6 +79,10 @@ static cl::opt<LoopVectorizeHints::ScalableForceKind>
                 "Scalable vectorization is available and favored when the "
                 "cost is inconclusive.")));
 
+static cl::opt<bool>
+    EnableCSA("enable-csa-vectorization", cl::init(false), cl::Hidden,
+              cl::desc("Control whether CSA loop vectorization is enabled"));
+
 /// Maximum vectorization interleave count.
 static const unsigned MaxInterleaveFactor = 16;
 
@@ -749,6 +753,15 @@ bool LoopVectorizationLegality::setupOuterLoopInductions() {
   return llvm::all_of(Header->phis(), IsSupportedPhi);
 }
 
+void LoopVectorizationLegality::addCSAPhi(
+    PHINode *Phi, const CSADescriptor &CSADesc,
+    SmallPtrSetImpl<Value *> &AllowedExit) {
+  assert(CSADesc.isValid() && "Expected Valid CSADescriptor");
+  LLVM_DEBUG(dbgs() << "LV: found legal CSA opportunity" << *Phi << "\n");
+  AllowedExit.insert(Phi);
+  CSAs.insert({Phi, CSADesc});
+}
+
 /// Checks if a function is scalarizable according to the TLI, in
 /// the sense that it should be vectorized and then expanded in
 /// multiple scalar calls. This is represented in the
@@ -866,14 +879,23 @@ bool LoopVectorizationLegality::canVectorizeInstrs() {
           continue;
         }
 
-        // As a last resort, coerce the PHI to a AddRec expression
-        // and re-try classifying it a an induction PHI.
+        // Try to coerce the PHI to a AddRec expression and re-try classifying
+        // it a an induction PHI.
         if (InductionDescriptor::isInductionPHI(Phi, TheLoop, PSE, ID, true) &&
             !IsDisallowedStridedPointerInduction(ID)) {
           addInductionPhi(Phi, ID, AllowedExit);
           continue;
         }
 
+        // Check if the PHI can be classified as a CSA PHI.
+        if (EnableCSA || (TTI->enableCSAVectorization() &&
+                          EnableCSA.getNumOccurrences() == 0)) {
+          if (auto CSADesc = CSADescriptor::isCSAPhi(Phi, TheLoop)) {
+            addCSAPhi(Phi, CSADesc, AllowedExit);
+            continue;
+          }
+        }
+
         reportVectorizationFailure("Found an unidentified PHI",
             "value that could not be identified as "
             "reduction is used outside the loop",
@@ -1555,11 +1577,15 @@ bool LoopVectorizationLegality::canFoldTailByMasking() const {
   for (const auto &Reduction : getReductionVars())
     ReductionLiveOuts.insert(Reduction.second.getLoopExitInstr());
 
+  SmallPtrSet<const Value *, 8> CSALiveOuts;
+  for (const auto &CSA: getCSAs())
+    CSALiveOuts.insert(CSA.second.getAssignment());
+
   // TODO: handle non-reduction outside users when tail is folded by masking.
   for (auto *AE : AllowedExit) {
     // Check that all users of allowed exit values are inside the loop or
-    // are the live-out of a reduction.
-    if (ReductionLiveOuts.count(AE))
+    // are the live-out of a reduction or a CSA
+    if (ReductionLiveOuts.count(AE) || CSALiveOuts.count(AE))
       continue;
     for (User *U : AE->users()) {
       Instruction *UI = cast<Instruction>(U);
diff --git a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
index 56f51e14a6eba9..5e45f500482826 100644
--- a/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
+++ b/llvm/lib/Transforms/Vectorize/LoopVectorize.cpp
@@ -180,6 +180,8 @@ const char LLVMLoopVectorizeFollowupEpilogue[] =
 STATISTIC(LoopsVectorized, "Number of loops vectorized");
 STATISTIC(LoopsAnalyzed, "Number of loops analyzed for vectorization");
 STATISTIC(LoopsEpilogueVectorized, "Number of epilogues vectorized");
+STATISTIC(CSAsVectorized,
+          "Number of conditional scalar assignments vectorized");
 
 static cl::opt<bool> EnableEpilogueVectorization(
     "enable-epilogue-vectorization", cl::init(true), cl::Hidden,
@@ -500,6 +502,10 @@ class InnerLoopVectorizer {
   virtual std::pair<BasicBlock *, Value *>
   createVectorizedLoopSkeleton(const SCEV2ValueTy &ExpandedSCEVs);
 
+  /// For all vectorized CSAs, replace uses of live-out scalar from the orignal
+  /// loop with the extracted scalar from the vector loop for.
+  void fixCSALiveOuts(VPTransformState &State, VPlan &Plan);
+
   /// Fix the vectorized code, taking care of header phi's, live-outs, and more.
   void fixVectorizedLoop(VPTransformState &State, VPlan &Plan);
 
@@ -2932,6 +2938,25 @@ LoopVectorizationCostModel::getVectorIntrinsicCost(CallInst *CI,
                                    TargetTransformInfo::TCK_RecipThroughput);
 }
 
+void InnerLoopVectorizer::fixCSALiveOuts(VPTransformState &State, VPlan &Plan) {
+  for (const auto &CSA: Plan.getCSAStates()) {
+    VPCSADataUpdateRecipe *VPDataUpdate = CSA.second->getDataUpdate();
+    assert(VPDataUpdate &&
+           "VPDataUpdate must have been introduced prior to fixing live outs");
+    Value *V = VPDataUpdate->getUnderlyingValue();
+    Value *ExtractedScalar = State.get(CSA.second->getExtractScalarRecipe(), 0,
+                                       /*NeedsScalar=*/true);
+    // Fix LCSSAPhis
+    llvm::SmallPtrSet<PHINode *, 2> ToFix;
+    for (User *U : V->users())
+      if (auto *Phi = dyn_cast<PHINode>(U);
+          Phi && Phi->getParent() == LoopExitBlock)
+        ToFix.insert(Phi);
+    for (PHINode *Phi : ToFix)
+      Phi->addIncoming(ExtractedScalar, LoopMiddleBlock);
+  }
+}
+
 void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
                                             VPlan &Plan) {
   // Fix widened non-induction PHIs by setting up the PHI operands.
@@ -2972,6 +2997,8 @@ void InnerLoopVectorizer::fixVectorizedLoop(VPTransformState &State,
                    getOrCreateVectorTripCount(VectorLoop->getLoopPreheader()),
                    IVEndValues[Entry.first], LoopMiddleBlock,
                    VectorLoop->getHeader(), Plan, State);
+
+    fixCSALiveOuts(State, Plan);
   }
 
   // Fix live-out phis not already fixed earlier.
@@ -4110,7 +4137,6 @@ LoopVectorizationCostModel::computeMaxVF(ElementCount UserVF, unsigned UserIC) {
   // found modulo the vectorization factor is not zero, try to fold the tail
   // by masking.
   // FIXME: look for a smaller MaxVF that does divide TC rather than masking.
-  setTailFoldingStyles(MaxFactors.ScalableVF.isScalable(), UserIC);
   if (foldTailByMasking()) {
     if (getTailFoldingStyle() == TailFoldingStyle::DataWithEVL) {
       LLVM_DEBUG(
@@ -4482,6 +4508,9 @@ static bool willGenerateVectors(VPlan &Plan, ElementCount VF,
       case VPDef::VPEVLBasedIVPHISC:
       case VPDef::VPPredInstPHISC:
       case VPDef::VPBranchOnMaskSC:
+      case VPRecipeBase::VPCSADataUpdateSC:
+      case VPRecipeBase::VPCSAExtractScalarSC:
+      case VPRecipeBase::VPCSAHe...
[truncated]

github-actions · 2024-08-29T14:19:41Z

✅ With the latest revision this PR passed the C/C++ code formatter.

artagnon

I've done a very rough initial review, to get the ball rolling. Can you kindly look at iv-select-cmp.ll (and associated tests) to increase test coverage to include negative tests as well? It would be nice if this patch created changes in iv-select-cmp.ll, although I'll need to review in more detail to find out why.

llvm/test/Transforms/LoopVectorize/RISCV/csa.ll

llvm/lib/Transforms/Vectorize/VPlan.h

llvm/include/llvm/Analysis/CSADescriptors.h

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

artagnon · 2024-08-30T15:47:30Z

Another high-level question: is it possible to not nest the test under RISCV? I realized that you've said it's disabled in BasicTTIImpl, but having a separate cas.ll under each target wouldn't be ideal, no?

michaelmaitland

@artagnon thanks so much for the initial review!

I've addressed comments and made some responses :)

llvm/lib/Target/RISCV/RISCVTargetTransformInfo.cpp

llvm/lib/Transforms/Vectorize/LoopVectorizationLegality.cpp

llvm/test/Transforms/LoopVectorize/RISCV/csa.ll

michaelmaitland · 2024-08-30T15:58:55Z

It would be nice if this patch created changes in iv-select-cmp.ll, although I'll need to review in more detail to find out why.

CSA vectorization is not enabled by default.

michaelmaitland · 2024-08-30T16:04:43Z

is it possible to not nest the test under RISCV? I realized that you've said it's disabled in BasicTTIImpl, but having a separate cas.ll under each target wouldn't be ideal, no?

I can do this. It's a good idea.

artagnon · 2024-08-30T16:13:24Z

This PR contains a series of commits which together implement an initial version of conditional scalar vectorization.

A detailed explanation of the changes can be found here.

Kindly note that the only functionality is squash-and-merge, and this will be the final commit message: "series of commits" doesn't make sense in the final commit message. Also, it would be nice to have a short explanation in the commit message itself (possibly with a few examples). Finally, kindly ensure that markdown doesn't end up when someone does a git log: it's plain-text.

artagnon · 2024-08-30T16:22:13Z

is it possible to not nest the test under RISCV? I realized that you've said it's disabled in BasicTTIImpl, but having a separate cas.ll under each target wouldn't be ideal, no?

I can do this. It's a good idea.

Perhaps a good idea to name the test conditional-scalar-assignment.ll instead of the more cryptic csa.ll.

michaelmaitland · 2024-08-30T17:58:46Z

I have updated the tests to use -NOT: vector.body for the cases that cannot be vectorized
I have also moved tests up a directory. I had to keep a copy of some in the RISCV target directory since they depend on TTI specific to RISC-V (i.e. pointer size).

artagnon

I did a closer inspection of the patch, and found mostly small things to fix. Otherwise, I'm a fan of the approach you've taken: the code is quite easy to understand, and there is little bloat. Thanks again for the patch, and I hope the other reviewers also have similar sentiments!

On a minor note, it might be helpful if you could locally enable CSA vectorization, and post the diff of the test changes somewhere, just to make sure we're doing the right thing.

llvm/include/llvm/Analysis/CSADescriptors.h

llvm/lib/Analysis/CSADescriptors.cpp

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

llvm/lib/Transforms/Vectorize/VPlanUtils.cpp

llvm/test/Transforms/LoopVectorize/conditional-scalar-assignment.ll

The docstring of this function: > Returns true if this VPInstruction's operands are single scalars and the > result is also a single scalar. ExplicitVectorLength fits this description. I don't have a test for it now, but it is needed by llvm#106560.

michaelmaitland · 2024-11-12T15:32:38Z

rebase; ping

As discussed in #112738, it may be better to have an intrinsic to represent vector element extracts based on mask bits. This intrinsic is for the case of extracting the last active element, if any, or a default value if the mask is all-false. The target-agnostic SelectionDAG lowering is similar to the IR in #106560.

As discussed in llvm#112738, it may be better to have an intrinsic to represent vector element extracts based on mask bits. This intrinsic is for the case of extracting the last active element, if any, or a default value if the mask is all-false. The target-agnostic SelectionDAG lowering is similar to the IR in llvm#106560.

michaelmaitland · 2024-11-21T19:11:06Z

ping

michaelmaitland · 2024-12-02T14:56:48Z

ping!

michaelmaitland · 2024-12-06T17:50:02Z

rebased

michaelmaitland · 2024-12-09T14:41:07Z

@fhahn @ayalz ping

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

michaelmaitland · 2024-12-16T15:42:45Z

ping! @fhahn @ayalz

fhahn

I added some comments inline.

On the high level, it would be good to have a test that checks the printed VPlan, to see things on a higher level.

Supporting both EVL and non-EVL in a single patch makes the patch quite big and I think it would help to separate them and first start with the simpler case

fhahn · 2024-12-22T21:13:26Z

llvm/include/llvm/Transforms/Vectorize/LoopVectorizationLegality.h

@@ -550,6 +560,10 @@ class LoopVectorizationLegality {
  void addInductionPhi(PHINode *Phi, const InductionDescriptor &ID,
                       SmallPtrSetImpl<Value *> &AllowedExit);

+  // Updates the vetorization state by adding \p Phi to the CSA list.


Suggested change

// Updates the vetorization state by adding \p Phi to the CSA list.

/// Updates the vetorization state by adding \p Phi to the CSA list.

fhahn · 2024-12-22T21:13:52Z

llvm/include/llvm/Analysis/IVDescriptors.h

@@ -423,6 +424,61 @@ class InductionDescriptor {
  SmallVector<Instruction *, 2> RedundantCasts;
 };

+/// A Conditional Scalar Assignment (CSA) is an assignment from an initial
+/// scalar that may or may not occur.
+class CSADescriptor {


I don't think CAS is a very common term, would be good to have a more descriptive name if possible

Intel has used the term conditional scalar assingmnet. I have abbreviated it as CSA for short. I have documented the acronym in the code in this patch in multiple places

/// A Conditional Scalar Assignment (CSA) is an assignment from an initial
/// scalar that may or may not occur.

// This file "describes" induction, recurrence, and conditional scalar
// assignment (CSA) variables.

STATISTIC(CSAsVectorized,
"Number of conditional scalar assignments vectorized");

I thought that ConditionalScalarAssignmentDescriptor, createConditionalScalarAssignmentMaskPhi, and VPConditionalScalarAssignmentDescriptorExtractScalarRecipe were quite long for example.

Do you have any suggestion on what you'd like it to be named? Is expanding CSA to ConditionalScalarAssignment everywhere your preference?

For now, I've tried to be proactive in ab17128.

fhahn · 2024-12-22T21:14:18Z

llvm/lib/Analysis/IVDescriptors.cpp

+/// that describes an InvalidCSA.
+bool CSADescriptor::isCSAPhi(PHINode *Phi, Loop *TheLoop, CSADescriptor &CSA) {
+
+  // Must be a scalar


Suggested change

// Must be a scalar

// Must be a scalar.

fhahn · 2024-12-22T21:14:37Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

+  // preheader and middle block. It also contains recipes that are not backed by
+  // underlying instructions in the original loop. This makes it difficult to
+  // model in the legacy cost model.
+  if (!Legal.getCSAs().empty())


Better check for the CAS recipes?

What is the reason to walk all recipes in the plan looking for CSA recipes when we can do this check in O(1) like this?

fhahn · 2024-12-22T21:15:01Z

llvm/lib/Transforms/Vectorize/LoopVectorize.cpp

@@ -8998,6 +9108,16 @@ static SetVector<VPIRInstruction *> collectUsersInExitBlocks(
          if (ExitVPBB->getSinglePredecessor() == MiddleVPBB)
            continue;
        }
+        // Exit values for CSAs are computed and updated outside of VPlan and


I am missing where the exit values are computed outside of VPlan. What's preventing them from being competed in VPlan?

fhahn · 2024-12-22T21:15:18Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+
+    Value *StartValue =
+        ConstantInt::get(WidenedCond->getType()->getScalarType(), 0);
+    Value *AnyOf = State.Builder.CreateIntrinsic(


Better use VPWidenIntrinsic like for other EVL based recipes if possible?

fhahn · 2024-12-22T21:15:32Z

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp

+    Value *VLPhi = State.get(getOperand(1), /*NeedsScalar=*/true);
+    Value *EVL = State.get(getOperand(2), /*NeedsScalar=*/true);
+    Value *VLSel = State.Builder.CreateSelect(AnyOf, EVL, VLPhi, "csa.vl.sel");
+    cast<PHINode>(VLPhi)->addIncoming(VLSel, State.CFG.PrevBB);


Why do we have to update the IR for one of the operands here? Can we not update the incoming values like for other phi recipes?

Then we could use regular select?

fhahn · 2024-12-22T21:15:49Z

llvm/lib/Transforms/Vectorize/VPlanTransforms.cpp

+
+  // We build the scalar version of a CSA when VF=ElementCount::getFixed(1),
+  // which does not require an EVL.
+  if (!Plan.hasScalarVFOnly())


does this apply to the other EVL recipes as well?

fhahn · 2024-12-22T21:17:55Z

llvm/test/Transforms/LoopVectorize/RISCV/conditional-scalar-assignment.ll

@@ -0,0 +1,2932 @@
+; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5


I think I am missing something, but the non-EVL codegen doens't seem to use and predicated memory accesses, so it should be able to do it with a generic target?

This patch adds initial support for CSA vectorization LLVM. This new class can be characterized by vectorization of assignment to a scalar in a loop, such that the assignment is conditional from the perspective of its use. An assignment is conditional in a loop if a value may or may not be assigned in the loop body. For example: ``` int t = init_val; for (int i = 0; i < N; i++) { if (cond[i]) t = a[i]; } s = t; // use t ``` Using pseudo-LLVM code this can be vectorized as ``` vector.ph: ... %t = %init_val %init.mask = <all-false-vec> %init.data = <poison-vec> ; uninitialized vector.body: ... %mask.phi = phi [%init.mask, %vector.ph], [%new.mask, %vector.body] %data.phi = phi [%data.mask, %vector.ph], [%new.mask, %vector.body] %cond.vec = <widened-cmp> ... %a.vec = <widened-load> %a, %i %b = <any-lane-active> %cond.vec %new.mask = select %b, %cond.vec, %mask.phi %new.data = select %b, %a.vec, %data.phi ... middle.block: %s = <extract-last-active-lane> %new.mask, %new.data ``` On each iteration, we track whether any lane in the widened condition was active, and if it was take the current mask and data as the new mask and data vector. Then at the end of the loop, the scalar can be extracted only once. This transformation works the same way for integer, pointer, and floating point conditional assignment, since the transformation does not require inspection of the data being assigned. In the vectorization of a CSA, we will be introducing recipes into the vector preheader, the vector body, and the middle block. Recipes that are introduced into the preheader and middle block are executed only one time, and recipes that are in the vector body will be possibly executed multiple times. The more times that the vector body is executed, the less of an impact the preheader and middle block cost have on the overall cost of a CSA. A detailed explanation of the concept can be found [here](https://discourse.llvm.org/t/vectorization-of-conditional-scalar-assignment-csa/80964). This patch is further tested in llvm/llvm-test-suite#155. This patch contains only the non-EVL related code. The is based on the larger patch of llvm#106560, which contained both EVL and non-EVL related parts.

michaelmaitland · 2024-12-27T19:03:47Z

Closing in favor of smaller patches. First one is posted at #121222

This patch adds initial support for CSA vectorization LLVM. This new class can be characterized by vectorization of assignment to a scalar in a loop, such that the assignment is conditional from the perspective of its use. An assignment is conditional in a loop if a value may or may not be assigned in the loop body. For example: ``` int t = init_val; for (int i = 0; i < N; i++) { if (cond[i]) t = a[i]; } s = t; // use t ``` Using pseudo-LLVM code this can be vectorized as ``` vector.ph: ... %t = %init_val %init.mask = <all-false-vec> %init.data = <poison-vec> ; uninitialized vector.body: ... %mask.phi = phi [%init.mask, %vector.ph], [%new.mask, %vector.body] %data.phi = phi [%data.mask, %vector.ph], [%new.mask, %vector.body] %cond.vec = <widened-cmp> ... %a.vec = <widened-load> %a, %i %b = <any-lane-active> %cond.vec %new.mask = select %b, %cond.vec, %mask.phi %new.data = select %b, %a.vec, %data.phi ... middle.block: %s = <extract-last-active-lane> %new.mask, %new.data ``` On each iteration, we track whether any lane in the widened condition was active, and if it was take the current mask and data as the new mask and data vector. Then at the end of the loop, the scalar can be extracted only once. This transformation works the same way for integer, pointer, and floating point conditional assignment, since the transformation does not require inspection of the data being assigned. In the vectorization of a CSA, we will be introducing recipes into the vector preheader, the vector body, and the middle block. Recipes that are introduced into the preheader and middle block are executed only one time, and recipes that are in the vector body will be possibly executed multiple times. The more times that the vector body is executed, the less of an impact the preheader and middle block cost have on the overall cost of a CSA. A detailed explanation of the concept can be found [here](https://discourse.llvm.org/t/vectorization-of-conditional-scalar-assignment-csa/80964). This patch is further tested in llvm/llvm-test-suite#155. This patch contains only the non-EVL related code. The is based on the larger patch of llvm#106560, which contained both EVL and non-EVL related parts.

michaelmaitland added the vectorizers label Aug 29, 2024

michaelmaitland requested review from fhahn, alexey-bataev and ayalz August 29, 2024 14:15

llvmbot added backend:RISC-V llvm:analysis llvm:transforms labels Aug 29, 2024

michaelmaitland force-pushed the csa-vectorization branch from c15cf30 to 3dc4a47 Compare August 29, 2024 14:30

michaelmaitland mentioned this pull request Aug 29, 2024

[SingleSource/Vectorizer] Add unit tests for conditional scalar assignment pattern llvm/llvm-test-suite#155

Open

michaelmaitland force-pushed the csa-vectorization branch from 3dc4a47 to 1e31c22 Compare August 29, 2024 14:46

artagnon reviewed Aug 30, 2024

View reviewed changes

michaelmaitland force-pushed the csa-vectorization branch from 1e31c22 to 51e4c67 Compare August 30, 2024 15:55

michaelmaitland commented Aug 30, 2024

View reviewed changes

artagnon requested a review from david-arm August 30, 2024 16:53

michaelmaitland force-pushed the csa-vectorization branch 2 times, most recently from c59efaf to ecd9285 Compare August 30, 2024 18:50

artagnon reviewed Aug 30, 2024

View reviewed changes

michaelmaitland mentioned this pull request Aug 31, 2024

[VPlan] Add ExplicitVectorLength to isSingleScalar #106818

Open

michaelmaitland force-pushed the csa-vectorization branch 3 times, most recently from a03e73b to c2e26ff Compare September 3, 2024 18:18

michaelmaitland requested a review from fhahn November 5, 2024 00:25

michaelmaitland force-pushed the csa-vectorization branch 2 times, most recently from e55606a to de0df31 Compare November 5, 2024 21:29

michaelmaitland force-pushed the csa-vectorization branch from de0df31 to dac2839 Compare November 12, 2024 15:32

michaelmaitland force-pushed the csa-vectorization branch 2 times, most recently from ca6cfa8 to 27131da Compare December 6, 2024 17:39

ElvisWang123 reviewed Dec 10, 2024

View reviewed changes

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp Outdated Show resolved Hide resolved

llvm/lib/Transforms/Vectorize/VPlanRecipes.cpp Outdated Show resolved Hide resolved

michaelmaitland added 3 commits December 16, 2024 07:15

[LV] Precommit csa vectorization test cases

5b34a9e

[CSA] Add CSADescriptors Analysis

e741182

[LVL][CSA] Legalize CSA vectorization

0c71559

michaelmaitland force-pushed the csa-vectorization branch from 27131da to e280239 Compare December 16, 2024 15:42

[LV] Build VPlan for CSA

80a98a5

michaelmaitland force-pushed the csa-vectorization branch from e280239 to 80a98a5 Compare December 16, 2024 19:30

fhahn reviewed Dec 22, 2024

View reviewed changes

michaelmaitland mentioned this pull request Dec 27, 2024

[LV][VPlan] Add initial support for CSA vectorization #121222

Open

michaelmaitland closed this Dec 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LV][VPlan] Add initial support for CSA vectorization #106560

[LV][VPlan] Add initial support for CSA vectorization #106560

michaelmaitland commented Aug 29, 2024 •

edited

Loading

llvmbot commented Aug 29, 2024 •

edited

Loading

github-actions bot commented Aug 29, 2024 •

edited

Loading

artagnon left a comment

artagnon commented Aug 30, 2024

michaelmaitland left a comment

michaelmaitland commented Aug 30, 2024

michaelmaitland commented Aug 30, 2024

artagnon commented Aug 30, 2024

artagnon commented Aug 30, 2024

michaelmaitland commented Aug 30, 2024

artagnon left a comment

michaelmaitland commented Nov 12, 2024

michaelmaitland commented Nov 21, 2024

michaelmaitland commented Dec 2, 2024

michaelmaitland commented Dec 6, 2024

michaelmaitland commented Dec 9, 2024

michaelmaitland commented Dec 16, 2024

fhahn left a comment

fhahn Dec 22, 2024

fhahn Dec 22, 2024

michaelmaitland Dec 27, 2024

fhahn Dec 22, 2024

fhahn Dec 22, 2024

michaelmaitland Dec 27, 2024

fhahn Dec 22, 2024

fhahn Dec 22, 2024

fhahn Dec 22, 2024

fhahn Dec 22, 2024

fhahn Dec 22, 2024

michaelmaitland commented Dec 27, 2024

	// Updates the vetorization state by adding \p Phi to the CSA list.
	/// Updates the vetorization state by adding \p Phi to the CSA list.

		@@ -0,0 +1,2932 @@
		; NOTE: Assertions have been autogenerated by utils/update_test_checks.py UTC_ARGS: --version 5

[LV][VPlan] Add initial support for CSA vectorization #106560

[LV][VPlan] Add initial support for CSA vectorization #106560

Conversation

michaelmaitland commented Aug 29, 2024 • edited Loading

llvmbot commented Aug 29, 2024 • edited Loading

github-actions bot commented Aug 29, 2024 • edited Loading

artagnon left a comment

Choose a reason for hiding this comment

artagnon commented Aug 30, 2024

michaelmaitland left a comment

Choose a reason for hiding this comment

michaelmaitland commented Aug 30, 2024

michaelmaitland commented Aug 30, 2024

artagnon commented Aug 30, 2024

artagnon commented Aug 30, 2024

michaelmaitland commented Aug 30, 2024

artagnon left a comment

Choose a reason for hiding this comment

michaelmaitland commented Nov 12, 2024

michaelmaitland commented Nov 21, 2024

michaelmaitland commented Dec 2, 2024

michaelmaitland commented Dec 6, 2024

michaelmaitland commented Dec 9, 2024

michaelmaitland commented Dec 16, 2024

fhahn left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelmaitland commented Dec 27, 2024

michaelmaitland commented Aug 29, 2024 •

edited

Loading

llvmbot commented Aug 29, 2024 •

edited

Loading

github-actions bot commented Aug 29, 2024 •

edited

Loading