Commit e84e5d03 authored 6 months ago by Tamar Christina
aarch64: Take into account when VF is higher than known scalar iters

Consider low overhead loops like:

void
foo (char *restrict a, int *restrict b, int *restrict c, int n)
{
  for (int i = 0; i < 9; i++)
    {
      int res = c[i];
      int t = b[i];
      if (a[i] != 0)
        res = t;
      c[i] = res;
    }
}

For such loops we use latency only costing since the loop bounds is known and
small.

The current costing however does not consider the case where niters < VF.

So when comparing the scalar vs vector costs it doesn't keep in mind that the
scalar code can't perform VF iterations.  This makes it overestimate the cost
for the scalar loop and we incorrectly vectorize.

This patch takes the minimum of the VF and niters in such cases.
Before the patch we generate:

 note:  Original vector body cost = 46
 note:  Vector loop iterates at most 1 times
 note:  Scalar issue estimate:
 note:    load operations = 2
 note:    store operations = 1
 note:    general operations = 1
 note:    reduction latency = 0
 note:    estimated min cycles per iteration = 1.000000
 note:    estimated cycles per vector iteration (for VF 32) = 32.000000
 note:  SVE issue estimate:
 note:    load operations = 5
 note:    store operations = 4
 note:    general operations = 11
 note:    predicate operations = 12
 note:    reduction latency = 0
 note:    estimated min cycles per iteration without predication = 5.500000
 note:    estimated min cycles per iteration for predication = 12.000000
 note:    estimated min cycles per iteration = 12.000000
 note:  Low iteration count, so using pure latency costs
 note:  Cost model analysis:

vs after:

 note:  Original vector body cost = 46
 note:  Known loop bounds, capping VF to 9 for analysis
 note:  Vector loop iterates at most 1 times
 note:  Scalar issue estimate:
 note:    load operations = 2
 note:    store operations = 1
 note:    general operations = 1
 note:    reduction latency = 0
 note:    estimated min cycles per iteration = 1.000000
 note:    estimated cycles per vector iteration (for VF 9) = 9.000000
 note:  SVE issue estimate:
 note:    load operations = 5
 note:    store operations = 4
 note:    general operations = 11
 note:    predicate operations = 12
 note:    reduction latency = 0
 note:    estimated min cycles per iteration without predication = 5.500000
 note:    estimated min cycles per iteration for predication = 12.000000
 note:    estimated min cycles per iteration = 12.000000
 note:  Increasing body cost to 1472 because the scalar code could issue within the limit imposed by predicate operations
 note:  Low iteration count, so using pure latency costs
 note:  Cost model analysis:

gcc/ChangeLog:

	* config/aarch64/aarch64.cc (adjust_body_cost):
	Cap VF for low iteration loops.

gcc/testsuite/ChangeLog:

	* gcc.target/aarch64/sve/asrdiv_4.c: Update bounds.
	* gcc.target/aarch64/sve/cond_asrd_2.c: Likewise.
	* gcc.target/aarch64/sve/cond_uxt_6.c: Likewise.
	* gcc.target/aarch64/sve/cond_uxt_7.c: Likewise.
	* gcc.target/aarch64/sve/cond_uxt_8.c: Likewise.
	* gcc.target/aarch64/sve/miniloop_1.c: Likewise.
	* gcc.target/aarch64/sve/spill_6.c: Likewise.
	* gcc.target/aarch64/sve/sve_iters_low_1.c: New test.
	* gcc.target/aarch64/sve/sve_iters_low_2.c: New test.
parent 67382245
No related branches found
No related tags found
Hide whitespace changes
Inline Side-by-side
Showing with 79 additions and 29 deletions
Please register or to comment