-
- Downloads
aarch64: Take into account when VF is higher than known scalar iters
Consider low overhead loops like: void foo (char *restrict a, int *restrict b, int *restrict c, int n) { for (int i = 0; i < 9; i++) { int res = c[i]; int t = b[i]; if (a[i] != 0) res = t; c[i] = res; } } For such loops we use latency only costing since the loop bounds is known and small. The current costing however does not consider the case where niters < VF. So when comparing the scalar vs vector costs it doesn't keep in mind that the scalar code can't perform VF iterations. This makes it overestimate the cost for the scalar loop and we incorrectly vectorize. This patch takes the minimum of the VF and niters in such cases. Before the patch we generate: note: Original vector body cost = 46 note: Vector loop iterates at most 1 times note: Scalar issue estimate: note: load operations = 2 note: store operations = 1 note: general operations = 1 note: reduction latency = 0 note: estimated min cycles per iteration = 1.000000 note: estimated cycles per vector iteration (for VF 32) = 32.000000 note: SVE issue estimate: note: load operations = 5 note: store operations = 4 note: general operations = 11 note: predicate operations = 12 note: reduction latency = 0 note: estimated min cycles per iteration without predication = 5.500000 note: estimated min cycles per iteration for predication = 12.000000 note: estimated min cycles per iteration = 12.000000 note: Low iteration count, so using pure latency costs note: Cost model analysis: vs after: note: Original vector body cost = 46 note: Known loop bounds, capping VF to 9 for analysis note: Vector loop iterates at most 1 times note: Scalar issue estimate: note: load operations = 2 note: store operations = 1 note: general operations = 1 note: reduction latency = 0 note: estimated min cycles per iteration = 1.000000 note: estimated cycles per vector iteration (for VF 9) = 9.000000 note: SVE issue estimate: note: load operations = 5 note: store operations = 4 note: general operations = 11 note: predicate operations = 12 note: reduction latency = 0 note: estimated min cycles per iteration without predication = 5.500000 note: estimated min cycles per iteration for predication = 12.000000 note: estimated min cycles per iteration = 12.000000 note: Increasing body cost to 1472 because the scalar code could issue within the limit imposed by predicate operations note: Low iteration count, so using pure latency costs note: Cost model analysis: gcc/ChangeLog: * config/aarch64/aarch64.cc (adjust_body_cost): Cap VF for low iteration loops. gcc/testsuite/ChangeLog: * gcc.target/aarch64/sve/asrdiv_4.c: Update bounds. * gcc.target/aarch64/sve/cond_asrd_2.c: Likewise. * gcc.target/aarch64/sve/cond_uxt_6.c: Likewise. * gcc.target/aarch64/sve/cond_uxt_7.c: Likewise. * gcc.target/aarch64/sve/cond_uxt_8.c: Likewise. * gcc.target/aarch64/sve/miniloop_1.c: Likewise. * gcc.target/aarch64/sve/spill_6.c: Likewise. * gcc.target/aarch64/sve/sve_iters_low_1.c: New test. * gcc.target/aarch64/sve/sve_iters_low_2.c: New test.
Showing
- gcc/config/aarch64/aarch64.cc 13 additions, 0 deletionsgcc/config/aarch64/aarch64.cc
- gcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c 6 additions, 6 deletionsgcc/testsuite/gcc.target/aarch64/sve/asrdiv_4.c
- gcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c 6 additions, 6 deletionsgcc/testsuite/gcc.target/aarch64/sve/cond_asrd_2.c
- gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c 4 additions, 4 deletionsgcc/testsuite/gcc.target/aarch64/sve/cond_uxt_6.c
- gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c 4 additions, 4 deletionsgcc/testsuite/gcc.target/aarch64/sve/cond_uxt_7.c
- gcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c 4 additions, 4 deletionsgcc/testsuite/gcc.target/aarch64/sve/cond_uxt_8.c
- gcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c 1 addition, 1 deletiongcc/testsuite/gcc.target/aarch64/sve/miniloop_1.c
- gcc/testsuite/gcc.target/aarch64/sve/spill_6.c 4 additions, 4 deletionsgcc/testsuite/gcc.target/aarch64/sve/spill_6.c
- gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c 17 additions, 0 deletionsgcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_1.c
- gcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c 20 additions, 0 deletionsgcc/testsuite/gcc.target/aarch64/sve/sve_iters_low_2.c
Loading
Please register or sign in to comment