Skip to content
Snippets Groups Projects
Commit 91430b73 authored by Juzhe-Zhong's avatar Juzhe-Zhong Committed by Pan Li
Browse files

RISC-V: Add vwadd.wv/vwsub.wv auto-vectorization lowering optimization

1. This patch optimize the codegen of the following auto-vectorization codes:

void foo (int32_t * __restrict a, int64_t * __restrict b, int64_t * __restrict c, int n)
{
    for (int i = 0; i < n; i++)
      c[i] = (int64_t)a[i] + b[i];
}

Combine instruction from:

...
vsext.vf2
vadd.vv
...

into:

...
vwadd.wv
...

Since for PLUS operation, GCC prefer the following RTL operand order when combining:

(plus: (sign_extend:..)
       (reg:)

instead of

(plus: (reg:..)
       (sign_extend:)

which is different from MINUS pattern.

I split patterns of vwadd/vwsub, and add dedicated patterns for them.

2. This patch not only optimize the case as above (1) mentioned, also enhance vwadd.vv/vwsub.vv
   optimization for complicate PLUS/MINUS codes, consider this following codes:

__attribute__ ((noipa)) void
vwadd_int16_t_int8_t (int16_t *__restrict dst, int16_t *__restrict dst2,
		      int16_t *__restrict dst3, int8_t *__restrict a,
		      int8_t *__restrict b, int8_t *__restrict a2,
		      int8_t *__restrict b2, int n)
{
  for (int i = 0; i < n; i++)
    {
      dst[i] = (int16_t) a[i] + (int16_t) b[i];
      dst2[i] = (int16_t) a2[i] + (int16_t) b[i];
      dst3[i] = (int16_t) a2[i] + (int16_t) a[i];
    }
}

Before this patch:
...
	vsetvli zero,a6,e8,mf2,ta,ma
	vle8.v  v2,0(a3)
	vle8.v  v1,0(a4)
	vsetvli t1,zero,e16,m1,ta,ma
	vsext.vf2       v3,v2
	vsext.vf2       v2,v1
	vadd.vv v1,v2,v3
	vsetvli zero,a6,e16,m1,ta,ma
	vse16.v v1,0(a0)
	vle8.v  v4,0(a5)
	vsetvli t1,zero,e16,m1,ta,ma
	vsext.vf2       v1,v4
	vadd.vv v2,v1,v2
...

After this patch:
...
	vsetvli	zero,a6,e8,mf2,ta,ma
	vle8.v	v3,0(a4)
	vle8.v	v1,0(a3)
	vsetvli	t4,zero,e8,mf2,ta,ma
	vwadd.vv	v2,v1,v3
	vsetvli	zero,a6,e16,m1,ta,ma
	vse16.v	v2,0(a0)
	vle8.v	v2,0(a5)
	vsetvli	t4,zero,e8,mf2,ta,ma
	vwadd.vv	v4,v3,v2
	vsetvli	zero,a6,e16,m1,ta,ma
	vse16.v	v4,0(a1)
	vsetvli	t4,zero,e8,mf2,ta,ma
	sub	a7,a7,a6
	vwadd.vv	v3,v2,v1
	vsetvli	zero,a6,e16,m1,ta,ma
	vse16.v	v3,0(a2)
...

The reason why current upstream GCC can not optimize codes using vwadd thoroughly is combine PASS
needs intermediate RTL IR (extend one of the operand pattern (vwadd.wv)), then base on this intermediate
RTL IR, extend the other operand to generate vwadd.vv.

So vwadd.wv/vwsub.wv definitely helps to vwadd.vv/vwsub.vv code optimizations.

gcc/ChangeLog:

	* config/riscv/riscv-vector-builtins-bases.cc: Change vwadd.wv/vwsub.wv
	intrinsic API expander
	* config/riscv/vector.md
	(@pred_single_widen_<plus_minus:optab><any_extend:su><mode>): Remove it.
	(@pred_single_widen_sub<any_extend:su><mode>): New pattern.
	(@pred_single_widen_add<any_extend:su><mode>): New pattern.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/autovec/widen/widen-5.c: New test.
	* gcc.target/riscv/rvv/autovec/widen/widen-6.c: New test.
	* gcc.target/riscv/rvv/autovec/widen/widen-complicate-1.c: New test.
	* gcc.target/riscv/rvv/autovec/widen/widen-complicate-2.c: New test.
	* gcc.target/riscv/rvv/autovec/widen/widen_run-5.c: New test.
	* gcc.target/riscv/rvv/autovec/widen/widen_run-6.c: New test.
parent bf9eee73
No related branches found
No related tags found
No related merge requests found
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment