Commit f9df0034 authored 1 year ago by Juzhe-Zhong Committed by Pan Li 1 year ago

RISC-V: Lower vmv.v.x (avl = 1) into vmv.s.x

Notice there is a AI benchmark, GCC vs Clang has 3% performance drop.

It's because Clang/LLVM has a simplification transform vmv.v.x (avl = 1) into vmv.s.x.

Since vmv.s.x has more flexible vsetvl demand than vmv.v.x that can allow us to have
better chances to fuse vsetvl.

Consider this following case:

void
foo (uint32_t *outputMat, uint32_t *inputMat)
{
  vuint32m1_t matRegIn0 = __riscv_vle32_v_u32m1 (inputMat, 4);
  vuint32m1_t matRegIn1 = __riscv_vle32_v_u32m1 (inputMat + 4, 4);
  vuint32m1_t matRegIn2 = __riscv_vle32_v_u32m1 (inputMat + 8, 4);
  vuint32m1_t matRegIn3 = __riscv_vle32_v_u32m1 (inputMat + 12, 4);

  vbool32_t oddMask
    = __riscv_vreinterpret_v_u32m1_b32 (__riscv_vmv_v_x_u32m1 (0xaaaa, 1));

  vuint32m1_t smallTransposeMat0
    = __riscv_vslideup_vx_u32m1_tumu (oddMask, matRegIn0, matRegIn1, 1, 4);
  vuint32m1_t smallTransposeMat2
    = __riscv_vslideup_vx_u32m1_tumu (oddMask, matRegIn2, matRegIn3, 1, 4);

  vuint32m1_t outMat0 = __riscv_vslideup_vx_u32m1_tu (smallTransposeMat0,
						      smallTransposeMat2, 2, 4);

  __riscv_vse32_v_u32m1 (outputMat, outMat0, 4);
}

Before this patch:

        vsetivli        zero,4,e32,m1,ta,ma
        li      a5,45056
        addi    a2,a1,16
        addi    a3,a1,32
        addi    a4,a1,48
        vle32.v v1,0(a1)
        vle32.v v4,0(a2)
        vle32.v v2,0(a3)
        vle32.v v3,0(a4)
        addiw   a5,a5,-1366
        vsetivli        zero,1,e32,m1,ta,ma
        vmv.v.x v0,a5                         ---> Since it avl = 1, we can transform it into vmv.s.x
        vsetivli        zero,4,e32,m1,tu,mu
        vslideup.vi     v1,v4,1,v0.t
        vslideup.vi     v2,v3,1,v0.t
        vslideup.vi     v1,v2,2
        vse32.v v1,0(a0)
        ret

After this patch:

	li	a5,45056
	addi	a2,a1,16
	vsetivli	zero,4,e32,m1,tu,mu
	addiw	a5,a5,-1366
	vle32.v	v3,0(a2)
	addi	a3,a1,32
	addi	a4,a1,48
	vle32.v	v1,0(a1)
	vmv.s.x	v0,a5
	vle32.v	v2,0(a3)
	vslideup.vi	v1,v3,1,v0.t
	vle32.v	v3,0(a4)
	vslideup.vi	v2,v3,1,v0.t
	vslideup.vi	v1,v2,2
	vse32.v	v1,0(a0)
	ret

Tested on both RV32 and RV64 no regression.

gcc/ChangeLog:

	* config/riscv/riscv-protos.h (splat_to_scalar_move_p): New function.
	* config/riscv/riscv-v.cc (splat_to_scalar_move_p): Ditto.
	* config/riscv/vector.md: Simplify vmv.v.x. into vmv.s.x.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/vsetvl/attribute-2.c: New test.
	* gcc.target/riscv/rvv/vsetvl/attribute-3.c: New test.

parent 8229214f

No related branches found

No related tags found

No related merge requests found

Hide whitespace changes

Inline Side-by-side

Showing with 94 additions and 1 deletion

Please register or to comment