Commit d34cda72 authored 1 year ago by Richard Biener Committed by Richard Biener 6 months ago

Handle non-grouped stores as single-lane SLP

The following enables single-lane loop SLP discovery for non-grouped stores
and adjusts vectorizable_store to properly handle those.

For gfortran.dg/vect/vect-8.f90 we vectorize one additional loop,
not running into the "not falling back to strided accesses" bail-out.
I have not investigated in detail.

There is a set of i386 target assembler test FAILs,
gcc.target/i386/pr88531-2[bc].c in particular fail because the
target cannot identify SLP emulated gathers, see another mail from me.
Others need adjustment, I've adjusted one with this patch only.
In particular there are gcc.target/i386/cond_op_fma_*-1.c FAILs
that are because we no longer fold a VEC_COND_EXPR during the
region value-numbering we do after vectorization since we
code-generate a { 0.0, ... } constant in the VEC_COND_EXPR now
instead of having a separate statement which gets forwarded
and then triggers folding.  This leads to sligtly different
code generation.  The solution is probably to use gimple_build
when building stmts or, in this case, directly emit .COND_FMA
instead of .FMA and a VEC_COND_EXPR.

gcc.dg/vect/slp-19a.c mixes contiguous 8-lane SLP with a single
lane contiguous store from one lane of the 8-lane load and we
expect to use load-lanes for this reason but the heuristic for
forcing single-lane rediscovery as implemented doesn't trigger
here as it treats both SLP instances separately.  FAILs on RISC-V

gcc.dg/vect/slp-19c.c shows we fail to implement an interleaving
scheme for group_size 12 (by extension using the group_size 3
scheme to reduce to 4 lanes and then continue with a pow2 scheme
would work);  we are also not considering load-lanes because of
the above reason, but aarch64 cannot do ld12.  FAILs on AARCH64
(load requires three vectors) and x86_64.

gcc.dg/vect/slp-19c.c FAILs with variable-length vectors because
of "SLP induction not supported for variable-length vectors".

gcc.target/aarch64/pr110449.c will FAIL because the (contested)
optimization in r14-2367-g224fd59b2dc8a5 was only applied to
loop-vect but not SLP vect.  I'll leave it to target maintainers
to either XFAIL (the optimization is bad) or remove the test.

	* tree-vect-slp.cc (vect_analyze_slp): Perform single-lane
	loop SLP discovery for non-grouped stores.  Move check on the root
	for re-doing SLP analysis with a single lane for load/store-lanes
	earlier and make sure we are dealing with a grouped access.
	* tree-vect-stmts.cc (vectorizable_store): Always set
	vec_num for SLP.

	* gcc.dg/vect/O3-pr39675-2.c: Adjust expected number of SLP.
	* gcc.dg/vect/fast-math-vect-call-1.c: Likewise.
	* gcc.dg/vect/no-scevccp-slp-31.c: Likewise.
	* gcc.dg/vect/slp-12b.c: Likewise.
	* gcc.dg/vect/slp-12c.c: Likewise.
	* gcc.dg/vect/slp-19a.c: Likewise.
	* gcc.dg/vect/slp-19b.c: Likewise.
	* gcc.dg/vect/slp-4-big-array.c: Likewise.
	* gcc.dg/vect/slp-4.c: Likewise.
	* gcc.dg/vect/slp-5.c: Likewise.
	* gcc.dg/vect/slp-7.c: Likewise.
	* gcc.dg/vect/slp-perm-7.c: Likewise.
	* gcc.dg/vect/slp-37.c: Likewise.
	* gcc.dg/vect/fast-math-vect-call-2.c: Likewise.
	* gcc.dg/vect/slp-26.c: RISC-V can now SLP two instances.
	* gcc.dg/vect/vect-outer-slp-3.c: Disable vectorization of
	initialization loop.
	* gcc.dg/vect/slp-reduc-5.c: Likewise.
	* gcc.dg/vect/no-scevccp-outer-12.c: Un-XFAIL.  SLP can handle
	inner loop inductions with multiple vector stmt copies.
	* gfortran.dg/vect/vect-8.f90: Adjust expected number of
	vectorized loops.
	* gcc.target/i386/vectorize1.c: Adjust what we scan for.

parent f9c5c12d

No related branches found

No related tags found

Hide whitespace changes

Inline Side-by-side

Showing with 26 additions and 23 deletions

rdubner @rdubner
mentioned in commit 62a6a537
· 2 weeks ago

mentioned in commit 62a6a537

mentioned in commit 62a6a53766ba46ada1112472b71d4ea21411ea39

Toggle commit list

Please register or to comment