Commit f23dc726 authored 2 years ago by Tamar Christina

AArch64: Update div-bitmask to implement new optab instead of target hook [PR108583]

This replaces the custom division hook with just an implementation through
add_highpart.  For NEON we implement the add highpart (Addition + extraction of
the upper highpart of the register in the same precision) as ADD + LSR.

This representation allows us to easily optimize the sequence using existing
sequences. This gets us a pretty decent sequence using SRA:

        umull   v1.8h, v0.8b, v3.8b
        umull2  v0.8h, v0.16b, v3.16b
        add     v5.8h, v1.8h, v2.8h
        add     v4.8h, v0.8h, v2.8h
        usra    v1.8h, v5.8h, 8
        usra    v0.8h, v4.8h, 8
        uzp2    v1.16b, v1.16b, v0.16b

To get the most optimal sequence however we match (a + ((b + c) >> n)) where n
is half the precision of the mode of the operation into addhn + uaddw which is
a general good optimization on its own and gets us back to:

.L4:
        ldr     q0, [x3]
        umull   v1.8h, v0.8b, v5.8b
        umull2  v0.8h, v0.16b, v5.16b
        addhn   v3.8b, v1.8h, v4.8h
        addhn   v2.8b, v0.8h, v4.8h
        uaddw   v1.8h, v1.8h, v3.8b
        uaddw   v0.8h, v0.8h, v2.8b
        uzp2    v1.16b, v1.16b, v0.16b
        str     q1, [x3], 16
        cmp     x3, x4
        bne     .L4

For SVE2 we optimize the initial sequence to the same ADD + LSR which gets us:

.L3:
        ld1b    z0.h, p0/z, [x0, x3]
        mul     z0.h, p1/m, z0.h, z2.h
        add     z1.h, z0.h, z3.h
        usra    z0.h, z1.h, #8
        lsr     z0.h, z0.h, #8
        st1b    z0.h, p0, [x0, x3]
        inch    x3
        whilelo p0.h, w3, w2
        b.any   .L3
.L1:
        ret

and to get the most optimal sequence I match (a + b) >> n (same constraint on n)
to addhnb which gets us to:

.L3:
        ld1b    z0.h, p0/z, [x0, x3]
        mul     z0.h, p1/m, z0.h, z2.h
        addhnb  z1.b, z0.h, z3.h
        addhnb  z0.b, z0.h, z1.h
        st1b    z0.h, p0, [x0, x3]
        inch    x3
        whilelo p0.h, w3, w2
        b.any   .L3

There are multiple RTL representations possible for these optimizations, I did
not represent them using a zero_extend because we seem very inconsistent in this
in the backend.  Since they are unspecs we won't match them from vector ops
anyway. I figured maintainers would prefer this, but my maintainer ouija board
is still out for repairs :)

There are no new test as new correctness tests were added to the mid-end and
the existing codegen tests for this already exist.

gcc/ChangeLog:

	PR target/108583
	* config/aarch64/aarch64-simd.md (@aarch64_bitmask_udiv<mode>3): Remove.
	(*bitmask_shift_plus<mode>): New.
	* config/aarch64/aarch64-sve2.md (*bitmask_shift_plus<mode>): New.
	(@aarch64_bitmask_udiv<mode>3): Remove.
	* config/aarch64/aarch64.cc
	(aarch64_vectorize_can_special_div_by_constant,
	TARGET_VECTORIZE_CAN_SPECIAL_DIV_BY_CONST): Removed.
	(TARGET_VECTORIZE_PREFERRED_DIV_AS_SHIFTS_OVER_MULT,
	aarch64_vectorize_preferred_div_as_shifts_over_mult): New.

parent 81fd62d1

No related branches found

No related tags found

No related merge requests found

Hide whitespace changes

Inline Side-by-side

Showing with 52 additions and 137 deletions

Please register or to comment