Commits · 79e1b23b91477b29deccf2cae92a7e8dd816c54a · COBOLworx / gcc-cobol

Dec 31, 2023

i386: Tweak define_insn_and_split to fix FAIL of gcc.target/i386/pr43644-2.c · 79e1b23b

Roger Sayle authored 1 year ago

This patch resolves the failure of pr43644-2.c in the testsuite, a code
quality test I added back in July, that started failing as the code GCC
generates for 128-bit values (and their parameter passing) has been in
flux.

The function:

unsigned __int128 foo(unsigned __int128 x, unsigned long long y) {
  return x+y;
}

currently generates:

foo:    movq    %rdx, %rcx
        movq    %rdi, %rax
        movq    %rsi, %rdx
        addq    %rcx, %rax
        adcq    $0, %rdx
        ret

and with this patch, we now generate:

foo:	movq    %rdi, %rax
        addq    %rdx, %rax
        movq    %rsi, %rdx
        adcq    $0, %rdx

which is optimal.

2023-12-31  Uros Bizjak  <ubizjak@gmail.com>
	    Roger Sayle  <roger@nextmovesoftware.com>

gcc/ChangeLog
	PR target/43644
	* config/i386/i386.md (*add<dwi>3_doubleword_concat_zext): Tweak
	order of instructions after split, to minimize number of moves.

gcc/testsuite/ChangeLog
	PR target/43644
	* gcc.target/i386/pr43644-2.c: Expect 2 movq instructions.

79e1b23b

libstdc++ testsuite/20_util/hash/quality.cc: Increase timeout 3x · 26fe2808

Hans-Peter Nilsson authored 1 year ago

Testing for mmix (a 64-bit target using Knuth's simulator).  The test
is largely pruned for simulators, but still needs 5m57s on my laptop
from 3.5 years ago to run to successful completion.  Perhaps slow
hosted targets could also have problems so increasing the timeout
limit, not just for simulators but for everyone, and by more than a
factor 2.

	* testsuite/20_util/hash/quality.cc: Increase timeout by a factor 3.

26fe2808

libstdc++: [_Hashtable] Extend the small size optimization · 505110bb

François Dumont authored 1 year ago

A number of methods were still not using the small size optimization which
is to prefer an O(N) research to a hash computation as long as N is small.

libstdc++-v3/ChangeLog:

	* include/bits/hashtable.h: Move comment about all equivalent values
	being next to each other in the class documentation header.
	(_M_reinsert_node, _M_merge_unique): Implement small size optimization.
	(_M_find_tr, _M_count_tr, _M_equal_range_tr): Likewise.

505110bb

libstdc++: [_Hashtable] Enhance performance benches · 91b334d0

François Dumont authored 1 year ago

Add benches on insert with hint and before begin cache.

libstdc++-v3/ChangeLog:

	* testsuite/performance/23_containers/insert/54075.cc: Add lookup on unknown entries
	w/o copy to see potential impact of memory fragmentation enhancements.
	* testsuite/performance/23_containers/insert/unordered_multiset_hint.cc: Enhance hash
	functor to make it perfect, exactly 1 entry per bucket. Also use hash functor tagged as
	slow or not to bench w/o hash code cache.
	* testsuite/performance/23_containers/insert/unordered_set_hint.cc: New test case. Like
	previous one but using std::unordered_set.
	* testsuite/performance/23_containers/insert/unordered_set_range_insert.cc: New test case.
	Check performance of range-insertion compared to individual insertions.
	* testsuite/performance/23_containers/insert_erase/unordered_small_size.cc: Add same bench
	but after a copy to demonstrate impact of enhancements regarding memory fragmentation.

91b334d0

Daily bump. · 03fb8f27
GCC Administrator authored 1 year ago

03fb8f27

Dec 30, 2023

C: Fix type compatibility for structs with variable sized fields. · 38c33fd2

Martin Uecker authored 1 year ago

This fixes the test gcc.dg/gnu23-tag-4.c introduced by commit 23fee88f
which fails for -march=... because the DECL_FIELD_BIT_OFFSET are set
inconsistently for types with and without variable-sized field.  This
is fixed by testing for DECL_ALIGN instead.  The code is further
simplified by removing some unnecessary conditions, i.e. anon_field is
set unconditionaly and all fields are assumed to be DECL_FIELDs.

gcc/c:
	* c-typeck.cc (tagged_types_tu_compatible_p): Revise.

gcc/testsuite:
	* gcc.dg/c23-tag-9.c: New test.

38c33fd2

MAINTAINERS: Update my email address · 77f30e22
Joseph Myers authored 1 year ago
```
There will be another update in January.

	* MAINTAINERS: Update my email address.
```
77f30e22
Daily bump. · ab7f6701
GCC Administrator authored 1 year ago

ab7f6701

Dec 29, 2023

Disable FMADD in chains for Zen4 and generic · 467cc398

Jan Hubicka authored 1 year ago

this patch disables use of FMA in matrix multiplication loop for generic (for
x86-64-v3) and zen4.  I tested this on zen4 and Xenon Gold Gold 6212U.

For Intel this is neutral both on the matrix multiplication microbenchmark
(attached) and spec2k17 where the difference was within noise for Core.

On core the micro-benchmark runs as follows:

With FMA:

       578,500,241      cycles:u                         #    3.645 GHz
                ( +-  0.12% )
       753,318,477      instructions:u                   #    1.30  insn per
cycle              ( +-  0.00% )
       125,417,701      branches:u                       #  790.227 M/sec
                ( +-  0.00% )
          0.159146 +- 0.000363 seconds time elapsed  ( +-  0.23% )

No FMA:

       577,573,960      cycles:u                         #    3.514 GHz
                ( +-  0.15% )
       878,318,479      instructions:u                   #    1.52  insn per
cycle              ( +-  0.00% )
       125,417,702      branches:u                       #  763.035 M/sec
                ( +-  0.00% )
          0.164734 +- 0.000321 seconds time elapsed  ( +-  0.19% )

So the cycle count is unchanged and discrete multiply+add takes same time as
FMA.

While on zen:

With FMA:
         484875179      cycles:u                         #    3.599 GHz
             ( +-  0.05% )  (82.11%)
         752031517      instructions:u                   #    1.55  insn per
cycle
         125106525      branches:u                       #  928.712 M/sec
             ( +-  0.03% )  (85.09%)
            128356      branch-misses:u                  #    0.10% of all
branches          ( +-  0.06% )  (83.58%)

No FMA:
         375875209      cycles:u                         #    3.592 GHz
             ( +-  0.08% )  (80.74%)
         875725341      instructions:u                   #    2.33  insn per
cycle
         124903825      branches:u                       #    1.194 G/sec
             ( +-  0.04% )  (84.59%)
          0.105203 +- 0.000188 seconds time elapsed  ( +-  0.18% )

The diffrerence is that Cores understand the fact that fmadd does not need
all three parameters to start computation, while Zen cores doesn't.

Since this seems noticeable win on zen and not loss on Core it seems like good
default for generic.

float a[SIZE][SIZE];
float b[SIZE][SIZE];
float c[SIZE][SIZE];

void init(void)
{
   int i, j, k;
   for(i=0; i<SIZE; ++i)
   {
      for(j=0; j<SIZE; ++j)
      {
         a[i][j] = (float)i + j;
         b[i][j] = (float)i - j;
         c[i][j] = 0.0f;
      }
   }
}

void mult(void)
{
   int i, j, k;

   for(i=0; i<SIZE; ++i)
   {
      for(j=0; j<SIZE; ++j)
      {
         for(k=0; k<SIZE; ++k)
         {
            c[i][j] += a[i][k] * b[k][j];
         }
      }
   }
}

int main(void)
{
   clock_t s, e;

   init();
   s=clock();
   mult();
   e=clock();
   printf("        mult took %10d clocks\n", (int)(e-s));

   return 0;

}

gcc/ChangeLog:

	* config/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS,
	X86_TUNE_AVOID_256FMA_CHAINS): Enable for znver4 and Core.

467cc398

AArch64: Update costing for vector conversions [PR110625] · 984bdeaa

Tamar Christina authored 1 year ago

In gimple the operation

short _8;
double _9;
_9 = (double) _8;

denotes two operations on AArch64.  First we have to widen from short to
long and then convert this integer to a double.

Currently however we only count the widen/truncate operations:

(double) _5 6 times vec_promote_demote costs 12 in body
(double) _5 12 times vec_promote_demote costs 24 in body

but not the actual conversion operation, which needs an additional 12
instructions in the attached testcase.   Without this the attached testcase ends
up incorrectly thinking that it's beneficial to vectorize the loop at a very
high VF = 8 (4x unrolled).

Because we can't change the mid-end to account for this the costing code in the
backend now keeps track of whether the previous operation was a
promotion/demotion and ajdusts the expected number of instructions to:

1. If it's the first FLOAT_EXPR and the precision of the lhs and rhs are
   different, double it, since we need to convert and promote.
2. If it's the previous operation was a demonition/promotion then reduce the
   cost of the current operation by the amount we added extra in the last.

with the patch we get:

(double) _5 6 times vec_promote_demote costs 24 in body
(double) _5 12 times vec_promote_demote costs 36 in body

which correctly accounts for 30 operations.

This fixes the 16% regression in imagick in SPECCPU 2017 reported on Neoverse N2
and using the new generic Armv9-a cost model.

gcc/ChangeLog:

	PR target/110625
	* config/aarch64/aarch64.cc (aarch64_vector_costs::add_stmt_cost):
	Adjust throughput and latency calculations for vector conversions.
	(class aarch64_vector_costs): Add m_num_last_promote_demote.

gcc/testsuite/ChangeLog:

	PR target/110625
	* gcc.target/aarch64/pr110625_4.c: New test.
	* gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Add
	--param aarch64-sve-compare-costs=0.
	* gcc.target/aarch64/sve/unpack_fcvt_unsigned_1.c: Likewise

984bdeaa

LoongArch: Fix the format of bstrins_<mode>_for_ior_mask condition (NFC) · 748a4e90

Xi Ruoyao authored 1 year ago

gcc/ChangeLog:

	* config/loongarch/loongarch.md (bstrins_<mode>_for_ior_mask):
	For the condition, remove unneeded trailing "\" and move "&&" to
	follow GNU coding style.  NFC.

748a4e90

LoongArch: Replace -mexplicit-relocs=auto simple-used address peephole2 with combine · 8b61d109

Xi Ruoyao authored 1 year ago

The problem with peephole2 is it uses a naive sliding-window algorithm
and misses many cases.  For example:

    float a[10000];
    float t() { return a[0] + a[8000]; }

is compiled to:

    la.local    $r13,a
    la.local    $r12,a+32768
    fld.s       $f1,$r13,0
    fld.s       $f0,$r12,-768
    fadd.s      $f0,$f1,$f0

by trunk.  But as we've explained in r14-4851, the following would be
better with -mexplicit-relocs=auto:

    pcalau12i   $r13,%pc_hi20(a)
    pcalau12i   $r12,%pc_hi20(a+32000)
    fld.s       $f1,$r13,%pc_lo12(a)
    fld.s       $f0,$r12,%pc_lo12(a+32000)
    fadd.s      $f0,$f1,$f0

However the sliding-window algorithm just won't detect the pcalau12i/fld
pair to be optimized.  Use a define_insn_and_rewrite in combine pass
will work around the issue.

gcc/ChangeLog:

	* config/loongarch/predicates.md
	(symbolic_pcrel_offset_operand): New define_predicate.
	(mem_simple_ldst_operand): Likewise.
	* config/loongarch/loongarch-protos.h
	(loongarch_rewrite_mem_for_simple_ldst): Declare.
	* config/loongarch/loongarch.cc
	(loongarch_rewrite_mem_for_simple_ldst): Implement.
	* config/loongarch/loongarch.md (simple_load<mode>): New
	define_insn_and_rewrite.
	(simple_load_<su>ext<SUBDI:mode><GPR:mode>): Likewise.
	(simple_store<mode>): Likewise.
	(define_peephole2): Remove la.local/[f]ld peepholes.

gcc/testsuite/ChangeLog:

	* gcc.target/loongarch/explicit-relocs-auto-single-load-store-2.c:
	New test.
	* gcc.target/loongarch/explicit-relocs-auto-single-load-store-3.c:
	New test.

8b61d109

i386: Fix TARGET_USE_VECTOR_FP_CONVERTS SF->DF float_extend splitter [PR113133] · 1e7f9abb

Uros Bizjak authored 1 year ago

The post-reload splitter currently allows xmm16+ registers with TARGET_EVEX512.
The splitter changes SFmode of the output operand to V4SFmode, but the vector
mode is currently unsupported in xmm16+ without TARGET_AVX512VL. lowpart_subreg
returns NULL_RTX in this case and the compilation fails with invalid RTX.

The patch removes support for x/ymm16+ registers with TARGET_EVEX512.  The
support should be restored once ix86_hard_regno_mode_ok is fixed to allow
16-byte modes in x/ymm16+ with TARGET_EVEX512.

	PR target/113133

gcc/ChangeLog:

	* config/i386/i386.md
	(TARGET_USE_VECTOR_FP_CONVERTS SF->DF float_extend splitter):
	Do not handle xmm16+ with TARGET_EVEX512.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/pr113133-1.c: New test.
	* gcc.target/i386/pr113133-2.c: New test.

1e7f9abb

Fix gen-vect-26.c testcase after loops with multiple exits [PR113167] · 200531d5

Andrew Pinski authored 1 year ago


This fixes the gcc.dg/tree-ssa/gen-vect-26.c testcase by adding
`#pragma GCC novector` in front of the loop that is doing the checking
of the result. We only want to test the first loop to see if it can be
vectorize.

Committed as obvious after testing on x86_64-linux-gnu with -m32.

gcc/testsuite/ChangeLog:

	PR testsuite/113167
	* gcc.dg/tree-ssa/gen-vect-26.c: Mark the test/check loop
	as novector.

Signed-off-by: Andrew Pinski <quic_apinski@quicinc.com>

200531d5

RISC-V: Robostify testcase pr113112-1.c · 7dc868cb

Juzhe-Zhong authored 1 year ago

The redudant dump check is fragile and easily changed, not necessary.

Tested on both RV32/RV64 no regression.

Remove it and committed.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/costmodel/riscv/rvv/pr113112-1.c: Remove redundant checks.

7dc868cb

RISC-V: Disallow transformation into VLMAX AVL for cond_len_xxx when length is in range [0, 31] · d1eacedc

Juzhe-Zhong authored 1 year ago

Notice we have this following situation:

        vsetivli        zero,4,e32,m1,ta,ma
        vlseg4e32.v     v4,(a5)
        vlseg4e32.v     v12,(a3)
        vsetvli a5,zero,e32,m1,tu,ma             ---> This is redundant since VLMAX AVL = 4 when it is fixed-vlmax
        vfadd.vf        v3,v13,fa0
        vfadd.vf        v1,v12,fa1
        vfmul.vv        v17,v3,v5
        vfmul.vv        v16,v1,v5

The rootcause is that we transform COND_LEN_xxx into VLMAX AVL when len == NUNITS blindly.
However, we don't need to transform all of them since when len is range of [0,31], we don't need to
consume scalar registers.

After this patch:

	vsetivli	zero,4,e32,m1,tu,ma
	addi	a4,a5,400
	vlseg4e32.v	v12,(a3)
	vfadd.vf	v3,v13,fa0
	vfadd.vf	v1,v12,fa1
	vlseg4e32.v	v4,(a4)
	vfadd.vf	v2,v14,fa1
	vfmul.vv	v17,v3,v5
	vfmul.vv	v16,v1,v5

Tested on both RV32 and RV64 no regression.

Ok for trunk ?

gcc/ChangeLog:

	* config/riscv/riscv-v.cc (is_vlmax_len_p): New function.
	(expand_load_store): Disallow transformation into VLMAX when len is in range of [0,31]
	(expand_cond_len_op): Ditto.
	(expand_gather_scatter): Ditto.
	(expand_lanes_load_store): Ditto.
	(expand_fold_extract_last): Ditto.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/autovec/post-ra-avl.c: Adapt test.
	* gcc.target/riscv/rvv/base/vf_avl-2.c: New test.

d1eacedc

Daily bump. · 7de05ad4
GCC Administrator authored 1 year ago

7de05ad4

Dec 28, 2023

Fortran: Add Developer Options mini-section to documentation · 2cb93e66

Rimvydas Jasinskas authored 1 year ago


Separate out -fdump-* options to the new section.  Sort by option name.

While there, document -save-temps intermediates.

gcc/fortran/ChangeLog:

	PR fortran/81615
	* invoke.texi: Add Developer Options section.  Move '-fdump-*'
	to it.  Add small examples about changed '-save-temps' behavior.

Signed-off-by: Rimvydas Jasinskas <rimvydas.jas@gmail.com>

2cb93e66

testsuite: XFAIL linkage testcases on AIX. · bf5c00d7

David Edelsohn authored 1 year ago


The template linkage2.C and linkage3.C testcases expect a
decoration that does not match AIX assembler syntax.  Expect failure.

gcc/testsuite/ChangeLog:
	* g++.dg/template/linkage2.C: XFAIL on AIX.
	* g++.dg/template/linkage3.C: Same.

Signed-off-by: David Edelsohn <dje.gcc@gmail.com>

bf5c00d7

i386: Cleanup ix86_expand_{unary|binary}_operator issues · d74cceb6

Uros Bizjak authored 1 year ago

Move ix86_expand_unary_operator from i386.cc to i386-expand.cc, re-arrange
prototypes and do some cosmetic changes with the usage of TARGET_APX_NDD.

No functional changes.

gcc/ChangeLog:

	* config/i386/i386.cc (ix86_unary_operator_ok): Move from here...
	* config/i386/i386-expand.cc (ix86_unary_operator_ok): ... to here.
	* config/i386/i386-protos.h: Re-arrange ix86_{unary|binary}_operator_ok
	and ix86_expand_{unary|binary}_operator prototypes.
	* config/i386/i386.md: Cosmetic changes with the usage of
	TARGET_APX_NDD in ix86_expand_{unary|binary}_operator
	and ix86_{unary|binary}_operator_ok function calls.

d74cceb6

RISC-V: Make dynamic LMUL cost model more accurate for conversion codes · 76f5542c

Juzhe-Zhong authored 1 year ago

Notice current dynamic LMUL is not accurate for conversion codes.
Refine for it, there is current case is changed from choosing LMUL = 4 into LMUL = 8.

Tested no regression, committed.

Before this patch (LMUL = 4):                  After this patch (LMUL = 8):
        lw      a7,56(sp)                             lw	a7,56(sp)
        ld      t5,0(sp)                              ld	t5,0(sp)
        ld      t1,8(sp)                              ld	t1,8(sp)
        ld      t6,16(sp)                             ld	t6,16(sp)
        ld      t0,24(sp)                             ld	t0,24(sp)
        ld      t3,32(sp)                             ld	t3,32(sp)
        ld      t4,40(sp)                             ld	t4,40(sp)
        ble     a7,zero,.L5                           ble	a7,zero,.L5
.L3:                                               .L3:
        vsetvli a4,a7,e32,m2,ta,ma                    vsetvli	a4,a7,e32,m4,ta
        vle8.v  v1,0(a2)                              vle8.v	v3,0(a2)
        vle8.v  v4,0(a1)                              vle8.v	v16,0(t0)
        vsext.vf4       v8,v1                         vle8.v	v7,0(a1)
        vsext.vf4       v2,v4                         vle8.v	v12,0(t6)
        vsetvli zero,zero,e8,mf2,ta,ma                vle8.v	v2,0(a5)
        vadd.vv v4,v4,v1                              vle8.v	v1,0(t5)
        vsetvli zero,zero,e32,m2,ta,ma                vsext.vf4	v20,v3
        vle8.v  v5,0(t0)                              vsext.vf4	v8,v7
        vle8.v  v6,0(t6)                              vadd.vv	v8,v8,v20
        vadd.vv v2,v2,v8                              vadd.vv	v8,v8,v8
        vadd.vv v2,v2,v2                              vadd.vv	v8,v8,v20
        vadd.vv v2,v2,v8                              vsetvli	zero,zero,e8,m1
        vsetvli zero,zero,e8,mf2,ta,ma                vadd.vv	v15,v12,v16
        vadd.vv v6,v6,v5                              vsetvli	zero,zero,e32,m4
        vsetvli zero,zero,e32,m2,ta,ma                vsext.vf4	v12,v15
        vle8.v  v8,0(t5)                              vadd.vv	v8,v8,v12
        vle8.v  v9,0(a5)                              vsetvli	zero,zero,e8,m1
        vsext.vf4       v10,v4                        vadd.vv	v7,v7,v3
        vsext.vf4       v12,v6                        vsetvli	zero,zero,e32,m4
        vadd.vv v2,v2,v12                             vsext.vf4	v4,v7
        vadd.vv v2,v2,v10                             vadd.vv	v8,v8,v4
        vsetvli zero,zero,e16,m1,ta,ma                vsetvli	zero,zero,e16,m2
        vncvt.x.x.w     v4,v2                         vncvt.x.x.w	v4,v8
        vsetvli zero,zero,e32,m2,ta,ma                vsetvli	zero,zero,e8,m1
        vadd.vv v6,v2,v2                              vncvt.x.x.w	v4,v4
        vsetvli zero,zero,e8,mf2,ta,ma                vadd.vv	v15,v3,v4
        vncvt.x.x.w     v4,v4                         vadd.vv	v2,v2,v4
        vadd.vv v5,v5,v4                              vse8.v	v15,0(t4)
        vadd.vv v9,v9,v4                              vadd.vv	v3,v16,v4
        vadd.vv v1,v1,v4                              vse8.v	v2,0(a3)
        vadd.vv v4,v8,v4                              vadd.vv	v1,v1,v4
        vse8.v  v1,0(t4)                              vse8.v	v1,0(a6)
        vse8.v  v9,0(a3)                              vse8.v	v3,0(t1)
        vsetvli zero,zero,e32,m2,ta,ma                vsetvli	zero,zero,e32,m4
        vse8.v  v4,0(a6)                              vsext.vf4	v4,v3
        vsext.vf4       v8,v5                         vadd.vv	v4,v4,v8
        vse8.v  v5,0(t1)                              vsetvli	zero,zero,e64,m8
        vadd.vv v2,v8,v2                              vsext.vf2	v16,v4
        vsetvli zero,zero,e64,m4,ta,ma                vse64.v	v16,0(t3)
        vsext.vf2       v8,v2                         vsetvli	zero,zero,e32,m4
        vsetvli zero,zero,e32,m2,ta,ma                vadd.vv	v8,v8,v8
        slli    t2,a4,3                               vsext.vf4	v4,v15
        vse64.v v8,0(t3)                              slli	t2,a4,3
        vsext.vf4       v2,v1                         vadd.vv	v4,v8,v4
        sub     a7,a7,a4                              sub	a7,a7,a4
        vadd.vv v2,v6,v2                              vsetvli	zero,zero,e64,m8
        vsetvli zero,zero,e64,m4,ta,ma                vsext.vf2	v8,v4
        vsext.vf2       v4,v2                         vse64.v	v8,0(a0)
        vse64.v v4,0(a0)                              add	a1,a1,a4
        add     a2,a2,a4                              add	a2,a2,a4
        add     a1,a1,a4                              add	a5,a5,a4
        add     t6,t6,a4                              add	t5,t5,a4
        add     t0,t0,a4                              add	t6,t6,a4
        add     a5,a5,a4                              add	t0,t0,a4
        add     t5,t5,a4                              add	t4,t4,a4
        add     t4,t4,a4                              add	a3,a3,a4
        add     a3,a3,a4                              add	a6,a6,a4
        add     a6,a6,a4                              add	t1,t1,a4
        add     t1,t1,a4                              add	t3,t3,t2
        add     t3,t3,t2                              add	a0,a0,t2
        add     a0,a0,t2                              bne	a7,zero,.L3
        bne     a7,zero,.L3                         .L5:
.L5:                                                  ret
        ret

gcc/ChangeLog:

	* config/riscv/riscv-vector-costs.cc (is_gimple_assign_or_call): Change interface.
	(get_live_range): New function.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-3.c: Adapt test.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-5.c: Ditto.

76f5542c

Daily bump. · fb57e402
GCC Administrator authored 1 year ago

fb57e402

Dec 27, 2023

LoongArch: Fix infinite secondary reloading of FCCmode [PR113148] · f19ceb2d

Xi Ruoyao authored 1 year ago

The GCC internal doc says:

     X might be a pseudo-register or a 'subreg' of a pseudo-register,
     which could either be in a hard register or in memory.  Use
     'true_regnum' to find out; it will return -1 if the pseudo is in
     memory and the hard register number if it is in a register.

So "MEM_P (x)" is not enough for checking if we are reloading from/to
the memory.  This bug has caused reload pass to stall and finally ICE
complaining with "maximum number of generated reload insns per insn
achieved", since r14-6814.

Check if "true_regnum (x)" is -1 besides "MEM_P (x)" to fix the issue.

gcc/ChangeLog:

	PR target/113148
	* config/loongarch/loongarch.cc (loongarch_secondary_reload):
	Check if regno == -1 besides MEM_P (x) for reloading FCCmode
	from/to FPR to/from memory.

gcc/testsuite/ChangeLog:

	PR target/113148
	* gcc.target/loongarch/pr113148.c: New test.

f19ceb2d

LoongArch: Expand left rotate to right rotate with negated amount · 80b8f1e5

Xi Ruoyao authored 1 year ago

gcc/ChangeLog:

	* config/loongarch/loongarch.md (rotl<mode>3):
	New define_expand.
	* config/loongarch/simd.md (vrotl<mode>3): Likewise.
	(rotl<mode>3): Likewise.

gcc/testsuite/ChangeLog:

	* gcc.target/loongarch/rotl-with-rotr.c: New test.
	* gcc.target/loongarch/rotl-with-vrotr-b.c: New test.
	* gcc.target/loongarch/rotl-with-vrotr-h.c: New test.
	* gcc.target/loongarch/rotl-with-vrotr-w.c: New test.
	* gcc.target/loongarch/rotl-with-vrotr-d.c: New test.
	* gcc.target/loongarch/rotl-with-xvrotr-b.c: New test.
	* gcc.target/loongarch/rotl-with-xvrotr-h.c: New test.
	* gcc.target/loongarch/rotl-with-xvrotr-w.c: New test.
	* gcc.target/loongarch/rotl-with-xvrotr-d.c: New test.

80b8f1e5

RISC-V: Make known NITERS loop be aware of dynamic lmul cost model liveness information · c4ac073d

Juzhe-Zhong authored 1 year ago

Consider this following case:

int f[12][100];

void bad1(int v1, int v2)
{
  for (int r = 0; r < 100; r += 4)
    {
      int i = r + 1;
      f[0][r] = f[1][r] * (f[2][r]) - f[1][i] * (f[2][i]);
      f[0][i] = f[1][r] * (f[2][i]) + f[1][i] * (f[2][r]);
      f[0][r+2] = f[1][r+2] * (f[2][r+2]) - f[1][i+2] * (f[2][i+2]);
      f[0][i+2] = f[1][r+2] * (f[2][i+2]) + f[1][i+2] * (f[2][r+2]);
    }
}

Pick up LMUL = 8 VLS blindly:

        lui     a4,%hi(f)
        addi    a4,a4,%lo(f)
        addi    sp,sp,-592
        addi    a3,a4,800
        lui     a5,%hi(.LANCHOR0)
        vl8re32.v       v24,0(a3)
        addi    a5,a5,%lo(.LANCHOR0)
        addi    a1,a4,400
        addi    a3,sp,140
        vl8re32.v       v16,0(a1)
        vl4re16.v       v4,0(a5)
        addi    a7,a5,192
        vs4r.v  v4,0(a3)
        addi    t0,a5,64
        addi    a3,sp,336
        li      t2,32
        addi    a2,a5,128
        vsetvli a5,zero,e32,m8,ta,ma
        vrgatherei16.vv v8,v16,v4
        vmul.vv v8,v8,v24
        vl8re32.v       v0,0(a7)
        vs8r.v  v8,0(a3)
        vmsltu.vx       v8,v0,t2
        addi    a3,sp,12
        addi    t2,sp,204
        vsm.v   v8,0(t2)
        vl4re16.v       v4,0(t0)
        vl4re16.v       v0,0(a2)
        vs4r.v  v4,0(a3)
        addi    t0,sp,336
        vrgatherei16.vv v8,v24,v4
        addi    a3,sp,208
        vrgatherei16.vv v24,v16,v0
        vs4r.v  v0,0(a3)
        vmul.vv v8,v8,v24
        vlm.v   v0,0(t2)
        vl8re32.v       v24,0(t0)
        addi    a3,sp,208
        vsub.vv v16,v24,v8
        addi    t6,a4,528
        vadd.vv v8,v24,v8
        addi    t5,a4,928
        vmerge.vvm      v8,v8,v16,v0
        addi    t3,a4,128
        vs8r.v  v8,0(a4)
        addi    t4,a4,1056
        addi    t1,a4,656
        addi    a0,a4,256
        addi    a6,a4,1184
        addi    a1,a4,784
        addi    a7,a4,384
        addi    a4,sp,140
        vl4re16.v       v0,0(a3)
        vl8re32.v       v24,0(t6)
        vl4re16.v       v4,0(a4)
        vrgatherei16.vv v16,v24,v0
        addi    a3,sp,12
        vs8r.v  v16,0(t0)
        vl8re32.v       v8,0(t5)
        vrgatherei16.vv v16,v24,v4
        vl4re16.v       v4,0(a3)
        vrgatherei16.vv v24,v8,v4
        vmul.vv v16,v16,v8
        vl8re32.v       v8,0(t0)
        vmul.vv v8,v8,v24
        vsub.vv v24,v16,v8
        vlm.v   v0,0(t2)
        addi    a3,sp,208
        vadd.vv v8,v8,v16
        vl8re32.v       v16,0(t4)
        vmerge.vvm      v8,v8,v24,v0
        vrgatherei16.vv v24,v16,v4
        vs8r.v  v24,0(t0)
        vl4re16.v       v28,0(a3)
        addi    a3,sp,464
        vs8r.v  v8,0(t3)
        vl8re32.v       v8,0(t1)
        vrgatherei16.vv v0,v8,v28
        vs8r.v  v0,0(a3)
        addi    a3,sp,140
        vl4re16.v       v24,0(a3)
        addi    a3,sp,464
        vrgatherei16.vv v0,v8,v24
        vl8re32.v       v24,0(t0)
        vmv8r.v v8,v0
        vl8re32.v       v0,0(a3)
        vmul.vv v8,v8,v16
        vmul.vv v24,v24,v0
        vsub.vv v16,v8,v24
        vadd.vv v8,v8,v24
        vsetivli        zero,4,e32,m8,ta,ma
        vle32.v v24,0(a6)
        vsetvli a4,zero,e32,m8,ta,ma
        addi    a4,sp,12
        vlm.v   v0,0(t2)
        vmerge.vvm      v8,v8,v16,v0
        vl4re16.v       v16,0(a4)
        vrgatherei16.vv v0,v24,v16
        vsetivli        zero,4,e32,m8,ta,ma
        vs8r.v  v0,0(a4)
        addi    a4,sp,208
        vl4re16.v       v0,0(a4)
        vs8r.v  v8,0(a0)
        vle32.v v16,0(a1)
        vsetvli a5,zero,e32,m8,ta,ma
        vrgatherei16.vv v8,v16,v0
        vs8r.v  v8,0(a4)
        addi    a4,sp,140
        vl4re16.v       v4,0(a4)
        addi    a5,sp,12
        vrgatherei16.vv v8,v16,v4
        vl8re32.v       v0,0(a5)
        vsetivli        zero,4,e32,m8,ta,ma
        addi    a5,sp,208
        vmv8r.v v16,v8
        vl8re32.v       v8,0(a5)
        vmul.vv v24,v24,v16
        vmul.vv v8,v0,v8
        vsub.vv v16,v24,v8
        vadd.vv v8,v8,v24
        vsetvli a5,zero,e8,m2,ta,ma
        vlm.v   v0,0(t2)
        vsetivli        zero,4,e32,m8,ta,ma
        vmerge.vvm      v8,v8,v16,v0
        vse32.v v8,0(a7)
        addi    sp,sp,592
        jr      ra

This patch makes loop with known NITERS be aware of liveness estimation, after this patch, choosing LMUL = 4:

	lui	a5,%hi(f)
	addi	a5,a5,%lo(f)
	addi	a3,a5,400
	addi	a4,a5,800
	vsetivli	zero,8,e32,m2,ta,ma
	vlseg4e32.v	v16,(a3)
	vlseg4e32.v	v8,(a4)
	vmul.vv	v2,v8,v16
	addi	a3,a5,528
	vmv.v.v	v24,v10
	vnmsub.vv	v24,v18,v2
	addi	a4,a5,928
	vmul.vv	v2,v12,v22
	vmul.vv	v6,v8,v18
	vmv.v.v	v30,v2
	vmacc.vv	v30,v14,v20
	vmv.v.v	v26,v6
	vmacc.vv	v26,v10,v16
	vmul.vv	v4,v12,v20
	vmv.v.v	v28,v14
	vnmsub.vv	v28,v22,v4
	vsseg4e32.v	v24,(a5)
	vlseg4e32.v	v16,(a3)
	vlseg4e32.v	v8,(a4)
	vmul.vv	v2,v8,v16
	addi	a6,a5,128
	vmv.v.v	v24,v10
	vnmsub.vv	v24,v18,v2
	addi	a0,a5,656
	vmul.vv	v2,v12,v22
	addi	a1,a5,1056
	vmv.v.v	v30,v2
	vmacc.vv	v30,v14,v20
	vmul.vv	v6,v8,v18
	vmul.vv	v4,v12,v20
	vmv.v.v	v26,v6
	vmacc.vv	v26,v10,v16
	vmv.v.v	v28,v14
	vnmsub.vv	v28,v22,v4
	vsseg4e32.v	v24,(a6)
	vlseg4e32.v	v16,(a0)
	vlseg4e32.v	v8,(a1)
	vmul.vv	v2,v8,v16
	addi	a2,a5,256
	vmv.v.v	v24,v10
	vnmsub.vv	v24,v18,v2
	addi	a3,a5,784
	vmul.vv	v2,v12,v22
	addi	a4,a5,1184
	vmv.v.v	v30,v2
	vmacc.vv	v30,v14,v20
	vmul.vv	v6,v8,v18
	vmul.vv	v4,v12,v20
	vmv.v.v	v26,v6
	vmacc.vv	v26,v10,v16
	vmv.v.v	v28,v14
	vnmsub.vv	v28,v22,v4
	addi	a5,a5,384
	vsseg4e32.v	v24,(a2)
	vsetivli	zero,1,e32,m2,ta,ma
	vlseg4e32.v	v16,(a3)
	vlseg4e32.v	v8,(a4)
	vmul.vv	v2,v16,v8
	vmul.vv	v6,v18,v8
	vmv.v.v	v24,v18
	vnmsub.vv	v24,v10,v2
	vmul.vv	v4,v20,v12
	vmul.vv	v2,v22,v12
	vmv.v.v	v26,v6
	vmacc.vv	v26,v16,v10
	vmv.v.v	v28,v22
	vnmsub.vv	v28,v14,v4
	vmv.v.v	v30,v2
	vmacc.vv	v30,v20,v14
	vsseg4e32.v	v24,(a5)
	ret

Tested on both RV32 and RV64 no regressions.

	PR target/113112

gcc/ChangeLog:

	* config/riscv/riscv-vector-costs.cc (is_gimple_assign_or_call): New function.
	(get_first_lane_point): Ditto.
	(get_last_lane_point): Ditto.
	(max_number_of_live_regs): Refine live point dump.
	(compute_estimated_lmul): Make unknown NITERS loop be aware of liveness.
	(costs::better_main_loop_than_p): Ditto.
	* config/riscv/riscv-vector-costs.h (struct stmt_point): Add new member.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/costmodel/riscv/rvv/pr113112-1.c:
	* gcc.dg/vect/costmodel/riscv/rvv/pr113112-3.c: New test.

c4ac073d

LoongArch: Fix ICE when passing two same vector argument consecutively · feaff27b

Chenghui Pan authored 1 year ago

Following code will cause ICE on LoongArch target:

  #include <lsxintrin.h>

  extern void bar (__m128i, __m128i);

  __m128i a;

  void
  foo ()
  {
    bar (a, a);
  }

It is caused by missing constraint definition in mov<mode>_lsx. This
patch fixes the template and remove the unnecessary processing from
loongarch_split_move () function.

This patch also cleanup the redundant definition from
loongarch_split_move () and loongarch_split_move_p ().

gcc/ChangeLog:

	* config/loongarch/lasx.md: Use loongarch_split_move and
	loongarch_split_move_p directly.
	* config/loongarch/loongarch-protos.h
	(loongarch_split_move): Remove unnecessary argument.
	(loongarch_split_move_insn_p): Delete.
	(loongarch_split_move_insn): Delete.
	* config/loongarch/loongarch.cc
	(loongarch_split_move_insn_p): Delete.
	(loongarch_load_store_insns): Use loongarch_split_move_p
	directly.
	(loongarch_split_move): remove the unnecessary processing.
	(loongarch_split_move_insn): Delete.
	* config/loongarch/lsx.md: Use loongarch_split_move and
	loongarch_split_move_p directly.

gcc/testsuite/ChangeLog:

	* gcc.target/loongarch/vector/lsx/lsx-mov-1.c: New test.

feaff27b

LoongArch: Fix insn output of vec_concat templates for LASX. · 183a5193

Chenghui Pan authored 1 year ago

When investigaing failure of gcc.dg/vect/slp-reduc-sad.c, following
instruction block are being generated by vec_concatv32qi (which is
generated by vec_initv32qiv16qi) at entrance of foo() function:

  vldx    $vr3,$r5,$r6
  vld     $vr2,$r5,0
  xvpermi.q       $xr2,$xr3,0x20

causes the reversion of vec_initv32qiv16qi operation's high and
low 128-bit part.

According to other target's similar impl and LSX impl for following
RTL representation, current definition in lasx.md of "vec_concat<mode>"
are wrong:

  (set (op0) (vec_concat (op1) (op2)))

For correct behavior, the last argument of xvpermi.q should be 0x02
instead of 0x20. This patch fixes this issue and cleanup the vec_concat
template impl.

gcc/ChangeLog:

	* config/loongarch/lasx.md (vec_concatv4di): Delete.
	(vec_concatv8si): Delete.
	(vec_concatv16hi): Delete.
	(vec_concatv32qi): Delete.
	(vec_concatv4df): Delete.
	(vec_concatv8sf): Delete.
	(vec_concat<mode>): New template with insn output fixed.

183a5193

LoongArch: Fixed bug in *bstrins_<mode>_for_ior_mask template. · 245c9ef2

Li Wei authored 1 year ago

We found that using the latest compiled gcc will cause a miscompare error
when running spec2006 400.perlbench test with -flto turned on.  After testing,
it was found that only the LoongArch architecture will report errors.
The first error commit was located through the git bisect command as
r14-3773-g5b857e87201335.  Through debugging, it was found that the problem
was that the split condition of the *bstrins_<mode>_for_ior_mask template was
empty, which should actually be consistent with the insn condition.

gcc/ChangeLog:

	* config/loongarch/loongarch.md: Adjust.

245c9ef2

rs6000: Clean up the pre-checkings of expand_block_compare · d92d26ff

Haochen Gui authored 1 year ago

Remove P7 CPU test as only P7 above can enter this function and P7 LE is
excluded by the checking of targetm.slow_unaligned_access on word_mode.
Also performance test shows the expand of block compare is better than
library on P7 BE when the length is from 16 bytes to 64 bytes.

gcc/
	* config/rs6000/rs6000-string.cc (expand_block_compare): Assert
	only P7 above can enter this function.  Remove P7 CPU test and let
	P7 BE do the expand.

gcc/testsuite/
	* gcc.target/powerpc/block-cmp-4.c: New.

d92d26ff

rs6000: Call library for block memory compare when optimizing for size · daea7777

Haochen Gui authored 1 year ago

gcc/
	* config/rs6000/rs6000.md (cmpmemsi): Fail when optimizing for size.

gcc/testsuite/
	* gcc.target/powerpc/block-cmp-3.c: New.

daea7777

rs6000: Correct definition of macro of fixed point efficient unaligned · 78bd9e25

Haochen Gui authored 1 year ago

Marco TARGET_EFFICIENT_OVERLAPPING_UNALIGNED is used in rs6000-string.cc
to guard the platform which is efficient on fixed point unaligned
load/store.  It's originally defined by TARGET_EFFICIENT_UNALIGNED_VSX
which is enabled from P8 and can be disabled by mno-vsx option. So the
definition is improper.  This patch corrects it and call
slow_unaligned_access to judge if fixed point unaligned load/store is
efficient or not.

gcc/
	* config/rs6000/rs6000.h (TARGET_EFFICIENT_OVERLAPPING_UNALIGNED):
	Remove.
	* config/rs6000/rs6000-string.cc (select_block_compare_mode):
	Replace TARGET_EFFICIENT_OVERLAPPING_UNALIGNED with
	targetm.slow_unaligned_access.
	(expand_block_compare_gpr): Likewise.
	(expand_block_compare): Likewise.
	(expand_strncmp_gpr_sequence): Likewise.

gcc/testsuite/
	* gcc.target/powerpc/block-cmp-1.c: New.
	* gcc.target/powerpc/block-cmp-2.c: New.

78bd9e25

testsuite: 32 bit AIX 2 byte wchar · f2d47aa7

David Edelsohn authored 1 year ago


32 bit AIX supports 2 byte wchar.  The wchar-multi1.C testcase assumes
4 byte wchar.  Update the testcase to require 4 byte wchar.

gcc/testsuite/ChangeLog:
	* g++.dg/cpp23/wchar-multi1.C: Require 4 byte wchar_t.

Signed-off-by: David Edelsohn <dje.gcc@gmail.com>

f2d47aa7

testsuite: AIX csect section name. · 5b7f5e62

David Edelsohn authored 1 year ago


AIX sections use the csect directive to name a section.  Check for
csect name in attr-section testcases.

gcc/testsuite/ChangeLog:
	* g++.dg/ext/attr-section1.C: Test for csect section directive.
	* g++.dg/ext/attr-section1a.C: Same.
	* g++.dg/ext/attr-section2.C: Same.
	* g++.dg/ext/attr-section2a.C: Same.
	* g++.dg/ext/attr-section2b.C: Same.

Signed-off-by: David Edelsohn <dje.gcc@gmail.com>

5b7f5e62

Daily bump. · 4fe33bf9
GCC Administrator authored 1 year ago

4fe33bf9

Dec 26, 2023

testsuite: Skip analyzer out-of-bounds-diagram on AIX. · 86f535cb

David Edelsohn authored 1 year ago


The out-of-bounds diagram tests fail on AIX.

gcc/testsuite/ChangeLog:
	* gcc.dg/analyzer/out-of-bounds-diagram-17.c: Skip on AIX.
	* gcc.dg/analyzer/out-of-bounds-diagram-18.c: Same.

Signed-off-by: David Edelsohn <dje.gcc@gmail.com>

86f535cb

testsuite: Skip split DWARF on AIX. · a004a59f

David Edelsohn authored 1 year ago


AIX does not support split DWARF.

gcc/testsuite/ChangeLog:
	* gcc.dg/pr111409.c: Skip on AIX.

Signed-off-by: David Edelsohn <dje.gcc@gmail.com>

a004a59f

testsuite: Disable strub on AIX. · 9773ca51

David Edelsohn authored 1 year ago


AIX does not support stack scrubbing. Set strub as unsupported on AIX
and ensure that testcases check for strub support.

gcc/testsuite/ChangeLog:
	* c-c++-common/strub-unsupported-2.c: Require strub.
	* c-c++-common/strub-unsupported-3.c: Same.
	* c-c++-common/strub-unsupported.c: Same.
	* lib/target-supports.exp (check_effective_target_strub): Return 0
	for AIX.

Signed-off-by: David Edelsohn <dje.gcc@gmail.com>

9773ca51

RISC-V: Fix typo · 87dfd707

Juzhe-Zhong authored 1 year ago

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-10.c: Fix typo.

87dfd707

RISC-V: Some minior tweak on dynamic LMUL cost model · f83cfb81

Juzhe-Zhong authored 1 year ago

Tweak some codes of dynamic LMUL cost model to make computation more predictable and accurate.

Tested on both RV32 and RV64 no regression.

Committed.

	PR target/113112

gcc/ChangeLog:

	* config/riscv/riscv-vector-costs.cc (compute_estimated_lmul): Tweak LMUL estimation.
	(has_unexpected_spills_p): Ditto.
	(costs::record_potential_unexpected_spills): Ditto.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-1.c: Add more checks.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-2.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-3.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-4.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-5.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-6.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul1-7.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-1.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-2.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-3.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-4.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul2-5.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-1.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-2.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-3.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-5.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-6.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-7.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul4-8.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-1.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-10.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-11.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-2.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-3.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-4.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-5.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-6.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-7.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-8.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-9.c: Ditto.
	* gcc.dg/vect/costmodel/riscv/rvv/dynamic-lmul8-12.c: New test.
	* gcc.dg/vect/costmodel/riscv/rvv/pr113112-2.c: New test.

f83cfb81

Fix compile options of pr110279-1.c and pr110279-2.c · 6cec7b06

Di Zhao authored 1 year ago

The two testcases are for targets that support FMA. And
pr110279-2.c assumes reassoc_width of FMUL to be 4.

This patch adds missing options, to fix regression test failures
on nvptx/GCN (default reassoc_width of FMUL is 1) and x86_64
(need "-mfma").

gcc/testsuite/ChangeLog:

	* gcc.dg/pr110279-1.c: Add "-mcpu=generic" for aarch64; add
	"-mfma" for x86_64.
	* gcc.dg/pr110279-2.c: Replace "-march=armv8.2-a" with
	"-mcpu=generic"; limit the check to be on aarch64.

6cec7b06