Commits · ff505948631713d8c62523005059b10e25343617 · COBOLworx / gcc-cobol

Mar 05, 2025

PR rtl-optimization/119046: aarch64: Fix PARALLEL mode for vec_perm DUP expansion · ff505948

Kyrylo Tkachov authored 2 weeks ago


The PARALLEL created in aarch64_evpc_dup is used to hold the lane number.
It is not appropriate for it to have a vector mode.
Other such uses use VOIDmode.
Do this here as well.
This avoids the risk of generic code treating the PARALLEL as trapping when it
has floating-point mode.

Bootstrapped and tested on aarch64-none-linux-gnu.

Signed-off-by: Kyrylo Tkachov <ktkachov@nvidia.com>

	PR rtl-optimization/119046
	* config/aarch64/aarch64.cc (aarch64_evpc_dup): Use VOIDmode for
	PARALLEL.

ff505948

LoongArch: Fix incorrect reorder of __lsx_vldx and __lasx_xvldx [PR119084] · 4856292f

Xi Ruoyao authored 3 weeks ago

They could be incorrectly reordered with store instructions like st.b
because the RTL expression does not have a memory_operand or a (mem)
expression.  The incorrect reorder has been observed in openh264 LTO
build.

Expand them to a (mem) expression instead of unspec to fix the issue.
Then we need to make loongarch_address_insns return 1 for
ADDRESS_REG_REG because the constraint "R" expects this behavior, or
the vldx instruction will be considered invalid by the register
allocate pass and turned to add.d + vld.  Apply the ADDRESS_REG_REG
penalty in loongarch_address_cost instead, loongarch_rtx_costs should
also call loongarch_address_cost instead of loongarch_address_insns
then.

Closes: https://github.com/cisco/openh264/issues/3857

gcc/ChangeLog:

	PR target/119084
	* config/loongarch/lasx.md (UNSPEC_LASX_XVLDX): Remove.
	(lasx_xvldx): Remove.
	* config/loongarch/lsx.md (UNSPEC_LSX_VLDX): Remove.
	(lsx_vldx): Remove.
	* config/loongarch/simd.md (QIVEC): New define_mode_iterator.
	(<simd_isa>_<x>vldx): New define_expand.
	* config/loongarch/loongarch.cc (loongarch_address_insns_1): New
	static function with most logic factored out from ...
	(loongarch_address_insns): ... here.  Call
	loongarch_address_insns_1 with reg_reg_cost = 1.
	(loongarch_address_cost): Call loongarch_address_insns_1 with
	reg_reg_cost = la_addr_reg_reg_cost.

gcc/testsuite/ChangeLog:

	PR target/119084
	* gcc.target/loongarch/pr119084.c: New test.

4856292f

Mar 04, 2025

Break false dependency chain on Zen5 · 8c4a00f9

Jan Hubicka authored 3 weeks ago

Zen5 on some variants has false dependency on tzcnt, blsi, blsr and blsmsk
instructions.  Those can be tested by the following benchmark

jh@shroud:~> cat ee.c
int
main()
{
       int a = 10;
       int b = 0;
       for (int i = 0; i < 1000000000; i++)
       {
               asm volatile ("xor %0, %0": "=r" (b));
               asm volatile (INST " %2, %0": "=r"(b): "0"(b),"r"(a));
               asm volatile (INST " %2, %0": "=r"(b): "0"(b),"r"(a));
               asm volatile (INST " %2, %0": "=r"(b): "0"(b),"r"(a));
               asm volatile (INST " %2, %0": "=r"(b): "0"(b),"r"(a));
               asm volatile (INST " %2, %0": "=r"(b): "0"(b),"r"(a));
               asm volatile (INST " %2, %0": "=r"(b): "0"(b),"r"(a));
               asm volatile (INST " %2, %0": "=r"(b): "0"(b),"r"(a));
               asm volatile (INST " %2, %0": "=r"(b): "0"(b),"r"(a));
               asm volatile (INST " %2, %0": "=r"(b): "0"(b),"r"(a));
               asm volatile (INST " %2, %0": "=r"(b): "0"(b),"r"(a));
       }
       return 0;
}
jh@shroud:~> cat bmk.sh
gcc ee.c -DBREAK -DINST=\"$1\" -O2 ; time ./a.out ; gcc ee.c -DINST=\"$1\" -O2 ; time ./a.out
jh@shroud:~> sh bmk.sh tzcnt

real    0m0.886s
user    0m0.886s
sys     0m0.000s

real    0m0.886s
user    0m0.886s
sys     0m0.000s

jh@shroud:~> sh bmk.sh blsi

real    0m0.979s
user    0m0.979s
sys     0m0.000s

real    0m2.418s
user    0m2.418s
sys     0m0.000s

jh@shroud:~> sh bmk.sh blsr

real    0m0.986s
user    0m0.986s
sys     0m0.000s

real    0m2.422s
user    0m2.421s
sys     0m0.000s
jh@shroud:~> sh bmk.sh blsmsk

real    0m0.973s
user    0m0.973s
sys     0m0.000s

real    0m2.422s
user    0m2.422s
sys     0m0.000s

We already have runable that controls tzcnt together with lzcnt and popcnt.
Since it seems that only tzcnt is affected I added new tunable to control tzcnt
only.  I also added splitters for blsi/blsr/blsmsk implemented analogously to
existing splitter for lzcnt.

The patch is neutral on SPEC. We produce blsi and blsr in some internal loops, but
they usually have same destination as source. However it is good to break the
dependency chain to avoid patogolical cases and it is quite cheap overall, so I
think we want to enable this for generic.  I will send followup patch for this.

Bootstrapped/regtested x86_64-linux, will commit it shortly.

gcc/ChangeLog:

	* config/i386/i386.h (TARGET_AVOID_FALSE_DEP_FOR_TZCNT): New macro.
	(TARGET_AVOID_FALSE_DEP_FOR_BLS): New macro.
	* config/i386/i386.md (*bmi_blsi_<mode>): Add splitter for false
	dependency.
	(*bmi_blsi_<mode>_ccno): Add splitter for false dependency.
	(*bmi_blsi_<mode>_falsedep): New pattern.
	(*bmi_blsmsk_<mode>): Add splitter for false dependency.
	(*bmi_blsmsk_<mode>_falsedep): New pattern.
	(*bmi_blsr_<mode>): Add splitter for false dependency.
	(*bmi_blsr_<mode>_cmp): Add splitter for false dependency
	(*bmi_blsr_<mode>_cmp_falsedep): New pattern.
	* config/i386/x86-tune.def (X86_TUNE_AVOID_FALSE_DEP_FOR_TZCNT): New tune.
	(X86_TUNE_AVOID_FALSE_DEP_FOR_BLS): New tune.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/blsi.c: New test.
	* gcc.target/i386/blsmsk.c: New test.
	* gcc.target/i386/blsr.c: New test.

8c4a00f9

Make ix86_macro_fusion_pair_p and ix86_fuse_mov_alu_p match current CPUs · c84be624

Jan Hubicka authored 3 weeks ago

The current implementation of fussion predicates misses some common
fussion cases on zen and more recent cores.  I added knobs for
individual conditionals we test.

 1) I split checks for fusing ALU with conditional operands when the ALU
 has memory operand.  This seems to be supported by zen3+ and by
 tigerlake and coperlake (according to Agner Fog's manual)

 2) znver4 and 5 supports fussion of ALU and conditional even if ALU has
    memory and immediate operands.
    This seems to be relatively important enabling 25% more fusions on
    gcc bootstrap.

 3) no CPU supports fusing when ALU contains IP relative memory
    references.  I added separate knob so we do not forger about this if
    this gets supoorted later.

The patch does not solve the limitation of sched that fuse pairs must be
adjacent on imput and the first operation must be signle-set.  Fixing
single-set is easy (I have separate patch for this), for non-adjacent
pairs we need bigger surgery.

To verify what CPU really does I made simpe test script.

jh@ryzen3:~> cat fuse-test.c
        int b;
        const int z = 0;
        const int o = 1;
        int
main()
{
        int a = 1000000000;
        int b;
        int z = 0;
        int o = 1;
        asm volatile ("\n"
".L1234:\n"
        "nop\n"
        "subl   %3, %0\n"

        "movl %0, %1\n"
        "cmpl     %2, %1\n"
        "movl %0, %1\n"
        "test %1, %1\n"

        "nop\n"
        "jne    .L1234":"=a"(a),
        "=m"(b)
        "=r"(b)
        :
        "m"(z),
        "m"(o),
        "i"(0),
        "i"(1),
        "0"(a)
                );
}
jh@ryzen3:~> cat fuse-test.sh
EVENT=ex_ret_fused_instr
dotest()
{
gcc -O2  fuse-test.c $* -o fuse-cmp-imm-mem-nofuse
perf stat -e $EVENT ./fuse-cmp-imm-mem-nofuse  2>&1 | grep $EVENT
gcc -O2 fuse-test.c -DFUSE $* -o fuse-cmp-imm-mem-fuse
perf stat  -e $EVENT ./fuse-cmp-imm-mem-fuse 2>&1 | grep $EVENT
}

echo ALU with immediate
dotest
echo ALU with memory
dotest -D MEM
echo ALU with IP relative memory
dotest -D MEM -D IPRELATIVE
echo CMP with immediate
dotest -D CMP
echo CMP with memory
dotest -D CMP -D MEM
echo CMP with memory and immediate
dotest -D CMP -D MEMIMM
echo CMP with IP relative memory
dotest -D CMP -D MEM -D IPRELATIVE
echo TEST
dotest -D TEST

On zen5 I get:
ALU with immediate
            20,345      ex_ret_fused_instr:u
     1,000,020,278      ex_ret_fused_instr:u
ALU with memory
            20,367      ex_ret_fused_instr:u
     1,000,020,290      ex_ret_fused_instr:u
ALU with IP relative memory
            20,395      ex_ret_fused_instr:u
            20,403      ex_ret_fused_instr:u
CMP with immediate
            20,369      ex_ret_fused_instr:u
     1,000,020,301      ex_ret_fused_instr:u
CMP with memory
            20,314      ex_ret_fused_instr:u
     1,000,020,341      ex_ret_fused_instr:u
CMP with memory and immediate
            20,372      ex_ret_fused_instr:u
     1,000,020,266      ex_ret_fused_instr:u
CMP with IP relative memory
            20,382      ex_ret_fused_instr:u
            20,369      ex_ret_fused_instr:u
TEST
            20,346      ex_ret_fused_instr:u
     1,000,020,301      ex_ret_fused_instr:u

IP relative memory seems to not be documented.

On zen3/4 I get:

ALU with immediate
            20,263      ex_ret_fused_instr:u
     1,000,020,051      ex_ret_fused_instr:u
ALU with memory
            20,255      ex_ret_fused_instr:u
     1,000,020,056      ex_ret_fused_instr:u
ALU with IP relative memory
            20,253      ex_ret_fused_instr:u
            20,266      ex_ret_fused_instr:u
CMP with immediate
            20,264      ex_ret_fused_instr:u
     1,000,020,052      ex_ret_fused_instr:u
CMP with memory
            20,253      ex_ret_fused_instr:u
     1,000,019,794      ex_ret_fused_instr:u
CMP with memory and immediate
            20,260      ex_ret_fused_instr:u
            20,264      ex_ret_fused_instr:u
CMP with IP relative memory
            20,258      ex_ret_fused_instr:u
            20,256      ex_ret_fused_instr:u
TEST
            20,261      ex_ret_fused_instr:u
     1,000,020,048      ex_ret_fused_instr:u

zen1 and 2 gets:

ALU with immediate
            21,610      ex_ret_fus_brnch_inst:u
            21,697      ex_ret_fus_brnch_inst:u
ALU with memory
            21,479      ex_ret_fus_brnch_inst:u
            21,747      ex_ret_fus_brnch_inst:u
ALU with IP relative memory
            21,623      ex_ret_fus_brnch_inst:u
            21,684      ex_ret_fus_brnch_inst:u
CMP with immediate
            21,708      ex_ret_fus_brnch_inst:u
     1,000,021,288      ex_ret_fus_brnch_inst:u
CMP with memory
            21,689      ex_ret_fus_brnch_inst:u
     1,000,004,270      ex_ret_fus_brnch_inst:u
CMP with memory and immediate
            21,604      ex_ret_fus_brnch_inst:u
            21,671      ex_ret_fus_brnch_inst:u
CMP with IP relative memory
            21,589      ex_ret_fus_brnch_inst:u
            21,602      ex_ret_fus_brnch_inst:u
TEST
            21,600      ex_ret_fus_brnch_inst:u
     1,000,021,233      ex_ret_fus_brnch_inst:u

I tested the patch on zen3 and zen5 and spec2k17 and it seems neutral, however
the number of fussion does go up.

Bootstrapped/regtested x86_64-linux, I plan to commit it tomorrow.

Honza

gcc/ChangeLog:

	* config/i386/i386.h (TARGET_FUSE_ALU_AND_BRANCH_MEM): New macro.
	(TARGET_FUSE_ALU_AND_BRANCH_MEM_IMM): New macro.
	(TARGET_FUSE_ALU_AND_BRANCH_RIP_RELATIVE): New macro.
	* config/i386/x86-tune-sched.cc (ix86_fuse_mov_alu_p): Support
	non-single-set.
	(ix86_macro_fusion_pair_p): Allow ALU which only clobbers;
	be more careful about immediates; check TARGET_FUSE_ALU_AND_BRANCH_MEM,
	TARGET_FUSE_ALU_AND_BRANCH_MEM_IMM, TARGET_FUSE_ALU_AND_BRANCH_RIP_RELATIVE;
	verify that we never use unsigned checks with inc/dec.
	* config/i386/x86-tune.def (X86_TUNE_FUSE_ALU_AND_BRANCH): New tune.
	(X86_TUNE_FUSE_ALU_AND_BRANCH_MEM): New tune.
	(X86_TUNE_FUSE_ALU_AND_BRANCH_MEM_IMM): New tune.
	(X86_TUNE_FUSE_ALU_AND_BRANCH_RIP_RELATIVE): New tune.

c84be624

aarch64: force operand to fresh register to avoid subreg issues [PR118892] · d883f323

Tamar Christina authored 3 weeks ago

When the input is already a subreg and we try to make a paradoxical
subreg out of it for copysign this can fail if it violates the subreg
relationship.

Use force_lowpart_subreg instead of lowpart_subreg to then force the
results to a register instead of ICEing.

gcc/ChangeLog:

	PR target/118892
	* config/aarch64/aarch64.md (copysign<GPF:mode>3): Use
	force_lowpart_subreg instead of lowpart_subreg.

gcc/testsuite/ChangeLog:

	PR target/118892
	* gcc.target/aarch64/copysign-pr118892.c: New test.

d883f323

Fix folding of BIT_NOT_EXPR for POLY_INT_CST [PR118976] · 78380fd7

Richard Sandiford authored 3 weeks ago

There was an embarrassing typo in the folding of BIT_NOT_EXPR for
POLY_INT_CSTs: it used - rather than ~ on the poly_int.  Not sure
how that happened, but it might have been due to the way that
~x is implemented as -1 - x internally.

gcc/
	PR tree-optimization/118976
	* fold-const.cc (const_unop): Use ~ rather than - for BIT_NOT_EXPR.
	* config/aarch64/aarch64.cc (aarch64_test_sve_folding): New function.
	(aarch64_run_selftests): Run it.

78380fd7

Mar 03, 2025

aarch64: Ignore target pragmas while defining intrinsics · 71355700

Andrew Carlotti authored 1 month ago

Refactor the switcher classes into two separate classes:

- sve_alignment_switcher takes the alignment switching functionality,
  and is used only for ABI correctness when defining sve structure
  types.
- aarch64_target_switcher takes the rest of the functionality of
  aarch64_simd_switcher and sve_switcher, and gates simd/sve specific
  parts upon the specified feature flags.

Additionally, aarch64_target_switcher now adds dependencies of the
specified flags (which adds +fcma and +bf16 to some intrinsic
declarations), and unsets current_target_pragma.

This last change fixes an internal bug where we would sometimes add a
user specified target pragma (stored in current_target_pragma) on top of
an internally specified target architecture while initialising
intrinsics with `#pragma GCC aarch64 "arm_*.h"`.  As far as I can tell, this
has no visible impact at the moment.  However, the unintended target
feature combinations lead to unwanted behaviour in an under-development
patch.

This also fixes a missing Makefile dependency, which was due to
aarch64-sve-builtins.o incorrectly depending on the undefined $(REG_H).
The correct $(REGS_H) dependency is added to the switcher's new source
location.

gcc/ChangeLog:

	* common/config/aarch64/aarch64-common.cc
	(struct aarch64_extension_info): Add field.
	(aarch64_get_required_features): New.
	* config/aarch64/aarch64-builtins.cc
	(aarch64_simd_switcher::aarch64_simd_switcher): Rename to...
	(aarch64_target_switcher::aarch64_target_switcher): ...this,
	and extend to handle sve, nosimd and target pragmas.
	(aarch64_simd_switcher::~aarch64_simd_switcher): Rename to...
	(aarch64_target_switcher::~aarch64_target_switcher): ...this,
	and extend to handle sve, nosimd and target pragmas.
	(handle_arm_acle_h): Use aarch64_target_switcher.
	(handle_arm_neon_h): Rename switcher and pass explicit flags.
	(aarch64_general_init_builtins): Ditto.
	* config/aarch64/aarch64-protos.h
	(class aarch64_simd_switcher): Rename to...
	(class aarch64_target_switcher): ...this, and add new members.
	(aarch64_get_required_features): New prototype.
	* config/aarch64/aarch64-sve-builtins.cc
	(sve_switcher::sve_switcher): Delete
	(sve_switcher::~sve_switcher): Delete
	(sve_alignment_switcher::sve_alignment_switcher): New
	(sve_alignment_switcher::~sve_alignment_switcher): New
	(register_builtin_types): Use alignment switcher
	(init_builtins): Rename switcher.
	(handle_arm_neon_sve_bridge_h): Ditto.
	(handle_arm_sme_h): Ditto.
	(handle_arm_sve_h): Ditto, and use alignment switcher.
	* config/aarch64/aarch64-sve-builtins.h
	(class sve_switcher): Delete.
	(class sme_switcher): Delete.
	(class sve_alignment_switcher): New.
	* config/aarch64/t-aarch64 (aarch64-builtins.o): Add $(REGS_H).
	(aarch64-sve-builtins.o): Remove $(REG_H).

71355700

arm: remove some redundant zero_extend ops on thumb1 · 2a502f9e

Richard Earnshaw authored 3 weeks ago

The code in gcc.target/unsigned-extend-1.c really should not need an
unsigned extension operations when the optimizers are used.  For Arm
and thumb2 that is indeed the case, but for thumb1 code it gets more
complicated as there are too many instructions for combine to look at.
For thumb1 we end up with two redundant zero_extend patterns which are
not removed: the first after the subtract instruction and the second of
the final boolean result.

We can partially fix this (for the second case above) by adding a new
split pattern for LEU and GEU patterns which work because the two
instructions for the [LG]EU pattern plus the redundant extension
instruction are combined into a single insn, which we can then split
using the 3->2 method back into the two insns of the [LG]EU sequence.

Because we're missing the optimization for all thumb1 cases (not just
those architectures with UXTB), I've adjust the testcase to detect all
the idioms that we might use for zero-extending a value, namely:

       UXTB
       AND ...#255 (in thumb1 this would require a register to hold 255)
       LSL ... #24; LSR ... #24

but I've also marked this test as XFAIL for thumb1 because we can't yet
eliminate the first of the two extend instructions.

gcc/
	* config/arm/thumb1.md (split patterns for GEU and LEU): New.

gcc/testsuite:
	* gcc.target/arm/unsigned-extend-1.c: Expand check for any
	insn suggesting a zero-extend.  XFAIL for thumb1 code.

2a502f9e

Mar 02, 2025

[RISC-V][PR target/118934] Fix ICE in RISC-V long branch support · 67e824c2

Jeff Law authored 3 weeks ago

I'm not sure if I goof'd this or if I merely upstreamed someone else's goof.
Either way the long branch code isn't working correctly.

We were using 'n' as the output modifier to negate the condition.  But 'n' has
a special meaning elsewhere, so when presented with a condition rather than
what was expected, boom, the compiler ICE'd.

Thankfully there's only a few places where we were using %n which I turned into
%r.

The BZ entry includes a good testcase, it just takes a long time to compile as
it's trying to create the out-of-range scenario.  I'm not including the
testcase due to how long it takes, but I did test it locally to ensure it's
working properly now.

I'm sure that with a little bit of work I could create at testcase that worked
before and fails with the trunk (by taking advantage of the fuzzyness in length
computations).  So I'm going to consider this a regression.

Will push to the trunk after pre-commit testing does its thing.

	PR target/118934
gcc/
	* config/riscv/corev.md (cv_branch): Adjust output template.
	(branch): Likewise.
	* config/riscv/riscv.md (branch): Likewise.
	* config/riscv/riscv.cc (riscv_asm_output_opcode): Handle 'r' rather
	than 'n'.

67e824c2

avr: Fix up avr_print_operand diagnostics [PR118991] · 047b7f9a

Jakub Jelinek authored 3 weeks ago

As can be seen in gcc/po/gcc.pot:
 #: config/avr/avr.cc:2754
 #, c-format
 msgid "bad I/O address 0x"
 msgstr ""

exgettext couldn't retrieve the whole format string in this case,
because it uses a macro in the middle.  output_operand_lossage
is c-format function though, so we can't use %wx to print HOST_WIDE_INT,
and HOST_WIDE_INT_PRINT_HEX_PURE is on some hosts %lx, on others %llx
and on others %I64x so isn't really translatable that way.

As Joseph mentioned in the PR, there is no easy way around this
but go through a temporary buffer, which the following patch does.

2025-03-02  Jakub Jelinek  <jakub@redhat.com>

	PR translation/118991
	* config/avr/avr.cc (avr_print_operand): Print ival into
	a temporary buffer and use %s in output_operand_lossage to make
	the diagnostics translatable.

047b7f9a

Mar 01, 2025

[PATCH] H8/300, libgcc: PR target/114222 For HImode call internal ffs()... · 898f22d1

Jan Dubiec authored 3 weeks ago

[PATCH] H8/300, libgcc: PR target/114222 For HImode call internal ffs() implementation instead of an external one

When INT_TYPE_SIZE < BITS_PER_WORD gcc emits a call to an external ffs()
implementation instead of a call to "__builtin_ffs()" – see function
init_optabs() in <SRCROOT>/gcc/optabs-libfuncs.cc. External ffs()
(which is usually the one from newlib) in turn calls __builtin_ffs()
what causes infinite recursion and stack overflow. This patch overrides
default gcc bahaviour for H8/300H (and newer) and provides a generic
ffs() implementation for HImode.

	PR target/114222
gcc/ChangeLog:

	* config/h8300/h8300.cc (h8300_init_libfuncs): For HImode override
	calls to external ffs() (from newlib) with calls to __ffshi2() from
	libgcc. The implementation of ffs() in newlib calls __builtin_ffs()
	what causes infinite recursion and finally a stack overflow.

libgcc/ChangeLog:

	* config/h8300/t-h8300: Add __ffshi2().
	* config/h8300/ffshi2.c: New file.

898f22d1

[PATCH] H8/300: PR target/109189 Silence -Wformat warnings on Windows · 2fc17730

Jan Dubiec authored 3 weeks ago

This patch fixes annoying -Wformat warnings when gcc is built
on Windows/MinGW64. Instead of %ld it uses HOST_WIDE_INT_PRINT_DEC
macro, just like many other targets do.

	PR target/109189
gcc/ChangeLog:

	* config/h8300/h8300.cc (h8300_print_operand): Replace %ld format
	strings with HOST_WIDE_INT_PRINT_DEC macro in order to silence
	-Wformat warnings when building on Windows/MinGW64.

2fc17730

Feb 28, 2025

x86: Move TARGET_SMALL_REGISTER_CLASSES_FOR_MODE_P to i386.cc · 075611b6

H.J. Lu authored 4 weeks ago


Move the TARGET_SMALL_REGISTER_CLASSES_FOR_MODE_P target hook from
i386.h to i386.cc.

	* config/i386/i386.h (TARGET_SMALL_REGISTER_CLASSES_FOR_MODE_P):
	Moved to ...
	* config/i386/i386.cc (TARGET_SMALL_REGISTER_CLASSES_FOR_MODE_P):
	Here.

Signed-off-by: H.J. Lu <hjl.tools@gmail.com>

075611b6

Feb 27, 2025

RISC-V: Fix bug for expand_const_vector interleave [PR118931] · e7287cbb

Pan Li authored 1 month ago


This patch would like to fix one bug when expanding const vector for the
interleave case.  For example, we have:

base1 = 151
step = 121

For vec_series, we will generate vector in format of v[i] = base + i * step.
Then the vec_series will have below result for HImode, and we can find
that the result overflow to the highest 8 bits of HImode.

v1.b = {151, 255, 7,  0, 119,  0, 231,  0, 87,  1, 199,  1, 55,   2, 167,   2}

Aka we expect v1.b should be:

v1.b = {151, 0, 7,  0, 119,  0, 231,  0, 87,  0, 199,  0, 55,   0, 167,   0}

After that it will perform the IOR with v2 for the base2(aka another series).

v2.b =  {0,  17, 0, 33,   0, 49,   0, 65,  0, 81,   0, 97,  0, 113,   0, 129}

Unfortunately, the base1 + i * step1 in HImode may overflow to the high
8 bits, and the high 8 bits will pollute the v2 and result in incorrect
value in const_vector.

This patch would like to perform the overflow to smode check before the
optimized interleave code generation.  If overflow or VLA, it will fall
back to the default merge approach.

The below test suites are passed for this patch.
* The rv64gcv fully regression test.

	PR target/118931

gcc/ChangeLog:

	* config/riscv/riscv-v.cc (expand_const_vector): Add overflow to
	smode check and clean up highest bits if overflow.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/base/pr118931-run-1.c: New test.

Signed-off-by: Pan Li <pan2.li@intel.com>

e7287cbb

nvptx: '#define MAX_FIXED_MODE_SIZE 128' · e333ad4e

Thomas Schwinge authored 3 weeks ago

... instead of 64 via 'gcc/defaults.h':

    MAX_FIXED_MODE_SIZE GET_MODE_BITSIZE (DImode)

This fixes ICEs:

    [-FAIL: c-c++-common/pr111309-1.c  -Wc++-compat  (internal compiler error: in expand_fn_using_insn, at internal-fn.cc:268)-]
    [-FAIL:-]{+PASS:+} c-c++-common/pr111309-1.c  -Wc++-compat  (test for excess errors)
    [-UNRESOLVED:-]{+PASS:+} c-c++-common/pr111309-1.c  -Wc++-compat  [-compilation failed to produce executable-]{+execution test+}

    [-FAIL: c-c++-common/pr111309-1.c  -std=gnu++17 (internal compiler error: in expand_fn_using_insn, at internal-fn.cc:268)-]
    [-FAIL:-]{+PASS:+} c-c++-common/pr111309-1.c  -std=gnu++17 (test for excess errors)
    [-UNRESOLVED:-]{+PASS:+} c-c++-common/pr111309-1.c  -std=gnu++17 [-compilation failed to produce executable-]{+execution test+}
    [-FAIL: c-c++-common/pr111309-1.c  -std=gnu++26 (internal compiler error: in expand_fn_using_insn, at internal-fn.cc:268)-]
    [-FAIL:-]{+PASS:+} c-c++-common/pr111309-1.c  -std=gnu++26 (test for excess errors)
    [-UNRESOLVED:-]{+PASS:+} c-c++-common/pr111309-1.c  -std=gnu++26 [-compilation failed to produce executable-]{+execution test+}
    [-FAIL: c-c++-common/pr111309-1.c  -std=gnu++98 (internal compiler error: in expand_fn_using_insn, at internal-fn.cc:268)-]
    [-FAIL:-]{+PASS:+} c-c++-common/pr111309-1.c  -std=gnu++98 (test for excess errors)
    [-UNRESOLVED:-]{+PASS:+} c-c++-common/pr111309-1.c  -std=gnu++98 [-compilation failed to produce executable-]{+execution test+}

    [-FAIL: gcc.dg/torture/pr116480-1.c   -O0  (internal compiler error: in expand_fn_using_insn, at internal-fn.cc:268)-]
    [-FAIL:-]{+PASS:+} gcc.dg/torture/pr116480-1.c   -O0  (test for excess errors)
    [-FAIL: gcc.dg/torture/pr116480-1.c   -O1  (internal compiler error: in expand_fn_using_insn, at internal-fn.cc:268)-]
    [-FAIL:-]{+PASS:+} gcc.dg/torture/pr116480-1.c   -O1  (test for excess errors)
    PASS: gcc.dg/torture/pr116480-1.c   -O2  (test for excess errors)
    PASS: gcc.dg/torture/pr116480-1.c   -O3 -g  (test for excess errors)
    PASS: gcc.dg/torture/pr116480-1.c   -Os  (test for excess errors)

..., where we ran into 'gcc_assert (icode != CODE_FOR_nothing);' in
'gcc/internal-fn.cc:expand_fn_using_insn' for '__int128' '__builtin_clzg' etc.:

    during RTL pass: expand
    [...]/c-c++-common/pr111309-1.c: In function 'clzI':
    [...]/c-c++-common/pr111309-1.c:69:10: internal compiler error: in expand_fn_using_insn, at internal-fn.cc:268
    0x120ec2cf internal_error(char const*, ...)
            [...]/gcc/diagnostic-global-context.cc:517
    0x102c7c5b fancy_abort(char const*, int, char const*)
            [...]/gcc/diagnostic.cc:1722
    0x109708eb expand_fn_using_insn
            [...]/gcc/internal-fn.cc:268
    0x1098114f expand_internal_call(internal_fn, gcall*)
            [...]/gcc/internal-fn.cc:5273
    0x1098114f expand_internal_call(gcall*)
            [...]/gcc/internal-fn.cc:5281
    0x10594fc7 expand_call_stmt
            [...]/gcc/cfgexpand.cc:3049
    [...]

Likewise, as of commit e8ad697a
"libstdc++: Use new type-generic built-ins in <bit> [PR118855]",
the libstdc++ target library build ICEd in the same way.

Additionally, this change fixes:

    [-FAIL:-]{+PASS:+} gcc.dg/pr105094.c (test for excess errors)

..., which was:

    [...]/gcc.dg/pr105094.c: In function 'foo':
    [...]/gcc.dg/pr105094.c:11:12: error: size of variable 's' is too large

And, finally, regarding 'gcc.target/nvptx/stack_frame-1.c'.  Before, in
'gcc/cfgexpand.cc': 'expand_used_vars' -> 'expand_used_vars_for_block' ->
'expand_one_var' for 'ww' -> 'gcc/function.cc:use_register_for_decl' due to
'DECL_MODE (decl) == BLKmode' did 'return false;', thus -> 'add_stack_var'
(even if 'ww' wasn't then actually living on the stack).  Now, 'ww' has
'TImode' and 'use_register_for_decl' does 'return true;', thus ->
'expand_one_register_var', and therefore no unused stack frame emitted.

	gcc/
	* config/nvptx/nvptx.h (MAX_FIXED_MODE_SIZE): '#define'.
	gcc/testsuite/
	* gcc.target/nvptx/stack_frame-1.c: Adjust.

e333ad4e

nvptx: Support '-mfake-ptx-alloca' · 1146410c

Thomas Schwinge authored 4 weeks ago

With '-mfake-ptx-alloca' enabled, the user-visible behavior changes only
for configurations where PTX 'alloca' is not available.  Rather than a
compile-time 'sorry, unimplemented: dynamic stack allocation not supported'
in presence of dynamic stack allocation, compilation and assembly then
succeeds.  However, attempting to link in such '*.o' files then fails due
to unresolved symbol '__GCC_nvptx__PTX_alloca_not_supported'.

This is meant to be used in scenarios where large volumes of code are
compiled, a small fraction of which runs into dynamic stack allocation, but
these parts are not important for specific use cases, and we'd thus like the
build to succeed, and error out just upon actual, very rare use of the
offending '*.o' files.

	gcc/
	* config/nvptx/nvptx.opt (-mfake-ptx-alloca): New.
	* config/nvptx/nvptx-protos.h (nvptx_output_fake_ptx_alloca):
	Declare.
	* config/nvptx/nvptx.cc (nvptx_output_fake_ptx_alloca): New.
	* config/nvptx/nvptx.md (define_insn "@nvptx_alloca_<mode>")
	[!(TARGET_PTX_7_3 && TARGET_SM52)]: Use it for
	'-mfake-ptx-alloca'.
	gcc/testsuite/
	* gcc.target/nvptx/alloca-1-O0_-mfake-ptx-alloca.c: New.
	* gcc.target/nvptx/alloca-2-O0_-mfake-ptx-alloca.c: Likewise.
	* gcc.target/nvptx/alloca-4-O3_-mfake-ptx-alloca.c: Likewise.
	* gcc.target/nvptx/vla-1-O0_-mfake-ptx-alloca.c: Likewise.
	* gcc.target/nvptx/alloca-4-O3.c:
	'dg-additional-options -mfake-ptx-alloca'.

1146410c

nvptx: Delay 'sorry, unimplemented: dynamic stack allocation not supported'... · 22e76700

Thomas Schwinge authored 4 weeks ago

nvptx: Delay 'sorry, unimplemented: dynamic stack allocation not supported' from expansion time to code generation

This gives the back end a chance to clean out a few more unnecessary instances
of dynamic stack allocation.  This progresses:

    PASS: gcc.dg/pr78902.c  (test for warnings, line 7)
    PASS: gcc.dg/pr78902.c  (test for warnings, line 8)
    PASS: gcc.dg/pr78902.c  (test for warnings, line 9)
    PASS: gcc.dg/pr78902.c  (test for warnings, line 10)
    PASS: gcc.dg/pr78902.c  (test for warnings, line 11)
    PASS: gcc.dg/pr78902.c  (test for warnings, line 12)
    PASS: gcc.dg/pr78902.c  (test for warnings, line 13)
    PASS: gcc.dg/pr78902.c strndup excessive bound at line 14 (test for warnings, line 13)
    [-UNSUPPORTED: gcc.dg/pr78902.c: dynamic stack allocation not supported-]
    {+PASS: gcc.dg/pr78902.c (test for excess errors)+}

    UNSUPPORTED: gcc.dg/torture/pr71901.c   -O0 : dynamic stack allocation not supported
    [-UNSUPPORTED:-]{+PASS:+} gcc.dg/torture/pr71901.c   -O1  [-: dynamic stack allocation not supported-]{+(test for excess errors)+}
    UNSUPPORTED: gcc.dg/torture/pr71901.c   -O2 : dynamic stack allocation not supported
    UNSUPPORTED: gcc.dg/torture/pr71901.c   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions : dynamic stack allocation not supported
    UNSUPPORTED: gcc.dg/torture/pr71901.c   -O3 -g : dynamic stack allocation not supported
    [-UNSUPPORTED:-]{+PASS:+} gcc.dg/torture/pr71901.c   -Os  [-: dynamic stack allocation not supported-]{+(test for excess errors)+}

    UNSUPPORTED: gcc.dg/torture/pr78742.c   -O0 : dynamic stack allocation not supported
    [-UNSUPPORTED:-]{+PASS:+} gcc.dg/torture/pr78742.c   -O1  [-: dynamic stack allocation not supported-]{+(test for excess errors)+}
    [-UNSUPPORTED:-]{+PASS:+} gcc.dg/torture/pr78742.c   -O2  [-: dynamic stack allocation not supported-]{+(test for excess errors)+}
    [-UNSUPPORTED:-]{+PASS:+} gcc.dg/torture/pr78742.c   -O3 -fomit-frame-pointer -funroll-loops -fpeel-loops -ftracer -finline-functions  [-: dynamic stack allocation not supported-]{+(test for excess errors)+}
    [-UNSUPPORTED:-]{+PASS:+} gcc.dg/torture/pr78742.c   -O3 -g  [-: dynamic stack allocation not supported-]{+(test for excess errors)+}
    UNSUPPORTED: gcc.dg/torture/pr78742.c   -Os : dynamic stack allocation not supported

    [-UNSUPPORTED:-]{+PASS:+} gfortran.dg/pr101267.f90   -O  [-: dynamic stack allocation not supported-]{+(test for excess errors)+}

    [-UNSUPPORTED:-]{+PASS:+} gfortran.dg/pr112404.f90   -O  [-: dynamic stack allocation not supported-]{+(test for excess errors)+}

	gcc/
	* config/nvptx/nvptx.md (define_expand "allocate_stack")
	[!TARGET_SOFT_STACK]: Move
	'sorry ("dynamic stack allocation not supported");'...
	(define_insn "@nvptx_alloca_<mode>"): ... here.
	gcc/testsuite/
	* gcc.target/nvptx/alloca-1-unused-O0-sm_30.c: Adjust.

22e76700

i386: Treat Granite Rapids/Granite Rapids-D/Diamond Rapids similar as Sapphire... · 44c4a720

Haochen Jiang authored 3 weeks ago

i386: Treat Granite Rapids/Granite Rapids-D/Diamond Rapids similar as Sapphire Rapids in x86-tune.def

Since GNR, GNR-D, DMR are both P-core based, we should treat them
just like SPR for now.

gcc/ChangeLog:

	* config/i386/x86-tune.def
	(X86_TUNE_DEST_FALSE_DEP_FOR_GLC): Add GNR, GNR-D, DMR.
	(X86_TUNE_AVOID_256FMA_CHAINS): Ditto.
	(X86_TUNE_AVX512_MOVE_BY_PIECES): Ditto.
	(X86_TUNE_AVX512_STORE_BY_PIECES): Ditto.

44c4a720

Feb 26, 2025

arm: Fix up REVERSE_CONDITION macro [PR119002] · 40bf0770

Jakub Jelinek authored 3 weeks ago

The linaro CI found my PR119002 patch broke bootstrap on arm.
Seems the problem is that it has incorrect REVERSE_CONDITION macro
definition.
All other target's REVERSE_CONDITION definitions and the default one
just use the macro's arguments, while arm.h definition uses the MODE
argument but uses code instead of CODE (the first argument).
This happens to work because before my patch the only use of the
macro was in jump.cc with
  /* First see if machine description supplies us way to reverse the
     comparison.  Give it priority over everything else to allow
     machine description to do tricks.  */
  if (GET_MODE_CLASS (mode) == MODE_CC
      && REVERSIBLE_CC_MODE (mode))
    return REVERSE_CONDITION (code, mode);
but in my patch it is used with GT rather than code.

2025-02-26  Jakub Jelinek  <jakub@redhat.com>

	PR rtl-optimization/119002
	* config/arm/arm.h (REVERSE_CONDITION): Use CODE - the macro
	argument - in the macro rather than code.

40bf0770

Feb 25, 2025

pru: Fix pru_pragma_ctable_entry diagnostics [PR118991] · 0bb431d0

Jakub Jelinek authored 4 weeks ago

HOST_WIDE_INT_PRINT* macros aren't supposed to be used in
gcc-internal-format format strings, we have the w modifier for HOST_WIDE_INT
in that case, the HOST_WIDE_INT_PRINT* macros might not work properly on
some hosts (e.g. mingw32 has HOST_LONG_LONG_FORMAT "I64" and that is
something pretty-print doesn't handle, while it handles "ll" for long long)
and also the use of macros in the middle of format strings breaks
translations (both that exgettext can't retrieve the string from there
and we get
 #: config/pru/pru-pragma.cc:61
 msgid "%<CTABLE_ENTRY%> index %"
 msgstr ""

 #: config/pru/pru-pragma.cc:64
 msgid "redefinition of %<CTABLE_ENTRY %"
 msgstr ""
in po/gcc.pot and also the macros are different on different hosts,
so even if exgettext extracted say "%<CTABLE_ENTRY%> index %lld is not valid"
it could be translated on some hosts but not e.g. mingw32).

So, the following patch just uses %wd instead.

Tested it before/after the
patch on
 #pragma ctable_entry 12 0x48040000
 #pragma ctable_entry 1024 0x48040000
 #pragma ctable_entry 12 0x48040001
and the result is the same.

2025-02-25  Jakub Jelinek  <jakub@redhat.com>

	PR translation/118991
	* config/pru/pru-pragma.cc (pru_pragma_ctable_entry): Use %wd
	instead of %" HOST_WIDE_INT_PRINT "d to print a hwi in error.

0bb431d0

d/i386: Add CET TargetInfo key and predefined version [PR118654] · c17044e5

Iain Buclaw authored 4 weeks ago

Adds a new i386 d_target_info_spec entry to handle requests for
`__traits(getTargetInfo, "CET")', and add predefined target version
`GNU_CET' when the option `-fcf-protecton' is used.

Both TargetInfo key and predefined version have been added to the D
front-end documentation.

In the library, `GNU_CET' replaces the existing use of the user-defined
version flag `CET' when building libphobos.

	PR d/118654

gcc/ChangeLog:

	* config/i386/i386-d.cc (ix86_d_target_versions): Predefine GNU_CET.
	(ix86_d_handle_target_cf_protection): New.
	(ix86_d_register_target_info): Add 'CET' TargetInfo key.

gcc/d/ChangeLog:

	* implement-d.texi: Document CET version and traits key.

libphobos/ChangeLog:

	* Makefile.in: Regenerate.
	* configure: Regenerate.
	* configure.ac: Remove CET_DFLAGS.
	* libdruntime/Makefile.am: Replace CET_DFLAGS with CET_FLAGS.
	* libdruntime/Makefile.in: Regenerate.
	* libdruntime/core/thread/fiber/package.d: Replace CET with GNU_CET.
	* src/Makefile.am: Replace CET_DFLAGS with CET_FLAGS.
	* src/Makefile.in: Regenerate.
	* testsuite/Makefile.in: Regenerate.
	* testsuite/testsuite_flags.in: Replace CET_DFLAGS with CET_FLAGS.

gcc/testsuite/ChangeLog:

	* gdc.dg/target/i386/i386.exp: New test.
	* gdc.dg/target/i386/targetinfo_CET.d: New test.

c17044e5

Feb 24, 2025

RISC-V: Include pattern stmts for dynamic LMUL computation [PR114516]. · 6be1b9e9

Robin Dapp authored 1 month ago

When scanning for program points, i.e. vector statements, we're missing
pattern statements.  In PR114516 this becomes obvious as we choose
LMUL=8 assuming there are only three statements but the divmod pattern
adds another three.  Those push us beyond four registers so we need to
switch to LMUL=4.

This patch adds pattern statements to the program points which helps
calculate a better register pressure estimate.

	PR target/114516

gcc/ChangeLog:

	* config/riscv/riscv-vector-costs.cc (compute_estimated_lmul):
	Add pattern statements to program points.

gcc/testsuite/ChangeLog:

	* gcc.dg/vect/costmodel/riscv/rvv/pr114516.c: New test.

6be1b9e9

RISC-V: Fix .cfi_offset directive when push/pop in zcmp · 4dcd3c77

Lino Hsing-Yu Peng authored 1 month ago

The incorrect cfi directive info breaks stack unwind in try/catch/cxa.

Before patch:
  cm.push	{ra, s0-s2}, -16
  .cfi_offset 1, -12
  .cfi_offset 8, -8
  .cfi_offset 18, -4

After patch:
  cm.push	{ra, s0-s2}, -16
  .cfi_offset 1, -16
  .cfi_offset 8, -12
  .cfi_offset 9, -8
  .cfi_offset 18, -4

gcc/ChangeLog:

	* config/riscv/riscv.cc: Set multi push regs bits.

gcc/testsuite/ChangeLog:
	* gcc.target/riscv/zcmp_push_gpr.c: New test.

4dcd3c77

Feb 22, 2025

BPF, nvptx: Standardize on 'sorry, unimplemented: dynamic stack allocation not supported' · 2abc942f

Thomas Schwinge authored 1 month ago

... instead of BPF: 'error: BPF does not support dynamic stack allocation', and
nvptx: 'sorry, unimplemented: target cannot support alloca'.

	gcc/
	* config/bpf/bpf.md (define_expand "allocate_stack"): Emit
	'sorry, unimplemented: dynamic stack allocation not supported'.
	* config/nvptx/nvptx.md (define_expand "allocate_stack")
	[!TARGET_SOFT_STACK && !(TARGET_PTX_7_3 && TARGET_SM52)]: Likewise.
	gcc/testsuite/
	* gcc.target/bpf/diag-alloca-1.c: Adjust 'dg-message'.
	* gcc.target/bpf/diag-alloca-2.c: Likewise.
	* gcc.target/nvptx/alloca-1-sm_30.c: Likewise.
	* gcc.target/nvptx/vla-1-sm_30.c: Likewise.
	* lib/target-supports.exp (proc check_effective_target_alloca):
	Adjust comment.

2abc942f

Feb 20, 2025

aarch64: Remove old aarch64_expand_sve_vec_cmp_float code · d7ff3142

Richard Sandiford authored 1 month ago

While looking at PR118956, I noticed that we had some dead code
left over after the removal of the vcond patterns.  The can_invert_p
path is no longer used.

gcc/
	* config/aarch64/aarch64-protos.h (aarch64_expand_sve_vec_cmp_float):
	Remove can_invert_p argument and change return type to void.
	* config/aarch64/aarch64.cc (aarch64_expand_sve_vec_cmp_float):
	Likewise.
	* config/aarch64/aarch64-sve.md (vec_cmp<mode><vpred>): Update call
	accordingly.

d7ff3142

Revert "x86: Properly find the maximum stack slot alignment" · 6921c93d
H.J. Lu authored 1 month ago
```
This reverts commit 11902be7.
```
6921c93d
Revert "i386: Simplify PARALLEL RTX scan in ix86_find_all_reg_use" · 0312d11b
H.J. Lu authored 1 month ago
```
This reverts commit 565d4e75.
```
0312d11b

Feb 19, 2025

LoongArch: Use normal RTL pattern instead of UNSPEC for {x,}vsr{a,l}ri instructions · 42738604

Xi Ruoyao authored 1 month ago

Allowing (t + (1ul << imm >> 1)) >> imm to be recognized as a rounding
shift operation.

gcc/ChangeLog:

	* config/loongarch/lasx.md (UNSPEC_LASX_XVSRARI): Remove.
	(UNSPEC_LASX_XVSRLRI): Remove.
	(lasx_xvsrari_<lsxfmt>): Remove.
	(lasx_xvsrlri_<lsxfmt>): Remove.
	* config/loongarch/lsx.md (UNSPEC_LSX_VSRARI): Remove.
	(UNSPEC_LSX_VSRLRI): Remove.
	(lsx_vsrari_<lsxfmt>): Remove.
	(lsx_vsrlri_<lsxfmt>): Remove.
	* config/loongarch/simd.md (simd_<optab>_imm_round_<mode>): New
	define_insn.
	(<simd_isa>_<x>v<insn>ri_<simdfmt>): New define_expand.

gcc/testsuite/ChangeLog:

	* gcc.target/loongarch/vect-shift-imm-round.c: New test.

42738604

LoongArch: Implement [su]dot_prod* for LSX and LASX modes · cef5f23a

Xi Ruoyao authored 2 months ago

Despite it's just a special case of "a widening product of which the
result used for reduction," having these standard names allows to
recognize the dot product pattern earlier and it may be beneficial to
optimization.  Also fix some test failures with the test cases:

- gcc.dg/vect/vect-reduc-chain-2.c
- gcc.dg/vect/vect-reduc-chain-3.c
- gcc.dg/vect/vect-reduc-chain-dot-slp-3.c
- gcc.dg/vect/vect-reduc-chain-dot-slp-4.c

gcc/ChangeLog:

	* config/loongarch/simd.md (wvec_half): New define_mode_attr.
	(<su>dot_prod<wvec_half><mode>): New define_expand.

gcc/testsuite/ChangeLog:

	* gcc.target/loongarch/wide-mul-reduc-2.c (dg-final): Scan
	DOT_PROD_EXPR in optimized tree.

cef5f23a

LoongArch: Implement vec_widen_mult_{even,odd}_* for LSX and LASX modes · 7c54e46b

Xi Ruoyao authored 2 months ago

Since PR116142 has been fixed, now we can add the standard names so the
compiler will generate better code if the result of a widening
production is reduced.

gcc/ChangeLog:

	* config/loongarch/simd.md (even_odd): New define_int_attr.
	(vec_widen_<su>mult_<even_odd>_<mode>): New define_expand.

gcc/testsuite/ChangeLog:

	* gcc.target/loongarch/wide-mul-reduc-1.c: New test.
	* gcc.target/loongarch/wide-mul-reduc-2.c: New test.

7c54e46b

LoongArch: Simplify lsx_vpick description · 7dda6715

Xi Ruoyao authored 1 month ago

Like what we've done for {lsx_,lasx_x}v{add,sub,mul}l{ev,od}, use
special predicates instead of hard-coded const vectors.

This is not suitable for LASX where lasx_xvpick has a different
semantic.

gcc/ChangeLog:

	* config/loongarch/simd.md (LVEC): New define_mode_attr.
	(simdfmt_as_i): Make it same as simdfmt for integer vector
	modes.
	(_f): New define_mode_attr.
	* config/loongarch/lsx.md (lsx_vpickev_b): Remove.
	(lsx_vpickev_h): Remove.
	(lsx_vpickev_w): Remove.
	(lsx_vpickev_w_f): Remove.
	(lsx_vpickod_b): Remove.
	(lsx_vpickod_h): Remove.
	(lsx_vpickod_w): Remove.
	(lsx_vpickev_w_f): Remove.
	(lsx_pick_evod_<mode>): New define_insn.
	(lsx_<x>vpick<ev_od>_<simdfmt_as_i><_f>): New
	define_expand.

7dda6715

LoongArch: Simplify {lsx_,lasx_x}vmaddw description · f727a4c5

Xi Ruoyao authored 7 months ago

Like what we've done for {lsx_,lasx_x}v{add,sub,mul}l{ev,od}, use
special predicates and TImode RTL instead of hard-coded const vectors
and UNSPECs.

Also reorder two operands of the outer plus in the template, so combine
will recognize {x,}vadd + {x,}vmulw{ev,od} => {x,}vmaddw{ev,od}.

gcc/ChangeLog:

	* config/loongarch/lasx.md (UNSPEC_LASX_XVMADDWEV): Remove.
	(UNSPEC_LASX_XVMADDWEV2): Remove.
	(UNSPEC_LASX_XVMADDWEV3): Remove.
	(UNSPEC_LASX_XVMADDWOD): Remove.
	(UNSPEC_LASX_XVMADDWOD2): Remove.
	(UNSPEC_LASX_XVMADDWOD3): Remove.
	(lasx_xvmaddwev_h_b<u>): Remove.
	(lasx_xvmaddwev_w_h<u>): Remove.
	(lasx_xvmaddwev_d_w<u>): Remove.
	(lasx_xvmaddwev_q_d): Remove.
	(lasx_xvmaddwod_h_b<u>): Remove.
	(lasx_xvmaddwod_w_h<u>): Remove.
	(lasx_xvmaddwod_d_w<u>): Remove.
	(lasx_xvmaddwod_q_d): Remove.
	(lasx_xvmaddwev_q_du): Remove.
	(lasx_xvmaddwod_q_du): Remove.
	(lasx_xvmaddwev_h_bu_b): Remove.
	(lasx_xvmaddwev_w_hu_h): Remove.
	(lasx_xvmaddwev_d_wu_w): Remove.
	(lasx_xvmaddwev_q_du_d): Remove.
	(lasx_xvmaddwod_h_bu_b): Remove.
	(lasx_xvmaddwod_w_hu_h): Remove.
	(lasx_xvmaddwod_d_wu_w): Remove.
	(lasx_xvmaddwod_q_du_d): Remove.
	* config/loongarch/lsx.md (UNSPEC_LSX_VMADDWEV): Remove.
	(UNSPEC_LSX_VMADDWEV2): Remove.
	(UNSPEC_LSX_VMADDWEV3): Remove.
	(UNSPEC_LSX_VMADDWOD): Remove.
	(UNSPEC_LSX_VMADDWOD2): Remove.
	(UNSPEC_LSX_VMADDWOD3): Remove.
	(lsx_vmaddwev_h_b<u>): Remove.
	(lsx_vmaddwev_w_h<u>): Remove.
	(lsx_vmaddwev_d_w<u>): Remove.
	(lsx_vmaddwev_q_d): Remove.
	(lsx_vmaddwod_h_b<u>): Remove.
	(lsx_vmaddwod_w_h<u>): Remove.
	(lsx_vmaddwod_d_w<u>): Remove.
	(lsx_vmaddwod_q_d): Remove.
	(lsx_vmaddwev_q_du): Remove.
	(lsx_vmaddwod_q_du): Remove.
	(lsx_vmaddwev_h_bu_b): Remove.
	(lsx_vmaddwev_w_hu_h): Remove.
	(lsx_vmaddwev_d_wu_w): Remove.
	(lsx_vmaddwev_q_du_d): Remove.
	(lsx_vmaddwod_h_bu_b): Remove.
	(lsx_vmaddwod_w_hu_h): Remove.
	(lsx_vmaddwod_d_wu_w): Remove.
	(lsx_vmaddwod_q_du_d): Remove.
	* config/loongarch/simd.md (simd_maddw_evod_<mode>_<su>):
	New define_insn.
	(<simd_isa>_<x>vmaddw<ev_od>_<simdfmt_w>_<simdfmt><u>): New
	define_expand.
	(simd_maddw_evod_<mode>_hetero): New define_insn.
	(<simd_isa>_<x>vmaddw<ev_od>_<simdfmt_w>_<simdfmt>u_<simdfmt>):
	New define_expand.
	(<simd_isa>_maddw<ev_od>_q_d<u>_punned): New define_expand.
	(<simd_isa>_maddw<ev_od>_q_du_d_punned): New define_expand.
	* config/loongarch/loongarch-builtins.cc
	(CODE_FOR_lsx_vmaddwev_q_d): Define as a macro to override it
	with the punned expand.
	(CODE_FOR_lsx_vmaddwev_q_du): Likewise.
	(CODE_FOR_lsx_vmaddwev_q_du_d): Likewise.
	(CODE_FOR_lsx_vmaddwod_q_d): Likewise.
	(CODE_FOR_lsx_vmaddwod_q_du): Likewise.
	(CODE_FOR_lsx_vmaddwod_q_du_d): Likewise.
	(CODE_FOR_lasx_xvmaddwev_q_d): Likewise.
	(CODE_FOR_lasx_xvmaddwev_q_du): Likewise.
	(CODE_FOR_lasx_xvmaddwev_q_du_d): Likewise.
	(CODE_FOR_lasx_xvmaddwod_q_d): Likewise.
	(CODE_FOR_lasx_xvmaddwod_q_du): Likewise.
	(CODE_FOR_lasx_xvmaddwod_q_du_d): Likewise.

f727a4c5

LoongArch: Simplify {lsx_,lasx_x}vh{add,sub}w description · 2ca759fc

Xi Ruoyao authored 7 months ago

Like what we've done for {lsx_,lasx_x}v{add,sub,mul}l{ev,od}, use
special predicates and TImode RTL instead of hard-coded const vectors
and UNSPECs.

gcc/ChangeLog:

	* config/loongarch/lasx.md (UNSPEC_LASX_XVHADDW_Q_D): Remove.
	(UNSPEC_LASX_XVHSUBW_Q_D): Remove.
	(UNSPEC_LASX_XVHADDW_QU_DU): Remove.
	(UNSPEC_LASX_XVHSUBW_QU_DU): Remove.
	(lasx_xvh<addsub:optab>w_h<u>_b<u>): Remove.
	(lasx_xvh<addsub:optab>w_w<u>_h<u>): Remove.
	(lasx_xvh<addsub:optab>w_d<u>_w<u>): Remove.
	(lasx_xvhaddw_q_d): Remove.
	(lasx_xvhsubw_q_d): Remove.
	(lasx_xvhaddw_qu_du): Remove.
	(lasx_xvhsubw_qu_du): Remove.
	(reduc_plus_scal_v4di): Call gen_lasx_haddw_q_d_punned instead
	of gen_lasx_xvhaddw_q_d.
	(reduc_plus_scal_v8si): Likewise.
	* config/loongarch/lsx.md (UNSPEC_LSX_VHADDW_Q_D): Remove.
	(UNSPEC_ASX_VHSUBW_Q_D): Remove.
	(UNSPEC_ASX_VHADDW_QU_DU): Remove.
	(UNSPEC_ASX_VHSUBW_QU_DU): Remove.
	(lsx_vh<addsub:optab>w_h<u>_b<u>): Remove.
	(lsx_vh<addsub:optab>w_w<u>_h<u>): Remove.
	(lsx_vh<addsub:optab>w_d<u>_w<u>): Remove.
	(lsx_vhaddw_q_d): Remove.
	(lsx_vhsubw_q_d): Remove.
	(lsx_vhaddw_qu_du): Remove.
	(lsx_vhsubw_qu_du): Remove.
	(reduc_plus_scal_v2di): Change the temporary register mode to
	V1TI, and pun the mode calling gen_vec_extractv2didi.
	(reduc_plus_scal_v4si): Change the temporary register mode to
	V1TI.
	* config/loongarch/simd.md (simd_h<optab>w_<mode>_<su>): New
	define_insn.
	(<simd_isa>_<x>vh<optab>w_<simdfmt_w><u>_<simdfmt><u>): New
	define_expand.
	(<simd_isa>_h<optab>w_q<u>_d<u>_punned): New define_expand.
	* config/loongarch/loongarch-builtins.cc
	(CODE_FOR_lsx_vhaddw_q_d): Define as a macro to override with
	punned expand.
	(CODE_FOR_lsx_vhaddw_qu_du): Likewise.
	(CODE_FOR_lsx_vhsubw_q_d): Likewise.
	(CODE_FOR_lsx_vhsubw_qu_du): Likewise.
	(CODE_FOR_lasx_xvhaddw_q_d): Likewise.
	(CODE_FOR_lasx_xvhaddw_qu_du): Likewise.
	(CODE_FOR_lasx_xvhsubw_q_d): Likewise.
	(CODE_FOR_lasx_xvhsubw_qu_du): Likewise.

2ca759fc

LoongArch: Simplify {lsx_,lasx_x}v{add,sub,mul}l{ev,od} description · a36c15aa

Xi Ruoyao authored 1 month ago

These pattern definitions are tediously long, invoking 32 UNSPECs and
many hard-coded long const vectors.  To simplify them, at first we use
the TImode vector operations instead of the UNSPECs, then we adopt an
approach in AArch64: using a special predicate to match the const
vectors for odd/even indices for define_insn's, and generate those
vectors in define_expand's.

For "backward compatibilty" we need to provide a "punned" version for
the operations invoking TImode vectors as the intrinsics still expect
DImode vectors.

The stat is "201 insertions, 905 deletions."

gcc/ChangeLog:

	* config/loongarch/lasx.md (UNSPEC_LASX_XVADDWEV): Remove.
	(UNSPEC_LASX_XVADDWEV2): Remove.
	(UNSPEC_LASX_XVADDWEV3): Remove.
	(UNSPEC_LASX_XVSUBWEV): Remove.
	(UNSPEC_LASX_XVSUBWEV2): Remove.
	(UNSPEC_LASX_XVMULWEV): Remove.
	(UNSPEC_LASX_XVMULWEV2): Remove.
	(UNSPEC_LASX_XVMULWEV3): Remove.
	(UNSPEC_LASX_XVADDWOD): Remove.
	(UNSPEC_LASX_XVADDWOD2): Remove.
	(UNSPEC_LASX_XVADDWOD3): Remove.
	(UNSPEC_LASX_XVSUBWOD): Remove.
	(UNSPEC_LASX_XVSUBWOD2): Remove.
	(UNSPEC_LASX_XVMULWOD): Remove.
	(UNSPEC_LASX_XVMULWOD2): Remove.
	(UNSPEC_LASX_XVMULWOD3): Remove.
	(lasx_xv<addsubmul:optab>wev_h_b<u>): Remove.
	(lasx_xv<addsubmul:optab>wev_w_h<u>): Remove.
	(lasx_xv<addsubmul:optab>wev_d_w<u>): Remove.
	(lasx_xvaddwev_q_d): Remove.
	(lasx_xvsubwev_q_d): Remove.
	(lasx_xvmulwev_q_d): Remove.
	(lasx_xv<addsubmul:optab>wod_h_b<u>): Remove.
	(lasx_xv<addsubmul:optab>wod_w_h<u>): Remove.
	(lasx_xv<addsubmul:optab>wod_d_w<u>): Remove.
	(lasx_xvaddwod_q_d): Remove.
	(lasx_xvsubwod_q_d): Remove.
	(lasx_xvmulwod_q_d): Remove.
	(lasx_xvaddwev_q_du): Remove.
	(lasx_xvsubwev_q_du): Remove.
	(lasx_xvmulwev_q_du): Remove.
	(lasx_xvaddwod_q_du): Remove.
	(lasx_xvsubwod_q_du): Remove.
	(lasx_xvmulwod_q_du): Remove.
	(lasx_xv<addmul:optab>wev_h_bu_b): Remove.
	(lasx_xv<addmul:optab>wev_w_hu_h): Remove.
	(lasx_xv<addmul:optab>wev_d_wu_w): Remove.
	(lasx_xv<addmul:optab>wod_h_bu_b): Remove.
	(lasx_xv<addmul:optab>wod_w_hu_h): Remove.
	(lasx_xv<addmul:optab>wod_d_wu_w): Remove.
	(lasx_xvaddwev_q_du_d): Remove.
	(lasx_xvsubwev_q_du_d): Remove.
	(lasx_xvmulwev_q_du_d): Remove.
	(lasx_xvaddwod_q_du_d): Remove.
	(lasx_xvsubwod_q_du_d): Remove.
	* config/loongarch/lsx.md (UNSPEC_LSX_XVADDWEV): Remove.
	(UNSPEC_LSX_VADDWEV2): Remove.
	(UNSPEC_LSX_VADDWEV3): Remove.
	(UNSPEC_LSX_VSUBWEV): Remove.
	(UNSPEC_LSX_VSUBWEV2): Remove.
	(UNSPEC_LSX_VMULWEV): Remove.
	(UNSPEC_LSX_VMULWEV2): Remove.
	(UNSPEC_LSX_VMULWEV3): Remove.
	(UNSPEC_LSX_VADDWOD): Remove.
	(UNSPEC_LSX_VADDWOD2): Remove.
	(UNSPEC_LSX_VADDWOD3): Remove.
	(UNSPEC_LSX_VSUBWOD): Remove.
	(UNSPEC_LSX_VSUBWOD2): Remove.
	(UNSPEC_LSX_VMULWOD): Remove.
	(UNSPEC_LSX_VMULWOD2): Remove.
	(UNSPEC_LSX_VMULWOD3): Remove.
	(lsx_v<addsubmul:optab>wev_h_b<u>): Remove.
	(lsx_v<addsubmul:optab>wev_w_h<u>): Remove.
	(lsx_v<addsubmul:optab>wev_d_w<u>): Remove.
	(lsx_vaddwev_q_d): Remove.
	(lsx_vsubwev_q_d): Remove.
	(lsx_vmulwev_q_d): Remove.
	(lsx_v<addsubmul:optab>wod_h_b<u>): Remove.
	(lsx_v<addsubmul:optab>wod_w_h<u>): Remove.
	(lsx_v<addsubmul:optab>wod_d_w<u>): Remove.
	(lsx_vaddwod_q_d): Remove.
	(lsx_vsubwod_q_d): Remove.
	(lsx_vmulwod_q_d): Remove.
	(lsx_vaddwev_q_du): Remove.
	(lsx_vsubwev_q_du): Remove.
	(lsx_vmulwev_q_du): Remove.
	(lsx_vaddwod_q_du): Remove.
	(lsx_vsubwod_q_du): Remove.
	(lsx_vmulwod_q_du): Remove.
	(lsx_v<addmul:optab>wev_h_bu_b): Remove.
	(lsx_v<addmul:optab>wev_w_hu_h): Remove.
	(lsx_v<addmul:optab>wev_d_wu_w): Remove.
	(lsx_v<addmul:optab>wod_h_bu_b): Remove.
	(lsx_v<addmul:optab>wod_w_hu_h): Remove.
	(lsx_v<addmul:optab>wod_d_wu_w): Remove.
	(lsx_vaddwev_q_du_d): Remove.
	(lsx_vsubwev_q_du_d): Remove.
	(lsx_vmulwev_q_du_d): Remove.
	(lsx_vaddwod_q_du_d): Remove.
	(lsx_vsubwod_q_du_d): Remove.
	(lsx_vmulwod_q_du_d): Remove.
	* config/loongarch/loongarch-modes.def: Add V4TI and V1DI.
	* config/loongarch/loongarch-protos.h
	(loongarch_gen_stepped_int_parallel): New function prototype.
	* config/loongarch/loongarch.cc (loongarch_print_operand):
	Accept 'O' for printing "ev" or "od."
	(loongarch_gen_stepped_int_parallel): Implement.
	* config/loongarch/predicates.md
	(vect_par_cnst_even_or_odd_half): New define_predicate.
	* config/loongarch/simd.md (WVEC_HALF): New define_mode_attr.
	(simdfmt_w): Likewise.
	(zero_one): New define_int_iterator.
	(ev_od): New define_int_attr.
	(simd_<optab>w_evod_<mode:IVEC>_<su>): New define_insn.
	(<simd_isa>_<x>v<optab>w<ev_od>_<simdfmt_w>_<simdfmt><u>): New
	define_expand.
	(simd_<optab>w_evod_<mode>_hetero): New define_insn.
	(<simd_isa>_<x>v<optab>w<ev_od>_<simdfmt_w>_<simdfmt>u_<simdfmt>):
	New define_expand.
	(DIVEC): New define_mode_iterator.
	(<simd_isa>_<optab>w<ev_od>_q_d<u>_punned): New define_expand.
	(<simd_isa>_<optab>w<ev_od>_q_du_d_punned): Likewise.
	* config/loongarch/loongarch-builtins.cc
	(CODE_FOR_lsx_vaddwev_q_d): Define as a macro to override it
	with the punned expand.
	(CODE_FOR_lsx_vaddwev_q_du): Likewise.
	(CODE_FOR_lsx_vsubwev_q_d): Likewise.
	(CODE_FOR_lsx_vsubwev_q_du): Likewise.
	(CODE_FOR_lsx_vmulwev_q_d): Likewise.
	(CODE_FOR_lsx_vmulwev_q_du): Likewise.
	(CODE_FOR_lsx_vaddwod_q_d): Likewise.
	(CODE_FOR_lsx_vaddwod_q_du): Likewise.
	(CODE_FOR_lsx_vsubwod_q_d): Likewise.
	(CODE_FOR_lsx_vsubwod_q_du): Likewise.
	(CODE_FOR_lsx_vmulwod_q_d): Likewise.
	(CODE_FOR_lsx_vmulwod_q_du): Likewise.
	(CODE_FOR_lsx_vaddwev_q_du_d): Likewise.
	(CODE_FOR_lsx_vmulwev_q_du_d): Likewise.
	(CODE_FOR_lsx_vaddwod_q_du_d): Likewise.
	(CODE_FOR_lsx_vmulwod_q_du_d): Likewise.
	(CODE_FOR_lasx_xvaddwev_q_d): Likewise.
	(CODE_FOR_lasx_xvaddwev_q_du): Likewise.
	(CODE_FOR_lasx_xvsubwev_q_d): Likewise.
	(CODE_FOR_lasx_xvsubwev_q_du): Likewise.
	(CODE_FOR_lasx_xvmulwev_q_d): Likewise.
	(CODE_FOR_lasx_xvmulwev_q_du): Likewise.
	(CODE_FOR_lasx_xvaddwod_q_d): Likewise.
	(CODE_FOR_lasx_xvaddwod_q_du): Likewise.
	(CODE_FOR_lasx_xvsubwod_q_d): Likewise.
	(CODE_FOR_lasx_xvsubwod_q_du): Likewise.
	(CODE_FOR_lasx_xvmulwod_q_d): Likewise.
	(CODE_FOR_lasx_xvmulwod_q_du): Likewise.
	(CODE_FOR_lasx_xvaddwev_q_du_d): Likewise.
	(CODE_FOR_lasx_xvmulwev_q_du_d): Likewise.
	(CODE_FOR_lasx_xvaddwod_q_du_d): Likewise.
	(CODE_FOR_lasx_xvmulwod_q_du_d): Likewise.

a36c15aa

LoongArch: Allow moving TImode vectors · ac1b0586

Xi Ruoyao authored 2 months ago

We have some vector instructions for operations on 128-bit integer, i.e.
TImode, vectors.  Previously they had been modeled with unspecs, but
it's more natural to just model them with TImode vector RTL expressions.

For the preparation, allow moving V1TImode and V2TImode vectors in LSX
and LASX registers so we won't get a reload failure when we start to
save TImode vectors in these registers.

This implicitly depends on the vrepli optimization: without it we'd try
"vrepli.q" which does not really exist and trigger an ICE.

gcc/ChangeLog:

	* config/loongarch/lsx.md (mov<LSX:mode>): Remove.
	(movmisalign<LSX:mode>): Remove.
	(mov<LSX:mode>_lsx): Remove.
	* config/loongarch/lasx.md (mov<LASX:mode>): Remove.
	(movmisalign<LASX:mode>): Remove.
	(mov<LASX:mode>_lasx): Remove.
	* config/loongarch/loongarch-modes.def (V1TI): Add.
	(V2TI): Mention in the comment.
	* config/loongarch/loongarch.md (mode): Add V1TI and V2TI.
	* config/loongarch/simd.md (ALLVEC_TI): New mode iterator.
	(mov<ALLVEC_TI:mode): New define_expand.
	(movmisalign<ALLVEC_TI:mode>): Likewise.
	(mov<ALLVEC_TI:mode>_simd): New define_insn_and_split.

ac1b0586

LoongArch: Try harder using vrepli instructions to materialize const vectors · ed979454

Xi Ruoyao authored 2 months ago

For

  a = (v4si){0xdddddddd, 0xdddddddd, 0xdddddddd, 0xdddddddd}

we just want

  vrepli.b $vr0, 0xdd

but the compiler actually produces a load:

  la.local $r14,.LC0
  vld      $vr0,$r14,0

It's because we only tried vrepli.d which wouldn't work.  Try all vrepli
instructions for const int vector materializing to fix it.

gcc/ChangeLog:

	* config/loongarch/loongarch-protos.h
	(loongarch_const_vector_vrepli): New function prototype.
	* config/loongarch/loongarch.cc (loongarch_const_vector_vrepli):
	Implement.
	(loongarch_const_insns): Call loongarch_const_vector_vrepli
	instead of loongarch_const_vector_same_int_p.
	(loongarch_split_vector_move_p): Likewise.
	(loongarch_output_move): Use loongarch_const_vector_vrepli to
	pun operend[1] into a better mode if it's a const int vector,
	and decide the suffix of [x]vrepli with the new mode.
	* config/loongarch/constraints.md (YI): Call
	loongarch_const_vector_vrepli instead of
	loongarch_const_vector_same_int_p.

gcc/testsuite/ChangeLog:

	* gcc.target/loongarch/vrepli.c: New test.

ed979454

LoongArch: Accept ADD, IOR or XOR when combining objects with no bits in common [PR115478] · ea3ebe48

Xi Ruoyao authored 1 month ago

Since r15-1120, multi-word shifts/rotates produces PLUS instead of IOR.
It's generally a good thing (allowing to use our alsl instruction or
similar instrunction on other architectures), but it's preventing us
from using bytepick.  For example, if we shift a __int128 by 16 bits,
the higher word can be produced via a single bytepick.d instruction with
immediate 2, but we got:

	srli.d	$r12,$r4,48
	slli.d	$r5,$r5,16
	slli.d	$r4,$r4,16
	add.d	$r5,$r12,$r5
	jr	$r1

This wasn't work with GCC 14, but after r15-6490 it's supposed to work
if IOR was used instead of PLUS.

To fix this, add a code iterator to match IOR, XOR, and PLUS and use it
instead of just IOR if we know the operands have no overlapping bits.

gcc/ChangeLog:

	PR target/115478
	* config/loongarch/loongarch.md (any_or_plus): New
	define_code_iterator.
	(bstrins_<mode>_for_ior_mask): Use any_or_plus instead of ior.
	(bytepick_w_<bytepick_imm>): Likewise.
	(bytepick_d_<bytepick_imm>): Likewise.
	(bytepick_d_<bytepick_imm>_rev): Likewise.

gcc/testsuite/ChangeLog:

	PR target/115478
	* gcc.target/loongarch/bytepick_shift_128.c: New test.

ea3ebe48

Feb 18, 2025

RISC-V: Fix ratio in vsetvl fuse rule [PR115703]. · 44d4a108

Robin Dapp authored 1 month ago

In PR115703 we fuse two vsetvls:

    Fuse curr info since prev info compatible with it:
      prev_info: VALID (insn 438, bb 2)
        Demand fields: demand_ge_sew demand_non_zero_avl
        SEW=32, VLMUL=m1, RATIO=32, MAX_SEW=64
        TAIL_POLICY=agnostic, MASK_POLICY=agnostic
        AVL=(reg:DI 0 zero)
        VL=(reg:DI 9 s1 [312])
      curr_info: VALID (insn 92, bb 20)
        Demand fields: demand_ratio_and_ge_sew demand_avl
        SEW=64, VLMUL=m1, RATIO=64, MAX_SEW=64
        TAIL_POLICY=agnostic, MASK_POLICY=agnostic
        AVL=(const_int 4 [0x4])
        VL=(nil)
      prev_info after fused: VALID (insn 438, bb 2)
        Demand fields: demand_ratio_and_ge_sew demand_avl
        SEW=64, VLMUL=mf2, RATIO=64, MAX_SEW=64
        TAIL_POLICY=agnostic, MASK_POLICY=agnostic
        AVL=(const_int 4 [0x4])
        VL=(nil).

The result is vsetvl zero, zero, e64, mf2, ta, ma.  The previous vsetvl
set vl = 4 but here we wrongly set it to vl = 2.  As all the following
vsetvls only ever change the ratio we never recover.

The issue is quite difficult to trigger because we can often
deduce the value of d at runtime.  Then very check for the value of
d will be optimized away.

The last known bad commit is r15-3458-g5326306e7d9d36.  With that commit
the output is wrong but -fno-schedule-insns makes it correct.  From the
next commit on the issue is latent.  I still added the PR's test as scan
and run check even if they don't trigger right now.  Not sure if the
run test will ever fail but well.  I verified that the
patch fixes the issue when applied on top of r15-3458-g5326306e7d9d36.

	PR target/115703

gcc/ChangeLog:

	* config/riscv/riscv-vsetvl.cc: Use max_sew for calculating the
	new LMUL.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/autovec/pr115703-run.c: New test.
	* gcc.target/riscv/rvv/autovec/pr115703.c: New test.

44d4a108

aarch64: Use generic_armv8_a_prefetch_tune in generic_armv8_a.h · 8606ab34

Soumya AR authored 1 month ago


generic_armv8_a.h defines generic_armv8_a_prefetch_tune but still uses
generic_prefetch_tune in generic_armv8_a_tunings.

This patch updates the pointer to generic_armv8_a_prefetch_tune.

This patch was bootstrapped and regtested on aarch64-linux-gnu, no regression.

Signed-off-by: Soumya AR <soumyaa@nvidia.com>

gcc/ChangeLog:

	* config/aarch64/tuning_models/generic_armv8_a.h: Updated prefetch
	struct pointer.

8606ab34

RISC-V: Fix ICE for target attributes has different xlen size · 17b95cfc

Pan Li authored 1 month ago


This patch would like to avoid the ICE when the target attribute
specific the xlen different to the cmd.  Aka compile with rv64gc
but target attribute with rv32gcv_zbb.  For example as blow:

   1   │ long foo (long a, long b)
   2   │ __attribute__((target("arch=rv32gcv_zbb")));
   3   │
   4   │ long foo (long a, long b)
   5   │ {
   6   │   return a + (b * 2);
   7   │ }

when compile with rv64gc -O3, it will have ICE similar as below

during RTL pass: fwprop1
test.c: In function ‘foo’:
test.c:10:1: internal compiler error: in add_use, at
rtl-ssa/accesses.cc:1234
   10 | }
      | ^
0x44d6b9d internal_error(char const*, ...)
        /home/pli/gcc/111/riscv-gnu-toolchain/gcc/__RISC-V_BUILD__/../gcc/diagnostic-global-context.cc:517
0x44a26a6 fancy_abort(char const*, int, char const*)
        /home/pli/gcc/111/riscv-gnu-toolchain/gcc/__RISC-V_BUILD__/../gcc/diagnostic.cc:1722
0x408fac9 rtl_ssa::function_info::add_use(rtl_ssa::use_info*)
        /home/pli/gcc/111/riscv-gnu-toolchain/gcc/__RISC-V_BUILD__/../gcc/rtl-ssa/accesses.cc:1234
0x40a5eea
rtl_ssa::function_info::create_reg_use(rtl_ssa::function_info::build_info&,
rtl_ssa::insn_info*, rtl_ssa::resource_info)
        /home/pli/gcc/111/riscv-gnu-toolchain/gcc/__RISC-V_BUILD__/../gcc/rtl-ssa/insns.cc:496
0x4456738
rtl_ssa::function_info::add_artificial_accesses(rtl_ssa::function_info::build_info&,
df_ref_flags)
        /home/pli/gcc/111/riscv-gnu-toolchain/gcc/__RISC-V_BUILD__/../gcc/rtl-ssa/blocks.cc:900
0x4457297
rtl_ssa::function_info::start_block(rtl_ssa::function_info::build_info&,
rtl_ssa::bb_info*)
        /home/pli/gcc/111/riscv-gnu-toolchain/gcc/__RISC-V_BUILD__/../gcc/rtl-ssa/blocks.cc:1082
0x4453627
rtl_ssa::function_info::bb_walker::before_dom_children(basic_block_def*)
        /home/pli/gcc/111/riscv-gnu-toolchain/gcc/__RISC-V_BUILD__/../gcc/rtl-ssa/blocks.cc:118
0x3e9f3fb dom_walker::walk(basic_block_def*)
        /home/pli/gcc/111/riscv-gnu-toolchain/gcc/__RISC-V_BUILD__/../gcc/domwalk.cc:311
0x445806f rtl_ssa::function_info::process_all_blocks()
        /home/pli/gcc/111/riscv-gnu-toolchain/gcc/__RISC-V_BUILD__/../gcc/rtl-ssa/blocks.cc:1298
0x40a22d3 rtl_ssa::function_info::function_info(function*)
        /home/pli/gcc/111/riscv-gnu-toolchain/gcc/__RISC-V_BUILD__/../gcc/rtl-ssa/functions.cc:51
0x3ec3f80 fwprop_init
        /home/pli/gcc/111/riscv-gnu-toolchain/gcc/__RISC-V_BUILD__/../gcc/fwprop.cc:893
0x3ec420d fwprop
        /home/pli/gcc/111/riscv-gnu-toolchain/gcc/__RISC-V_BUILD__/../gcc/fwprop.cc:963
0x3ec43ad execute

Consider stage 4, we just report error for the above scenario when
detect the cmd xlen is different to the target attribute during the
target hook TARGET_OPTION_VALID_ATTRIBUTE_P implementation.

	PR target/118540

gcc/ChangeLog:

	* config/riscv/riscv-target-attr.cc (riscv_target_attr_parser::parse_arch):
	Report error when cmd xlen is different with target attribute.

gcc/testsuite/ChangeLog:

	* gcc.target/riscv/rvv/base/pr118540-1.c: New test.
	* gcc.target/riscv/rvv/base/pr118540-2.c: New test.

Signed-off-by: Pan Li <pan2.li@intel.com>

17b95cfc