forwprop: Try to blend two isomorphic VEC_PERM sequences
This extends forwprop by yet another VEC_PERM optimization:
It attempts to blend two isomorphic vector sequences by using the
redundancy in the lane utilization in these sequences.
This redundancy in lane utilization comes from the way how specific
scalar statements end up vectorized: two VEC_PERMs on top, binary operations
on both of them, and a final VEC_PERM to create the result.
Here is an example of this sequence:
v_in = {e0, e1, e2, e3}
v_1 = VEC_PERM <v_in, v_in, {0, 2, 0, 2}>
// v_1 = {e0, e2, e0, e2}
v_2 = VEC_PERM <v_in, v_in, {1, 3, 1, 3}>
// v_2 = {e1, e3, e1, e3}
v_x = v_1 + v_2
// v_x = {e0+e1, e2+e3, e0+e1, e2+e3}
v_y = v_1 - v_2
// v_y = {e0-e1, e2-e3, e0-e1, e2-e3}
v_out = VEC_PERM <v_x, v_y, {0, 1, 6, 7}>
// v_out = {e0+e1, e2+e3, e0-e1, e2-e3}
To remove the redundancy, lanes 2 and 3 can be freed, which allows to
change the last statement into:
v_out' = VEC_PERM <v_x, v_y, {0, 1, 4, 5}>
// v_out' = {e0+e1, e2+e3, e0-e1, e2-e3}
The cost of eliminating the redundancy in the lane utilization is that
lowering the VEC PERM expression could get more expensive because of
tighter packing of the lanes. Therefore this optimization is not done
alone, but in only in case we identify two such sequences that can be
blended.
Once all candidate sequences have been identified, we try to blend them,
so that we can use the freed lanes for the second sequence.
On success we convert 2x (2x BINOP + 1x VEC_PERM) to
2x VEC_PERM + 2x BINOP + 2x VEC_PERM traded for 4x VEC_PERM + 2x BINOP.
The implemented transformation reuses (rewrites) the statements
of the first sequence and the last VEC_PERM of the second sequence.
The remaining four statements of the second statment are left untouched
and will be eliminated by DCE later.
This targets x264_pixel_satd_8x4, which calculates the sum of absolute
transformed differences (SATD) using Hadamard transformation.
We have seen 8% speedup on SPEC's x264 on a 5950X (x86-64) and 7%
speedup on an AArch64 machine.
Bootstrapped and reg-tested on x86-64 and AArch64 (all languages).
gcc/ChangeLog:
* tree-ssa-forwprop.cc (struct _vec_perm_simplify_seq): New data
structure to store analysis results of a vec perm simplify sequence.
(get_vect_selector_index_map): Helper to get an index map from the
provided vector permute selector.
(recognise_vec_perm_simplify_seq): Helper to recognise a
vec perm simplify sequence.
(narrow_vec_perm_simplify_seq): Helper to pack the lanes more
tight.
(can_blend_vec_perm_simplify_seqs_p): Test if two vec perm
sequences can be blended.
(calc_perm_vec_perm_simplify_seqs): Helper to calculate the new
permutation indices.
(blend_vec_perm_simplify_seqs): Helper to blend two vec perm
simplify sequences.
(process_vec_perm_simplify_seq_list): Helper to process a list
of vec perm simplify sequences.
(append_vec_perm_simplify_seq_list): Helper to add a vec perm
simplify sequence to the list.
(pass_forwprop::execute): Integrate new functionality.
gcc/testsuite/ChangeLog:
* gcc.dg/tree-ssa/satd-hadamard.c: New test.
* gcc.dg/tree-ssa/vector-10.c: New test.
* gcc.dg/tree-ssa/vector-8.c: New test.
* gcc.dg/tree-ssa/vector-9.c: New test.
* gcc.target/aarch64/sve/satd-hadamard.c: New test.
Signed-off-by:
Christoph Müllner <christoph.muellner@vrull.eu>
Showing
- gcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c 43 additions, 0 deletionsgcc/testsuite/gcc.dg/tree-ssa/satd-hadamard.c
- gcc/testsuite/gcc.dg/tree-ssa/vector-10.c 122 additions, 0 deletionsgcc/testsuite/gcc.dg/tree-ssa/vector-10.c
- gcc/testsuite/gcc.dg/tree-ssa/vector-8.c 34 additions, 0 deletionsgcc/testsuite/gcc.dg/tree-ssa/vector-8.c
- gcc/testsuite/gcc.dg/tree-ssa/vector-9.c 34 additions, 0 deletionsgcc/testsuite/gcc.dg/tree-ssa/vector-9.c
- gcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c 3 additions, 0 deletionsgcc/testsuite/gcc.target/aarch64/sve/satd-hadamard.c
- gcc/tree-ssa-forwprop.cc 583 additions, 1 deletiongcc/tree-ssa-forwprop.cc
Loading
Please register or sign in to comment