Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebase revyos gcc10.4 thead wip #5

Open
wants to merge 28 commits into
base: revyos-gcc10.4-thead-wip
Choose a base branch
from

Conversation

pz9115
Copy link

@pz9115 pz9115 commented Feb 4, 2024

Please check and force update the wip branch.

Nelson Chu and others added 28 commits February 4, 2024 14:00
…ject.

This is the original binutils bugzilla report,
https://sourceware.org/bugzilla/show_bug.cgi?id=28509

And this is the first version of the proposed binutils patch,
https://sourceware.org/pipermail/binutils/2021-November/118398.html

After applying the binutils patch, I get the the unexpected error when
building libgcc,

/scratch/nelsonc/riscv-gnu-toolchain/riscv-gcc/libgcc/config/riscv/div.S:42:
/scratch/nelsonc/build-upstream/rv64gc-linux/build-install/riscv64-unknown-linux-gnu/bin/ld: relocation R_RISCV_JAL against `__udivdi3' which may bind externally can not be used when making a shared object; recompile with -fPIC

Therefore, this patch add an extra hidden alias symbol for __udivdi3, and
then use HIDDEN_JUMPTARGET to target a non-preemptible symbol instead.
The solution is similar to glibc as follows,
https://sourceware.org/git/?p=glibc.git;a=commit;h=68389203832ab39dd0dbaabbc4059e7fff51c29b

libgcc/ChangeLog:

	* config/riscv/div.S: Add the hidden alias symbol for __udivdi3, and
	then use HIDDEN_JUMPTARGET to target it since it is non-preemptible.
	* config/riscv/riscv-asm.h: Added new macros HIDDEN_JUMPTARGET and
	HIDDEN_DEF.
This patch main update the extension support with RVV 0.7&1.0.
Update RVV1.0 extensions into latest release version,
changes imply condition with version control.
Also adjust mutilib script set.

gcc/ChangeLog:

        * common/config/riscv/riscv-common.c (riscv_subset_list::handle_implied_ext):
        * config/riscv/multilib-generator:
        * config/riscv/riscv-thead.h (riscv_dsp_preferred_mode):
…ributes.

This new feature causes the compiler to zero a  subset of all call-used
registers at function return.  This is used to increase program security
by either mitigating Return-Oriented Programming (ROP) attacks or
preventing information leakage through registers.

gcc/ChangeLog:

2020-10-30  Qing Zhao  <[email protected]>
	    H.J.Lu  <[email protected]>

	* common.opt: Add new option -fzero-call-used-regs
	* config/i386/i386.c (zero_call_used_regno_p): New function.
	(zero_call_used_regno_mode): Likewise.
	(zero_all_vector_registers): Likewise.
	(zero_all_st_registers): Likewise.
	(zero_all_mm_registers): Likewise.
	(ix86_zero_call_used_regs): Likewise.
	(TARGET_ZERO_CALL_USED_REGS): Define.
	* df-scan.c (df_epilogue_uses_p): New function.
	(df_get_exit_block_use_set): Replace EPILOGUE_USES with
	df_epilogue_uses_p.
	* df.h (df_epilogue_uses_p): Declare.
	* doc/extend.texi: Document the new zero_call_used_regs attribute.
	* doc/invoke.texi: Document the new -fzero-call-used-regs option.
	* doc/tm.texi: Regenerate.
	* doc/tm.texi.in (TARGET_ZERO_CALL_USED_REGS): New hook.
	* emit-rtl.h (struct rtl_data): New field must_be_zero_on_return.
	* flag-types.h (namespace zero_regs_flags): New namespace.
	* function.c (gen_call_used_regs_seq): New function.
	(class pass_zero_call_used_regs): New class.
	(pass_zero_call_used_regs::execute): New function.
	(make_pass_zero_call_used_regs): New function.
	* optabs.c (expand_asm_reg_clobber_mem_blockage): New function.
	* optabs.h (expand_asm_reg_clobber_mem_blockage): Declare.
	* opts.c (zero_call_used_regs_opts): New structure array
	initialization.
	(parse_zero_call_used_regs_options): New function.
	(common_handle_option): Handle -fzero-call-used-regs.
	* opts.h (zero_call_used_regs_opts): New structure array.
	* passes.def: Add new pass pass_zero_call_used_regs.
	* recog.c (valid_insn_p): New function.
	* recog.h (valid_insn_p): Declare.
	* resource.c (init_resource_info): Replace EPILOGUE_USES with
	df_epilogue_uses_p.
	* target.def (zero_call_used_regs): New hook.
	* targhooks.c (default_zero_call_used_regs): New function.
	* targhooks.h (default_zero_call_used_regs): Declare.
	* tree-pass.h (make_pass_zero_call_used_regs): Declare.

gcc/c-family/ChangeLog:

2020-10-30  Qing Zhao  <[email protected]>
	    H.J.Lu  <[email protected]>

	* c-attribs.c (c_common_attribute_table): Add new attribute
	zero_call_used_regs.
	(handle_zero_call_used_regs_attribute): New function.

gcc/testsuite/ChangeLog:

2020-10-30  Qing Zhao  <[email protected]>
	    H.J.Lu  <[email protected]>

	* c-c++-common/zero-scratch-regs-1.c: New test.
	* c-c++-common/zero-scratch-regs-10.c: New test.
	* c-c++-common/zero-scratch-regs-11.c: New test.
	* c-c++-common/zero-scratch-regs-2.c: New test.
	* c-c++-common/zero-scratch-regs-3.c: New test.
	* c-c++-common/zero-scratch-regs-4.c: New test.
	* c-c++-common/zero-scratch-regs-5.c: New test.
	* c-c++-common/zero-scratch-regs-6.c: New test.
	* c-c++-common/zero-scratch-regs-7.c: New test.
	* c-c++-common/zero-scratch-regs-8.c: New test.
	* c-c++-common/zero-scratch-regs-9.c: New test.
	* c-c++-common/zero-scratch-regs-attr-usages.c: New test.
	* gcc.target/i386/zero-scratch-regs-1.c: New test.
	* gcc.target/i386/zero-scratch-regs-10.c: New test.
	* gcc.target/i386/zero-scratch-regs-11.c: New test.
	* gcc.target/i386/zero-scratch-regs-12.c: New test.
	* gcc.target/i386/zero-scratch-regs-13.c: New test.
	* gcc.target/i386/zero-scratch-regs-14.c: New test.
	* gcc.target/i386/zero-scratch-regs-15.c: New test.
	* gcc.target/i386/zero-scratch-regs-16.c: New test.
	* gcc.target/i386/zero-scratch-regs-17.c: New test.
	* gcc.target/i386/zero-scratch-regs-18.c: New test.
	* gcc.target/i386/zero-scratch-regs-19.c: New test.
	* gcc.target/i386/zero-scratch-regs-2.c: New test.
	* gcc.target/i386/zero-scratch-regs-20.c: New test.
	* gcc.target/i386/zero-scratch-regs-21.c: New test.
	* gcc.target/i386/zero-scratch-regs-22.c: New test.
	* gcc.target/i386/zero-scratch-regs-23.c: New test.
	* gcc.target/i386/zero-scratch-regs-24.c: New test.
	* gcc.target/i386/zero-scratch-regs-25.c: New test.
	* gcc.target/i386/zero-scratch-regs-26.c: New test.
	* gcc.target/i386/zero-scratch-regs-27.c: New test.
	* gcc.target/i386/zero-scratch-regs-28.c: New test.
	* gcc.target/i386/zero-scratch-regs-29.c: New test.
	* gcc.target/i386/zero-scratch-regs-30.c: New test.
	* gcc.target/i386/zero-scratch-regs-31.c: New test.
	* gcc.target/i386/zero-scratch-regs-3.c: New test.
	* gcc.target/i386/zero-scratch-regs-4.c: New test.
	* gcc.target/i386/zero-scratch-regs-5.c: New test.
	* gcc.target/i386/zero-scratch-regs-6.c: New test.
	* gcc.target/i386/zero-scratch-regs-7.c: New test.
	* gcc.target/i386/zero-scratch-regs-8.c: New test.
	* gcc.target/i386/zero-scratch-regs-9.c: New test.
gcc/ChangeLog:

	* targhooks.c (default_zero_call_used_regs): Fix flag-name typo
	in sorry.
…egs [PR100775]

In the pass_zero_call_used_regs, when updating dataflow info after adding
the register zeroing sequence in the epilogue of the function, we should
call "df_update_exit_block_uses" to update the register use information in
the exit block to include all the registers that have been zeroed.

2022-02-10  Qing Zhao  <[email protected]>

gcc/ChangeLog:

	PR middle-end/100775
	* function.cc (gen_call_used_regs_seq): Call
	df_update_exit_block_uses when updating df.

gcc/testsuite/ChangeLog:

	PR middle-end/100775
	* gcc.target/arm/pr100775.c: New test.
When the -fzero-call-used-regs command line option is used with an
unsupported value, indicate that it's a value problem instead of an
option problem.

Without the patch, the error is:
In file included from gcc/testsuite/c-c++-common/zero-scratch-regs-8.c:5:
gcc/testsuite/c-c++-common/zero-scratch-regs-1.c: In function 'foo':
gcc/testsuite/c-c++-common/zero-scratch-regs-1.c:10:1: sorry, unimplemented: '-fzero-call-used-regs' not supported on this target
   10 | }
      | ^

With the patch, the error would be like this:
 In file included from gcc/testsuite/c-c++-common/zero-scratch-regs-8.c:5:
gcc/testsuite/c-c++-common/zero-scratch-regs-1.c: In function 'foo':
gcc/testsuite/c-c++-common/zero-scratch-regs-1.c:10:1: sorry, unimplemented: argument 'all-arg' is not supported for '-fzero-call-used-regs' on this target
   10 | }
      | ^

2022-09-19  Torbjörn SVENSSON  <[email protected]>

gcc/ChangeLog:

	* targhooks.cc (default_zero_call_used_regs): Improve sorry
	message.

Signed-off-by: Torbjörn SVENSSON  <[email protected]>
gcc/c-family/ChangeLog

	PR tree-optimization/80532
	* c.opt (-Wuse-after-free): New options.

gcc/ChangeLog:

	PR tree-optimization/80532
	* common.opt (-Wuse-after-free): New options.
	* diagnostic-spec.c (nowarn_spec_t::nowarn_spec_t): Handle
	OPT_Wreturn_local_addr and OPT_Wuse_after_free_.
	* diagnostic-spec.h (NW_DANGLING): New enumerator.
	* doc/invoke.texi (-Wuse-after-free): Document new option.
	* gimple-ssa-warn-access.cc (pass_waccess::check_call): Rename...
	(pass_waccess::check_call_access): ...to this.
	(pass_waccess::check): Rename...
	(pass_waccess::check_block): ...to this.
	(pass_waccess::check_pointer_uses): New function.
	(pass_waccess::gimple_call_return_arg): New function.
	(pass_waccess::warn_invalid_pointer): New function.
	(pass_waccess::check_builtin): Handle free and realloc.
	(gimple_use_after_inval_p): New function.
	(get_realloc_lhs): New function.
	(maybe_warn_mismatched_realloc): New function.
	(pointers_related_p): New function.
	(pass_waccess::check_call): Call check_pointer_uses.
	(pass_waccess::execute): Compute and free dominance info.

libcpp/ChangeLog:

	* files.c (_cpp_find_file): Substitute a valid pointer for
	an invalid one to avoid -Wuse-after-free.

libiberty/ChangeLog:

	* regex.c: Suppress -Wuse-after-free.

gcc/testsuite/ChangeLog:

	PR tree-optimization/80532
	* gcc.dg/Wmismatched-dealloc-2.c: Avoid -Wuse-after-free.
	* gcc.dg/Wmismatched-dealloc-3.c: Same.
	* gcc.dg/analyzer/file-1.c: Prune expected warning.
	* gcc.dg/analyzer/file-2.c: Same.
	* gcc.dg/attr-alloc_size-6.c: Disable -Wuse-after-free.
	* gcc.dg/attr-alloc_size-7.c: Same.
	* c-c++-common/Wuse-after-free-2.c: New test.
	* c-c++-common/Wuse-after-free-3.c: New test.
	* c-c++-common/Wuse-after-free-4.c: New test.
	* c-c++-common/Wuse-after-free-5.c: New test.
	* c-c++-common/Wuse-after-free-6.c: New test.
	* c-c++-common/Wuse-after-free-7.c: New test.
	* c-c++-common/Wuse-after-free.c: New test.
	* g++.dg/warn/Wmismatched-dealloc-3.C: New test.
	* g++.dg/warn/Wuse-after-free.C: New test.
Avoid undefined arithmetic involving a pointer to a heap allocation that
has been freed and move a problematic calculation ahead of the following
call to `free' in `riscv_subset_list::parse_multiletter_ext', removing a
compilation error:

.../gcc/common/config/riscv/riscv-common.cc: In member function 'const char* riscv_subset_list::parse_multiletter_ext(const char*, const char*, const char*)':
.../gcc/common/config/riscv/riscv-common.cc:905:27: error: pointer 'subset' used after 'void free(void*)' [-Werror=use-after-free]
  905 |       p += end_of_version - subset;
      |            ~~~~~~~~~~~~~~~^~~~~~~~
.../gcc/common/config/riscv/riscv-common.cc:904:12: note: call to 'void free(void*)' here
  904 |       free (subset);
      |       ~~~~~^~~~~~~~
cc1plus: all warnings being treated as errors
make[2]: *** [Makefile:2428: riscv-common.o] Error 1

and a build regression from commit 671a283 ("Add -Wuse-after-free
[PR80532].").

	gcc/
	* common/config/riscv/riscv-common.cc
	(riscv_subset_list::parse_multiletter_ext): Move pointer
	arithmetic ahead of `free'.
The PR66728 changes broke __int128 handling.
It emits wide_int numbers in their minimum unsigned precision
rather than in their full precision.
The problem is then that e.g. the DW_OP_implicit_value path:
          int_mode = as_a <scalar_int_mode> (mode);
          loc_result = new_loc_descr (DW_OP_implicit_value,
                                      GET_MODE_SIZE (int_mode), 0);
          loc_result->dw_loc_oprnd2.val_class = dw_val_class_wide_int;
          loc_result->dw_loc_oprnd2.v.val_wide = ggc_alloc<wide_int> ();
          *loc_result->dw_loc_oprnd2.v.val_wide = rtx_mode_t (rtl, int_mode);
emits invalid DWARF.  In particular this patch fixes there multiple
occurences of:
        .byte   0x9e    # DW_OP_implicit_value
        .uleb128 0x10
        .quad   0xffffffffffffffff
+       .quad   0
        .quad   .LVL46  # Location list begin address (*.LLST40)
        .quad   .LFE14  # Location list end address (*.LLST40)
where we said the value has 16 byte size but then only emitted 8 byte value.
My understanding is that most of the places that use val_wide expect
the precision they chose (the one of the mode they want etc.), the only
exception is the add_const_value_attribute case where it deals with
VOIDmode CONST_WIDE_INTs, for that I agree when we don't have a mode
we need to fallback to minimum precision (not sure if maximum of
min_precision UNSIGNED and SIGNED wouldn't be better, then consumers
would know if it is signed or unsigned by looking at the MSB),
but that code already computes the precision, just decided to
create the wide_int with much larger precision (e.g. 512 bit
on x86_64).

2021-03-22  Jakub Jelinek  <[email protected]>

	PR debug/99562
	PR debug/66728
	* dwarf2out.c (get_full_len): Use get_precision rather than
	min_precision.
	(add_const_value_attribute): Make sure add_AT_wide argument has
	precision prec rather than some very wide one.
The insn with op like post_modify may have multiple defs.
Take the following command as an example,
(insn 790 789 800 87 (set (reg:DI 8 s0 [orig:182 _97 ] [182])
        (sign_extend:DI (mem:SI (post_modify:DI (reg/v/f:DI 22 s6 [orig:183 _98 ] [183])
                    (plus:DI (reg/v/f:DI 22 s6 [orig:183 _98 ] [183])
                        (const_int 4 [0x4]))) [0 MEM <unsigned int> [(char * {ref-all})_98]+0 S4 A8]))) 189 {extendsidi2}
     (expr_list:REG_INC (reg/v/f:DI 22 s6 [orig:183 _98 ] [183])
        (nil)))
The insn will mark as sexted cause the dest reg is sign extended,
the post_modify's reg is also the defination, if the pass parse
this reg's defs, it will found this insn and find it was marked
as sexted, It will mistakenly think that this register is also
sign extended.
[T-HEAD][APPLY] 6c8e4f4

Add built-in functions __builtin_nansd32, __builtin_nansd64 and
__builtin_nansd128 to return signaling NaNs of decimal floating-point
types, analogous to the functions already present for binary
floating-point types.

This patch, independent of
<https://gcc.gnu.org/pipermail/gcc-patches/2020-October/557136.html>
(pending review), is in preparation for adding the <float.h> macros
for such signaling NaNs that are in C2x, analogous to the macros for
other types that are in that patch.

Bootstrapped with no regressions for x86_64-pc-linux-gnu.  Also ran
the new tests for powerpc64le-linux-gnu to confirm they do work in the
case (hardware DFP) where floating-point exceptions are supported for
DFP.

gcc/
2020-11-06  Joseph Myers  <[email protected]>

	* builtins.def (BUILT_IN_NANSD32, BUILT_IN_NANSD64)
	(BUILT_IN_NANSD128): New built-in functions.
	* fold-const-call.c (fold_const_call): Handle the new built-in
	functions.
	* doc/extend.texi (__builtin_nansd32, __builtin_nansd64)
	(__builtin_nansd128): Document.
	* doc/sourcebuild.texi (Effective-Target Keywords): Document
	fenv_exceptions_dfp.

gcc/testsuite/
2020-11-06  Joseph Myers  <[email protected]>

	* lib/target-supports.exp
	(check_effective_target_fenv_exceptions_dfp): New.
	* gcc.dg/dfp/builtin-snan-1.c, gcc.dg/dfp/builtin-snan-2.c: New
	tests.
[T-HEAD][APPLY] b04445d

C++20 isn't final quite yet, but all that remains is formalities, so let's
go ahead and change all the references.

I think for the next C++ standard we can just call it C++23 rather than
C++2b, since the committee has been consistent about time-based releases
rather than feature-based.

gcc/c-family/ChangeLog
2020-05-13  Jason Merrill  <[email protected]>

	* c.opt (std=c++20): Make c++2a the alias.
	(std=gnu++20): Likewise.
	* c-common.h (cxx_dialect): Change cxx2a to cxx20.
	* c-opts.c: Adjust.
	* c-cppbuiltin.c: Adjust.
	* c-ubsan.c: Adjust.
	* c-warn.c: Adjust.

gcc/cp/ChangeLog
2020-05-13  Jason Merrill  <[email protected]>

	* call.c, class.c, constexpr.c, constraint.cc, decl.c, init.c,
	lambda.c, lex.c, method.c, name-lookup.c, parser.c, pt.c, tree.c,
	typeck2.c: Change cxx2a to cxx20.

libcpp/ChangeLog
2020-05-13  Jason Merrill  <[email protected]>

	* include/cpplib.h (enum c_lang): Change CXX2A to CXX20.
	* init.c, lex.c: Adjust.
[T-HEAD][APPLY] 78739c2

Derived from the changes that added C++2a support in 2017.
r8-3237-g026a79f70cf33f836ea5275eda72d4870a3041e5

No C++23 features are added here.
Use of -std=c++23 sets __cplusplus to 202100L.

$ g++ -std=c++23 -dM -E -x c++ - < /dev/null | grep cplusplus
 #define __cplusplus 202100L

gcc/
	* doc/cpp.texi (__cplusplus): Document value for -std=c++23
	or -std=gnu++23.
	* doc/invoke.texi: Document -std=c++23 and -std=gnu++23.
	* dwarf2out.c (highest_c_language): Recognise C++20 and C++23.
	(gen_compile_unit_die): Recognise C++23.

gcc/c-family/
	* c-common.h (cxx_dialect): Add cxx23 as a dialect.
	* c.opt: Add options for -std=c++23, std=c++2b, -std=gnu++23
	and -std=gnu++2b
	* c-opts.c (set_std_cxx23): New.
	(c_common_handle_option): Set options when -std=c++23 is enabled.
	(c_common_post_options): Adjust comments.
	(set_std_cxx20): Likewise.

gcc/testsuite/
	* lib/target-supports.exp (check_effective_target_c++2a):
	Check for C++2a or C++23.
	(check_effective_target_c++20_down): New.
	(check_effective_target_c++23_only): New.
	(check_effective_target_c++23): New.
	* g++.dg/cpp23/cplusplus.C: New.

libcpp/
	* include/cpplib.h (c_lang): Add CXX23 and GNUCXX23.
	* init.c (lang_defaults): Add rows for CXX23 and GNUCXX23.
	(cpp_init_builtins): Set __cplusplus to 202100L for C++23.
…ames compiler part except for bfloat16 [PR106652]

The following patch implements the compiler part of C++23
P1467R9 - Extended floating-point types and standard names compiler part
by introducing _Float{16,32,64,128} as keywords and builtin types
like they are implemented for C already since GCC 7, with DF{16,32,64,128}_
mangling.
It also introduces _Float{32,64,128}x for C++ with the
itanium-cxx-abi/cxx-abi#147
proposed mangling of DF{32,64,128}x.
The patch doesn't add anything for bfloat16_t support, as right now
__bf16 type refuses all conversions and arithmetic operations.
The patch wants to keep backwards compatibility with how __float128 has
been handled in C++ before, both for mangling and behavior in binary
operations, overload resolution etc.  So, there are some backend changes
where for C __float128 and _Float128 are the same type (float128_type_node
and float128t_type_node are the same pointer), but for C++ they are distinct
types which mangle differently and _Float128 is treated as extended
floating-point type while __float128 is treated as non-standard floating
point type.  The various C++23 changes about how floating-point types
are changed are actually implemented as written in the spec only if at least
one of the types involved is _Float{16,32,64,128,32x,64x,128x} (_FloatNx are
also treated as extended floating-point types) and kept previous behavior
otherwise.  For float/double/long double the rules are actually written that
they behave the same as before.
There is some backwards incompatibility at least on x86 regarding _Float16,
because that type was already used by that name and with the DF16_ mangling
(but only since GCC 12 and I think it isn't that widely used in the wild
yet).  E.g. config/i386/avx512fp16intrin.h shows the issues, where
in C or in GCC 12 in C++ one could pass 0.0f to a builtin taking _Float16
argument, but with the changes that is not possible anymore, one needs
to either use 0.0f16 or (_Float16) 0.0f.
We have also a problem with glibc headers, where since glibc 2.27
math.h and complex.h aren't compilable with these changes.  One gets
errors like:
In file included from /usr/include/math.h:43,
                 from abc.c:1:
/usr/include/bits/floatn.h:86:9: error: multiple types in one declaration
   86 | typedef __float128 _Float128;
      |         ^~~~~~~~~~
/usr/include/bits/floatn.h:86:20: error: declaration does not declare anything [-fpermissive]
   86 | typedef __float128 _Float128;
      |                    ^~~~~~~~~
In file included from /usr/include/bits/floatn.h:119:
/usr/include/bits/floatn-common.h:214:9: error: multiple types in one declaration
  214 | typedef float _Float32;
      |         ^~~~~
/usr/include/bits/floatn-common.h:214:15: error: declaration does not declare anything [-fpermissive]
  214 | typedef float _Float32;
      |               ^~~~~~~~
/usr/include/bits/floatn-common.h:251:9: error: multiple types in one declaration
  251 | typedef double _Float64;
      |         ^~~~~~
/usr/include/bits/floatn-common.h:251:16: error: declaration does not declare anything [-fpermissive]
  251 | typedef double _Float64;
      |                ^~~~~~~~
This is from snippets like:
 /* The remaining of this file provides support for older compilers.  */
 # if __HAVE_FLOAT128

 /* The type _Float128 exists only since GCC 7.0.  */
 #  if !__GNUC_PREREQ (7, 0) || defined __cplusplus
 typedef __float128 _Float128;
 #  endif
where it hardcodes that C++ doesn't have _Float{16,32,64,128,32x,64x,128x} support nor
{f,F}{16,32,64,128}{,x} literal suffixes nor _Complex _Float{16,32,64,128,32x,64x,128x}.
The patch fixincludes this for now and hopefully if this is committed, then
glibc can change those.  The patch changes those
 #  if !__GNUC_PREREQ (7, 0) || defined __cplusplus
conditions to
 #  if !__GNUC_PREREQ (7, 0) || (defined __cplusplus && !__GNUC_PREREQ (13, 0))
Another thing is mangling, as said above, Itanium C++ ABI specifies
DF <number> _ as _Float{16,32,64,128} mangling, but GCC was implementing
a mangling incompatible with that starting with DF for fixed point types.
Fixed point was never supported in C++ though, I believe the reason why
the mangling has been added was that due to a bug it would leak into the
C++ FE through decltype (0.0r) etc.  But that has been shortly after the
mangling was added fixed (I think in the same GCC release cycle), so we
now reject 0.0r etc. in C++.  If we ever need the fixed point mangling,
I think it can be readded but better with a different prefix so that it
doesn't conflict with the published standard manglings.  So, this patch
also kills the fixed point mangling and implements the DF <number> _
demangling.
The patch predefines __STDCPP_FLOAT{16,32,64,128}_T__ macros when
those types are available, but only for C++23, while the underlying types
are available in C++98 and later including the {f,F}{16,32,64,128} literal
suffixes (but those with a pedwarn for C++20 and earlier).  My understanding
is that it needs to be predefined by the compiler, on the other side
predefining even for older modes when <stdfloat> is a new C++23 header
would be weird.  One can find out if _Float{16,32,64,128,32x,64x,128x} is
supported in C++ by
__GNUC__ >= 13 && defined(__FLT{16,32,64,128,32X,64X,128X}_MANT_DIG__)
(but that doesn't work well with older G++ 13 snapshots).

As for std::bfloat16_t, three targets (aarch64, arm and x86) apparently
"support" __bf16 type which has the bfloat16 format, but isn't really
usable, e.g. {aarch64,arm,ix86}_invalid_conversion disallow any conversions
from or to type with BFmode, {aarch64,arm,ix86}_invalid_unary_op disallows
any unary operations on those except for ADDR_EXPR and
{aarch64,arm,ix86}_invalid_binary_op disallows any binary operation on
those.  So, I think we satisfy:
"If the implementation supports an extended floating-point type with the
properties, as specified by ISO/IEC/IEEE 60559, of radix (b) of 2, storage
width in bits (k) of 16, precision in bits (p) of 8, maximum exponent (emax)
of 127, and exponent field width in bits (w) of 8, then the typedef-name
std::bfloat16_t is defined in the header <stdfloat> and names such a type,
the macro __STDCPP_BFLOAT16_T__ is defined, and the floating-point literal
suffixes bf16 and BF16 are supported."
because we don't really support those right now.

2022-09-27  Jakub Jelinek  <[email protected]>

        PR c++/106652
        PR c++/85518
gcc/
        * tree-core.h (enum tree_index): Add TI_FLOAT128T_TYPE
        enumerator.
        * tree.h (float128t_type_node): Define.
        * tree.cc (build_common_tree_nodes): Initialize float128t_type_node.
        * builtins.def (DEF_FLOATN_BUILTIN): Adjust comment now that
        _Float<N> is supported in C++ too.
        * config/i386/i386.cc (ix86_mangle_type): Only mangle as "g"
        float128t_type_node.
        * config/i386/i386-builtins.cc (ix86_init_builtin_types): Use
        float128t_type_node for __float128 instead of float128_type_node
        and create it if NULL.
        * config/i386/avx512fp16intrin.h (_mm_setzero_ph, _mm256_setzero_ph,
        _mm512_setzero_ph, _mm_set_sh, _mm_load_sh): Use 0.0f16 instead of
        0.0f.
        * config/ia64/ia64.cc (ia64_init_builtins): Use
        float128t_type_node for __float128 instead of float128_type_node
        and create it if NULL.
        * config/rs6000/rs6000-c.cc (is_float128_p): Also return true
        for float128t_type_node if non-NULL.
        * config/rs6000/rs6000.cc (rs6000_mangle_type): Don't mangle
        float128_type_node as "u9__ieee128".
        * config/rs6000/rs6000-builtin.cc (rs6000_init_builtins): Use
        float128t_type_node for __float128 instead of float128_type_node
        and create it if NULL.
gcc/c-family/
        * c-common.cc (c_common_reswords): Change _Float{16,32,64,128} and
        _Float{32,64,128}x flags from D_CONLY to 0.
        (shorten_binary_op): Punt if common_type returns error_mark_node.
        (shorten_compare): Likewise.
        (c_common_nodes_and_builtins): For C++ record _Float{16,32,64,128}
        and _Float{32,64,128}x builtin types if available.  For C++
        clear float128t_type_node.
        * c-cppbuiltin.cc (c_cpp_builtins): Predefine
        __STDCPP_FLOAT{16,32,64,128}_T__ for C++23 if supported.
        * c-lex.cc (interpret_float): For q/Q suffixes prefer
        float128t_type_node over float128_type_node.  Allow
        {f,F}{16,32,64,128} suffixes for C++ if supported with pedwarn
        for C++20 and older.  Allow {f,F}{32,64,128}x suffixes for C++
        with pedwarn.  Don't call excess_precision_type for C++.
gcc/cp/
        * cp-tree.h (cp_compare_floating_point_conversion_ranks): Implement
        P1467R9 - Extended floating-point types and standard names except
        for std::bfloat16_t for now.  Declare.
        (extended_float_type_p): New inline function.
        * mangle.cc (write_builtin_type): Mangle float{16,32,64,128}_type_node
        as DF{16,32,64,128}_.  Mangle float{32,64,128}x_type_node as
        DF{32,64,128}x.  Remove FIXED_POINT_TYPE mangling that conflicts
        with that.
        * typeck2.cc (check_narrowing): If one of ftype or type is extended
        floating-point type, compare floating-point conversion ranks.
        * parser.cc (cp_keyword_starts_decl_specifier_p): Handle
        CASE_RID_FLOATN_NX.
        (cp_parser_simple_type_specifier): Likewise and diagnose missing
        _Float<N> or _Float<N>x support if not supported by target.
        * typeck.cc (cp_compare_floating_point_conversion_ranks): New function.
        (cp_common_type): If both types are REAL_TYPE and one or both are
        extended floating-point types, select common type based on comparison
        extended floating-point types, select common type based on comparison
        of floating-point conversion ranks and subranks.
        (cp_build_binary_op): Diagnose operation with floating point arguments
        with unordered conversion ranks.
        * call.cc (standard_conversion): For floating-point conversion, if
        either from or to are extended floating-point types, set conv->bad_p
        for implicit conversion from larger to smaller conversion rank or
        with unordered conversion ranks.
        (convert_like_internal): Emit a pedwarn on such conversions.
        (build_conditional_expr): Diagnose operation with floating point
        arguments with unordered conversion ranks.
        (convert_arg_to_ellipsis): Don't promote extended floating-point types
        narrower than double to double.
        (compare_ics): Implement P1467R9 [over.ics.rank]/4 changes.
gcc/testsuite/
        * g++.dg/cpp23/ext-floating1.C: New test.
        * g++.dg/cpp23/ext-floating2.C: New test.
        * g++.dg/cpp23/ext-floating3.C: New test.
        * g++.dg/cpp23/ext-floating4.C: New test.
        * g++.dg/cpp23/ext-floating5.C: New test.
        * g++.dg/cpp23/ext-floating6.C: New test.
        * g++.dg/cpp23/ext-floating7.C: New test.
        * g++.dg/cpp23/ext-floating8.C: New test.
        * g++.dg/cpp23/ext-floating9.C: New test.
        * g++.dg/cpp23/ext-floating10.C: New test.
        * g++.dg/cpp23/ext-floating.h: New file.
        * g++.target/i386/float16-1.C: Adjust expected diagnostics.
libcpp/
        * expr.cc (interpret_float_suffix): Allow {f,F}{16,32,64,128} and
        {f,F}{32,64,128}x suffixes for C++.
include/
        * demangle.h (enum demangle_component_type): Add
        DEMANGLE_COMPONENT_EXTENDED_BUILTIN_TYPE.
        (struct demangle_component): Add u.s_extended_builtin member.
libiberty/
        * cp-demangle.c (d_dump): Handle
        DEMANGLE_COMPONENT_EXTENDED_BUILTIN_TYPE.  Don't handle
        DEMANGLE_COMPONENT_FIXED_TYPE.
        (d_make_extended_builtin_type): New function.
        (cplus_demangle_builtin_types): Add _Float entry.
        (cplus_demangle_type): For DF demangle it as _Float<N> or
        _Float<N>x rather than fixed point which conflicts with it.
        (d_count_templates_scopes): Handle
        DEMANGLE_COMPONENT_EXTENDED_BUILTIN_TYPE.  Just break; for
        DEMANGLE_COMPONENT_FIXED_TYPE.
        (d_find_pack): Handle DEMANGLE_COMPONENT_EXTENDED_BUILTIN_TYPE.
        Don't handle DEMANGLE_COMPONENT_FIXED_TYPE.
        (d_print_comp_inner): Likewise.
        * cp-demangle.h (D_BUILTIN_TYPE_COUNT): Bump.
        * testsuite/demangle-expected: Replace _Z3xxxDFyuVb test
        with _Z3xxxDF16_DF32_DF64_DF128_CDF16_Vb.  Add
        _Z3xxxDF32xDF64xDF128xCDF32xVb test.
fixincludes/
        * inclhack.def (glibc_cxx_floatn_1, glibc_cxx_floatn_2,
        glibc_cxx_floatn_3): New fixes.
        * tests/base/bits/floatn.h: New file.
        * fixincl.x: Regenerated.
…TE_TO_FLOAT16 when backend supports _Float16.

[T-HEAD][APPLY] f19a327

gcc/ada/ChangeLog:

	* gcc-interface/misc.c (gnat_post_options): Issue an error for
	-fexcess-precision=16.

gcc/c-family/ChangeLog:

	* c-common.c (excess_precision_mode_join): Update below comments.
	(c_ts18661_flt_eval_method): Set excess_precision_type to
	EXCESS_PRECISION_TYPE_FLOAT16 when -fexcess-precision=16.
	* c-cppbuiltin.c (cpp_atomic_builtins): Update below comments.
	(c_cpp_flt_eval_method_iec_559): Set excess_precision_type to
	EXCESS_PRECISION_TYPE_FLOAT16 when -fexcess-precision=16.

gcc/ChangeLog:

	* common.opt: Support -fexcess-precision=16.
	* config/aarch64/aarch64.c (aarch64_excess_precision): Return
	FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 when
	EXCESS_PRECISION_TYPE_FLOAT16.
	* config/arm/arm.c (arm_excess_precision): Ditto.
	* config/i386/i386.c (ix86_get_excess_precision): Ditto.
	* config/m68k/m68k.c (m68k_excess_precision): Issue an error
	when EXCESS_PRECISION_TYPE_FLOAT16.
	* config/s390/s390.c (s390_excess_precision): Ditto.
	* coretypes.h (enum excess_precision_type): Add
	EXCESS_PRECISION_TYPE_FLOAT16.
	* doc/tm.texi (TARGET_C_EXCESS_PRECISION): Update documents.
	* doc/tm.texi.in (TARGET_C_EXCESS_PRECISION): Ditto.
	* doc/extend.texi (Half-Precision): Document
	-fexcess-precision=16.
	* flag-types.h (enum excess_precision): Add
	EXCESS_PRECISION_FLOAT16.
	* target.def (excess_precision): Update document.
	* tree.c (excess_precision_type): Set excess_precision_type to
	EXCESS_PRECISION_FLOAT16 when -fexcess-precision=16.

gcc/fortran/ChangeLog:

	* options.c (gfc_post_options): Issue an error for
	-fexcess-precision=16.

gcc/testsuite/ChangeLog:

	* gcc.target/i386/float16-6.c: New test.
	* gcc.target/i386/float16-7.c: New test.
[T-HEAD][APPLY] 54f0224

Correctness and performance test programs used during development of
this project may be found in the attachment to:
https://www.mail-archive.com/[email protected]/msg254210.html

Summary of Purpose

This patch to libgcc/libgcc2.c __divdc3 provides an
opportunity to gain important improvements to the quality of answers
for the default complex divide routine (half, float, double, extended,
long double precisions) when dealing with very large or very small exponents.

The current code correctly implements Smith's method (1962) [2]
further modified by c99's requirements for dealing with NaN (not a
number) results. When working with input values where the exponents
are greater than *_MAX_EXP/2 or less than -(*_MAX_EXP)/2, results are
substantially different from the answers provided by quad precision
more than 1% of the time. This error rate may be unacceptable for many
applications that cannot a priori restrict their computations to the
safe range. The proposed method reduces the frequency of
"substantially different" answers by more than 99% for double
precision at a modest cost of performance.

Differences between current gcc methods and the new method will be
described. Then accuracy and performance differences will be discussed.

Background

This project started with an investigation related to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59714.  Study of Beebe[1]
provided an overview of past and recent practice for computing complex
divide. The current glibc implementation is based on Robert Smith's
algorithm [2] from 1962.  A google search found the paper by Baudin
and Smith [3] (same Robert Smith) published in 2012. Elen Kalda's
proposed patch [4] is based on that paper.

I developed two sets of test data by randomly distributing values over
a restricted range and the full range of input values. The current
complex divide handled the restricted range well enough, but failed on
the full range more than 1% of the time. Baudin and Smith's primary
test for "ratio" equals zero reduced the cases with 16 or more error
bits by a factor of 5, but still left too many flawed answers. Adding
debug print out to cases with substantial errors allowed me to see the
intermediate calculations for test values that failed. I noted that
for many of the failures, "ratio" was a subnormal. Changing the
"ratio" test from check for zero to check for subnormal reduced the 16
bit error rate by another factor of 12. This single modified test
provides the greatest benefit for the least cost, but the percentage
of cases with greater than 16 bit errors (double precision data) is
still greater than 0.027% (2.7 in 10,000).

Continued examination of remaining errors and their intermediate
computations led to the various tests of input value tests and scaling
to avoid under/overflow. The current patch does not handle some of the
rare and most extreme combinations of input values, but the random
test data is only showing 1 case in 10 million that has an error of
greater than 12 bits. That case has 18 bits of error and is due to
subtraction cancellation. These results are significantly better
than the results reported by Baudin and Smith.

Support for half, float, double, extended, and long double precision
is included as all are handled with suitable preprocessor symbols in a
single source routine. Since half precision is computed with float
precision as per current libgcc practice, the enhanced algorithm
provides no benefit for half precision and would cost performance.
Further investigation showed changing the half precision algorithm
to use the simple formula (real=a*c+b*d imag=b*c-a*d) caused no
loss of precision and modest improvement in performance.

The existing constants for each precision:
float: FLT_MAX, FLT_MIN;
double: DBL_MAX, DBL_MIN;
extended and/or long double: LDBL_MAX, LDBL_MIN
are used for avoiding the more common overflow/underflow cases.  This
use is made generic by defining appropriate __LIBGCC2_* macros in
c-cppbuiltin.c.

Tests are added for when both parts of the denominator have exponents
small enough to allow shifting any subnormal values to normal values
all input values could be scaled up without risking overflow. That
gained a clear improvement in accuracy. Similarly, when either
numerator was subnormal and the other numerator and both denominator
values were not too large, scaling could be used to reduce risk of
computing with subnormals.  The test and scaling values used all fit
within the allowed exponent range for each precision required by the C
standard.

Float precision has more difficulty with getting correct answers than
double precision. When hardware for double precision floating point
operations is available, float precision is now handled in double
precision intermediate calculations with the simple algorithm the same
as the half-precision method of using float precision for intermediate
calculations. Using the higher precision yields exact results for all
tested input values (64-bit double, 32-bit float) with the only
performance cost being the requirement to convert the four input
values from float to double. If double precision hardware is not
available, then float complex divide will use the same improved
algorithm as the other precisions with similar change in performance.

Further Improvement

The most common remaining substantial errors are due to accuracy loss
when subtracting nearly equal values. This patch makes no attempt to
improve that situation.

NOTATION

For all of the following, the notation is:
Input complex values:
  a+bi  (a= real part, b= imaginary part)
  c+di
Output complex value:
  e+fi = (a+bi)/(c+di)

For the result tables:
current = current method (SMITH)
b1div = method proposed by Elen Kalda
b2div = alternate method considered by Elen Kalda
new = new method proposed by this patch

DESCRIPTIONS of different complex divide methods:

NAIVE COMPUTATION (-fcx-limited-range):
  e = (a*c + b*d)/(c*c + d*d)
  f = (b*c - a*d)/(c*c + d*d)

Note that c*c and d*d will overflow or underflow if either
c or d is outside the range 2^-538 to 2^512.

This method is available in gcc when the switch -fcx-limited-range is
used. That switch is also enabled by -ffast-math. Only one who has a
clear understanding of the maximum range of all intermediate values
generated by an application should consider using this switch.

SMITH's METHOD (current libgcc):
  if(fabs(c)<fabs(d) {
    r = c/d;
    denom = (c*r) + d;
    e = (a*r + b) / denom;
    f = (b*r - a) / denom;
  } else {
    r = d/c;
    denom = c + (d*r);
    e = (a + b*r) / denom;
    f = (b - a*r) / denom;
  }

Smith's method is the current default method available with __divdc3.

Elen Kalda's METHOD

Elen Kalda proposed a patch about a year ago, also based on Baudin and
Smith, but not including tests for subnormals:
https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html [4]
It is compared here for accuracy with this patch.

This method applies the most significant part of the algorithm
proposed by Baudin&Smith (2012) in the paper "A Robust Complex
Division in Scilab" [3]. Elen's method also replaces two divides by
one divide and two multiplies due to the high cost of divide on
aarch64. In the comparison sections, this method will be labeled
b1div. A variation discussed in that patch which does not replace the
two divides will be labeled b2div.

  inline void improved_internal (MTYPE a, MTYPE b, MTYPE c, MTYPE d)
  {
    r = d/c;
    t = 1.0 / (c + (d * r));
    if (r != 0) {
        x = (a + (b * r)) * t;
        y = (b - (a * r)) * t;
    }  else {
    /* Changing the order of operations avoids the underflow of r impacting
     the result. */
        x = (a + (d * (b / c))) * t;
        y = (b - (d * (a / c))) * t;
    }
  }

  if (FABS (d) < FABS (c)) {
      improved_internal (a, b, c, d);
  } else {
      improved_internal (b, a, d, c);
      y = -y;
  }

NEW METHOD (proposed by patch) to replace the current default method:

The proposed method starts with an algorithm proposed by Baudin&Smith
(2012) in the paper "A Robust Complex Division in Scilab" [3]. The
patch makes additional modifications to that method for further
reductions in the error rate. The following code shows the #define
values for double precision. See the patch for #define values used
for other precisions.

  #define RBIG ((DBL_MAX)/2.0)
  #define RMIN (DBL_MIN)
  #define RMIN2 (0x1.0p-53)
  #define RMINSCAL (0x1.0p+51)
  #define RMAX2  ((RBIG)*(RMIN2))

  if (FABS(c) < FABS(d)) {
  /* prevent overflow when arguments are near max representable */
  if ((FABS (d) > RBIG) || (FABS (a) > RBIG) || (FABS (b) > RBIG) ) {
      a = a * 0.5;
      b = b * 0.5;
      c = c * 0.5;
      d = d * 0.5;
  }
  /* minimize overflow/underflow issues when c and d are small */
  else if (FABS (d) < RMIN2) {
      a = a * RMINSCAL;
      b = b * RMINSCAL;
      c = c * RMINSCAL;
      d = d * RMINSCAL;
  }
  else {
    if(((FABS (a) < RMIN) && (FABS (b) < RMAX2) && (FABS (d) < RMAX2)) ||
       ((FABS (b) < RMIN) && (FABS (a) < RMAX2) && (FABS (d) < RMAX2))) {
        a = a * RMINSCAL;
        b = b * RMINSCAL;
        c = c * RMINSCAL;
        d = d * RMINSCAL;
    }
  }
  r = c/d; denom = (c*r) + d;
  if( r > RMIN ) {
      e = (a*r + b) / denom   ;
      f = (b*r - a) / denom
  } else {
      e = (c * (a/d) + b) / denom;
      f = (c * (b/d) - a) / denom;
  }
  }
[ only presenting the fabs(c) < fabs(d) case here, full code in patch. ]

Before any computation of the answer, the code checks for any input
values near maximum to allow down scaling to avoid overflow.  These
scalings almost never harm the accuracy since they are by 2. Values that
are over RBIG are relatively rare but it is easy to test for them and
allow aviodance of overflows.

Testing for RMIN2 reveals when both c and d are less than [FLT|DBL]_EPSILON.
By scaling all values by 1/EPSILON, the code converts subnormals to normals,
avoids loss of accuracy and underflows in intermediate computations
that otherwise might occur. If scaling a and b by 1/EPSILON causes either
to overflow, then the computation will overflow whatever method is used.

Finally, we test for either a or b being subnormal (RMIN) and if so,
for the other three values being small enough to allow scaling.  We
only need to test a single denominator value since we have already
determined which of c and d is larger.

Next, r (the ratio of c to d) is checked for being near zero. Baudin
and Smith checked r for zero. This code improves that approach by
checking for values less than DBL_MIN (subnormal) covers roughly 12
times as many cases and substantially improves overall accuracy. If r
is too small, then when it is used in a multiplication, there is a
high chance that the result will underflow to zero, losing significant
accuracy. That underflow is avoided by reordering the computation.
When r is subnormal, the code replaces a*r (= a*(c/d)) with ((a/d)*c)
which is mathematically the same but avoids the unnecessary underflow.

TEST Data

Two sets of data are presented to test these methods. Both sets
contain 10 million pairs of complex values.  The exponents and
mantissas are generated using multiple calls to random() and then
combining the results. Only values which give results to complex
divide that are representable in the appropriate precision after
being computed in quad precision are used.

The first data set is labeled "moderate exponents".
The exponent range is limited to -DBL_MAX_EXP/2 to DBL_MAX_EXP/2
for Double Precision (use FLT_MAX_EXP or LDBL_MAX_EXP for the
appropriate precisions.
The second data set is labeled "full exponents".
The exponent range for these cases is the full exponent range
including subnormals for a given precision.

ACCURACY Test results:

Note: The following accuracy tests are based on IEEE-754 arithmetic.

Note: All results reporteed are based on use of fused multiply-add. If
fused multiply-add is not used, the error rate increases, giving more
1 and 2 bit errors for both current and new complex divide.
Differences between using fused multiply and not using it that are
greater than 2 bits are less than 1 in a million.

The complex divide methods are evaluated by determining the percentage
of values that exceed differences in low order bits.  If a "2 bit"
test results show 1%, that would mean that 1% of 10,000,000 values
(100,000) have either a real or imaginary part that differs from the
quad precision result by more than the last 2 bits.

Results are reported for differences greater than or equal to 1 bit, 2
bits, 8 bits, 16 bits, 24 bits, and 52 bits for double precision.  Even
when the patch avoids overflows and underflows, some input values are
expected to have errors due to the potential for catastrophic roundoff
from floating point subtraction. For example, when b*c and a*d are
nearly equal, the result of subtraction may lose several places of
accuracy. This patch does not attempt to detect or minimize this type
of error, but neither does it increase them.

I only show the results for Elen Kalda's method (with both 1 and
2 divides) and the new method for only 1 divide in the double
precision table.

In the following charts, lower values are better.

current - current complex divide in libgcc
b1div - Elen Kalda's method from Baudin & Smith with one divide
b2div - Elen Kalda's method from Baudin & Smith with two divides
new   - This patch which uses 2 divides

===================================================
Errors   Moderate Dataset
gtr eq     current    b1div      b2div        new
======    ========   ========   ========   ========
 1 bit    0.24707%   0.92986%   0.24707%   0.24707%
 2 bits   0.01762%   0.01770%   0.01762%   0.01762%
 8 bits   0.00026%   0.00026%   0.00026%   0.00026%
16 bits   0.00000%   0.00000%   0.00000%   0.00000%
24 bits         0%         0%         0%         0%
52 bits         0%         0%         0%         0%
===================================================
Table 1: Errors with Moderate Dataset (Double Precision)

Note in Table 1 that both the old and new methods give identical error
rates for data with moderate exponents. Errors exceeding 16 bits are
exceedingly rare. There are substantial increases in the 1 bit error
rates for b1div (the 1 divide/2 multiplys method) as compared to b2div
(the 2 divides method). These differences are minimal for 2 bits and
larger error measurements.

===================================================
Errors   Full Dataset
gtr eq     current    b1div      b2div        new
======    ========   ========   ========   ========
 1 bit      2.05%   1.23842%    0.67130%   0.16664%
 2 bits     1.88%   0.51615%    0.50354%   0.00900%
 8 bits     1.77%   0.42856%    0.42168%   0.00011%
16 bits     1.63%   0.33840%    0.32879%   0.00001%
24 bits     1.51%   0.25583%    0.24405%   0.00000%
52 bits     1.13%   0.01886%    0.00350%   0.00000%
===================================================
Table 2: Errors with Full Dataset (Double Precision)

Table 2 shows significant differences in error rates. First, the
difference between b1div and b2div show a significantly higher error
rate for the b1div method both for single bit errros and well
beyond. Even for 52 bits, we see the b1div method gets completely
wrong answers more than 5 times as often as b2div. To retain
comparable accuracy with current complex divide results for small
exponents and due to the increase in errors for large exponents, I
choose to use the more accurate method of two divides.

The current method has more 1.6% of cases where it is getting results
where the low 24 bits of the mantissa differ from the correct
answer. More than 1.1% of cases where the answer is completely wrong.
The new method shows less than one case in 10,000 with greater than
two bits of error and only one case in 10 million with greater than
16 bits of errors. The new patch reduces 8 bit errors by
a factor of 16,000 and virtually eliminates completely wrong
answers.

As noted above, for architectures with double precision
hardware, the new method uses that hardware for the
intermediate calculations before returning the
result in float precision. Testing of the new patch
has shown zero errors found as seen in Tables 3 and 4.

Correctness for float
=============================
Errors   Moderate Dataset
gtr eq     current     new
======    ========   ========
 1 bit   28.68070%         0%
 2 bits   0.64386%         0%
 8 bits   0.00401%         0%
16 bits   0.00001%         0%
24 bits         0%         0%
=============================
Table 3: Errors with Moderate Dataset (float)

=============================
Errors   Full Dataset
gtr eq     current     new
======    ========   ========
 1 bit     19.98%         0%
 2 bits     3.20%         0%
 8 bits     1.97%         0%
16 bits     1.08%         0%
24 bits     0.55%         0%
=============================
Table 4: Errors with Full Dataset (float)

As before, the current method shows an troubling rate of extreme
errors.

There very minor changes in accuracy for half-precision since the code
changes from Smith's method to the simple method. 5 out of 1 million
test cases show correct answers instead of 1 or 2 bit errors.
libgcc computes half-precision functions in float precision
allowing the existing methods to avoid overflow/underflow issues
for the allowed range of exponents for half-precision.

Extended precision (using x87 80-bit format on x86) and Long double
(using IEEE-754 128-bit on x86 and aarch64) both have 15-bit exponents
as compared to 11-bit exponents in double precision. We note that the
C standard also allows Long Double to be implemented in the equivalent
range of Double. The RMIN2 and RMINSCAL constants are selected to work
within the Double range as well as with extended and 128-bit ranges.
We will limit our performance and accurancy discussions to the 80-bit
and 128-bit formats as seen on x86 here.

The extended and long double precision investigations were more
limited. Aarch64 does not support extended precision but does support
the software implementation of 128-bit long double precision. For x86,
long double defaults to the 80-bit precision but using the
-mlong-double-128 flag switches to using the software implementation
of 128-bit precision. Both 80-bit and 128-bit precisions have the same
exponent range, with the 128-bit precision has extended mantissas.
Since this change is only aimed at avoiding underflow/overflow for
extreme exponents, I studied the extended precision results on x86 for
100,000 values. The limited exponent dataset showed no differences.
For the dataset with full exponent range, the current and new values
showed major differences (greater than 32 bits) in 567 cases out of
100,000 (0.56%). In every one of these cases, the ratio of c/d or d/c
(as appropriate) was zero or subnormal, indicating the advantage of
the new method and its continued correctness where needed.

PERFORMANCE Test results

In order for a library change to be practical, it is necessary to show
the slowdown is tolerable. The slowdowns observed are much less than
would be seen by (for example) switching from hardware double precison
to a software quad precision, which on the tested machines causes a
slowdown of around 100x).

The actual slowdown depends on the machine architecture. It also
depends on the nature of the input data. If underflow/overflow is
rare, then implementations that have strong branch prediction will
only slowdown by a few cycles. If underflow/overflow is common, then
the branch predictors will be less accurate and the cost will be
higher.

Results from two machines are presented as examples of the overhead
for the new method. The one labeled x86 is a 5 year old Intel x86
processor and the one labeled aarch64 is a 3 year old arm64 processor.

In the following chart, the times are averaged over a one million
value data set. All values are scaled to set the time of the current
method to be 1.0. Lower values are better. A value of less than 1.0
would be faster than the current method and a value greater than 1.0
would be slower than the current method.

================================================
               Moderate set          full set
               x86  aarch64        x86  aarch64
========     ===============     ===============
float         0.59    0.79        0.45    0.81
double        1.04    1.24        1.38    1.56
long double   1.13    1.24        1.29    1.25
================================================
Table 5: Performance Comparisons (ratio new/current)

The above tables omit the timing for the 1 divide and 2 multiply
comparison with the 2 divide approach.

The float results show clear performance improvement due to using the
simple method with double precision for intermediate calculations.

The double results with the newer method show less overhead for the
moderate dataset than for the full dataset. That's because the moderate
dataset does not ever take the new branches which protect from
under/overflow. The better the branch predictor, the lower the cost
for these untaken branches. Both platforms are somewhat dated, with
the x86 having a better branch predictor which reduces the cost of the
additional branches in the new code. Of course, the relative slowdown
may be greater for some architectures, especially those with limited
branch prediction combined with a high cost of misprediction.

The long double results are fairly consistent in showing the moderate
additional cost of the extra branches and calculations for all cases.

The observed cost for all precisions is claimed to be tolerable on the
grounds that:

(a) the cost is worthwhile considering the accuracy improvement shown.
(b) most applications will only spend a small fraction of their time
    calculating complex divide.
(c) it is much less than the cost of extended precision
(d) users are not forced to use it (as described below)

Those users who find this degree of slowdown unsatisfactory may use
the gcc switch -fcx-fortran-rules which does not use the library
routine, instead inlining Smith's method without the C99 requirement
for dealing with NaN results. The proposed patch for libgcc complex
divide does not affect the code generated by -fcx-fortran-rules.

SUMMARY

When input data to complex divide has exponents whose absolute value
is less than half of *_MAX_EXP, this patch makes no changes in
accuracy and has only a modest effect on performance.  When input data
contains values outside those ranges, the patch eliminates more than
99.9% of major errors with a tolerable cost in performance.

In comparison to Elen Kalda's method, this patch introduces more
performance overhead but reduces major errors by a factor of
greater than 4000.

REFERENCES

[1] Nelson H.F. Beebe, "The Mathematical-Function Computation Handbook.
Springer International Publishing AG, 2017.

[2] Robert L. Smith. Algorithm 116: Complex division.  Commun. ACM,
 5(8):435, 1962.

[3] Michael Baudin and Robert L. Smith. "A robust complex division in
Scilab," October 2012, available at http://arxiv.org/abs/1210.4539.

[4] Elen Kalda: Complex division improvements in libgcc
https://gcc.gnu.org/legacy-ml/gcc-patches/2019-08/msg01629.html

2020-12-08  Patrick McGehearty  <[email protected]>

gcc/c-family/
	* c-cppbuiltin.c (c_cpp_builtins): Add supporting macros for new
	complex divide
libgcc/
	* libgcc2.c (XMTYPE, XCTYPE, RBIG, RMIN, RMIN2, RMINSCAL, RMAX2):
	Define.
	(__divsc3, __divdc3, __divxc3, __divtc3): Improve complex divide.
	* config/rs6000/_divkc3.c (RBIG, RMIN, RMIN2, RMINSCAL, RMAX2):
	Define.
	(__divkc3): Improve complex divide.
gcc/testsuite/
	* gcc.c-torture/execute/ieee/cdivchkd.c: New test.
	* gcc.c-torture/execute/ieee/cdivchkf.c: Likewise.
	* gcc.c-torture/execute/ieee/cdivchkld.c: Likewise.
…support

Here is a complete patch to add std::bfloat16_t support on
x86 (AArch64 and ARM left for later).  Almost no BFmode optabs
are added by the patch, so for binops/unops it extends to SFmode
first and then truncates back to BFmode.
For {HF,SF,DF,XF,TF}mode -> BFmode conversions libgcc has implementations
of all those conversions so that we avoid double rounding, for
BFmode -> {DF,XF,TF}mode conversions to avoid growing libgcc too much
it emits BFmode -> SFmode conversion first and then converts to the even
wider mode, neither step should be imprecise.
For BFmode -> HFmode, it first emits a precise BFmode -> SFmode conversion
and then SFmode -> HFmode, because neither format is subset or superset
of the other, while SFmode is superset of both.
expr.cc then contains a -ffast-math optimization of the BF -> SF and
SF -> BF conversions if we don't optimize for space (and for the latter
if -frounding-math isn't enabled either).
For x86, perhaps truncsfbf2 optab could be defined for TARGET_AVX512BF16
but IMNSHO should FAIL if !flag_finite_math || flag_rounding_math
|| !flag_unsafe_math_optimizations, because I think the insn doesn't
raise on sNaNs, hardcodes round to nearest and flushes denormals to zero.
By default (unless x86 -fexcess-precision=16) we use float excess
precision for BFmode, so truncate only on explicit casts and assignments.
The patch introduces a single __bf16 builtin - __builtin_nansf16b,
because (__bf16) __builtin_nansf ("") will drop the sNaN into qNaN,
and uses f16b suffix instead of bf16 because there would be ambiguity on
log vs. logb - __builtin_logbf16 could be either log with bf16 suffix
or logb with f16 suffix.  In other cases libstdc++ should mostly use
__builtin_*f for std::bfloat16_t overloads (we have a problem with
std::nextafter though but that one we have also for std::float16_t).

2022-10-14  Jakub Jelinek  <[email protected]>

gcc/
        * tree-core.h (enum tree_index): Add TI_BFLOAT16_TYPE.
        * tree.h (bfloat16_type_node): Define.
        * tree.cc (excess_precision_type): Promote bfloat16_type_mode
        like float16_type_mode.
        (build_common_tree_nodes): Initialize bfloat16_type_node if
        BFmode is supported.
        * expmed.h (maybe_expand_shift): Declare.
        * expmed.cc (maybe_expand_shift): No longer static.
        * expr.cc (convert_mode_scalar): Don't ICE on BF -> HF or HF -> BF
        conversions.  If there is no optab, handle BF -> {DF,XF,TF,HF}
        conversions as separate BF -> SF -> {DF,XF,TF,HF} conversions, add
        -ffast-math generic implementation for BF -> SF and SF -> BF
        conversions.
        * builtin-types.def (BT_BFLOAT16, BT_FN_BFLOAT16_CONST_STRING): New.
        * builtins.def (BUILT_IN_NANSF16B): New builtin.
        * fold-const-call.cc (fold_const_call): Handle CFN_BUILT_IN_NANSF16B.
        * config/i386/i386.cc (classify_argument): Handle E_BCmode.
        (ix86_libgcc_floating_mode_supported_p): Also return true for BFmode
        for -msse2.
        (ix86_mangle_type): Mangle BFmode as DF16b.
        (ix86_invalid_conversion, ix86_invalid_unary_op,
        ix86_invalid_binary_op): Remove.
        (TARGET_INVALID_CONVERSION, TARGET_INVALID_UNARY_OP,
        TARGET_INVALID_BINARY_OP): Don't redefine.
        * config/i386/i386-builtins.cc (ix86_bf16_type_node): Remove.
        (ix86_register_bf16_builtin_type): Use bfloat16_type_node rather than
        ix86_bf16_type_node, only create it if still NULL.
        * config/i386/i386-builtin-types.def (BFLOAT16): Likewise.
        * config/i386/i386.md (cbranchbf4, cstorebf4): New expanders.
gcc/c-family/
        * c-cppbuiltin.cc (c_cpp_builtins): If bfloat16_type_node,
        predefine __BFLT16_*__ macros and for C++23 also
        __STDCPP_BFLOAT16_T__.  Predefine bfloat16_type_node related
        macros for -fbuilding-libgcc.
        * c-lex.cc (interpret_float): Handle CPP_N_BFLOAT16.
gcc/c/
        * c-typeck.cc (convert_arguments): Don't promote __bf16 to
        double.
gcc/cp/
        * cp-tree.h (extended_float_type_p): Return true for
        bfloat16_type_node.
        * typeck.cc (cp_compare_floating_point_conversion_ranks): Set
        extended{1,2} if mv{1,2} is bfloat16_type_node.  Adjust comment.
gcc/testsuite/
        * lib/target-supports.exp (check_effective_target_bfloat16,
        check_effective_target_bfloat16_runtime, add_options_for_bfloat16):
        New.
        * gcc.dg/torture/bfloat16-basic.c: New test.
        * gcc.dg/torture/bfloat16-builtin.c: New test.
        * gcc.dg/torture/bfloat16-builtin-issignaling-1.c: New test.
        * gcc.dg/torture/bfloat16-complex.c: New test.
        * gcc.dg/torture/builtin-issignaling-1.c: Allow to be includable
        from bfloat16-builtin-issignaling-1.c.
        * gcc.dg/torture/floatn-basic.h: Allow to be includable from
        bfloat16-basic.c.
        * gcc.target/i386/vect-bfloat16-typecheck_2.c: Adjust expected
        diagnostics.
        * gcc.target/i386/sse2-bfloat16-scalar-typecheck.c: Likewise.
        * gcc.target/i386/vect-bfloat16-typecheck_1.c: Likewise.
        * g++.target/i386/bfloat_cpp_typecheck.C: Likewise.
libcpp/
        * include/cpplib.h (CPP_N_BFLOAT16): Define.
        * expr.cc (interpret_float_suffix): Handle bf16 and BF16 suffixes for
        C++.
libgcc/
        * config/i386/t-softfp (softfp_extensions): Add bfsf.
        (softfp_truncations): Add tfbf xfbf dfbf sfbf hfbf.
        (CFLAGS-extendbfsf2.c, CFLAGS-truncsfbf2.c, CFLAGS-truncdfbf2.c,
        CFLAGS-truncxfbf2.c, CFLAGS-trunctfbf2.c, CFLAGS-trunchfbf2.c): Add
        -msse2.
        * config/i386/libgcc-glibc.ver (GCC_13.0.0): Export
        __extendbfsf2 and __trunc{s,d,x,t,h}fbf2.
        * config/i386/sfp-machine.h (_FP_NANSIGN_B): Define.
        * config/i386/64/sfp-machine.h (_FP_NANFRAC_B): Define.
        * config/i386/32/sfp-machine.h (_FP_NANFRAC_B): Define.
        * soft-fp/brain.h: New file.
        * soft-fp/truncsfbf2.c: New file.
        * soft-fp/truncdfbf2.c: New file.
        * soft-fp/truncxfbf2.c: New file.
        * soft-fp/trunctfbf2.c: New file.
        * soft-fp/trunchfbf2.c: New file.
        * soft-fp/truncbfhf2.c: New file.
        * soft-fp/extendbfsf2.c: New file.
libiberty/
        * cp-demangle.h (D_BUILTIN_TYPE_COUNT): Increment.
        * cp-demangle.c (cplus_demangle_builtin_types): Add std::bfloat16_t
        entry.
        (cplus_demangle_type): Demangle DF16b.
        * testsuite/demangle-expected (_Z3xxxDF16b): New test.
… on ia32 with -mno-sse2 [PR108883]

_Float16 and decltype(0.0bf16) types are on x86 supported only with
-msse2.  On x86_64 that is the default, but on ia32 it is not.
We should still emit fundamental type tinfo for those types in
libsupc++.a/libstdc++.*, regardless of whether libsupc++/libstdc++
is compiled with -msse2 or not, as user programs can be compiled
with different ISA flags from libsupc++/libstdc++ and if they
are compiled with -msse2 and use std::float16_t or std::bfloat16_t
and need RTTI for it, it should work out of the box.  Furthermore,
libstdc++ ABI on ia32 shouldn't depend on whether the library
is compiled with -mno-sse or -msse2.

Unfortunately, just hacking up libsupc++ Makefile/configure so that
a single source is compiled with -msse2 isn't appropriate, because
that TU emits also code and the code should be able to run on CPUs
which libstdc++ supports.  We could add [[gnu::attribute ("no-sse2")]]
there perhaps conditionally, but it all gets quite ugly.

The following patch instead adds a target hook which allows the backend
to temporarily tweak registered types such that emit_support_tinfos
emits whatever is needed.

Additionally, it makes emit_support_tinfos_1 call emit_tinfo_decl
immediately, so that temporarily created dummy types for emit_support_tinfo
purposes only can be nullified again afterwards.  And removes the
previous fallback_* types used for dfloat*_type_node tinfos even when
decimal types aren't supported.

2023-03-03  Jakub Jelinek  <[email protected]>

        PR target/108883
gcc/
        * target.h (emit_support_tinfos_callback): New typedef.
        * targhooks.h (default_emit_support_tinfos): Declare.
        * targhooks.cc (default_emit_support_tinfos): New function.
        * target.def (emit_support_tinfos): New target hook.
        * doc/tm.texi.in (emit_support_tinfos): Document it.
        * doc/tm.texi: Regenerated.
        * config/i386/i386.cc (ix86_emit_support_tinfos): New function.
        (TARGET_EMIT_SUPPORT_TINFOS): Redefine.
gcc/cp/
        * cp-tree.h (enum cp_tree_index): Remove CPTI_FALLBACK_DFLOAT*_TYPE
        enumerators.
        (fallback_dfloat32_type, fallback_dfloat64_type,
        fallback_dfloat128_type): Remove.
        * rtti.cc (emit_support_tinfo_1): If not emitted already, call
        emit_tinfo_decl and remove from unemitted_tinfo_decls right away.
        (emit_support_tinfos): Move &dfloat*_type_node from fundamentals array
        into new fundamentals_with_fallback array.  Call emit_support_tinfo_1
        on elements of that array too, with the difference that if
        the type is NULL, use a fallback REAL_TYPE for it temporarily.
        Drop the !targetm.decimal_float_supported_p () handling.  Call
        targetm.emit_support_tinfos at the end.
        * mangle.cc (write_builtin_type): Remove references to
        fallback_dfloat*_type.  Handle bfloat16_type_node mangling.
…* tweaks for -fwrapv and C++20+ [PR104711]"

This reverts commit 7a4db01.
… for -fwrapv and C++20+ [PR104711]

As mentioned in the PR, different standards have different definition
on what is an UB left shift.  They all agree on out of bounds (including
negative) shift count.
The rules used by ubsan are:
C99-C2x ((unsigned) x >> (uprecm1 - y)) != 0 then UB
C++11-C++17 x < 0 || ((unsigned) x >> (uprecm1 - y)) > 1 then UB
C++20 and later everything is well defined
Now, for C++20, I've in the P1236R1 implementation added an early
exit for -Wshift-overflow* warning so that it never warns, but apparently
-Wshift-negative-value remained as is.  As it is well defined in C++20,
the following patch doesn't enable -Wshift-negative-value from -Wextra
anymore for C++20 and later, if users want for compatibility with C++17
and earlier get the warning, they still can by using -Wshift-negative-value
explicitly.
Another thing is -fwrapv, that is an extension to the standards, so it is up
to us how exactly we define that case.  Our ubsan code treats
TYPE_OVERFLOW_WRAPS (type0) and cxx_dialect >= cxx20 the same as only
diagnosing out of bounds shift count and nothing else and IMHO it is most
sensical to treat -fwrapv signed left shifts the same as C++20 treats
them, https://eel.is/c++draft/expr.shift#2
"The value of E1 << E2 is the unique value congruent to E1×2^E2 modulo 2^N,
where N is the width of the type of the result.
[Note 1: E1 is left-shifted E2 bit positions; vacated bits are zero-filled.
— end note]"
with no UB dependent on the E1 values.  The UB is only
"The behavior is undefined if the right operand is negative, or greater
than or equal to the width of the promoted left operand."
Under the hood (except for FEs and ubsan from FEs) GCC middle-end doesn't
consider UB in left shifts dependent on the first operand's value, only
the out of bounds shifts.

While this change isn't a regression, I'd think it is useful for GCC 12,
it doesn't add new warnings, but just removes warnings that aren't
appropriate.

2022-03-09  Jakub Jelinek  <[email protected]>

	PR c/104711
gcc/
	* doc/invoke.texi (-Wextra): Document that -Wshift-negative-value
	is enabled by it only for C++11 to C++17 rather than for C++03 or
	later.
	(-Wshift-negative-value): Similarly (except here we stated
	that it is enabled for C++11 or later).
gcc/c-family/
	* c-opts.c (c_common_post_options): Don't enable
	-Wshift-negative-value from -Wextra for C++20 or later.
	* c-ubsan.c (ubsan_instrument_shift): Adjust comments.
	* c-warn.c (maybe_warn_shift_overflow): Use TYPE_OVERFLOW_WRAPS
	instead of TYPE_UNSIGNED.
gcc/c/
	* c-fold.c (c_fully_fold_internal): Don't emit
	-Wshift-negative-value warning if TYPE_OVERFLOW_WRAPS.
	* c-typeck.c (build_binary_op): Likewise.
gcc/cp/
	* constexpr.c (cxx_eval_check_shift_p): Use TYPE_OVERFLOW_WRAPS
	instead of TYPE_UNSIGNED.
	* typeck.c (cp_build_binary_op): Don't emit
	-Wshift-negative-value warning if TYPE_OVERFLOW_WRAPS.
gcc/testsuite/
	* c-c++-common/Wshift-negative-value-1.c: Remove
	dg-additional-options, instead in target selectors of each diagnostic
	check for exact C++ versions where it should be diagnosed.
	* c-c++-common/Wshift-negative-value-2.c: Likewise.
	* c-c++-common/Wshift-negative-value-3.c: Likewise.
	* c-c++-common/Wshift-negative-value-4.c: Likewise.
	* c-c++-common/Wshift-negative-value-7.c: New test.
	* c-c++-common/Wshift-negative-value-8.c: New test.
	* c-c++-common/Wshift-negative-value-9.c: New test.
	* c-c++-common/Wshift-negative-value-10.c: New test.
	* c-c++-common/Wshift-overflow-1.c: Remove
	dg-additional-options, instead in target selectors of each diagnostic
	check for exact C++ versions where it should be diagnosed.
	* c-c++-common/Wshift-overflow-2.c: Likewise.
	* c-c++-common/Wshift-overflow-5.c: Likewise.
	* c-c++-common/Wshift-overflow-6.c: Likewise.
	* c-c++-common/Wshift-overflow-7.c: Likewise.
	* c-c++-common/Wshift-overflow-8.c: New test.
	* c-c++-common/Wshift-overflow-9.c: New test.
	* c-c++-common/Wshift-overflow-10.c: New test.
	* c-c++-common/Wshift-overflow-11.c: New test.
	* c-c++-common/Wshift-overflow-12.c: New test.
For the conversion from _Float16 to int, if the corresponding optab
does not exist, the compiler will try the wider mode (SFmode here),
but when floatsfsi exists but FAIL, FROM will be rewritten, which
leads to a PR runtime error.

gcc/ChangeLog:

	PR middle-end/102182
	* optabs.c (expand_fix): Add from1 to avoid from being
	overwritten.

gcc/testsuite/ChangeLog:

	PR middle-end/102182
	* gcc.target/i386/pr101282.c: New test.
While DI <-> BF conversions can be handled (and are) through
DI <-> XF <-> BF and for narrower integral modes even sometimes
through DF or SF, because XFmode has 64-bit mantissa and so all
the DImode values are exactly representable in XFmode.
That is not the case for TImode, and while e.g. the HF -> TI
conversions are IMHO useless in libgcc, because HFmode has
-65504.0f16, 65504.0f16 range, all the integers will be already
representable in SImode (or even HImode for unsigned) and so
I think HF -> DI -> TI conversions are faster and valid,
BFmode has roughly the same range as SFmode and so we absolutely need
the TI -> BF conversions to avoid double rounding.

As for BF -> TI conversions, they can be either also implemented
in libgcc, or they can be implemented (as done in this commit)
as BF -> SF -> TI conversions with the same code generation used
elsewhere, just doing the 16-bit left shift of the bits - I think
we don't need to handle sNaNs during the BF -> SF part because
SF -> TI (which is already a libcall too) will handle that too.

The BF -> SF -> TI path avoids wasting
    32: 0000000000015e10   321 FUNC    GLOBAL DEFAULT   13 __fixbfti@@GCC_13.0.0
    89: 0000000000015f60   299 FUNC    GLOBAL DEFAULT   13 __fixunsbfti@@GCC_13.0.0

2023-03-10  Jakub Jelinek  <[email protected]>

	PR target/107703
	* optabs.c (expand_fix): For conversions from BFmode to integral,
	use shifts to convert it to SFmode first and then convert SFmode
	to integral.

	* soft-fp/floattibf.c: New file.
	* soft-fp/floatuntibf.c: New file.
	* config/i386/libgcc-glibc.ver: Export __float{,un}tibf @ GCC_13.0.0.
	* config/i386/64/t-softfp (softfp_extras): Add floattibf and
	floatuntibf.
	(CFLAGS-floattibf.c, CFLAGS-floatunstibf.c): Add -msse2.
x86_64/i686 has for a few months working std::bfloat16_t support, __bf16
there is no longer a storage only type, but can be used for arithmetics
and is supported in libgcc and libstdc++.

The following patch adds similar support for AArch64.

Unlike the x86 changes, this one keeps the old __bf16 mangling of
u6__bf16 rather than DF16b (so an exception from Itanium ABI), but
otherwise __bf16 and decltype (0.0bf16) are the same type and both
in C++ act as extended floating-point type.

2023-03-13  Jakub Jelinek

gcc/
	* config/aarch64/aarch64.h (aarch64_bf16_type_node): Remove.
	(aarch64_bf16_ptr_type_node): Adjust comment.
	* config/aarch64/aarch64.cc (aarch64_gimplify_va_arg_expr): Use
	bfloat16_type_node rather than aarch64_bf16_type_node.
	(aarch64_libgcc_floating_mode_supported_p,
	aarch64_scalar_mode_supported_p): Also support BFmode.
	(aarch64_invalid_conversion, aarch64_invalid_unary_op): Remove.
	(aarch64_invalid_binary_op): Remove BFmode related rejections.
	(TARGET_INVALID_CONVERSION, TARGET_INVALID_UNARY_OP): Don't redefine.
	* config/aarch64/aarch64-builtins.cc (aarch64_bf16_type_node): Remove.
	(aarch64_int_or_fp_type): Use bfloat16_type_node rather than
	aarch64_bf16_type_node.
	(aarch64_init_simd_builtin_types): Likewise.
	(aarch64_init_bf16_types): Likewise.  Don't create bfloat16_type_node,
	which is created in tree.cc already.
	* config/aarch64/aarch64-sve-builtins.def (svbfloat16_t): Likewise.
gcc/testsuite/
	* gcc.target/aarch64/sve/acle/general-c/ternary_bfloat16_opt_n_1.c:
	Don't expect one __bf16 related error.
	* gcc.target/aarch64/bfloat16_vector_typecheck_1.c: Adjust or remove
	dg-error directives for __bf16 being an extended arithmetic type.
	* gcc.target/aarch64/bfloat16_vector_typecheck_2.c: Likewise.
	* gcc.target/aarch64/bfloat16_scalar_typecheck.c: Likewise.
	* g++.target/aarch64/bfloat_cpp_typecheck.C: Don't expect two __bf16
	related errors.
libgcc/
	* config/aarch64/t-softfp (softfp_extensions): Add bfsf.
	(softfp_truncations): Add tfbf dfbf sfbf hfbf.
	(softfp_extras): Add floatdibf floatundibf floattibf floatuntibf.
	* config/aarch64/libgcc-softfp.ver (GCC_13.0.0): Export
	__extendbfsf2 and __trunc{s,d,t,h}fbf2.
	* config/aarch64/sfp-machine.h (_FP_NANFRAC_B, _FP_NANSIGN_B): Define.
	* soft-fp/floatundibf.c: New file.
	* soft-fp/floatdibf.c: New file.
libstdc++-v3/
	* config/abi/pre/gnu.ver (CXXABI_1.3.14): Also export __bf16 tinfos
	if it isn't mangled as DF16b but u6__bf16.
fixincludes/

	PR other/91085
	* fixfixes.c (check_has_inc): New static function.
	  (machine_name_fix): Don't replace header names in
	  __has_include(...).
	* inclhack.def (machine_name): Adjust test.
	* tests/base/testing.h: Update.
Mainly includes the following changes:
- Added CPU support for c910v2, c920v2, and c908i.
- Added support for the __bf16 data type.
- Added support for bf16-related vector intrinsic interfaces (zvfbfmin, zvfbfwma).
- Added compilation generation for the zfa instruction set and related intrinsic interfaces.
- Added support for rv32 RVV vector intrinsic interfaces.
- Added alpha support for the xtheadmatrix (v0.3) intrinsic interface.
- Added the function attribute "THead-interrupt-nesting" to facilitate the use of nested interrupts on CPUs without the xtheadint instruction set.
- Fixed the issue where interrupt handler functions did not save the floating-point state registers and the P-extension state registers.
- Optimized the multilib selection algorithm.
RevySR pushed a commit that referenced this pull request Dec 21, 2024
…o_debug_section [PR116614]

cat abc.C
  #define A(n) struct T##n {} t##n;
  #define B(n) A(n##0) A(n##1) A(n##2) A(n##3) A(n##4) A(n##5) A(n##6) A(n#XUANTIE-RV#7) A(n#XUANTIE-RV#8) A(n#XUANTIE-RV#9)
  #define C(n) B(n##0) B(n##1) B(n##2) B(n##3) B(n##4) B(n##5) B(n##6) B(n#XUANTIE-RV#7) B(n#XUANTIE-RV#8) B(n#XUANTIE-RV#9)
  #define D(n) C(n##0) C(n##1) C(n##2) C(n##3) C(n##4) C(n##5) C(n##6) C(n#XUANTIE-RV#7) C(n#XUANTIE-RV#8) C(n#XUANTIE-RV#9)
  #define E(n) D(n##0) D(n##1) D(n##2) D(n##3) D(n##4) D(n##5) D(n##6) D(n#XUANTIE-RV#7) D(n#XUANTIE-RV#8) D(n#XUANTIE-RV#9)
  E(1) E(2) E(3)
  int main () { return 0; }
./xg++ -B ./ -o abc{.o,.C} -flto -flto-partition=1to1 -O2 -g -fdebug-types-section -c
./xgcc -B ./ -o abc{,.o} -flto -flto-partition=1to1 -O2
(not included in testsuite as it takes a while to compile) FAILs with
lto-wrapper: fatal error: Too many copied sections: Operation not supported
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status

The following patch fixes that.  Most of the 64K+ section support for
reading and writing was already there years ago (and especially reading used
quite often already) and a further bug fixed in it in the PR104617 fix.

Yet, the fix isn't solely about removing the
  if (new_i - 1 >= SHN_LORESERVE)
    {
      *err = ENOTSUP;
      return "Too many copied sections";
    }
5 lines, the missing part was that the function only handled reading of
the .symtab_shndx section but not copying/updating of it.
If the result has less than 64K-epsilon sections, that actually wasn't
needed, but e.g. with -fdebug-types-section one can exceed that pretty
easily (reported to us on WebKitGtk build on ppc64le).
Updating the section is slightly more complicated, because it basically
needs to be done in lock step with updating the .symtab section, if one
doesn't need to use SHN_XINDEX in there, the section should (or should be
updated to) contain SHN_UNDEF entry, otherwise needs to have whatever would
be overwise stored but couldn't fit.  But repeating due to that all the
symtab decisions what to discard and how to rewrite it would be ugly.

So, the patch instead emits the .symtab_shndx section (or sections) last
and prepares the content during the .symtab processing and in a second
pass when going just through .symtab_shndx sections just uses the saved
content.

2024-09-07  Jakub Jelinek  <[email protected]>

	PR lto/116614
	* simple-object-elf.c (SHN_COMMON): Align comment with neighbouring
	comments.
	(SHN_HIRESERVE): Use uppercase hex digits instead of lowercase for
	consistency.
	(simple_object_elf_find_sections): Formatting fixes.
	(simple_object_elf_fetch_attributes): Likewise.
	(simple_object_elf_attributes_merge): Likewise.
	(simple_object_elf_start_write): Likewise.
	(simple_object_elf_write_ehdr): Likewise.
	(simple_object_elf_write_shdr): Likewise.
	(simple_object_elf_write_to_file): Likewise.
	(simple_object_elf_copy_lto_debug_section): Likewise.  Don't fail for
	new_i - 1 >= SHN_LORESERVE, instead arrange in that case to copy
	over .symtab_shndx sections, though emit those last and compute their
	section content when processing associated .symtab sections.  Handle
	simple_object_internal_read failure even in the .symtab_shndx reading
	case.

(cherry picked from commit bb8dd09)
RevySR pushed a commit that referenced this pull request Jan 1, 2025
…o_debug_section [PR116614]

cat abc.C
  #define A(n) struct T##n {} t##n;
  #define B(n) A(n##0) A(n##1) A(n##2) A(n##3) A(n##4) A(n##5) A(n##6) A(n#XUANTIE-RV#7) A(n#XUANTIE-RV#8) A(n#XUANTIE-RV#9)
  #define C(n) B(n##0) B(n##1) B(n##2) B(n##3) B(n##4) B(n##5) B(n##6) B(n#XUANTIE-RV#7) B(n#XUANTIE-RV#8) B(n#XUANTIE-RV#9)
  #define D(n) C(n##0) C(n##1) C(n##2) C(n##3) C(n##4) C(n##5) C(n##6) C(n#XUANTIE-RV#7) C(n#XUANTIE-RV#8) C(n#XUANTIE-RV#9)
  #define E(n) D(n##0) D(n##1) D(n##2) D(n##3) D(n##4) D(n##5) D(n##6) D(n#XUANTIE-RV#7) D(n#XUANTIE-RV#8) D(n#XUANTIE-RV#9)
  E(1) E(2) E(3)
  int main () { return 0; }
./xg++ -B ./ -o abc{.o,.C} -flto -flto-partition=1to1 -O2 -g -fdebug-types-section -c
./xgcc -B ./ -o abc{,.o} -flto -flto-partition=1to1 -O2
(not included in testsuite as it takes a while to compile) FAILs with
lto-wrapper: fatal error: Too many copied sections: Operation not supported
compilation terminated.
/usr/bin/ld: error: lto-wrapper failed
collect2: error: ld returned 1 exit status

The following patch fixes that.  Most of the 64K+ section support for
reading and writing was already there years ago (and especially reading used
quite often already) and a further bug fixed in it in the PR104617 fix.

Yet, the fix isn't solely about removing the
  if (new_i - 1 >= SHN_LORESERVE)
    {
      *err = ENOTSUP;
      return "Too many copied sections";
    }
5 lines, the missing part was that the function only handled reading of
the .symtab_shndx section but not copying/updating of it.
If the result has less than 64K-epsilon sections, that actually wasn't
needed, but e.g. with -fdebug-types-section one can exceed that pretty
easily (reported to us on WebKitGtk build on ppc64le).
Updating the section is slightly more complicated, because it basically
needs to be done in lock step with updating the .symtab section, if one
doesn't need to use SHN_XINDEX in there, the section should (or should be
updated to) contain SHN_UNDEF entry, otherwise needs to have whatever would
be overwise stored but couldn't fit.  But repeating due to that all the
symtab decisions what to discard and how to rewrite it would be ugly.

So, the patch instead emits the .symtab_shndx section (or sections) last
and prepares the content during the .symtab processing and in a second
pass when going just through .symtab_shndx sections just uses the saved
content.

2024-09-07  Jakub Jelinek  <[email protected]>

	PR lto/116614
	* simple-object-elf.c (SHN_COMMON): Align comment with neighbouring
	comments.
	(SHN_HIRESERVE): Use uppercase hex digits instead of lowercase for
	consistency.
	(simple_object_elf_find_sections): Formatting fixes.
	(simple_object_elf_fetch_attributes): Likewise.
	(simple_object_elf_attributes_merge): Likewise.
	(simple_object_elf_start_write): Likewise.
	(simple_object_elf_write_ehdr): Likewise.
	(simple_object_elf_write_shdr): Likewise.
	(simple_object_elf_write_to_file): Likewise.
	(simple_object_elf_copy_lto_debug_section): Likewise.  Don't fail for
	new_i - 1 >= SHN_LORESERVE, instead arrange in that case to copy
	over .symtab_shndx sections, though emit those last and compute their
	section content when processing associated .symtab sections.  Handle
	simple_object_internal_read failure even in the .symtab_shndx reading
	case.

(cherry picked from commit bb8dd09)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.