Unverified Commit fa6549c1 authored 3 months ago by Jonathan Wakely Committed by Jonathan Wakely 2 months ago
libstdc++: Handle errors from strxfrm in std::collate::transform [PR85824]

std::regex builds a cache of equivalence classes by calling
std::regex_traits<char>::transform_primary(c) for every char, which then
calls std::collate<char>::transform which calls strxfrm. On several
targets strxfrm fails for non-ASCII characters. Because strxfrm has no
return value reserved to indicate an error, some implementations return
INT_MAX or SIZE_MAX. This causes std::collate::transform to try to
allocate a huge buffer, which is either very slow or throws
std::bad_alloc. We should check errno after calling strxfrm to detect
errors and then throw a more appropriate exception instead of trying to
allocate a huge buffer.

Unfortunately the std::collate<C>::_M_transform function has a
non-throwing exception specifier, so we can't do the error handling
there.

As well as checking errno, this patch changes std::collate::do_transform
to use __builtin_alloca for small inputs, and to use RAII to deallocate
the buffers used for large inputs.

This change isn't sufficient to fix the three std::regex bugs caused by
the lack of error handling in std::collate::do_transform, we also need
to make std::regex_traits::transform_primary handle exceptions. This
change also attempts to make transform_primary closer to the effects
described in the standard, by not even attempting to use std::collate if
the locale's std::collate facet has been replaced (see PR 118105).
Implementing the correct effects for transform_primary requires RTTI, so
that we don't use some user-defined std::collate facet with unknown
semantics. When -fno-rtti is used transform_primary just returns an
empty string, making equivalence classes unusable in std::basic_regex.
That's not ideal, but I don't have any better ideas.

I'm unsure if std::regex_traits<C>::transform_primary is supposed to
convert the string to lower case or not.  The general regex traits
requirements ([re.req] p20) do say "when character case is not
considered" but the specification for the std::regex_traits<char> and
std::regex_traits<wchar_t> specializations ([re.traits] p7) don't say
anything about that.

With the r15-6317-geb339c29ee42aa change, transform_primary is not
called unless the regex actually uses an equivalence class. But using an
equivalence class would still fail (or be incredibly slow) on some
targets. With this commit, equivalence classes should be usable on all
targets, without excessive memory allocations.

Arguably, we should not even try to call transform_primary for any char
values over 127, since they're never valid in locales that use UTF-8 or
7-bit ASCII, and probably for other charsets too. Handling 128
exceptions for every std::regex compilation is very inefficient, but at
least it now works instead of failing with std::bad_alloc, and no longer
allocates 128 x 2GB. Maybe for C++26 we could check the locale's
std::text_encoding and use that to decide whether to cache equivalence
classes for char values over 127.

libstdc++-v3/ChangeLog:

	PR libstdc++/85824
	PR libstdc++/94409
	PR libstdc++/98723
	PR libstdc++/118105
	* include/bits/locale_classes.tcc (collate::do_transform): Check
	errno after calling _M_transform. Use RAII type to manage the
	buffer and to restore errno.
	* include/bits/regex.h (regex_traits::transform_primary): Handle
	exceptions from std::collate::transform and do not try to use
	std::collate for user-defined facets.
parent 8ade3c3e
No related branches found
Tags releases/gcc-13.1.0
Hide whitespace changes
Inline Side-by-side
Showing with 99 additions and 41 deletions
Please register or to comment