Skip to content
Snippets Groups Projects
  • Jakub Jelinek's avatar
    572f5e1b
    libcpp: Named universal character escapes and delimited escape sequence tweaks · 572f5e1b
    Jakub Jelinek authored
    On Tue, Aug 30, 2022 at 09:10:37PM +0000, Joseph Myers wrote:
    > I'm seeing build failures of glibc for powerpc64, as illustrated by the
    > following C code:
    >
    > #if 0
    > \NARG
    > #endif
    >
    > (the actual sysdeps/powerpc/powerpc64/sysdep.h code is inside #ifdef
    > __ASSEMBLER__).
    >
    > This shows some problems with this feature - and with delimited escape
    > sequences - as it affects C.  It's fine to accept it as an extension
    > inside string and character literals, because \N or \u{...} would be
    > invalid in the absence of the feature (i.e. the syntax for such literals
    > fails to match, meaning that the rule about undefined behavior for a
    > single ' or " as a pp-token applies).  But outside string and character
    > literals, the usual lexing rules apply, the \ is a pp-token on its own and
    > the code is valid at the preprocessing level, and with expansion of macros
    > appearing before or after the \ (e.g. u defined as a macro in the \u{...}
    > case) it may be valid code at the language level as well.  I don't know
    > what older C++ versions say about this, but for C this means e.g.
    >
    > #define z(x) 0
    > #define a z(
    > int x = a\NARG);
    >
    > needs to be accepted as expanding to "int x = 0;", not interpreted as
    > using the \N feature in an identifier and produce an error.
    
    The following patch changes this, so that:
    1) outside of string/character literals, \N without following { is never
       treated as an error nor warning, it is silently treated as \ separate
       token followed by whatever is after it
    2) \u{123} and \N{LATIN SMALL LETTER A WITH ACUTE} are not handled as
       extension at all outside of string/character literals in the strict
       standard modes (-std=c*) except for -std=c++{23,2b}, only in the
       -std=gnu* modes, because it changes behavior on valid sources, e.g.
       #define z(x) 0
       #define a z(
       int x = a\u{123});
       int y = a\N{LATIN SMALL LETTER A WITH ACUTE});
    3) introduces -Wunicode warning (on by default) and warns for cases
       of what looks like invalid delimited escape sequence or named
       universal character escape outside of string/character literals
       and is treated as separate tokens
    
    2022-09-07  Jakub Jelinek  <jakub@redhat.com>
    
    libcpp/
    	* include/cpplib.h (struct cpp_options): Add cpp_warn_unicode member.
    	(enum cpp_warning_reason): Add CPP_W_UNICODE.
    	* init.cc (cpp_create_reader): Initialize cpp_warn_unicode.
    	* charset.cc (_cpp_valid_ucn): In possible identifier contexts, don't
    	handle \u{ or \N{ specially in -std=c* modes except -std=c++2{3,b}.
    	In possible identifier contexts, don't emit an error and punt
    	if \N isn't followed by {, or if \N{} surrounds some lower case
    	letters or _.  In possible identifier contexts when not C++23, don't
    	emit an error but warning about unknown character names and treat as
    	separate tokens.  When treating as separate tokens \u{ or \N{, emit
    	warnings.
    gcc/
    	* doc/invoke.texi (-Wno-unicode): Document.
    gcc/c-family/
    	* c.opt (Winvalid-utf8): Use ObjC instead of objC.  Remove
    	" in comments" from description.
    	(Wunicode): New option.
    gcc/testsuite/
    	* c-c++-common/cpp/delimited-escape-seq-4.c: New test.
    	* c-c++-common/cpp/delimited-escape-seq-5.c: New test.
    	* c-c++-common/cpp/delimited-escape-seq-6.c: New test.
    	* c-c++-common/cpp/delimited-escape-seq-7.c: New test.
    	* c-c++-common/cpp/named-universal-char-escape-5.c: New test.
    	* c-c++-common/cpp/named-universal-char-escape-6.c: New test.
    	* c-c++-common/cpp/named-universal-char-escape-7.c: New test.
    	* g++.dg/cpp23/named-universal-char-escape1.C: New test.
    	* g++.dg/cpp23/named-universal-char-escape2.C: New test.
    572f5e1b
    History
    libcpp: Named universal character escapes and delimited escape sequence tweaks
    Jakub Jelinek authored
    On Tue, Aug 30, 2022 at 09:10:37PM +0000, Joseph Myers wrote:
    > I'm seeing build failures of glibc for powerpc64, as illustrated by the
    > following C code:
    >
    > #if 0
    > \NARG
    > #endif
    >
    > (the actual sysdeps/powerpc/powerpc64/sysdep.h code is inside #ifdef
    > __ASSEMBLER__).
    >
    > This shows some problems with this feature - and with delimited escape
    > sequences - as it affects C.  It's fine to accept it as an extension
    > inside string and character literals, because \N or \u{...} would be
    > invalid in the absence of the feature (i.e. the syntax for such literals
    > fails to match, meaning that the rule about undefined behavior for a
    > single ' or " as a pp-token applies).  But outside string and character
    > literals, the usual lexing rules apply, the \ is a pp-token on its own and
    > the code is valid at the preprocessing level, and with expansion of macros
    > appearing before or after the \ (e.g. u defined as a macro in the \u{...}
    > case) it may be valid code at the language level as well.  I don't know
    > what older C++ versions say about this, but for C this means e.g.
    >
    > #define z(x) 0
    > #define a z(
    > int x = a\NARG);
    >
    > needs to be accepted as expanding to "int x = 0;", not interpreted as
    > using the \N feature in an identifier and produce an error.
    
    The following patch changes this, so that:
    1) outside of string/character literals, \N without following { is never
       treated as an error nor warning, it is silently treated as \ separate
       token followed by whatever is after it
    2) \u{123} and \N{LATIN SMALL LETTER A WITH ACUTE} are not handled as
       extension at all outside of string/character literals in the strict
       standard modes (-std=c*) except for -std=c++{23,2b}, only in the
       -std=gnu* modes, because it changes behavior on valid sources, e.g.
       #define z(x) 0
       #define a z(
       int x = a\u{123});
       int y = a\N{LATIN SMALL LETTER A WITH ACUTE});
    3) introduces -Wunicode warning (on by default) and warns for cases
       of what looks like invalid delimited escape sequence or named
       universal character escape outside of string/character literals
       and is treated as separate tokens
    
    2022-09-07  Jakub Jelinek  <jakub@redhat.com>
    
    libcpp/
    	* include/cpplib.h (struct cpp_options): Add cpp_warn_unicode member.
    	(enum cpp_warning_reason): Add CPP_W_UNICODE.
    	* init.cc (cpp_create_reader): Initialize cpp_warn_unicode.
    	* charset.cc (_cpp_valid_ucn): In possible identifier contexts, don't
    	handle \u{ or \N{ specially in -std=c* modes except -std=c++2{3,b}.
    	In possible identifier contexts, don't emit an error and punt
    	if \N isn't followed by {, or if \N{} surrounds some lower case
    	letters or _.  In possible identifier contexts when not C++23, don't
    	emit an error but warning about unknown character names and treat as
    	separate tokens.  When treating as separate tokens \u{ or \N{, emit
    	warnings.
    gcc/
    	* doc/invoke.texi (-Wno-unicode): Document.
    gcc/c-family/
    	* c.opt (Winvalid-utf8): Use ObjC instead of objC.  Remove
    	" in comments" from description.
    	(Wunicode): New option.
    gcc/testsuite/
    	* c-c++-common/cpp/delimited-escape-seq-4.c: New test.
    	* c-c++-common/cpp/delimited-escape-seq-5.c: New test.
    	* c-c++-common/cpp/delimited-escape-seq-6.c: New test.
    	* c-c++-common/cpp/delimited-escape-seq-7.c: New test.
    	* c-c++-common/cpp/named-universal-char-escape-5.c: New test.
    	* c-c++-common/cpp/named-universal-char-escape-6.c: New test.
    	* c-c++-common/cpp/named-universal-char-escape-7.c: New test.
    	* g++.dg/cpp23/named-universal-char-escape1.C: New test.
    	* g++.dg/cpp23/named-universal-char-escape2.C: New test.