Skip to content
Snippets Groups Projects
  • Jakub Jelinek's avatar
    eba6d2aa
    libcpp, c-family: Add (dumb) C23 N3017 #embed support [PR105863] · eba6d2aa
    Jakub Jelinek authored
    The following patch implements the C23 N3017 "#embed - a scannable,
    tooling-friendly binary resource inclusion mechanism" paper.
    
    The implementation is intentionally dumb, in that it doesn't significantly
    speed up compilation of larger initializers and doesn't make it possible
    to use huge #embeds (like several gigabytes large, that is compile time
    and memory still infeasible).
    There are 2 reasons for this.  One is that I think like it is implemented
    now in the patch is how we should use it for the smaller #embed sizes,
    dunno with which boundary, whether 32 bytes or 64 or something like that,
    certainly handling the single byte cases which is something that can appear
    anywhere in the source where constant integer literal can appear is
    desirable and I think for a few bytes it isn't worth it to come up with
    something smarter and users would like to e.g. see it in -E readably as
    well (perhaps the slow vs. fast boundary should be determined by command
    line option).  And the other one is to be able to more easily find
    regressions in behavior caused by the optimizations, so we have something
    to get back in git to compare against.
    I'm definitely willing to work on the optimizations (likely introduce a new
    CPP_* token type to refer to a range of libcpp owned memory (start + size)
    and similarly some tree which can do the same, and can be at any time e.g.
    split into 2 subparts + say INTEGER_CST in between if needed say for
    const unsigned char d[] = {
     #embed "2GB.dat" prefix (0, 0, ) suffix (, [0x40000000] = 42)
    }; still without having to copy around huge amounts of data; STRING_CST
    owns the memory it points to and can be only 2GB in size), but would
    like to do that incrementally.
    And would like to first include some extensions also not included in
    this patch, like gnu::offset (off) parameter to allow to skip certain
    constant amount of bytes at the start of the files, plus
    gnu::base64 ("base64_encoded_data") parameter to add something which can
    store more efficiently large amounts of the #embed data in preprocessed
    source.
    
    I've been cross-checking all the tests also against the LLVM implementation
    https://github.com/llvm/llvm-project/pull/68620
    which has been for a few hours even committed to LLVM trunk but reverted
    afterwards.  LLVM now has the support committed and I admit I haven't
    rechecked whether the behavior on the below mentioned spots have been fixed
    in it already or not yet.
    
    The patch uses --embed-dir= option that clang plans to add above and doesn't
    use other variants on the search directories yet, plus there are no
    default directories at least for the time being where to search for embed
    files.  So, #embed "..." works if it is found in the same directory (or
    relative to the current file's directory) and #embed "/..." or #embed </...>
    work always, but relative #embed <...> doesn't unless at least one
    --embed-dir= is specified.  There is no reason to differentiate between
    system and non-system directories, so we don't need -isystem like
    counterpart, perhaps -iquote like counterpart could be useful in the future,
    dunno what else.  It has --embed-directory=dir and --embed-directory dir
    as aliases.
    
    There are some differences beyond clang ICEs, so I'd like to point them out
    to make sure there is agreement on the choices in the patch.  They are also
    mentioned in the comments of the llvm pull request.
    
    The most important is that the GCC patch (as well as the original thephd.dev
    LLVM branch on godbolt) expands #embed (or acts as if it is expanded) into
    a mere sequence of numbers like 123,2,35,26 rather then what clang
    effectively treats as (unsigned char)123,(unsigned char)2,(unsigned
    char)35,(unsigned char)26 but only does that when using integrated
    preprocessor, not when using -save-temps where it acts as GCC.
    JeanHeyd as the original author agrees that is how it is currently worded in
    C23.
    
    Another difference (not tested in the testsuite, not sure how to check for
    effective target /dev/urandom nor am sure it is desirable to check that
    during testsuite) is how to treat character devices, named pipes etc.
    (block devices are errored on).  The original paper uses /dev/urandom
    in various examples and seems to assume that unlike regular files the
    devices aren't really cached, so
     #embed </dev/urandom> limit(1) prefix(int a = ) suffix(;)
     #embed </dev/urandom> limit(1) prefix(int b = ) suffix(;)
    usually results in a != b.  That is what the godbolt thephd.dev branch
    implements too and what this patch does as well, but clang actually seems
    to just go from st.st_size == 0, ergo it must be zero-sized resource and
    so just copies over if_empty if present.  It is really questionable
    what to do about the character devices/named pipes with __has_embed, for
    regular files the patch doesn't read anything from them, relies on
    st.st_size + limit for whether it is empty or non-empty.  But I don't know
    of a way to check if read on say a character device would read anything
    or not (the </dev/null> limit (1) vs. </dev/zero> limit (1) cases), and
    if we read something, that would be better cached for later because
     #embed later if it reads again could read no further data even when it
    first read something.  So, the patch currently for __has_embed just
    always returns 2 on the non-regular files, like the thephd.dev
    branch does as well and like the clang pull request as well.
    A question is also what to do for gnu::offset on the non-regular files
    even for #embed, those aren't seekable and do we want to just read and throw
    away the offset bytes each time we see it used?
    
    clang also chokes on the
     #if __has_embed (__FILE__ __limit__ (1) __prefix__ () suffix (1 / 0) \
     __if_empty__ ((({{[0[0{0{0(0(0)1)1}1}]]}})))) != __STDC_EMBED_FOUND__
     #error "__has_embed fail"
     #endif
    in embed-1.c, but thephd.dev branch accepts it and I don't see why
    it shouldn't, (({{[0[0{0{0(0(0)1)1}1}]]}}))) is a balanced token
    sequence and the file isn't empty, so it should just be parsed and
    discarded.
    
    clang also IMHO mishandles
     const unsigned char w[] = {
     #embed __FILE__ prefix([0] = 42, [15] =) limit(32)
     };
    but again only without -save-temps, seems like it
    treats it as
    [0] = 42, [15] = (99,111,110,115,116,32,117,110,115,105,103,110,101,100,
    32,99,104,97,114,32,119,91,93,32,61,32,123,10,35,101,109,98)
    rather than
    [0] = 42, [15] = 99,111,110,115,116,32,117,110,115,105,103,110,101,100,
    32,99,104,97,114,32,119,91,93,32,61,32,123,10,35,101,109,98
    and warns on it for -Wunused-value and just compiles it as
    [0] = 42, [15] = 98
    
    And also
     void foo (int, int, int, int);
     void bar (void) { foo (
     #embed __FILE__ limit (4) prefix (172 + ) suffix (+ 2)
     ); }
    is treated as
    172 + (118, 111, 105, 100) + 2
    rather than
    172 + 118, 111, 105, 100 + 2
    which clang -save-temps or GCC treats it like, so results
    in just one argument passed rather than 4.
    
    if (!strstr ((const char *) magna_carta, "imprisonétur")) abort ();
    in the testcase fails as well, but in that case calling it in gdb succeeds:
    p ((char *(*)(char *, char *))__strstr_sse2) (magna_carta, "imprisonétur")
    $2 = 0x555555558d3c <magna_carta+11564> "imprisonétur aut disseisiátur"...
    so I guess they are just trying to constant evaluate strstr and do it
    incorrectly.
    
    They started with making the optimizations together in the initial patch
    set, so they don't have the luxury to compare if it is just because of
    the optimization they are trying to do or because that is how the
    feature works for them.  At least unless they use -save-temps for now.
    
    There is also different behavior between clang and gcc on -M or other
    dependency generating options.  Seems clang includes the __has_embed
    searched files in dependencies, while my patch doesn't.  But so does
    clang for __has_include and GCC doesn't.  Emitting a hard dependency
    on some header just because there was __has_include/__has_embed for it
    seems wrong to me, because (at least when properly written) the source
    likely doesn't mind if the file is missing, it will do something else,
    so a hard error from make because of it doesn't seem right.  Does
    make have some weaker dependencies, such that if some file can be remade
    it is but if it doesn't exist, it isn't fatal?
    
    I wonder whether #embed <non-existent-file> really needs to be fatal
    or whether we could simply after diagnosing it pretend the file exists
    and is empty.  For #include I think fatal errors make tons of sense,
    but perhaps for #embed which is more localized we'd get better error
    reporting if we didn't bail out immediately.  Note, both GCC and clang
    currently treat those as fatal errors.
    
    clang also added -dE option which with -E instead of preprocessing
    the #embed directives keeps them as is, but the preprocessed source
    then isn't self-contained.  That option looks more harmful than useful to
    me.
    
    Also, it isn't clear to me from C23 whether it is possible to have
    __has_include/__has_c_attribute/__has_embed expressions inside of
    the limit #embed/__has_embed argument.
    6.10.3.2/2 says that defined should not appear there (and the patch
    diagnoses it and testsuite tests), but for __has_include/__has_embed
    etc. 6.10.1/11 says:
    "The identifiers __has_include, __has_embed, and __has_c_attribute
    shall not appear in any context not mentioned in this subclause."
    If that subclause in that case means 6.10.1, then it presumably shouldn't
    appear in #embed in 6.10.3, but __has_embed is in 6.10.1...
    But 6.10.3.2/3 says that it should be parsed according to the 6.10.1
    rules.  Haven't included tests like
     #if __has_embed (__FILE__ limit (__has_embed (__FILE__ limit (1))))
    or
     #embed __FILE__ limit (__has_include (__FILE__))
    into the testsuite because of the doubts but I think the patch should
    handle those right now.
    
    The reason I've used Magna Carta text in some of the testcases is that
    I hope it shouldn't be copyrighted after the centuries and I'd strongly
    prefer not to have binary blobs in git after the xz backdoor lesson
    and wanted something larger which doesn't change all the time.
    
    Oh, BTW, I see in C23 draft 6.10.3.2 in Example 4
    if (f_source == NULL);
      return 1;
    (note the spurious semicolon after closing paren), has that been fixed
    already?
    
    Like the thephd.dev and clang implementations, the patch always macro
    expands the whole #embed and __has_embed directives except for the
    embed keyword.  That is most likely not what C23 says, my limited
    understanding right now is that in #embed one needs to parse the whole
    directive line with macro expansion disabled and check if it satisfies the
    grammar, if not, the whole directive is macro expanded, if yes, only
    the limit parameter argument is macro expanded and the prefix/suffix/if_empty
    arguments are maybe macro expanded when actually used (and not at all if
    unused).  And I think __has_embed macro expansion has conflicting rules.
    
    2024-09-12  Jakub Jelinek  <jakub@redhat.com>
    
    	PR c/105863
    libcpp/
    	* include/cpplib.h: Implement C23 N3017 #embed - a scannable,
    	tooling-friendly binary resource inclusion mechanism paper.
    	(struct cpp_options): Add embed member.
    	(enum cpp_builtin_type): Add BT_HAS_EMBED.
    	(cpp_set_include_chains): Add another cpp_dir * argument to
    	the declaration.
    	* internal.h (enum include_type): Add IT_EMBED.
    	(struct cpp_reader): Add embed_include member.
    	(struct cpp_embed_params_tokens): New type.
    	(struct cpp_embed_params): New type.
    	(_cpp_get_token_no_padding): Declare.
    	(enum _cpp_find_file_kind): Add _cpp_FFK_EMBED and _cpp_FFK_HAS_EMBED.
    	(_cpp_stack_embed): Declare.
    	(_cpp_parse_expr): Change return type to cpp_num_part instead of
    	bool, change second argument from bool to const char * and add third
    	argument.
    	(_cpp_parse_embed_params): Declare.
    	* directives.cc (DIRECTIVE_TABLE): Add embed entry.
    	(end_directive): Don't call skip_rest_of_line for T_EMBED directive.
    	(_cpp_handle_directive): Return 2 rather than 1 for T_EMBED in
    	directives-only mode.
    	(parse_include): Don't Call check_eol for T_EMBED directive.
    	(skip_balanced_token_seq): New function.
    	(EMBED_PARAMS): Define.
    	(enum embed_param_kind): New type.
    	(embed_params): New variable.
    	(_cpp_parse_embed_params): New function.
    	(do_embed): New function.
    	(do_if): Adjust _cpp_parse_expr caller.
    	(do_elif): Likewise.
    	* expr.cc (parse_defined): Diagnose defined in #embed or __has_embed
    	parameters.
    	(_cpp_parse_expr): Change return type to cpp_num_part instead of
    	bool, change second argument from bool to const char * and add third
    	argument.  Adjust function comment.  For #embed/__has_embed parameters
    	add an artificial CPP_OPEN_PAREN.  Use the second argument DIR
    	directly instead of string literals conditional on IS_IF.
    	For #embed/__has_embed parameter, stop on reaching CPP_CLOSE_PAREN
    	matching the artificial one.  Diagnose negative or too large embed
    	parameter operands.
    	(num_binary_op): Use #embed instead of #if for diagnostics if inside
    	#embed/__has_embed parameter.
    	(num_div_op): Likewise.
    	* files.cc (struct _cpp_file): Add limit member and embed bitfield.
    	(search_cache): Add IS_EMBED argument, formatting fix.  Skip over
    	files with different file->embed from the argument.
    	(find_file_in_dir): Don't call pch_open_file if file->embed.
    	(_cpp_find_file): Handle _cpp_FFK_EMBED and _cpp_FFK_HAS_EMBED.
    	(read_file_guts): Formatting fix.
    	(has_unique_contents): Ignore file->embed files.
    	(search_path_head): Handle IT_EMBED type.
    	(_cpp_stack_embed): New function.
    	(_cpp_get_file_stat): Formatting fix.
    	(cpp_set_include_chains): Add embed argument, save it to
    	pfile->embed_include and compute lens for the chain.
    	* init.cc (struct lang_flags): Add embed member.
    	(lang_defaults): Add embed initializers.
    	(cpp_set_lang): Initialize CPP_OPTION (pfile, embed).
    	(builtin_array): Add __has_embed entry.
    	(cpp_init_builtins): Predefine __STDC_EMBED_NOT_FOUND__,
    	__STDC_EMBED_FOUND__ and __STDC_EMBED_EMPTY__.
    	* lex.cc (cpp_directive_only_process): Handle #embed.
    	* macro.cc (cpp_get_token_no_padding): Rename to ...
    	(_cpp_get_token_no_padding): ... this.  No longer static.
    	(builtin_has_include_1): New function.
    	(builtin_has_include): Use it.  Use _cpp_get_token_no_padding
    	instead of cpp_get_token_no_padding.
    	(builtin_has_embed): New function.
    	(_cpp_builtin_macro_text): Handle BT_HAS_EMBED.
    gcc/
    	* doc/cppdiropts.texi (--embed-dir=): Document.
    	* doc/cpp.texi (Binary Resource Inclusion): New chapter.
    	(__has_embed): Document.
    	* doc/invoke.texi (Directory Options): Mention --embed-dir=.
    	* gcc.cc (cpp_unique_options): Add %{-embed*}.
    	* genmatch.cc (main): Adjust cpp_set_include_chains caller.
    	* incpath.h (enum incpath_kind): Add INC_EMBED.
    	* incpath.cc (merge_include_chains): Handle INC_EMBED.
    	(register_include_chains): Adjust cpp_set_include_chains caller.
    gcc/c-family/
    	* c.opt (-embed-dir=): New option.
    	(-embed-directory): New alias.
    	(-embed-directory=): New alias.
    	* c-opts.cc (c_common_handle_option): Handle OPT__embed_dir_.
    gcc/testsuite/
    	* c-c++-common/cpp/embed-1.c: New test.
    	* c-c++-common/cpp/embed-2.c: New test.
    	* c-c++-common/cpp/embed-3.c: New test.
    	* c-c++-common/cpp/embed-4.c: New test.
    	* c-c++-common/cpp/embed-5.c: New test.
    	* c-c++-common/cpp/embed-6.c: New test.
    	* c-c++-common/cpp/embed-7.c: New test.
    	* c-c++-common/cpp/embed-8.c: New test.
    	* c-c++-common/cpp/embed-9.c: New test.
    	* c-c++-common/cpp/embed-10.c: New test.
    	* c-c++-common/cpp/embed-11.c: New test.
    	* c-c++-common/cpp/embed-12.c: New test.
    	* c-c++-common/cpp/embed-13.c: New test.
    	* c-c++-common/cpp/embed-14.c: New test.
    	* c-c++-common/cpp/embed-25.c: New test.
    	* c-c++-common/cpp/embed-26.c: New test.
    	* c-c++-common/cpp/embed-dir/embed-1.inc: New test.
    	* c-c++-common/cpp/embed-dir/embed-3.c: New test.
    	* c-c++-common/cpp/embed-dir/embed-4.c: New test.
    	* c-c++-common/cpp/embed-dir/magna-carta.txt: New test.
    	* gcc.dg/cpp/embed-1.c: New test.
    	* gcc.dg/cpp/embed-2.c: New test.
    	* gcc.dg/cpp/embed-3.c: New test.
    	* gcc.dg/cpp/embed-4.c: New test.
    	* g++.dg/cpp/embed-1.C: New test.
    	* g++.dg/cpp/embed-2.C: New test.
    	* g++.dg/cpp/embed-3.C: New test.
    eba6d2aa
    History
    libcpp, c-family: Add (dumb) C23 N3017 #embed support [PR105863]
    Jakub Jelinek authored
    The following patch implements the C23 N3017 "#embed - a scannable,
    tooling-friendly binary resource inclusion mechanism" paper.
    
    The implementation is intentionally dumb, in that it doesn't significantly
    speed up compilation of larger initializers and doesn't make it possible
    to use huge #embeds (like several gigabytes large, that is compile time
    and memory still infeasible).
    There are 2 reasons for this.  One is that I think like it is implemented
    now in the patch is how we should use it for the smaller #embed sizes,
    dunno with which boundary, whether 32 bytes or 64 or something like that,
    certainly handling the single byte cases which is something that can appear
    anywhere in the source where constant integer literal can appear is
    desirable and I think for a few bytes it isn't worth it to come up with
    something smarter and users would like to e.g. see it in -E readably as
    well (perhaps the slow vs. fast boundary should be determined by command
    line option).  And the other one is to be able to more easily find
    regressions in behavior caused by the optimizations, so we have something
    to get back in git to compare against.
    I'm definitely willing to work on the optimizations (likely introduce a new
    CPP_* token type to refer to a range of libcpp owned memory (start + size)
    and similarly some tree which can do the same, and can be at any time e.g.
    split into 2 subparts + say INTEGER_CST in between if needed say for
    const unsigned char d[] = {
     #embed "2GB.dat" prefix (0, 0, ) suffix (, [0x40000000] = 42)
    }; still without having to copy around huge amounts of data; STRING_CST
    owns the memory it points to and can be only 2GB in size), but would
    like to do that incrementally.
    And would like to first include some extensions also not included in
    this patch, like gnu::offset (off) parameter to allow to skip certain
    constant amount of bytes at the start of the files, plus
    gnu::base64 ("base64_encoded_data") parameter to add something which can
    store more efficiently large amounts of the #embed data in preprocessed
    source.
    
    I've been cross-checking all the tests also against the LLVM implementation
    https://github.com/llvm/llvm-project/pull/68620
    which has been for a few hours even committed to LLVM trunk but reverted
    afterwards.  LLVM now has the support committed and I admit I haven't
    rechecked whether the behavior on the below mentioned spots have been fixed
    in it already or not yet.
    
    The patch uses --embed-dir= option that clang plans to add above and doesn't
    use other variants on the search directories yet, plus there are no
    default directories at least for the time being where to search for embed
    files.  So, #embed "..." works if it is found in the same directory (or
    relative to the current file's directory) and #embed "/..." or #embed </...>
    work always, but relative #embed <...> doesn't unless at least one
    --embed-dir= is specified.  There is no reason to differentiate between
    system and non-system directories, so we don't need -isystem like
    counterpart, perhaps -iquote like counterpart could be useful in the future,
    dunno what else.  It has --embed-directory=dir and --embed-directory dir
    as aliases.
    
    There are some differences beyond clang ICEs, so I'd like to point them out
    to make sure there is agreement on the choices in the patch.  They are also
    mentioned in the comments of the llvm pull request.
    
    The most important is that the GCC patch (as well as the original thephd.dev
    LLVM branch on godbolt) expands #embed (or acts as if it is expanded) into
    a mere sequence of numbers like 123,2,35,26 rather then what clang
    effectively treats as (unsigned char)123,(unsigned char)2,(unsigned
    char)35,(unsigned char)26 but only does that when using integrated
    preprocessor, not when using -save-temps where it acts as GCC.
    JeanHeyd as the original author agrees that is how it is currently worded in
    C23.
    
    Another difference (not tested in the testsuite, not sure how to check for
    effective target /dev/urandom nor am sure it is desirable to check that
    during testsuite) is how to treat character devices, named pipes etc.
    (block devices are errored on).  The original paper uses /dev/urandom
    in various examples and seems to assume that unlike regular files the
    devices aren't really cached, so
     #embed </dev/urandom> limit(1) prefix(int a = ) suffix(;)
     #embed </dev/urandom> limit(1) prefix(int b = ) suffix(;)
    usually results in a != b.  That is what the godbolt thephd.dev branch
    implements too and what this patch does as well, but clang actually seems
    to just go from st.st_size == 0, ergo it must be zero-sized resource and
    so just copies over if_empty if present.  It is really questionable
    what to do about the character devices/named pipes with __has_embed, for
    regular files the patch doesn't read anything from them, relies on
    st.st_size + limit for whether it is empty or non-empty.  But I don't know
    of a way to check if read on say a character device would read anything
    or not (the </dev/null> limit (1) vs. </dev/zero> limit (1) cases), and
    if we read something, that would be better cached for later because
     #embed later if it reads again could read no further data even when it
    first read something.  So, the patch currently for __has_embed just
    always returns 2 on the non-regular files, like the thephd.dev
    branch does as well and like the clang pull request as well.
    A question is also what to do for gnu::offset on the non-regular files
    even for #embed, those aren't seekable and do we want to just read and throw
    away the offset bytes each time we see it used?
    
    clang also chokes on the
     #if __has_embed (__FILE__ __limit__ (1) __prefix__ () suffix (1 / 0) \
     __if_empty__ ((({{[0[0{0{0(0(0)1)1}1}]]}})))) != __STDC_EMBED_FOUND__
     #error "__has_embed fail"
     #endif
    in embed-1.c, but thephd.dev branch accepts it and I don't see why
    it shouldn't, (({{[0[0{0{0(0(0)1)1}1}]]}}))) is a balanced token
    sequence and the file isn't empty, so it should just be parsed and
    discarded.
    
    clang also IMHO mishandles
     const unsigned char w[] = {
     #embed __FILE__ prefix([0] = 42, [15] =) limit(32)
     };
    but again only without -save-temps, seems like it
    treats it as
    [0] = 42, [15] = (99,111,110,115,116,32,117,110,115,105,103,110,101,100,
    32,99,104,97,114,32,119,91,93,32,61,32,123,10,35,101,109,98)
    rather than
    [0] = 42, [15] = 99,111,110,115,116,32,117,110,115,105,103,110,101,100,
    32,99,104,97,114,32,119,91,93,32,61,32,123,10,35,101,109,98
    and warns on it for -Wunused-value and just compiles it as
    [0] = 42, [15] = 98
    
    And also
     void foo (int, int, int, int);
     void bar (void) { foo (
     #embed __FILE__ limit (4) prefix (172 + ) suffix (+ 2)
     ); }
    is treated as
    172 + (118, 111, 105, 100) + 2
    rather than
    172 + 118, 111, 105, 100 + 2
    which clang -save-temps or GCC treats it like, so results
    in just one argument passed rather than 4.
    
    if (!strstr ((const char *) magna_carta, "imprisonétur")) abort ();
    in the testcase fails as well, but in that case calling it in gdb succeeds:
    p ((char *(*)(char *, char *))__strstr_sse2) (magna_carta, "imprisonétur")
    $2 = 0x555555558d3c <magna_carta+11564> "imprisonétur aut disseisiátur"...
    so I guess they are just trying to constant evaluate strstr and do it
    incorrectly.
    
    They started with making the optimizations together in the initial patch
    set, so they don't have the luxury to compare if it is just because of
    the optimization they are trying to do or because that is how the
    feature works for them.  At least unless they use -save-temps for now.
    
    There is also different behavior between clang and gcc on -M or other
    dependency generating options.  Seems clang includes the __has_embed
    searched files in dependencies, while my patch doesn't.  But so does
    clang for __has_include and GCC doesn't.  Emitting a hard dependency
    on some header just because there was __has_include/__has_embed for it
    seems wrong to me, because (at least when properly written) the source
    likely doesn't mind if the file is missing, it will do something else,
    so a hard error from make because of it doesn't seem right.  Does
    make have some weaker dependencies, such that if some file can be remade
    it is but if it doesn't exist, it isn't fatal?
    
    I wonder whether #embed <non-existent-file> really needs to be fatal
    or whether we could simply after diagnosing it pretend the file exists
    and is empty.  For #include I think fatal errors make tons of sense,
    but perhaps for #embed which is more localized we'd get better error
    reporting if we didn't bail out immediately.  Note, both GCC and clang
    currently treat those as fatal errors.
    
    clang also added -dE option which with -E instead of preprocessing
    the #embed directives keeps them as is, but the preprocessed source
    then isn't self-contained.  That option looks more harmful than useful to
    me.
    
    Also, it isn't clear to me from C23 whether it is possible to have
    __has_include/__has_c_attribute/__has_embed expressions inside of
    the limit #embed/__has_embed argument.
    6.10.3.2/2 says that defined should not appear there (and the patch
    diagnoses it and testsuite tests), but for __has_include/__has_embed
    etc. 6.10.1/11 says:
    "The identifiers __has_include, __has_embed, and __has_c_attribute
    shall not appear in any context not mentioned in this subclause."
    If that subclause in that case means 6.10.1, then it presumably shouldn't
    appear in #embed in 6.10.3, but __has_embed is in 6.10.1...
    But 6.10.3.2/3 says that it should be parsed according to the 6.10.1
    rules.  Haven't included tests like
     #if __has_embed (__FILE__ limit (__has_embed (__FILE__ limit (1))))
    or
     #embed __FILE__ limit (__has_include (__FILE__))
    into the testsuite because of the doubts but I think the patch should
    handle those right now.
    
    The reason I've used Magna Carta text in some of the testcases is that
    I hope it shouldn't be copyrighted after the centuries and I'd strongly
    prefer not to have binary blobs in git after the xz backdoor lesson
    and wanted something larger which doesn't change all the time.
    
    Oh, BTW, I see in C23 draft 6.10.3.2 in Example 4
    if (f_source == NULL);
      return 1;
    (note the spurious semicolon after closing paren), has that been fixed
    already?
    
    Like the thephd.dev and clang implementations, the patch always macro
    expands the whole #embed and __has_embed directives except for the
    embed keyword.  That is most likely not what C23 says, my limited
    understanding right now is that in #embed one needs to parse the whole
    directive line with macro expansion disabled and check if it satisfies the
    grammar, if not, the whole directive is macro expanded, if yes, only
    the limit parameter argument is macro expanded and the prefix/suffix/if_empty
    arguments are maybe macro expanded when actually used (and not at all if
    unused).  And I think __has_embed macro expansion has conflicting rules.
    
    2024-09-12  Jakub Jelinek  <jakub@redhat.com>
    
    	PR c/105863
    libcpp/
    	* include/cpplib.h: Implement C23 N3017 #embed - a scannable,
    	tooling-friendly binary resource inclusion mechanism paper.
    	(struct cpp_options): Add embed member.
    	(enum cpp_builtin_type): Add BT_HAS_EMBED.
    	(cpp_set_include_chains): Add another cpp_dir * argument to
    	the declaration.
    	* internal.h (enum include_type): Add IT_EMBED.
    	(struct cpp_reader): Add embed_include member.
    	(struct cpp_embed_params_tokens): New type.
    	(struct cpp_embed_params): New type.
    	(_cpp_get_token_no_padding): Declare.
    	(enum _cpp_find_file_kind): Add _cpp_FFK_EMBED and _cpp_FFK_HAS_EMBED.
    	(_cpp_stack_embed): Declare.
    	(_cpp_parse_expr): Change return type to cpp_num_part instead of
    	bool, change second argument from bool to const char * and add third
    	argument.
    	(_cpp_parse_embed_params): Declare.
    	* directives.cc (DIRECTIVE_TABLE): Add embed entry.
    	(end_directive): Don't call skip_rest_of_line for T_EMBED directive.
    	(_cpp_handle_directive): Return 2 rather than 1 for T_EMBED in
    	directives-only mode.
    	(parse_include): Don't Call check_eol for T_EMBED directive.
    	(skip_balanced_token_seq): New function.
    	(EMBED_PARAMS): Define.
    	(enum embed_param_kind): New type.
    	(embed_params): New variable.
    	(_cpp_parse_embed_params): New function.
    	(do_embed): New function.
    	(do_if): Adjust _cpp_parse_expr caller.
    	(do_elif): Likewise.
    	* expr.cc (parse_defined): Diagnose defined in #embed or __has_embed
    	parameters.
    	(_cpp_parse_expr): Change return type to cpp_num_part instead of
    	bool, change second argument from bool to const char * and add third
    	argument.  Adjust function comment.  For #embed/__has_embed parameters
    	add an artificial CPP_OPEN_PAREN.  Use the second argument DIR
    	directly instead of string literals conditional on IS_IF.
    	For #embed/__has_embed parameter, stop on reaching CPP_CLOSE_PAREN
    	matching the artificial one.  Diagnose negative or too large embed
    	parameter operands.
    	(num_binary_op): Use #embed instead of #if for diagnostics if inside
    	#embed/__has_embed parameter.
    	(num_div_op): Likewise.
    	* files.cc (struct _cpp_file): Add limit member and embed bitfield.
    	(search_cache): Add IS_EMBED argument, formatting fix.  Skip over
    	files with different file->embed from the argument.
    	(find_file_in_dir): Don't call pch_open_file if file->embed.
    	(_cpp_find_file): Handle _cpp_FFK_EMBED and _cpp_FFK_HAS_EMBED.
    	(read_file_guts): Formatting fix.
    	(has_unique_contents): Ignore file->embed files.
    	(search_path_head): Handle IT_EMBED type.
    	(_cpp_stack_embed): New function.
    	(_cpp_get_file_stat): Formatting fix.
    	(cpp_set_include_chains): Add embed argument, save it to
    	pfile->embed_include and compute lens for the chain.
    	* init.cc (struct lang_flags): Add embed member.
    	(lang_defaults): Add embed initializers.
    	(cpp_set_lang): Initialize CPP_OPTION (pfile, embed).
    	(builtin_array): Add __has_embed entry.
    	(cpp_init_builtins): Predefine __STDC_EMBED_NOT_FOUND__,
    	__STDC_EMBED_FOUND__ and __STDC_EMBED_EMPTY__.
    	* lex.cc (cpp_directive_only_process): Handle #embed.
    	* macro.cc (cpp_get_token_no_padding): Rename to ...
    	(_cpp_get_token_no_padding): ... this.  No longer static.
    	(builtin_has_include_1): New function.
    	(builtin_has_include): Use it.  Use _cpp_get_token_no_padding
    	instead of cpp_get_token_no_padding.
    	(builtin_has_embed): New function.
    	(_cpp_builtin_macro_text): Handle BT_HAS_EMBED.
    gcc/
    	* doc/cppdiropts.texi (--embed-dir=): Document.
    	* doc/cpp.texi (Binary Resource Inclusion): New chapter.
    	(__has_embed): Document.
    	* doc/invoke.texi (Directory Options): Mention --embed-dir=.
    	* gcc.cc (cpp_unique_options): Add %{-embed*}.
    	* genmatch.cc (main): Adjust cpp_set_include_chains caller.
    	* incpath.h (enum incpath_kind): Add INC_EMBED.
    	* incpath.cc (merge_include_chains): Handle INC_EMBED.
    	(register_include_chains): Adjust cpp_set_include_chains caller.
    gcc/c-family/
    	* c.opt (-embed-dir=): New option.
    	(-embed-directory): New alias.
    	(-embed-directory=): New alias.
    	* c-opts.cc (c_common_handle_option): Handle OPT__embed_dir_.
    gcc/testsuite/
    	* c-c++-common/cpp/embed-1.c: New test.
    	* c-c++-common/cpp/embed-2.c: New test.
    	* c-c++-common/cpp/embed-3.c: New test.
    	* c-c++-common/cpp/embed-4.c: New test.
    	* c-c++-common/cpp/embed-5.c: New test.
    	* c-c++-common/cpp/embed-6.c: New test.
    	* c-c++-common/cpp/embed-7.c: New test.
    	* c-c++-common/cpp/embed-8.c: New test.
    	* c-c++-common/cpp/embed-9.c: New test.
    	* c-c++-common/cpp/embed-10.c: New test.
    	* c-c++-common/cpp/embed-11.c: New test.
    	* c-c++-common/cpp/embed-12.c: New test.
    	* c-c++-common/cpp/embed-13.c: New test.
    	* c-c++-common/cpp/embed-14.c: New test.
    	* c-c++-common/cpp/embed-25.c: New test.
    	* c-c++-common/cpp/embed-26.c: New test.
    	* c-c++-common/cpp/embed-dir/embed-1.inc: New test.
    	* c-c++-common/cpp/embed-dir/embed-3.c: New test.
    	* c-c++-common/cpp/embed-dir/embed-4.c: New test.
    	* c-c++-common/cpp/embed-dir/magna-carta.txt: New test.
    	* gcc.dg/cpp/embed-1.c: New test.
    	* gcc.dg/cpp/embed-2.c: New test.
    	* gcc.dg/cpp/embed-3.c: New test.
    	* gcc.dg/cpp/embed-4.c: New test.
    	* g++.dg/cpp/embed-1.C: New test.
    	* g++.dg/cpp/embed-2.C: New test.
    	* g++.dg/cpp/embed-3.C: New test.