Tokenization.html 12 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239
  1. <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
  2. <html>
  3. <!-- Copyright (C) 1987-2023 Free Software Foundation, Inc.
  4. Permission is granted to copy, distribute and/or modify this document
  5. under the terms of the GNU Free Documentation License, Version 1.3 or
  6. any later version published by the Free Software Foundation. A copy of
  7. the license is included in the
  8. section entitled "GNU Free Documentation License".
  9. This manual contains no Invariant Sections. The Front-Cover Texts are
  10. (a) (see below), and the Back-Cover Texts are (b) (see below).
  11. (a) The FSF's Front-Cover Text is:
  12. A GNU Manual
  13. (b) The FSF's Back-Cover Text is:
  14. You have freedom to copy and modify this GNU Manual, like GNU
  15. software. Copies published by the Free Software Foundation raise
  16. funds for GNU development. -->
  17. <!-- Created by GNU Texinfo 6.7, http://www.gnu.org/software/texinfo/ -->
  18. <head>
  19. <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
  20. <title>Tokenization (The C Preprocessor)</title>
  21. <meta name="description" content="Tokenization (The C Preprocessor)">
  22. <meta name="keywords" content="Tokenization (The C Preprocessor)">
  23. <meta name="resource-type" content="document">
  24. <meta name="distribution" content="global">
  25. <meta name="Generator" content="makeinfo">
  26. <link href="index.html" rel="start" title="Top">
  27. <link href="Index-of-Directives.html" rel="index" title="Index of Directives">
  28. <link href="index.html#SEC_Contents" rel="contents" title="Table of Contents">
  29. <link href="Overview.html" rel="up" title="Overview">
  30. <link href="The-preprocessing-language.html" rel="next" title="The preprocessing language">
  31. <link href="Initial-processing.html" rel="prev" title="Initial processing">
  32. <style type="text/css">
  33. <!--
  34. a.summary-letter {text-decoration: none}
  35. blockquote.indentedblock {margin-right: 0em}
  36. div.display {margin-left: 3.2em}
  37. div.example {margin-left: 3.2em}
  38. div.lisp {margin-left: 3.2em}
  39. kbd {font-style: oblique}
  40. pre.display {font-family: inherit}
  41. pre.format {font-family: inherit}
  42. pre.menu-comment {font-family: serif}
  43. pre.menu-preformatted {font-family: serif}
  44. span.nolinebreak {white-space: nowrap}
  45. span.roman {font-family: initial; font-weight: normal}
  46. span.sansserif {font-family: sans-serif; font-weight: normal}
  47. ul.no-bullet {list-style: none}
  48. -->
  49. </style>
  50. </head>
  51. <body lang="en">
  52. <span id="Tokenization"></span><div class="header">
  53. <p>
  54. Next: <a href="The-preprocessing-language.html" accesskey="n" rel="next">The preprocessing language</a>, Previous: <a href="Initial-processing.html" accesskey="p" rel="prev">Initial processing</a>, Up: <a href="Overview.html" accesskey="u" rel="up">Overview</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index-of-Directives.html" title="Index" rel="index">Index</a>]</p>
  55. </div>
  56. <hr>
  57. <span id="Tokenization-1"></span><h3 class="section">1.3 Tokenization</h3>
  58. <span id="index-tokens"></span>
  59. <span id="index-preprocessing-tokens"></span>
  60. <p>After the textual transformations are finished, the input file is
  61. converted into a sequence of <em>preprocessing tokens</em>. These mostly
  62. correspond to the syntactic tokens used by the C compiler, but there are
  63. a few differences. White space separates tokens; it is not itself a
  64. token of any kind. Tokens do not have to be separated by white space,
  65. but it is often necessary to avoid ambiguities.
  66. </p>
  67. <p>When faced with a sequence of characters that has more than one possible
  68. tokenization, the preprocessor is greedy. It always makes each token,
  69. starting from the left, as big as possible before moving on to the next
  70. token. For instance, <code>a+++++b</code> is interpreted as
  71. <code>a&nbsp;++&nbsp;++&nbsp;+&nbsp;b<!-- /@w --></code>, not as <code>a&nbsp;++&nbsp;+&nbsp;++&nbsp;b<!-- /@w --></code>, even though the
  72. latter tokenization could be part of a valid C program and the former
  73. could not.
  74. </p>
  75. <p>Once the input file is broken into tokens, the token boundaries never
  76. change, except when the &lsquo;<samp>##</samp>&rsquo; preprocessing operator is used to paste
  77. tokens together. See <a href="Concatenation.html">Concatenation</a>. For example,
  78. </p>
  79. <div class="example">
  80. <pre class="example">#define foo() bar
  81. foo()baz
  82. &rarr; bar baz
  83. <em>not</em>
  84. &rarr; barbaz
  85. </pre></div>
  86. <p>The compiler does not re-tokenize the preprocessor&rsquo;s output. Each
  87. preprocessing token becomes one compiler token.
  88. </p>
  89. <span id="index-identifiers"></span>
  90. <p>Preprocessing tokens fall into five broad classes: identifiers,
  91. preprocessing numbers, string literals, punctuators, and other. An
  92. <em>identifier</em> is the same as an identifier in C: any sequence of
  93. letters, digits, or underscores, which begins with a letter or
  94. underscore. Keywords of C have no significance to the preprocessor;
  95. they are ordinary identifiers. You can define a macro whose name is a
  96. keyword, for instance. The only identifier which can be considered a
  97. preprocessing keyword is <code>defined</code>. See <a href="Defined.html">Defined</a>.
  98. </p>
  99. <p>This is mostly true of other languages which use the C preprocessor.
  100. However, a few of the keywords of C++ are significant even in the
  101. preprocessor. See <a href="C_002b_002b-Named-Operators.html">C++ Named Operators</a>.
  102. </p>
  103. <p>In the 1999 C standard, identifiers may contain letters which are not
  104. part of the &ldquo;basic source character set&rdquo;, at the implementation&rsquo;s
  105. discretion (such as accented Latin letters, Greek letters, or Chinese
  106. ideograms). This may be done with an extended character set, or the
  107. &lsquo;<samp>\u</samp>&rsquo; and &lsquo;<samp>\U</samp>&rsquo; escape sequences.
  108. </p>
  109. <p>As an extension, GCC treats &lsquo;<samp>$</samp>&rsquo; as a letter. This is for
  110. compatibility with some systems, such as VMS, where &lsquo;<samp>$</samp>&rsquo; is commonly
  111. used in system-defined function and object names. &lsquo;<samp>$</samp>&rsquo; is not a
  112. letter in strictly conforming mode, or if you specify the <samp>-$</samp>
  113. option. See <a href="Invocation.html">Invocation</a>.
  114. </p>
  115. <span id="index-numbers"></span>
  116. <span id="index-preprocessing-numbers"></span>
  117. <p>A <em>preprocessing number</em> has a rather bizarre definition. The
  118. category includes all the normal integer and floating point constants
  119. one expects of C, but also a number of other things one might not
  120. initially recognize as a number. Formally, preprocessing numbers begin
  121. with an optional period, a required decimal digit, and then continue
  122. with any sequence of letters, digits, underscores, periods, and
  123. exponents. Exponents are the two-character sequences &lsquo;<samp>e+</samp>&rsquo;,
  124. &lsquo;<samp>e-</samp>&rsquo;, &lsquo;<samp>E+</samp>&rsquo;, &lsquo;<samp>E-</samp>&rsquo;, &lsquo;<samp>p+</samp>&rsquo;, &lsquo;<samp>p-</samp>&rsquo;, &lsquo;<samp>P+</samp>&rsquo;, and
  125. &lsquo;<samp>P-</samp>&rsquo;. (The exponents that begin with &lsquo;<samp>p</samp>&rsquo; or &lsquo;<samp>P</samp>&rsquo; are
  126. used for hexadecimal floating-point constants.)
  127. </p>
  128. <p>The purpose of this unusual definition is to isolate the preprocessor
  129. from the full complexity of numeric constants. It does not have to
  130. distinguish between lexically valid and invalid floating-point numbers,
  131. which is complicated. The definition also permits you to split an
  132. identifier at any position and get exactly two tokens, which can then be
  133. pasted back together with the &lsquo;<samp>##</samp>&rsquo; operator.
  134. </p>
  135. <p>It&rsquo;s possible for preprocessing numbers to cause programs to be
  136. misinterpreted. For example, <code>0xE+12</code> is a preprocessing number
  137. which does not translate to any valid numeric constant, therefore a
  138. syntax error. It does not mean <code>0xE&nbsp;+&nbsp;12<!-- /@w --></code>, which is what you
  139. might have intended.
  140. </p>
  141. <span id="index-string-literals"></span>
  142. <span id="index-string-constants"></span>
  143. <span id="index-character-constants"></span>
  144. <span id="index-header-file-names"></span>
  145. <p><em>String literals</em> are string constants, character constants, and
  146. header file names (the argument of &lsquo;<samp>#include</samp>&rsquo;).<a id="DOCF2" href="#FOOT2"><sup>2</sup></a> String constants and character
  147. constants are straightforward: <tt>&quot;&hellip;&quot;</tt> or <tt>'&hellip;'</tt>. In
  148. either case embedded quotes should be escaped with a backslash:
  149. <tt>'\''</tt> is the character constant for &lsquo;<samp>'</samp>&rsquo;. There is no limit on
  150. the length of a character constant, but the value of a character
  151. constant that contains more than one character is
  152. implementation-defined. See <a href="Implementation-Details.html">Implementation Details</a>.
  153. </p>
  154. <p>Header file names either look like string constants, <tt>&quot;&hellip;&quot;</tt>, or are
  155. written with angle brackets instead, <tt>&lt;&hellip;&gt;</tt>. In either case,
  156. backslash is an ordinary character. There is no way to escape the
  157. closing quote or angle bracket. The preprocessor looks for the header
  158. file in different places depending on which form you use. See <a href="Include-Operation.html">Include Operation</a>.
  159. </p>
  160. <p>No string literal may extend past the end of a line. You may use continued
  161. lines instead, or string constant concatenation.
  162. </p>
  163. <span id="index-punctuators"></span>
  164. <span id="index-digraphs"></span>
  165. <span id="index-alternative-tokens"></span>
  166. <p><em>Punctuators</em> are all the usual bits of punctuation which are
  167. meaningful to C and C++. All but three of the punctuation characters in
  168. ASCII are C punctuators. The exceptions are &lsquo;<samp>@</samp>&rsquo;, &lsquo;<samp>$</samp>&rsquo;, and
  169. &lsquo;<samp>`</samp>&rsquo;. In addition, all the two- and three-character operators are
  170. punctuators. There are also six <em>digraphs</em>, which the C++ standard
  171. calls <em>alternative tokens</em>, which are merely alternate ways to spell
  172. other punctuators. This is a second attempt to work around missing
  173. punctuation in obsolete systems. It has no negative side effects,
  174. unlike trigraphs, but does not cover as much ground. The digraphs and
  175. their corresponding normal punctuators are:
  176. </p>
  177. <div class="example">
  178. <pre class="example">Digraph: &lt;% %&gt; &lt;: :&gt; %: %:%:
  179. Punctuator: { } [ ] # ##
  180. </pre></div>
  181. <span id="index-other-tokens"></span>
  182. <p>Any other single byte is considered &ldquo;other&rdquo; and passed on to the
  183. preprocessor&rsquo;s output unchanged. The C compiler will almost certainly
  184. reject source code containing &ldquo;other&rdquo; tokens. In ASCII, the only
  185. &ldquo;other&rdquo; characters are &lsquo;<samp>@</samp>&rsquo;, &lsquo;<samp>$</samp>&rsquo;, &lsquo;<samp>`</samp>&rsquo;, and control
  186. characters other than NUL (all bits zero). (Note that &lsquo;<samp>$</samp>&rsquo; is
  187. normally considered a letter.) All bytes with the high bit set
  188. (numeric range 0x7F&ndash;0xFF) that were not succesfully interpreted as
  189. part of an extended character in the input encoding are also &ldquo;other&rdquo;
  190. in the present implementation.
  191. </p>
  192. <p>NUL is a special case because of the high probability that its
  193. appearance is accidental, and because it may be invisible to the user
  194. (many terminals do not display NUL at all). Within comments, NULs are
  195. silently ignored, just as any other character would be. In running
  196. text, NUL is considered white space. For example, these two directives
  197. have the same meaning.
  198. </p>
  199. <div class="example">
  200. <pre class="example">#define X^@1
  201. #define X 1
  202. </pre></div>
  203. <p>(where &lsquo;<samp>^@</samp>&rsquo; is ASCII NUL). Within string or character constants,
  204. NULs are preserved. In the latter two cases the preprocessor emits a
  205. warning message.
  206. </p>
  207. <div class="footnote">
  208. <hr>
  209. <h4 class="footnotes-heading">Footnotes</h4>
  210. <h5><a id="FOOT2" href="#DOCF2">(2)</a></h3>
  211. <p>The C
  212. standard uses the term <em>string literal</em> to refer only to what we are
  213. calling <em>string constants</em>.</p>
  214. </div>
  215. <hr>
  216. <div class="header">
  217. <p>
  218. Next: <a href="The-preprocessing-language.html" accesskey="n" rel="next">The preprocessing language</a>, Previous: <a href="Initial-processing.html" accesskey="p" rel="prev">Initial processing</a>, Up: <a href="Overview.html" accesskey="u" rel="up">Overview</a> &nbsp; [<a href="index.html#SEC_Contents" title="Table of contents" rel="contents">Contents</a>][<a href="Index-of-Directives.html" title="Index" rel="index">Index</a>]</p>
  219. </div>
  220. </body>
  221. </html>