rfc9839v1.txt | rfc9839.txt | |||
---|---|---|---|---|
skipping to change at line 102 ¶ | skipping to change at line 102 ¶ | |||
subset. The intended use is to serve as a convenient target for | subset. The intended use is to serve as a convenient target for | |||
cross-reference from other specifications whose authors wish to | cross-reference from other specifications whose authors wish to | |||
exclude problematic code points from the data format or protocol | exclude problematic code points from the data format or protocol | |||
being specified. | being specified. | |||
Note that this document only provides guidance on avoiding the use of | Note that this document only provides guidance on avoiding the use of | |||
code points that cannot be used for interoperable interchange of | code points that cannot be used for interoperable interchange of | |||
Unicode textual data. Dealing with strings, particularly in the | Unicode textual data. Dealing with strings, particularly in the | |||
context of user interfaces, requires addressing language, text | context of user interfaces, requires addressing language, text | |||
rendering direction, alternate representations of the same abstract | rendering direction, alternate representations of the same abstract | |||
character, and so on. These issues, among many others, led to many | character, and so on. These issues, among many others, led to | |||
efforts by the Unicode Consortium, efforts by the IETF such as [IDN] | efforts by the Unicode Consortium, efforts by the IETF such as [IDN] | |||
and [PRECIS], and internationalization efforts by W3C such as | and [PRECIS], and internationalization efforts by W3C such as | |||
[W3C-CHAR]. The results of these efforts should be consulted by | [W3C-CHAR]. The results of these efforts should be consulted by | |||
anyone engaging in such work. | anyone engaging in such work. | |||
1.1. Notation | 1.1. Notation | |||
In this document, the numeric values assigned to Unicode characters | In this document, the numeric values assigned to Unicode characters | |||
are provided in hexadecimal. This document uses Unicode's standard | are provided in hexadecimal. This document uses Unicode's standard | |||
notation of "U+" followed by four or more hexadecimal digits. For | notation of "U+" followed by four or more hexadecimal digits. For | |||
skipping to change at line 143 ¶ | skipping to change at line 143 ¶ | |||
storage systems and to specify allowed subsets in specifications. | storage systems and to specify allowed subsets in specifications. | |||
There are 1,114,112 (17 * 2^16) code points; as of Unicode 16.0 | There are 1,114,112 (17 * 2^16) code points; as of Unicode 16.0 | |||
(2024), about 155,000 have been assigned to characters. Since | (2024), about 155,000 have been assigned to characters. Since | |||
unassigned code points regularly become assigned when new characters | unassigned code points regularly become assigned when new characters | |||
are added to Unicode, it is usually not a good practice to specify | are added to Unicode, it is usually not a good practice to specify | |||
that unassigned code points should be avoided. | that unassigned code points should be avoided. | |||
2.1. Encoding Forms | 2.1. Encoding Forms | |||
Unicode describes a variety of encoding forms, ways to marshal code | Unicode describes a variety of encoding forms that can be used to | |||
points into byte sequences. A survey of these is beyond the scope of | marshal code points into byte sequences. A survey of these is beyond | |||
this document. However, it is useful to note that "UTF-16" | the scope of this document. However, it is useful to note that "UTF- | |||
represents each code point with one or two 16-bit chunks, while "UTF- | 16" represents each code point with one or two 16-bit chunks, while | |||
8" uses variable-length byte sequences [RFC3629]. | "UTF-8" uses variable-length byte sequences [RFC3629]. | |||
The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277], | The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277], | |||
says "Protocols MUST be able to use the UTF-8 charset", which becomes | says "Protocols MUST be able to use the UTF-8 charset", which becomes | |||
a mandate to use UTF-8 for any protocol or data format that specifies | a mandate to use UTF-8 for any protocol or data format that specifies | |||
a single encoding form. UTF-8 is widely used for interoperable data | a single encoding form. UTF-8 is widely used for interoperable data | |||
formats such as JSON, YAML, CBOR, and XML. | formats such as JSON, YAML, CBOR, and XML. | |||
2.2. Problematic Code Points | 2.2. Problematic Code Points | |||
This section classifies all the code points that can never represent | This section classifies all the code points that can never represent | |||
skipping to change at line 235 ¶ | skipping to change at line 235 ¶ | |||
Code points are organized into 17 "planes", each containing 2^16 code | Code points are organized into 17 "planes", each containing 2^16 code | |||
points. The last two code points in each plane are noncharacters: | points. The last two code points in each plane are noncharacters: | |||
U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, and so on, up to | U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, and so on, up to | |||
U+10FFFE, U+10FFFF. | U+10FFFE, U+10FFFF. | |||
The code points in the range U+FDD0-U+FDEF are noncharacters. | The code points in the range U+FDD0-U+FDEF are noncharacters. | |||
3. Dealing with Problematic Code Points | 3. Dealing with Problematic Code Points | |||
[RFC9413], "Maintaining Robust Protocols", provides a thorough | "Maintaining Robust Protocols" [RFC9413] provides a thorough | |||
discussion of strategies for dealing with issues in input data. | discussion of strategies for dealing with issues in input data. | |||
Different types of problematic code points cause different issues. | Different types of problematic code points cause different issues. | |||
Noncharacters and legacy controls are unlikely to cause software | Noncharacters and legacy controls are unlikely to cause software | |||
failures, but they cannot usefully be displayed to humans, and they | failures, but they cannot usefully be displayed to humans, and they | |||
can be used in attacks based on attempting to display text that | can be used in attacks based on attempting to display text that | |||
includes them. | includes them. | |||
The behavior of software that encounters surrogates is unpredictable | The behavior of software that encounters surrogates is unpredictable | |||
and differs among programming-language implementations, even between | and differs among programming-language implementations, even between | |||
End of changes. 3 change blocks. | ||||
7 lines changed or deleted | 7 lines changed or added | |||
This html diff was produced by rfcdiff 1.48. |