Diff: rfc9839v1.txt - rfc9839.txt

	rfc9839v1.txt	rfc9839.txt

	skipping to change at line 102 ¶	skipping to change at line 102 ¶
	subset. The intended use is to serve as a convenient target for	subset. The intended use is to serve as a convenient target for
	cross-reference from other specifications whose authors wish to	cross-reference from other specifications whose authors wish to
	exclude problematic code points from the data format or protocol	exclude problematic code points from the data format or protocol
	being specified.	being specified.

	Note that this document only provides guidance on avoiding the use of	Note that this document only provides guidance on avoiding the use of
	code points that cannot be used for interoperable interchange of	code points that cannot be used for interoperable interchange of
	Unicode textual data. Dealing with strings, particularly in the	Unicode textual data. Dealing with strings, particularly in the
	context of user interfaces, requires addressing language, text	context of user interfaces, requires addressing language, text
	rendering direction, alternate representations of the same abstract	rendering direction, alternate representations of the same abstract

	character, and so on. These issues, among many others, led to many	character, and so on. These issues, among many others, led to
	efforts by the Unicode Consortium, efforts by the IETF such as [IDN]	efforts by the Unicode Consortium, efforts by the IETF such as [IDN]
	and [PRECIS], and internationalization efforts by W3C such as	and [PRECIS], and internationalization efforts by W3C such as
	[W3C-CHAR]. The results of these efforts should be consulted by	[W3C-CHAR]. The results of these efforts should be consulted by
	anyone engaging in such work.	anyone engaging in such work.

	1.1. Notation	1.1. Notation

	In this document, the numeric values assigned to Unicode characters	In this document, the numeric values assigned to Unicode characters
	are provided in hexadecimal. This document uses Unicode's standard	are provided in hexadecimal. This document uses Unicode's standard
	notation of "U+" followed by four or more hexadecimal digits. For	notation of "U+" followed by four or more hexadecimal digits. For

	skipping to change at line 143 ¶	skipping to change at line 143 ¶
	storage systems and to specify allowed subsets in specifications.	storage systems and to specify allowed subsets in specifications.

	There are 1,114,112 (17 * 2^16) code points; as of Unicode 16.0	There are 1,114,112 (17 * 2^16) code points; as of Unicode 16.0
	(2024), about 155,000 have been assigned to characters. Since	(2024), about 155,000 have been assigned to characters. Since
	unassigned code points regularly become assigned when new characters	unassigned code points regularly become assigned when new characters
	are added to Unicode, it is usually not a good practice to specify	are added to Unicode, it is usually not a good practice to specify
	that unassigned code points should be avoided.	that unassigned code points should be avoided.

	2.1. Encoding Forms	2.1. Encoding Forms


	Unicode describes a variety of encoding forms, ways to marshal code	Unicode describes a variety of encoding forms that can be used to
	points into byte sequences. A survey of these is beyond the scope of	marshal code points into byte sequences. A survey of these is beyond
	this document. However, it is useful to note that "UTF-16"	the scope of this document. However, it is useful to note that "UTF-
	represents each code point with one or two 16-bit chunks, while "UTF-	16" represents each code point with one or two 16-bit chunks, while
	8" uses variable-length byte sequences [RFC3629].	"UTF-8" uses variable-length byte sequences [RFC3629].

	The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277],	The "IETF Policy on Character Sets and Languages", BCP 18 [RFC2277],
	says "Protocols MUST be able to use the UTF-8 charset", which becomes	says "Protocols MUST be able to use the UTF-8 charset", which becomes
	a mandate to use UTF-8 for any protocol or data format that specifies	a mandate to use UTF-8 for any protocol or data format that specifies
	a single encoding form. UTF-8 is widely used for interoperable data	a single encoding form. UTF-8 is widely used for interoperable data
	formats such as JSON, YAML, CBOR, and XML.	formats such as JSON, YAML, CBOR, and XML.

	2.2. Problematic Code Points	2.2. Problematic Code Points

	This section classifies all the code points that can never represent	This section classifies all the code points that can never represent

	skipping to change at line 235 ¶	skipping to change at line 235 ¶

	Code points are organized into 17 "planes", each containing 2^16 code	Code points are organized into 17 "planes", each containing 2^16 code
	points. The last two code points in each plane are noncharacters:	points. The last two code points in each plane are noncharacters:
	U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, and so on, up to	U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, and so on, up to
	U+10FFFE, U+10FFFF.	U+10FFFE, U+10FFFF.

	The code points in the range U+FDD0-U+FDEF are noncharacters.	The code points in the range U+FDD0-U+FDEF are noncharacters.

	3. Dealing with Problematic Code Points	3. Dealing with Problematic Code Points


	[RFC9413], "Maintaining Robust Protocols", provides a thorough	"Maintaining Robust Protocols" [RFC9413] provides a thorough
	discussion of strategies for dealing with issues in input data.	discussion of strategies for dealing with issues in input data.

	Different types of problematic code points cause different issues.	Different types of problematic code points cause different issues.
	Noncharacters and legacy controls are unlikely to cause software	Noncharacters and legacy controls are unlikely to cause software
	failures, but they cannot usefully be displayed to humans, and they	failures, but they cannot usefully be displayed to humans, and they
	can be used in attacks based on attempting to display text that	can be used in attacks based on attempting to display text that
	includes them.	includes them.

	The behavior of software that encounters surrogates is unpredictable	The behavior of software that encounters surrogates is unpredictable
	and differs among programming-language implementations, even between	and differs among programming-language implementations, even between

End of changes. 3 change blocks.
	7 lines changed or deleted	7 lines changed or added
This html diff was produced by rfcdiff 1.48.