Application of RFC 2231
Encoding to Hypertext Transfer Protocol (HTTP) Header Fieldsgreenbytes GmbHHafenweg 16MuensterNW48155Germanyjulian.reschke@greenbytes.dehttp://greenbytes.de/tech/webdav/
By default, message header field parameters in Hypertext Transfer Protocol (HTTP) messages
can not carry characters outside the ISO-8859-1 character set. RFC 2231
defines an escaping mechanism for use in Multipurpose Internet Mail Extensions
(MIME) headers. This document specifies a profile of that encoding
suitable for use in HTTP.
There are multiple HTTP header fields that already use RFC 2231 encoding
in practice (Content-Disposition) or might use it in the future
(Link). The purpose of this document is to provide a single place where
the generic aspects of RFC 2231 encoding in HTTP header fields is defined.
Distribution of this document is unlimited. Although this is not a work
item of the HTTPbis Working Group, comments should be sent to the
Hypertext Transfer Protocol (HTTP) mailing list at ietf-http-wg@w3.org,
which may be joined by sending a message with subject
"subscribe" to ietf-http-wg-request@w3.org.
Discussions of the HTTPbis Working Group are archived at
.
XML versions, latest edits and the issues list for this document
are available from .
A collection of test cases is available at .
Umbrella issue for editorial fixes/enhancements.
By default, message header field parameters in HTTP () messages
can not carry characters outside the ISO-8859-1 character set (). RFC 2231
() defines an escaping mechanism for use in MIME headers.
This document specifies a profile of that encoding for use in HTTP.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document
are to be interpreted as described in .
This specification uses the ABNF (Augmented Backus-Naur Form) notation defined in
. The following core rules are included by
reference, as defined in :
ALPHA (letters), DIGIT (decimal 0-9), HEXDIG (hexadecimal 0-9/A-F/a-f) and
LWSP (linear white space).
Note that this specification uses the term "character set" for consistency
with other IETF specifications such as RFC 2277 (see ). A more accurate term would be "character
encoding" (a mapping of code points to octet sequences).
RFC 2231 defines several extensions to MIME. The sections below discuss
if and how they apply to HTTP.
In short:
Parameter Continuations aren't needed (),Character Set and Language Information are useful, therefore a simple subset
is specified (), andLanguage Specifications in Encoded Words aren't needed (). defines a mechanism that
deals with the length limitations that apply to MIME headers. These
limitations do not apply to HTTP ().
Thus in HTTP, senders MUST NOT use parameter continuations, and
therefore recipients do not need to support them.
specifies how to embed
language information into parameter values, and also how to encode
non-ASCII characters, dealing with restrictions both in MIME and HTTP
header parameters.
However, RFC 2231 does not specify a mandatory-to-implement character set,
making it hard for senders to decide which character set to use.
Thus, recipients implementing this specification MUST support the
character sets "ISO-8859-1" and "UTF-8"
.
Furthermore, RFC 2231 allows leaving out the character set information.
The profile defined by this specification does not allow that.
The syntax for parameters is defined in
(with RFC 2616 implied LWS translated to RFC 5234 LWSP):
This specification extends the grammar to:
Thus, a parameter is either regular parameter (reg-parameter), as previously
defined in , or an extended
parameter (ext-parameter).
Extended parameters are those where the left hand side of the assignment
ends with an asterisk character.
The value part of an extended parameter (ext-value) is a token that consists
of three parts: the REQUIRED character set name (charset), the OPTIONAL
language information (language), and a character sequence representing the
actual value (value-chars), separated by single quote
characters. Note that both character set names and
language tags are restricted to the US-ASCII character set, and are matched
case-insensitively (see and
).
Inside the value part, characters not contained in attr-char are
encoded into an octet sequence using the specified character set. That octet
sequence then is percent-encoded as specified in .
Producers MUST NOT use character sets other than "UTF-8" ()
or "ISO-8859-1" ().
Extension character sets (ext-charset) are reserved for future use.
Note: recipients should be prepared to handle encoding
errors, such as malformed or incomplete percent escape sequences, or
non-decodable octet sequences, in a robust manner. This specification
does not mandate any specific behavior, for instance the following
strategies are all acceptable:
ignoring the parameter,stripping a non-decodable octet sequence,substituting a non-decodable octet sequence by a replacement
character, such as the Unicode character U+FFFD (Replacement Character).Note: the <mime-charset> ABNF defined here differs from
the one in in that it does
not allow the single quote character (see also RFC Editor Errata ID 1912). In practice, no character set names
using that character have been registered at the time of this writing.
extends the encoding
defined in to also support language specification
in encoded words.
Although the HTTP/1.1 specification does refer to RFC 2047
(),
it's not clear to which header field exactly it applies, and whether it is
implemented in practice (see
for details).
Thus, the RFC 2231 profile defined by this specification does not include
this feature.
Specifications of HTTP header fields that use the extensions defined
in should clearly
state that. A simple way to achieve this is to normatively reference
this specification, and to include the ext-value
production into the ABNF for that header field.
Note to RFC Editor: in the figure above, please replace "xxxx" by the
RFC number assigned to this specification. requires that protocol
elements containing text can carry language information. Thus, the ext-value
production should always be used when the parameter value is of textual
nature.
Furthermore, the extension should also be used whenever the parameter value
needs to carry characters not present in the US-ASCII ()
character set (note that it would be unacceptable to define a new parameter that
would be restricted to a subset of the Unicode character set).
Header specifications that include parameters should also specify whether
same-named parameters can occur multiple times. If repetitions are not
allowed (and this is believed to be the common case), the specification
should state whether regular or the extended syntax takes precedence.
In the latter case, this could be used by producers to use both formats
without breaking recipients that do not understand the syntax.
Note: at the time of this writing, many implementations failed
to ignore the form they do not understand, or prioritize the ASCII form
although the extended syntax was present.
It is expected that in many cases, internationalization of parameters in response headers
is implemented using server driven content negotiation ()
using the Accept-Language header ().
However, the format described in this specification also allows to use
multiple instances providing multiple languages in a single header.
Specifications that want to take advantage of this should clearly specify
the expected processing by the recipient.
This document does not discuss security issues and is not believed to raise
any security issues not already endemic in HTTP.
There are no IANA Considerations related to this specification.
Thanks to Martin Duerst and Frank Ellermann for help figuring out ABNF details, and to
Roar Lauritzsen for implementer's feedback.
Key words for use in RFCs to Indicate Requirement LevelsHarvard Universitysob@harvard.edu
General
keywordHypertext Transfer Protocol -- HTTP/1.1University of California, Irvinefielding@ics.uci.eduW3Cjg@w3.orgCompaq Computer Corporationmogul@wrl.dec.comMIT Laboratory for Computer Sciencefrystyk@w3.orgXerox Corporationmasinter@parc.xerox.comMicrosoft Corporationpaulle@microsoft.comW3Ctimbl@w3.orgUTF-8, a transformation format of ISO 10646Alis Technologiesfyergeau@alis.comTags for Identifying LanguagesLab126addison@inter-locale.comGooglemark.davis@google.comAugmented BNF for Syntax Specifications: ABNFBrandenburg InternetWorking+1.408.246.8253dcrocker@bbiw.netTHUS plc.paul.overell@thus.netInformation technology -- 8-bit single-byte coded graphic character sets -- Part 1: Latin alphabet No. 1International Organization for StandardizationMIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII TextUniversity of Tennesseemoore@cs.utk.eduMIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and ContinuationsInnosoft International, Inc.ned.freed@innosoft.comUniversity of Tennesseemoore@cs.utk.eduIETF Policy on Character Sets and LanguagesUNINETTHarald.T.Alvestrand@uninett.noUniform Resource Identifier (URI): Generic SyntaxWorld Wide Web Consortiumtimbl@w3.orghttp://www.w3.org/People/Berners-Lee/Day Softwarefielding@gbiv.comhttp://roy.gbiv.com/Adobe Systems IncorporatedLMM@acm.orghttp://larry.masinter.net/Coded Character Set -- 7-bit American Standard Code for Information InterchangeAmerican National Standards InstituteIANA Charset Registration Procedures
Problems with the internationalization of the HTTP Content-Disposition header
field have been known for many years (see test cases at
).
During IETF 72 (),
the HTTPbis Working Group shortly discussed how to deal with the underspecification
of (1) Content-Disposition, and its (2) internationalization aspects.
Back then, there was rough consensus in the room to move the definition
into a separate draft.
This specification addresses problem (2), by defining a simple subset of
the encoding format defined in RFC 2231. A separate specification,
draft-reschke-rfc2183-in-http, is planned to address problem (1). Note that
this approach was chosen because Content-Disposition is just an example for
an HTTP header field using this kind of encoding. Another example
is the currently proposed Link header field (draft-nottingham-http-link-header).
This document is planned to be published on the IETF Standards Track, so that
other standards-track level documents can depend on it, such as the new
specification of Content-Disposition, or potentially future revisions of
the HTTP Link Header specification.
Also note that this document specifies a proper subset of the extensions
defined in RFC 2231, but does not normatively refer to it. Thus, RFC 2231
can be revised separately, should the email community decide to.
Use RFC5234-style ABNF, closer to the one used in RFC 2231.
Make RFC 2231 dependency informative, so this specification can evolve
independently.
Explain the ABNF in prose.
Remove unneeded RFC5137 notation (code point vs character).
And and resolve issues "charset", "repeats"
and "rfc4646".
And and resolve issue "charsetmatch".
Add and resolve issues "badseq" and
"tokenquotcharset".
Say "header field" instead of "header" in the context of HTTP.
Add an appendix discussing document history and future plans, to be removed before publication.