GFF3 character encoding

From SO Wiki
Jump to: navigation, search

I clarified the character encoding and escaping issues as follows:

Description of the Format

GFF3 files are 9-column, tab-delimited plain text files. Literal tabs, newline, carriage return, percent sign and control characters must be encoded using RFC 3986 Percent-Encoding as described below; no other characters may be encoded. The file contents may include any character in the set supported by the operating environment, although for portability with other systems, use of Latin-1 or Unicode are recommended.

The format consists of 9 columns, separated by tabs (NOT spaces). The following characters must be escaped using URL escaping conventions (%XX hex codes):

  • tab (%09)
  • newline (%0A)
  • carriage return (%0D)
  • % percent (%25)
  • control characters (%00 through %1F, %7F)

In addition, the following characters have reserved meanings in column-9 and must be escaped when used in other contexts:

  • ; semicolon (%3B)
  • = equals (%3D)
  • & ampersand (%26)
  • , comma (%2C)

...

Note that unescaped spaces are allowed within fields, meaning that parsers must split on tabs, not spaces. Use of the "+" (plus) character to encode spaces is depracated from early versions of the spec and is no longer supported.

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox