28 Jul 2003 - First Revision
28 Jul 2003 - Mentioned Joe Mason's request for vertical text
30 Jul 2003 - Added gestalt_LineInput, title case
30 Jul 2003 - Combining Marks, Text Grid
 2 Aug 2003 - right-to-left and justification interaction

 * Rationale *

  People may want to write IF games in languages which contain
  characters not in Latin-1.  The Z-machine and TADS3 both contain
  some support for Unicode, but it was previously not available in
  Glk.

  There have been many ideas about how to add Unicode support to Glk
  such as code pages, UTF-8, and UCS-4.  Code pages have historically
  been a solution to the i18n problem.  But they are awkward and don't
  work well for languages with thousands of glyphs.  UTF-8 has an
  advantage of being relatively compact for text which is mostly
  ASCII.  But text in Glulx is Huffman-compressed anyway which negates
  this advantage, and the variable-length character encoding is not
  trivial to work with.  UCS-4 uses 31-bits for each character in the
  native byte ordering.  Each character is thus one glui32.  This is
  simple, and it works.

  The Unicode system for Glk is designed to have several functions
  which parallel the Latin-1 support functions it already has.
  Functions which take Unicode are suffixed with _ucs4 to distinguish
  them from the Latin-1 functions.

  Disclaimer: I don't have any experience reading/writing texts in
  languages which use non-Latin alphabets.  The text here is based on
  reading about Unicode from the web.


  Gratitude goes to Andrew Plotkin and Iain Merrick for giving
  suggestions and filtering out some of my bad ideas.  Much of this
  document is highly derivative of Plotkin's Glk Specification.


 * Output Functions *

void glk_put_char_ucs4(glui32 ch);

  This prints one Unicode character to the current stream.

void glk_put_string_ucs4(glui32 *s);

  This prints a string of Unicode characters to the current stream.  A
  string ends on a glui32 whose value is 0.  Its use is depreciated,
  and not currently available from glulx, as the dispatch layer lacks
  a type for "a zero-terminated array of glui32."

void glk_put_buffer_ucs4(glui32 *buf, glui32 len);

  This prints a block of Unicode characters to the current stream.
  It is exactly equivalent to:

    for (i = 0; i < len; i++)
        glk_put_char_ucs4(buf[i])

void glk_put_char_stream_ucs4(strid_t str, glui32 ch);
void glk_put_string_stream_ucs4(strid_t str, glui32 *s);
void glk_put_buffer_stream_ucs4(strid_t str, glui32 *buf, glui32 len);

  These are the same as the above functions, but they specify a stream
  to print to.


 * Input Functions *

glsi32 glk_get_char_stream_ucs4(strid_t str);

  Reads one character from the given stream.  The result will be
  between 0 and 0x7fffffff.  If the end of the stream has been
  reached, the result will be -1.

glui32 glk_get_buffer_stream_ucs4(strid_t str, glui32 *buf, glui32 len);

  This reads len Unicode characters from the given stream, unless the end
  of the stream is reached first.  No terminal null is placed in the
  buffer.  It returns the number of Unicode characters actually read.

glui32 glk_get_line_stream_ucs4(strid_t str, glui32 *buf, glui32 len);

  This reads Unicode characters from the given stream, until either
  len-1 Unicode characters have been read or a newline has been read.
  It then puts a terminal null (a zero glui32) character on the end.
  It returns the number of Unicode characters actually read, including
  the newline (if there is one) but not including the terminal null.


 * Unicode Event Requests *

void glk_request_line_event_ucs4(winid_t win, glui32 *buf,
			         glui32 maxlen, glui32 initlen);

  This requests line input from a text buffer or text grid window,
  storing the result at buf in Unicode.  The text returned will have
  been normalized to Unicode form C.  You may not request Unicode line
  events from a window which has a pending request for any kind of
  character or line input.

  This event may be canceled by calling glk_cancel_line_event.

  The event returned from a Unicode line request has type
  evtype_LineInput and the same format as that which would have been
  returned from a Latin-1 line request.

  You can test whether an implementation allows line input of a given
  Unicode character by using glk_gestalt(gestalt_LineInput, ch).


void glk_request_char_event_ucs4(winid_t win);

  This requests Unicode character input from a text buffer or text
  grid window.  You may not request Unicode character events from a
  window which has a pending request for any kind of character or
  line input.

  This event may be canceled by calling glk_cancel_char_event.

  The event returned from a Unicode character request has type
  evtype_CharInput and the same format as that which would have been
  returned from a Latin-1 character request.  However, the value of
  a character may now range from 0 to 0x7fffffff (special key codes
  all have the top bit set, and so will not conflict).

  You can test whether an implementation allows character input of a
  given Unicode character by using glk_gestalt(gestalt_CharInput, ch).


 * Bi-directional Text *

  Some languages, such as Hebrew, write text from right-to-left,
  instead of left-to-right.  Some authors may even want to include a
  passage of Hebrew as part of an English game.  Thus, the library
  should be able to switch between left-to-right and right-to-left
  modes.  For compatibility with previous Glk versions, initially the
  library will start in left-to-right mode.

  For this purpose, a new style hint is added, stylehint_Direction.
  It is 0 if the text with this hint goes from left-to-right, and 1 if
  the text with this hint goes from right-to-left.  In particular, if
  style_Input is set to be right-to-left, then typing at a line input
  will move the cursor leftwards.

  Setting text to be right-to-left reverses the meaning of
  justification.  Thus, "left"-justified (the default) text will be
  flush against the right edge of the window, and "right"-justified
  will be flush against the left edge of the window.

  Like justification, setting right-to-left text might only take
  effect if an entire paragraph has this hint set.

  Joe Mason suggests supporting vertical text, (e.g. for some
  traditional Japanese forms).  This could be done by using a bit mask
  of stylehint_Direction_LeftToRight, stylehint_Direction_RightToLeft,
  stylehint_Direction_TopToBottom, stylehint_Direction_BottomToTop,
  stylehint_Direction_Horizontal, and stylehint_Direction_Vertical,
  which could be ORed together to generate the desired effect.  A
  library would need a more sophisticated text widget and clever
  scrolling routines; e.g., use a horizontal scroll bar or sideways
  [MORE] prompt to scroll vertical text.


 * Unicode in a Text Grid *

  Some Unicode characters do not represent actual graphemes, but
  modifications to a previous character.  These modifications, called
  "combining marks," and the original letter are combined to form one
  grapheme.  Such combined graphemes will take up one cell of a Text
  Grid.  If a program wishes to write text which contains these
  combinations, it must always write the character to be modified
  first, followed by the combining marks.  That is, it is illegal to
  write combining marks immediately after calling
  glk_window_move_cursor.

  If you write Unicode characters which are double-wide to a text
  grid, the cursor will advance by two positions.  It is illegal to
  reposition the cursor so that it is in the middle of a double-wide
  character.  If you overwrite the first half of a double-wide
  character, the second half will be replaced by a space.


 * Upper, Lower, and Title Case *

  You can convert Unicode characters between upper, lower, and title
  case using the following functions:

glui32 glk_char_to_lower_ucs4(glui32 ch);
glui32 glk_char_to_upper_ucs4(glui32 ch);
glui32 glk_char_to_title_ucs4(glui32 ch);

  These are similar to their Latin-1 equivalents, but should work for
  all Unicode characters which the library is capable of printing.  If
  a library does not know how to print a character, the library may
  return the same character as it was given.

  Title case is used when two letters are smushed together into one,
  and only the first of these letters should be capitalized.  Note
  that not all Unicode characters have direct upper-case mappings, and
  may need to be broken into pieces before they can be meaningfully
  made upper-case.


 * Mixing Unicode and 8-Bit Streams *

  It is not an inherent problem to specify that a text buffer accept
  Unicode characters.  But if a program attempts to read or write
  Unicode characters to a file or memory stream, what should be done?

  Files are used for many things such as transcripts and recorded
  input files.  Glk must be able to preserve all the characters in
  such files and the files should be readable by other native
  programs.

  Memory streams are often used as temporary buffers for text.  They
  must preserve the Unicode text and give an accurate count of how
  many characters were written to it.

  If a stream is opened with fileusage_TextMode, then output to it
  will translated into some native encoding.  This may be UTF-8,
  UCS-4, HTML with entities, or whatever else is common on the target
  platform.  On reading from such files, an inverse transformation
  will be done.

  If a stream is opened with fileusage_BinaryMode, then output to it
  will be written in UTF-32BE (big-endian UCS-4).  Similarly, reading
  a Unicode character from a binary stream will read a UTF-32BE
  character, which is turned to native byte order and placed in a
  glui32.  For each Unicode character, readcount or writecount (as
  returned by closing a stream) will be increased by one, but the file
  position mark will advance by four.  Therefore, there is a
  difference between glk_put_char_ucs4('X') and glk_put_char('X') when
  writing to a binary stream.

  Memory streams follow the same rules as files opened in binary mode.

  If you are attempting to calculate the length of a string for format
  in a text grid, none of the length measurements in this section are
  of use to you.  You need to use gestalt_CharOutput on each character
  in the string to measure how much space it will occupy in a grid.


 * Testing for Unicode Capabilities *

  Before calling the Unicode functions, you should use the following
  gestalt selector:

    glui32 res;
    res = glk_gestalt(gestalt_Unicode, 0);

  This returns 1 if the Unicode functions are available.  If it
  returns 0, you should not try to call them.  They may print nothing,
  print gibberish, or cause a run-time error.

  Additionally, a library which provides at least stubs for the
  Unicode functions will define GLK_MODULE_UNICODE, which you can use
  with the preprocessor to allow your C programs to work both with
  libraries which support and do not support Unicode.


  Most implementations will not support all possible characters in
  Unicode.  To test whether the implementation can display the Unicode
  character ch, call:

    glui32 res, len;
    res = glk_gestalt_ext(gestalt_CharOutput, ch, &len, 1);

  The results will be the same as for Latin-1 characters, except in
  the case of gestalt_CharOutput_ExactPrint.  With Latin-1 characters,
  len will always be set to 1.  It will be set to 0 for non-spacing
  marks, such as combining marks which do not specify a particular
  character, but merely modify the preceding character, adding
  punctuation or accents.  It will be set to 2 for Unicode characters
  which are double-wide (take up two cells in a text grid).

  There is an additional problem: a character may be printable, and a
  combining mark may be printable, but their combination might *not*
  be printable.