HyFo Hyphenation Pattern Files

This is not an exhaustive discussion of the hyphenation input files, but a brief description of the derivation of patterns and the features you will see on examining such files.

Derivation of the patterns

HyFo's pattern files are based on those of TeX, as are most, if not all, open source hypnenation programs. The seminal paper for all of this activity was published by Frank Liang in 1983.

Frank Liang, Word hy-phen-a-tion by com-puter, STAN-CS-83-977, Stanford University Ph.D. thesis, 1983.

Components of patterns

The contents of Folio hyphenation files are defined in the file resources/hyphenation.dtd. What follows is not an exhaustive, but a general discussion of that definition.

The alphabet

Every set of hyphenation patterns is defined over a particular alphabet. This described in the <classes> element. The contents of the <classes> element is a series of <cl> elements. Each <cl> has a text attribute with one or more characters. The first character is the canonical representation for the class, and each other character, if any, is equivalent to the canonical character. For example, in Western alphabets the upper and lower case version of a letter are equivalent for the purposes of hyphenation. The lower-case version is usually defined as the canonical one.

The alphabet may not contain the Western digits, for reasons explained below.

The patterns

The <patterns> element, containing multiple instances of the <pt> element, enumerates the patterns. The #PCDATA of a <pt> element is drawn from the canonical characters of the alphabet, the digits from '0' to '9', and the full stop, also known as period, aka dot: '.'. The dot may only occur at the beginning or end of a pattern, where it represents the beginning or end of a word containing the pattern. The characters from the alphabet represent themselves or their equivalents. The digits may not appear next to one another.

The digits

The digits carry the information about the desirability or otherwise of a hyphen occuring at that point in the pattern. In practice, the digits range between 1 and 5, representing the results of 5 passes over a dictionary containing the valid hyphenation points for its contents.

Crudely speaking, the first, broad pass looks for the most general rules governing hyphenation. It seeks the broadest coverage of valid hyphenation points, irrespective of errors, and places a 1 at break points in the pattern it finds. These patterns are used to generate hyphenations of all of the words in the dictionary, and the results are compared with the dictionary. The second pass then tries to eliminate errors by overriding the first results with a set of inhibiting patterns. It puts a 2 at points where hyphens should be inhibited. The third pass then tries to correct over-zealous inhibition in the previous pass by re-enabling hyphenation in certain patterns. These points are marked with a 3. The process repeats twice more, leading to a range of enablers - 1, 3 and 5 - and inhibitors - 2 and 4.

When a word is hyphenated, every substring of the word is checked against the alphabet characters (and the beginning and end indicators) of every pattern. For each matching pattern possible hyphenation points are extracted. The highest valued hyphenation marker at any point wins. At the end of this process, inhibitors are discarded, leaving only possible break points whose enabler superseded any other enabler or inhibitor.

PATGEN

Liang's paper describes the progam he developed to generate patterns. This program or a derivative is still used in conjunction with a comprehensive hyphenation dictionary, to generate patterns.

TeX pattern files

PATGEN's most well-known application has been in the generation of the pattern files for TeX. For HyFo users, existing TeX files are a ready source of patterns. So far, all of the pattern files in HyFo have been generated from original TeX files.

No script exists yet for converting the TeX code of pattern files to the XML of HyFo, but the process of conversion, though tedious and error-prone, is not difficult.

Exceptions

In spite of the exhaustive process of pattern generation, the process is not perfect. Exceptions are a means of overriding the patterns for a set of words, whose complete hyphenation is defined in the <exceptions> element. All words submitted to the hyphenator are first checked against the exceptions list. It generally contains only canonical characters with hyphens at valid hyphenation points. In most cases the set of exceptions is small; a notable exception to this is the pattern file for old German orthography, which has a huge exception vocabulary.

Adding new patterns to the build

Pattern files are located under the resources directory. Part of the existing directory structure look like this:

    resources/
          apache/
                dk/
                      dk.xml
                      message.txt
                LICENSE
          gpl/
                cs/
                      cs.xml
                license.eng
                lic-gpl.eng
          lppl/
                de/
                      de.xml
                      dehyphtex.xml
                lppl.txt
  

The directories under resources are each devoted to a particular licence type. Any files, for example apache/LICENSE, which apply to all files in the directory tree, are placed in the licence-level directory. Within the license-level directory are all of the individual locale directories, for example apache/dk and lppl/de. The pattern file for that locale is placed in the directory, along with any files which are specific to that locale, for example lppl/de/dehyphtex.xml (which is imported when lppl/de/de.xml is processed), and apache/dk/message.txt

To add a new pattern to the build, determine first what license applies. If a corresponding licence directory is not available, create it under resources, and add any licence-specific files to that directory. If the licence directory exists already, it should have the applicable licence files already in the directory. If there are general licence files which you believe should be in there but are not, please contact me.

Add a directory named for the locale. This must be the same name as the locale XML file, excluding the suffix. Add the XML pattern file, and any supporting files, to the newly-created directory.

Build the pattern jar

The target jars-licensed will generate and compile the code, build the jar and place it in the dist/bin directory. So will the higher-level targets jar and full-distribution. The jar will be called hyfo_<licence>_<locale>.jar.