12. Glossary


The process of making allomorphs equal length and make them to correspond each other phoneme by phoneme. Alignment consists of adding zero symbols as needed so that the phonemes in the same position are phonologically similar. One could align, e.g. mäki and mäe by inserting one zero to the latter morph (mäØe) so that the corresponding phonemes would be mm, ää, and ie. See Alignment.

comma-separated values

A common exchange format for representing tabular data which most spreadsheet programs can import and export. In CSV table rows are usually lines where fields are separated by a comma. Instead of a comma, a semicolon is sometimes used. Values may be enclosed in double quotes if they contain e.g. commas. For more information, see https://en.wikipedia.org/wiki/Comma-separated_values.


The immediate environment where a symbol occurs. The context consists of the left context, i.e. zero or more symbols occurring before and of the right context, i.e. zero or more symbols occurring after the symbol whose context we are speaking of.


Deletion is said to occur when a phoneme in a in a morph corresponds to zero in another morph of the same morpheme. Cf. epenthesis.

encoded FST

A FST can be converted into an equivalent FSM by changing all its transition labels so that the new labels are combinations of the original input and output labels using functions fst_to_fsa. If the original FST contained a transition {aä}:a then the encoded FSA will have a transition {aä}^a:{aä}^a. An encoded FSA can be made back to a normal FST by the function fsa_to_fst. See the HFST documentation


Epenthesis is said to occur when a zero in a morph corresponds to a phoneme in another morph of the same morpheme. In the simplified two-level framework, epenthesis and deletion are equivalent.

null string

Epsilon represents the null string, i.e. a string whose length is 0. For a FSM or a FST it matching an epsilon in input means that the machine reads nothing (i.e. the input tape does not move). An epsilon in output for a FST means that nothing is written. Epsilon is not present in the two-level model. Instead, it uses a zero.

finite-state machine
finite-state automaton

A finite-state machine (automaton). An abstract device consisting of states and transitions. One state is the initial state where the FSM is when it starts. An FSM reads symbols, one at a time and moves into another state if there is a transition from the current state where the transition label is the current input symbol. If so, the FSM moves into a new state given by the transition. It continues so, until the last input symbol has been read. If the FSM is in one of its final states, the FSM is said to accept the input string. If the FSM fails to have a matching transition at any step, then the FSM rejects the input. The FSM also rejects the input, if it ends up in a state which is not one of the final states.

finite-state transducer

An abstract machine like a FSM but it operates with two tapes: input tape and output tape. Thus, the transitios are labeled with a symbol pair instead of a single symbol. A transition is applied, if the current input symbol matches the former component of the symbol pair in the transition. Then, the latter component of the symbol pair is output. Labels in FST transitions may, in general, also contain epsilons instead of symbols. In the two-level rules and examples, no epsilons are used. Two-level FSTs define, thus, same length relations, i.e. the relate pairs of strings where both strings are equally long.

generalized restriction

A method for compiling rules by formulating the rule components as regular sets. Additional boundary markers are used so that the restriction for the center part and the restrictions for the context parts can be combined. The boundary markers are removed after the operations. Using the boundary markers, the context parts get a natural expression and the context parts can be disjuncted in a natural manner. See [ylijyrä2006] for details.


A process of producing word forms of a lexeme. Inflecting verbs is often called conjugation and inflecting nouns is called declination. Conjugation can also refer to an inflection class of verbs and delination to an inflectional class of nouns.

inflection class
inflectional class

A set of lexemes which are inflected in a similar way.

initial state

The state where an automaton (e.g. FSM or FST) is when it starts.

input symbol

In general, input symbols are the symbols that a FST reads in. In two-level rules and examples, the input symbols belong to the underlying representation and they may be either phonemes or morphophonemes. The input symbols in two-level rules and examples are sometines also called lexical characters or upper characters.


Lexemes correspond roughly to dictionary entry words. In morphological analysis, a lexeme can be idientified by its base form and inflectional class. Two words represent different lexemes if they have an identical base form but are inflected in a different way. A lexeme may have several senses. A lemma is a label for all inflected forms of a lexeme. A representation of a lexeme in a lexicon might have more information than just a lemma.


A part of the surface form which is said to correspond to a morpheme, e.g. in kadulla the part kadu (street) and the part lla (on) are morphs.


A minimal unit with a meaning and which is manifeseted as morphs, e.g. we may have a morpheme KATU which has a meaning ‘street’ and is manifested as two possible morphs katu and kadu. E.g. stems of words may be morphemes as well as various affixes for inflection and derivation. Some stems combine two or more morphemes, e.g. compounds and derived lexemes.


An abstract symbol which denotes the alternation of surface characters in a position within a morpheme. E.g. {td} could denote the alternation between t and d. The names of the morphophonemes are chosen by the linguist who writes a two-level grammar. Morphophonemes are always input symbols to the two-level rules.

morphophonemic representation

An abtract representation which is a kind of summary of the concrete surface morphs of a morpheme. Two-level rules describe the relation between the lexical and the surface level. Corresponds to the sequence of input symbols of two-level rules. The morphophonemic representation is sometimes also called the lexical level or the upper level.

output symbol
surface character

In general, output symbols are the symbols that a FST writes as output. In two-level rules and examples, the output symbols are the phonemes in actual word forms (or letters in a near phonemic writing system). Output symbols are sometimes called surface characters or lower characters.

pair symbol

A human readable representation of a symbol pair. It consists either of an output symbol, e.g. a which corresponds to symbol pair ('a', 'a'), or an input symbol followed by a colon followed by an output symbol, e.g. {aä}:a which corresponds to symbol pair ('{aä}', 'a').


For the purposes of writing two-level rules and analyzers, phonemes often correspond to letters in a near-phonemic writing system. In linguistics, phonemes are units which represent similar phohes whose differences do not carry any additional information. The choice of a phone in a phoneme might be irrelevant or sometimes determined by the surrounding context of phones.

principal forms
principal parts

A subset of inflectional forms that is needed for determining the full paradigm of inflectional forms for a lemma.

raw morphophoneme

A combination of phonemes which correspond to each other as a result of alignment, e.g. if käsi, käde, käte, käs and kät are aligned, we get raw morphophonemes such as kkkk or sdtst. Raw morhpphonemes are usually renamed to morphophonemes, e.g. k or {tds}


FSMs and FSTs have states. At any moment, the machines are in some state and during the process, they move from some state to another state according to what transition matches the input symbol.

surface representation

The concrete representation of of word forms as a sequence of phonemes or letters (possibly with some zeros inserted).

surface symbol

Surface symbols are phonemes or the symbols which are used to write word forms. For two-level rules, surface symbols are output-symbols.

symbol pair

A tuple consisting of an input and an output symbol, e.g. ({aä}, a)


FSMs and FSTs have transitions which tell to which state the machine must move according to the input symbol that is currently being processed. In a FST, the transition also gives the possible output symbol.

word form

A possibly inflected form of a lexeme. A word form is a string of phonemes or letters. A word form might have several occurrences (sometimes called word tokens) in a text. In some statistical contexts, word_forms are called word types or just types.


A placeholder which indicates that in some other allomorphs there is some phoneme in this position. By inserting zeros, one makes the allomorphs same length. Zero is not a morphophoneme and it never occurs in morphophonemic representations. The zero is not an epsilon.