11.1. twol.alphabet module¶
Phonemes are defined according to what features they have. Features are divided into six groups or positions so that each feature name belongs to exactly one position. One the first three groups characterize consonants and the last three groups characterize vowels. Consonants typically have one feature in each of their first three positions and and the last three positions are empty. A consonant t
could be described e.g. as:
"t": ("Dental", "Unvoiced", "Stop", "", "", "")
In features_of_phoneme
dict the phonemes are represented in this way after the alphabet has been processed. Similarly, vowels have their first three slots empty and the three last positions filled with vocalic features, e.g. the phonemes i
and u
could be represented as:
"i": ("", "", "", "Close", "Front", "Unrounded")
"u": ("", "", "", "Close", "Back", "Rounded")
We represent semivowels or approximants by giving them features in all six positions, e.g. Finnish and Estonian j
could be:
"j": ("Palatal","Voiced","Approximant",
"SemiVowel","Front","Unrounded")
The Ø
which will be inserted in the words during the course of alignment is represented as:
"Ø": ("Zero", "Zero", "Zero", "Zero", "Zero", "Zero")
11.1.1. Representing phonemes and morphophonemes with sets¶
Another, equivalent, set representation for phonemes is useful for defining the compatibility of pairs of phonemes and sets of phonemes. Individual features are replaced by sets and the empty position by a universal set U
, e.g. for t
, ì
, u
and j
:
"t": ({"Dental"}, {"Unvoiced"}, {"Stop"}, U, U, U)
"i"; (U, U, U, {"Close"}, {"Front"}, {"Unrounded"})
"u": (U, U, U, {"Close"}, {"Back"}, {"Rounded"})
"j": ({"Palatal"}, {"Voiced"}, {"Approximant"},
{"SemiVowel"}, {"Front"}, {"Unrounded"})
We define a (raw) morphophoneme to be either a single phoneme (e.g. i
) or a combination of a (raw) morphophoneme and a single phoneme (e.g. iu
). The set representation of the combination is the six-tuple of unions of their sets, e.g.:
"iu": (U, U, U, {"Close"}, {"Front", "Back"},
{"Rounded", "Unrounded"})
"ij": (U, U, U, {"Close", "Semivowel"}, {"Front"}, {"Unrounded"})
"ti": (U, U, U, U, U, U)
The tuple with six universal sets is not valid by definition, i.e. such morphophonemes are not established. The universal set U
is defined to include not only the existing features in a position but at least some other non-existing features as well so that the union with U
always is a superset of sets of actually used feature. Thus, e.g. the morphophoneme consisting of all vowels is not confused with U
.
11.1.2. Binary representation of sets of features¶
For each of the six positions, there is a small finite set of possible features. We assume that there are never more than 16 different features (including “Zero”) for any positionThe program finds them and gives a numeric value for each. The “Zero” feature is always value 0, and the rest are given values 1, 2, 4, 8, 16, 32, … Each phoneme gets a representation of six integers which are powers of 2. The set U
would be {1, 2, 4, 8, 16, ..., 2*15}
For each of the six positions we now have an integer which corresponds to the set of features as discussed above. Union of two sets corresponds to the or (|
) of these binary integers. Combining a morphophoneme and a phoneme reduces now to the or (|
) of the six corresponding binary integers.
We have six 16 bit integers. They can be represented as a single Python integer which contains (at least) 96 bits. Hereafter we can represent any phoneme and any sets of phonemes by a single integer. Furthermore we can combine them with an elementary Python (|
) operation. The result is invalid if it consists of 96 bits one i.e. is equal to 0xffffffffffffffffffffffff and this integer value represens U
.
alphabet.py
Processes an alphabet definition file so that it can be used both by aligner.py and by multialign.py. In particular, it computes weights for phoneme pairs or morphophonemes for the purposes of weighted alignment.
Copyright 2017-2020, Kimmo Koskenniemi
This is free software according to GNU GPL 3 license.
- twol.alphabet.binint_to_mphon_set = {}¶
For a binint, it gives the set of phoneme set previously stored for a 96 bit integer.
- twol.alphabet.consonant_set = {}¶
The set of consonants including semivowels.
- twol.alphabet.cost_of_zero_c = 25¶
Additional weight for a subset of consonants if Ø belongs to it.
- twol.alphabet.cost_of_zero_v = 10¶
Additional weight for a subset of vowels if Ø belongs to it.
- twol.alphabet.exception_lst = []¶
A list of weighting exceptions to be used in metric.py
- twol.alphabet.feature_bitpos = {}¶
The bit position of the feature within the 16 bit field of the group.
- twol.alphabet.for_definitions_lst = []¶
A list of … FOR X IN … definitions to be used in metric.py
- twol.alphabet.mphon_is_valid(mphon)[source]¶
Tests whether a set of phonemes is possible.
- Parameters:
mphon (str) – A sequence of phonemes a.k.a. morphophoneme, e.g. ‘ij’.
- Returns:
True if the set consists either of vowels (and semivowels and Øs) or of consonants (and semivowels and Øs), otherwise False.
- Return type:
boolean
- twol.alphabet.mphon_to_binint(mphon)[source]¶
Converts a morphophoneme into a binary integer that represents it.
- Parameters:
mphon (str) – A string of phonemes and possibly Øs
- Returns:
A binary integer which represents the morphophoneme
- Return type:
int
- twol.alphabet.mphon_to_binint_cache = {}¶
A 6*16 bit vector which represents the feature sets of this phoneme.
- twol.alphabet.mphon_weight(mphon)[source]¶
Returns the weight of a morphophoneme
- Parameters:
mphon (str) – A sequence of phonemes a.k.a. morphophoneme, e.g. ‘ij’
- Returns:
The weight of mphon based on the phonological features of its members
- Return type:
float
- twol.alphabet.mphon_weight_cache = {}¶
A cache for morphophoneme weights.
- twol.alphabet.phoneme_set_weight_cache = {}¶
Given weights for phoneme sets represented by “”.join(sorted(set(MPHON)))
- twol.alphabet.read_alphabet(file_name)[source]¶
Reads phoneme features, feature subsets with weights, and other definitions
- Parameters:
file_name (str) – Name of the file that contains the alphabet definition
Stores the computed sets and weights for feature subsets and other info in module variables for the use of other modules.
- twol.alphabet.vowel_set = {}¶
The set of vowels including semivowels.