11.10. twol.discover module¶
11.10.1. Sets of positive and negative contexts¶
The discovery processes one pair symbol. If all pair symbols for a morphophoneme are to be processed, it is done separately for each pair symbol. For the pair, e.g. {tds}:s
, the set of relevant contexts is first extracted out of the whole set of examples. The relevant contexsts consist of (1) the positive contexts (positive_contexts
) where the pair (e.g. {tds}:s
) occurs and (2) negative contexts (negative_contexts
) where the morphophoneme of the pair occurs but with any other surface phoneme (e.g. {tds}:t
or {tds}:d
). If any contexts in the negative set would occur in the positive set, such contexts are removed from the negative set. Each context in those sets consists of a pair of two strings: the left context and the right context.
In this module, examples are treated as strings of space-separated pair symbols, e.g.:
"k ä {tds}:s {ieØeØ}:i"
"k ä {tds}:d {ieØeØ}:e n"
Thus one element in the set of positive context pairs for {tds}:s
would be:
(".#. k ä", "{ieØeØ}:i .#.")
Note the symbol .#.
which denotes the end of the left or right context. The set of negative context context pairs for {tds}:s
could include the following pair:
(".#. k ä", "{ieØeØ}:e n .#.")
The sets of positive and the negative contexts afe built and the positive is subtracted from the negative. The negative contexts remain constant during the processing whereas the positive ones are modified by truncations and/or substitutions.
11.10.2. Generalising the sets of contexts¶
These initial sets of contexts correspond to the trivial rule which constrains the occurrence of the pair symbol within the set of examples. Such a rule is correct but does not work outside the set of examples because the contexts are too restrictive. The task of the discovery procedure is to generalise the positive and the negative sets while keeping the two sets mutually exclusive.
The program modifies the positive context sets but keeps the negative sets constant. Therefore, the comparisons between positive and negative context sets is performed with functions which correctly deduce the inclusions or disjointness.
11.10.3. Python functions of the module¶
A module for dicovering raw two-level rules from a set of carefully chosen examples
Examples, contexts and rules are treated in terms of strings without any finite-state machinery or rule compilation. Examples and contexts are space separated sequences of pair-symbols.
© Kimmo Koskenniemi, 2017-2023. Free software under the GPL 3 or later.
- twol.discover.context_set_penalty(context_set)[source]¶
- Parameters:
context_set (
Set
[Tuple
[str
,str
]]) –
- twol.discover.context_to_output_str(pairsym_str)[source]¶
Converts a pair symbol string into its surface string
- Parameters:
pairsym_str (
str
) –- Return type:
str
- twol.discover.max_left_len(pos_context_set)[source]¶
- Parameters:
pos_context_set (
set
) –- Return type:
int
- twol.discover.max_right_len(pos_context_set)[source]¶
- Parameters:
pos_context_set (
set
) –- Return type:
int
- twol.discover.mphon_subset(pos_context_set, subset_name)[source]¶
Reduces a set of contexts by replacing e.g. {ij}:i with {ij}:
- Parameters:
pos_context_set (
Set
[Tuple
[str
,str
]]) – A set of positive context which might be truncated and already reduced.subset_name (
str
) – Only pairs which are in definitions[subset_name] are reduced.
- Return type:
Set
[Tuple
[str
,str
]]- Returns:
A modified context set where pair symbols (insym:outsym) belonging to the given subset have been reduced into (insym:).
- twol.discover.overlap(set_lst, pairsym_lst)[source]¶
Tests whether list of set names covers the list of pairsyms
- Parameters:
set_lst (
list
[str
]) – List of pair symbols or names of defined setspairsym_lst (
list
[str
]) – List of pair symbols. If shorter than set_lst, the match fails.
- Return type:
bool
- Returns:
True if each pairsym in the latter list is included in a respective set name (or pairsym) in the former list
- twol.discover.pos_neg_is_disjoint(pos_ctx_set, other_ctx_set)[source]¶
Tests whether a pos context set is disjoint from a negative one
- Parameters:
pos_ctx_set (
Set
[Tuple
[str
,str
]]) – A set of left and right context pairs where the contexts are represented as space-separated strings of pair symbols or set names.other_ctx_set (
Set
[Tuple
[str
,str
]]) – A context set to which the pos context is compared. The contexts are space-separated strings of pair symbols.
- Return type:
bool
- Returns:
True if the context sets are logically disjoint.
- twol.discover.pos_neg_is_subset(pos_ctx_set, ctx_set)[source]¶
Tests whether the first context set is logically subset of the second
- Parameters:
pos_ctx_set (
Set
[Tuple
[str
,str
]]) – A positive context set which has gone through reductions such as truncation or replacements.ctx_set (
Set
[Tuple
[str
,str
]]) – An intact negative context set which has not been reduced.
- Return type:
bool
- Returns:
True if all context in the first set match some context in the second set.
- twol.discover.print_context_set(msg, context_set)[source]¶
- Parameters:
msg (
str
) –context_set (
Set
[Tuple
[str
,str
]]) –
- Return type:
None
- twol.discover.print_rule(result, operator)[source]¶
Prints one rule
- Parameters:
result (
Dict
) –operator (
str
) –
- Return type:
None
- twol.discover.process_results_into_rules(pairsym_lst, result_lst_lst)[source]¶
- Parameters:
pairsym_lst (
List
[str
]) –result_lst_lst (
List
[List
[Dict
]]) –
- twol.discover.reduce_set(set_name, pos_context_set)[source]¶
Reduce contexts by substituting pair symbols with set names
- Parameters:
set_name (
str
) – A name of a pairsym set in definitions.pos_context_set (
Set
[Tuple
[str
,str
]]) – A set of contexts to which the reduction is to be applied.
- Return type:
Set
[Tuple
[str
,str
]]- Returns:
A new set of contexts where every occurrence of pairsyms in
definitions[set_name]
have been substituted withset_name
.
- twol.discover.relevant_contexts(pair_symbol)[source]¶
Select positive and negative contexts for a given pair-symbol
- Parameters:
pair_symbol (
str
) – The pairsym for which the contexts are selected.- Return type:
None
Sets a global variable
positive_context_set[pair_symbol]
to a set of those contexts in the examples in which the pair_symbol occursSets a global variable
negative_context_set[pair_symbol]
to a set of contexts where the input-symbol of the pair_symbol occurs with another output-symbol but so that there is no example in the example_set where the pair_symbol occurs in such a context.
- twol.discover.search_reductions(agenda, pair_symbol, pos_context_set)[source]¶
Tests and executes context reductions according to a recipe
- Parameters:
agenda (
Deque
) – Initially a recipe. Consumed and updated during the process.pair_symbol (
str
) – The pairsym for which a rule is deduced.pos_context_set (
Set
[Tuple
[str
,str
]]) – set of contexts, i.e. pairs (left_context, right_context) where the contexts are space-separated strings of pairsyms. The context are reduced during the process.
- Return type:
Set
[Tuple
[str
,str
]]
- twol.discover.surface_subset(pos_context_set, subset_name)[source]¶
Reduce the contexts by substituting insym:outsym pairs with :outsym
- Parameters:
pos_context_set (
Set
[Tuple
[str
,str
]]) – The set of contexts to be reducedsubset_name (
str
) – A defined subset whose sympairs are considered for reduction.
- Return type:
Set
[Tuple
[str
,str
]]- Returns:
A new context set where all occurences of pairsyms (
insym:outsym
) indefinitions[subset_name]
are replaced with (:outsym
), e.g.{tds}:s
have been reduced into e.g.:s
.
- twol.discover.truncate_left(syms_to_remain, context_set)[source]¶
Truncate the left contexts
- Parameters:
syms_to_remain (
int
) – A minimum number of pair symbols to remain in the left context.context_set (
Set
[Tuple
[str
,str
]]) – A set of (positive) contexts to be truncated.
- Return type:
Set
[Tuple
[str
,str
]]- Returns:
A new context set where left contexts are truncated
- twol.discover.truncate_right(syms_to_remain, context_set)[source]¶
Truncate the right contexts
- Parameters:
syms_to_remain (
int
) – A minimum number of pair symbols to remain in the right context.context_set (
Set
[Tuple
[str
,str
]]) – A set of (positive) contexts to be truncated.
- Return type:
Set
[Tuple
[str
,str
]]- Returns:
A new context set where right contexts are truncated