11.10. twol.discover module

11.10.1. Sets of positive and negative contexts

The discovery processes one pair symbol. If all pair symbols for a morphophoneme are to be processed, it is done separately for each pair symbol. For the pair, e.g. {tds}:s, the set of relevant contexts is first extracted out of the whole set of examples. The relevant contexsts consist of (1) the positive contexts (positive_contexts) where the pair (e.g. {tds}:s) occurs and (2) negative contexts (negative_contexts) where the morphophoneme of the pair occurs but with any other surface phoneme (e.g. {tds}:t or {tds}:d). If any contexts in the negative set would occur in the positive set, such contexts are removed from the negative set. Each context in those sets consists of a pair of two strings: the left context and the right context.

In this module, examples are treated as strings of space-separated pair symbols, e.g.:

"k ä {tds}:s {ieØeØ}:i"
"k ä {tds}:d {ieØeØ}:e n"

Thus one element in the set of positive context pairs for {tds}:s would be:

(".#. k ä", "{ieØeØ}:i .#.")

Note the symbol .#. which denotes the end of the left or right context. The set of negative context context pairs for {tds}:s could include the following pair:

(".#. k ä", "{ieØeØ}:e n .#.")

The sets of positive and the negative contexts afe built and the positive is subtracted from the negative. The negative contexts remain constant during the processing whereas the positive ones are modified by truncations and/or substitutions.

11.10.2. Generalising the sets of contexts

These initial sets of contexts correspond to the trivial rule which constrains the occurrence of the pair symbol within the set of examples. Such a rule is correct but does not work outside the set of examples because the contexts are too restrictive. The task of the discovery procedure is to generalise the positive and the negative sets while keeping the two sets mutually exclusive.

The program modifies the positive context sets but keeps the negative sets constant. Therefore, the comparisons between positive and negative context sets is performed with functions which correctly deduce the inclusions or disjointness.

11.10.3. Python functions of the module

A module for dicovering raw two-level rules from a set of carefully chosen examples

Examples, contexts and rules are treated in terms of strings without any finite-state machinery or rule compilation. Examples and contexts are space separated sequences of pair-symbols.

© Kimmo Koskenniemi, 2017-2023. Free software under the GPL 3 or later.

twol.discover.context_set_penalty(context_set)[source]
Parameters:

context_set (Set[Tuple[str, str]]) –

twol.discover.context_to_output_str(pairsym_str)[source]

Converts a pair symbol string into its surface string

Parameters:

pairsym_str (str) –

Return type:

str

twol.discover.main()[source]
twol.discover.max_left_len(pos_context_set)[source]
Parameters:

pos_context_set (set) –

Return type:

int

twol.discover.max_right_len(pos_context_set)[source]
Parameters:

pos_context_set (set) –

Return type:

int

twol.discover.mphon_subset(pos_context_set, subset_name)[source]

Reduces a set of contexts by replacing e.g. {ij}:i with {ij}:

Parameters:
  • pos_context_set (Set[Tuple[str, str]]) – A set of positive context which might be truncated and already reduced.

  • subset_name (str) – Only pairs which are in definitions[subset_name] are reduced.

Return type:

Set[Tuple[str, str]]

Returns:

A modified context set where pair symbols (insym:outsym) belonging to the given subset have been reduced into (insym:).

twol.discover.overlap(set_lst, pairsym_lst)[source]

Tests whether list of set names covers the list of pairsyms

Parameters:
  • set_lst (list[str]) – List of pair symbols or names of defined sets

  • pairsym_lst (list[str]) – List of pair symbols. If shorter than set_lst, the match fails.

Return type:

bool

Returns:

True if each pairsym in the latter list is included in a respective set name (or pairsym) in the former list

twol.discover.pos_neg_is_disjoint(pos_ctx_set, other_ctx_set)[source]

Tests whether a pos context set is disjoint from a negative one

Parameters:
  • pos_ctx_set (Set[Tuple[str, str]]) – A set of left and right context pairs where the contexts are represented as space-separated strings of pair symbols or set names.

  • other_ctx_set (Set[Tuple[str, str]]) – A context set to which the pos context is compared. The contexts are space-separated strings of pair symbols.

Return type:

bool

Returns:

True if the context sets are logically disjoint.

twol.discover.pos_neg_is_subset(pos_ctx_set, ctx_set)[source]

Tests whether the first context set is logically subset of the second

Parameters:
  • pos_ctx_set (Set[Tuple[str, str]]) – A positive context set which has gone through reductions such as truncation or replacements.

  • ctx_set (Set[Tuple[str, str]]) – An intact negative context set which has not been reduced.

Return type:

bool

Returns:

True if all context in the first set match some context in the second set.

twol.discover.print_context_set(msg, context_set)[source]
Parameters:
  • msg (str) –

  • context_set (Set[Tuple[str, str]]) –

Return type:

None

twol.discover.print_rule(result, operator)[source]

Prints one rule

Parameters:
  • result (Dict) –

  • operator (str) –

Return type:

None

twol.discover.process_results_into_rules(pairsym_lst, result_lst_lst)[source]
Parameters:
  • pairsym_lst (List[str]) –

  • result_lst_lst (List[List[Dict]]) –

twol.discover.reduce_set(set_name, pos_context_set)[source]

Reduce contexts by substituting pair symbols with set names

Parameters:
  • set_name (str) – A name of a pairsym set in definitions.

  • pos_context_set (Set[Tuple[str, str]]) – A set of contexts to which the reduction is to be applied.

Return type:

Set[Tuple[str, str]]

Returns:

A new set of contexts where every occurrence of pairsyms in definitions[set_name] have been substituted with set_name.

twol.discover.relevant_contexts(pair_symbol)[source]

Select positive and negative contexts for a given pair-symbol

Parameters:

pair_symbol (str) – The pairsym for which the contexts are selected.

Return type:

None

Sets a global variable positive_context_set[pair_symbol] to a set of those contexts in the examples in which the pair_symbol occurs

Sets a global variable negative_context_set[pair_symbol] to a set of contexts where the input-symbol of the pair_symbol occurs with another output-symbol but so that there is no example in the example_set where the pair_symbol occurs in such a context.

twol.discover.search_reductions(agenda, pair_symbol, pos_context_set)[source]

Tests and executes context reductions according to a recipe

Parameters:
  • agenda (Deque) – Initially a recipe. Consumed and updated during the process.

  • pair_symbol (str) – The pairsym for which a rule is deduced.

  • pos_context_set (Set[Tuple[str, str]]) – set of contexts, i.e. pairs (left_context, right_context) where the contexts are space-separated strings of pairsyms. The context are reduced during the process.

Return type:

Set[Tuple[str, str]]

twol.discover.surface_subset(pos_context_set, subset_name)[source]

Reduce the contexts by substituting insym:outsym pairs with :outsym

Parameters:
  • pos_context_set (Set[Tuple[str, str]]) – The set of contexts to be reduced

  • subset_name (str) – A defined subset whose sympairs are considered for reduction.

Return type:

Set[Tuple[str, str]]

Returns:

A new context set where all occurences of pairsyms (insym:outsym) in definitions[subset_name] are replaced with (:outsym), e.g. {tds}:s have been reduced into e.g. :s.

twol.discover.truncate_left(syms_to_remain, context_set)[source]

Truncate the left contexts

Parameters:
  • syms_to_remain (int) – A minimum number of pair symbols to remain in the left context.

  • context_set (Set[Tuple[str, str]]) – A set of (positive) contexts to be truncated.

Return type:

Set[Tuple[str, str]]

Returns:

A new context set where left contexts are truncated

twol.discover.truncate_right(syms_to_remain, context_set)[source]

Truncate the right contexts

Parameters:
  • syms_to_remain (int) – A minimum number of pair symbols to remain in the right context.

  • context_set (Set[Tuple[str, str]]) – A set of (positive) contexts to be truncated.

Return type:

Set[Tuple[str, str]]

Returns:

A new context set where right contexts are truncated