===========================================================================

sphcorp API specification

sphcorp.py v1.2.0
14 April 2011

Athanassios Protopapas                                  
Institute for Language & Speech Processing              

===========================================================================

CorpProc usage via COM:

    (first run sphcorp.py to register COM objects with system server)

    import win32com.client
    cp = win32com.client.Dispatch("Python.SPFproc")
    cp.self_init()
    # then call cp functions below

---------------------------------------------------------------------------

CorpProc usage as module -- easy:

    import sphcorp
    cp = sphcorp.CorpProc()
    cp.self_init()
    # then call cp functions below

---------------------------------------------------------------------------

CorpProc usage as module -- with access to variables & corpora:

    import sphcorp
    glb=sphcorp.Globals()
    crp=sphcorp.Corpora(glb) # optional parameter, cs="CL" to specify corpus
    cp=sphcorp.CorpProc()
    cp.init(glb,crp) # not self_init()!
    corpus=cp.crp.selected_corpus
    # check out additional attributes in Globals, Corpora, CorpProc

---------------------------------------------------------------------------

CorpProc method usage without a corpus (no neighborhoods, word lookup etc.):

    cp=sphcorp.CorpProc(no_corpus=True)
    cp.self_init()
    # then call cp functions below
    # bigram & syllable probabilities are based on C corpus

===========================================================================

CorpProc methods:
    
self_init()
    initializes global variables, corpora etc, when not needed externally

set_corpus(corp)
    corp is an integer (offset of desired corpus in list of available corpora)
        or a string (denoting the desired corpus;
        availalbe corpora include "clean","clean_ns","large_lc","large_ns")
    sets the currently selected corpus for all processing
    no result is returned

preproc(word)
    word is a phonological string
    the result is a phonological string with standard preprocessing
        simplifications, removing nasals before voiced stops, M->m, mpt->mp

isgreek(s)
    s is an orthographic string
    the result is True if every character of s is a Greek letter; False otherwise

isphone(s)
    s is a phonological string
    the result is True if every character of s is a Greek phone; False otherwise

total_syllables(corplist=_SPEL_,tokens)
    corplist is _SPEL_ (0; default) or _PHON_ (1) (keyword argument)
    tokens is True for syllable token counts (default) or False for syllable type counts
    the result is the total number of syllable tokens of the corplist type
        in the current selected corpus (to be used in frequency normalization)

total_bigrams(corplist=_SPEL_,tokens)
    corplist is _SPEL_ (0; default) or _PHON_ (1) (keyword argument)
    tokens is True for bigram token counts (default) or False for bigram type counts
    the result is the total number of bigram tokens of the corplist type
        in the current selected corpus (to be used in frequency normalization)

total_words()
    the result is a pair (tuple) with the total number of word types and tokens
        in the current selected corpus (to be used in frequency normalization)

CV_types(sylset)
    sylset is a phonological string or a list of syllables
    the result is a corresponding string or list of strings containing
        only C and V corresponding to the consonants and vowels of sylset
    
get_pho(spe)
    spe is an orthographic string
    the result is a phonological string if spe exists in the current corpus
    the result is 0 (zero) if spe does not occur in the corpus
    the result is None in case of internal mismatch error
    
get_fre(spe)
    spe is an orthographic string
    the result is the frequency of spe in the current corpus
    the result is 0 (zero) if spe does not occur in the corpus
    the result is None in case of internal mismatch error

get_phofre(spe)
    spe is an orthographic string
    the result is a tuple with three members (orhographic string,
        phonological string, frequency) if spe exists in the current corpus
    the result is 0 (zero) if spe does not occur in the corpus
    the result is None in case of internal mismatch error

align_phospe(pho,spe)
    pho is a phonological string
    spe is an orthographic string
    the result is a list of pairs; each pair is a list of one phoneme and
        the corresponding grapheme (ph,sp); the phonemes and graphemes in the
        pairs making up the input phonological and orthographic strings in sequence.
        there is no indication if the phonological and orthographic strings
        fail to match at the phoneme-grapheme level; use check_phospe to verify

check_phospe(pho,spe,trans)
    pho is a phonological string
    spe is an orthographic string
    trans is a list of pairs (lists) as returned by align_phospe
    the result is True is trans fully accounts for pho and spe in alignment
    the result is False otherwise

index_phospe(phospe,trans)
    phospe is a (pho,spe) pair, ie a list composed of one phonological string
        and one orthographic string
    trans is a list of pairs (lists) as returned by align_phospe (normally
        called with the pho and spe in phospe)
    the result is a list of number pairs (lists), matching phospe in length;
        the first number in each pair is the index of the ph part of the
        corresponding trans member within pho in phospe; and the second number
        is the index of the sp part of the corresponding trans member within spe
        in phospe; in other words this is a list of indices of the grapheme and
        phoneme onsets within the orthographic and phonological strings.

unstress(item,corplist)
    item is an orthographic (default) or phonological string
    corplist is _SPEL_ (0; default) or _PHON_ (1) (keyword argument)
    the result is the input string with all stressed letters/phonemes
        converted to their unstressed counterparts

count_stress(word,corplist)
    word is an orthographic or phonological (default) string
    corplist is _SPEL_ (0) or _PHON_ (1; default) (keyword argument)
    the result is the number of stressed vowels in word
    
single_stress(word,corplist)
    word is an orthographic or phonological (default) string
    corplist is _SPEL_ (0) or _PHON_ (1; default) (keyword argument)
    if word is a double-stressed word then the result is the corresponding
        single-stressed word (enclitic stress is removed); otherwise the
        result is word

stressed_syllable(word)
    word is a phonological string
    the result is the index of the stressed syllable (1=final, 2=penult,
        3=antepenult); if there are two stressed syllables the result is
        a list of indexes (normally [1,3])

syllable_count(word)
    word is a phonological string
    the result is the number of vowels in word
    
syllabify(line,star,debug)
    line is a phonological string that can include spaces (multi-word line)
    star is True (default) or False; if True, illegal syllables are preceded
        by an asterisk -- easy to detect but causing mismatches when join()'ed
    debug is True or False (default); when True, cases extensive information
        about the syllables to be returned (onsets, codas, misses etc.)
    the result is a list of phonological string fragments corresponding
        to the syllables of line
    star and debug are keyword arguments

syllable_alignment(syllables,phospe)
    syllables is a list of phonological strings
    phospe is a (pho,spe) pair, ie a list composed of one phonological string
        and one orthographic string
    the result is a list of (ph,sp) pairs, each pair composed of one
        phonological fragment (string) and one orthographic fragment (string),
        such that the phonological fragments match syllables and in sequence
        make up pho, while the orthographic fragments in sequence make up spe

syllable_freq(syllables,tokens)
    syllables is a list of phonological strings or a list of (ph,sp) pairs,
        each pair composed of a phonological and an orthographic string
    tokens is True for syllable token counts (default) or False for syllable type counts
    the result is a list of pairs or a list of triplets; if only phonological
        strings were passed into the function, then the result is a list of
        pairs, each pair composed of the frequency of the phonological syllable
        and the frequency of the corresponding syllable type (eg CV, CCV...);
        if (ph,sp) pairs were passed into the function, then the result is a
        list of triplets, the third member of each being the frequency of
        the corresponding orthographic syllable

syllable_prob(sylfreq)
    sylfreq is a list of triplets, composed of frequencies of
        orthographic syllable (or None), phonological syllable, and syllable type
    the result is a pair or triplet (list), respectively, with the corresponding
        log sums of the individual syllable frequencies. If syllables with zero
        frequency are encountered, they are given a 1.0/MINRATIO nominal frequency

bigram_exist(word,mode,lowcase,unstress,tokens)
    word is an orthographic (default) or phonological string
    mode is _SPEL_ (0; default) or _PHON_ (1)
    lowcase is True or False (default); applies only to orthographic strings
    unstress is True or False (default)
    tokens is True for bigram token counts (default) or False for bigram type counts
    the result is True if all bigrams in word occur in the current corpus
    the result is False if one or more bigrams in word have zero occurrences
    if either lowcase or unstress is True then the search is done on bigrams
        counted after removing stress diacritics and converting to lowercase
    mode, lowcase, and unstress are keyword arguments

bigram_prob(word,mode,lowcase,unstress,ends,tokens)
    word is an orthographic (default) or phonological string
    mode is _SPEL_ (0; default) or _PHON_ (1)
    lowcase is True or False (default); applies only to orthographic strings
    unstress is True or False (default)
    ends is True (default) or False
    tokens is True for bigram token counts (default) or False for bigram type counts
    the result is a pair (tuple) composed of the log sum of the frequencies
        (counts) of all bigrams in word (including onset/offset bigrams
        with space) and of the cumulative probability of the bigram sequence
        based on the conditional probabilities of bigrams on initial letters
    if ends is False then onset/offset bigrams with space are excluded from the
        log sum (not from the cumulative probability!)
    if either lowcase or unstress is True then the search is done on bigrams
        counted after removing stress diacritics and converting to lowercase
    mode, lowcase, unstress, and ends are keyword arguments

unique(self,word,mode,unstress,nearest)
    word is an orthographic (default) or phonological string
    mode is _SPEL_ (0; default) or _PHON_ (1)
    unstress is True or False (default)
    nearest is True or False (default)
    The result is the serial letter position of uniqueness in the word, that is,
        the first letter not matching any other word in the sortlist, left-to-right.
    If unstress is True, the word is destressed and a no-stress list is used
    If nearest is True, the last-diverging word is also returned in a tuple
    ATTENTION: Python convention, first letter position is 0!

find(item,corplist,unstress)
    item is an orthographic (default) or phonological string; add asterisks to match
        freely at different parts of the word (e.g. 'ka*' finds words beginning with ka)
    corplist is _SPEL_ (0; default) or _PHON_ (1) (keyword argument) 
    the result is a list of (spel,phon,freq) lists, each list consisting
        of an orthographic and a phonological string and a number. The orthographic
        or phonological string (depending on corplist) is a word that contains item.
        The third element (number) in the triplet is the number of occurrences of
        spel in the current corpus.

neighbors(item,corplist,unstress,types)
    item is an orthographic (default) or phonological string
    corplist is _SPEL_ (0; default) or _PHON_ (1) (keyword argument)
    unstress is True or False (default) (keyword argument)
    types is a string containing any of the letters R, D, I, and T; e.g. "RT" (default "R")
    the result is a list of (spel,phon,freq) lists, each list consisting
        of an orthographic and a phonological string and a number. The orthographic
        or phonological string (depending on corplist) is a neighbor of item. 
        By default, only Replacement neighbors are considered, i.e., each neighbor
        it is equal in length to item and differs from it by a single character
        (letter or phoneme, respectively). The third element (number) in the triplet
        is the number of occurrences of spel in the current corpus.
        Depending on types, Replacement, Deletion, Insertion, and/or Transposition
        neighbors may be included, i.e., words with one character changed, missing, 
        added, or swapped with the following one, respectively.

levenshtein_distance(item,corplist,unstress,N,minF,_ins,_del,_sub,_tra,Nlist)
    item is an orthographic (default) or phonological string
    corplist is _SPEL_ (0; default) or _PHON_ (1) (keyword argument)
    unstress is True or False
    N is the number of nearest neighbors to be considered in the mean distance metric
        (defaults to 20, per Yarkoni et al.)
    minF is the minimum frequency (occurrence count) of lexicon items to consider
        (defaults to 1, i.e., to include the entire lexicon)
    _ins,_del,_sub,_tra are the costs associated with insertion, deletion, substitution,
        and transposition of letters (phones), respectively; default to 1, 1, 1, 2
    Nlist controls output: If False (default), the result is a single number, the mean
        distance of the N items. If True, the result is a tuple containing the list
        of N items, as 2-member tuples of (item,distance)
    NOTE: Calculates mean Levenshtein distance for N=20 nearest neighbors
    	following Yarkoni's OLD20 / PLD20 indices; this function is REALLY SLOW
    	and should not be used for more than a couple of items; use leven() instead,
	if possible (i.e., if you are on a Windows machine and you have leven.exe)

leven(item,corplist,unstress,N,minF,_ins,_del,_sub,_tra,Nlist)
    item is an orthographic (default) or phonological string or a list of strings
    corplist is _SPEL_ (0; default) or _PHON_ (1) (keyword argument)
    unstress is True or False (default)
    N is the number of nearest neighbors to be considered in the mean distance metric
        (defaults to 20, per Yarkoni et al.)
    minF is the minimum frequency (occurrence count) of lexicon items to consider
        (defaults to 1, i.e., to include the entire lexicon)
    _ins,_del,_sub,_tra are the costs associated with insertion, deletion, substitution,
        and transposition of letters (phones), respectively; default to 1, 1, 1, 2
    the result is the mean distance or list of mean distances (depending on item);
        if Nlist is True then the result is a list of (string,float) pairs
        including the 20 nearest items and corresponding distances (no mean distance)
    NOTE: Calculates mean Levenshtein distance for N=20 nearest neighbors following
    Yarkoni's OLD20 / PLD20 indices; the same is done by levenshtein_distance()
    but this one is much faster because it runs an external highly optimized .exe
    (if you have it in the current working directory; Windows only)

syllabic_neighbors(item,unstress)
    item is a phonological string
    unstress is True or False (default) (keyword argument)
    the result is a list of (spel,phon,freq) lists, each list consisting
        of an orthographic and a phonological string and a number.
        The phonological string shares its first syllable with item and
        is a neighbor of item, i.e., it is equal in length to item and
        differs from it by a single character (phoneme). The orthographic
        string is the corresponding spelling and the third element
        (number) in the triplet is the number of occurrences of spel
        in the current corpus.

cohort(item,minlen,corplist,unstress,reverse)
    item is an orthographic or phonological (default) string
    minlen is the length of the common onset among the cohort group
    corplist is _SPEL_ (0) or _PHON_ (1; default)
    unstress is True or False (default) 
    reverse is False (default) or True; if True then matching is done at word
        endings rather than beginnings so the result is not a cohort group
        but can be a suffix or rhyme group depending on the matching sequence
    the result is a list of (spel,phon,freq) lists, each list consisting
        of an orthographic and a phonological string and a number. The orthographic
        or phonological string (depending on corplist) matches item in the first
        (if reverse is False) or last (if reverse is True) <minlen> characters
        (letters or phonemes). The third element (number) in the triplet
        is the number of occurrences of spel in the current corpus.
    minlen, corplist, unstress, and reverse are keyword arguments
    

stress_neighbors(item,corplist)
    item is an orthographic (default) or phonological string
    corplist is _SPEL_ (0) or _PHON_ (1; default) (keyword argument)
    the result is a list of (spel,phon,freq) lists, each list consisting
        of an orthographic and a phonological string and a number. The
        orthographic or phonological string (depending on corplist)
        matches item from the stressed grapheme or phoneme, respectively,
        through the end. The third element (number) in the triplet
        is the number of occurrences of spel in the current corpus.

spe2pho(spel)
    spel is a single orthographic word or a list of orthographic words
    the result is the corresponding phonological word or list of phon. words

setup_GPC_rules(exclude_optrules,nonoptional)
    exclude_optrules is True or False (default); when True, rules marked as
        optional are not loaded into the ruleset, so they can never apply
    nonoptional is True or False (default); when True, rules marked as optional
        are considered obligatory, so they always apply
    the result is a ruleset that can be used by gpc
    exclude_optrules and nonoptional are keyword arguments

gpc(spel,ruleset)
    spel is an orthographic string
    ruleset is a set of GPC rules as returned by setup_GPC_rules
    the result is a phonological string (if there is a single possible outcome,
        either because all rules are obligatory or because no optional rule
        applies for this word) or a list of phonological strings (if there are
        multiple alternatives depending on the application of optional rules)