=========================================================================== sphcorp API specification sphcorp.py v1.2.0 14 April 2011 Athanassios Protopapas Institute for Language & Speech Processing =========================================================================== CorpProc usage via COM: (first run sphcorp.py to register COM objects with system server) import win32com.client cp = win32com.client.Dispatch("Python.SPFproc") cp.self_init() # then call cp functions below --------------------------------------------------------------------------- CorpProc usage as module -- easy: import sphcorp cp = sphcorp.CorpProc() cp.self_init() # then call cp functions below --------------------------------------------------------------------------- CorpProc usage as module -- with access to variables & corpora: import sphcorp glb=sphcorp.Globals() crp=sphcorp.Corpora(glb) # optional parameter, cs="CL" to specify corpus cp=sphcorp.CorpProc() cp.init(glb,crp) # not self_init()! corpus=cp.crp.selected_corpus # check out additional attributes in Globals, Corpora, CorpProc --------------------------------------------------------------------------- CorpProc method usage without a corpus (no neighborhoods, word lookup etc.): cp=sphcorp.CorpProc(no_corpus=True) cp.self_init() # then call cp functions below # bigram & syllable probabilities are based on C corpus =========================================================================== CorpProc methods: self_init() initializes global variables, corpora etc, when not needed externally set_corpus(corp) corp is an integer (offset of desired corpus in list of available corpora) or a string (denoting the desired corpus; availalbe corpora include "clean","clean_ns","large_lc","large_ns") sets the currently selected corpus for all processing no result is returned preproc(word) word is a phonological string the result is a phonological string with standard preprocessing simplifications, removing nasals before voiced stops, M->m, mpt->mp isgreek(s) s is an orthographic string the result is True if every character of s is a Greek letter; False otherwise isphone(s) s is a phonological string the result is True if every character of s is a Greek phone; False otherwise total_syllables(corplist=_SPEL_,tokens) corplist is _SPEL_ (0; default) or _PHON_ (1) (keyword argument) tokens is True for syllable token counts (default) or False for syllable type counts the result is the total number of syllable tokens of the corplist type in the current selected corpus (to be used in frequency normalization) total_bigrams(corplist=_SPEL_,tokens) corplist is _SPEL_ (0; default) or _PHON_ (1) (keyword argument) tokens is True for bigram token counts (default) or False for bigram type counts the result is the total number of bigram tokens of the corplist type in the current selected corpus (to be used in frequency normalization) total_words() the result is a pair (tuple) with the total number of word types and tokens in the current selected corpus (to be used in frequency normalization) CV_types(sylset) sylset is a phonological string or a list of syllables the result is a corresponding string or list of strings containing only C and V corresponding to the consonants and vowels of sylset get_pho(spe) spe is an orthographic string the result is a phonological string if spe exists in the current corpus the result is 0 (zero) if spe does not occur in the corpus the result is None in case of internal mismatch error get_fre(spe) spe is an orthographic string the result is the frequency of spe in the current corpus the result is 0 (zero) if spe does not occur in the corpus the result is None in case of internal mismatch error get_phofre(spe) spe is an orthographic string the result is a tuple with three members (orhographic string, phonological string, frequency) if spe exists in the current corpus the result is 0 (zero) if spe does not occur in the corpus the result is None in case of internal mismatch error align_phospe(pho,spe) pho is a phonological string spe is an orthographic string the result is a list of pairs; each pair is a list of one phoneme and the corresponding grapheme (ph,sp); the phonemes and graphemes in the pairs making up the input phonological and orthographic strings in sequence. there is no indication if the phonological and orthographic strings fail to match at the phoneme-grapheme level; use check_phospe to verify check_phospe(pho,spe,trans) pho is a phonological string spe is an orthographic string trans is a list of pairs (lists) as returned by align_phospe the result is True is trans fully accounts for pho and spe in alignment the result is False otherwise index_phospe(phospe,trans) phospe is a (pho,spe) pair, ie a list composed of one phonological string and one orthographic string trans is a list of pairs (lists) as returned by align_phospe (normally called with the pho and spe in phospe) the result is a list of number pairs (lists), matching phospe in length; the first number in each pair is the index of the ph part of the corresponding trans member within pho in phospe; and the second number is the index of the sp part of the corresponding trans member within spe in phospe; in other words this is a list of indices of the grapheme and phoneme onsets within the orthographic and phonological strings. unstress(item,corplist) item is an orthographic (default) or phonological string corplist is _SPEL_ (0; default) or _PHON_ (1) (keyword argument) the result is the input string with all stressed letters/phonemes converted to their unstressed counterparts count_stress(word,corplist) word is an orthographic or phonological (default) string corplist is _SPEL_ (0) or _PHON_ (1; default) (keyword argument) the result is the number of stressed vowels in word single_stress(word,corplist) word is an orthographic or phonological (default) string corplist is _SPEL_ (0) or _PHON_ (1; default) (keyword argument) if word is a double-stressed word then the result is the corresponding single-stressed word (enclitic stress is removed); otherwise the result is word stressed_syllable(word) word is a phonological string the result is the index of the stressed syllable (1=final, 2=penult, 3=antepenult); if there are two stressed syllables the result is a list of indexes (normally [1,3]) syllable_count(word) word is a phonological string the result is the number of vowels in word syllabify(line,star,debug) line is a phonological string that can include spaces (multi-word line) star is True (default) or False; if True, illegal syllables are preceded by an asterisk -- easy to detect but causing mismatches when join()'ed debug is True or False (default); when True, cases extensive information about the syllables to be returned (onsets, codas, misses etc.) the result is a list of phonological string fragments corresponding to the syllables of line star and debug are keyword arguments syllable_alignment(syllables,phospe) syllables is a list of phonological strings phospe is a (pho,spe) pair, ie a list composed of one phonological string and one orthographic string the result is a list of (ph,sp) pairs, each pair composed of one phonological fragment (string) and one orthographic fragment (string), such that the phonological fragments match syllables and in sequence make up pho, while the orthographic fragments in sequence make up spe syllable_freq(syllables,tokens) syllables is a list of phonological strings or a list of (ph,sp) pairs, each pair composed of a phonological and an orthographic string tokens is True for syllable token counts (default) or False for syllable type counts the result is a list of pairs or a list of triplets; if only phonological strings were passed into the function, then the result is a list of pairs, each pair composed of the frequency of the phonological syllable and the frequency of the corresponding syllable type (eg CV, CCV...); if (ph,sp) pairs were passed into the function, then the result is a list of triplets, the third member of each being the frequency of the corresponding orthographic syllable syllable_prob(sylfreq) sylfreq is a list of triplets, composed of frequencies of orthographic syllable (or None), phonological syllable, and syllable type the result is a pair or triplet (list), respectively, with the corresponding log sums of the individual syllable frequencies. If syllables with zero frequency are encountered, they are given a 1.0/MINRATIO nominal frequency bigram_exist(word,mode,lowcase,unstress,tokens) word is an orthographic (default) or phonological string mode is _SPEL_ (0; default) or _PHON_ (1) lowcase is True or False (default); applies only to orthographic strings unstress is True or False (default) tokens is True for bigram token counts (default) or False for bigram type counts the result is True if all bigrams in word occur in the current corpus the result is False if one or more bigrams in word have zero occurrences if either lowcase or unstress is True then the search is done on bigrams counted after removing stress diacritics and converting to lowercase mode, lowcase, and unstress are keyword arguments bigram_prob(word,mode,lowcase,unstress,ends,tokens) word is an orthographic (default) or phonological string mode is _SPEL_ (0; default) or _PHON_ (1) lowcase is True or False (default); applies only to orthographic strings unstress is True or False (default) ends is True (default) or False tokens is True for bigram token counts (default) or False for bigram type counts the result is a pair (tuple) composed of the log sum of the frequencies (counts) of all bigrams in word (including onset/offset bigrams with space) and of the cumulative probability of the bigram sequence based on the conditional probabilities of bigrams on initial letters if ends is False then onset/offset bigrams with space are excluded from the log sum (not from the cumulative probability!) if either lowcase or unstress is True then the search is done on bigrams counted after removing stress diacritics and converting to lowercase mode, lowcase, unstress, and ends are keyword arguments unique(self,word,mode,unstress,nearest) word is an orthographic (default) or phonological string mode is _SPEL_ (0; default) or _PHON_ (1) unstress is True or False (default) nearest is True or False (default) The result is the serial letter position of uniqueness in the word, that is, the first letter not matching any other word in the sortlist, left-to-right. If unstress is True, the word is destressed and a no-stress list is used If nearest is True, the last-diverging word is also returned in a tuple ATTENTION: Python convention, first letter position is 0! find(item,corplist,unstress) item is an orthographic (default) or phonological string; add asterisks to match freely at different parts of the word (e.g. 'ka*' finds words beginning with ka) corplist is _SPEL_ (0; default) or _PHON_ (1) (keyword argument) the result is a list of (spel,phon,freq) lists, each list consisting of an orthographic and a phonological string and a number. The orthographic or phonological string (depending on corplist) is a word that contains item. The third element (number) in the triplet is the number of occurrences of spel in the current corpus. neighbors(item,corplist,unstress,types) item is an orthographic (default) or phonological string corplist is _SPEL_ (0; default) or _PHON_ (1) (keyword argument) unstress is True or False (default) (keyword argument) types is a string containing any of the letters R, D, I, and T; e.g. "RT" (default "R") the result is a list of (spel,phon,freq) lists, each list consisting of an orthographic and a phonological string and a number. The orthographic or phonological string (depending on corplist) is a neighbor of item. By default, only Replacement neighbors are considered, i.e., each neighbor it is equal in length to item and differs from it by a single character (letter or phoneme, respectively). The third element (number) in the triplet is the number of occurrences of spel in the current corpus. Depending on types, Replacement, Deletion, Insertion, and/or Transposition neighbors may be included, i.e., words with one character changed, missing, added, or swapped with the following one, respectively. levenshtein_distance(item,corplist,unstress,N,minF,_ins,_del,_sub,_tra,Nlist) item is an orthographic (default) or phonological string corplist is _SPEL_ (0; default) or _PHON_ (1) (keyword argument) unstress is True or False N is the number of nearest neighbors to be considered in the mean distance metric (defaults to 20, per Yarkoni et al.) minF is the minimum frequency (occurrence count) of lexicon items to consider (defaults to 1, i.e., to include the entire lexicon) _ins,_del,_sub,_tra are the costs associated with insertion, deletion, substitution, and transposition of letters (phones), respectively; default to 1, 1, 1, 2 Nlist controls output: If False (default), the result is a single number, the mean distance of the N items. If True, the result is a tuple containing the list of N items, as 2-member tuples of (item,distance) NOTE: Calculates mean Levenshtein distance for N=20 nearest neighbors following Yarkoni's OLD20 / PLD20 indices; this function is REALLY SLOW and should not be used for more than a couple of items; use leven() instead, if possible (i.e., if you are on a Windows machine and you have leven.exe) leven(item,corplist,unstress,N,minF,_ins,_del,_sub,_tra,Nlist) item is an orthographic (default) or phonological string or a list of strings corplist is _SPEL_ (0; default) or _PHON_ (1) (keyword argument) unstress is True or False (default) N is the number of nearest neighbors to be considered in the mean distance metric (defaults to 20, per Yarkoni et al.) minF is the minimum frequency (occurrence count) of lexicon items to consider (defaults to 1, i.e., to include the entire lexicon) _ins,_del,_sub,_tra are the costs associated with insertion, deletion, substitution, and transposition of letters (phones), respectively; default to 1, 1, 1, 2 the result is the mean distance or list of mean distances (depending on item); if Nlist is True then the result is a list of (string,float) pairs including the 20 nearest items and corresponding distances (no mean distance) NOTE: Calculates mean Levenshtein distance for N=20 nearest neighbors following Yarkoni's OLD20 / PLD20 indices; the same is done by levenshtein_distance() but this one is much faster because it runs an external highly optimized .exe (if you have it in the current working directory; Windows only) syllabic_neighbors(item,unstress) item is a phonological string unstress is True or False (default) (keyword argument) the result is a list of (spel,phon,freq) lists, each list consisting of an orthographic and a phonological string and a number. The phonological string shares its first syllable with item and is a neighbor of item, i.e., it is equal in length to item and differs from it by a single character (phoneme). The orthographic string is the corresponding spelling and the third element (number) in the triplet is the number of occurrences of spel in the current corpus. cohort(item,minlen,corplist,unstress,reverse) item is an orthographic or phonological (default) string minlen is the length of the common onset among the cohort group corplist is _SPEL_ (0) or _PHON_ (1; default) unstress is True or False (default) reverse is False (default) or True; if True then matching is done at word endings rather than beginnings so the result is not a cohort group but can be a suffix or rhyme group depending on the matching sequence the result is a list of (spel,phon,freq) lists, each list consisting of an orthographic and a phonological string and a number. The orthographic or phonological string (depending on corplist) matches item in the first (if reverse is False) or last (if reverse is True) characters (letters or phonemes). The third element (number) in the triplet is the number of occurrences of spel in the current corpus. minlen, corplist, unstress, and reverse are keyword arguments stress_neighbors(item,corplist) item is an orthographic (default) or phonological string corplist is _SPEL_ (0) or _PHON_ (1; default) (keyword argument) the result is a list of (spel,phon,freq) lists, each list consisting of an orthographic and a phonological string and a number. The orthographic or phonological string (depending on corplist) matches item from the stressed grapheme or phoneme, respectively, through the end. The third element (number) in the triplet is the number of occurrences of spel in the current corpus. spe2pho(spel) spel is a single orthographic word or a list of orthographic words the result is the corresponding phonological word or list of phon. words setup_GPC_rules(exclude_optrules,nonoptional) exclude_optrules is True or False (default); when True, rules marked as optional are not loaded into the ruleset, so they can never apply nonoptional is True or False (default); when True, rules marked as optional are considered obligatory, so they always apply the result is a ruleset that can be used by gpc exclude_optrules and nonoptional are keyword arguments gpc(spel,ruleset) spel is an orthographic string ruleset is a set of GPC rules as returned by setup_GPC_rules the result is a phonological string (if there is a single possible outcome, either because all rules are obligatory or because no optional rule applies for this word) or a list of phonological strings (if there are multiple alternatives depending on the application of optional rules)