Documentation

quoteInformation about the sources, tools, and results.





WARNING: Information and data provided here may contain errors. Please report problems to iplr[at]ilsp.gr.

Corpora

Counts and materials distributed from this web page were based on word lists from two electronic printed text corpora, referred to in the tables as “LARGE” (L) and “CLEAN” (C). We recommend using the CLEAN corpus because of its very low proportion of errors, and resorting to LARGE only when special circumstances dictate.

CLEAN: A medium-size corpus, derived from a previous version of the Hellenic National Corpus, a collection of journalistic, legal, and literary texts including more than 34 million words, collected, processed and maintained by ILSP. The list of all space-separated tokens from this corpus was distilled into 374,075 types totaling approximately 31 million tokens. For this smaller corpus, each type was checked against a large electronic dictionary containing 1,622,668 entries. The dictionary is part of the ILSP spelling- and grammar-checking software Symfonia and includes derived morphological variants for each lemma, covering the full range of morphological types found in the Greek language. Rejected out-of-dictionary strings made up 41% of the types, accounting for only 5.3% of the tokens. The resulting corpus thus includes only clearly legal and correctly spelled items, although it is impossible to ascertain the extent to which words in the original text may have been misspelled as different existing words. The 217,664 types in this corpus accounted for a total of approximately 29.6 million tokens. This set is deprived of extremely rare words, and of most proper names and foreign words, which are not found in the dictionary; but it is also free from spelling errors. Therefore it constitutes an approximation to conservative, mainstream, well-proofread sources. We recommend using this for any work that does not require extremely large token counts or inclusion of foreign words and names.

LARGE: A very large corpus, made up entirely of journalistic texts, based on the online content of news sources. A list of all space-separated tokens (strings) was created and the number of occurrences of each unique item (type) was determined. This list included 1,017,946 types totaling approximately 272 million tokens. After removing items containing punctuation, numbers, and latin characters, the remaining strings were converted to their phonetic form (see grapheme-to-phoneme below). The broad phonetic transcriptions were parsed into approximate syllables (based on vowels), rejecting a variety of illegal patterns. Rejected strings made up 8.6% of types, accounting for only 1.4% of tokens. The remaining 930,760 types accounted for a total of approximately 268.2 million tokens in the original text corpus. This set of words is likely to include misspellings, incorrectly stressed words, foreign words, proper names, and a large proportion of very rare words. In fact, 52.8% of types occur only once or twice in the corpus, accounting for only 0.2% of tokens. The corpus can be considered to be reasonably realistic, marked with all the imperfections expected from everyday texts found in newspapers and comparable, minimally proofread sources. Because of a substantial proportion of errors (misspellings, typos etc.), we do not recommend using this corpus for work requiring accurate estimates of low-frequency items.

Phonetic units and phonetic symbols

The level of segmental analysis in this resource is the broad phonetic level of “speech sound” categories (as they occur in canonical pronunciation typical of major cities, such as Athens). The corresponding segmental units are sometimes termed phones to distinguish from the abstract segmental units termed phonemes by theoretical phonologists. There is some controversy and much confusion in the Greek literature about the two domains. In theoretical linguistics, phonological segments are postulated as abstract units that satisfy criteria of representational economy and distinctiveness in lexical composition. However, in speech production, phonetic segments are identified insofar as they exhibit stable constellations of articulatory (and/or acoustic) features across speakers and contexts. Phonetic distinctiveness may concern mispronunciation, as judged by native listeners, without necessarily invoking a minimal pair at the lexical level. Both concepts are abstractions to some degree. However, the two segmental descriptions need not correspond in a one-to-one fashion to each other, and in fact they often don't. Groups of phonetic segments, termed “allophones,” are taken to correpond to the same phonological segment, to the extent that theoretical analysis permits a relevant parsimonious description. Regardless of their dubious theoretical status in synchronic analysis of Greek, these considerations are irrelevant to reading. The relevant notion of “phonology” in reading refers to “speech sounds,” that is, to the phonetic segments that can be (a) pronounced as such and (b) identified meta-linguistically in “phonological awareness” tasks.

In IPLR, phones are indicated with latin characters plus five stressed vowels of the Greek alphabet. Due to the need for processing using text tools and transferring between systems and platforms, we do not use IPA phonetic symbols in the tables that would require special fonts. A lookup table titled “phonetic symbols” is included as a separate sheet in distributed spreadsheet files that include phonetic strings. This table can also be viewed online. This custom representation is case-sensitive, requiring extra care when matching using Excel formulas (use the EXACT function instead of the = sign). Note that our analyses concern a representation at the level of broad phonetic transcription and not any abstract phonemic level. The phonetic units we employ correspond to broadly defined segments as pronounced by a successful text-to-speech system, therefore they are tested to be pragmatically accurate. They need not conform to theoretical notions about what constitutes a phoneme in Greek. Our phonetic unit set includes segments that are usually classified as allophones by Greek phonologists, because they are phonetically distinctive at the surface level. Notably, all palatal consonants and the velar nasal are considered to be distinct segments (phones). The affricates ts and dz are considered to be single segments and not homorganic doublets. The labiodental nasal is treated as identical to the bilabial nasal, as its pronunciation is optional and nondistinctive.

Grapheme-to-phoneme transcription

A module for producing the phonetic representation of a word or phrase was developed in the framework of a text-to-speech project for the Greek language. Using a collection of manually transcribed lexicons as a training set and hand-seeded mapping rules from grapheme to phonemes, additional mapping rules are iteratively derived, forming revised rule-sets, until the entire training corpus if fully accounted for. The resulting module is a rule-based grapheme-to-phoneme system that performs very well with out-of-vocabulary words (its success rate has been estimated to exceed 98% by a word-base criterion). A central concept in the development of this module has been to avoid reliance on the epsilon phoneme, i.e., when a letter is not assigned to a phoneme but instead serves the purpose of altering an adjacent, typically following, phoneme. In the present approach, all letters are assigned to phonemes, either individually or as a part of a sequence. There are no a-priori constraints on the size of the letter sequences that are mapped as units onto corresponding phoneme sequences, thus allowing generalizations of common mappings as well as complete handling of exceptional or unusual mappings. The current implementation incorporates more than 5000 unique rules for transcribing letters to sounds. All word pronunciations used in IPLR analyses were produced by this module. You can find more information about it in a publication available from the Downloads page. If you need more information about the ILSP text-to-speech synthesis applications, services, and products, contact Innoetics (an ILSP spin-off).

Orthographic and phonological strings are used throughout this resource, referred to by the “spel” and “phon” abbreviations, respectively. Orthographic (spel) representations refer to letter strings and results of analyses or counts of letter strings, whereas phonological (phon) representations refer to phone strings and results of analyses or counts of phone strings.

In addition to the aforementioned grapheme-to-phoneme module, we have developed a set of graphophonemic transcription rules that can transcribe any Greek letter string and produce a legal pronunciation. Cases of amgiguity (such as CiV sequences) are handled by “optional” transcription rules. The python library distributed from the Downloads page includes code for applying these rules and producing phonetic transcriptions. The code allows the user to choose whether to apply optional rules, or, alternatively, to obtain all legal alternatives. These rules were developed for the purpose of modeling a rule-based sublexical reading route for Greek. You can read mode about them in our analysis of Greek orthographic transparency (manuscript available from Downloads).

Graphophonemic alignment

A module to align single phones to letter sequences was developed, based on a list of all possible grapheme-phoneme pairs as given in the literature. Certain simplifications were applied, in order to avoid contaminating effects from optional pronunciations, such as prenasalized stops. Specifically, all combinations of a nasal consonant followed by a homorganic stop were simplified by dropping the nasal (e.g., all occurrences of /mb/ were converted to /b/, etc.). Although, etymologically, it might be considered proper to pronounce the nasal in certain words and not pronounce it in others, in practice the choice regarding the realization or not of a nasal is a matter of idiolect and/or social circumstances; phonologically, the pronunciation of the nasal is optional. Given that all homorganic nasal-stop clusters have a phonologically ambiguous status, such sequences are eliminated. Removal of the nasal facilitates syllabification and simplifies the processing of consonant clusters. In addition, all /mpt/ clusters were converted to /mt/, because pronunciation of /p/ is optional.

The grapheme-phoneme alignment module processes one space-delimited string at a time. It runs serially through the phoneme string, maximizing the number of letters that can be mapped onto the current phoneme, as long as this does not cause mismatches further downstream. The maximization strategy proceeds successfully through both letter and phoneme strings and produces an ordered list of string pairs. In each string pair, one member contains the current phoneme (or, rarely, two phonemes, such as represented by the letters ξ and ψ) and the other member contains the letter or sequence of letters that correspond to this phoneme. This mapping can be used for orthographic syllabification and for evaluating the transparency and consistency of Greek spelling.

The IPLR TEXT Tool (under the Online Tools page) provides aligned grapheme-phoneme pairs of arbitrary strings provided by the user.

Syllabification

Phoneme strings for each word token can be approximately syllabified capitalizing on the fact that every full vowel in Modern Greek corresponds to a syllabic nucleus. (Rare exceptions concern a few diphthongs which are ignored in these analyses.) After vowels are detected in the phoneme sequence, the principle of maximal onset is applied: Consonant sequences preceding each vowel, up to the previous vowel or word beginning, are compared against a list of legal onset clusters. This list was created by including all consonant clusters found in word beginnings in the CLEAN corpus (sequences of two or more consonants up to the first vowel), and was manually amended to include a large number of clusters that can be considered to be legal onsets on phonological grounds (such as /ɣm/ and /vn/). It is assumed that the non-occurrence of the latter in word beginnings is accidental. The complete list of clusters considered legal syllabic onsets, and their associated frequency of occurrence, can be found in the syllable tables under Downloads.

Automatic syllabification can only be approximate, at this stage, for a number of reasons. First, it is not universally agreed what consonants and consonant clusters constitute legal onsets and codas in Greek phonology. The historic roots of the language have resulted in relic items that could conceivably be considered phonologically illegal, yet they are undoubtedly parts of the language. Similar problems are caused by recent loans from languages with more complex syllabic structures. Second, it is not possible to account automatically for the cases of extrametricality of /s/, because other factors can influence relevant judgments, such as word position. Third, morphological considerations, such as morpheme boundaries, may play a role in (re)syllabification, but this cannot be accounted for automatically on the basis of the phonological specification only. Thus, the proposed automatic approach to syllabification is a rough approximation meant to facilitate automatic processing and to permit calculation of quantities relevant for psycholinguistic evaluation and balancing of stimuli.

The relative token frequency of various syllable types, derived in this manner, is consistent with a preponderance of CV syllables (55.9%) and generally open syllables (CV, V, CCV, and CCCV totalling 86.2%). Despite the tendency to allow a wide variety of consonant clusters as legal syllabic onsets, complex-onset syllables do not exceed 15.0%, therefore the effect of specific clusters with dubious tautosyllabic status must be very small.

This online table lists the 433 most frequent syllables (ignoring stress) with token frequency at least 100 per million, which together account for 98.1% of all syllable tokens. Separate counts are provided for word-initial, word-medial, and word-final positions of each syllable type. The corresponding list of orthographic syllables (taking stress diacritics into account) is also available online. Complete tables of syllables and associated type and token frequencies, from both corpora, with or without taking stress into account, are availble from the Downloads page.

Bigrams

Bigrams are pairs of adjacent items. In orthographic representations, bigrams refer to letter pairs. In phonological representations, bigrams refer to phone pairs. Bigram counts are calculated by first summing up all the occurrences (tokens) of each combination of two letters or two phones, accordingly. These counts are typically log-transformed, to compress their range.

Total bigram frequency is related to the difficulty with which an item can be read, as it reflects the familiarity of the reader with the combinations of letters (or phones) exhibited by a given item (word). The usual method to calculate the “bigram frequency” of an item is to sum the log-transformed bigram counts of all the bigrams contained in the item, including a pseudo-bigram composed of a space (word-end marker) and the initial segment and another pseudo-bigram composed of the final segment and a space (word-end marker). The frequencies of occurrence of the additional endpoint pseudo-bigrams are equal to the count of the initial segment at word-initial position and the count of the final segment at word-final position, respectively. The resulting sum (whether linear or log sum) is a useful metric but does not stand up to theoretical scrutiny, because the summed frequencies are not independent, hence their summation (log sum, equivalent to untransformed multiplication) does not index the cumulative probability of the corresponding segment sequence. A more appropriate metric would be composed of the conditional probabilities of each segment, given the preceding segment. Therefore we have also implemented a cumulative conditional probability index, in which each bigram count is normalized by the token count of its first member. Following this normalization these resulting counts can be multiplied with each other (or their logarithms can be summed, preferably) to produce a joint (or cumulative) probability of the entire string on the basis of its constituent letter pairs.

Word-level cumulative bigram frequency/probability based on either type or token bigram counts is available by IPLR.

Neighbors

According to the usual definition, the neighbors of an item (word or nonword, letter or phone string) are items (words) of equal length that differ from the probe item by a single segment. These are termed “standard” neighbors at IPLR. Orthographic and phonological neighbors are calculated on the basis of letter and phone strings, respectively. More sophisticated indexes take into account segment Deletions, Insertions, and Transpositions, in addition to Replacements. The IPLR online tools provide “RDIT” neighbors, including all four possibilities.

In addition to the neighbor groups based on single-segment differences, IPLR provides two other related groups: The “cohort” group comprises all words with the same initial syllable as the target item (word or pseudoword); a minimum of two phonemes is enforced in cases of single-segment initial syllables, in order to avoid excessively long (and useless) lists. The “stress neighbors” group comprises all words that match the target item from the stressed vowel through the end. In Greek, this corresponds to rhyming.

Recent developments in word neighborhood research have suggested that less strict measures of orthographic or phonological distance may constitute a better match to the cumulative effects of similar lexical items on word recognition. Calculation of “edit distance” or Levenshtein distance to derive mean orthographic/phonological distance of the N (typically, 20) nearest items is used to produce indices such as the OLD20 and PLD20 scores of Yarkoni, Balota, & Yap (2008).

Transparency indices

The orthographic transparency of Greek orthography has been thoroughly analyzed on the basis of a grapheme-phoneme alignment of the entire CLEAN corpus. The results are available as a manuscript in the Downloads page. To compute transparency indices at the word-level, we follow Spencer (2009) and compute three kinds of quantities. Nondirectional metrics are based on the frequency of occurrence of grapheme-phoneme pairs (termed “sonographs” by Spencer). Feedforward metrics are derived by normalizing nondirectional frequency counts by grapheme, so that each grapheme-phoneme pair expresses its relative proportion in pronouncing the same grapheme (reading consistency). Feedback metrics are derived by normalizing nondirectional frequency counts by phoneme, so that each grapheme-phoneme pair expresses its relative proportion in spelling the same phoneme (spelling consistency). Each kind of quantity may be computed on type counts (that is, unique word forms) or on token counts (that is, multiplying each occurrence by the frequency of the word in which it occurs); both options are available. Furthermore, two word-level metrics are provided, based on these quantities. One is the logarithmic mean of the probabilities of all grapheme-phoneme pairs in the word, indexing mean difficulty. The other is the probability of the grapheme-phoneme pair with the lowest probability within the word, indexing maximum difficulty. It remains to be investigated which of these quantities are the best predictors of reading and spelling performance as independent indices of word-level difficulty.

All of the aforementioned transparency indices are available from the IPLR Online Tools and are listed in the “Orthographic transparency” tables available under Downloads. To facilitate viewing, values are multiplied by 1000 before displaying by the online tools, therefore they are properly not probabilities but relative frequencies per thousand.