quoteTables of units and measures; python code implementation; publications.

WARNING: Information and data provided here may contain errors. Please report problems to iplr[at]


Tables are provided as Microsoft Excel files, compressed into rar format.

Full processed corpus word list with all quantitative measures.
Including stress: [compressed xls file] [compressed txt file]
Ignoring stress: [compressed xls file] [compressed txt file]

These tables are also provided in tab-separated text form (Greek encoding; CP1253/ISO-8859-7) because they contain too many lines to be read into Excel in a single sheet, so the xls version is broken into 4 sheets. If you need all items together, use another program to read and process the tab-separated text. If you do not want to write your own code, try reading the text file into MS Access.

Lists of single letters and phones, with associated frequency of occurrence, are included in the bigram tables.

Tables of orthographic and phonological syllables, with associated frequency of occurrence. [compressed xls files]

The syllables are ranked in order of decreasing type or token frequency. See the syllable section of the documentation for more information about the syllabification. This file includes a list of consonant clusters that are considered to be legal syllabic onsets, and their associated frequency of occurrence as syllabic onsets word-initially and word-medially.

Tables of letter and phone bigrams, with associated frequency of occurrence. [compressed xls files]

There are tables of type and token counts for letter bigrams and phone bigrams, with or without stress diacritics. When calculated with stress diacritics, each marked letter is treated as a separate letter (e.g., is distinct from ). When calculated without stress diacritics these are obviously collapsed into the same letter category.

Grapheme/phoneme-level analysis of orthographic transparency in both directions. [compresssed xls file]

This table contains bidirectional mapping lists between letters-phones and graphemes-phones. The results of this analysis have been accepted for publication in Behavior Research Methods.

Word-type and word-token statistics for all quantitative measures. [compressed xls file]

Lists descriptive statistics (mean, minimum, maximum, skewness, kurtosis) and selected percentiles for each quantitative measure, based on the CLEAN corpus, for word types and word tokens, separately with and without taking stress into account.


Download the Python programming language (version 2.5 or later 2.x) to use these resources.

The main library of functions used to process the text corpus and providing the online tools services on this web site. [compressed py file]

You are welcome to “borrow” parts of this code as they may be useful to you; please cite IPLR when you do. To use the library as it stands you will need at least one word-list corpus, broken down into three files, one for orthography (spelling), one for phonology (pronunciation), and one for frequency (number of occurrences). Each line in each file contains a single string, and the three files have the same number of items in the same (corresponding) order. Pronunciation uses our custom symbol list to avoid confusions with special fonts.

The API for using the processing code in your own python programs. [txt file]

Presents each available function, and the associated parameters. Also includes an example of setting up the necessary class instances for carrying out the processing.

Auxilliary data files necessary to use [compressed files and folder]

Uncompress this archive in the folder where you have placed See the included for using the library functions that do not require a corpus.

The code implementing the IPLR online tools offered on this web site. [compressed py file]

Examine this code to see exactly how each parameter is computed in the results you download from IPLR. Also, find out how you can use the sphcorp library in your own functions.

A python program that counts tokens of word forms (types) in Greek text corpora. [compressed py file] [compressed package with exe file]

Runs from cmd, with your Greek text file name as a command-line argument. Alternatively, name your file text.txt, place it in the same folder, and double-click on comp_freq. You will receive a list of unique Greek words, with associated token counts, and a junk list of letter strings including anything other than Greek letters.

A C program that calculates orhographic and phonological distances. [compressed C file] [compressed package with exe file]

The windows executable leven.exe runs from cmd; run without any arguments to view information about required and optional parameters. The C code compiles as provided by gcc under Linux or Mac OS; see compilation information near the top of the file. This program calculates Levenshtein distance and mean orthographic/phonological distance of the N nearest items (such as the OLD20 and PLD20 scores of Yarkoni, Balota, & Yap, 2008). You provide a list of target items, for which the indices are calculated, and a “lexicon” (list of items) in which to search for the nearest neighbors.


Chalamandaris, A. Raptis, S., & Tsiakoulis, P. (2005). Rule-based grapheme-to-phoneme method for the Greek. In INTERSPEECH-2005 pp. 2937–2940

Protopapas, A., Tzakosta, M., Chalamandaris, A., & Tsiakoulis, P. (in press). IPLR: An online resource for Greek word-level and sublexical information. Language Resources & Evaluation. doi:10.1007/s10579-010-9130-z

Protopapas, A., & Vlahou, E. (2009). A comparative quantitative analysis of Greek orthographic transparency. Behavior Research Methods, 41, 4, 991–1008.

Tzakosta, M., & Karra, A. (2007). A typological and comparative account of CL and CC clusters in Greek dialects. 3rd International Conference on Modern Greek dialects and linguistic theory. University of Cyprus, 14–16 June.

Tzakosta, M., & Vis, J. (2007). Phonological Representations of consonant sequences: the case of affricates vs. ‘true’ clusters. 8th International Conference on Greek Linguistics. University of Ioannina, 30 August–2 September.