Applying a Normalized Compression Metric to the Measurement of Dialect Distance

Authors

  • Kiril Simov
  • Petya Osenova

DOI:

https://doi.org/10.55630/sjc.2007.1.73-86

Keywords:

Kolmogorov Complexity, Compression Metric, Dialect Distance, Language Contacts

Abstract

The paper discusses the application of a similarity metric based on compression to the measurement of the distance among Bulgarian dia- lects. The similarity metric is de ned on the basis of the notion of Kolmogorov complexity of a le (or binary string). The application of Kolmogorov complexity in practice is not possible because its calculation over a le is an undecidable problem. Thus, the actual similarity metric is based on a real life compressor which only approximates the Kolmogorov complexity. To use the metric for distance measurement of Bulgarian dialects we rst represent the dialectological data in such a way that the metric is applicable. We propose two such representations which are compared to a baseline distance between dialects. Then we conclude the paper with an outline of our future work.

References

Cilibrasi R., P. Vitanyi. Clustering by Compression. IEEE Trans. Information Theory 51, No. 4 (2005), 1523–1545.

Grunwald P., P. Vitanyi. Kolmogorov complexity and information theory. With an interpretation in terms of questions and answers. Journal of Logic, Language, and Information 12, No. 4 (2003), 497–529.

Grunwald P., P. Vitanyi. Shannon Information and Kolmogorov complexity. IEEE Trans. Information Theory (submitted).

Heeringa W. Measuring Dialect Pronunciation Differences

using Levenshtein Distance. PhD thesis. University of Groningen. Groningen, The Netherlands, 2004.

http://www.let.rug.nl/˜heeringa/dialectology/thesis/

Levenshtein V. I. Binary codes capable of correcting deletions, insertions, and reversals. Doklady Akademii Nauk SSSR 163, No. 4 (1965), 845–848 (in Russian).

Osenova P., W. Heeringa, J. Nerbonne. A Quantitative Analysis of Bulgarian Dialect Pronunciation. In: Zeitschrift für Slavische Philologie. Tübingen, Germany, in press.

Osenova P., W. Heeringa, J. Nerbonne. Using Pronunciation

Difference to Measure Language Contact Effects. In: Proceedings of the First Conference on Language Contact in Times of Globalization (LCTG). University of Groningen. Groningen, The Netherlands, in press.

Popov D., K. Simov, S. Vidinska. A Dictionary of Writing, Pronunciation and Punctuation of the Bulgarian Language. Atlantis LK, Sofia, Bulgaria, 1998 (in Bulgarian)

Simov K., P. Osenova, S. Kolkovska, E. Balabanova, D. Doikoff. A Language Resources Infrastructure for Bulgarian. In: Proceedings of LREC, 2004, Lisbon, Portugal, 1685–1688.

Stoykov S. Atlas of Bulgarian Dialects: Northeastern Bulgaria. Publishing House of the Bulgarian Academy of Sciences, 1966, volume II, Sofia, Bulgaria (in Bulgarian).

Stoykov S., S. Bernshteyn. Atlas of Bulgarian Dialects: Southeastern Bulgaria. Publishing House of the Bulgarian Academy of Sciences, 1964, volume I, Sofia, Bulgaria (in Bulgarian).

Stoykov S., K. Mirchev, I. Kochev, M. Mladenov. Atlas of Bulgarian Dialects: Southwestern Bulgaria. Publishing House of the Bulgarian Academy of Sciences, 1975, volume III, Sofia, Bulgaria (in Bulgarian).

Stoykov S., I. Kochev, M. Mladenov. Atlas of Bulgarian Dialects: Northwestern Bulgaria. Publishing House of the Bulgarian Academy of Sciences, 1981, volume IV, Sofia, Bulgaria (in Bulgarian).

Downloads

Published

2007-03-19

Issue

Section

Articles