Statistics of Ukrainian dialects

The aim of this study is to investigate the nowadays distribution of dialects and rare words in Ukrainian language using mathematical methods. The main objectives are:
  • Define regions of a given words usage.
  • Define the borders between regions with significantly different vocabulary.
  • Estimate the time when a given dialect word will "die".
  • Define the time and spatial scale of correlation between people taking into account theirs vocabulary.
  • Compare a vocabulary of different regions, cities and localities.
  • To check whether language now reflects historical borders from the last centuries.

Content


1. Data
2. Maps of usage level
3. Clustering
4. Variety of names for some ordinary things


1. The data


The primary source of information is the questionnaire that consists of above 400 selected dialectic words. We proposed for respondents to mark the usage level for a given word. The usage level is quantitative value from 1 to 5:
  1. never heard this word,
  2. know but never use,
  3. very rarely usage,
  4. use it from time to time, besides to other synonym,
  5. actively use only this form.

Each respondent also pointed out the year of birth and the town in which he/she lived before the end of his/her school. We assume that basic vocabulary of person is formed during the school time. We circulated the questionnaire via social webs and forums. 2399 people have been involved in this poll from December 2011 till February 2016. Distribution of the respondents by theirs year of birth is shown in the figure below.
years1_eng.png

The oldest respondent was born in 1922, the youngest in 2002, average year of birth 1983 and median is 1986. The map below shows the geography of the received questionnaires. Numbers in green circles correspond to the number of respondents from a given place. It is noteworthy that we received a few questionnaires abroad of Ukraine from Poland, Belarus, Roumania, Russia and Moldova from the places of compact historical Ukrainian communities. The areas around these localities also were taken into account.

statystyka.png


The geographical distribution of the respondents is very inhomogeneous, therefore for the proper smoothing of data we introduced typical distance d0 between respondent localities for a given point on the map:

Code2.png,
where N is the number of localities from which we received at least one response in the circle with radius R around a given point. As a compromise between enough statistics and accuracy we chose the R to encircle above 40 different localities. At the calculation of d0 we took into account border effects of the investigated area. The map below shows the distribution of the typical distance d0 in km. Grey zones were not involved in the investigation as usage level of Ukrainian dialects there is negligible small.

gaus.png


There is a strong correlation between the year of birth of the respondent and his/her vocabulary, the older respondent, the higher usage level of diallects. As we have the data from respondents of different age, it is necessary to normalize data to some year of birth. To do this, we examined the compact region in the south of Ternopil region where we have a large number respondents of different ages. We assumed that the vocabulary in this box is uniformly distributed and depends only on the year of birth of respondent. We found that the average level of usage of words per unit decreases in average every 58 years. Given that we have normalized the level of usage of words to 1980 year of birth.

2. The maps of usage level B


For the visualization of the usage level for each word we used a weighted average for each map pixel (x, y). As weights we chose the Gaussian function exp(-(di/d0)2). Where di is a distance in km from a given point on the map to the i-th respondent, d0 is a typical distance between the respondents introduced above. In practice it means that a significant effect on the color of a pixel is only coming from a few near respondents. But formally, we consider the impact of all respondents. So, to build the smoothed heatmaps of the distribution of the usage level Bj for j-th word at a given point (x, y) we take into account the level bi,j of each i-th respondent:
eq1.png

Active migration is common for the large cities, therefore the respondents from large cities do not represent correctly vocabulary in the areas around the city. Therefore, for many cities, usually regional centers, we calculated the level of usage separately as the arithmetic average of the respondents from this city. We marked these cities by circles filled by uniform color, the size of a circle corresponds to its official area. See below the heatmaps of the usage level distribution for different words. The words in square brackets are phonetic spelling of dialect word.
small_dialect1.jpg
[andruty] - waffle

small_dialect3.jpg
[aja] - yes

small_dialect4.jpg
[bal', bal'vytysia] - feast

small_dialect5.jpg
[bal'on] - ball
see the maps for rest of words

3. Clustering


Using clustering techniques we divided words on the groups taking into account only their population distribution. As a distance between j-th and k-th words words we used Pearson’s distance:
pears_dist.png,
where the Pearson correlation coefficient between Bj and Bk populations:
ro.png.
The summation is over whole data pixel of given word map, σ i the standard deviation of B. The Pearson distance lies in [0, 2] range. Example of two words with small distance between them is [kl'osh] - vase for fruits and [rombambar] - rheum with D=0.08. They have very similar population distribution, see the maps below. Obviously their origin or coming to this area was happened on the same manner and epoch:
new_dialect253.pngnew_dialect137.png



















Example of the pair of words with large distance near 2 is [trempel'] - сlothes hanger and [lekvar] - jam:
new_dialect595.png new_dialect514.png

On the base of information about Pearson distances between each words we clustered them and created following dendrogram:

Rplots_n.png



4. Variety of names for some ordinary things


There are a number of concepts and things that are called differently in different regions. For the identification of these regions we used the previous heatmaps of usage levels. For each point on the map, we chose those word which has maximum usage level among other variants of dialects for a given thing or concept. Of course, this does not exclude that in this place people use synonyms or other dialect name for it. Here are maps of different versions of name for bicycle, stork, potato, potato pancake, frying pan, sheet pan, rag and attic.

velosyped.png
bicycle

leleka2.png
stork

small_kartoplia.png
potato

small_deruny2.png
potato pancake
skovoroda.png
frying pan

small_deka.png
sheet pan


small_ganchirka.png
rag

small_goryshche.png
attic


На основі отриманних даних ми оновили анкету 5 березня 2013 року.
кликаємо всіх охочих, незалежно від місця проживання та віку, взяти участь в опитуванні.
Для цього потрібно завантажити файл, поставити навпроти кожного слова бал від 1 до 5 згідно
зазначеної градації вживання, вказати рік народження та населений пункт в якому проживали
до закінчення школи. Далі потрібно вислати заповнену анкету та інформацію
на dialectstat@gmail.com

Для тих хто вже заповнив анкету до 5 березня 2013 року достатньо лише заповнити її другу частину.Author: Andrii Elyiv