Collocations Dictionary of
Modern Slovene

The biblographical data of the Collocations Dictionary of Modern Slovene can be found at the link below.

Current version

1.0

Date of last update:
16. 10. 2018

Licence

The thesaurus database is available at the CLARIN.SI repository.

This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International licence.

About the Collocations Dictionary

Collocations are typical co-occurrences of words and represent an important part of language. By providing information on what is typical in language, collocations dictionaries are useful in language production and acquisition. The Collocations Dictionary of Modern Slovene, which contains 35,989 headwords and 7,717,561 collocations, is the first dictionary of collocations for Slovene and represents the first step towards filling the gap in the field of language resources for Slovene, particularly those aimed at facilitating language production.

The Collocations Dictionary of Modern Slovene is characterized by a phase-based entry representation, collocational data in context, and numerous filtering and sorting options. Unlike similar resources for other languages, the interface of the dictionary focuses on collocations, while other information (senses, syntactic structures, etc.) are provided as filter options.
The dictionary has been compiled using advanced computational methods for the automatic extraction of Slovene collocations, which have already been tested and evaluated and are constantly improved. In terms of financial resources, computer-assisted data preparation is significantly less demanding and more economical than manual processing, and is also significantly less time-consuming. This enables regular updates and upgrades to the resource, making the dictionary a dynamic source of language information.

The Collocations Dictionary of Modern Slovene is the second responsive dictionary published in Slovenia (the first being the Thesaurus of Modern Slovene). With responsive dictionaries, the compilation of the dictionary database is immediately followed by providing the language community with access to a large amount of relevant, albeit somewhat noisy language information. One of the main advantages of responsive dictionaries is the fact that the data can be quickly updated based on both the progress in the database as well as the changes in language use.

In the Collocations Dictionary of Modern Slovene, the phases of its development are determined in advance and clearly visualized in the interface. The information on the phase of a particular entry is provided by the pyramid icon (see figure to the right). The developmental phases are the following:

  • Phase 1: The data contained is automatically extracted and contains noise.
  • Phase 2: Syntactic structures with too much noise have been removed. The same is true of collocates that mostly occur in inadequate collocations.
  • Phase 3: The collocations that have been manually identified as inadequate have been removed.
  • Phase 4: All collocations and their examples of use have been categorized in senses.
  • Phase 5: The entry has been finalized.

The pyramid and the timestamp, along with archived previous versions of the dictionary database, enable tracking any changes in dictionary entries.

Although automatic data extraction and ranking is never perfectly accurate, the results are useful to dictionary users even before lexicographic post-processing. This has been confirmed by resources and tools for other languages (e.g. Merriam-Webster, the Digital Dictionary of German DWDS), which include automatically extracted data in their entries. An example of a successfully automatically generated language resource for Slovene is the Thesaurus of Modern Slovene. The effectiveness of automatic collocation extraction has also been confirmed by a linguistic evaluation, during which the automatically extracted collocations in the ten most frequent syntactic structures of 333 headwords have been rated as adequate or inadequate (see table to the right).

Understanding collocations and their use requires context, which is why all collocations in the dictionary contain examples of use taken from real texts. The examples have been imported into the dictionary using automatic methods for identifying good (dictionary) examples. In addition, all collocations contain a link to the Gigafida corpus of Slovene, which allows the user to further investigate language use in context.

The Collocations Dictionary of Modern Slovene is part of an organized effort to establish an infrastructure for Slovene that is comparable to the infrastructures of larger languages. We believe that, in terms of methodology, the construction of language resources should follow the contemporary zeitgeist and that all data prepared through publicly financed initiatives and projects should be openly accessible for the further development of language technologies, considering the actual needs of modern language users in the digital age. The process of constructing the Collocations Dictionary of Modern Slovene thus also puts considerable effort into establishing a dedicated community that not only uses the dictionary, but also contributes to its development.

grammatical relation % of good
1. adjective + noun 88.9
2. noun + noun (genitive) 84.9
3. verb + noun (accusative) 87.0
4. adverb + verb 87.7
5. adverb + adjective 63.6
6. noun + prep. "v" + noun (locative) 64.2
7. verb + adverb 59.6
8. verb + prep. "v" + noun (locative) 86.0
9. noun + prep. "s/z" + noun (instrumental) 74.7
10. verb + prep. "s/z" + noun (instrumental) 92.7

Publications

KOSEM, Iztok, KREK, Simon, GANTAR, Polona, ARHAR HOLDT, Špela, ČIBEJ, Jaka, LASKOWSKI, Cyprian. Kolokacijski slovar sodobne slovenščine. V: FIŠER, Darja (ur.), PANČUR, Andrej (ur.). Zbornik konference Jezikovne tehnologije in digitalna humanistika / Proceedings of the conference on Language Technologies & Digital Humanities, 20.-21. september 2018, Ljubljana. Ljubljana: Znanstvena založba Filozofske fakultete v Ljubljani. 2018, str. 133.139, http://www.sdjt.si/wp/wp-content/uploads/2018/09/JTDH-2018_Kosem-et-al_Kolokacijski-slovar-sodobne-slovenscine.pdf.

KOSEM, Iztok, KREK, Simon, GANTAR, Polona, ARHAR HOLDT, Špela, ČIBEJ, Jaka, LASKOWSKI, Cyprian. Collocations dictionary of modern Slovene. V: ČIBEJ, Jaka (ur.), et al. Proceedings of the 18th EURALEX International Congress: lexicography in global contexts, 17-21 July 2018, Ljubljana. Ljubljana: Ljubljana University Press, Faculty of Arts. 2018, str. 989-997, ilustr. https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/view/118/211/3000-1.

KOSEM, Iztok, KOPPEL, Kristina, ZINGANO KUHN, Tanara, MICHELFEIT, Jan, TIBERIUS, Carole. Identification and automatic extraction of good dictionary examples: the case(s) of GDEX. International journal of lexicography, https://academic.oup.com/ijl/advance-article/doi/10.1093/ijl/ecy014/5075863.

GANTAR, Polona, GORJANC, Vojko, KOSEM, Iztok, KREK, Simon. Going semi-automatic and crowdsourced: collocation dictionary of Slovene. V: KOSEM, Iztok (ur.). Electronic lexicography in the 21st century: linking lexical data in the digital age. Ljubljana: Trojina, Institute for Applied Slovene Studies; Brighton: Lexical Computing. 2015, str. 37.

GORJANC, Vojko, GANTAR, Polona, KOSEM, Iztok, KREK, Simon (ur.) Slovar sodobne slovenščine: problemi in rešitve. Ljubljana: Znanstvena založba Filozofske fakultete. 2015. Deloma prevedeno v: GORJANC, Vojko, GANTAR, Polona, KOSEM, Iztok, KREK, Simon (ur.) Dictionary of modern Slovene: problems and solutions. Ljubljana: Ljubljana University Press, Faculty of Arts, 2017. https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/book/15

GANTAR, Polona, KOSEM, Iztok, KREK, Simon. Discovering automated lexicography = the case of Slovene lexical database. International journal of lexicography, 2016, vol. 29, issue 2, str. 200-225. https://academic.oup.com/ijl/article/29/2/200/2413284/Discovering-Automated-Lexicography-The-Case-of-the?guestAccessKey=95f18766-f10f-4994-a6fa-448cf75ac55e

KOSEM, Iztok, GANTAR, Polona, KREK, Simon. Avtomatizacija leksikografskih postopkov. V: ERJAVEC, Tomaž (ur.), ŽGANEC GROS, Jerneja (ur.). Jezikovne tehnologije, Slovenščina 2.0, letn. 1, št. 2. Ljubljana: Trojina, zavod za uporabno slovenistiko. 2013, str. 139-164. http://www.trojina.org/slovenscina2.0/arhiv/2013/2/Slo2.0_2013_2_07.pdf

ČIBEJ, Jaka, FIŠER, Darja, KOSEM, Iztok. The role of crowdsourcing in lexicography. V: KOSEM, Iztok (ur.), et al. Electronic lexicography in the 21st century: linking lexical data in the digital age. Ljubljana: Trojina, Institute for Applied Slovene Studies; Brighton: Lexical Computing. 2015, str. 70-83. https://elex.link/elex2015/proceedings/eLex_2015_05_Cibej+Fiser+Kosem.pdf

ARHAR HOLDT, Špela, ČIBEJ, Jaka, ZWITTER VITEZ, Ana. Value of language-related questions and comments in digital media for lexicographical user research. International journal of lexicography, 2017, vol. 30, issue 3, str. 285-308. http://ijl.oxfordjournals.org/content/early/2016/04/20/ijl.ecw017.full.pdf?keytype=ref&ijkey=SP5Yb4PHvfykRkk.

ARHAR HOLDT, Špela, KOSEM, Iztok, GANTAR, Polona. Dictionary user typology: the Slovenian case. V: MARGALITADZE, Tinatin (ur.), MELADZE, George (ur.). Lexicography and linguistic diversity: proceedings of the XVII EURALEX International Congress. Tbilisi: Ivane Javakhishvili Tbilisi State University. 2016, str. 179-187. http://euralex2016.tsu.ge/publication2016.pdf

KOSEM, Iztok, GANTAR, Polona, KREK, Simon. Automation of lexicographic work: an opportunity for both lexicographers and crowd-sourcing. V: KOSEM, Iztok (ur.), et al. Electronic lexicography in the 21st century: thinking outside the paper. Ljubljana: Trojina, Institute for Applied Slovene Studies; Tallinn: Eesti Keele Instituut. 2013, str. 32-48. http://eki.ee/elex2013/proceedings/eLex2013_03_Kosem+Gantar+Krek.pdf

KOSEM, Iztok, HUSAK, Milos, MCCARTHY, Diana. GDEX for Slovene. V: KOSEM, Iztok (ur.), KOSEM, Karmen (ur.). Electronic lexicography in the 21st century: new applications for new users. Ljubljana: Trojina, Institute for Applied Slovene Studies. 2011, str. 150-159. http://www.trojina.si/elex2011/elex2011_proceedings.pdf

LOGAR, Nataša, GRČAR, Miha, BRAKUS, Marko, ERJAVEC, Tomaž, ARHAR HOLDT, Špela, KREK, Simon. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES : gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko: Fakulteta za družbene vede, 2012.

KREK, Simon, GANTAR, Polona, ARHAR HOLDT, Špela, GORJANC, Vojko. Nadgradnja korpusov Gigafida, Kres, ccGigafida in ccKres. V: ERJAVEC, Tomaž (ur.), FIŠER, Darja (ur.). Zbornik konference Jezikovne tehnologije in digitalna humanistika. Ljubljana: Znanstvena založba Filozofske fakultete. 2016, str. 200-202. http://www.sdjt.si/wp/wp-content/uploads/2016/09/JTDH-2016_Krek-et-al_Nadgradnja-korpusov-Gigafida-Kres-ccGigafida-ccKres.pdf

KREK, Simon, KOSEM, Iztok, GANTAR, Polona. Predlog za izdelavo Slovarja sodobnega slovenskega jezika. Izd. 1.1. Ljubljana: s. n., 2013. http://www.sssj.si/datoteke/Predlog_SSSJ_v1.1.pdf

KREK, Simon, LASKOWSKI, Cyprian, ROBNIK-ŠIKONJA, Marko. From translation equivalents to synonyms: creation of a Slovene thesaurus using word co-occurrence network analysis. V: KOSEM, Iztok (ur.) et al., Proceedings of eLex 2017: Lexicography from Scratch, 19-21 September 2017, Leiden, Netherlands. https://elex.link/elex2017/wp-content/uploads/2017/09/paper05.pdf

The data for the Collocations Dictionary of Modern Slovene was prepared by an interdisciplinary group of researchers at the Centre for Language Resources and Technologies of the University of Ljubljana.

The development of the Collocations Dictionary was financed by two infrastructural programs: CJVT at the University of Ljubljana and the Centre for Applied Linguistics at the Trojina Institute. Research was funded by the ARRS J6-8255 research project (Collocations as a basis for language description: semantic and temporal perspectives) and the ARRS P6-0215 research program (Slovene language – basic, contrastive, and applied studies).

The interface was developed by Studio Kruh
in collaboration with Leon Noe Jovan.

arrow_upward