Thesaurus of
Modern Slovene

The biblographical data of the Thesaurus of Modern Slovene can be found at the link below.

Current version

1.0

Date of last update:
26. 3. 2018

Licence

The Thesaurus database is available at the CLARIN.SI repository.

This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International licence.

About the Thesaurus

In its current version, the Thesaurus of Modern Slovene contains 105,473 keywords and 368,117 synonyms, making it the largest automatically generated open-access collection of Slovene synonyms. Unlike other similar language resources, the Thesaurus is based on a range of different databases and enables users to compare different synonyms and check their use in the Gigafida reference corpus of modern Slovene.

The Thesaurus was constructed using an advanced computational approach, which is innovative even in the international lexicographical context. In terms of financial resources, computer-assisted data preparation is significantly less demanding and more economical than manual processing, and is also significantly less time-consuming. This enables regular updates and upgrades to the resource, making the dictionary a dynamic source of language information.

With the Thesaurus of Modern Slovene, we are introducing a new type of dictionary called the responsive dictionary. The initial database of a responsive dictionary is constructed using advanced computational methods, instantly providing the language community with a large amount of relevant, albeit still somewhat noisy language information. A responsive dictionary is characterized by two more key traits: first, its database is openly accessible, and second, it provides a number of ways for the language community to improve the database and clean up noisy elements. This means that the construction of a responsive dictionary is never truly concluded as its data constantly evolves in accordance with changes in the modern language. All changes can be tracked using timestamps in individual entries, while the different versions of the database are stored in a dedicated archive. The responsive dictionary takes its name from the fact that the approach to its construction allows the data to continuously respond to the opinions of the contributing language community and the changes in language originating from text produced by the language community. Essentially, it is “a dictionary made by the community for the community”.

The Thesaurus of Modern Slovene is based on the data contained in two principal language resources: The Oxford®-DZS Comprehensive English-Slovenian Dictionary and the Gigafida reference corpus of written Slovene. Both resources contain language material created after 1991 and as such offer a description of modern Slovene. The links identified between synonyms were additionally confirmed using the older Dictionary of Standard Slovenian Language (SSKJ). The data extraction and structure for the Thesaurus were based on the frequency and manner in which words co-occur in translation strings of the Oxford-DZS Dictionary. This information is the basis for discriminating between ‘core’ and ‘near’ synonyms, with ‘core’ synonyms exhibiting a greater degree of connection to the keyword. In the following step, an approach combining balanced co-occurrence graphs and the Personal PageRank algorithm automatically divides the synonyms into subgroups and ranks them according to the degree of semantic relatedness to the keyword, as well as their frequency in language use. Co-occurrence graphs are used to organize synonyms in the dictionary. For a more detailed description of this methodology, see Krek et al. (2017).

Automatic data extraction and ranking is never perfectly accurate, which can also be observed in related projects for other languages. However, evaluations show that the method is reliable enough to provide results that are useful for the dictionary user even before lexicographic post-processing. The graph to the right shows the results of a linguistic evaluation in which the automatically extracted synonyms for a given keyword were rated as good, acceptable or poor. However, it should be noted that the concept of synonymy is difficult to define and heavily dependent on the context and the circumstances of language use. Thus, even for humans, rating synonyms is not a one-dimensional task.

Understanding synonymy requires context, which is why the Thesaurus provides the user with a number of ways to compare synonyms using corpus data. An important novelty for Slovene language resources in this regard is the option to compare the use of different synonyms in real language use with the help of collocations (typical word co-occurrences). In addition, examples of use are imported into the dictionary using computational methods for the automatic recognition of good (dictionary) examples. Collocations and examples of use are included in most entries, while all of them also provide links to the Gigafida corpus, which allows for a more detailed analysis of modern language use. Furthermore, domain labels were added from the Oxford-DZS Comprehensive English-Slovenian Dictionary, which help explain the context of use for individual synonyms. The current version of the Thesaurus contains no other type of labels.

The Thesaurus of Modern Slovene is part of an organized effort to establish an infrastructure for Slovene that is comparable to the infrastructures of larger languages. We believe that, in terms of methodology, the construction of language resources should follow the contemporary zeitgeist and that all data prepared through publicly financed initiatives and projects should be openly accessible for the further development of language technologies, considering the actual needs of modern language users in the digital age. The process of constructing the Thesaurus of Modern Slovene thus also puts considerable effort into establishing a dedicated community that not only uses the dictionary, but also contributes to its development.

Publications

ARHAR HOLDT, Špela, ČIBEJ, Jaka, DOBROVOLJC, Kaja, GANTAR, Apolonija, GORJANC, Vojko, KLEMENC, Bojan, KOSEM, Iztok, KREK, Simon, LASKOWSKI, Cyprian, ROBNIK ŠIKONJA, Marko. Thesaurus of Modern Slovene: By the Community for the Community. V: Čibej, Jaka, Vojko Gorjanc, Iztok Kosem, Simon Krek (ur.). Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts. ISBN 978-961-06-0097-8). 1. izd. Ljubljana: Znanstvena založba Filozofske fakultete. 2018, str. 401-410. https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/view/118/211/3000-1

KREK, Simon, LASKOWSKI, Cyprian, ROBNIK-ŠIKONJA, Marko. From translation equivalents to synonyms: creation of a Slovene thesaurus using word co-occurrence network analysis. V: KOSEM, Iztok (ur.) et al., Proceedings of eLex 2017: Lexicography from Scratch, 19-21 September 2017, Leiden, Netherlands. https://elex.link/elex2017/wp-content/uploads/2017/09/paper05.pdf

GORJANC, Vojko, GANTAR, Polona, KOSEM, Iztok, KREK, Simon (ur.) Slovar sodobne slovenščine: problemi in rešitve. Ljubljana: Znanstvena založba Filozofske fakultete. 2015. Deloma prevedeno v: GORJANC, Vojko, GANTAR, Polona, KOSEM, Iztok, KREK, Simon (ur.) Dictionary of modern Slovene: problems and solutions. Ljubljana: Ljubljana University Press, Faculty of Arts, 2017. https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/book/15

GANTAR, Polona, KOSEM, Iztok, KREK, Simon. Discovering automated lexicography = the case of Slovene lexical database. International journal of lexicography, 2016, vol. 29, issue 2, str. 200-225. https://academic.oup.com/ijl/article/29/2/200/2413284/Discovering-Automated-Lexicography-The-Case-of-the?guestAccessKey=95f18766-f10f-4994-a6fa-448cf75ac55e

KOSEM, Iztok, GANTAR, Polona, KREK, Simon. Avtomatizacija leksikografskih postopkov. V: ERJAVEC, Tomaž (ur.), ŽGANEC GROS, Jerneja (ur.). Jezikovne tehnologije, Slovenščina 2.0, letn. 1, št. 2. Ljubljana: Trojina, zavod za uporabno slovenistiko. 2013, str. 139-164. http://www.trojina.org/slovenscina2.0/arhiv/2013/2/Slo2.0_2013_2_07.pdf

ČIBEJ, Jaka, FIŠER, Darja, KOSEM, Iztok. The role of crowdsourcing in lexicography. V: KOSEM, Iztok (ur.), et al. Electronic lexicography in the 21st century: linking lexical data in the digital age. Ljubljana: Trojina, Institute for Applied Slovene Studies; Brighton: Lexical Computing. 2015, str. 70-83. https://elex.link/elex2015/proceedings/eLex_2015_05_Cibej+Fiser+Kosem.pdf

ARHAR HOLDT, Špela, ČIBEJ, Jaka, ZWITTER VITEZ, Ana. Value of language-related questions and comments in digital media for lexicographical user research. International journal of lexicography, 2017, vol. 30, issue 3, str. 285-308. http://ijl.oxfordjournals.org/content/early/2016/04/20/ijl.ecw017.full.pdf?keytype=ref&ijkey=SP5Yb4PHvfykRkk.

ARHAR HOLDT, Špela, KOSEM, Iztok, GANTAR, Polona. Dictionary user typology: the Slovenian case. V: MARGALITADZE, Tinatin (ur.), MELADZE, George (ur.). Lexicography and linguistic diversity: proceedings of the XVII EURALEX International Congress. Tbilisi: Ivane Javakhishvili Tbilisi State University. 2016, str. 179-187. http://euralex2016.tsu.ge/publication2016.pdf

GANTAR, Polona, GORJANC, Vojko, KOSEM, Iztok, KREK, Simon. Going semi-automatic and crowdsourced: collocation dictionary of Slovene. V: KOSEM, Iztok (ur.). Electronic lexicography in the 21st century: linking lexical data in the digital age. Ljubljana: Trojina, Institute for Applied Slovene Studies; Brighton: Lexical Computing. 2015, str. 37.

KOSEM, Iztok, GANTAR, Polona, KREK, Simon. Automation of lexicographic work: an opportunity for both lexicographers and crowd-sourcing. V: KOSEM, Iztok (ur.), et al. Electronic lexicography in the 21st century: thinking outside the paper. Ljubljana: Trojina, Institute for Applied Slovene Studies; Tallinn: Eesti Keele Instituut. 2013, str. 32-48. http://eki.ee/elex2013/proceedings/eLex2013_03_Kosem+Gantar+Krek.pdf

KOSEM, Iztok, HUSAK, Milos, MCCARTHY, Diana. GDEX for Slovene. V: KOSEM, Iztok (ur.), KOSEM, Karmen (ur.). Electronic lexicography in the 21st century: new applications for new users. Ljubljana: Trojina, Institute for Applied Slovene Studies. 2011, str. 150-159. http://www.trojina.si/elex2011/elex2011_proceedings.pdf

LOGAR, Nataša, GRČAR, Miha, BRAKUS, Marko, ERJAVEC, Tomaž, ARHAR HOLDT, Špela, KREK, Simon. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES : gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko: Fakulteta za družbene vede, 2012.

KREK, Simon, GANTAR, Polona, ARHAR HOLDT, Špela, GORJANC, Vojko. Nadgradnja korpusov Gigafida, Kres, ccGigafida in ccKres. V: ERJAVEC, Tomaž (ur.), FIŠER, Darja (ur.). Zbornik konference Jezikovne tehnologije in digitalna humanistika. Ljubljana: Znanstvena založba Filozofske fakultete. 2016, str. 200-202. http://www.sdjt.si/wp/wp-content/uploads/2016/09/JTDH-2016_Krek-et-al_Nadgradnja-korpusov-Gigafida-Kres-ccGigafida-ccKres.pdf

KREK, Simon, KOSEM, Iztok, GANTAR, Polona. Predlog za izdelavo Slovarja sodobnega slovenskega jezika. Izd. 1.1. Ljubljana: s. n., 2013. http://www.sssj.si/datoteke/Predlog_SSSJ_v1.1.pdf

The data for the Thesaurus of Modern Slovene was prepared by an interdisciplinary group of researchers at the Centre for Language Resources and Technologies of the University of Ljubljana.

The development of the Thesaurus was financed by the CJVT infrastructural program. Research was funded by the ARRS P6-0215 research program (Slovene language – basic, contrastive, and applied studies).

The interface was developed by Studio Kruh
in collaboration with Leon Noe Jovan.