Thesaurus of Modern Slovene
In its current version, the Thesaurus of Modern Slovene contains 100.837 keywords and 362.828 synonyms, making it the largest automatically generated open-access collection of Slovene synonyms.
Current version is 2.0.
Date of publication: 26. 3. 2023
The Thesaurus of Modern Slovene database is available at the Clarin.si repository.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International licence.
In its current version, the Thesaurus of Modern Slovene contains 100.837 keywords and 362.828 synonyms, making it the largest automatically generated open-access collection of Slovene synonyms. Unlike other similar language resources, the Thesaurus is based on a range of different databases and enables users to compare different synonyms and check their use in the Gigafida 2.0 reference corpus of modern Slovene.
The Thesaurus was constructed using an advanced computational approach, which is innovative even in the international lexicographical context. In terms of financial resources, computer-assisted data preparation is significantly less demanding and more economical than manual processing, and is also significantly less time-consuming. This enables regular updates and upgrades to the resource, making the dictionary a dynamic source of language information.
With the Thesaurus of Modern Slovene, we are introducing a new type of dictionary called the responsive dictionary. The initial database of a responsive dictionary is constructed using advanced computational methods, instantly providing the language community with a large amount of relevant, albeit still somewhat noisy language information. A responsive dictionary is characterized by two more key traits: first, its database is openly accessible, and second, it provides a number of ways for the language community to improve the database and clean up noisy elements. This means that the construction of a responsive dictionary is never truly concluded as its data constantly evolves in accordance with changes in the modern language. All changes can be tracked using timestamps in individual entries, while the different versions of the database are stored in a dedicated archive. The responsive dictionary takes its name from the fact that the approach to its construction allows the data to continuously respond to the opinions of the contributing language community and the changes in language originating from text produced by the language community. Essentially, it is “a dictionary made by the community for the community” (Arhar Holdt et al. (2018).
The Thesaurus of Modern Slovene is based on the data contained in two principal language resources: The Oxford®-DZS Comprehensive English-Slovenian Dictionary and the Gigafida reference corpus of written Slovene. Both resources contain language material created after 1991 and as such offer a description of modern Slovene. The links identified between synonyms were additionally confirmed using the older Dictionary of Standard Slovenian Language (SSKJ). The data extraction and structure for the Thesaurus were based on the frequency and manner in which words co-occur in translation strings of the Oxford-DZS Dictionary. This information is the basis for discriminating between ‘core’ and ‘near’ synonyms, with ‘core’ synonyms exhibiting a greater degree of connection to the keyword. In the following step, an approach combining balanced co-occurrence graphs and the Personal PageRank algorithm automatically divides the synonyms into subgroups and ranks them according to the degree of semantic relatedness to the keyword, as well as their frequency in language use. Co-occurrence graphs are used to organize synonyms in the dictionary. For a more detailed description of this methodology, see Krek et al. (2017).
Co-ocurence graph for the word hiša (‘house’)
Automatic data extraction and ranking is never perfectly accurate, which can also be observed in related projects for other languages. However, evaluations show that the method is reliable enough to provide results that are useful for the dictionary user even before lexicographic post-processing. The graph to the right shows the results of a linguistic evaluation in which the automatically extracted synonyms for a given keyword were rated as good, acceptable or poor. However, it should be noted that the concept of synonymy is difficult to define and heavily dependent on the context and the circumstances of language use. Thus, even for humans, rating synonyms is not a one-dimensional task.
Linguistic evaluation of synonyms for a given headword.
Understanding synonymy requires context, which is why the Thesaurus provides the user with a number of ways to compare synonyms using corpus data. An important novelty for Slovene language resources in this regard is the option to compare the use of different synonyms in real language use with the help of collocations (typical word co-occurrences). In addition, examples of use are imported into the dictionary using computational methods for the automatic recognition of good (dictionary) examples. Collocations and examples of use are included in most entries, while all of them also provide links to the Gigafida corpus, which allows for a more detailed analysis of modern language use. Furthermore, domain labels were added from the Oxford-DZS Comprehensive English-Slovenian Dictionary, which help explain the context of use for individual synonyms. In version 2.0, labels for hateful and coarse vocabulary were also added.
Thesaurus of Modern Slovene 2.0 incorporates two types of dictionary entries. The majority of entries are prepared entirely through automated processes. However, for 3,054 headwords, we have manually analyzed and divided senses, described them with short semantic indicators, and categorized synonyms under appropriate senses. Another new feature is the inclusion of manually reviewed antonyms, which are available for 3,599 headwords. The sense-divided entries and antonyms indicate the direction in which we aim to expand the dictionary in the future.
The Thesaurus of Modern Slovene is part of an organized effort to establish an infrastructure for Slovene that is comparable to the infrastructures of larger languages. We believe that, in terms of methodology, the construction of language resources should follow the contemporary zeitgeist and that all data prepared through publicly financed initiatives and projects should be openly accessible for the further development of language technologies, considering the actual needs of modern language users in the digital age. The process of constructing the Thesaurus of Modern Slovene thus also puts considerable effort into establishing a dedicated community that not only uses the dictionary, but also contributes to its development.
Users can contribute their own suggestions for synonyms and antonyms to the dictionary. Each new suggestion appears in the dictionary immediately upon submission, and the community has the option to give it positive or negative votes. Starting from version 2.0, selected user contributions will also be included in the openly accessible Thesaurus database and the Digital Dictionary Database for Slovene. For the inclusion, we will consider the following criteria: (1) Does the proposed word or phrase occur in real language use? (2) Has it been added under the appropriate headword considering its meaning? (3) Does it require a dictionary label, and if so, has a label suggestion been provided? (4) How have other users responded to the suggestion? Regardless of whether the suggestion is included in the database or not, it will remain available in the dictionary interface. In exceptional cases, entries that are malicious or similiarly problematic will be removed.
ARHAR HOLDT, Špela, GANTAR, Polona, KOSEM, Iztok, PORI, Eva, ROBNIK ŠIKONJA, Marko, KREK, Simon. Thesaurus of Modern Slovene 2.0. In: MEDVEĎ, Marek (Ed.), et al. eLex 2023: electronic lexicography in the 21st century (eLex 2023): proceedings of the eLex 2023 conference: [Brno], 27-29 June 2023. Brno: Lexical Computing CZ, 2023. Pp. 366-381. https://elex.link/elex2023/wp-content/uploads/82.pdf
ARHAR HOLDT, Špela, ČIBEJ, Jaka, DOBROVOLJC, Kaja, GANTAR, Apolonija, GORJANC, Vojko, KLEMENC, Bojan, KOSEM, Iztok, KREK, Simon, LASKOWSKI, Cyprian, ROBNIK ŠIKONJA, Marko. Thesaurus of Modern Slovene: By the Community for the Community. V: Čibej, Jaka, Vojko Gorjanc, Iztok Kosem, Simon Krek (ur.). Proceedings of the XVIII EURALEX International Congress: Lexicography in Global Contexts. ISBN 978-961-06-0097-8). 1. izd. Ljubljana: Znanstvena založba Filozofske fakultete. 2018, str. 401-410. https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/view/118/211/3000-1
KREK, Simon, LASKOWSKI, Cyprian, ROBNIK-ŠIKONJA, Marko. From translation equivalents to synonyms: creation of a Slovene thesaurus using word co-occurrence network analysis. V: KOSEM, Iztok (ur.) et al., Proceedings of eLex 2017: Lexicography from Scratch, 19-21 September 2017, Leiden, Netherlands. https://elex.link/elex2017/wp-content/uploads/2017/09/paper05.pdf
ARHAR HOLDT, Špela. How users responded to a responsive dictionary: the case of the Thesaurus of Modern Slovene. Rasprave Instituta za hrvatski jezik i jezikoslovlje. 2020, vol. 46, no. 2, str. 465-482. DOI: 10.31724/rihjj.46.2.1
ARHAR HOLDT, Špela, KOSEM, Iztok, PORI, Eva, GORJANC, Vojko, KREK, Simon, GANTAR, Polona. Negativno zaznamovano besedišče v Slovarju sopomenk sodobne slovenščine 2.0. Slovenščina 2.0: empirične, aplikativne in interdisciplinarne raziskave. 2023, year 11, num. 1, pp. 8-32. DOI: 10.4312/slo2.0.2023.1.8-32
GAPSA, Magdalena, ARHAR HOLDT, Špela. How lexicographers evaluate user contributions in the Thesaurus of Modern Slovene in comparison to dictionary users. V: MEDVEĎ, Marek (ur.), et al. eLex 2023: electronic lexicography in the 21st century (eLex 2023): proceedings of the eLex 2023 conference: [Brno], 27–29 June 2023. Brno: Lexical Computing CZ, 2023. Str. 178-200. https://elex.link/elex2023/wp-content/uploads/47.pdf
ARHAR HOLDT, Špela, GANTAR, Polona, KOSEM, Iztok, PORI, Eva, LOGAR, Nataša, GORJANC, Vojko, KREK, Simon. Sovražno in grobo besedišče v odzivnem Slovarju sopomenk sodobne slovenščine. V: FIŠER, Darja, ERJAVEC, Tomaž (ur.): Jezikovne tehnologije in digitalna humanistika: zbornik konference: 15.-16. september 2022, Ljubljana, Slovenija. Inštitut za novejšo zgodovino. Str. 10-16. https://nl.ijs.si/jtdh22/pdf/JTDH2022_Proceedings.pdf
ARHAR HOLDT, Špela, ČIBEJ, Jaka. Rezultati projekta "Slovar sopomenk sodobne slovenščine: od skupnosti za skupnost". V: FIŠER, Darja, ERJAVEC, Tomaž (ur.). Jezikovne tehnologije in digitalna humanistika: zbornik konference: 24.- 25. september 2020, Ljubljana, Slovenija. Ljubljana: Inštitut za novejšo zgodovino. 2020, str. 3-9. http://nl.ijs.si/jtdh20/pdf/JT-DH_2020_Arhar-Holdt-et-al_Rezultati-projekta_Slovar-sopomenk-sodobne-slovenscine.pdf
GORJANC, Vojko, GANTAR, Polona, KOSEM, Iztok, KREK, Simon (ur.) Slovar sodobne slovenščine: problemi in rešitve. Ljubljana: Znanstvena založba Filozofske fakultete. 2015. Deloma prevedeno v: GORJANC, Vojko, GANTAR, Polona, KOSEM, Iztok, KREK, Simon (ur.) Dictionary of modern Slovene: problems and solutions. Ljubljana: Ljubljana University Press, Faculty of Arts, 2017. https://e-knjige.ff.uni-lj.si/znanstvena-zalozba/catalog/book/15
GANTAR, Polona, KOSEM, Iztok, KREK, Simon. Discovering automated lexicography = the case of Slovene lexical database. International journal of lexicography, 2016, vol. 29, issue 2, str. 200-225. https://academic.oup.com/ijl/article/29/2/200/2413284/Discovering-Automated-Lexicography-The-Case-of-the?guestAccessKey=95f18766-f10f-4994-a6fa-448cf75ac55e
KOSEM, Iztok, GANTAR, Polona, KREK, Simon. Avtomatizacija leksikografskih postopkov. V: ERJAVEC, Tomaž (ur.), ŽGANEC GROS, Jerneja (ur.). Jezikovne tehnologije, Slovenščina 2.0, letn. 1, št. 2. Ljubljana: Trojina, zavod za uporabno slovenistiko. 2013, str. 139-164. http://www.trojina.org/slovenscina2.0/arhiv/2013/2/Slo2.0_2013_2_07.pdf
ČIBEJ, Jaka, FIŠER, Darja, KOSEM, Iztok. The role of crowdsourcing in lexicography. V: KOSEM, Iztok (ur.), et al. Electronic lexicography in the 21st century: linking lexical data in the digital age. Ljubljana: Trojina, Institute for Applied Slovene Studies; Brighton: Lexical Computing. 2015, str. 70-83. https://elex.link/elex2015/proceedings/eLex_2015_05_Cibej+Fiser+Kosem.pdf
ARHAR HOLDT, Špela, LOGAR, Nataša, PORI, Eva, KOSEM, Iztok. Game of words: play the game, clean the database. V: GAVRIILIDOU, Zoe, MITITS, Lydia, KIOSSES, Spyros (ur.). Lexicography for inclusion: EURALEX XIX: 7-9 September 2021, Vol. 2. 2021. Komotini: Democritus University of Thrace. 2021, str. 41-49. https://euralex.org/publications/game-of-words-play-the-game-clean-the-database/
ČIBEJ, Jaka, ARHAR HOLDT, Špela. Repel the syntruders! A crowdsourcing cleanup of the thesaurus of modern Slovene. V: KOSEM, Iztok, KREK, Simon (ur.): Electronic lexicography in the 21st century: proceedings of eLex 2019 Conference, 1-3 October 2019, Sintra, Portugal. Brno: Lexical Computing, 2019. Str. 338-356. https://elex.link/elex2019/wp-content/uploads/2019/10/eLex-2019_Proceedings.pdf
ARHAR HOLDT, Špela, ČIBEJ, Jaka, ZWITTER VITEZ, Ana. Value of language-related questions and comments in digital media for lexicographical user research. International journal of lexicography, 2017, vol. 30, issue 3, str. 285-308. http://ijl.oxfordjournals.org/content/early/2016/04/20/ijl.ecw017.full.pdf?keytype=ref&ijkey=SP5Yb4PHvfykRkk
ARHAR HOLDT, Špela, KOSEM, Iztok, GANTAR, Polona. Dictionary user typology: the Slovenian case. V: MARGALITADZE, Tinatin (ur.), MELADZE, George (ur.). Lexicography and linguistic diversity: proceedings of the XVII EURALEX International Congress. Tbilisi: Ivane Javakhishvili Tbilisi State University. 2016, str. 179-187. http://euralex2016.tsu.ge/publication2016.pdf
GANTAR, Polona, GORJANC, Vojko, KOSEM, Iztok, KREK, Simon. Going semi-automatic and crowdsourced: collocation dictionary of Slovene. V: KOSEM, Iztok (ur.). Electronic lexicography in the 21st century: linking lexical data in the digital age. Ljubljana: Trojina, Institute for Applied Slovene Studies; Brighton: Lexical Computing. 2015, str. 37.
KOSEM, Iztok, GANTAR, Polona, KREK, Simon. Automation of lexicographic work: an opportunity for both lexicographers and crowd-sourcing. V: KOSEM, Iztok (ur.), et al. Electronic lexicography in the 21st century: thinking outside the paper. Ljubljana: Trojina, Institute for Applied Slovene Studies; Tallinn: Eesti Keele Instituut. 2013, str. 32-48. http://eki.ee/elex2013/proceedings/eLex2013_03_Kosem+Gantar+Krek.pdf
KOSEM, Iztok, HUSAK, Milos, MCCARTHY, Diana. GDEX for Slovene. V: KOSEM, Iztok (ur.), KOSEM, Karmen (ur.). Electronic lexicography in the 21st century: new applications for new users. Ljubljana: Trojina, Institute for Applied Slovene Studies. 2011, str. 150-159. http://www.trojina.si/elex2011/elex2011_proceedings.pdf
LOGAR, Nataša, GRČAR, Miha, BRAKUS, Marko, ERJAVEC, Tomaž, ARHAR HOLDT, Špela, KREK, Simon. Korpusi slovenskega jezika Gigafida, KRES, ccGigafida in ccKRES : gradnja, vsebina, uporaba. Ljubljana: Trojina, zavod za uporabno slovenistiko: Fakulteta za družbene vede, 2012. https://doi.org/10.4312/9789610603542
KREK, Simon, ARHAR HOLDT, Špela, ERJAVEC, Tomaž, ČIBEJ, Jaka, REPAR, Andraž, GANTAR, Polona, LJUBEŠIĆ, Nikola, KOSEM, Iztok, DOBROVOLJC, Kaja. Gigafida 2.0: the reference corpus of written standard Slovene. V: CALZOLARI, Nicoletta (ur.): LREC 2020: Twelfth International Conference on Language Resources and Evaluation: May 11-16, 2020, Palais du Pharo, Marseille, France. Paris: ELRA - European Language Resources Association, 2020, str. 3340-3345. http://www.lrec-conf.org/proceedings/lrec2020/LREC-2020.pdf
ARHAR HOLDT, Špela, KOSEM, Iztok, PORI, Eva. Jezikovni viri CJVT in njihova raba v izobraževalne namene. V: ULČNIK, Natalija, ANTLOGA, Špela (ur.): Slovenščina na dlani 4. Maribor: Univerza v Mariboru, Univerzitetna založba, 2021. Str. 19-36. https://press.um.si/index.php/ump/catalog/book/615
KREK, Simon, KOSEM, Iztok, GANTAR, Polona. Predlog za izdelavo Slovarja sodobnega slovenskega jezika. Izd. 1.1. Ljubljana: s. n., 2013. http://www.sssj.si/datoteke/Predlog_SSSJ_v1.1.pdf
Online dictionary at viri.cjvt.si
Viri CJVT
ISSN 2591-247X
Ljubljana, 2023
This work is licensed under a Creative Commons licence:
Creative Commons Attribution-ShareAlike International 4.0.
Edited by
Špela Arhar Holdt
Simon Krek
Cyprian Laskowski
Iztok Kosem
Polona Gantar
Marko Robnik Šikonja
Jaka Čibej
Vojko Gorjanc
Bojan Klemenc
Kaja Dobrovoljc
Interface design
Gašper Uršič
Gregor Makovec
(Studio Kruh)
Interface development
Leon Noe Jovan
Issued by
Centre for Language Resources and Technologies, University of Ljubljana
Ljubljana University Press, Faculty of Arts
For the issuer
Mojca Schlamberger Brezar, Dean of the Faculty of Arts, University of Ljubljana
Published by
Ljubljana University Press, Faculty of Arts
(until 2022) University of Ljubljana Academic Press
For the publisher
Gregor Majdič, Rector of the University of Ljubljana
Citation
Sopomenke 2.0: Thesaurus of Modern Slovene, viri.cjvt.si/sopomenke, accessed on 21. 11. 2024.
Thesaurus of Modern Slovene 2.0
Date of publication: 26. 3. 2023
Number of headwords: 100.837
Number of synonyms: 362.828
Number of collocations: 2.885.894
Number of examples: 7.364.128
Thesaurus of Modern Slovene 1.0
Date of publication: 26. 3. 2018
Number of headwords: 105.473
Number of synonyms: 368.117
Number of collocations: 3.353.061
Number of examples: 2.505.472