About the Thesaurus

In its current version, the Thesaurus of Modern Slovene contains 100.837 keywords and 362.828 synonyms, making it the largest automatically generated open-access collection of Slovene synonyms. Unlike other similar language resources, the Thesaurus is based on a range of different databases and enables users to compare different synonyms and check their use in the Gigafida 2.0 reference corpus of modern Slovene.

The Thesaurus was constructed using an advanced computational approach, which is innovative even in the international lexicographical context. In terms of financial resources, computer-assisted data preparation is significantly less demanding and more economical than manual processing, and is also significantly less time-consuming. This enables regular updates and upgrades to the resource, making the dictionary a dynamic source of language information.

Responsive dictionary

With the Thesaurus of Modern Slovene, we are introducing a new type of dictionary called the responsive dictionary. The initial database of a responsive dictionary is constructed using advanced computational methods, instantly providing the language community with a large amount of relevant, albeit still somewhat noisy language information. A responsive dictionary is characterized by two more key traits: first, its database is openly accessible, and second, it provides a number of ways for the language community to improve the database and clean up noisy elements. This means that the construction of a responsive dictionary is never truly concluded as its data constantly evolves in accordance with changes in the modern language. All changes can be tracked using timestamps in individual entries, while the different versions of the database are stored in a dedicated archive. The responsive dictionary takes its name from the fact that the approach to its construction allows the data to continuously respond to the opinions of the contributing language community and the changes in language originating from text produced by the language community. Essentially, it is “a dictionary made by the community for the community” (Arhar Holdt et al. (2018).

Dictionary creation

The Thesaurus of Modern Slovene is based on the data contained in two principal language resources: The Oxford®-DZS Comprehensive English-Slovenian Dictionary and the Gigafida reference corpus of written Slovene. Both resources contain language material created after 1991 and as such offer a description of modern Slovene. The links identified between synonyms were additionally confirmed using the older Dictionary of Standard Slovenian Language (SSKJ). The data extraction and structure for the Thesaurus were based on the frequency and manner in which words co-occur in translation strings of the Oxford-DZS Dictionary. This information is the basis for discriminating between ‘core’ and ‘near’ synonyms, with ‘core’ synonyms exhibiting a greater degree of connection to the keyword. In the following step, an approach combining balanced co-occurrence graphs and the Personal PageRank algorithm automatically divides the synonyms into subgroups and ranks them according to the degree of semantic relatedness to the keyword, as well as their frequency in language use. Co-occurrence graphs are used to organize synonyms in the dictionary. For a more detailed description of this methodology, see Krek et al. (2017).

Co-ocurence graph for the word hiša (‘house’)

Data reliability

Automatic data extraction and ranking is never perfectly accurate, which can also be observed in related projects for other languages. However, evaluations show that the method is reliable enough to provide results that are useful for the dictionary user even before lexicographic post-processing. The graph to the right shows the results of a linguistic evaluation in which the automatically extracted synonyms for a given keyword were rated as good, acceptable or poor. However, it should be noted that the concept of synonymy is difficult to define and heavily dependent on the context and the circumstances of language use. Thus, even for humans, rating synonyms is not a one-dimensional task.

Linguistic evaluation of synonyms for a given headword.

Synonyms and the context

Understanding synonymy requires context, which is why the Thesaurus provides the user with a number of ways to compare synonyms using corpus data. An important novelty for Slovene language resources in this regard is the option to compare the use of different synonyms in real language use with the help of collocations (typical word co-occurrences). In addition, examples of use are imported into the dictionary using computational methods for the automatic recognition of good (dictionary) examples. Collocations and examples of use are included in most entries, while all of them also provide links to the Gigafida corpus, which allows for a more detailed analysis of modern language use. Furthermore, domain labels were added from the Oxford-DZS Comprehensive English-Slovenian Dictionary, which help explain the context of use for individual synonyms. In version 2.0, labels for hateful and coarse vocabulary were also added.

Updates in Thesaurus 2.0

Thesaurus of Modern Slovene 2.0 incorporates two types of dictionary entries. The majority of entries are prepared entirely through automated processes. However, for 3,054 headwords, we have manually analyzed and divided senses, described them with short semantic indicators, and categorized synonyms under appropriate senses. Another new feature is the inclusion of manually reviewed antonyms, which are available for 3,599 headwords. The sense-divided entries and antonyms indicate the direction in which we aim to expand the dictionary in the future.

From the community for the community

The Thesaurus of Modern Slovene is part of an organized effort to establish an infrastructure for Slovene that is comparable to the infrastructures of larger languages. We believe that, in terms of methodology, the construction of language resources should follow the contemporary zeitgeist and that all data prepared through publicly financed initiatives and projects should be openly accessible for the further development of language technologies, considering the actual needs of modern language users in the digital age. The process of constructing the Thesaurus of Modern Slovene thus also puts considerable effort into establishing a dedicated community that not only uses the dictionary, but also contributes to its development.

User contributions in the dictionary database

Users can contribute their own suggestions for synonyms and antonyms to the dictionary. Each new suggestion appears in the dictionary immediately upon submission, and the community has the option to give it positive or negative votes. Starting from version 2.0, selected user contributions will also be included in the openly accessible Thesaurus database and the Digital Dictionary Database for Slovene. For the inclusion, we will consider the following criteria: (1) Does the proposed word or phrase occur in real language use? (2) Has it been added under the appropriate headword considering its meaning? (3) Does it require a dictionary label, and if so, has a label suggestion been provided? (4) How have other users responded to the suggestion? Regardless of whether the suggestion is included in the database or not, it will remain available in the dictionary interface. In exceptional cases, entries that are malicious or similiarly problematic will be removed.


Thesaurus of Modern Slovene

Sopomenke 2.0: Thesaurus of Modern Slovene, viri.cjvt.si/sopomenke, accessed on 24. 09. 2023.


Thesaurus of Modern Slovene 2.0

Date of publication: 26. 3. 2023
Number of headwords: 100.837
Number of synonyms: 362.828
Number of collocations: 2.885.894
Number of examples: 7.364.128

Thesaurus of Modern Slovene 1.0

Date of publication: 26. 3. 2018
Number of headwords: 105.473
Number of synonyms: 368.117
Number of collocations: 3.353.061
Number of examples: 2.505.472

URL: http://viri.cjvt.si/sopomenke/arhiv/CJVT_Thesaurus-v1.0.zip