About the Resource

Gos is a reference corpus of spoken Slovene. It consists of transcriptions of around 320 hours of recorded speech in a wide array of situations, which occur in everyday life: from radio and television shows to school lessons and lectures, private conversations between friends or family and various work meetings, consultations, commercial conversations etc. The speech database is transcribed in two different versions - a standardized version and a pronunciation-based version - and comprises more than two million words.

The Gos 2.1 corpus is the newest iteration of the corpus. The new version was created as part of the Development of Slovene in a Digital Environment project with the merging of the Gos 1.1 and Gos VideoLectures corpora and a part of the Artur speech database. Compared to the first version, Gos 2.1 contains more than twice the amount of recordings and transcriptions, while using a somewhat altered method for speech transcription, due to the merging of all three datasets. During the planning stage for the new version, special attention was given to its currency (the recordings span a period from the year 2007 to 2022) and to how balanced it is with respect to the different types of speech events it contains.

The corpus can be accessed through a web interface on this website. In addition to various options for searching through both types of transcription, the web concordancer also supports playback of sound recordings and result filtering using a wide array of metadata, such as event type, communications channel, gender, age, education and the speaker’s region of origin. For linguistic usage, the corpus is also available in the noSketchEngine and Kontext concordancers, maintained by the CLARIN.SI research infrastructure.