Common Language Resources and Technology Infrastructure - Slovenia
Not a member yet
816 research outputs found
Sort by
Frequency lists of syntactic structures from the Učbeniki 1.0 corpus
The frequency lists of syntactic structures from the Slovene textbook corpus Učbeniki 1.0 were extracted with the STARK v3 tool (http://hdl.handle.net/11356/1958).
The extracted data is available at two levels: at the phrase level (see folder "besednozvezne") and at the sentence level (see folder "medstavcne").
At the phrase level, the extracted syntactic structures have a headword belonging to one of the following parts of speech, as defined by the MULTEXT-East system for morphosyntactic annotation of Slovene texts:
noun (samostalnik), verb (glagol), adjective (pridevnik), adverb (prislov), pronoun (zaimek),
numeral (števnik), predlog (adposition), veznik (conjunction), particle (členek), abbreviation (okrajšava) (no results were returned for interjection (medmet) and residual (neuvrščeno)).
These structures were extracted based on the MULTEXT-East morphosyntax v6 (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and the JOS-SYN dependency syntax (https://wiki.cjvt.si/books/06-jos-syn-syntax), where the latter serves as a syntactic complement to the former.
At the sentence level, the extracted syntactic structures link two clauses. The included types of clausal syntactic relations according to Universal Dependencies (UD) are:
parataxis (soredje), coordination (priredje), and subordination (podredje), which is further divided into 4 main types according to UD:
clausal subject (osebkov odvisnik), clausal object (predmetni odvisnik), adverbial cluase modifier (prislovni odvisniki), and adnominal clause modifier (prilastkov odvisnik).
These structures were extracted based on the UD part-of-speech and syntactic relations annotations (https://wiki.cjvt.si/books/07-universal-dependencies).
The dataset can be used for syntactic analyses in combination with comparable data (http://hdl.handle.net/11356/2009) from develpmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589), the present data representing the expected or desired scope of reception.
For each part of speech (phrase level) or clausal relation (sentence level), there are 4 files:
- "ucbeniki_*_default.tsv" - the original output, containing extracted unique syntactic structures of varying lengths, ranging from 2 to 10 tokens, arranged by frequency, followed by additional data on syntactic structures and corpus-linguistic statistics (Absolute frequency, Relative frequency, MI, MI3, Dice, logDice, t-score, simple-LL).
- "ucbeniki_*_all-examples.tsv" - the original output, containing all matched structures found in the input corpus (i.e. all occurances of the extracted structures in every sentence).
- "ucbeniki_*_default_tree-description.tsv" - an extension of the "ucbeniki_*_default.tsv" file that includes a verbal description of syntactic structures (trees).
- "ucbeniki_*_all-examples_tree-description.tsv" - an extension of the "ucbeniki_*_all-examples.tsv" file that includes a verbal description of syntactic structures (trees).
(The asterisk (*) in file names serves as a placeholder for a part of speech or a clausal relation.)
The data was prepared in the following manner:
The individual files of Slovene school textbooks were merged into a single CONLLU file. The corpus was already linguistically annotated with the CLASSLA pipeline (https://github.com/clarinsi/classla/) at the levels of the MULTEXT-East v6 morphosyntax, JOS-SYN dependency syntax, and UD part-of-speech and syntactic relations annotations.
Furthermore, the original corpus was preprocessed to reduce the MSD tag to its first letter (e.g., Somei → S), which denotes the part of speech (the remaining letters represent the token's morphosyntactic features). This preprocessing step enabled extraction at the part-of-speech level, disregarding token-specific features, yet still displaying the full MSD tags as nodes in the extracted structures. (Note that STARK was originally developed for extracting data from UD-parsed corpora and was not designed for use cases like this one.)
Then, the data was extracted with the STARK v3.0 tool (http://hdl.handle.net/11356/1958), based on predefined parameters in the "config.ini" file, with phrase-level structures extracted based on the MULTEXT-East and JOS-SYN annotation systems, and sentence-level structures extracted based on the UD schema.
The sentence-level data underwent a postprocessing phase to remove duplicates that occured due to the phased extraction of complex connectives and to recalculate corpus-linguistic statistics based on the deduplicated data.
Another step was to enhance all output files with verbal descriptions of the extracted structures.
Lastly, the extended versions of the two original output files ("ucbeniki_*_default_tree-description.tsv", "ucbeniki_*_all-examples_tree-description.tsv") were converted into Excel spreadsheets.
The package also includes a configuration file for each level: "config_ucbeniki_besednozvezne.ini" for phrase-level structures, and "config_ucbeniki_medstavcne.ini" for sentence-level structures. These files contain all the parameter values used for data extraction with STARK.
For more details, see "00README.txt"
Parallel sense-annotated corpus ELEXIS-WSD 1.3
ELEXIS-WSD is a parallel sense-annotated corpus in which content words (nouns, adjectives, verbs, and adverbs) have been assigned senses. Version 1.3 contains sentences for 10 languages: Bulgarian, Danish, English, Spanish, Estonian, Hungarian, Italian, Dutch, Portuguese, and Slovene.
The corpus was compiled by automatically extracting a set of sentences from WikiMatrix (Schwenk et al., 2019), a large open-access collection of parallel sentences derived from Wikipedia, using an automatic approach based on multilingual sentence embeddings. The sentences were manually validated according to specific formal, lexical and semantic criteria (e.g. by removing incorrect punctuation, morphological errors, notes in square brackets and etymological information typically provided in Wikipedia pages). To obtain a satisfactory semantic coverage, we filtered out sentences with less than 5 words and less than 2 polysemous words were filtered out. Subsequently, in order to obtain datasets in the other nine target languages, for each selected sentence in English, the corresponding WikiMatrix translation into each of the other languages was retrieved. If no translation was available, the English sentence was translated manually. The resulting corpus is comprised of 2,024 sentences for each language.
The sentences were tokenized, lemmatized, and tagged with UPOS tags using UDPipe v2.6 (https://lindat.mff.cuni.cz/services/udpipe/). Senses were annotated using LexTag (https://elexis.babelscape.com/): each content word (noun, verb, adjective, and adverb) was assigned a sense from among the available senses from the sense inventory selected for the language (see below) or BabelNet. Sense inventories were also updated with new senses during annotation. Dependency relations were added with UDPipe 2.15 in version 1.2.
List of sense inventories
BG: Dictionary of Bulgarian
DA: DanNet – The Danish WordNet
EN: Open English WordNet
ES: Spanish Wiktionary
ET: The EKI Combined Dictionary of Estonian
HU: The Explanatory Dictionary of the Hungarian Language
IT: PSC + Italian WordNet
NL: Open Dutch WordNet
PT: Portuguese Academy Dictionary (DACL)
SL: Digital Dictionary Database of Slovene
The corpus is available in the CoNLL-U tab-separated format. In order, the columns contain the token ID, its form, its lemma, its UPOS-tag, its XPOS-tag (if available), its morphological features (FEATS), the head of the dependency relation (HEAD), the type of dependency relation (DEPREL); the ninth column (DEPS) is empty; the final MISC column contains the following: the token's whitespace information (whether the token is followed by a whitespace or not; e.g. SpaceAfter=No), the ID of the sense assigned to the token, the index of the multiword expression (if the token is part of an annotated multiword expression), and the index and type of the named entity annotation (currently only available in elexis-wsd-sl and elexis-wsd-en).
Each language has a separate sense inventory containing all the senses (and their definitions) used for annotation in the corpus. Not all the senses from the sense inventory are necessarily included in the corpus annotations: for instance, all occurrences of the English noun "bank" in the corpus might be annotated with the sense of "financial institution", but the sense inventory also contains the sense "edge of a river" as well as all other possible senses to disambiguate between.
For more information, please refer to 00README.txt.
Updates in version 1.3:
- A handful of token ID issues were corrected in ELEXIS-WSD-sl. In addition, lemmas were corrected according to the version of ELEXIS-WSD-sl included in the SUK 1.1 Training Corpus of Slovene (http://hdl.handle.net/11356/1959).
- Named entity annotations and named entity core concept annotations were added to ELEXIS-WSD-en.
- For all languages, missing UPOS tags were added for non-content words
The CLASSLA-Stanza model for named entity recognition of standard Slovenian 2.2
This model for named entity recognition of standard Slovenian was built with the CLASSLA-Stanza tool (https://github.com/clarinsi/classla) by training on the SUK training corpus (http://hdl.handle.net/11356/1959) and using the CLARIN.SI-embed.sl 2.0 word embeddings (http://hdl.handle.net/11356/1791).
The difference to the previous version of the model is that the model was trained using the SUK training corpus and uses new embeddings
Linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 5.0
ParlaMint 5.0 is a set of comparable corpora containing transcriptions of parliamentary debates of 29 European countries and autonomous regions, mostly starting in 2015 and extending to mid-2022. The individual corpora comprise between 9 and 126 million words and the complete set contains over 1.2 billion words.
The transcriptions are divided by days with information on the term, session and meeting, and contain speeches marked by the speaker and their role (e.g. chair, regular speaker) as well as by their automatically assigned CAP (Comparative Agendas Project) top level topic.
The speeches also contain marked-up transcriber comments, such as gaps in the transcription, interruptions, applause, etc. The corpora have extensive metadata, most importantly on speakers (name, gender, MP and minister status, party affiliation), on their political parties and parliamentary groups (name, coalition/opposition status, Wikipedia-sourced left-to-right political orientation, and CHES variables, https://www.chesdata.eu/). Note that some corpora have further metadata, e.g. the year of birth of the speakers, links to their Wikipedia articles, their membership in various committees, etc. The transcriptions are also marked with the subcorpora they belong to ("reference", until 2020-01-30, "covid", from 2020-01-31, and "war", from 2022-02-24).
An overview of the statistics of the corpora is available on GitHub in the folder Build/Metadata, in particular for the release 5.0 at https://github.com/clarin-eric/ParlaMint/tree/v5.0/Build/Metadata.
The corpora are encoded according to the ParlaMint encoding guidelines (https://clarin-eric.github.io/ParlaMint/) and schemas (included in the distribution).
The ParlaMint.ana linguistic annotation includes tokenization; sentence segmentation; lemmatisation; Universal Dependencies part-of-speech, morphological features, and syntactic dependencies; the 4-class CoNLL-2003 named entities; and per-sentence sentiment score and class. Some corpora also have further linguistic annotations, in particular PoS tagging according to a language-specific scheme, with their corpus TEI headers giving further details on the annotation vocabularies and tools used.
This entry contains the ParlaMint.ana TEI-encoded linguistically annotated corpora; the derived CoNLL-U files along with TSV metadata of the speeches and TSV with per-sentence sentiment score, 6- and 3-categories class; and the derived vertical files (with their registry file), suitable for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText.
Also included is the 5.0 release of the sample data and scripts available at the GitHub repository of the ParlaMint project at https://github.com/clarin-eric/ParlaMint and the log files produced in the process of building the corpora for this release. The log files show e.g. known errors in the corpora, while more information about known problems is available in the open issues at the GitHub repository of the project.
This entry contains the linguistically marked-up version of the corpus, while the text version, i.e. without the linguistic annotation is also available at http://hdl.handle.net/11356/2004. Another related resource, namely the ParlaMint corpora machine translated to English ParlaMint-en.ana 5.0 can be found at http://hdl.handle.net/11356/2006.
As opposed to the previous version 4.1, this version adds information on the topic of each speech and the sentence-level sentiment for all corpora, adds some previously missing speeches to the TR corpus, changes the IDs of the categories in corpus-specific taxonomies to prevent ID clashes and corrects some other minor errors
Word-sense disambiguation corpus SloDicWSD 1.0
SloDicWSD is a Slovene word-sense disambiguation (WSD) corpus generated from data contained in SSKJ (Slovar slovenskega knjižnega jezika, the largest dictionary of standard Slovene). The corpus is an automatically constructed WSD dataset based on the sense inventory from the SSKJ dictionary and consists of SSKJ dictionary use-case examples converted to complete sentences using GPT-3.5 Turbo (https://platform.openai.com/docs/models#gpt-3-5-turbo).
We limited the corpus to the top 758 lemmas present in the Slovene part of the Elexis-WSD dataset (http://hdl.handle.net/11356/1842). For each lemma, we extracted every usage example from the SSKJ dictionary and labeled it with the matching sense. As these usage examples are likely too short to be useful for the WSD task, we extended them using GPT-3.5. We automatically filtered sentences that contain one of the two errors:
1. The original dictionary lemma was not present in the full sentence. While we prompted GPT-3.5 to generate complete sentences by extending existing examples, GPT-3.5 sometimes omitted the original lemma.
2. The generated sentence was identical to one of the already generated sentences. Thesentences generated by GPT-3.5 are not guaranteed to be unique; therefore, we discarded duplicates
French and Slovene offensive language metaphor and metonymy annotated dataset FRENK-MRW 1.0
The Frenk-MRW dataset contains French and Slovene socially unacceptable Facebook comments that are manually annotated for metaphor and metonymy based on the observed incongruity between the basic and contextual meaning. The comments were posted between 2015 and 2017 under Facebook posts produced by major news media outlets on the topics of LGBTQIA+/homophobia and migration/islamophobia. This entry includes the dataset divided into four files in CSV format, two with French comments (metadata: meta_fr, metaphor/metonymy annotations: mrw_fr) and two with Slovene comments (metadata: meta_sl, metaphor/metonymy annotations: mrw_sl). Attached are also annotation guidelines and a README file explaining the file structure, both formatted as TXT files.
The dataset uses a selection of Slovene socially unacceptable comments from FRENK 1.1 (http://hdl.handle.net/11356/1462) and French socially unacceptable comments from FRENK-fr 1.0 (http://hdl.handle.net/11356/1947). French data from FRENK-fr 1.0 was linguistically annotated with the FreeLing tagger (https://aclanthology.org/L12-1224/), while Slovene data from FRENK 1.1 was processed using CLASSLA tagger (http://hdl.handle.net/11356/1337). Manual annotations were performed in a WebAnno deployment (webanno.github.io/webanno) hosted at CLARIN.SI.
FRENK-MRW represent a set of comments, 2,000 in total, that is based on a selection of news items (POST_CONTENT (NEWS) column) which were chosen according to two criteria: (1) for ease of annotation and interpretation, the entire thread of comments needed to be included (excluding acceptable comments from the annotation), and (2) the total amount of available comments linked to these news posts had to reach 2,000 comments equally distributed between the two languages (French, Slovene) and the two topics (migrants, LGBT). The French part of the dataset includes posts from Le Figaro and 20 minutes, with LGBT-related news coming only from the latter. In the Slovene part, the posts on both topics (migrants and LGBT) come from Nova24TV, Siol.net and 24ur.
There are 2,000 comments in the dataset with 84,738 tokens. Not all comments contain metaphors. In the French part, 541 comments contain at least one metaphorically used token, while in the Slovene part of the dataset this number amounts to 571 comments. In total, there are 1,192 metaphorically used tokens in the French part of the dataset, and 1,270 in the Slovene part
Monitor corpus of Slovene Trendi 2025-02
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 77 publishers. Trendi 2025-02 covers the period from January 2019 to February 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320).
The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).
An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics).
The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]).
This version adds texts from February 2025
Monitor corpus of Slovene Trendi 2025-03
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 56 publishers. Trendi 2025-03 covers the period from January 2019 to March 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320).
The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).
An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics).
The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]).
This version adds texts from March 2025, and provides some improvements in the list of publishers by correcting source to publisher conversion from the previous months (esp. 2023-02)
Monitor corpus of Slovene Trendi 2025-04
The Trendi corpus is a monitor corpus of Slovenian. It contains news articles from 106 media websites, published by 56 publishers. Trendi 2025-04 covers the period from January 2019 to April 2025, complementing the Gigafida 2.0 reference corpus of written Slovene (http://hdl.handle.net/11356/1320).
The contents of the Trendi corpus are obtained using the Jožef Stefan Institute Newsfeed service (http://newsfeed.ijs.si/). The texts have been annotated using the CLASSLA-Stanza pipeline (https://github.com/clarinsi/classla), including syntactic parsing according to the Universal Dependencies (https://universaldependencies.org/sl/) and Named Entities (https://nl.ijs.si/janes/wp-content/uploads/2017/09/SlovenianNER-eng-v1.1.pdf).
An important addition are topics or thematical categories, which have been automatically assigned to each text. There are 13 categories altogether: Arts and culture, Crime and accidents, Economy, Environment, Health, Leisure, Politics and Law, Science and Technology, Society, Sports, Weather, Entertainment, and Education. The text classification uses the following models: Text classification model SloBERTa-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1709), Text classification model fastText-Trendi-Topics 1.0 (http://hdl.handle.net/11356/1710), and the SloBERTa model (https://huggingface.co/cjvt/sloberta-trendi-topics).
The corpus is currently not available as a downloadable dataset due to copyright restrictions but we hope to make at least some of it available in the near future. The corpus is accessible through CLARIN.SI concordancers. If you would like to use the dataset for research purposes, please contact Iztok Kosem ([email protected]).
This version adds texts from April 2025
Frequency lists of syntactic structures from the Šolar 3.0 corpus
The frequency lists of syntactic structures from the developmental corpus Šolar 3.0 (http://hdl.handle.net/11356/1589), specifically from the original, uncorrected student texts ("solar-orig.conllu") were extracted with the STARK v3 tool (http://hdl.handle.net/11356/1958).
The extracted data is available at two levels: at the phrase level (see folder "besednozvezne") and at the sentence level (see folder "medstavcne").
At the phrase level, the extracted syntactic structures have a headword belonging to one of the following parts of speech, as defined by the MULTEXT-East system for morphosyntactic annotation of Slovene texts:
noun (samostalnik), verb (glagol), adjective (pridevnik), adverb (prislov), pronoun (zaimek),
numeral (števnik), predlog (adposition), veznik (conjunction), particle (členek), abbreviation (okrajšava) (no results were returned for interjection (medmet) and residual (neuvrščeno)).
These structures were extracted based on the MULTEXT-East morphosyntax v6 (https://wiki.cjvt.si/books/04-multext-east-morphosyntax) and the JOS-SYN dependency syntax (https://wiki.cjvt.si/books/06-jos-syn-syntax), where the latter serves as a syntactic complement to the former.
At the sentence level, the extracted syntactic structures link two clauses. The included types of clausal syntactic relations according to Universal Dependencies (UD) are:
parataxis (soredje), coordination (priredje), and subordination (podredje), which is further divided into 4 main types according to UD:
clausal subject (osebkov odvisnik), clausal object (predmetni odvisnik), adverbial cluase modifier (prislovni odvisniki), and adnominal clause modifier (prilastkov odvisnik).
These structures were extracted based on the UD part-of-speech and syntactic relations annotations (https://wiki.cjvt.si/books/07-universal-dependencies).
The dataset can be used for syntactic analyses of school writing in Slovene in (Slovene) schools, also in combination with comparable data (http://hdl.handle.net/11356/2010) from the Slovene textbook corpus Učbeniki 1.0, which presents the expected or desired scope of reception.
For each part of speech (phrase level) or clausal relation (sentence level), there are 4 files:
- "solar-orig_*_default.tsv" - the original output, containing extracted unique syntactic structures of varying lengths, ranging from 2 to 10 tokens, arranged by frequency, followed by additional data on syntactic structures and corpus-linguistic statistics (Absolute frequency, Relative frequency, MI, MI3, Dice, logDice, t-score, simple-LL).
- "solar-orig_*_all-examples.tsv" - the original output, containing all matched structures found in the input corpus (i.e. all occurances of the extracted structures in every sentence).
- "solar-orig_*_default_tree-description.tsv" - an extension of the "solar-orig_*_default.tsv" file that includes a verbal description of syntactic structures (trees).
- "solar-orig_*_all-examples_metadata_tree-description.tsv" - an extension of the "solar-orig_*_all-examples.tsv" file that includes school text metadata and a verbal description of syntactic structures (trees).
(The asterisk (*) in file names serves as a placeholder for a part of speech or a clausal relation.)
The data was prepared in the following manner:
First, the corpus was linguistically annotated with the CLASSLA v2.1 pipeline (https://github.com/clarinsi/classla/) at the levels of UD part-of-speech and syntactic relations annotations to enable the extraction of sentence-level structures.
Furthermore, the original corpus containing MULTEXT-East tags (MSD tags) was preprocessed to reduce the tag to its first letter (e.g., Somei → S), which denotes the part of speech (the remaining letters represent the token's morphosyntactic features). This preprocessing step enabled extraction at the part-of-speech level, disregarding token-specific features, yet still displaying the full MSD tags as nodes in the extracted structures. (Note that STARK was originally developed for extracting data from UD-parsed corpora and was not designed for use cases like this one.)
Then, the data was extracted with the STARK v3.0 tool (http://hdl.handle.net/11356/1958), based on predefined parameters in the "config.ini" file, with phrase-level structures extracted based on the MULTEXT-East and JOS-SYN annotation systems, and sentence-level structures extracted based on the UD schema.
The sentence-level data underwent a postprocessing phase to remove duplicates that occured due to the phased extraction of complex connectives and to recalculate corpus-linguistic statistics based on the deduplicated data.
Another step was to enhance all output files with verbal descriptions of the extracted structures and to enrich all "solar-orig_*_all-examples.tsv" files with school text metadata by assigning metadata from "solar-meta.tsv" (see "Solar.CoNLL-U.zip" in http://hdl.handle.net/11356/1589) to each structure based on matching text IDs (both with Python).
Lastly, the extended versions of the two original output files ("solar-orig_*_default_tree-description.tsv", "solar-orig_*_all-examples_metadata_tree-description.tsv") were converted into Excel spreadsheets.
The package also includes a configuration file for each level: "config_solar_besednozvezne.ini" for phrase-level structures, and "config_solar_medstavcne.ini" for sentence-level structures. These files contain all the parameter values used for data extraction with STARK.
For more details, see "00README.txt"