Institute for Computational Linguistics “A. Zampolli”
ILC4CLARIN: Linguistic Data and NLP ToolNot a member yet
955 research outputs found
Sort by
GreekSchools Public Editions
The GitHub repository archive hosting the XML documents for the open access critical edition of the 885222-GreekSchools ERC project.
GreekSchools XML Data for PHerc. 327 (Philodemus, History of the Eleatic and the Atomistic Schools, edited by Verhasselt, Gertjan), PHerc. 1691/1021, PHerc. 164 (Philodemus, Academicorum Index, edited by Fleischer, Kilian J.), PHerc. 1020 (Stoicus Scriptor Anonymus, [Opus incertum], edited by Alessandrelli, M. and Ranocchia, G.), PHerc. 1004 (Philodemus, [De rhetorica], Uncertain Book, edited by Ranocchia, G. and Vassallo, Ch.), PHerc. 1508 (Philodemus, [Index Pythagoreorum], edited by Avdoulou, E.).
XML Data available for the CoPhi Viewer Web App
Domain-Specific Languages for the GreekSchools project
The repository hosts the Context-Free Grammars for the Domain-Specific Languages developed within the GreekSchools project.
The repository includes diplomatic and literary DSLs for transcription with also palaeographic and critical DSLs for apparatuses as well as for a modern translation.
The goal is to create a flexible environment for scholarly editing critical texts constantly monitored by the community by making both electronic texts and digital sources remotely accessible through a single interface with advanced capabilities
StarwarsNER French Italian Corpus - sample
The StarwarsNER French Italian Corpus - sample is a multilingual benchmark resource for Named Entity Recognition (NER) in the wastewater and stormwater management domain.
It supports research in:
- Information extraction
- Relation extraction
- Entity linking
The corpus consists of manually annotated parallel French and Italian documents, aligned at the sentence level. Annotations follow a domain-specific schema based on the Sewer Network Ontology .
For copyright reasons, this release contains only a sample of the original corpus, namely 8 French documents from public administrations and their Italian translations.
---
## Resource Creation
1. **French corpus**
- Collected from reports, regulations, and local media texts.
- Manually annotated according to the STARWARS schema.
2. **Italian corpus**
- Produced via machine translation of the French texts.
- Reviewed and corrected by bilingual translation students and expert hydrologists.
3. **Annotation process**
- Conducted with the **INCEpTION** annotation platform.
- Ensured consistent alignment between French and Italian.
For details, please refer to the publication:
F.A. Cardillo, F. Debole, F. Frontini, M. Aelami, N. Chahinian, S. Conrad (2025) “Novel Benchmark for NER in the Wastewater and Stormwater Domain”, Proceedings of the 6th IEEE MNLP Conf. (CiST-MNLP’2025) 4-10 October 2025, Marrakech, Morocco.
---
## Contents of this Package
- **Texts**: Provided in plain text.
- **Annotations**: Provided in **CONLL 2003 format, as exported from INCEpTION**.
- **Annotation guidelines**: Included in both **French** and **Italian**, as used by annotators
NomadLingo1.0 open
The corpus NomadLingo1.0 contains transcripts of extracts from naturally-occurring conversations which were audio-recorded between November 2023 and April 2024 at social events organised and promoted at digital nomad communities based in Madeira and Canary Islands. The total time of transcribed recording is 11 hours 38 mins. For further information about the texts in the corpus see Section 4.
The corpus aims to represent translingual interactions based on the fluid use of English as a lingua franca, other linguae francae such as Spanish, and strategies of transcultural communication like intercomprehension and peer/self-translation
Automatic Speech Recognition and Token boundary detection for Italian Speech
Open Source code for the implementation of an Automatic Speech Recognition system for Italian speech.
Can perform automated transcription as well as Speech-text alignment
DH ATLAS: Knowledge Graph
A knowledge graph representative of Italian Digital Cultural Heritage projects.
The DH ATLAS Knowledge Graph is currently available as a set of Turtle XML files and gathers metadata on a list of examined research products and their related entities. This release includes Turtle (.ttl) serializations of the records created during the Datathon held as part of the ATLAS workshop on March 26, 2025
Women’s Empowerment – Inner and Outer Communication (Pilot Corpus)
The submitted data consists of the Women’s Empowerment Pilot Corpus, a curated collection of 30 short texts and dialogue excerpts documenting the communicative journey of empowerment. The corpus is divided into two dimensions: (a) internal dialogue, capturing expressions of self-reflection, emotional recognition, and inner transformation, and (b) external expression, covering assertion, resistance, self-definition, and boundary-setting. Each utterance has been annotated with a pragmatic-functional schema, including the categories Inner Realization (IR), Resistance (R), Assertive Act (AA), and Identity Redefinition (ID).
The resource is encoded in TEI/XML and accompanied by CMDI metadata to ensure CLARIN compliance. A JSON version is also provided to facilitate integration into NLP pipelines. The corpus is designed as a proof-of-concept resource that operationalizes theoretical insights from semantics and pragmatics into computationally reusable linguistic data. It contributes to CLARIN-IT by enriching the infrastructure with gender-sensitive communication models and offering applications in education, digital humanities, and cross-cultural studies
MariTerm v.1.2
This is an enriched version of the MariTerm maritime ontology, containing plug-ins to correpsonding synsets inside IWN. The resource was created within the collaboration of the Institute of Computational Linguistics "A. Zampolli" in Pisa and the University of Bologna as an intership project
Parlement of Foules, a digital diplomatic edition
A digital edition of the Middle English poem “Parlement of Foules” by Geoffrey Chaucer, featuring a diplomatic transcription of the text found in MS Gg.4.27(1), Cambridge University Library. The edition is encoded in XML format according to TEI Guidelines and includes manuscript description metadata, the full transcription, and links to the electronic facsimile hosted on the Cambridge University Library website. The transcription preserves original spelling, punctuation, and scribal choices, with selective expansion of abbreviations
CompL-it
CompL-it is a computational lexicon for Italian derived from LexicO (https://dspace-clarin-it.ilc.cnr.it/repository/xmlui/handle/20.500.11752/ILC-977), with the integration of following resources:
- M-GLF (https://dspace-clarin-it.ilc.cnr.it/repository/xmlui/handle/20.500.11752/ILC-1002), a list of lemmatized forms generated by the morphological analyzer MAGIC (Battista and Pirrelli, 1999, Pirrelli and Battista 2000);
- a set of treebanks for Italian (contained in https://lindat.cz/repository/xmlui/handle/11234/1-4611):
- ISDT;
- VIT;
- ParTUT;
- ParlaMint-it.
The resource contains a morphological layer (including lemmas, inflected forms, and morphological features) and a semantic layer (including senses and relations between them). Entries are encoded according to the OntoLex-Lemon model and made available as a semantic repository