This article reports some of the main achievements of the EU-funded PRINCIPLE project in collecting high-quality language resources (LRs) in the legal domain for four under-resourced European languages, namely Croatian, Irish, Norwegian and Icelandic. After illustrating the significance of this work for developing translation technologies in the context of the European Union and the European Economic Area, the paper outlines the main steps of data collection, curation and sharing of the LRs gathered with the support of public and private data contributors. This is followed by the description of the development pipeline and key features of the state-of-the-art bespoke neural machine translation (MT) engines for the legal domain that were built using this data. The MT systems were evaluated with a combination of automatic and human methods to validate the quality of the LRs collected in the project; the high-quality LRs were subsequently shared with the wider community via the ELRC-SHARE repository. The main challenges that were encountered in this work are discussed, emphasising the importance and the key benefits of sharing high-quality digital LRs.

Sharing high-quality language resources in the legal domain to develop neural machine translation for under-resourced European languages / Bago, Petra; Castilho, Sheila; Celeste, Edoardo; Dunne, Jane; Gaspari, Federico; Rúnar Gíslason, Níels; Kåsen, Andre; Klubička, Filip; Kristmannsson, Gauti; Mchugh, Helen; Moran, Róisín; Ní Loinsigh, Órla; Arild Olsen, Jon; Parra Escartín, Carla; Ramesh, Akshai; Resende, Natalia; Sheridan, Páraic; Way, Andy. - In: REVISTA DE LLENGUA I DRET. - ISSN 2013-1453. - 78:78(2022), pp. 9-34. [10.2436/rld.i78.2022.3741]

Sharing high-quality language resources in the legal domain to develop neural machine translation for under-resourced European languages

Federico Gaspari
Membro del Collaboration Group
;
2022

Abstract

This article reports some of the main achievements of the EU-funded PRINCIPLE project in collecting high-quality language resources (LRs) in the legal domain for four under-resourced European languages, namely Croatian, Irish, Norwegian and Icelandic. After illustrating the significance of this work for developing translation technologies in the context of the European Union and the European Economic Area, the paper outlines the main steps of data collection, curation and sharing of the LRs gathered with the support of public and private data contributors. This is followed by the description of the development pipeline and key features of the state-of-the-art bespoke neural machine translation (MT) engines for the legal domain that were built using this data. The MT systems were evaluated with a combination of automatic and human methods to validate the quality of the LRs collected in the project; the high-quality LRs were subsequently shared with the wider community via the ELRC-SHARE repository. The main challenges that were encountered in this work are discussed, emphasising the importance and the key benefits of sharing high-quality digital LRs.
2022
Sharing high-quality language resources in the legal domain to develop neural machine translation for under-resourced European languages / Bago, Petra; Castilho, Sheila; Celeste, Edoardo; Dunne, Jane; Gaspari, Federico; Rúnar Gíslason, Níels; Kåsen, Andre; Klubička, Filip; Kristmannsson, Gauti; Mchugh, Helen; Moran, Róisín; Ní Loinsigh, Órla; Arild Olsen, Jon; Parra Escartín, Carla; Ramesh, Akshai; Resende, Natalia; Sheridan, Páraic; Way, Andy. - In: REVISTA DE LLENGUA I DRET. - ISSN 2013-1453. - 78:78(2022), pp. 9-34. [10.2436/rld.i78.2022.3741]
File in questo prodotto:
File Dimensione Formato  
3741-10299-1-PB.pdf

accesso aperto

Descrizione: PDF
Tipologia: Versione Editoriale (PDF)
Licenza: Non specificato
Dimensione 665.07 kB
Formato Adobe PDF
665.07 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/905220
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 6
  • ???jsp.display-item.citation.isi??? 3
social impact