CorDis is a large, XML, TEI-conformant, POS-tagged, multimodal, multigenre corpus representing a significant portion of the political and media discourse on the 2003 Iraqi conflict. It was generated from different sub-corpora which had been assembled by various research groups, ranging from official transcripts of Parliamentary sessions, both in the US and the UK, to the transcripts of the Hutton Inquiry, from American and British newspaper coverage of the conflict to White House press briefings and to transcriptions of American and British TV news programmes. The heterogeneity of the data, the specificity of the genres and the diverse discourse analytical purposes of different groups had led to a wide range of coding strategies being employed to make textual and meta-textual information retrievable. The main purpose of this paper is to show the process of harmonisation and integration whereby a loose collection of texts has become a stable architecture. The TEI proved a valid instrument to achieve standardisation of mark-up. The guidelines provide for a hierarchical organisation which gives the corpus a sound structure favouring replicability and enhancing the reliability of research. In discussing some examples of the problems encountered in the annotation, we will deal with issues like consistency and re-usability, and will examine the constraints imposed on data handling by specific research objectives. Examples include the choice to code the same speakers in different ways depending on the various (institutional) roles they may assume throughout the corpus, the distinction between quotations of spoken or written discourse and quotations read aloud in the course of a spoken text, and the segmentation of portions of news according to participants interaction and use of camera/voiceover.

The CorDis Corpus Mark-up and Related Issues / Venuti, Marco; Cirillo, Letizia; A., Marchi. - In: PROCEEDINGS FROM THE CORPUS LINGUISTICS CONFERENCE SERIES. - ISSN 1747-9398. - ELETTRONICO. - (2007), pp. 0-0.

The CorDis Corpus Mark-up and Related Issues

VENUTI, MARCO;CIRILLO, LETIZIA;
2007

Abstract

CorDis is a large, XML, TEI-conformant, POS-tagged, multimodal, multigenre corpus representing a significant portion of the political and media discourse on the 2003 Iraqi conflict. It was generated from different sub-corpora which had been assembled by various research groups, ranging from official transcripts of Parliamentary sessions, both in the US and the UK, to the transcripts of the Hutton Inquiry, from American and British newspaper coverage of the conflict to White House press briefings and to transcriptions of American and British TV news programmes. The heterogeneity of the data, the specificity of the genres and the diverse discourse analytical purposes of different groups had led to a wide range of coding strategies being employed to make textual and meta-textual information retrievable. The main purpose of this paper is to show the process of harmonisation and integration whereby a loose collection of texts has become a stable architecture. The TEI proved a valid instrument to achieve standardisation of mark-up. The guidelines provide for a hierarchical organisation which gives the corpus a sound structure favouring replicability and enhancing the reliability of research. In discussing some examples of the problems encountered in the annotation, we will deal with issues like consistency and re-usability, and will examine the constraints imposed on data handling by specific research objectives. Examples include the choice to code the same speakers in different ways depending on the various (institutional) roles they may assume throughout the corpus, the distinction between quotations of spoken or written discourse and quotations read aloud in the course of a spoken text, and the segmentation of portions of news according to participants interaction and use of camera/voiceover.
2007
The CorDis Corpus Mark-up and Related Issues / Venuti, Marco; Cirillo, Letizia; A., Marchi. - In: PROCEEDINGS FROM THE CORPUS LINGUISTICS CONFERENCE SERIES. - ISSN 1747-9398. - ELETTRONICO. - (2007), pp. 0-0.
File in questo prodotto:
File Dimensione Formato  
258_Paper.pdf

accesso aperto

Tipologia: Abstract
Licenza: Dominio pubblico
Dimensione 59.86 kB
Formato Adobe PDF
59.86 kB Adobe PDF Visualizza/Apri

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11588/122076
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact