The ambiguity of words is a crucial question when dealing with an automatic analysis of documentary data bases. In Text Mining, Word Sense Disambiguation is the task of giving a particular sense to a term with different meanings both in the case of coincidental and polysemous homographs. In literature, the proposed solutions are mainly based on two elements: some knowledge related to the term and the context in which the term appears. Limiting the related knowledge to grammatical tagging, and to an analysis of collocations, here we focus our attention on identifying the context, in a data driven approach. Our framework is based on Textual Data Analysis and we assume that language and knowledge can be modeled as networks of words and the relations between them. The aim of this paper is to propose an extension of the strategy for building lexical sources in Balbi et al. (2012), in order to deal with ambiguous words. The methodological basis is given by the joint use of lexical Correspondence Analysis and Network Analysis. Our idea is investigating the neighborhood of ambiguous terms, with respect to the different latent semantic components, emerging thanks to Correspondence Analysis of a training set of documents, in order to build some rules, useful in solving WSD problems in the entire corpus.
Textual Data Analysis tools for Word Sense Disambiguation / Balbi, Simona; Stawinoga, AGNIESZKA ELZBIETA. - 1:(2014), pp. 57-66. (Intervento presentato al convegno JADT 2014 tenutosi a Parigi nel 3-6 giugno 2014).
Textual Data Analysis tools for Word Sense Disambiguation
BALBI, SIMONA;STAWINOGA, AGNIESZKA ELZBIETA
2014
Abstract
The ambiguity of words is a crucial question when dealing with an automatic analysis of documentary data bases. In Text Mining, Word Sense Disambiguation is the task of giving a particular sense to a term with different meanings both in the case of coincidental and polysemous homographs. In literature, the proposed solutions are mainly based on two elements: some knowledge related to the term and the context in which the term appears. Limiting the related knowledge to grammatical tagging, and to an analysis of collocations, here we focus our attention on identifying the context, in a data driven approach. Our framework is based on Textual Data Analysis and we assume that language and knowledge can be modeled as networks of words and the relations between them. The aim of this paper is to propose an extension of the strategy for building lexical sources in Balbi et al. (2012), in order to deal with ambiguous words. The methodological basis is given by the joint use of lexical Correspondence Analysis and Network Analysis. Our idea is investigating the neighborhood of ambiguous terms, with respect to the different latent semantic components, emerging thanks to Correspondence Analysis of a training set of documents, in order to build some rules, useful in solving WSD problems in the entire corpus.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.