Principal Component Analysis (PCA) is an eigendecomposition of a properly transformed matrix, then its standard application requires the data set to be complete (no missing entries). Alternative implementations have been proposed in the literature that extends the PCA to incomplete data sets. Recent comparative reviews of PCA algorithms with missings proved regularised iterative PCA algorithm (RPCA) to be effective. In some applications, incomplete data are constantly produced (e.g. process sensor data) and the corresponding data flow is often analysed in chunks (subsets of observations). In this setting, RPCA could be applied to each chunk, with the result that the PCA solutions (and, the imputations) of single chunks are independent from one another. An incremental RPCA implementation is proposed such that the imputation of each new chunk is based on that chunk, and on all the chunks analysed that far. The proposed procedure is compared to batch RPCA considering different data sets and missing data mechanisms. Experimental results show that the incremental approach has an appreciable performance when the data is missing not completely at random, and the first analysed chunks contain sufficient information on the data structure.
Regularised PCA for incremental single imputation of missings / IODICE D'ENZA, Alfonso; Markos, Angelos; Palumbo, Francesco. - (2022). (Intervento presentato al convegno COMPSTAT 2022 tenutosi a Bologna, Italy nel 23-26 August, 2022).
Regularised PCA for incremental single imputation of missings
Alfonso Iodice D Enza
;Francesco Palumbo
2022
Abstract
Principal Component Analysis (PCA) is an eigendecomposition of a properly transformed matrix, then its standard application requires the data set to be complete (no missing entries). Alternative implementations have been proposed in the literature that extends the PCA to incomplete data sets. Recent comparative reviews of PCA algorithms with missings proved regularised iterative PCA algorithm (RPCA) to be effective. In some applications, incomplete data are constantly produced (e.g. process sensor data) and the corresponding data flow is often analysed in chunks (subsets of observations). In this setting, RPCA could be applied to each chunk, with the result that the PCA solutions (and, the imputations) of single chunks are independent from one another. An incremental RPCA implementation is proposed such that the imputation of each new chunk is based on that chunk, and on all the chunks analysed that far. The proposed procedure is compared to batch RPCA considering different data sets and missing data mechanisms. Experimental results show that the incremental approach has an appreciable performance when the data is missing not completely at random, and the first analysed chunks contain sufficient information on the data structure.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.