Database integrity and the identifications of unicorns and chimeras: influence on validity of QSAR/machine learning models

Budzelaar, P. H. M.; Ehm, Christian; Antinucci, Giuseppe; Vittoria, Antonio; Dall’Anese, Anna; Goryunov, Georgy P.; Kulyabin, Pavel S.; Uborsky, Dmitry V.; Voskoboynikov, Alexander Z.; Zuccaccia, Cristiano; Macchioni, Alceo; Cipullo, Roberta; Busico, Vincenzo

“No model can be better than the data base it was trained on”1and “garbage in/garbage out”2 are universal truths when it comes to statistical modeling. Albeit decades of published literature data provide a wealth of information, differences in experimental polymerization conditions (pressure, activation and scavenging protocols, impurity levels etc.) render attempts to compile databases from the literature for modeling purposes nearly useless. Modern high-throughput experimentation approaches can yield highly accurate and internally consistent databases amendable to highly accurate statistical models. However, statistical modeling relies on generating descriptors from computational structures that are representative of the active species at the transition state stage. It is commonly understood that such models can only work when this condition is met. In the last 5 years we have screened >100 of catalysts belonging to different catalyst classes for their performance in propene homopolymerization. In several cases, the above mentioned condition fails, however, this is not always obvious from the polymerization results. In fact, careful analysis of the polymerization results, dedicated NMR studies and/or additional polymerization experiments are sometimes needed to identify catalysts that should not be included in the modeling because they change under polymerization conditions and the precursor structure is not representative of the active species (“chimera”). Using various examples, we show how such cases can be reliably identified and which substituent patterns can be prone to chemical changes under polymerization conditions. Additionally, we discuss which effects inclusion of structures substantially different from the average catalyst structure (“unicorn”) can have. Finally, we demonstrate the effects that including such catalysts would have on the outcome and meaning of statistical models. This research forms part of the research programme of DPI, project #835.

Database integrity and the identifications of unicorns and chimeras: influence on validity of QSAR/machine learning models / Budzelaar, P. H. M.; Ehm, Christian; Antinucci, Giuseppe; Vittoria, Antonio; Dall’Anese, Anna; Goryunov, Georgy P.; Kulyabin, Pavel S.; Uborsky, Dmitry V.; Voskoboynikov, Alexander Z.; Zuccaccia, Cristiano; Macchioni, Alceo; Cipullo, Roberta; Busico, Vincenzo. - (2023). (Intervento presentato al convegno BlueSky/Incorep Polyolefin Conference. Sorrento, Italy tenutosi a Sorrento, Italy nel 12/06-16/06-2023).