Effect of Sampling Dataset Size and Distribution on Machine Learning-Based Surrogate Modelling of CO2 Corrosion (C2026-00149)

Thursday, March 19, 2026

1:30 PM - 2:00 PM Central

Location: 370 AB

Earn .5 PDH

Interested in reading the entire paper? Click on the "Paper" button below to read on the AMPP Knowledge Hub!

*Please note, if your registration came with access to the conference proceedings don't forget to login to your AMPP Knowledge Hub account to access the paper for free. If you login and don't have access to the paper, you can purchase the individual paper or purchase the entire conference proceedings on your Knowledge Hub account.

James Willson, Harvey Thompson, Richard Woollam, Richard Barker

Presenting Author(s)

JW

James Willson

University of Leeds

Approaches to CO2 corrosion prediction have evolved significantly, from empirical correlations derived from operational data, to advanced mechanistic models for the underlying physicochemical processes. However, such models remain inaccessible to everyday engineering practice due to their significant computational demands and requirements for specialist expertise.
Surrogate modelling offers an approach to bridge this gap by creating mathematical approximations of the corrosion model outputs. Machine Learning (ML) provides a powerful framework for developing these surrogates. The study builds on a recent investigation into the use of ML Methods for creating surrogate models of state-of-the-art CO2 corrosion predictions by exploring potential improvements in the accuracy and efficiency of a range of Gradient Boosted Decision Trees and Neural Network ML models.
The study begins with an investigation into the correlations between prediction error and key model input variables and then considers how the choice of error metric influences the comparative performance and selection of ML models. A comprehensive investigation into the influence of sampling dataset size and distribution shows that employing smaller datasets with higher sampling concentrations near boundaries enables the sampling dataset size to be reduced by an order of magnitude without significant impacts on surrogate modelling accuracy.