Recent advances in modelling and control of liquid chromatography

For more than a century, chromatography has been indispensable as a separation method for both analytics and puriﬁcation. Among the variety of chromatographic techniques, liquid chromatography has a special status owing to its efﬁciency and versatility, and its status is further enhanced by the continuous improvements of analysers, materials, methods and understanding, all supported by computational approaches. High performance liquid chromatography (HPLC) has always held a special place in pharmaceutical processing, and computational HPLC has been explored since the very early stages of computing, although without having yet reached its full potential. Herein, we provide a comprehensive and critical review of recent developments in designing and operating liquid chromatographic systems, focussing on their modelling approaches and control strategies at large scale.


Introduction
Liquid chromatography (LC), and in particular high performance liquid chromatography (HPLC), is the most common separation method in the production of pharmaceutical and biopharmaceutical products. The method is highly versatile, used for fast analysis and high yield separation at both preparative and process scale. While HPLC was initially operated only in batch mode, recently techniques allowing for continuous operation, such as counter-current chromatography [1][2][3] and simulated moving beds [4 ], have advanced significantly.
Within the chemical industries, chromatographic processes cannot yet be designed with the same confidence as, say, distillation, and laboratory experimentation and pilot plant testing are normally necessary. As the elution behaviour is complex, the development of accurate experimental procedures is challenging, and usually the number of experiments required is limited by the availability of expensive material. Mathematical modelling is an invaluable tool to reduce the number of costly and time-consuming experiments, as well as to gain insight into separation mechanisms to support design decisions at production scale. Therefore, the use of in silico (HP)LC can accelerate analytical and preparative method development with reduced experimental effort and material, yielding improved purity and yield of the desired product while reducing solvent consumption. This is of special importance for early stage drug development and in turn for the reduction of the time-tomarket of new drugs. In biopharmaceutical production, optimisingpreparativeHPLCforchiraldrugsisofparticular importance as their purification is a common manufacturing bottleneck [5,6]. Once the process has been designed, appropriate monitoring and control measures are required to ensure the operation is conducted optimally and without disturbances.
Although computational methods have always accompanied LC, the continuous improvement of mathematical models including commercial software [7], more affordable computational power (e.g. via cloud computing) [8], the acceptance of simulations for Quality by Design (QbD) concepts by pharma regulatory bodies [9 ,10], and the exciting trend towards machine learning or artificial intelligence [11 ] are likely to change academic and industrial practices in the coming decade. This work reviews the broad landscape of modelling and control for LC, focussing on the current state of the art, the mathematical models available and how these have been used recently. commercial tools use a database either of experimental chromatograms or of physio-chemical parameters (or a combination thereof). Peak tracking algorithms are also commonly included [15] as well as algorithms for optimising HPLC methods. Until now, however, 'optimising' HPLC conditions is commonly achieved via experimental design and model response surfaces [16,17], and not via proper optimisation. These models -if they can be called models -do not provide any fundamental insight or knowledge into the process, although they can be fairly accurate if fitted well, and are commonly used for robustness studies [18]. Still, they are not much superior to basic trial-and-error based approaches owing to the high experimental effort that they require.
The simplest computer-assisted methods are based on linear solvent strength (LSS) theory [19,20]. LSS assumes a linear relation between the retention factor logarithm and the volume fraction of the organic phase. The parameters describing this relation can be determined based on a small set of experiments. Not least due to its simplicity, LSS theory is used frequently to predict the retention factors and the related elution times for changing mobile phase compositions [21,22 ], including within commercial software. More advanced retention models considering both mobile and stationary phase properties are the linear solvation energy relationships (LSER) introduced in the 1980s and still used [23][24][25], again including within commercial software. LSER uses semi-empirical expressions, derived from first principles, to relate the retention time to solvent-dependent solute parameters such as polarisability, hydrogen bond acidity/basicity and molecular volume. Hence, LSER can predict elution times of new (usually small molecule) solutes for different mobile phase compositions and type of solvent used [26].
More versatile, but more complex and frequently data driven, models predicting retentions for different chromatographic systems are chromatographic quantitative structure retention relationships (CQSRR) [27,28]. Although CQSRR include LSER models [29], their usage is usually specified explicitly. The concept of CQSRR is to relate variations of one or more response variables describing the retention behaviour to the variations of so-called descriptor variables. These descriptor variables should represent both the chromatographic system and the molecular entity of the solute(s). The latter is commonly accounted for by choosing suitable molecular descriptors -which is not a simple task considering the >5000 options [30]. CQSRR models commonly combine global optimisation algorithms choosing the molecular descriptors with statistical techniques such as multiple linear regression [31,32], chemometrics [33 ,34], machine learning strategies such as decisions trees [35], random forest and support vector machines [36], and artificial neural networks [37].
Despite the success of CQSRR, the often large number of partly 'mysterious' descriptors [38] required to predict retention accurately makes the underlying retention mechanisms difficult to understand. Retention depends on the physiochemical properties of the solutes, which cannot be deduced from their atomic composition; that is, solutes with the same molecular formula can have very different retention behaviours [39]. Predictive models require training with representative (in terms of the physiochemical properties) solutes, a task that is challenging for large molecules owning to their intrinsic complexity. Therefore, CQSRR strategies for biomolecules require large data sets and/or include a model selection step based on the similarity of a solute with the samples used to train individual models [40].
The use of machine learning strategies is by no means new. Nevertheless, the limited experimental data available is a roadblock for retention time predictions for new solutes. Each laboratory uses customised HPLC instruments and unique solvent compositions, gradient profiles, flow rates, and so on. Therefore, machine-learning strategies are commonly utilised for relatively small in-house built databases, that is, a small experimental design space of selected solute candidates. It has been shown, however, that for analytical reversed phase HPLC, retention factor predictions of one HPLC system can be projected onto other systems given the general conservative 2 Biotechnology and bioprocess engineering: modeling and related issues compound elution order [41]. This has recently enabled machine learning strategies to train models using big databases with retention data for >80k small molecules [11 ,42].
A common limitation for these models is the lack of flexibility, for example in modelling complex gradients and realistic sample and mobile phase injection profiles. Also, a possible solvent mismatch between the sample solvent and the mobile phase, which is common for preparative HPLC, cannot be accounted for (which includes commercial software tools) [43]. This is why modelling the actual transport of the solutes and the mobile phase, and considering the concentrations of both, is essential for the development of a true digital LC twin.

Transport models
Since tracking all the solute molecules in (HP)LC is neither feasible nor necessary, continuum approaches in terms of solute concentrations are used. Transport models typically use spatial or temporal averaged solute and solvent concentrations in the mobile and stationary phases. The uniform and dense packing within an HPLC column suggests model reduction to the axial (z) dimension (see Figure 2a). This can be an oversimplification for preparative or process chromatography, where the larger columns (commonly loaded manually) are prone to nonhomogeneous packing causing radial velocity and temperature gradients or non-homogeneous sample injection. Hence, two dimensional (radial and axial) transport models, although not yet the standard, are by no means the exception [44,45].
The simplest descriptions of mass transport through the stationary phase are plate models. These models depict a column of length L by a discrete number N of side-byside and well-mixed cells/plates of width Dz (=L=N ), which is commonly set to the theoretical plate height. The mobile phase transfers from one plate to the next as new mobile phase enters the first plate either continuously or discontinuously (see Figure 2b). Despite dating back to the 1940s, these models are still in use due to their simplicity, adaptability and efficient numerical computation, for example, via parallelisation [43,46].
Several different models for closing the mass balance equations governing the evolution of chromatographic peaks have been used; these are most prominently summarised by Guichon et al. [47]. Equilibrium dispersive models (EDM) account for dispersion due to flow through the stationary phase by considering an apparent dispersion coefficient D a . Although the name indicates that equilibria are assumed, the effects of non-ideal mass transfer (by which we mean that equilibrium is not established instantaneously) can be lumped into D a if solute mass transfer between stationary and mobile phases is fast compared to axial convection and dispersion. Because of their simplicity and the small number of parameters, EDMs remain a common firstchoice [4 ,48], especially if mass transfer can be considered to be fast. This assumption seems valid for small molecules but is commonly adopted without justification.    The more advanced general rate models (GRMs) account for mass transfer effects by incorporating transfer resistance, surface diffusion, adsorption-desorption kinetics, and pore diffusion [45,49]. This is achieved by two additional equations describing the radial solute concentration profiles inside the porous particles (Figure 2c) and the mass transfer between the stationary and mobile phases at the stationary particle surfaces. For some kinetic parameters, the GRM reduces to a lumped kinetic model which can be considered as an intermediate between EDM and GRM [50]. Since computational power is no longer a bottleneck for the more demanding GMR models, and because these models provide the highest accuracy, they are now used more commonly, although their application is mostly limited by the amount of parameters that have to be estimated using additional models or experiments.
Solving these transport models requires an inlet boundary condition, which depends on the sample injection profile. This transient concentration profile at the column inlet results from the sample volume, the flow rate and sample dispersion before the column. Owing to this additional complexity (or to bad habits), the incorrectly assumed rectangular injection profiles still prevail. There are exceptions, which use either experimentally determined injection profiles [46] or surrogate models derived by convoluting Gaussian, square and exponential residence time profiles, that can account for variable sample volumes and flow rates after parameter estimation [51 ]. Additional complexity arises from solvent mismatch between the sample and the mobile phase, which is more relevant to preparative chromatography [46]. Despite the challenges described, such transport models provide the basis for true digital LC twins, in combination with either first principle models or hybrid computational/ experimental retention models. Within a QbD framework, the confidence in the employed models can therefore increase, rendering the application of advanced control strategies for process operation less challenging.

Monitoring and control
In the previous section, we discussed how various modelling approaches can be employed to analyse, model and design chromatographic processes. However, a number of factors, such as imperfect column packing, the presence of disturbances, plant-model mismatch and so on, can hinder optimal operation of the real plant and product quality specifications might therefore be violated. Adequate process control is often required, which can either be conventional (i.e. P, PI, PID controllers) or advanced (e.g. model predictive control). Typical controlled variables in liquid chromatography include, but are not limited to, product purity, recovery yield, production rate and pH. A number of variables can be manipulated to achieve the desired control performance, such as feed flow rate and composition, switching times for continuous operations and so on (see Figure 3). Although conventional control strategies are economically attractive and simple to implement, centralised control usually outperforms conventional methods. Because of the complexity of advanced control strategies, however, the latter have not yet been implemented in large, industrial scale despite the growing interest [52].
For batch chromatographic processes, various control strategies have been developed in order to ensure robust control performance. Advanced strategies often combine online measurements and parameter estimation based on online optimisation routines; for example, the use of Extended (EKF) and Ensemble (EnKF) Kalman Filters. This strategy has been employed for simultaneous estimation of uncertain states and inlet concentration of nonlinear chromatographic processes based on noisy measurements at the outlet, demonstrating that the EKF is significantly faster, although less accurate, compared to 4 Biotechnology and bioprocess engineering: modeling and related issues  the EnKF [53]. Open-loop control has also been considered in order to identify fractionation endpoints that meet purity constraints, for instance, the simultaneous maximisation of recovery yield and production rate for a ternary mixture separation problem of human insulin analogues in a HPLC process. In an effort to minimise buffer and storage tanks, a methodology for the design and control of an Integrated Column Sequence (ICS) has been proposed which for small scale production can be implemented on a single chromatographic system, for instance controlling a four-column chromatographic system for the separation of a mixture of proteins [54].
Currently, the biopharmaceutical manufacturing industry, in particular, is pushing towards the transition from batch to continuous, or at least semi-continuous, operation in order to reduce manufacturing cost and processing times and to increase flexibility and product quality [55]. This transition requires the acquisition and handling of often heterogeneous data through Process Analytical Technologies (PAT) [56] to inform process monitoring and control, and global coordination of decentralised control loops to ensure successful continuous operation [57]. Model-based adaptive control strategies can also be considered, for example, for continuous two-column capture step of monoclonal antibodies (mAb) using protein A chromatography [58]. Another example is advanced control of a Multicolumn Counter Current Solvent Gradient Purification process (MCSGP) implementing a SIMO multi parametric MPC controller with an approximate model that tracks the integral of the concentrations at the outlet stream, outperforming P-only control, for a mAb production process [59,60]. Simulated Moving Bed (SMB) is challenging to operate and accurate cycle-to-cycle adaptive control is needed, generally implemented via a parameter estimator and a controller [52]. Also of interest is the use of Artificial Neural Networks, for instance working simultaneously with an offline measurement system such as Quasi-Virtual Analyser (Q-VOA), for the separation of a bi-naphthol enantiomer mixture in a SMB process [61]. Other control strategies have also been applied for the control of continuous chromatographic processes, for instance, based on multi-objective optimisation to find optimal open-loop control parameters for the separation of human growth hormone (hGH) from its dimer [62].
Although the abovementioned strategies provide efficient control of the systems considered, simultaneous optimisation of the design (e.g. solvent type and composition based on data-driven or hybrid models) and dynamic operation (i.e. control), although computationally demanding, is expected to improve process performance further and ensure optimal operation, supporting the transition towards continuous and semi-continuous bioprocesses [63 ,64]. Towards this transition and when considering the design and/or optimisation of an end-to-end bioprocess, various objectives can be set such as process stability, product purity, environmental impact and so on. Even if the total number of degrees of freedom reduces owing to the coupling of the units, the complexity of the problem increases, because all the units should operate at optimal conditions, both individually and as a whole. Designing and controlling entire processes would require detailed models of all the processing units as well as their associated control systems.
Proper control of preparative and process chromatography can only be achieved if the measurements used to determine the control action are reliable. UV-vis or other spectroscopic methods as well as automated analytical HPLC systems are used for most chromatographic processes. However, the accuracy and explanatory power of UV-vis spectroscopy is often limited, whilst the HPLC system provides infrequent (ca. every 3À10 min) measurements of the components of the mixture [65]. In addition, in order to obtain suitable feedback on process performance and product quality, lengthy experimental procedures are required [57], leading to significant delays in process operation. The detection of impurities in the mixture is still usually performed offline, for example, through size exclusion chromatography [65], although a number of online measurement methods have also been developed.
A range of such Process Analytical Technologies (PAT)s have been applied to purification processes, such as online pH and conductivity sensors, and mass spectrometry [66][67][68]. Fourier-transform infrared spectroscopy (FTIR) has also been applied for protein chromatography [69]. PAT implementation in biopharmaceutical manufacturing, aiming to improve production efficiency, yield and product purity and reduce time-consuming offline analyses, has also been considered [70]. Similarly, Partial Least Squares Regression (PLS) modelling on UV-vis absorption spectra has been applied for antibody quantification to allow realtime monitoring in protein A chromatography [71]. However, as those technologies and their associated models are characterised by high complexity, they have not yet been used at commercial scale and more research is required to pave the way towards their industrial implementation.

Conclusion and perspectives
High performance liquid chromatography is currently the most important separation method in the pharmaceutical and biopharmaceutical industries, and is used extensively at both analytical scale and at preparative and process scale. For new products, industry can reduce time-tomarket only by combining computational and experimental work -a goal they can achieve only if they employ advanced mathematical models. The same models would enable the efficient design of batch or continuous units for preparative and process scale operation, as well as for adequate monitoring and optimal control. The increase in computational power will allow the adoption of more complex models and the reduction of the simulation time, and will facilitate faster screening of multidimensional operation spaces for parameter estimation and optimisation. However, the immense complexity of predicting molecule-specific retention times (a complexity similar to that of solubility predictions) will hinder the use of generic first principle models for all HPLC method development. The need for (partly) data-driven or empirical surrogate models will prevail, highlighting the importance of understanding better retention mechanisms, benchmark studies and well-structured (open-access) retention time databases documenting accurately the chromatographic system used. The widespread use of chromatographic separation, especially for bioprocesses, proves not only the success of previous computational strategies, but shows the potential for improved models. Computational HPLC is extremely important for bioprocessing, where chromatographic separation is a common bottleneck. Understanding physiochemical properties of biomolecules, quantifying their effect on retention behaviour, and identifying mobile and stationary phase characteristics are key challenges and require more work. In the nearer future, simpler models, and model-based control strategies combining offline or online experimental data, will continue to support continuous chromatography and widen production bottlenecks, not only for bioproduction.

Conflict of interest statement
Nothing to declare.