Data integration in logic-based models of biological mechanisms

Discrete, logic-based models are increasingly used to describe biological mechanisms. Initially introduced to study gene regulation, these models evolved to cover various molecular mechanisms, such as signalling, transcription factor cooperativity, and even metabolic processes. The abstract nature and amenability of discrete models to robust mathematical analyses make them appropriate for addressing a wide range of complex biological problems. Recent technological breakthroughs have generated a wealth of high throughput data. Novel, literature-based representations of biological processes and emerging algorithms offer new opportunities for model construction. Here, we review up-to-date efforts to address challenging biological questions by incorporating omic data into logic-based models, and discuss critical difficulties in constructing and analysing integrative, large-scale, logic-based models of biological mechanisms.


Introduction
Logic-based models have made significant contributions to our understanding of a wide range of biological processes in health and disease. Initially introduced in the 60s to describe gene regulatory circuits [1-3], logic-based models have evolved substantially over the past five decades to cover various biological processes, such as signalling cascades, ion channels, coregulation of transcription factors and even metabolism. With the growing body of data available due to technological breakthroughs, new methods are being developed to integrate different biological scales and expand the size and complexity of discrete models. Additionally, efforts to create formalised, large-scale representations of network "maps" open avenues for rapidly repurposing these datasets to serve as scaffolds for qualitative models [4].
Logic-based models use logical operators, such as AND, OR and NOT, to describe the functions that govern the regulation of the biological entities. While detailed mechanistic knowledge is not a prerequisite, the type of regulation (positive or negative) between the biological entities and the directionality of these regulations is necessary to construct the regulatory graph [5]. In the logical formalism, genes, proteins, and other biomolecules are assigned discrete values that correspond to activity thresholds (binary values for Boolean Networks: BNs hereafter), multivariate values for logical models), and logical rules define the evolution of the system in the next time step. Time is implicitly modelled using updating schemes that, together with the logical rules, define the emergent behaviour of the system [6,7]. The precise quantitative relationship between model variables and experimental observables is model dependent, and needs to be considered during the model building process.
In silico simulations of the logic-based discrete models give insights into the dynamics of the modelled system and allow in-depth analysis, like the searching of "attractors"terminal states of the system such as steady states or cycles [8]. Simple attractors represent fixed points that correspond to the system's stable states. These states can be linked to cellular decision-making processes, such as apoptosis, cell proliferation, migration, chemotaxis. Complex attractors represent terminal cycles that can be linked to biological oscillations, like, for example, the p53 MDM2 interactions [9][10][11]. The absence of parameters makes logic-based models suitable for large-scale biological networks where little or no kinetic information is available. Nevertheless, as their size and complexity scale up, their analysis can prove to be challenging.
Technological advancements including high-throughput methods have led to an overwhelming amount of biological data. Such data has created a pressing need to develop tools and methodologies that could integrate omic data into the modelling pipelines. These new approaches include the use of omic data in combination with small-scale experiments and prior knowledge for i) model enrichment, pointing to new interactions and regulators, ii) model contextualisation, adding specificity in terms of data origin and type (species, body fluid, cell type, tissue, single cell data, bulk, disease state, treatment, healthy condition etc), iii) model validation, showcasing that the model can reproduce known behaviours of the system of interest, and iv) as source input to infer network structure and functions ( Figure 1).

High-throughput data integration into logic-based models
Efforts to combine high throughput data with discrete logic-based modelling depend heavily on the model purpose and the data availability and include model enrichment, validation and contextualisation. A typical approach consists of using omic data to expand existing models with entities of interest that can be measurable and comparable in different conditions. Early attempts to combine high throughput data with logic-based models consisted mainly of using the data as a guide to model enrichment, identifying key genes and biomolecules to include in the model. An example of such an approach is the building of a logic-based model to study mast cell activation in the context of allergy, combining high-throughput proteomics and prior knowledge [12]. To build the regulatory graph, besides literature mining, the authors used proteomic data, pointing to novel SLP76 interactants identified for the first time in mastocytes [13]. A combination of small-scale experiments, such as quantitative PCR, Western blots, EMSA, together with data from genome-wide assays, such as RNA-sequencing and ChIP-sequencing, was used to assemble a comprehensive regulatory network to study the reprogramming of pre-B cells into macrophages [14]. An iteration of model predictions and in vitro validation led to the update of the model with new knowledge and a better understanding of B cell reprogramming mechanisms. In the same line, researchers developed a methodology that integrates several -omics datasets to identify candidate genes, serving as seeds for network modelling. They analysed multi-omics data from the Consensus Molecular Subtypes [15,16] study of colorectal cancer to expand a previously built generic cell-fate decision network [17].
In many studies, omic data is used as a source of biomarker signatures compared against stable states to validate phenotypic outcomes. This requires discretizing the measured data, using statistical thresholds such as the p-value or fold change. In this case, the regulatory graph of the discrete model is usually built manually through curation of the literature, text mining, and pathway database interrogation. The logical formulae describing specific mechanisms of gene activation are derived from the results of small-scale experiments. The modeller curates the relevant literature and uses the experiments to infer causality and mechanistic details, where possible. Then different types of omic data are analysed and compared against the model behaviour for validation. This step includes data discretization using statistical thresholds to facilitate the comparison with the discrete nature of the logicbased model results. Recent examples include the enrichment of a logical model of macrophage polarisation to describe cancer cell-macrophage interactions and its validation using microarray expression data from in vitro co-culture experiments [18][19]. A similar methodology is employed for the building of a logical model for cancer cell invasion and migration. Alongside model building, researchers propose matching transcriptomics data to the attractors and validating the model on cell line experiments [20]. Going one step further and focusing on the role of ion channels in cancer, an executable model of osmotic regulation and membrane transport was proposed predicting behaviour from expression data [21][22]. In addition to considering large datasets, this model expands the family of biological processes beyond just expression and gene activation, to include the coordinated activities of biomolecules (in this case ions) that are not under direct control by single genes.
In a recent commentary, the need for personalised models and the challenges that lie in incorporating high-throughput data into mechanistic dynamic models were highlighted [23]. An example of this is the framework developed to tailor logical models to a particular biological sample. The approach focuses on integrating mutation data, copy number alterations (CNA), and expression data (transcriptomics or proteomics) into logical models [24]. Using this data, the researchers propose a logical model to study the mechanisms of resistance to BRAF inhibition between melanomas and colorectal cancers. The model was built using literature mining and pathway integration and was contextualised for 100 melanoma and colorectal cell lines using available omics data, including mutations and RNAseq data [25]. Cell-specific logicbased models have also been employed to recapitulate experimentally tested dynamic proteomic changes and phenotypic responses in diverse Acute Myeloid Leukaemia (AML) cell lines treated with a variety of kinase inhibitors [26]. To improve patient stratification, researchers assembled a network of logical relationships linking genes that are mutated frequently in AML patients and contextualised the model with genomic data inferring relevant patient-specific clinical features [27]. In each of these cases, even where the studied cancer was the same, different models reflect not only the biology and specific questions being studied, but the data used to build the model and the predictions that could be made. This underlines the importance of knowing the role data integration plays in model building.

Data-driven discrete model inference
Whilst high-throughput datasets offer new ways to build and analyse models following bottom-up approaches; reverse engineering methods can also be applied to infer models from experimental data. Different algorithms have been developed to reconstruct logic-based models, and specifically BNs, from high-throughput data. There exist two broad categories; combinatorial optimisation methods, which include integer or answer set programming (ASP) and allow for full exploration of the search space to identify the model that best explains the experimental data, and methods that implement heuristic approaches. The first category has the drawback of not scaling well due to computational explosion, while the second one tends to focus on specific conditions and stable states to ease the calculation burden. In broad terms, automated inference of Boolean networks and functions from data, can be a daunting task due to the uncertainty of the data itself and also to the large number of unknowns regarding structure and functions that need estimation. Moreover, identifying the most suitable data type and available datasets for model training adds to the task, as they need to be different from the data used for inference. It should be noted that the experimental ability to resolve biologically important expression or concentration differences will impact the results; datasets that are prone to noise, or that concern low-expressed genes, may introduce bias by excluding important pathways.
Recently, the caspo time series (caspo-ts) method [28,29], which allows learning of BNs from phosphoproteomic time series data given a Prior Knowledge Network (PKN), was applied to data from four breast cancer cell lines (BT20, BT549, MCF7, UACC812) [28]. Based on ASP and model-checking, the method could handle a large PKN with 64 nodes and 170 edges [30]. Another popular software for building logic-based models of signalling networks using prior knowledge and phosphoproteomic data is CellNOptR. CellNOptR supports multiple formalisms, from BNs to differential equations, in a common framework [31,32]. GABNI (Genetic Algorithm-based Boolean Network Inference) is a method that searches for an optimal Boolean regulatory function by exploiting a mutual information-based Boolean network inference (MIBNI). If this step fails to find an optimal solution, then a genetic algorithm (GA) is applied to search an optimal set of regulatory genes on a broader solution space [33]. BONITA (Boolean Omics Network Invariant-Time Analysis (BONITA)) is a new algorithm for signal propagation, signal integration, and pathway analysis capable of modelling heterogeneity in transcriptomic data. The logical rules of the model are inferred by the genetic algorithm and are refined by local search. Application of BONITA pathway analysis to previously validated RNA-sequencing studies identifies additional relevant pathways in in-vitro human cell line experiments and in-vivo infant studies [34]. Single-cell expression data has also been used to infer the underlying model of blood development from the mesoderm. The expression of 40 genes, measured using qRT-PCR data in 3934 cells, was discretized and used to infer a BN consisting of 20 transcription factors, giving insight into the independent roles of Hox and Sox in Erg activation [35]. Lastly, BTR, an algorithm for training asynchronous BNs with single-cell expression data using a novel Boolean state space scoring function, was recently proposed. BTR refines existing BNs and infers new by improving the match between model prediction and expression data [36].

Scalability in inference and analysis of logic-based models
Understanding complex biological processes, such as immunometabolism, the tumour microenvironment, chronic or acute inflammation, or autoimmunity, requires models that do not comprise only a handful of nodes but can be adapted accordingly to incorporate hundreds of nodes and reactions. Advancements in the field reflect the tendency to scale up in terms of size and complexity to create models of more realistic performance. Recently, the development of the tool CaSQ bridged the gap between static and dynamic representations of disease mechanisms, with the inference of large-scale BNs from molecular interaction maps [37]. The automated inference of large-scale BNs creates new challenges in analysing these models, pushing the limits of the existing tools and methodologies. Commonly used software such as GINsim [38] can handle Boolean and multivariate logic-based models; however, the attractor's search can be challenging when scaling up, relying on model reduction techniques to deal with large systems.
Several platforms offer different approaches to dealing with large complex systems, focused on different problem areas. Cell Collective [39] efficiently handles large-scale BNs for simulations but does not offer attractors search. In contrast, BoolNet, an R/ Bioconductor package, offers a collection of options for the analysis of BNs and a set of heuristics for attractors search when the size and the complexity of the model is considerably large [40]. These heuristics focus on retrieving stable states in lieu of searching the whole state space and significantly reducing the calculation burden, though the results are limited to analysing stable states. BMA [41,42] focuses on analysing stable states and, more particularly, fixed points, offering several highly scalable algorithms for model analysis, including stability proof, cycle searching, and linear temporal logic [43][44][45]. The specialisation of tools emphasises the importance of commonly agreed standards for model storage.
In parallel, progress has been made in developing hybrid and multi-scale integrative modelling frameworks, connecting different formalisms, and generating new insights from the emergent, combined properties. FlexFlux, an open-source java software, combines metabolic and regulatory networks based on the identification of steady states. These steady states are further used as constraints for metabolic flux analyses using Flux Balance Analysis (FBA) [46]. A multi-scale framework that couples cell cycle and metabolic networks in yeast was proposed, integrating BNs of a minimal yeast cell cycle with a constraint-based model of metabolism. Models are implemented in Python using the BooleanNet and COBRApy packages and are connected using Boolean logic. The methodology allows for the incorporation of interaction data and validation through -omics data [47].

Community efforts for the reproducibility of discrete models in biology
Recent studies have raised concerns about reproducibility in various scientific fields. In computational systems biology, efforts have been made to identify the problem and propose strategies to tackle it [48]. The Curation and Annotation of Logical Models (CALM) initiative emerged to promote reproducibility, interoperability, accessibility and reusability of the discrete biological models [49]. The initiative promotes reproducibility by linking model components to the underlying experimental papers using proper identifiers like BioModels.net Qualifiers 1 and interoperability by promoting the use of the SBML-Qual format, an extension of the SBML Level 3 standard compatible with the representation of qualitative models of biological networks [50]. Furthermore, the CoLoMoTo Interactive Notebook developed by the community relies on Docker and Jupyter technologies to provide a unified and userfriendly environment to edit, execute, share, and reproduce analyses of qualitative models of biological networks via streamlining of tools that do not necessarily use standard formats, circumventing compatibility issues [51].
In Table 1 we list the tools mentioned in the previous sections, with a brief description of their features, the environment and their capacity of supporting annotations.

New methods for formal analysis of large-scale logic-based models
In this section we highlight recent developments regarding formal analysis. The methodologies presented here address problems inherent to larger and more complex models.
One issue that arises as networks become larger is the role of timings in the control of cellular function. Whilst timing effects can be accounted for in small models using synchronous or asynchronous update schemas, as more genes are introduced this may not be a scalable approach. Ignoring potential timing effects however may obscure important model properties. The Most Permissive Boolean Networks (MPBN) approach is a promising formal method that addresses the fact that both synchronous and asynchronous dynamical interpretations of BNs can miss some predictions of behaviours observed in similar quantitative systems. The MPBNs approach formally guarantees not to miss any behaviour achievable by a quantitative model following the same logic. Moreover, MPBNs significantly reduce the complexity of dynamical analysis, allowing for modelling genome-scale networks. One limitation of the approach can be the generation of over approximated dynamical representations, with only small subsets of the corresponding trajectories effectively observed [52].
The control of BNs offers the possibility to delineate interconnected pathways and specify conditions to determine a functional outcome, offering a way to focus on a smaller subset of nodes that possess important properties over the whole network. Researchers compute a minimal subset of nodes (Cmin) in recent work that allows a BN to be driven from any initial state in an attractor to an attractor of interest by a single step perturbation of Cmin. In their method, they decompose the network into modules, compute the minimal control on the projection of the attractors to these modules, and then compose the results to obtain the global Cmin [53].
Finally, as models become larger, state space expands and the potential for rare transitions that undermine conclusions drawn from the model increases. Model verification, derived from the broader field of verification in software and hardware, offers a new way to tackle complexity. Here, mathematical proofs are used instead of simulation to analyse model behaviour. These proofs can offer guarantees of model correctness that apply over all of state space-for example, stating that one gene is always activated transiently, or another gene never becomes active. Examples include the computation of attractors [54] and proofs of stability [43], where proofs of properties of the whole model are composed of proofs computed on individual components.

Conclusion
The growing availability of high quality, whole-cell biological data has underlined the need to develop rigorous integrative methods that connect observations to fundamental mechanisms of action. Data driven-model inference combined with high-quality biocuration could lead to the construction of more accurate and robust models. At the same time, the rapid adoption of increasingly large logic-based models stress-tests the existing methods and tools used for dynamic analysis.
The key challenges of the field consist in developing efficient formalisms for data integration and tool implementations to properly combine and integrate data to models but also analyse and understand these models at a larger scale. While model inference methodologies can greatly accelerate model building and training, the parallel development of formal methods for analysis, control and verification is needed to cope with the size and complexity of such models. The coupling of logic based models with other modelling types offers possibilities to address more complex questions spanning over different scales, such as signalling and metabolism. Lastly, the use of common annotation schemes and standard formats could help maximize transparency and model reusability and reproducibility.
As multi-omic data will become increasingly available for a variety of biological functions in health and disease, logic-based models can be employed as versatile, powerful tools to deepen our understanding of complex biological mechanisms. Describes two approaches to cope with analysing complex, large-scale logic-based models. Local model verification is inspired by unit testing, and input propagation helps to assess the impact of constraints on the dynamical behaviour.