Quantitative Biology
See recent articles
Showing new listings for Friday, 22 November 2024
- [1] arXiv:2411.13556 [pdf, other]
-
Title: gggenomes: effective and versatile visualizations for comparative genomicsThomas Hackl (1), Markus Ankenbrand (2), Bart van Adrichem (1), David Wilkins (3), Kristina Haslinger (4) ((1) Groningen Institute for Evolutionary Life Sciences, University of Groningen, The Netherlands (2) Center for Computational and Theoretical Biology, Julius-Maximilians-Universität Würzburg, Germany (3) Discipline of General Practice, Adelaide Medical School, The University of Adelaide, Australia (4) Department of Chemical and Pharmaceutical Biology, Groningen Research Institute of Pharmacy, University of Groningen, The Netherlands)Comments: TH and MA contributed equally to this work. Corresponding author: Thomas HacklSubjects: Genomics (q-bio.GN)
The effective visualization of genomic data is crucial for exploring and interpreting complex relationships within and across genes and genomes. Despite advances in developing dedicated bioinformatics software, common visualization tools often fail to efficiently integrate the diverse datasets produced in comparative genomics, lack intuitive interfaces to construct complex plots and are missing functionalities to inspect the underlying data iteratively and at scale. Here, we introduce gggenomes, a versatile R package designed to overcome these challenges by extending the widely used ggplot2 framework for comparative genomics. gggenomes is available from CRAN and GitHub, accompanied by detailed and user-friendly documentation (this https URL).
- [2] arXiv:2411.13630 [pdf, html, other]
-
Title: Competitive binding of Activator-Repressor in Stochastic Gene ExpressionComments: 34 pages,47 figuresSubjects: Molecular Networks (q-bio.MN); Biological Physics (physics.bio-ph)
Regulation of gene expression is the consequence of interactions between the promoter of the gene and the transcription factors (TFs). In this paper, we explore the features of a genetic network where the TFs (activators and repressors) bind the promoter in a competitive way. We develop an analytical theory that offers detailed reaction kinetics of the competitive activator-repressor system which could be the powerful tools for extensive study and analysis of the genetic circuit in future research. Moreover, the theoretical approach helps us to find a most probable set of parameter values which was unavailable in experiments. We study the noisy behaviour of the circuit and compare the profile with the network where the activator and repressor bind the promoter non-competitively. We further notice that, due to the effect of transcriptional reinitiation in the presence of the activator and repressor molecules, there exits some anomalous characteristic features in the mean expressions and noise profiles. We find that, in presence of the reinitiation the noise in transcriptional level remains low while it is higher in translational level than the noise when the reinitiation is absent. In addition, it is possible to reduce the noise further below the Poissonian level in competitive circuit than the non-competitive one with the help of some noise reducing parameters.
- [3] arXiv:2411.13680 [pdf, html, other]
-
Title: Long-term predictive models for mosquito borne diseases: a narrative reviewSubjects: Quantitative Methods (q-bio.QM); Dynamical Systems (math.DS); Biological Physics (physics.bio-ph)
In face of climate change and increasing urbanization, the predictive mosquito-borne diseases (MBD) transmission models require constant updates. Thus, is urgent to comprehend the driving forces of this non stationary behavior, observed through spatial and incidence expansion. We observed that temperature is a critical driver in predictive models for MBD transmission, also being consistently used in multiple reviewed papers with considerable incidence predictive capacity. Rainfall, however, have more subtle importance as moderate precipitation creates breeding sites for mosquitoes, but excessive rainfall can reduce larvae populations. We highlight the frequent use of mechanistic models, particularly those that integrate temperature-dependent biological parameters of disease transmission in incidence proxies as the Vectorial Capacity (VC) and temperature-based basic reproduction number $R_0(t)$, for example. These models show the importance of climate variables, but the socio-demographic factors are often not considered. This gap is a significant opportunity for future research to incorporate socio-demographic data into long-term predictive models for more comprehensive and reliable forecasts. With this survey, we outline the most promising paths to be followed by long-term MBD transmission research and highlighting the potential facing challenges. Thus, we offer a valuable foundation for enhancing disease forecasting models and supporting more effective public health interventions, specially in the long term.
- [4] arXiv:2411.13796 [pdf, html, other]
-
Title: Re-examining aggregation in the Tallis-Leyton model of parasite acquisitionSubjects: Populations and Evolution (q-bio.PE)
The Tallis-Leyton model is a simple model of parasite acquisition where there is no interaction between the host and the acquired parasites. We examine the effect of model parameters on the distribution of the host's parasite burden in the sense of the Lorenz order. This fits with an alternate view of parasite aggregation that has become widely used in empirical studies but is rarely used in the analysis of mathematical models of parasite acquisition.
- [5] arXiv:2411.14079 [pdf, other]
-
Title: Interpretable QSPR Modeling using Recursive Feature Machines and Multi-scale FingerprintsSubjects: Biomolecules (q-bio.BM)
This study pioneers the application of Recursive Feature Machines (RFM) in QSPR modeling, introducing a tailored feature importance analysis approach to enhance interpretability. By leveraging deep feature learning through AGOP, RFM achieves state-of-the-art (SOTA) results in predicting molecular properties, as demonstrated through solubility prediction across nine benchmark datasets. To capture a wide array of structural information, we employ diverse molecular representations, including MACCS keys, Morgan fingerprints, and a custom multi-scale hybrid fingerprint (HF) derived from global descriptors and SMILES local fragmentation techniques. Notably, the HF offers significant advantages over MACCS and Morgan fingerprints in revealing structural determinants of molecular properties. The feature importance analysis in RFM provides robust local and global explanations, effectively identifying structural features that drive molecular behavior and offering valuable insights for drug development. Additionally, RFM demonstrates strong redundancy-filtering abilities, as model performance remains stable even after removing redundant features within custom fingerprints. Importantly, RFM introduces the deep feature learning capabilities of the average gradient outer product (AGOP) matrix into ultra-fast kernel machine learning, to imbue kernel machines with interpretable deep feature learning capabilities. We extend this approach beyond the Laplace Kernel to the Matern, Rational Quadratic, and Gaussian kernels, to find that the Matern and Laplace kernels deliver the best performance, thus reinforcing the flexibility and effectiveness of AGOP in RFM. Experimental results show that RFM-HF surpasses both traditional machine learning models and advanced graph neural networks.
- [6] arXiv:2411.14107 [pdf, html, other]
-
Title: Inward rectifier potassium channels interact with calcium channels to promote robust and physiological bistabilitySubjects: Neurons and Cognition (q-bio.NC)
In the dorsal horn, projection neurons play a significant role in pain processing by transmitting sensory stimuli to supraspinal centers during nociception. Following exposure to intense noxious stimuli, a sensitization process occurs to adapt the dorsal horn functional state. Notably, projection neurons can display a switch in firing pattern from tonic firing to plateau potentials with sustained afterdischarges. For afterdischarges to manifest after this switch, the neuron must develop bistability, which characterizes the ability to show resting or spiking at the same input current depending on the context. In numerous instances, neuronal bistability arises through voltage-gated calcium channels. However, computational studies have demonstrated a trade-off between bistability and the plausibility of its resting states if calcium channels are counterbalanced with voltage-gated potassium channels. Current knowledge leaves a gap in understanding the underlying mechanisms by which robust bistability, plateau potentials, and sustained afterdischarges emerge in neurons via calcium channels. In this study, we used a conductance-based model to explore the mechanisms by which L-type calcium (CaL) channels can achieve bistability when combined with either M-type potassium (KM) channels or inward-rectifying potassium (Kir) channels. Unlike KM channels, Kir channels enhance bistability. Combined with CaL channels, KM and Kir channels promote different types of bistability, with distinct robustness, and function. An analysis of their inward/outward properties revealed that their distinct steady-state currents explain this contrast. Altogether, the complementary of CaL and Kir channels creates a reliable possible pathway for central sensitization in the dorsal horn.
- [7] arXiv:2411.14130 [pdf, other]
-
Title: Evidence of epigenetic oncogenesis: a turning point in cancer researchSubjects: Genomics (q-bio.GN)
In cancer research, the term epigenetics was used in the 1970s in its modern sense encompassing non-genetic events modifying the chromatin state, mainly to oppose the emerging oncogene paradigm. However, starting from the establishment of this prominent concept, the importance of these epigenetic phenomena in cancer rarely led to questioning the causal role of genetic alterations. Only in the last 10 years, the accumulation of problematic data, better experimental technologies, and some ambitious models pushed the idea that epigenetics could be at least as important as genetics in early oncogenesis. Until this year, a direct demonstration of epigenetic oncogenesis was still lacking. Now Parreno, Cavalli and colleagues, using a refined experimental model in the fruit fly Drosophila melanogaster, enforced the initiation of tumours solely by imposing a transient loss of Polycomb repression, leading to a purely epigenetic oncogenesis phenomenon. Despite a few caveats that we discuss, this pioneering work represents a major breakpoint in cancer research that leads us to consider the theoretical and conceptual implications on oncogenesis and to search for links between this artificial experimental model and naturally occurring processes, while revisiting cancer theories that were previously proposed as alternatives to the oncogene-centered paradigm.
- [8] arXiv:2411.14157 [pdf, html, other]
-
Title: DrugGen: Advancing Drug Discovery with Large Language Models and Reinforcement Learning FeedbackMahsa Sheikholeslami, Navid Mazrouei, Yousof Gheisari, Afshin Fasihi, Matin Irajpour, Ali MotahharyniaComments: 20 pages, 5 figures, 3 tables, and 7 supplementary files. To use the model, see this https URLSubjects: Quantitative Methods (q-bio.QM); Artificial Intelligence (cs.AI)
Traditional drug design faces significant challenges due to inherent chemical and biological complexities, often resulting in high failure rates in clinical trials. Deep learning advancements, particularly generative models, offer potential solutions to these challenges. One promising algorithm is DrugGPT, a transformer-based model, that generates small molecules for input protein sequences. Although promising, it generates both chemically valid and invalid structures and does not incorporate the features of approved drugs, resulting in time-consuming and inefficient drug discovery. To address these issues, we introduce DrugGen, an enhanced model based on the DrugGPT structure. DrugGen is fine-tuned on approved drug-target interactions and optimized with proximal policy optimization. By giving reward feedback from protein-ligand binding affinity prediction using pre-trained transformers (PLAPT) and a customized invalid structure assessor, DrugGen significantly improves performance. Evaluation across multiple targets demonstrated that DrugGen achieves 100% valid structure generation compared to 95.5% with DrugGPT and produced molecules with higher predicted binding affinities (7.22 [6.30-8.07]) compared to DrugGPT (5.81 [4.97-6.63]) while maintaining diversity and novelty. Docking simulations further validate its ability to generate molecules targeting binding sites effectively. For example, in the case of fatty acid-binding protein 5 (FABP5), DrugGen generated molecules with superior docking scores (FABP5/11, -9.537 and FABP5/5, -8.399) compared to the reference molecule (Palmitic acid, -6.177). Beyond lead compound generation, DrugGen also shows potential for drug repositioning and creating novel pharmacophores for existing targets. By producing high-quality small molecules, DrugGen provides a high-performance medium for advancing pharmaceutical research and drug discovery.
- [9] arXiv:2411.14291 [pdf, other]
-
Title: Adaptive flexibility of cytoskeletal structures through nonequilibrium entropy productionSubjects: Cell Behavior (q-bio.CB); Subcellular Processes (q-bio.SC)
Cellular adaptation to environmental changes relies on the dynamic remodeling of cytoskeletal structures. Sarcomeres, periodic units composed mainly of actin and myosin II filaments, are fundamental to the function of the cytoskeletal architecture. In muscle cells, sarcomeres maintain consistent lengths optimized for stable force generation, while in nonmuscle cells, they display greater structural variability and undergo adaptive remodeling in response to environmental cues. However, the relevance of this structural variability to cellular adaptability remains unclear. Here, we present a nonequilibrium physics framework to investigate the role of sarcomere variability in cytoskeletal adaptation. By deriving the probability distribution of sarcomere lengths and analyzing binding energies during cytoskeletal elongation, we show that structural variability, rather than hindering function, facilitates adaptive responses to environmental conditions. We reveal that entropy production arising from this variability drives dynamic remodeling, allowing the cytoskeletal architecture to respond effectively to external cues. This framework bridges sarcomere variability and cytoskeletal adaptability required for diverse cellular functions, providing new insights into how variability supports cellular adaptation through nonequilibrium processes.
- [10] arXiv:2411.14293 [pdf, other]
-
Title: Machine learning framework to predict the performance of lipid nanoparticles for nucleic acid deliveryComments: This is a preprint of a manuscript under reviewSubjects: Biomolecules (q-bio.BM)
Lipid nanoparticles (LNPs) are highly effective carriers for gene therapies, including mRNA and siRNA delivery, due to their ability to transport nucleic acids across biological membranes, low cytotoxicity, improved pharmacokinetics, and scalability. A typical approach to formulate LNPs is to establish a quantitative structure-activity relationship (QSAR) between their compositions and in vitro/in vivo activities which allows for the prediction of activity based on molecular structure. However, developing QSAR for LNPs can be challenging due to the complexity of multi-component formulations, interactions with biological membranes, and stability in physiological environments. To address these challenges, we developed a machine learning framework to predict the activity and cell viability of LNPs for nucleic acid delivery. We curated data from 6,398 LNP formulations in the literature, applied nine featurization techniques to extract chemical information, and trained five machine learning models for binary and multiclass classification. Our binary models achieved over 90% accuracy, while the multiclass models reached over 95% accuracy. Our results demonstrated that molecular descriptors, particularly when used with random forest and gradient boosting models, provided the most accurate predictions. Our findings also emphasized the need for large training datasets and comprehensive LNP composition details, such as constituent structures, molar ratios, nucleic acid types, and dosages, to enhance predictive performance.
New submissions (showing 10 of 10 entries)
- [11] arXiv:2411.13688 (cross-list from cs.LG) [pdf, other]
-
Title: Investigating Graph Neural Networks and Classical Feature-Extraction Techniques in Activity-Cliff and Molecular Property PredictionComments: Doctoral Thesis (Mathematical Institute, University of Oxford)Subjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM); Machine Learning (stat.ML)
Molecular featurisation refers to the transformation of molecular data into numerical feature vectors. It is one of the key research areas in molecular machine learning and computational drug discovery. Recently, message-passing graph neural networks (GNNs) have emerged as a novel method to learn differentiable features directly from molecular graphs. While such techniques hold great promise, further investigations are needed to clarify if and when they indeed manage to definitively outcompete classical molecular featurisations such as extended-connectivity fingerprints (ECFPs) and physicochemical-descriptor vectors (PDVs). We systematically explore and further develop classical and graph-based molecular featurisation methods for two important tasks: molecular property prediction, in particular, quantitative structure-activity relationship (QSAR) prediction, and the largely unexplored challenge of activity-cliff (AC) prediction. We first give a technical description and critical analysis of PDVs, ECFPs and message-passing GNNs, with a focus on graph isomorphism networks (GINs). We then conduct a rigorous computational study to compare the performance of PDVs, ECFPs and GINs for QSAR and AC-prediction. Following this, we mathematically describe and computationally evaluate a novel twin neural network model for AC-prediction. We further introduce an operation called substructure pooling for the vectorisation of structural fingerprints as a natural counterpart to graph pooling in GNN architectures. We go on to propose Sort & Slice, a simple substructure-pooling technique for ECFPs that robustly outperforms hash-based folding at molecular property prediction. Finally, we outline two ideas for future research: (i) a graph-based self-supervised learning strategy to make classical molecular featurisations trainable, and (ii) trainable substructure-pooling via differentiable self-attention.
- [12] arXiv:2411.14242 (cross-list from cs.CE) [pdf, html, other]
-
Title: Approximate Constrained Lumping of Chemical Reaction NetworksAlexander Leguizamon-Robayo, Antonio Jiménez-Pastor, Micro Tribastone, Max Tschaikowski, Andrea VandinSubjects: Computational Engineering, Finance, and Science (cs.CE); Quantitative Methods (q-bio.QM)
Gaining insights from realistic dynamical models of biochemical systems can be challenging given their large number of state variables. Model reduction techniques can mitigate this by decreasing complexity by mapping the model onto a lower-dimensional state space. Exact constrained lumping identifies reductions as linear combinations of the original state variables in systems of nonlinear ordinary differential equations, preserving specific user-defined output variables without error. However, exact reductions can be too stringent in practice, as model parameters are often uncertain or imprecise -- a particularly relevant problem for biochemical systems. We propose approximate constrained lumping. It allows for a relaxation of exactness within a given tolerance parameter $\varepsilon$, while still working in polynomial time. We prove that the accuracy, i.e., the difference between the output variables in the original and reduced model, is in the order of $\varepsilon$. Furthermore, we provide a heuristic algorithm to find the smallest $\varepsilon$ for a given maximum allowable size of the lumped system. Our method is applied to several models from the literature, resulting in coarser aggregations than exact lumping while still capturing the dynamics of the original system accurately.
Cross submissions (showing 2 of 2 entries)
- [13] arXiv:2306.11031 (replaced) [pdf, html, other]
-
Title: Chaotic turnover of rare and abundant species in a strongly interacting model communityComments: 15 pages, 7 figuresJournal-ref: PNAS 121(11) e2312822121, 2024Subjects: Populations and Evolution (q-bio.PE); Statistical Mechanics (cond-mat.stat-mech)
The composition of ecological communities varies not only between different locations but also in time. Understanding the fundamental processes that drive species towards rarity or abundance is crucial to assessing ecosystem resilience and adaptation to changing environmental conditions. In plankton communities in particular, large temporal fluctuations in species abundances have been associated with chaotic dynamics. On the other hand, microbial diversity is overwhelmingly sustained by a `rare biosphere' of species with very low abundances. We consider here the possibility that interactions within a species-rich community can relate both phenomena. We use a Lotka-Volterra model with weak immigration and strong, disordered, and mostly competitive interactions between hundreds of species to bridge single-species temporal fluctuations and abundance distribution patterns. We highlight a generic chaotic regime where a few species at a time achieve dominance, but are continuously overturned by the invasion of formerly rare species. We derive a focal-species model that captures the intermittent boom-and-bust dynamics that every species undergoes. Although species cannot be treated as effectively uncorrelated in their abundances, the community's effect on a focal species can nonetheless be described by a time-correlated noise characterized by a few effective parameters that can be estimated from time series. The model predicts a non-unitary exponent of the power-law abundance decay, which varies weakly with ecological parameters, consistent with observation in marine protist communities. The chaotic turnover regime is thus poised to capture relevant ecological features of species-rich microbial communities.
- [14] arXiv:2306.13633 (replaced) [pdf, html, other]
-
Title: Optimal Vaccination Policy to Prevent Endemicity: A Stochastic ModelComments: 51 pages, 7 figuresSubjects: Populations and Evolution (q-bio.PE); Probability (math.PR)
We examine here the effects of recurrent vaccination and waning immunity on the establishment of an endemic equilibrium in a population. An individual-based model that incorporates memory effects for transmission rate during infection and subsequent immunity is introduced, considering stochasticity at the individual level. By letting the population size going to infinity, we derive a set of equations describing the large scale behavior of the epidemic. The analysis of the model's equilibria reveals a criterion for the existence of an endemic equilibrium, which depends on the rate of immunity loss and the distribution of time between booster doses. The outcome of a vaccination policy in this context is influenced by the efficiency of the vaccine in blocking transmissions and the distribution pattern of booster doses within the population. Strategies with evenly spaced booster shots at the individual level prove to be more effective in preventing disease spread compared to irregularly spaced boosters, as longer intervals without vaccination increase susceptibility and facilitate more efficient disease transmission. We provide an expression for the critical fraction of the population required to adhere to the vaccination policy in order to eradicate the disease, that resembles a well-known threshold for preventing an outbreak with an imperfect vaccine. We also investigate the consequences of unequal vaccine access in a population and prove that, under reasonable assumptions, fair vaccine allocation is the optimal strategy to prevent endemicity.
- [15] arXiv:2405.01715 (replaced) [pdf, html, other]
-
Title: GRAMEP: an alignment-free method based on the Maximum Entropy Principle for identifying SNPsMatheus Henrique Pimenta-Zanon, André Yoshiaki Kashiwabara, André Luís Laforga Vanzela, Fabricio Martins LopesSubjects: Genomics (q-bio.GN); Information Theory (cs.IT); Applications (stat.AP)
Background: Advances in high throughput sequencing technologies provide a huge number of genomes to be analyzed. Thus, computational methods play a crucial role in analyzing and extracting knowledge from the data generated. Investigating genomic mutations is critical because of their impact on chromosomal evolution, genetic disorders, and diseases. It is common to adopt aligning sequences for analyzing genomic variations. However, this approach can be computationally expensive and restrictive in scenarios with large datasets. Results: We present a novel method for identifying single nucleotide polymorphisms (SNPs) in DNA sequences from assembled genomes. This study proposes GRAMEP, an alignment-free approach that adopts the principle of maximum entropy to discover the most informative k-mers specific to a genome or set of sequences under investigation. The informative k-mers enable the detection of variant-specific mutations in comparison to a reference genome or other set of sequences. In addition, our method offers the possibility of classifying novel sequences with no need for organism-specific information. GRAMEP demonstrated high accuracy in both in silico simulations and analyses of viral genomes, including Dengue, HIV, and SARS-CoV-2. Our approach maintained accurate SARS-CoV-2 variant identification while demonstrating a lower computational cost compared to methods with the same purpose. Conclusions: GRAMEP is an open and user-friendly software based on maximum entropy that provides an efficient alignment-free approach to identifying and classifying unique genomic subsequences and SNPs with high accuracy, offering advantages over comparative methods. The instructions for use, applicability, and usability of GRAMEP are open access at this https URL
- [16] arXiv:2405.07245 (replaced) [pdf, html, other]
-
Title: Ecology, Spatial Structure, and Selection Pressure Induce Strong Signatures in Phylogenetic StructureSubjects: Populations and Evolution (q-bio.PE); Neural and Evolutionary Computing (cs.NE)
Evolutionary dynamics are shaped by a variety of fundamental, generic drivers, including spatial structure, ecology, and selection pressure. These drivers impact the trajectory of evolution, and have been hypothesized to influence phylogenetic structure. Here, we set out to assess (1) if spatial structure, ecology, and selection pressure leave detectable signatures in phylogenetic structure, (2) the extent, in particular, to which ecology can be detected and discerned in the presence of spatial structure, and (3) the extent to which these phylogenetic signatures generalize across evolutionary systems. To this end, we analyze phylogenies generated by manipulating spatial structure, ecology, and selection pressure within three computational models of varied scope and sophistication. We find that selection pressure, spatial structure, and ecology have characteristic effects on phylogenetic metrics, although these effects are complex and not always intuitive. Signatures have some consistency across systems when using equivalent taxonomic unit definitions (e.g., individual, genotype, species). Further, we find that sufficiently strong ecology can be detected in the presence of spatial structure. We also find that, while low-resolution phylogenetic reconstructions can bias some phylogenetic metrics, high-resolution reconstructions recapitulate them faithfully. Although our results suggest potential for evolutionary inference of spatial structure, ecology, and selection pressure through phylogenetic analysis, further methods development is needed to distinguish these drivers' phylometric signatures from each other and to appropriately normalize phylogenetic metrics. With such work, phylogenetic analysis could provide a versatile toolkit to study large-scale evolving populations.
- [17] arXiv:2406.08521 (replaced) [pdf, html, other]
-
Title: Embedding-based Multimodal Learning on Pan-Squamous Cell Carcinomas for Improved Survival OutcomesSubjects: Cell Behavior (q-bio.CB); Machine Learning (cs.LG)
Cancer clinics capture disease data at various scales, from genetic to organ level. Current bioinformatic methods struggle to handle the heterogeneous nature of this data, especially with missing modalities. We propose PARADIGM, a Graph Neural Network (GNN) framework that learns from multimodal, heterogeneous datasets to improve clinical outcome prediction. PARADIGM generates embeddings from multi-resolution data using foundation models, aggregates them into patient-level representations, fuses them into a unified graph, and enhances performance for tasks like survival analysis. We train GNNs on pan-Squamous Cell Carcinomas and validate our approach on Moffitt Cancer Center lung SCC data. Multimodal GNN outperforms other models in patient survival prediction. Converging individual data modalities across varying scales provides a more insightful disease view. Our solution aims to understand the patient's circumstances comprehensively, offering insights on heterogeneous data integration and the benefits of converging maximum data views.
- [18] arXiv:2409.07462 (replaced) [pdf, html, other]
-
Title: S-MolSearch: 3D Semi-supervised Contrastive Learning for Bioactive Molecule SearchSubjects: Biomolecules (q-bio.BM); Machine Learning (cs.LG)
Virtual Screening is an essential technique in the early phases of drug discovery, aimed at identifying promising drug candidates from vast molecular libraries. Recently, ligand-based virtual screening has garnered significant attention due to its efficacy in conducting extensive database screenings without relying on specific protein-binding site information. Obtaining binding affinity data for complexes is highly expensive, resulting in a limited amount of available data that covers a relatively small chemical space. Moreover, these datasets contain a significant amount of inconsistent noise. It is challenging to identify an inductive bias that consistently maintains the integrity of molecular activity during data augmentation. To tackle these challenges, we propose S-MolSearch, the first framework to our knowledge, that leverages molecular 3D information and affinity information in semi-supervised contrastive learning for ligand-based virtual screening. Drawing on the principles of inverse optimal transport, S-MolSearch efficiently processes both labeled and unlabeled data, training molecular structural encoders while generating soft labels for the unlabeled data. This design allows S-MolSearch to adaptively utilize unlabeled data within the learning process. Empirically, S-MolSearch demonstrates superior performance on widely-used benchmarks LIT-PCBA and DUD-E. It surpasses both structure-based and ligand-based virtual screening methods for AUROC, BEDROC and EF.
- [19] arXiv:2411.13280 (replaced) [pdf, html, other]
-
Title: Structure-Based Molecule Optimization via Gradient-Guided Bayesian UpdateKeyue Qiu, Yuxuan Song, Jie Yu, Hongbo Ma, Ziyao Cao, Zhilong Zhang, Yushuai Wu, Mingyue Zheng, Hao Zhou, Wei-Ying MaComments: 27 pages, 17 figuresSubjects: Biomolecules (q-bio.BM); Artificial Intelligence (cs.AI)
Structure-based molecule optimization (SBMO) aims to optimize molecules with both continuous coordinates and discrete types against protein targets. A promising direction is to exert gradient guidance on generative models given its remarkable success in images, but it is challenging to guide discrete data and risks inconsistencies between modalities. To this end, we leverage a continuous and differentiable space derived through Bayesian inference, presenting Molecule Joint Optimization (MolJO), the first gradient-based SBMO framework that facilitates joint guidance signals across different modalities while preserving SE(3)-equivariance. We introduce a novel backward correction strategy that optimizes within a sliding window of the past histories, allowing for a seamless trade-off between explore-and-exploit during optimization. Our proposed MolJO achieves state-of-the-art performance on CrossDocked2020 benchmark (Success Rate 51.3% , Vina Dock -9.05 and SA 0.78), more than 4x improvement in Success Rate compared to the gradient-based counterpart, and 2x "Me-Better" Ratio as much as 3D baselines. Furthermore, we extend MolJO to a wide range of optimization settings, including multi-objective optimization and challenging tasks in drug design such as R-group optimization and scaffold hopping, further underscoring its versatility and potential.
- [20] arXiv:2411.03832 (replaced) [pdf, html, other]
-
Title: Accelerating DNA Read Mapping with Digital Processing-in-MemoryRotem Ben-Hur, Orian Leitersdorf, Ronny Ronen, Lidor Goldshmidt, Idan Magram, Lior Kaplun, Leonid Yavitz, Shahar KvatinskySubjects: Hardware Architecture (cs.AR); Distributed, Parallel, and Cluster Computing (cs.DC); Quantitative Methods (q-bio.QM)
Genome analysis has revolutionized fields such as personalized medicine and forensics. Modern sequencing machines generate vast amounts of fragmented strings of genome data called reads. The alignment of these reads into a complete DNA sequence of an organism (the read mapping process) requires extensive data transfer between processing units and memory, leading to execution bottlenecks. Prior studies have primarily focused on accelerating specific stages of the read-mapping task. Conversely, this paper introduces a holistic framework called DART-PIM that accelerates the entire read-mapping process. DART-PIM facilitates digital processing-in-memory (PIM) for an end-to-end acceleration of the entire read-mapping process, from indexing using a unique data organization schema to filtering and read alignment with an optimized Wagner Fischer algorithm. A comprehensive performance evaluation with real genomic data shows that DART-PIM achieves a 5.7x and 257x improvement in throughput and a 92x and 27x energy efficiency enhancement compared to state-of-the-art GPU and PIM implementations, respectively.