Autonomous Drug Discovery: Mining 700,000 Natural Products for Antimicrobial Candidates

How K-Dense Web autonomously processed the COCONUT database to identify 50 prioritized antimicrobial candidates using unsupervised machine learning.

Share:
Autonomous Drug Discovery: Mining 700,000 Natural Products for Antimicrobial Candidates

Antimicrobial resistance is one of the most pressing global health crises of our time. With bacterial infections becoming increasingly difficult to treat, the search for new antibiotics has never been more urgent. Natural products, compounds produced by living organisms, have historically been our richest source of antimicrobial drugs, from penicillin to vancomycin.

In this case study, we demonstrate how K-Dense Web autonomously executed a complete computational drug discovery pipeline, processing over 700,000 natural products to identify 50 prioritized antimicrobial candidates ready for experimental screening.

The Challenge: Finding Needles in a Molecular Haystack

The COCONUT database (COlleCtion of Open Natural prodUcTs) contains 715,822 natural products: a treasure trove of chemical diversity. But manually screening this many compounds is impractical. The challenge: how do you systematically identify the most promising antimicrobial candidates from such a vast chemical space?

This is where K-Dense Web's autonomous research capabilities come into play.

The Autonomous Pipeline

With a single prompt describing the research objective, K-Dense Web designed and executed a complete five-step computational pipeline:

Autonomous workflow schematic showing the complete pipeline from data preparation through candidate selection

Step 1: Data Preparation

K-Dense Web automatically:

  • Downloaded the full COCONUT database (664 MB)
  • Filtered for bacterial-derived compounds (24,911 compounds, 3.48% of total)
  • Validated all SMILES structures using RDKit (99.996% validation rate)
  • Standardized molecular representations for downstream analysis

Result: 24,910 unique bacterial natural products with validated chemical structures.

Step 2: Feature Engineering

For each compound, K-Dense Web calculated:

  • Physicochemical properties: Molecular weight, LogP, TPSA, hydrogen bond donors/acceptors
  • Structural descriptors: Ring count, aromatic rings, fraction sp3 carbons
  • Drug-likeness metrics: QED score, Lipinski's Rule of 5 compliance, PAINS filtering

Key findings from descriptor analysis:

Property Mean Range
Molecular Weight 539 Da 1 - 4,900 Da
LogP 2.1 -29 to 37
QED Score 0.36 0.01 - 0.94
Lipinski Compliant 44.8% -

Only 39.7% of compounds passed both Lipinski's Rule of 5 and PAINS filters, highlighting that bacterial natural products often exist beyond traditional "drug-like" chemical space.

Step 3: Chemical Space Analysis

When K-Dense Web attempted to query ChEMBL for bioactivity training data (as originally planned), the API returned errors. Rather than failing, the agent autonomously pivoted to an unsupervised learning approach, demonstrating the adaptive problem-solving that makes autonomous research powerful.

The chemical space analysis revealed a striking finding: bacterial natural products occupy two distinct chemical clusters.

PCA visualization showing two distinct chemical clusters in the bacterial natural product space

Cluster 0 (75.4% of compounds):

  • Small, drug-like molecules
  • Mean MW: 396 Da
  • Mean QED: 0.44
  • Likely represents alkaloids, terpenoids, and smaller polyketides

Cluster 1 (24.6% of compounds):

  • Large, complex molecules
  • Mean MW: 978 Da (2.5× larger)
  • Mean QED: 0.09
  • Likely represents glycopeptides, lipopeptides, and macrocyclic antibiotics

This bimodal distribution is scientifically significant: it mirrors the known diversity of antimicrobial natural products, from simple alkaloids to complex glycopeptide antibiotics like vancomycin.

Heatmap showing normalized chemical profiles for each cluster

Step 4: Bimodal Candidate Selection

Rather than applying a single selection criterion, K-Dense Web implemented a sophisticated bimodal selection strategy to maximize both drug development feasibility and bioactive potential:

Group A: Drug-like Leads (25 compounds)

  • Selected from Cluster 0 based on highest QED scores
  • Mean MW: 322 Da, Mean QED: 0.93
  • 100% Lipinski compliant
  • Optimized for oral bioavailability and easier development

Group B: Complex Scaffolds (25 compounds)

  • Selected from Cluster 1 based on structural complexity
  • Mean MW: 1,930 Da, Mean QED: 0.05
  • Representative of privileged antibiotic scaffolds
  • Optimized for potency and structural novelty

PCA plot showing the 50 selected candidates highlighted against the full dataset

This dual-track approach ensures the final candidate set covers both:

  1. Low-risk development paths (Group A): smaller molecules amenable to traditional medicinal chemistry optimization
  2. High-novelty potential (Group B): complex scaffolds typical of clinically successful antibiotics

Step 5: Validation and Reporting

K-Dense Web generated comprehensive outputs including:

  • Publication-ready figures (10 visualizations)
  • A formal research manuscript with methods, results, and discussion
  • Detailed summary statistics and candidate profiles

Property distribution comparison between Group A and Group B candidates

The top candidates from each group show the striking diversity of the selection:

Chemical structures of top candidates from each group

Key Results

Metric Value
Initial compounds screened 715,822
Bacterial compounds identified 24,910
Chemical clusters discovered 2
Prioritized candidates 50
Group A (drug-like) 25
Group B (complex) 25
Pipeline execution time ~45 minutes

Why This Matters

Traditional computational drug discovery requires:

  • Domain expertise in cheminformatics
  • Familiarity with multiple software tools (RDKit, scikit-learn, matplotlib)
  • Days to weeks of manual analysis
  • Expertise to pivot when external resources fail

K-Dense Web completed this entire workflow autonomously, including:

  • Adaptive problem-solving: When the ChEMBL API failed, it pivoted to unsupervised learning
  • Scientific reasoning: The bimodal selection strategy reflects genuine understanding of drug discovery principles
  • Publication-quality outputs: Figures, statistics, and manuscript all ready for use

Next Steps

The 50 prioritized candidates are now ready for:

  1. Antimicrobial screening against bacterial panels including resistant strains (MRSA, VRE, MDR pathogens)
  2. MIC determination for active compounds
  3. Structure-activity relationship studies using the chemical space clustering
  4. Lead optimization with Group A compounds as starting points

Try It Yourself

This analysis demonstrates how K-Dense Web can accelerate early-stage drug discovery from weeks to minutes. Whether you're mining chemical databases, analyzing bioactivity data, or prioritizing compounds for screening, autonomous AI research can dramatically accelerate your workflow.

Start your autonomous research project with $50 free credits →


This case study was generated from K-Dense Web. View the complete example session including all analysis code, data files, figures, and the publication-ready research manuscript.

Enjoyed this article? Share it with others!

Share:
Back to all posts