AI-Assisted Protein Analysis: From Sequence to Drug-Binding Hypothesis

In this blog, I walk through how publicly available AI tools can be used to explore a real-world biology problem—starting from a protein sequence and ending with a plausible drug-binding hypothesis.

šŸ‘‰ The complete code is available on GitHub and can be run in environments like Google Colab.

Why This Problem Matters (In Simple Terms)

At the most basic level, our bodies run on proteins. Think of a protein as a tiny machine:

  • Some cut molecules (enzymes)
  • Some send signals
  • Some build structures
  • Some help DNA replicate

Proteins are made up of amino acid sequences (a 1D string of letters). But here’s the key idea:

A protein’s function is determined not just by its sequence—but by its 3D shape.

Why?

Because proteins work by binding to other molecules, and this happens in specific regions called binding pockets.

When scientists design drugs, they are essentially asking:

ā€œWhere can I attach a molecule to influence this protein?ā€

This is why protein structure is critical—and why AI has become so powerful in this space.

Project Hypothesis : In this project, I explored the following question:

Can we use AI-predicted protein structures (via AlphaFold/ColabFold) to identify plausible drug-binding regions and generate an explainable early-stage drug discovery hypothesis?

Approach (High-Level Workflow)

We follow a simple but powerful pipeline:

Protein sequence → Structure prediction → Confidence analysis → Docking → Interpretation

Or more concretely:

Take a protein → predict its 3D structure → identify where a drug might bind → test small molecules → interpret the results

Tools Used

  • AlphaFold / ColabFold (structure prediction)
  • AutoDock Vina (molecular docking)
  • Python (NumPy, Pandas, visualization)
  • OpenAI API (AI-assisted interpretation)

1. Download Protein Sequence

Target protein: E. coli DHFR (gene: folA, UniProt ID: P0ABQ4)

This enzyme plays a role in DNA synthesis, making it a classic biological target.

We obtain the protein’s amino acid sequence in FASTA format from UniProt. This sequence serves as the input to the AI model.

2. Predict 3D Structure (AlphaFold / ColabFold)

We use AlphaFold (via ColabFold) to convert the 1D amino acid sequence into a 3D folded protein structure.

This step is the core AI component—learning structure from sequence.

3. Load Predicted Structure

The model outputs one or more predicted 3D structures.

These structures are typically shown in a rainbow color scheme, representing residue position along the sequence.

4. Evaluate Model Confidence

AlphaFold also provides confidence scores (pLDDT) for each residue:

  • šŸ”µ Blue → very high confidence
  • 🟢 Green → good
  • 🟔 Yellow → lower confidence
  • šŸ”“ Red → unreliable

In our case, the structure is predominantly blue—indicating a highly reliable model.

5. Explore the Structure (Where Could a Drug Bind?)

This is where biology intuition begins. When visualizing the structure, look for:

  • Cavities or grooves → potential binding pockets
  • Surface depressions → where molecules may fit
  • Deep internal pockets → strong candidates

Ask simple questions:

  • Where could a small molecule sit?
  • Is the surface flat or pocket-like?

6. Prepare Protein for Docking

We convert the protein structure into a docking-ready format (PDBQT)

This step adds:

  • Hydrogen atoms
  • Charge information
  • Atom types

These are required for molecular simulation tools.

7. Prepare Ligand

We select a small molecule (ligand) from a public database like PubChem.

This molecule is also converted into PDBQT format for docking.

8. Run Docking (AutoDock Vina)

Docking simulates how a molecule might bind to the protein.

Steps involved:

  1. Define a search region (box) on the protein
  2. Place the ligand inside
  3. Explore possible binding poses

In this example, we use the center of the protein as the search region (a simple starting point).

9. Analyze Docking Scores

Docking produces:

  • Multiple binding poses
  • A score for each (in kcal/mol)

Lower (more negative) scores suggest stronger binding.

In this case: Best score ā‰ˆ -5.7 kcal/mol. This indicates a moderate, computationally plausible interaction.

āš ļø Important: Docking scores are approximate and do not prove real-world binding.

10. Visualize Protein–Ligand Interaction

We then visualize:

  • Protein structure
  • Docked ligand
  • Binding region

This helps us understand:

Where exactly the molecule is interacting with the protein

11. AI-Assisted Scientific Interpretation

Finally, we use an LLM (OpenAI API) to generate a structured interpretation:

  • Objective
  • Model quality
  • Docking results
  • Interaction regions
  • Limitations
  • Next steps

This turns raw outputs into a readable research-style summary.

AI-Assisted Scientific Interpretation
Objective
The project aims to explore a mini-pipeline for AI-assisted structure-based drug discovery targeting E. coli dihydrofolate reductase (DHFR, folA). The approach involved predicting the 3D structure of E. coli DHFR using a ColabFold / AlphaFold2-style workflow, followed by molecular docking of a compound with AutoDock Vina to identify computationally plausible binding interactions.

Structural Model Quality
The predicted DHFR structure contains 159 residues with a high overall confidence indicated by a mean pLDDT score of 95.49. The majority of residues (146 out of 159) have pLDDT values above 90, representing very high confidence in local structural accuracy. Only 13 residues fall within the medium confidence range (70–90), and none below 70, suggesting a reliable model suitable for downstream docking analyses.

Docking Result Interpretation
The best docking score obtained from AutoDock Vina was āˆ’5.686 kcal/mol. While this score indicates a moderately favorable binding affinity, it should be regarded as an exploratory measure of ligand-target complementarity rather than a validated binding free energy. This interaction score supports prioritizing the compound for further computational and experimental investigation, recognizing the heuristics inherent to docking scoring functions.

Putative Interaction Zone
The docked ligand is predicted to interact with multiple residues within a 4.5 ƅ radius, including hydrophobic residues (ILE5, ILE14, ILE94), polar and charged residues (ASN18, ASP27, ARG98), and aromatic residues (PHE31, TYR100, HIS45). Glycine-rich loops (GLY95-97) and threonine side chains (THR46, THR123) also appear involved, which may contribute to binding through flexible loop regions or hydrogen bonding. This interaction cluster likely represents the putative ligand-binding site, consistent with the canonical substrate binding region of DHFR.

Scientific Limitations

The AlphaFold2-derived structure is a prediction, and despite high confidence, may not capture dynamic conformations or alternate binding site states.
Docking scores are simplified approximations and do not account for entropic effects, solvation dynamics, or induced fit conformational changes.
Experimental validation is necessary to confirm binding affinity and functional inhibition.
The ligand identity and chemical nature were not discussed, and off-target effects or bioavailability cannot be inferred.
The analysis does not consider potential allosteric sites or competitive binding with endogenous substrates/cofactors.

Key Takeaways

  • AI can predict protein structures with high confidence
  • Molecular docking can simulate potential binding
  • LLMs can help interpret results in a structured way
  • All of this can now be done on a single laptop

Limitations

This is an exploratory, computational workflow:

  • No experimental validation
  • Protein flexibility is not modeled
  • Docking is approximate
  • Binding is not confirmed

This is hypothesis generation, not proof

What’s Next

I am now working on improving this pipeline further:

  • Identifying more realistic binding pockets
  • Using known inhibitors for comparison
  • Refining docking accuracy
  • Improving biological interpretation

Want to learn more about everyday use of AI?


Discover more from Debabrata Pruseth

Subscribe to get the latest posts sent to your email.

Scroll to Top