
AI-Assisted Protein Analysis: From Sequence to Drug-Binding Hypothesis
In this blog, I walk through how publicly available AI tools can be used to explore a real-world biology problemāstarting from a protein sequence and ending with a plausible drug-binding hypothesis.
š The complete code is available on GitHub and can be run in environments like Google Colab.
Why This Problem Matters (In Simple Terms)
At the most basic level, our bodies run on proteins. Think of a protein as a tiny machine:
- Some cut molecules (enzymes)
- Some send signals
- Some build structures
- Some help DNA replicate
Proteins are made up of amino acid sequences (a 1D string of letters). But hereās the key idea:
A proteinās function is determined not just by its sequenceābut by its 3D shape.
Why?
Because proteins work by binding to other molecules, and this happens in specific regions called binding pockets.
When scientists design drugs, they are essentially asking:
āWhere can I attach a molecule to influence this protein?ā
This is why protein structure is criticalāand why AI has become so powerful in this space.
Project Hypothesis : In this project, I explored the following question:
Can we use AI-predicted protein structures (via AlphaFold/ColabFold) to identify plausible drug-binding regions and generate an explainable early-stage drug discovery hypothesis?
Approach (High-Level Workflow)
We follow a simple but powerful pipeline:
Protein sequence ā Structure prediction ā Confidence analysis ā Docking ā Interpretation
Or more concretely:
Take a protein ā predict its 3D structure ā identify where a drug might bind ā test small molecules ā interpret the results
Tools Used
- AlphaFold / ColabFold (structure prediction)
- AutoDock Vina (molecular docking)
- Python (NumPy, Pandas, visualization)
- OpenAI API (AI-assisted interpretation)
1. Download Protein Sequence
Target protein: E. coli DHFR (gene: folA, UniProt ID: P0ABQ4)
This enzyme plays a role in DNA synthesis, making it a classic biological target.
We obtain the proteinās amino acid sequence in FASTA format from UniProt. This sequence serves as the input to the AI model.
2. Predict 3D Structure (AlphaFold / ColabFold)
We use AlphaFold (via ColabFold) to convert the 1D amino acid sequence into a 3D folded protein structure.
This step is the core AI componentālearning structure from sequence.
3. Load Predicted Structure
The model outputs one or more predicted 3D structures.
These structures are typically shown in a rainbow color scheme, representing residue position along the sequence.

4. Evaluate Model Confidence
AlphaFold also provides confidence scores (pLDDT) for each residue:
- šµ Blue ā very high confidence
- š¢ Green ā good
- š” Yellow ā lower confidence
- š“ Red ā unreliable
In our case, the structure is predominantly blueāindicating a highly reliable model.

5. Explore the Structure (Where Could a Drug Bind?)
This is where biology intuition begins. When visualizing the structure, look for:
- Cavities or grooves ā potential binding pockets
- Surface depressions ā where molecules may fit
- Deep internal pockets ā strong candidates
Ask simple questions:
- Where could a small molecule sit?
- Is the surface flat or pocket-like?
6. Prepare Protein for Docking
We convert the protein structure into a docking-ready format (PDBQT)
This step adds:
- Hydrogen atoms
- Charge information
- Atom types
These are required for molecular simulation tools.
7. Prepare Ligand
We select a small molecule (ligand) from a public database like PubChem.
This molecule is also converted into PDBQT format for docking.
8. Run Docking (AutoDock Vina)
Docking simulates how a molecule might bind to the protein.
Steps involved:
- Define a search region (box) on the protein
- Place the ligand inside
- Explore possible binding poses
In this example, we use the center of the protein as the search region (a simple starting point).

9. Analyze Docking Scores
Docking produces:
- Multiple binding poses
- A score for each (in kcal/mol)
Lower (more negative) scores suggest stronger binding.
In this case: Best score ā -5.7 kcal/mol. This indicates a moderate, computationally plausible interaction.
ā ļø Important: Docking scores are approximate and do not prove real-world binding.
10. Visualize ProteināLigand Interaction
We then visualize:
- Protein structure
- Docked ligand
- Binding region
This helps us understand:
Where exactly the molecule is interacting with the protein
11. AI-Assisted Scientific Interpretation
Finally, we use an LLM (OpenAI API) to generate a structured interpretation:
- Objective
- Model quality
- Docking results
- Interaction regions
- Limitations
- Next steps
This turns raw outputs into a readable research-style summary.
AI-Assisted Scientific Interpretation
Objective
The project aims to explore a mini-pipeline for AI-assisted structure-based drug discovery targeting E. coli dihydrofolate reductase (DHFR, folA). The approach involved predicting the 3D structure of E. coli DHFR using a ColabFold / AlphaFold2-style workflow, followed by molecular docking of a compound with AutoDock Vina to identify computationally plausible binding interactions.
Structural Model Quality
The predicted DHFR structure contains 159 residues with a high overall confidence indicated by a mean pLDDT score of 95.49. The majority of residues (146 out of 159) have pLDDT values above 90, representing very high confidence in local structural accuracy. Only 13 residues fall within the medium confidence range (70ā90), and none below 70, suggesting a reliable model suitable for downstream docking analyses.
Docking Result Interpretation
The best docking score obtained from AutoDock Vina was ā5.686 kcal/mol. While this score indicates a moderately favorable binding affinity, it should be regarded as an exploratory measure of ligand-target complementarity rather than a validated binding free energy. This interaction score supports prioritizing the compound for further computational and experimental investigation, recognizing the heuristics inherent to docking scoring functions.
Putative Interaction Zone
The docked ligand is predicted to interact with multiple residues within a 4.5 Ć
radius, including hydrophobic residues (ILE5, ILE14, ILE94), polar and charged residues (ASN18, ASP27, ARG98), and aromatic residues (PHE31, TYR100, HIS45). Glycine-rich loops (GLY95-97) and threonine side chains (THR46, THR123) also appear involved, which may contribute to binding through flexible loop regions or hydrogen bonding. This interaction cluster likely represents the putative ligand-binding site, consistent with the canonical substrate binding region of DHFR.
Scientific Limitations
The AlphaFold2-derived structure is a prediction, and despite high confidence, may not capture dynamic conformations or alternate binding site states.
Docking scores are simplified approximations and do not account for entropic effects, solvation dynamics, or induced fit conformational changes.
Experimental validation is necessary to confirm binding affinity and functional inhibition.
The ligand identity and chemical nature were not discussed, and off-target effects or bioavailability cannot be inferred.
The analysis does not consider potential allosteric sites or competitive binding with endogenous substrates/cofactors.
Key Takeaways
- AI can predict protein structures with high confidence
- Molecular docking can simulate potential binding
- LLMs can help interpret results in a structured way
- All of this can now be done on a single laptop
Limitations
This is an exploratory, computational workflow:
- No experimental validation
- Protein flexibility is not modeled
- Docking is approximate
- Binding is not confirmed
This is hypothesis generation, not proof
Whatās Next
I am now working on improving this pipeline further:
- Identifying more realistic binding pockets
- Using known inhibitors for comparison
- Refining docking accuracy
- Improving biological interpretation
Want to learn more about everyday use of AI?
Discover more from Debabrata Pruseth
Subscribe to get the latest posts sent to your email.


