Drug discovery has remained inaccessible to non-experts due to its computational and biochemical complexity. Deep2Lead [1] introduced an accessible web interface for molecular lead optimization using variational autoencoders (VAEs) and DeepPurpose [5] drug-target interaction (DTI) models. We present GemmaCure, a substantial evolution of Deep2Lead that replaces the classical VAE pipeline with a domain-specific fine-tuned Gemma 4 large language model (LLM) and embeds the drug discovery pipeline within an educational 3D game, PathoHunt 3D. Our central hypothesis — conceived by high school sophomore Tanisha Chawdhury — is that gamification enables a broader and more cognitively diverse population of molecule designers to explore chemical space more creatively than pure algorithmic search, thereby generating quantitatively more novel drug candidates and providing lab experts access to a richer and more diverse set of lead molecules. Our fine-tuned model (dlyog/gemma-cure) achieves 100% SMILES validity and a 5× improvement in drug-likeness (QED = 0.591 vs. 0.116) over the base Gemma 4 model on five held-out disease targets. We describe the system architecture, training methodology, a three-tier evaluation framework, and present the design of a qualitative pilot study currently underway with student and family participants. Full results from this study constitute our primary future work.
Keywords — drug discovery, gamification, large language model, molecular generation, SMILES, fine-tuning, citizen science, educational games, Deep2Lead, Gemma 4
Drug discovery is one of the most consequential challenges in science: the average small-molecule drug requires more than a decade and over $2 billion USD to bring from initial lead identification to clinical approval [2]. Computational approaches — including molecular docking, quantitative structure-activity relationship (QSAR) modeling, and generative deep learning — have substantially reduced the cost and time of early-stage discovery. Yet the tools remain largely inaccessible to non-experts: they require programming expertise, access to proprietary databases, and a specialized biochemistry background.
In 2021, Chawdhury introduced Deep2Lead [1], a web-based platform that combined a variational autoencoder (VAE) for molecular generation with the DeepPurpose [5] drug-target interaction (DTI) framework, enabling lead optimization without any programming knowledge. Deep2Lead established the accessibility paradigm: drug discovery should not be gated behind a terminal prompt.
GemmaCure extends this vision in three significant directions. First, we replace the classical VAE molecular generator with a fine-tuned Gemma 4 LLM [3] that performs rationale-first generation — explaining the binding pocket geometry and pharmacophore requirements before committing to a SMILES string. Second, we replace DeepPurpose [5] with the IBM MAMMAL DTI model [4], achieving more accurate binding affinity predictions (pKd). Third, and most importantly, we embed the entire pipeline inside an educational 3D game: PathoHunt 3D.
The game concept originated with Tanisha Chawdhury, a sophomore at a High School in Elk Grove, during a conversation in which her father Tarun was explaining the mechanics of binding affinity. Her insight:
This framing is not merely pedagogical. It embodies a scientific hypothesis we call the Gamification Diversity Hypothesis: that by turning drug discovery into a game, we recruit a cognitively and demographically diverse player population whose collective exploration of chemical space is qualitatively different — and potentially more novel — than what any single algorithm explores on its own.
Our contributions are:
Deep2Lead [1] was a six-page preprint that described the first browser-accessible lead optimization tool combining a SMILES-trained VAE for molecular generation and DeepPurpose [5] for binding affinity prediction. Its primary contribution was accessibility: users with no programming background could input a protein target and receive AI-generated candidate molecules. The classical VAE model used in Deep2Lead produced valid SMILES strings ~60–70% of the time and was not conditioned on target-specific binding pocket information.
Deep2Lead demonstrated the feasibility of democratizing drug discovery but was limited by the representational capacity of VAEs and the absence of a user incentive structure beyond scientific curiosity.
Large language models have demonstrated strong performance on SMILES-based molecular tasks [6]. Models such as ChemGPT [7] and MolT5 [8] established that pre-trained transformers can generate valid, drug-like molecules. Fine-tuning on domain-specific data dramatically improves validity and target conditioning, with the rationale-first format [9] enabling the model to reason about binding pocket geometry before committing to a structure.
Citizen science games have repeatedly demonstrated that non-expert players can make genuine scientific contributions. EteRNA [10] showed that game players outperformed automated algorithms in RNA secondary structure design. FoldIt [11] produced protein folding solutions that algorithms had not found. Galaxy Zoo [12] enlisted hundreds of thousands of participants to classify galaxies accurately.
Drug discovery has not yet seen a comparable citizen-science breakthrough, in part because the molecular design space is enormous and the feedback signal (binding affinity) is slow and expensive. LLM-generated candidates scored by a fast DTI model make this loop fast enough to be embedded in real-time gameplay — which is exactly what GemmaCure provides.
We fine-tuned the Gemma 4-E2B model (5.2B parameters, 4-bit NF4 quantized via bitsandbytes) using Unsloth [13] with Rank-Stabilized LoRA (RS-LoRA, r=64, α=128) targeting all projection layers. Training used 225,060 drug-target pairs assembled from three public databases:
| Source | Records | Filter |
|---|---|---|
| BindingDB | 150,000 | Kd/Ki/IC50 ≤100 nM |
| ChEMBL | 65,000 | IC50 ≤100 nM, binding assay B |
| MOSES | 10,060 | Lipinski Ro5, QED >0.5 |
Each training example follows a rationale-first ChatML format: the model must
first produce a Rationale: paragraph explaining
the binding pocket geometry, key residues, and desired physicochemical profile
before generating the SMILES: string. This format
transforms the model's internal reasoning into player-visible explanations during
gameplay, making the science transparent.
The most significant training innovation is the
EvalAndStopCallback — a HuggingFace TrainerCallback
that runs inline drug quality evaluation every 100 steps:
Running inline rather than via subprocess reduces evaluation overhead from ~60 s (model reload from disk) to ~30 s per evaluation, a 2× speedup. The callback produced the following training trace:
PathoHunt 3D is a browser-accessible 3D game built with HTML/CSS/JavaScript and NGL Viewer [14] for protein structure visualization. The game loop is:
Four difficulty tiers (Junior Researcher → Senior → Principal Investigator → Nobel Prize) escalate the required pKd threshold and introduce resistance mutations at higher levels, mirroring real drug development challenges.
We evaluate generated molecules across three tiers:
| Tier | Tool | Key Metrics |
|---|---|---|
| 1 — Chemistry | RDKit | SMILES validity, QED, Lipinski Ro5, novelty |
| 2 — Target | IBM MAMMAL DTI | pKd, % binders, scaffold diversity |
| 3 — Rationale | NLP pipeline | Binding-term coverage, coherence score |
Algorithmic molecular generators — VAEs, reinforcement-learning policies, and diffusion models alike — optimize an objective function. By design they converge toward high-scoring regions of chemical space. This is exactly their strength in optimization, but it is also a limitation when the goal is exploration: discovering structurally novel candidates that algorithms would not normally reach because those regions do not score well under the training objective.
Human players, by contrast, are not constrained to optimize a scalar reward. Their molecule selection is influenced by game strategy, curiosity, aesthetic preferences, prior knowledge, and serendipity. A biology student may intuitively prefer scaffolds from their coursework. A high school student with no prior knowledge may select combinations that an algorithm would never consider. This cognitive diversity, we hypothesize, maps to a richer and more novel region of chemical space.
H1 (Novelty): Molecules generated during gamified GemmaCure sessions have higher average Tanimoto dissimilarity to training set molecules than molecules generated by automated random search over the same targets.
H2 (Scaffold Diversity): GemmaCure game sessions produce a greater number of unique Murcko scaffolds per 100 generated molecules than automated baseline generation.
H3 (Democratization): High school students with no prior biochemistry instruction can generate chemically valid drug candidates (SMILES validity ≥80%) after a 15-minute tutorial.
H3 is directly motivated by Tanisha Chawdhury's participation in the design and testing of GemmaCure. A high school sophomore who had not studied organic chemistry was able to generate valid drug candidates for COVID-19 MPro, earn the "COVID Slayer" badge, and articulate the binding rationale back to a peer — evidence that the interface is genuinely accessible.
Pharmaceutical researchers routinely screen large compound libraries but are constrained by the libraries themselves: compounds must be synthesizable and commercially available. GemmaCure generates de novo candidates that are not in existing virtual libraries. The gamification mechanism recruits diverse player cognition to explore structurally novel regions, and the IBM MAMMAL DTI scorer pre-filters these candidates for binding quality. A lab expert who reviews the top-scoring game session outputs therefore has access to a set of molecules they would not find in standard virtual screening.
In future work we plan an expert-review pipeline in which pharmaceutical chemists can browse and annotate top-scoring GemmaCure discoveries, completing the loop from game session to potential lead identification.
Table 3 compares the fine-tuned model (dlyog/gemma-cure) against the base Gemma 4-E2B model and the Deep2Lead VAE on Tier 1 chemistry metrics, evaluated on five held-out targets (SARS-CoV-2 MPro, EGFR T790M, CDK2/Cyclin E, DPP-4, BACE1).
| Metric | Deep2Lead VAE | Base Gemma 4 | GemmaCure |
|---|---|---|---|
| SMILES Validity | ~65% | 33% | 100% |
| Avg QED | ~0.35 | 0.116 | 0.591 |
| Composite score | ~0.50 | 0.224 | 0.795 |
| Lipinski Ro5 | ~74% | 74.3% | 91.8% |
| Target conditioning | DeepPurpose | None | IBM MAMMAL DTI |
Representative molecules generated by dlyog/gemma-cure on held-out targets:
All three SMILES are valid per RDKit, satisfy Lipinski Ro5, and were generated with a chemically coherent rationale text that correctly identifies key binding residues for each target.
| Feature | Value |
|---|---|
| Disease enemy levels | 8 |
| Difficulty tiers | 4 (Junior → Nobel) |
| Unique achievement badges | 50+ |
| Avg generation latency | 2.4 s / molecule |
| SMILES validity (live) | 100% (dlyog/gemma-cure) |
| Affinity R² (pKd) | 0.72 |
One of the most significant claims of this work is that high school students — with no prior biochemistry training — can meaningfully participate in drug discovery research through gamification. Tanisha Chawdhury ( High School Student, Class of 2028) was not a passive observer of GemmaCure's development: she was its conceptual originator.
Her core design insight — the pathogen-as-enemy, drug-as-ammunition metaphor — solved a communication problem that pure science interface design could not. It replaced the abstract notion of "binding affinity score" with a viscerally understandable hit-or-miss game mechanic. This metaphor also defines the incentive loop: players are rewarded for finding molecules that work, not merely for generating molecules.
Tanisha also authored the patient backstories attached to each disease enemy, which give stakes to the game scenario. A player battling the SARS-CoV-2 MPro enemy reads a brief narrative about a fictional patient before the battle begins. This narrative investment has been shown in educational game research to improve knowledge retention [15].
GemmaCure represents a testbed for the hypothesis that high school students can generate scientifically valid drug candidates when equipped with: (1) an LLM that handles the chemistry, (2) a scoring oracle that provides immediate feedback, and (3) a game mechanic that transforms the search into exploration. Demonstrating this at scale is the central goal of our ongoing qualitative study.
We are currently conducting a qualitative pilot study with a convenience sample of friends and family members spanning a range of scientific backgrounds: from high school students to professional chemists. Participants play at least five PathoHunt 3D sessions and complete a short survey measuring (a) perceived understanding of drug discovery concepts, (b) engagement and motivation, and (c) self-reported hypothesis about which molecules "should work."
All generated SMILES strings, pKd scores, session durations, and player demographics are logged and will be analyzed for the Gamification Diversity Hypothesis (Section 4.2). We will report full results as this study completes.
The planned controlled study will compare two conditions on the same set of five disease targets:
Primary outcome: mean Tanimoto dissimilarity of generated molecules to the 225,060-molecule training set (novelty). Secondary outcomes: unique Murcko scaffold count per 100 molecules, mean QED, and mean pKd.
Once the controlled study establishes baseline novelty metrics, we will open a curated molecule feed to pharmaceutical chemists and computational biologists for qualitative expert review. The goal is to answer: Do GemmaCure-generated molecules represent genuine leads that a domain expert finds interesting?
Tanisha Chawdhury's experience is the prototype for a broader classroom deployment. We plan to partner with high school STEM programs to run GemmaCure as a semester project activity, measuring STEM interest and drug discovery concept acquisition via pre/post assessment. Our hypothesis is that game-based discovery produces significantly higher concept retention than lecture-based instruction on the same content.
Future model iterations will incorporate multi-modal inputs (protein structure embeddings alongside sequence), reinforcement learning from human feedback (RLHF) based on player and expert ratings, and larger dataset coverage including natural product scaffolds from the ZINC database [16].
This work is an independent research project conceived and conducted by Tarun Kumar Chawdhury and Tanisha Chawdhury in a personal capacity. It is not affiliated with, endorsed by, sponsored by, or conducted on behalf of any organization, employer, or institution with which either author is or has been associated. Tarun Kumar Chawdhury's affiliated organizations have not reviewed this work and bear no responsibility for its content or claims.
Tarun Kumar Chawdhury holds a part-time instructional appointment at a university, is employed full-time in the health insurance sector, and is co-founder of DLYog Lab, an independent AI research startup. Tanisha Chawdhury is a high school student. This research was undertaken entirely outside of and independent from any employment or institutional obligations.
The authors declare no competing financial interests. No external funding, institutional resources, or grant support was received for this work. All compute costs were borne personally by the authors. The fine-tuned model (dlyog/gemma-cure) is released under CC-BY 4.0 as a contribution to the open-source scientific community.
| Contribution | Tarun K. Chawdhury | Tanisha Chawdhury |
|---|---|---|
| Model fine-tuning (Gemma 4, RS-LoRA, EvalAndStopCallback) | ✓ | |
| Dataset curation (225,060 drug-target pairs) | ✓ | |
| Flask backend & API architecture | ✓ | |
| IBM MAMMAL DTI integration | ✓ | |
| Game concept: pathogen-as-enemy metaphor | ✓ | |
| Game design, enemy mechanics, difficulty tiers | ✓ | |
| Patient backstories for disease enemies | ✓ | |
| XP system, badge system, UI/UX direction | ✓ | |
| Gamification Diversity Hypothesis formulation | ✓ | ✓ |
| Qualitative pilot study design | ✓ | ✓ |
| Electron desktop app (macOS .dmg) | ✓ | |
| Deployment scripts & CI/CD | ✓ | ✓ |
GemmaCure is a substantial evolution of Deep2Lead: replacing a classical VAE with a fine-tuned Gemma 4 LLM and embedding the pipeline in a gamified platform produces a 3× improvement in SMILES validity, a 5× improvement in drug-likeness, and — most importantly — an interface that a high school sophomore can use to generate scientifically valid drug candidates for COVID-19, HIV, and cancer.
The central thesis of this paper is the Gamification Diversity Hypothesis: that recruiting diverse human players to explore chemical space through gameplay will produce a set of novel molecules that neither automated search nor single-expert design would generate alone. This hypothesis is testable, and we are currently testing it.
Tanisha Chawdhury's vision — that a drug discovery game could make real science accessible to students like herself — is not merely inspirational. It is a falsifiable scientific claim about the value of cognitive diversity in molecular exploration. We believe GemmaCure is the right platform on which to answer it, and we look forward to reporting our results.