About the results reproduction #5

MoonLBH · 2025-01-09T12:39:18Z

1.I understand that the script ‘paper_experiments/run_inference_gdb_conditional_x4.py’ directly corresponds to the conditional generation (P(x1|x4)), which corresponds to the two box plots on the upper right of Figure 3. Is this correct?
By running this script, I will generate 20 molecular analogs for 100 template molecules, retain the valid molecules after filtering, and then calculate the 3D pharmacophore similarity score according to the demonstration calculation method in shepherd-score.

Is the process I mentioned correct? If so, are the parameters in run_inference_gdb_conditional_x4.py consistent with those in your test? Do I need to change any parameters? The overall distribution and median of the boxplot I got are worse than those in your paper, and the median is about 0.1 lower. I wonder if I have set it incorrectly.

Code Supplement：

The first step is to run ‘paper_experiments/run_inference_gdb_conditional_x4.py’

The second step I will extract the molecules stored in the pickle

`for i in trange(100):
with open(f'samples/GDB_conditional/x4/samples_{i}.pickle', 'rb') as f:
molblocks_and_charges = pickle.load(f)

from rdkit.Chem import SDWriter

sdf_writer = SDWriter(f'samples/GDB_conditional/x4/sdfs/sample{i}.sdf')


for b,sample_dict in enumerate(molblocks_and_charges[7]):
    
    xyz = '' 
    
    x_ = sample_dict['x1']['atoms']
    pos_ = sample_dict['x1']['positions']
    
    xyz += f'{len(x_)}\n{b+1}\n'
    for a in range(len(x_)):
        atomic_number_ = int(x_[a])
        position_ = pos_[a]
        
        xyz+= f'{rdkit.Chem.Atom(atomic_number_).GetSymbol()} {str(position_[0].round(3))} {str(position_[1].round(3))} {str(position_[2].round(3))}\n'
    xyz+= '\n'
    
    try:
        mol_ = rdkit.Chem.MolFromXYZBlock(xyz)
    except Exception as e:
        mol_ = None
        print(f'invalid molecule: {e}')
        continue
    
    try:
        for c in [0, 1, -1, 2, -2]:
            mol__ = deepcopy(mol_)
            try:
                rdkit.Chem.rdDetermineBonds.DetermineBonds(mol__, charge = c, embedChiral = True)
            except:
                continue
            if mol__ is not None:
                print(c)
                break 
    except Exception as e:
        mol_ = None
        print(f'invalid molecule: {e}')
        continue
    
    mol_ = mol__
    try:
        assert sum([a.GetNumRadicalElectrons() for a in mol_.GetAtoms()]) == 0, 'has radical electrons'
        mol_.UpdatePropertyCache()
        rdkit.Chem.GetSymmSSSR(mol_)
        
    except Exception as e:
        mol_ = None
        print(f'invalid molecule: {e}')
        continue

    try:

        sdf_writer.write(mol_)
    except Exception as e:
        print(f'Failed to write molecule to SDF: {e}')
        continue`

Finally, caculate the 3D pharmacophore similarity score

`with open(f'conformers/gdb/molblock_charges_9_test100.pkl', 'rb') as f:
molblocks_and_charges = pickle.load(f)
record = {f'{i}':[] for i in name_list}
for idx in trange(70,100):
ref_mol_rdkit = rdkit.Chem.MolFromMolBlock(molblocks_and_charges[idx][0], removeHs = False)
ref_mol, _, ref_charges = optimize_conformer_with_xtb(ref_mol_rdkit)
fit_mol_rdkits = Chem.SDMolSupplier(f"/data/lbh/Code/shepherd/samples/GDB_conditional/x4/sdfs/sample{idx}.sdf", removeHs=False)
for i,fit_mol_rdkit in enumerate(fit_mol_rdkits):
# Local relaxation with xTB
# ref_mol, _, ref_charges = optimize_conformer_with_xtb(ref_mol_rdkit)
try:
fit_mol, _, fit_charges = optimize_conformer_with_xtb(fit_mol_rdkit)

        # Extract interaction profiles
        ref_molec = Molecule(ref_mol,
                            num_surf_points=200,
                            partial_charges=ref_charges,
                            pharm_multi_vector=False)
        fit_molec = Molecule(fit_mol,
                            num_surf_points=200,
                            partial_charges=fit_charges,
                            pharm_multi_vector=False)

        # Centers the two molecules' COM's to the origin
        mp = MoleculePair(ref_molec, fit_molec, num_surf_points=200, do_center=True)

        # Compute the similarity score for each interaction profile
        shape_score = mp.score_with_surf(ALPHA(mp.num_surf_points))
        esp_score = mp.score_with_esp(ALPHA(mp.num_surf_points), lam=0.3)
        pharm_score = mp.score_with_pharm()
        record[f'template{idx}'].append(pharm_score)
    except:
        print(i)
        continue`

The text was updated successfully, but these errors were encountered:

MoonLBH · 2025-01-09T12:43:01Z

The final calculation code is like this. The previous upload was a bit wrong.

with open(f'conformers/gdb/molblock_charges_9_test100.pkl', 'rb') as f:
    molblocks_and_charges = pickle.load(f) 
record = {f'template{i}':[] for i in range(100)}
for idx in trange(100):
    ref_mol_rdkit = rdkit.Chem.MolFromMolBlock(molblocks_and_charges[idx][0], removeHs = False)
    ref_mol, _, ref_charges = optimize_conformer_with_xtb(ref_mol_rdkit)
    fit_mol_rdkits = Chem.SDMolSupplier(f"samples/GDB_conditional/x4/sdfs/sample{idx}.sdf", removeHs=False)
    for i,fit_mol_rdkit in enumerate(fit_mol_rdkits):
        # Local relaxation with xTB
        # ref_mol, _, ref_charges = optimize_conformer_with_xtb(ref_mol_rdkit)
        try:
            fit_mol, _, fit_charges = optimize_conformer_with_xtb(fit_mol_rdkit)
        

            # Extract interaction profiles
            ref_molec = Molecule(ref_mol,
                                num_surf_points=200,
                                partial_charges=ref_charges,
                                pharm_multi_vector=False)
            fit_molec = Molecule(fit_mol,
                                num_surf_points=200,
                                partial_charges=fit_charges,
                                pharm_multi_vector=False)

            # Centers the two molecules' COM's to the origin
            mp = MoleculePair(ref_molec, fit_molec, num_surf_points=200, do_center=True)

            # Compute the similarity score for each interaction profile
            shape_score = mp.score_with_surf(ALPHA(mp.num_surf_points))
            esp_score = mp.score_with_esp(ALPHA(mp.num_surf_points), lam=0.3)
            pharm_score = mp.score_with_pharm()
            record[f'template{idx}'].append(pharm_score)
        except:
            print(i)
            continue

kentoabeywardane · 2025-01-09T15:36:30Z

Hi, thanks for reaching out.

I suggest using or referencing our evaluation pipeline found in the shepherd-score repo. For the task of x4 conditional evaluation, we used ConditionalEvalPipeline with the argument condition='pharm' and the results would be found in the variable ConditionalEvalPipeline.sims_pharm_target_relax_optimal

To address your question, I see a few inconsistencies with our pipeline and your implementation which would lead to lower scores:

The supplied ref_mol found in molblock_charges_9_test100.pkl is already xTB optimized, so the additional optimization is unnecessary
We use a stricter conformer extraction found in shepherd-score.evaluations.evaluate.ConfEval
For scoring...
1. We initially align the xTB-(locally)-relaxed fit molecule to the generated one via RMSD (rdkit's AlignMol), if the graphs are consistent
2. When initializing MoleculePair, ensure that do_center=False so that the generated conformers retain their original positions, and do not have their COM's aligned.
3. Then, we do a local pharmacophore alignment with mp.align_with_pharm where num_repeats=1 and trans_init=False
  - This is to ensure that we do a local alignment based on the original alignment (i.e., don't do a global alignment search)

Let me know if this helps,
Kento

MoonLBH · 2025-01-10T10:09:22Z

Thank you for your reply. I reproduced the box plots result on the upper right of Figure 3 with your guidance~

I have two more questions.
First, when I use run_inference_gdb_conditional_x4.py, the template molecule will use the molecular structure and charges information. How can I get the charges information when generating analogs of my own molecules?

Chem.rdPartialCharges.ComputeGasteigerCharges(mol)
for atom in mol.GetAtoms():
    charge = float(atom.GetProp("_GasteigerCharge"))

Are the partial charges obtained in this way consistent with those used by shepherd?

In addition, shepherd seems to use the complete pharmacophore information of the template molecule in the P(x1 | x4) scenario? Can we only use part of the pharmacophore information of the molecule to generate analogs? If so, can we calculate the matches of part of the pharmacophore when calculating the pharmacophore 3D similarity score?

Looking forward to your reply.

kentoabeywardane · 2025-01-10T16:55:53Z

How can I get the charges information when generating analogs of my own molecules?

ShEPhERD was trained on xTB-optimized conformers and xTB-computed charges, so we recommend using xTB to generate partial charges. Given a conformer, this can be done with shepherd_score.conformer_generation.charges_from_single_point_conformer_with_xtb(). Alternatively, this is also found in the shepherd_score_utils folder in this repo. Separately, we found that Gasteiger and MMFF94 charges are low-quality, and do not recommend these methods for this model.

Can we only use part of the pharmacophore information of the molecule to generate analogs

Yes, but expect the subselection of the pharmacophores to be a manual task. Assuming that you already have a set of pharmacophore features (say n of them) and want to generate molecules that contain those pharmacophores as a subset, you could set n4 ≥ n and only inpaint the n pharmacophores that you want. This would require you to adjust the mask of the inpainting such that only the specified n pharmacophores are regenerated while the remaining n4-n pharmacophores are free to diffuse. Please let me know if you need some clarity on this strategy.

Can we calculate the matches of part of the pharmacophore when calculating the pharmacophore 3D similarity score

You can use a Tversky similarity instead of Tanimoto similarity for this case. This is implemented in pharmacophore scoring and alignment functions. For example, to compute the asymmetric score: MoleculePair.score_with_pharm(similarity='tversky_ref')

--
On a separate note, while the GDB x4 model is more efficient, if you would like to generate molecules that are more "drug-like" rather than "GDB-like" I suggest trying out the MOSES model by only inpainting the x4.

MoonLBH · 2025-01-13T15:11:33Z

是的，但药效团的子选择需要手动完成。假设您已经有一组药效团特征（比如说 n 个）并且想要生成包含这些药效团作为子集的分子，您可以设置 n4 ≥ n 并仅修复您想要的 n 个药效团。这将要求您调整修复的掩码，以便仅再生指定的 n 个药效团，而其余的 n4-n 个药效团可以自由扩散。如果您需要了解此策略，请告诉我。

Hello, I am currently doing a baseline comparison of pharmacophore-based molecular generation models. I'm interested in generating molecules based on partial pharmacophore information (only several pharmacophore positions and types are provided). I tried the method you mentioned, but I am not sure if my settings are correct. I am also worried that my incorrect use will affect the performance of Shepherd and cause an unfair comparison. Could you please provide an example or script? If possible, please send it to [email protected]. I would be deeply grateful.

My usage is similar to the following, assuming that the 0th, 3rd, 5th, and 9th pharmacophores identified from mol are what I need

pharm_types, pharm_pos, pharm_direction = get_pharmacophores(
        mol,
        multi_vector = params['dataset']['x4']['multivectors'],
        check_access = params['dataset']['x4']['check_accessibility'],
    )
pharm_types, pharm_pos, pharm_direction = pharm_types[0,3,5,9], pharm_pos[0,3,5,9], pharm_direction[prepare[0,3,5,9]
batch_size = 20
    
    n_atoms_list = [N_atoms] * 2
    
    num_batches = len(n_atoms_list)
    
    generated_samples = []
    for n in range(num_batches):
        
        n_atoms = int(n_atoms_list[n])
        
        generated_samples_batch = inference_sample(
            model_pl,
            batch_size = batch_size,
            
            N_x1 = n_atoms,
            N_x4 = len(pharm_types), # must equal pharm_pos.shape[0] if inpainting
            
            unconditional = False,
            
            prior_noise_scale = 1.0,
            denoising_noise_scale = 1.0,
            
            inject_noise_at_ts = [],
            inject_noise_scales = [],    
            
            harmonize = False,
            harmonize_ts = [],
            harmonize_jumps = [],
            
            # all the below options are only relevant if unconditional is False
            
            inpaint_x2_pos = False,
            
            inpaint_x3_pos = False,
            inpaint_x3_x = False,
            
            inpaint_x4_pos = True,
            inpaint_x4_direction = True,
            inpaint_x4_type = True,
            
            stop_inpainting_at_time_x2 = 0.0,
            add_noise_to_inpainted_x2_pos = 0.0,
            
            stop_inpainting_at_time_x3 = 0.0,
            add_noise_to_inpainted_x3_pos = 0.0,
            add_noise_to_inpainted_x3_x = 0.0,
            
            stop_inpainting_at_time_x4 = 0.0,
            add_noise_to_inpainted_x4_pos = 0.0,
            add_noise_to_inpainted_x4_direction = 0.0,
            add_noise_to_inpainted_x4_type = 0.0,
            
            # these are the inpainting targets
            center_of_mass = np.zeros(3), # center of mass of x1; already centered to zero above
            surface = surface,
            electrostatics = electrostatics,
            pharm_types = pharm_types,
            pharm_pos = pharm_pos,
            pharm_direction = pharm_direction,
            
        )

kentoabeywardane · 2025-01-14T00:34:19Z

Hi, I've just pushed changes to the inference.py script in #6 to enable partial pharmacophore inpainting simply by setting N_x4 > len(pharm_types). In other words, it will inpaint the provided pharmacophores (i.e., your four pharmacophores in the example) while diffusing N_x4 - len(pharm_types) other pharmacophore features.

Here are some corrections or important notes in the code that you sent:

pharm_types, pharm_pos, pharm_direction = pharm_types[0,3,5,9], pharm_pos[0,3,5,9], pharm_direction[0,3,5,9]

generated_samples_batch = inference_sample(
            ...
            N_x4 = desired_num_pharms, # This has been updated so that it can handle n4 >= len(pharm_types)
            ...
            # these are the inpainting targets
            center_of_mass = np.zeros(3), # Assuming you have already centered your target molecule's COM
            ...
            pharm_types = pharm_types,
            pharm_pos = pharm_pos,
            pharm_direction = pharm_direction,
        )

In particular, you should scan through values for desired_num_pharms since the max number of pharmacophores must be defined (this is true for all diffusion models).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About the results reproduction #5

About the results reproduction #5

MoonLBH commented Jan 9, 2025

MoonLBH commented Jan 9, 2025

kentoabeywardane commented Jan 9, 2025

MoonLBH commented Jan 10, 2025

kentoabeywardane commented Jan 10, 2025

MoonLBH commented Jan 13, 2025

kentoabeywardane commented Jan 14, 2025

About the results reproduction #5

About the results reproduction #5

Comments

MoonLBH commented Jan 9, 2025

MoonLBH commented Jan 9, 2025

kentoabeywardane commented Jan 9, 2025

MoonLBH commented Jan 10, 2025

kentoabeywardane commented Jan 10, 2025

MoonLBH commented Jan 13, 2025

kentoabeywardane commented Jan 14, 2025