Wrapping up updates, May 6th 2020 · Protein Features Documentation

Today is my last day formally working with the Silver Lab, so here's an update on where things stand.

Protein Modelling

We are still on track to have all ~29k I-TASSER human models completed by the end of May.

You can find the models at /n/groups/drad/all_pdb_models. There are also log files in the progress_logs subdirectory of that folder.

Right now, a cronjob runs the scripts that generate those progress logs, and then they are rsynced from /home/js741 to /n/groups/drad.

I also updated the "Human Protein Modelling Status" excel file.

(quick note on how I update that file: I transfer the progress logs into DropBox with rsync -av js741@transfer.rc.hms.harvard.edu:/n/groups/drad/all_pdb_models/progress_logs .. To fill out the "progress_over_time" tab, I run head -1 progress_logs/updates/* and fill the tab in manually with that info. To main tab comes verbatim from progress_logs/old/05_06_2020/total_proteins_count--just copy and paste. The all_fragments tab comes basically verbatim from progress_logs/old/05_06_2020/total_fragments_count--all numbers should be 0 or 1, but there are always a handful (13 this time, of ~60k) of "2"s and I'm not sure why, but I change those to 1's)

Also note that very few of the human fragments have been run through QUARK, but all of the scripts are already generated and ready to go in /n/groups/drad/QUARKmod/BatchJobs/Final_Push. You may want to modify some of those scripts before running though, since they use relative paths.

Predicting Carbonylation Status of Human Proteins

Unfortunately, I was not able to consolidate the findsitemetal and uniprot metal features before the end of today.

I was also not able to get through running the Chimera feature extraction on all related fragments. That pipeline goes through ~400 fragments per instance per 24 hours, and there were a little over 2 thousand structures that needed to be run.

What follows is a quick summary of where I left this process.

Finishing Chimera runs

I started Chimera runs without SPPIDER, DISOPRED, or FINDSITE features (see below). So, they are all set up and ready to go.

You can find the working folder at /Dropbox (HMS)/Protein_Oxidation/Feature_Engineering/Chimera_Features/Working_Space/May_05_2020. You can see in the output/individual folder that I got through running 245 of the fragments before running out of time.

Running the remaining structures is as simple as running bash /Dropbox (HMS)/Protein_Oxidation/Feature_Engineering/Chimera_Features/Working_Space/May_05_2020/run_ChimeraFeatureExtraction.sh --pdbdir /Dropbox (HMS)/Protein_Oxidation/Feature_Engineering/Chimera_Features/Working_Space/May_05_2020/all_structures/ --outdir /Dropbox (HMS)/Protein_Oxidation/Feature_Engineering/Chimera_Features/Working_Space/May_05_2020/output. (Replacing your personal DropBox path, of course--or just running from the folder with local paths. Personally, I would split the all_structures folder up into 4 equal parts and run 4 instances of the Chimera pipeline on the Dell computer in the lab--that method seems to generate ~1,500 structures/24h, so would get through these in bit over a day.

DISOPRED

I've run DISOPRED on all of the carbonylation-related UniProt IDs. But, to make a DISOPRED input file for Chimera, you need have DISOPRED results specific for certain structures. That means we need Sequence-Structure mapping (DISOPRED raw output is in Site by Sequence space, Chimera input needs to be in Site by Structure space)

So:

Sequence-Structure Mapping

I started running the sequence-structure mapping on all ~2K carbonylation-related structures. My progress is located at /home/julian/Dropbox (HMS)/Protein_Oxidation/Feature_Engineering/Sequence_Structure_Mapping/05_05_2020. You can see in output_maps/individual that I got through around half of them.

I also noticed that, as-is, my "Sequence-Structure Mapping" pipeline only outputs alignment maps after aligning everything. Rookie mistake. So, I messily pasted together /home/julian/Dropbox (HMS)/Protein_Oxidation/Feature_Engineering/Sequence_Structure_Mapping/scripts/ssm_core_indiv.py (modifying the original ssm_core.py) so that it would spit out mappings as it went. But that's probably why the runtime is so much longer this time.

findsitemetal

I've run findsite metal on the 371 fragments that I identified in the 05_04_2020 notes (and in the Human CSdata excel spreadsheet), but not on all ~2K protein oxidation related structures. So, those still need to be run following the instructions in these docs.

The raw output of the ones that I have already run are located at /n/groups/drad/julian/findsite/output.

uniprot metal, and integrating findsite and uniprot

The big hiccup here is that merging findsitemetal (in Site by Structure space) and uniprot metal (in Site by Sequence space) would require the Sequence-Structure mapping, which I didn't have time to finish running.

SPPIDER

I hadn't run all ~2k fragments through SPPIDER, so I started submitting those last night. The submissions then hit a snag and I re-started them around 9AM this morning. I haven't gotten all of the output files via email, yet. I may remember to go download them in a few weeks but, if not, feel free to email me and I can download and send them your way. Or, feel free to re-run them.

I did create a proper Chimera input file for the 371 structures that I picked out, though. That's at /home/julian/Dropbox (HMS)/Protein_Oxidation/Feature_Engineering/SPPIDER/SPPIDER_pared_05052020. It's not perfect though--I think the Sequence-Structure mapping for a few of the experimentally-derived PDB files need to be dealt with manually. See my notes in 05/04/2020 or in SPPIDER if/when you want to read more.

Feature formatting

After looking at the output features from the 275 structures I ran last night, I noticed that I hadn't left behind a good script to turn those features into the nice format that the machine learning script expects. So, I whipped together a quick script for that (found in this repo, or at /Dropbox (HMS)/Protein_Oxidation/Feature_Engineering/Chimera_Features/Formatting).

All done!

Okay, I think I covered everything! Thanks for reading, and please feel free to reach out.