Chimera notes from May 04, 2020 - May 06, 2020
While I hope that the prior documentation was helpful, I am sure that it isn't fully comprehensive.
So, in an effort to fill in some gaps, I'm going to type up some notes from my effort over the past two days (May 04, 2020 - May 06, 2020) to go through most steps of the feature engineering process--from modelled structures to features.
Step 1: Consolidating structures
The first step was to get all of those structures in one place.
You can see my work on that in /n/groups/drad/julian/Carbonylation_May2020
. This was a good bit of work--and, for future reference, I wrote my scripts in the ./quickscripts
directory.
First, I just copied over all modelled protein structures that had UniProt IDs corresponding to my IDs of interest--that was pretty straightforward, since I labelled the modelled structures with their UniProt IDs.
Then, I downloaded all experimentally-derived PDB structures of interested. Previously, I went through by hand to choose these structures (based on their coverage and resolution).
Then, I annotated each structure with its coverage. For most modelled proteins, that was straightforward, since those regions were included in the filename originally.
For experimentally-derived structures, those were annotated by hand.
2. Matching structures with dataset entries
Next, I wanted to know: what structure should be associated with each line of the source data?
So I navigated over to the source data--located in the Dropbox at /Protein_Oxidation/PDB_Structures/Human/Source_Tables/Human_CS_data_from_peptide_calls_Bollineni2014_and_Madian2011_js.xlsx
.
At the time, in the CSdata_simple
tab, I only had the first few rows. I wanted to create a row that contained the structure that would be used for each "Site by Ref Seq" entry (I ultimately called this row "Combined (JAS preference)".
2a. Defining the range of residues spanned by each structure
I made a new tab called CSdata_simple_helper
. First, I added all the empirically-determined PDB structures. I did that by hand--previously I picked these structures in the proteins
tab. It still required some work, since I had to seperate entries so that there was only one pdb structure per line.
Then, I did the same for model structures. That didn't require any hand curation though--I generated the data in O2, in /n/groups/drad/julian/Carbonylation_May2020/finished_models/locations_model1_best
and just copied over to excel.
For the model structures, there was one extra step: for models of the whole protein, I replaced the "start" with 1 and the "end" with the length of the protein.
This excluded one structure QUARK structure that did not have location coordinates specified (I'm not sure why that was the case, though? All files should be in the format full_protein
or pfam_start_end
. The exception was called Q13813_PF08726_EFhand_Ca_insen_QUARK_model1.pdb
).
2b. Using CSdata_simple_helper to populate CSdata_simple
Now, I needed to use CSdata_simple_helper
to populate the CSdata_simple
tab. I used Python for that.
First, I exported two files.
The first was source_pdbs
, which was just the CSdata_simple_helper
tab, here's a few lines for reference:
P12259 2224 P12259_PF07732_Cu_oxidase_3_65_197_I-TASSER_model1.pdb 65 197 Model
Q13813 2472 Q13813_PF00435_Spectrin_1092_1175_I-TASSER_model1.pdb 1092 1175 Model
Q8WZ42 34350 Q8WZ42_PF07679_I_set_7385_7474_I-TASSER_model1.pdb 7385 7474 Model
Q8WZ42 34350 Q8WZ42_PF00041_fn3_28585_28669_I-TASSER_model1.pdb 28585 28669 Model
Q8NF91 8797 Q8NF91_PF00435_Spectrin_5738_5835_I-TASSER_model1.pdb 5738 5835 Model
The second was target_pdbs
, which was the uniprot id and location of the "SiteByStructure" row I wanted to match:
P01160 123
O75446 1
P42766 48
P62318 11
P62910 64
Then, to fill the target_pdbs file, I made a Jupyter notebook (at /Protein_Oxidation/PDB_Structures/Human/Source_Tables/jupyter_playground/filling_CSdata_simple
).
I just used source_pdbs
to build a dictionary mapping between uniprot IDs-->sequence locations-->structures, keeping track of whether structures were experimental or modelled:
mapping = {}
with open("./source_pdbs") as source:
for line in source:
up, _, name, start, stop, model_type = line.rstrip().split("\t")
if up not in mapping:
mapping[up] = {}
if model_type not in mapping[up]:
mapping[up][model_type] = {}
for i in range(int(start), int(stop) + 1):
if i in mapping[up][model_type]:
mapping[up][model_type][i].append(name)
else:
mapping[up][model_type][i] = [name]
Then, I picked one structure for each location in target_pdbs
. My priorities were: Experimental Structure > Modelled Structure of a particular pfam > Modelled Structure of an entire protein.
with open("./target_pdbs_output_one", "w") as outfh:
with open("./target_pdbs") as target:
for line in target:
up, loc = line.rstrip().split("\t")
if up not in mapping:
outfh.write(f'{up}\t{loc}\t\n')
else:
if int(loc) in mapping[up].get("Empirical", {}):
structures = mapping[up]["Empirical"][int(loc)]
structure = structures[0]
outfh.write(f"{up}\t{loc}\t{structure}\n")
elif int(loc) in mapping[up].get("Model", {}):
structures = mapping[up]["Model"][int(loc)]
structures_nofull = [structure for structure in structures if "full_protein" not in structure]
if len(structures_nofull) == 0:
structure = structures[0]
else:
structure = structures_nofull[0]
outfh.write(f"{up}\t{loc}\t{structure}\n")
else:
outfh.write(f'{up}\t{loc}\t\n')
2c Finishing up
Now that I had the table of structures for each Reference Sequence, I just pasted those into the CSdata_simple
tab. Then, I combined them with the already-existing RLC structures. I decided to give the newer structures preference: so, if there was both an existing structure and a new structure, I picked the new one.
Note: I went through this process of matching structures to a particular spot in the CSdata table, but we ultimately decided to extract features from all structures (~2,100), not just those 371.
Consolidating all structures to be used
Now I just copied over the structures from the CSdata_simple
tab (Combined (JAS preference)
tab), removed N/As and duplicates, and put them in a text file on O2. There were 371 unique structures.
(For consistency, I removed any .pdb
extensions--since the PDB structures did not have those, but the models did)
For convienence, I then copied all CSdata PDB files from /home/rlc18
to /n/groups/drad/julian/Carbonylation_May2020
Then, I just used a short script (in quickscripts/copy_all_relevant_pdbs.sh
) to move all of those files into one folder.
For the PDB files, I needed to extract only the chain of interest from the whole PDB file. I used a Python script for that--found at /n/groups/drad/julian/Carbonylation_May2020/quickscripts/extract_chain.py
.
This script erred on 4 structures. For 3 of them, that was because some PDB structures actually don't have proper PDB downloads because they have so many chains. For the other one, it was just too big to be loaded by the extract_chain.py
script without requesting more than the default memory on O2:
6EK0_Lh
6EK0_Le
5IJN_H (this one was just too big for the memory in my interactive job. Ran once I upped to 4GB)
6MSB_B
Now all of the 371 unique structures are in /n/groups/drad/julian/Carbonylation_May2020/all_relevant_pdbs
.
(Note that I repeated this process for all structures, after deciding to run features on all)
Getting sppider_binding data for all structures
Now, I need to run SPPIDER on each of those 371 unique structures.
To do that, I use the sppider_mech_JAS.pl
script in the SPPIDER directory of this repo.
For some reason, I have trouble running this script in high throughput. The only way I've gotten it to work is to copy the script into the same directory where I have my pdb files, and then run for file in ./*.pdb; do perl sppider_mech_JAS.pl ${file/.\//}; done
.
Once those finish running, I can run the parseSPPIDER2results.pl
script to parse the results (after downloading results emails from Outlook), and then I'll need to do a bit more parsing to get them in chimera-ready format.
(And later repeated for all structures. When I did that, the script froze after the first thousand structures or so, so had to re-run on remaining structures.)
The remaining structures have not finished from SPPIDER, but I will go ahead and write a parser for future users.
TODO: Collect SPPIDER output files and parse into the same format as the file found in input/
.
Getting disopred data for all structures
Since DISOPRED only required uniprot IDs, I should already have all of the DISOPRED results that I need (in the DropBox at /Protein_Oxidation/Feature_Engineering/DISOPRED
)
Output files can be parsed like so:
(base) [js741@compute-a-16-161 Q9Y6U3]$ tail -n +4 Q9Y6U3.diso | tr -s " " | cut -f1,2,3,5 -d" " | head
1 M 0.83
2 A 0.69
3 R 0.48
4 E 0.32
5 L 0.31
6 Y 0.20
7 H 0.16
8 E 0.23
9 E 0.23
10 F 0.16
(base) [js741@compute-a-16-161 Q9Y6U3]$ tail -n +6 Q9Y6U3.pbdat | tr -s " " | cut -f2,3,5 -d" " | head
1 M 0.40
2 A 0.27
3 R NA
4 E NA
5 L NA
6 Y NA
7 H NA
8 E NA
9 E NA
10 F NA
Getting metal binding for all structures
LOMETS already ran.
Consolidated really simply by running for file in */init.dat; do new_file="${file/\//_}"; cp $file all_files/$new_file; done
in the LOMETS directory.
Copied to /home/js741/CarbonylationSite_Prediction/Human_Proteins/findsite/LOMETS_to_findsitemetal/LOMETSinitdatFiles
Then ran /home/js741/CarbonylationSite_Prediction/Human_Proteins/findsite/LOMETS_to_findsitemetal/LOMETS_to_FINDSITE_files_js.py
Running findsite:
Need to set environmental variables (put in .bashrc): export FINDSITELIB=/n/groups/drad/findsitemetal-1.0/dat/pdb_library export FINDSITEMAP=/n/groups/drad/findsitemetal-1.0/dat/my_mapping.cls export FINDSITEDAT=/n/groups/drad/findsitemetal-1.0/dat
Then you can run, general format: /n/groups/drad/findsitemetal-1.0/bin/findsitemetal -s /n/groups/drad/julian/Carbonylation_May2020/all_relevant_pdbs/Q9Y6U3_5A1K_A.pdb -t FINDSITE_input/Q9Y6U3.txt -o /home/js741/CarbonylationSite_Prediction/Human_Proteins/findsite/LOMETS_to_findsitemetal/test_findsite/Q9Y6U3_5A1K_A
Completes pretty quickly
Wrote /home/js741/CarbonylationSite_Prediction/Human_Proteins/findsite/run_findsite.sh
to run through each of the 317 proteins and get a findsite output. Doesn't work on everything, since it relies on a filename format that I used, but not RLC. running the following structures manually:
Still need to work on merging findsitemetal and UniProt metals And parsing findsite output?
TODO: Somehow integrate findsite and uniprotmetal inputs to make a metal input file
UniProt metals is in the format: SiteBySequence\tMetals
TODO#1: Convert findsite into format SiteByStructure\tMetals
TODO#2: Run structures through SSM to get sequence-structure mapping.
Going to go ahead and do TODO #2.
Getting all Carbonylation-related proteins in one place to run via Chimera.
After meeting with Roger today, I'm going to go ahead and run Chimera, sans metal binding features, across all carbonylation-related structures.
I only have ~24 hours to run all of these, so I'm going to have to prioritize some. I'll run: (1) the 2,017 structures in /n/groups/drad/julian/Carbonylation_May2020/finished_models/collapsed_model1_best
, (2) the 139 pdb structures in /n/groups/drad/julian/Carbonylation_May2020/pdb_structures/onechain_pdb
, and (3) the 90 structures in /n/groups/drad/julian/Carbonylation_May2020/rlc18_structures_up
(these are the structures from rlc18's home directory, but only the ones that were easy to identify an associated uniprot entry).
I would like to include DISOPRED and SPPIDER features in those, if possible, even if I can't do metal binding. That's going to require (1) knowing the Uniprot IDs of each structure, (2) knowing the sequence-structure mapping for each, and (3) making sure DISOPRED has been run for each unique Uniprot ID, and (4) making sure SPPIDER has been run for each stucture.
While that is running, it would be great if I could re-write the Chimera formatting script to make that easier. Maybe just write the column-picking script first really quickly, and add matching back if time.
Copied all structures into DropBox at
/Protein_Oxidation/PDB_Structures/Carbonylation_Push_May2020/all_structures
.Found all unique UniProt IDs with
find all_structures/ -type f | cut -f2 -d"/" | cut -f1 -d"_" | sort | uniq >all_ups
.Copied all_ups into sequence-structure mapping, followed pipeline instructions. (running overnight)
Submitting all structures to SPPIDER, followed instructions. (running overnight).
DISOPRED: Need to write script for this too, maybe I can do this tomorrow?
- For now, just going to run all of the structures without any DISOPRED/metal binding/SPPIDER, since it may take a while. --> Had to create dummy DISOPRED, SPPIDER, and metal files: I just made them blank. The pipeline should really run without them--that would be a quick fix for another day.
05/06/2020 at 13:15: Only ~200 chimera structures finished. Going to move to a seperate notes doc to wrap things up.