Using SSM
Description
Here's a collection of tools to derive a mapping between a protein structure (PDB format file) and a corresponding sequence (FASTA format file).
Overview
To derive a mapping, you need three things:
(1) Sequences (1a) The primary sequence of the PDB-format structure (1b) The primary sequence of the FASTA-format file (2) PDB Numbering (2a) Since PDB files sometimes skip numbers due to gaps/etc., we also need a text file with all the site numbers from the PDB-format structre.
After that, a Python script can do the rest. By which I mean it can output a text file in the following format:
{structure file name}_{residue number from PDB structure}\t{residue number from FASTA sequence}
Quickstart
# Original structure
julian-ThinkPad-T460:04_21_2020$ tree
.
├── sequence_fasta
│ ├── A5A619.fasta
│ ├── P21515.fasta
└── structure_pdb
├── A5A619_E10835_deltop10.pdb
├── A5A619_E10835_gap30_40.pdb
├── A5A619_E10835_mut10_mut50.pdb
├── A5A619_E10835.pdb
├── P21515_E10907_deltop10.pdb
├── P21515_E10907_gap20_30.pdb
├── P21515_E10907_mut2_mut82.pdb
├── P21515_E10907.pdb
# Will respond with each pdb structure turned into a fasta file, leaving out here for brevity.
julian-ThinkPad-T460:04_21_2020$ bash ../scripts/pdb2fasta_wrapper_ssm.sh
julian-ThinkPad-T460:04_21_2020$ bash ../scripts/extract_numbers_wrapper.sh
# Will respond with each structure and its alignment, but leaving out here for brevity
julian-ThinkPad-T460:04_21_2020$ python3 ../scripts/ssm_core.py
# Final output directory structure
julian-ThinkPad-T460:04_21_2020$ tree
.
├── output_maps
│ ├── individual
│ │ ├── A5A619_E10835.csv
│ │ ├── A5A619_E10835_deltop10.csv
│ │ ├── A5A619_E10835_gap30_40.csv
│ │ ├── A5A619_E10835_mut10_mut50.csv
│ │ ├── P21515_E10907.csv
│ │ ├── P21515_E10907_deltop10.csv
│ │ ├── P21515_E10907_gap20_30.csv
│ │ ├── P21515_E10907_mut2_mut82.csv
│ └── map_2020-04-21.csv
├── pickled_intermediates
│ ├── alignments.pickle
│ └── up_to_sequences.pickle
├── sequence_fasta
│ ├── A5A619.fasta
│ ├── P21515.fasta
│ ├── Q46798.fasta
│ ├── Q9RUW8.fasta
│ └── Q9RXJ9.fasta
├── structure_fasta
│ ├── A5A619_E10835_deltop10.fasta
│ ├── A5A619_E10835.fasta
│ ├── A5A619_E10835_gap30_40.fasta
│ ├── A5A619_E10835_mut10_mut50.fasta
│ ├── P21515_E10907_deltop10.fasta
│ ├── P21515_E10907.fasta
│ ├── P21515_E10907_gap20_30.fasta
│ ├── P21515_E10907_mut2_mut82.fasta
├── structure_numbering
│ ├── A5A619_E10835_deltop10.pdb
│ ├── A5A619_E10835_gap30_40.pdb
│ ├── A5A619_E10835_mut10_mut50.pdb
│ ├── A5A619_E10835.pdb
│ ├── P21515_E10907_deltop10.pdb
│ ├── P21515_E10907_gap20_30.pdb
│ ├── P21515_E10907_mut2_mut82.pdb
│ ├── P21515_E10907.pdb
└── structure_pdb
├── A5A619_E10835_deltop10.pdb
├── A5A619_E10835_gap30_40.pdb
├── A5A619_E10835_mut10_mut50.pdb
├── A5A619_E10835.pdb
├── P21515_E10907_deltop10.pdb
├── P21515_E10907_gap20_30.pdb
├── P21515_E10907_mut2_mut82.pdb
├── P21515_E10907.pdb
Creating prerequisites
- First, create a folder with today's date and cd into that folder. All of the scripts assume that your working directory is at the same depth as the
scripts
directory.
This assumes that you start off with (1) a PDB files and (2) a file, uniprot_ids
, containing each of the canonical uniprot IDs of your PDB files, one ID per lne.
Make a folder called
{today's date}/sequence_fasta
. Then, use the./scripts/get_uniprot.sh
to query uniprot fasta sequences. So, for example, I ranbash ../../scripts/get_uniprot.sh ../uniprot_ids
while in the04_20_2020/sequence_fasta
folder. Now, you should have fasta files corresponding to each canonical ID.Put all of your PDB files in a folder called
{today's date}/structure_pdb
.
Future scripts rely on the names of your files, so make sure to name them with the following scheme:
For canonical sequence fasta files: {uniprot id}.fasta. So, for example, I have
P00338.fasta
in the04_20_2020/sequence_fasta
folder.For PDB files: {uniprot id}_{structure name}.pdb. So, for example, I am putting a corresponding
P00338_pdb1i10.pdb
into the04_17_2020/structure_pdb
folder.
I created a helper script to help you rename your pdb files, if they are in the format pdb1i10.pdb
and you want to rename to P00338_pdb1i10.pdb
. Uniprot
ids are not generally included in PDB files, so you'll need to provide your own comma-delimited mapping between UniProt and PDB structure file.
^ Remember to back-up your original structures before running this!
Converting PDB into fasta
We can use the ./scripts/pdb2fasta_firstchain_js.sh
script to extract a sequence from the the first chain in your PDB file.
To make this a bit easier on many files, I wrote the ./scripts/pdb2fasta_wrapper_ssm.sh
script.
So, first cd
into your base directory (the one with the date--in my case, 04_17_2017
).
Then, run bash ../scripts/pdb2fasta_wrapper_ssm.sh
.
And that should convert all PDB files in structure_pdb
into fasta files, located in structure_fasta
.
Converting PDB into a numbering file.
We can use the ./scripts/extract_numbers.sh
script to extract sequence numbers from the first chain in your PDB file.
To make this a bit easier on many files, I wrote the ./scripts/extract_numbers_wrapper.sh
script.
Like the last step, you should be in your base directory. Then, run bash ../scripts/extract_numbers_wrapper.sh
.
And that should convert all PDB files in structure_pdb
into numbering files, located in structure_numbering
.
(Note that these numbering files do include all chains, even though we're just using the first chain, so numbering from other chains will be ignored)
Running the main script
Then, you just need to run python3 ../scripts/ssm_core.py
for the main computation.