Maintaining and monitoring modelling jobs · Protein Features Documentation

This assumes that you already have I-TASSER and/or QUARK jobs running in parallel.

Both of these modelling programs have very high failure rates and they all finish at different times. And, importantly, these jobs often fail without proper error codes, meaning that the job will be marked "COMPLETED" in an sacct output, but will not produce models.

So, it is necessary to regularly monitor job status.

Moving model*.pdb files

The main output from I-TASSER and QUARK jobs are the model*.pdb files (model1.pdb, model2.pdb, model3.pdb). These are the actual protein models. There are other interesting outputs (like B-factor for I-TASSER) that we haven't used for this pipeline, but you might consider using in the future.

So, the best way to check whether a job is done is to look for a model*.pdb file.

To save space, I will often archive a main subdirectory after I find a copy model*.pdb files to a seperate location, to save space.

I-TASSER

model.pdb files in I-TASSER are generally found directly in the job's subdirectory. So, to search through all potential model.pdb files in any batch of I-TASSER jobs, I search for /n/groups/drad/I-TASSER4.3/BatchJobs/*/main/*/model*.pdb.

After finding one of these model files, I generally rename them with I-TASSER (e.g. I-TASSER_model3.pdb) and then put them in a folder corresponding to their protein and fragment of origin. In the future, it might be better to name the file with it's header (e.g. I-TASSER_Q90928_full_protein_model3.pdb).

QUARK

model*.pdb files in QUARK are found within the 'cluster' subdirectory. So, to find them, I search for /n/groups/drad/QUARKmod/BatchJobs/*/main/*/cluster/model*.pdb.

General Script

So, we would like to programatically find folders that have finished models, move those models to a centralized location, and then archive those files.

I will move all models to /n/groups/drad/all_pdb_files. I would like to archive the jobs in the Silver Lab research file server, located at /n/files/SysBio/SILVER LAB/. However, that directory is only available on the transfer servers (which you can access by ssh'ing into transfer.rc.hms.harvard.edu with your eCommonsID).

So, I archive the run folder (after finding a model3.pdb, since I only collect 3 models) into /n/groups/drad/archive and then later move those into a /n/files/ directory on the transfer server. (Note: I used to use the scratch directory as an intermediate for archiving, but we recently increased the storage limit in /n/groups/drad so that is no longer necessary.)

The script to archive and save models can be found at /n/groups/drad/julian/Modelling_Monitoring/transfer_archive_models.sh, but the format is pretty straightforward, as follows:

#!/usr/bin/env bash
set -e

# Enable globstar for looping through files
shopt -s globstar

# Make an output directories
output_dir="output/archive_run_out_$(date +%Y%m%d_%H%M%S)"
archive_name="archive_$(date +Y%m%d)"
mkdir -p $output_dir

archive_dir="/n/groups/drad/archive"
model_dir="/n/groups/drad/all_pdb_files/human"

# Glob-based looping is whitespace-safe and recursive
for file in /n/groups/drad/I-TASSER4.3/BatchJobs/*/main/*/model*.pdb; do
        # Put the filename in a log file
        echo $file >>"$output_dir/I-TASSER_models"

        sourceFolder=$(cut -d'/' -f1-9 <<< $file)
        modelNumber=$(cut -d'/' -f10 <<< $file)
        fragment=$(cut -d'/' -f9 <<< $file)
        protein=$(cut -d'_' -f1 <<< $fragment)

        mkdir -p "${model_dir}/${protein}/${fragment}"

        # (rsync -a: archive -c: verify file with checksum)
        rsync -ac $file "${model_dir}/${protein}/${fragment}/I-TASSER_${modelNumber}"

        if [ "$modelNumber" = "model3.pdb" ]
        then
        echo "Found $file : archiving">>"$output_dir/I-TASSER_model3"

        # Compress and save folder in scratch space.
        mkdir -p "${archive_dir}/I-TASSER/${protein}/${fragment}"
        tar cjPf "${archive_dir}/I-TASSER/${protein}/${fragment}/${archive_name}.tar.bz" ${sourceFolder} --remove-files
        fi  
done

And repeating for QUARK

Tracking the number of jobs that have completed

Currently (as of April 28), I have a recurring cronjob that checks the status of I-TASSER and QUARK jobs.

The cronjob calls the /n/groups/drad/all_pdb_models/progress_logs/cronjob.sh script, which in turn calls a Makefile in the same directory. That makefile first updates all available models (by moving them from BatchJobs directories to the all_pdb_models directories) and calls a series of relatively simple bash commands to identify completed fragments and models.

For more information, see /n/groups/drad/all_pdb_models/progress_logs/Makefile.

To check the number of I-TASSER and QUARK jobs completed over time, I go into /n/groups/drad/all_pdb_models/progress_logs/updates and run head -2 *. The files in that directory contain a log of the counts of fragments detected.

The /n/groups/drad/all_pdb_models/progress_logs/old directory contains more detailed information about those fragments, with folders corresponding to the dates at which those fragments were found. The total_proteins_count and total_fragments_count are the source tables that I used to update the modelling status.