Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evcomplex2 #225

Open
wants to merge 106 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
106 commits
Select commit Hold shift + click to select a range
30a2cc7
INITIAL: create the calibration output file
aggreen Jan 2, 2018
5136422
[FIX]: bug with writing inter ec file in couplings stage
aggreen Jan 12, 2018
1f1e1ff
[MERGE] from master
aggreen Feb 19, 2018
d67bb8a
Bugfix: genome dist fails for eukaryotic seqs
aggreen Mar 20, 2018
ab63e76
[FEATURE]: remove identical structures for inter_dists comparison
aggreen Apr 17, 2018
0d4f7e7
merged from master
aggreen Apr 17, 2018
a19b621
Merge branch 'master' into feature/calibration_compare
aggreen Apr 23, 2018
e2c8c56
API: create calibration file for pairs with no structure
aggreen May 21, 2018
067ad6a
initial commit
aggreen Aug 10, 2018
dbec8e2
initial commit
aggreen Aug 12, 2018
dfe3529
removed overlap_allowed feature
aggreen Aug 15, 2018
778b028
updated from local
aggreen Aug 15, 2018
446238e
Correct call to raise in MultiSegmentCouplingsModel, addresses #177
aggreen Aug 16, 2018
366749f
Added N_eff correction to EVcomplex scoring
aggreen Aug 16, 2018
15123ce
added correct EVcomplex scoring
aggreen Aug 16, 2018
95d6a90
Initialized testing file for couplings stage
aggreen Aug 16, 2018
57c6d7e
created a function to map single columns based on segment index mapper
aggreen Aug 16, 2018
8826e2b
added function to remap the frequencies file generated after concaten…
aggreen Aug 16, 2018
40216fd
TESTING: added test cases for Coupling Score probability models
aggreen Aug 17, 2018
0853c4d
FEATURE: added renumbering of complex frequency file
aggreen Aug 17, 2018
1d303cf
FEATURE: EVcomplex score computed using Neff/L correction
aggreen Aug 17, 2018
83020fb
BUGFIX: now fitting on each quadrant separately when user does not wa…
aggreen Aug 17, 2018
4f3f90b
[API, INTERNAL]: generalized enrichment function to make segment ready
aggreen Aug 21, 2018
46a7f44
[INTERNAL, API]: calculate EC enrichment for complexes
aggreen Aug 21, 2018
5736997
TEST: expanded test cases for couplings protocol
aggreen Aug 21, 2018
d6136cb
NOP: pep8 compliance
aggreen Aug 21, 2018
ca43054
CONFIG: updated complex config to run on Uniprot as default
aggreen Aug 21, 2018
ef8025e
DOC: updated output files tutorial notebook to explain complex enrich…
aggreen Aug 21, 2018
fa74dfe
Merge branch 'feature/avoid_self' of github.com:aggreen/EVcouplings i…
aggreen Aug 21, 2018
5f41532
Merge branch 'develop' into feature/avoid_self
aggreen Aug 22, 2018
360ad7a
Update sample_config_complex.txt
aggreen Aug 22, 2018
03dbdd6
TEST: updated test case paths
aggreen Aug 22, 2018
f95c37a
merged from remote
aggreen Aug 22, 2018
4297be7
Merge branch 'develop' into feature/avoid_self
aggreen Aug 22, 2018
60af0d3
Update sample_config_complex.txt
aggreen Aug 23, 2018
25bfe1d
TEST: fixed ruamel_yaml bad call
aggreen Aug 25, 2018
2bd5222
merged from remote
aggreen Aug 25, 2018
e123ff5
Trigger
aggreen Aug 25, 2018
b57c073
Trigger
aggreen Aug 25, 2018
99bd5c5
Trigger
aggreen Aug 25, 2018
e48ff6e
[INTERNAL, FEATURE] Added feature to compute inter-protein SIFTS map …
aggreen Oct 1, 2018
e53bedb
CONFIG: updated complex config with new parameters
aggreen Oct 1, 2018
b7965e2
INTERNAL: created inter-protein SIFTSresult object, saved inter prote…
aggreen Oct 1, 2018
928aa2b
debug
aggreen Oct 1, 2018
8ab2ccd
initial commit
aggreen Oct 15, 2018
c506066
comparison to ASA and enrichment
aggreen Oct 15, 2018
94fd548
updated RSA calculation
aggreen Oct 24, 2018
b701163
Merge branch 'develop' into feature/avoid_self
sacdallago Dec 16, 2018
8c478d2
DOC: immediate fix for #194
aggreen Dec 19, 2018
0932711
INTERNAL: more robust replacing of kwargs for monomer structure ident…
aggreen Dec 19, 2018
e36e182
updated from remote
aggreen Dec 19, 2018
dbed684
added accessible surface area calculation
aggreen Dec 19, 2018
09a6843
BUGFIX: proper overwriting of kwargs in complex compare
aggreen Jan 7, 2019
0b1c764
merge from feature/avoid-self
aggreen Jan 8, 2019
f14a549
fixed couplings protocol
aggreen Jan 8, 2019
26d9655
merged from calibration_compare
aggreen Jan 8, 2019
55b3168
added calibration features
aggreen Apr 2, 2019
4f31692
merged from feature/calibration
aggreen Apr 8, 2019
688dd7a
added mean field protocol for complexes
kpgbrock Jul 22, 2019
ccb5092
Merge remote-tracking branch 'kelly/complex_dev' into calibration_dca
aggreen Jul 23, 2019
b66b136
[API, INTERNAL, NOP] made requested changes
aggreen Aug 14, 2019
9fe8f3a
[API, INTERNAL, NOP] made requested changes
aggreen Aug 14, 2019
3d1cce5
Merge branch 'develop' into feature/avoid_self
aggreen Aug 14, 2019
82a3671
updates
aggreen Aug 14, 2019
8952c9d
bugfixes
aggreen Aug 15, 2019
81b5fc2
updates to fast complex protocol
aggreen Sep 18, 2019
393e2ba
BUGFIX: correct kwargs to add_mixture_probability
aggreen Sep 18, 2019
c2f3cb8
[INTERNAL]: fixed circular input in couplings/pairs.py
aggreen Sep 19, 2019
4966de1
merged from remote
aggreen Sep 19, 2019
b76ba94
Re-introduced ENA mapping
aggreen Sep 26, 2019
629c460
documented and cleaned ASA code
aggreen Sep 26, 2019
8ede6b5
updated ASA protocol in compare/protocol.py
aggreen Sep 26, 2019
c8a5271
Merge branch 'master' into evcomplex2
aggreen Sep 26, 2019
fc39ccc
INTERNAL: removed unnecessary print statement
aggreen Sep 26, 2019
fa5538c
removed fastcomplex pipeline
aggreen Sep 26, 2019
94a74a7
config_file: added dssp binary
aggreen Sep 26, 2019
32d0cbc
removed fast best hit protocol
aggreen Sep 26, 2019
8284241
removed qos in slurm submission
aggreen Sep 26, 2019
af8c5ed
removed qos in slurm submission
aggreen Sep 26, 2019
7a875fd
updated enrichment code, removed own file
aggreen Sep 26, 2019
5326b48
Updated calibration file protocol
aggreen Oct 1, 2019
025eb8d
updated sample config
aggreen Oct 1, 2019
4d559c7
BUGFIX: typos
aggreen Oct 8, 2019
16534bd
BUGFIX
aggreen Oct 8, 2019
3fdc92a
merged feature/avoid_self
aggreen Apr 28, 2020
d2a136a
merged upstream/develop
aggreen Apr 28, 2020
af8edfa
Fixed merging issues
aggreen May 4, 2020
ed92860
Merge branch 'master' into evcomplex2
aggreen Nov 20, 2020
21554a6
updated configuration files
aggreen Nov 20, 2020
138128a
requested code changes part 1
aggreen Nov 20, 2020
7aa0e41
NOP: making requested changes
aggreen Nov 23, 2020
d048871
API: created compare/tools.py and moved run_dssp to tools.py
aggreen Nov 23, 2020
9464490
API: moved hydropathy index and AA surface area to utils/constants.py
aggreen Nov 23, 2020
2eb48ad
API: refactored segment_map_positions_single_column
aggreen Nov 23, 2020
83db643
INTERNAL: requested changes to EVcomplex2 PR
aggreen Nov 23, 2020
f920ef5
NOP: whitespace error
aggreen Nov 25, 2020
ac01558
INTERNAL: fixed code to run DSSP
aggreen Dec 22, 2020
b302058
INTERNAL: bugfix to enrichment dataframe concatenation
aggreen Dec 22, 2020
0c39855
INTERNAL
aggreen Dec 22, 2020
30a8963
Fixes #225
aggreen Dec 22, 2020
635ab30
initial commit to refactor complex mean field code
aggreen Dec 22, 2020
3953771
BUGFIX: mean field couplings protocl
aggreen Dec 23, 2020
8a2b9d5
Initial commit for complex probability model refactor
aggreen Jan 14, 2021
50c948b
additions to probability stage
aggreen Jan 15, 2021
a56fe66
removed unused imports, PEP8 compliance
aggreen Jan 15, 2021
e4365f3
created score_interaction stage
aggreen Nov 1, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
112 changes: 63 additions & 49 deletions config/sample_config_complex.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@
# Minimal settings required before this configuration can be executed:
# - set your environment, paths to tools and databases (at the end of this file)
# - under "global", set prefix
# - under "align_1" and "align_2", set the monomer sequence_id
# - under "align_1" and "align_2", set the monomer sequence_id
# - run it! :)

# Configuration rules:
Expand All @@ -29,12 +29,12 @@ stages:
- compare
- mutate
- fold

# Global job settings. These will override settings of the same name in each of the stages.
# These are typically the settings you want to modify for each of your jobs, together with some settings in the align stage.
global:
# mandatory output prefix of the job (e.g. output/HRAS will store outputs in folder "output", using files prefixed with "HRAS")
prefix:
prefix:

# Clustering threshold for downweighting redudant sequences (Meff computation). E.g. 0.8 will cluster sequences
# at a 80% sequence identity cutoff
Expand All @@ -47,20 +47,20 @@ global:
align_1:
# use complex protocol to properly prepare inputs for concatenation
protocol: complex

# monomer alignment creation protocol to nest within the complex alignment protocol
# choose either existing (below) to use a previously created alignment
# or standard to construct an alignment
# or standard to construct an alignment
alignment_protocol: standard


# Mandatory: specify the sequence identifier
# Region can be left blank
# Sequence file can be left blank
sequence_id:
region:
sequence_id:
region:
sequence_file:

# The following typically do not need to be set because 'global' overrides them
# prefix:
# theta:
Expand All @@ -87,7 +87,7 @@ align_1:

# sequence database (specify possible databases and paths in "databases" section below)
# note: use uniprot for genome distance based concatenation
database: uniref100
database: uniprot

# compute the redundancy-reduced number of effective sequences (M_eff) already in the alignment stage.
# To save compute time, this computation is normally carried out in the couplings stage
Expand Down Expand Up @@ -134,10 +134,10 @@ align_1:
# minimum_column_coverage: 70
# extract_annotation: True

# # if using existing alignment protocol, provide a path to the annotations.csv file
# # if using existing alignment protocol, provide a path to the annotations.csv file
# # from the monomer run that generated the input alignment
# # Needed to correctly find the species identifiers for best hit concatenation
# override_annotation_file:
# override_annotation_file:

# Sequence alignment generation/processing for the second monomer.
align_2:
Expand All @@ -148,10 +148,10 @@ align_2:
alignment_protocol: standard
# Mandatory: specify the sequence identifier and region
# Sequence file can be left blank
sequence_id:
sequence_id:
region:
sequence_file:

# The following typically do not need to be set because 'global' overrides them
# prefix:
# theta:
Expand All @@ -178,7 +178,7 @@ align_2:

# sequence database (specify possible databases and paths in "databases" section below)
# note: use uniprot for genome distance based concatenation
database: uniref100
database: uniprot

# compute the redundancy-reduced number of effective sequences (M_eff) already in the alignment stage.
# To save compute time, this computation is normally carried out in the couplings stage
Expand Down Expand Up @@ -224,10 +224,10 @@ align_2:
# minimum_sequence_coverage: 50
# minimum_column_coverage: 70
# extract_annotation: True
# # if using existing alignment protocol, provide a path to the annotations.csv file
# # if using existing alignment protocol, provide a path to the annotations.csv file
# # from the monomer run that generated the input alignment
# # Needed to correctly find the species identifiers for best hit concatenation
# override_annotation_file:
# override_annotation_file:

#Generation of concatenated sequence alignment for evolutionary couplings calculation
concatenate:
Expand All @@ -238,31 +238,35 @@ concatenate:
second_alignment_file:

# Select protocol for concatenation of sequence alignments
# Available protocols:
# Available protocols:
# genome_distance: pair sequences that are closest neighbors on the genome
# best_hit: for each genome, pair the sequences that have the highest % identity to the target sequence
# for best hit protocol, user can set use_best_reciprocal to take the best reciprocal hits only (recommended)
protocol: best_hit
use_best_reciprocal: true

# Maximum genome distance in bases allowed between pairs
# Required for genome_distance protocol only
genome_distance_threshold: 10000

# Maximum sequence identity allowed for hits to be designated
# as paralogs. Required for best_hit in best reciprocal mode only
paralog_identity_threshold: 0.95


# forbid overlapping regions of the same seqeunce ID from being concatenated
# for typical heteromultimeric complexes, this should be true
forbid_overlapping_concatenation: true

# Parameters for filtering of concatenated alignment

# Filter sequence alignment at this % sequence identity cutoff. Can be used to cut computation time in
# the couplings stage (e.g. set to 95 to remove any sequence that is more than 95% identical to a sequence
# already present in the alignment). If blank, no filtering. If filtering, HHfilter must be installed.
seqid_filter:

# Only keep sequences that align to at least x% of the target sequence (i.e. remove fragments)
minimum_sequence_coverage: 50

# Only include alignment columns with at least x% residues (rather than gaps) during model inference
minimum_column_coverage: 50

Expand Down Expand Up @@ -305,12 +309,12 @@ couplings:
# Sequence separation filter for generation of CouplingScores_longrange.csv table (i.e. to take out short-range
# ECs from table, only pairs with abs(i-j)>=min_sequence_distance will be kept.
min_sequence_distance: 6

# Parameters specific to complex pipeline scoring
# Scoring model to assess confidence in computed ECs
# available options: skewnormal, normal, evcomplex
scoring_model: skewnormal

# Specify whether to use all ECs or only inter-molecular ECs for scoring
use_all_ecs_for_scoring: False

Expand All @@ -327,7 +331,7 @@ couplings:
compare:
# Current options: standard, complex
protocol: complex

# Following parameters will be usually overriden by global settings / output of previous stage
prefix:
ec_file:
Expand All @@ -340,28 +344,30 @@ compare:
# sequence_id and SIFTS database (sequence_id must be UniProt AC/ID in this case)
first_by_alignment: True
second_by_alignment: True
# Alignment method to use to search the PDB Seqres database. Options: jackhmmer, hmmsearch
# Set to jackhmmer to search the PDB Seqres database using jackhmmer from the target sequence only (more stringent).
# Set to hmmsearch to search the PDB seqres database using an HMM built from the output monomer alignment (less stringent).
# Warning: searching by HMM may result in crystal structures from very distant homologs or even unrelated sequences.
# Alignment method to use to find sequences corresponding to PDB structures. Options: jackhmmer, hmmsearch
# Set to jackhmmer to search using jackhmmer from the target sequence only (more stringent).
# Set to hmmsearch to search using an HMM built from the output monomer alignment (less stringent).
# Warning: searching by HMM may result in crystal structures from very distant homologs or even unrelated sequences.
first_pdb_alignment_method: jackhmmer
second_pdb_alignment_method: jackhmmer

# Leave this parameter empty to use all PDB structures for given sequence_id, otherwise
# will be limited to the given IDs (single value or list). Important: note that this acts only as a filter on the
# structures found by alignment or in the SIFTS table (!)
pdb_ids:
inter_pdb_ids:
first_pdb_ids:
second_pdb_ids:

# Limit number of structures and chains for comparison
# Note - the intersection of the monomer structural hits is taken to find the
# Inter-protein structures. If you limit the number of monomer structures found in this step,
# you may miss some inter-protein structures
first_max_num_structures: 100
first_max_num_hits: 100
second_max_num_structures: 100
second_max_num_hits: 100
inter_max_num_structures: 10
inter_max_num_hits: 10

# Limit number chains and structures to use for each monomer comparison, IN ADDITION to those found from the
# inter protein compariso
first_max_num_structures: 10
first_max_num_hits: 10
second_max_num_structures: 10
second_max_num_hits: 10

# compare to multimer contacts (if multiple chains of the same sequence or its homologs are present in a structure)
first_compare_multimer: True
Expand All @@ -376,7 +382,7 @@ compare:
first_use_bitscores: True
first_domain_threshold: 0.5
first_sequence_threshold: 0.5

second_sequence_file:
second_first_index:
second_region:
Expand All @@ -386,10 +392,10 @@ compare:
second_sequence_threshold: 0.5

# Comparison and plotting settings

# Return an error if we fail to automatically retrieve information about a given pdb id
raise_missing: False

# Filter that defines which atoms will be used for distance calculations. If empty/None, no filter will be
# applied (resulting in the computation of minimum atom distances between all pairs of atoms). If setting to any
# particular PDB atom type, only these atoms will be used for the computation (e.g. CA will give C_alpha distances,
Expand All @@ -406,7 +412,7 @@ compare:
plot_probability_cutoffs: [0.90, 0.99]

# Plot fixed numbers of inter-protein ECS, and all intra ECs scoring at least as high
# As those inter-protein ECs.
# As those inter-protein ECs.
# Use integers only
plot_lowest_count: 5
plot_highest_count: 10
Expand All @@ -421,7 +427,7 @@ compare:

# draw secondary structure on contact map plots
draw_secondary_structure: True

# Settings for Mutation effect predictions
mutate:
# Options: standard, complex
Expand Down Expand Up @@ -489,7 +495,7 @@ environment:

# command that will be executed before running actual computation (can be used to set up environment)
configuration:


# Paths to databases used by evcouplings.
databases:
Expand All @@ -510,14 +516,22 @@ databases:
# Periodically delete these files to more recent versions of SIFTS are used.
sifts_mapping_table: /n/groups/marks/databases/SIFTS/pdb_chain_uniprot_plus_current.o2.csv
sifts_sequence_db: /n/groups/marks/databases/SIFTS/pdb_chain_uniprot_plus_current.o2.fasta

# the following two databases are exclusive to EVcomplex and need to be manually downloaded and saved locally
# then add the paths to your local copies of the database
# Download urls:
# Download urls:
# ena_genome_location_table: https://marks.hms.harvard.edu/evcomplex_databases/cds_pro_2017_02.txt
# uniprot_to_embl_table: https://marks.hms.harvard.edu/evcomplex_databases/idmapping_uniprot_embl_2017_02.txt
# uniprot_to_embl_table: https://marks.hms.harvard.edu/evcomplex_databases/idmapping_uniprot_embl_2017_02.txt
uniprot_to_embl_table: /n/groups/marks/databases/complexes/idmapping/idmapping_uniprot_embl_2017_02.txt
ena_genome_location_table: /n/groups/marks/databases/complexes/ena/2017_02/cds_pro.txt
structurefree_model_file: /n/groups/marks/users/agreen/dev/EVcouplings/evcouplings/compare/aux/residue_strucfree.saved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have these somewhere publicly accessible

structureaware_model_file: /n/groups/marks/users/agreen/dev/EVcouplings/evcouplings/compare/aux/residue_strucaware.saved

complex_strucfree_model_file: /n/home/ag300/EVcouplings/compare/aux/complex_strucfree.saved
complex_strucfree_scaler_file: /n/home/ag300/EVcouplings/compare/aux/complex_strucfree.scaler
complex_strucaware_model_file: /n/home/ag300/EVcouplings/compare/aux/complex_strucaware.saved
complex_strucaware_scaler_file: /n/home/ag300/EVcouplings/compare/aux/complex_strucaware.saved


# Paths to external tools used by evcouplings. Please refer to README.md for installation instructions and which tools are required.
tools:
Expand All @@ -529,4 +543,4 @@ tools:
psipred: /n/groups/marks/software/runpsipred
cns: /n/groups/marks/pipelines/evcouplings/software/cns_solve_1.21/intel-x86_64bit-linux/bin/cns
maxcluster: /n/groups/marks/pipelines/evcouplings/software/maxcluster64bit

dssp: /n/groups/marks/software/dssp
6 changes: 3 additions & 3 deletions config/sample_config_monomer.txt
Original file line number Diff line number Diff line change
Expand Up @@ -278,9 +278,9 @@ compare:
# print information about used PDB structures on contact map plots
print_pdb_information: True

# Alignment method to use to search the PDB Seqres database. Options: jackhmmer, hmmsearch
# Set to jackhmmer to search the PDB Seqres database using jackhmmer from the target sequence only (more stringent).
# Set to hmmsearch to search the PDB seqres database using an HMM built from the output monomer alignment (less stringent).
# Alignment method to use to find sequences corresponding to PDB structures. Options: jackhmmer, hmmsearch
# Set to jackhmmer to search using jackhmmer from the target sequence only (more stringent).
# Set to hmmsearch to search using an HMM built from the output monomer alignment (less stringent).
# Warning: searching by HMM may result in crystal structures from very distant homologs or even unrelated sequences.
pdb_alignment_method: jackhmmer

Expand Down
Loading