NMPFamsDB is a database which hosts novel metagenome protein clusters with no or weak hits to Pfam or Reference genomes and aiming at significantly expanding the protein family space known till today.
With NPFAMsDB you can:
NMPFamsDB has the following pages/sections:
NMPFamsDB is publicly available through http://www.nmpfamsdb.org or https://pavlopoulos-lab.org/NMPFamsDB.
Topics in the Manual are divided in separate tabs, accessible through header buttons at the top of the page. Click on any of these header buttons to navigate to its respective section.
Long sections are divided into subsections, that can be scrolled up and down. At the end of each subsection a link exists, labeled [Back to top]. Clicking on it will return you to the top of the section.
The data contained in NMPFamsDB falls under the following four categories: Families, Sequences, Datasets and Habitats.
Novel Metagenome Protein families (or NMPFs) are the main entry category of NMPFamsDB. They represent collections of protein sequences, clustered based on their sequence identity. Each NMPF is characterized by a consensus sequence, a number of Multiple Sequence Alignments (MSAs) and an associated Hidden Markov Model (HMM) profile, secondary structure and protein topology predictions, phylogenetic, ecosystem and geographical distribution annotations. In addition, a subset of NMPFs have available 3D structure models predicted using AlphaFold. NMPFs are represented by unique alphanumerical identifiers (e.g. F000001, F000018, F134500) or by their corresponding numerical indexes (e.g. 1, 18, 134500). A detailed description on the information contained for each family is given in the section “Protein Families”.
Protein sequences are given in NMPFamsDB both as family members and as individual entities. Each sequence is characterized by a number of identifiers, namely, a unique Protein ID, corresponding to the IMG JGI Gene OID, a Dataset Taxon OID, corresponding to the JGI Taxon OID of environmental samples, and a Scaffold ID, corresponding to the equivalent JGI Scaffold OID for sequencing contig/scaffolds. A detailed description on sequences is given in the section “Sequences”.
Scaffolds represent the sequencing contigs derived from a metagenome or metatranscriptome dataset. from which the protein sequences were derived. Each Scaffold is represented by a Taxon OID, corresponding to the equivalent IMG JGI Taxon OID for sequencing datasets. A detailed description on the information contained for each scaffold is given in the section “Scaffolds”.
Datasets represent the Metagenome (or Metatranscriptome) sequencing samples/datasets from which the associated scaffolds and protein sequences were derived. Each Dataset is represented by a Taxon OID, corresponding to the equivalent IMG JGI Taxon OID for sequencing datasets. Additional information includes ecosystem classification, geographical metadata and availability for public use. A detailed description on the information contained for each dataset is given in the section “Datasets”.
Habitats represent the documented ecosystems of the sequenced datasets from which protein sequences have been collected. They are hierarchically organized, grouped into categories based on their nature and characteristics. Habitats are directly linked to the metagenome datasets and, through them, to scaffolds, protein sequences and NMPFs. A detailed description on sequences is given in the section “Ecosystems”.
Users can browse NMPFamsDB’s contents through the web interface’s “Browse” sub-menu in fivw different manners, by Family, Sequence, Scaffold, Dataset and Habitat:
Browse Families: The database’s novel metagenome protein families (NMPFs).
Browse Sequences: All sequences in the database.
Browse Sequences: The sequencing scaffolds from which sequence data has been collected.
Browse Datasets: The metagenome and metatrasncriptome datasets from which sequence data has been collected.
Browse Habitats: The ecosystems from which datasets were obtained and sequences were derived.
By choosing Browse → Families you will be redirected to the following page:
The top of the page contains a search form with a number of options that can help you filter data according to your needs. Browse / search results are presented in a table containing the following information:
Users can view a family’s entry page by clicking on its ID. The family’s Number of Sequences, Datasets and Scaffolds are also clickable; by selecting them users can view all sequences, datasets or scaffolds associated with a particular family.
Browse and search results are organized in pages. Users can select the number of entries per page by selecting a value in the “Show Entries” drop-down menu, located at the top left side of the entries table.
One or multiple entries can be selected by clicking their checkboxes, appearing at the start of each line. By clicking the checkbox in the table's head line, all entries per page can be selected at once. Selected entries can then be downloaded, using the options presented at the top right side of the table. Data can be downloaded in either tab (TSV) or comma-delimited (CSV) format, by choosing the appropriate option and clicking the “Export” button.
Family searches can be performed by using the search form in the top of the page. The form contains four panels, corresponding to four search capabilities, “Keyword”, "Sequence & Structure", “Environment” and "Phylogeny".
The following options are offered in the Keyword section of the search form:
F000015
) and family number IDs (e.g. 15
) can be used. The search will return the families of the submitted IDs.3300004768
) separated by spaces. The search will return all families that are contain sequences from the queried datasets.JGI20166J26741_10000139
) separated by spaces. The search will return all families that are contain sequences from the queried scaffolds.GI24668J20090_102382961
) separated by spaces. The search will return the families containing the submitted sequences.1
(minimum) and 1195
maximum.100
(minimum) and 16856
maximum.The Environment panel enables you to perform searches based on the families' associations with ecosystems, based on the analyzed metagenome and metatranscriptome datasets. All habitats are shown in a hierarchical tree structure, following the GOLD ecosystem classification system. The available ecosystems belong to one of the following general categories:
Each category is further divided into sub-categories that can be displayed or hidden by clicking on the red caret ( ) icons. You can also expand or collapse the entire tree by clicking on the Expand All or Collapse All button, respectively.
To select an ecosystem for search, simply click the checkbox next to its name. You can select multiple ecosystems to perform complex queries. The names of the selected ecosystems will appear in the list box at the right of the tree.
Most families are not limited to a single ecosystem; in fact, they are associated with multiple ecosystem types, with each type appearing at a particular association level (for example, a family can be 30% Environmental and 70% Host-associated). You can further limit your query options and search families that are associated with an ecosystem above a certain percentage cut-off, simply by clicking the "Association cut-off" checkbox and selecting a cut-off level by dragging the slider button. A small subset of the families are associated with a single ecosystem type (e.g. 100% association with plants). You can select to retrieve only these families by activating the checkbox marked "Limit search to families containing sequences from only the selected environments". This is the equivalent to setting the Association cut-off slider at 100%.
Similar to the ecosystem search, you can perform searches based on phylogenetic associations. All organisms associated with sequences in NMPFamsDB are shown in a hierarchical tree structure, following the NCBI nomenclature, classification system and ranking. Like in Ecosystem search, you can expand or collapse a category by clicking on the red caret ( ) icon or by using the Expand All or Collapse All buttons. You can select one or more taxa by clicking the checkbox next to their names. The selected taxa will appear in the list box at the right of the tree.
To perform a search query, simply set the values you wish to search in each of the respective fields. The query options of the Keyword, Sequence & Structure, Environment and Phylogeny panels can be combined. For example, you can choose to search for families coming only from Metagenomes that have an available 3D structure, are associated with Host-associated ecosystems, and contain sequences from viruses, by setting the appropriate options, as in the figure below:
When you are ready to begin, click the "Submit" button at the bottom of the form. To click all options, click on the "Reset" button. After the search is completed, a table of results will appear, in the same format as in the original page. In addition, the query options will appear in a box marked "Search Options".
Each family entry is divided in distinct sections. You can access them by navigating the family page or by using the buttons at the top of the entry. At the top right of the screen you will find the family's downloadable files. You can download a file simply by choosing it from the selection list and clicking the "Download" button.
The first section presents an overview of the family's properties. These include the family identifier and category, the number of associated sequences, datasets and scaffolds, the average sequence length and the representative consensus sequence, derived from the family's multiple sequence alignment.
The right side of the section contains an interactive viewer displaying the family's Hidden Markov Model (HMM) profile in a Sequence Logo representation. In the logo, each column corresponds to an alignment / HMM position and contains the residues found in that position, colored based on their type. Each amino acid's size in a column is proportional to its probability to be found in that particular position. You can scroll through the HMM viewer by left-clicking and dragging the Sequence Logo. Clicking on a position will open a window showing a detailed list of all residue probabilities. These can also be toggled on and off by clicking the "Toggle Column Annotation" button.
This section contains an interactive viewer for the family's Multiple Sequence Alignments (MSAs). In NMPFamsDB each family is represented by two alignments; a full MSA, containing all the family's sequences and a "seed" MSA, containing a representative, non-redundant subset of the sequences. The seed MSA is also the alignment used to create the family's HMM profile. You can access each MSA by clicking on the relevant button a the top of the viewer.
The MSA in the viewer can be scrolled by using the mouse. At the top of the viewer, a menu presents a number of functions to visualize and manipulate the alignment:
The structure and topology section presents the structural annotation of the family.
If a family contains sequences from scaffolds that also have hits to Pfam domains, these domains are considered to be the family's gene neighborhood. The Pfam IDs and names of these domains, as well as their % frequency in the family's associated scaffolds, are given in a table.
This section presents the phylogenetic annotation of the family, based on the metadata of the associated scaffolds.
This section presents the ecosystem annotation of the family, based on the metadata of the associated datasets. It is further divided into two parts, Associated Habitat Types and Associated Datasets.
The first part of the section presents the % distribution of ecosystems in the family, through an interactive table and an associated pie chart. Like the Phylogeny section, the various levels of habitat taxonomy can be accessed through the selection list a the top of the table.
The second part shows the family's associated datasets in a table, accompanied by an interactive map showing their geographical distribution. Clicking on a dataset's Taxon OID will redirect you to its page in NMPFamsDB (see the "Datasets" section for details). In the map, dataset locations are indicated by red markers. Hovering the mouse over a marker will open a pop-up window showing its coordinates, as well as the number of sampled datasets from that location.
The final section shows the family's protein components. Each protein is represented by its identifiers (Protein ID, Taxon OID and Scaffold ID), its environmental metadata and its amino acid sequence.
Each family has a number of files available for download through the form at the top right corner of the page. These include:
All files are encoded in ASCII UTF-8 and can be opened by text editors, or manipulated using specialized software:
By choosing Browse → Sequences you will be redirected to the following page:
The top of the page contains a search form with a number of options that can help you filter data according to your needs. Browse / search results are presented in a table containing the following information:
You can see each sequence by clicking the Show/Hide button to expand and collapse it in each table row.
Browse and search results are organized in pages. Users can select the number of entries per page by selecting a value in the “Show Entries” drop-down menu, located at the top left side of the entries table.
One or multiple entries can be selected by clicking their checkboxes, appearing at the start of each line. By clicking the checkbox in the table's head line, all entries per page can be selected at once. Selected entries can then be downloaded, using the options presented at the top right side of the table. Data can be downloaded in tab (TSV) or comma-delimited (CSV) format, as well as in the FASTA sequence format, by choosing the appropriate option and clicking the “Export” button.
Like families, sequence searches can be performed by using the search form in the top of the page. The form contains three panels, corresponding to three search capabilities, “Keyword”, “Environment” and "Phylogeny".
The following options are offered in the Keyword section of the search form:
GI24668J20090_102382961
) separated by spaces. The search will return the sequences of the submitted IDs.F000015
) and family number IDs (e.g. 15
) can be used. The search will return the sequences belonging to the submitted IDs.3300004768
) separated by spaces. The search will return the sequences belonging to the submitted datasets.JGI20166J26741_10000139
) separated by spaces. The search will return the sequences belonging to the submitted scaffolds.The Environment panel enables you to perform searches using the sequences' ecosystem annotation, based on the analyzed metagenome and metatranscriptome datasets. All habitats are shown in a hierarchical tree structure, following the GOLD ecosystem classification system. The available ecosystems belong to one of the following general categories:
Each category is further divided into sub-categories that can be displayed or hidden by clicking on the red caret ( ) icons. You can also expand or collapse the entire tree by clicking on the Expand All or Collapse All button, respectively.
To select an ecosystem for search, simply click the checkbox next to its name. You can select multiple ecosystems to perform complex queries. The names of the selected ecosystems will appear in the list box at the right of the tree.
Similar to the ecosystem search, you can perform searches based on phylogenetic associations. All organisms associated with sequences in NMPFamsDB are shown in a hierarchical tree structure, following the NCBI nomenclature, classification system and ranking. Like in Ecosystem search, you can expand or collapse a category by clicking on the red caret ( ) icon or by using the Expand All or Collapse All buttons. You can select one or more taxa by clicking the checkbox next to their names. The selected taxa will appear in the list box at the right of the tree.
To perform a search query, simply set the values you wish to search in each of the respective fields. The query options of the Keyword, Environment and Phylogeny panels can be combined. For example, you can choose to search for Metagenome sequences that are associated with Engineered ecosystems, and contain sequences from Archaea, by setting the appropriate options, as in the figure below:
By choosing Browse → Scaffolds you will be redirected to the following page:
The top of the page contains a search form with a number of options that can help you filter data according to your needs. Browse / search results are presented in a table containing the following information:
Browse and search results are organized in pages. Users can select the number of entries per page by selecting a value in the “Show Entries” drop-down menu, located at the top left side of the entries table.
One or multiple entries can be selected by clicking their checkboxes, appearing at the start of each line. By clicking the checkbox in the table's head line, all entries per page can be selected at once. Selected entries can then be downloaded, using the options presented at the top right side of the table. Data can be downloaded in tab (TSV) or comma-delimited (CSV) format, by choosing the appropriate option and clicking the “Export” button.
Scaffold searches can be performed by using the search form in the top of the page. The form contains three panels, corresponding to three search capabilities, “Keyword”, “Environment” and "Phylogeny".
The following options are offered in the Keyword section of the search form:
IMG/M Taxon OID(s): Search based on associated IMG/M datasets. Input one or more Taxon OIDs (e.g. 3300004768
) separated by spaces. The search will return the scaffolds belonging to the submitted datasets.
IMG/M Scaffold ID(s): Retrieve scaffolds based on their Scaffold IDs. Input one or more Scaffold IDs (e.g. JGI20166J26741_10000139
) separated by spaces. The search will return the scaffolds of the specified IDs.
IMG/M Gene/Protein ID(s): Search based on sequence components. Input one or more protein IDs (e.g. GI24668J20090_102382961
) separated by spaces. The search will return the scaffolds containing of the submitted IDs.
Family ID(s): Search scaffolds based on their associated. Input one or more family IDs separated by spaces. Both full IDs (e.g. F000015
) and family number IDs (e.g. 15
) can be used. The search will return the scaffolds associated with the submitted IDs.
Family Category: Search sequences based on their associated dataset type. Available categories are:
Scaffold length: Search scaffolds based on their length. Drag the slider's two endpoints to define the minimum and maximum value.
The Environment panel enables you to perform searches using the scaffolds' ecosystem annotation, based on the analyzed metagenome and metatranscriptome datasets. All habitats are shown in a hierarchical tree structure, following the GOLD ecosystem classification system. The available ecosystems belong to one of the following general categories:
Each category is further divided into sub-categories that can be displayed or hidden by clicking on the red caret ( ) icons. You can also expand or collapse the entire tree by clicking on the Expand All or Collapse All button, respectively.
To select an ecosystem for search, simply click the checkbox next to its name. You can select multiple ecosystems to perform complex queries. The names of the selected ecosystems will appear in the list box at the right of the tree.
Similar to the ecosystem search, you can perform searches based on phylogenetic associations. All organisms associated with scaffolds in NMPFamsDB are shown in a hierarchical tree structure, following the NCBI nomenclature, classification system and ranking. Like in Ecosystem search, you can expand or collapse a category by clicking on the red caret ( ) icon or by using the Expand All or Collapse All buttons. You can select one or more taxa by clicking the checkbox next to their names. The selected taxa will appear in the list box at the right of the tree.
To perform a search query, simply set the values you wish to search in each of the respective fields. The query options of the Keyword, Environment and Phylogeny panels can be combined. For example, you can choose to search for Metagenome scaffolds that are associated with Engineered ecosystems, and contain sequences from Archaea, by setting the appropriate options, as in the figure below:
The Scaffold entry page contains the basic information on the sequencing scaffold and is divided into four sections: Overview, Ecosystem & Geography, Associated Families and Sequences.
The Overview section contains the basic properties of the scaffold. These include the scaffolds Taxon OID and Scaffold ID, and its length in base pairs (bps). Links to the associated dataset and scaffold pages in IMG/M are offered through the "Open in IMG/M" buttons. In addition, the basic metadata of the source metagenomic dataset are presented, namely, the name of the dataset, its category (Metagenome or Metatranscriptome), its data usage policy, and information on the sequencing procedure. Apart from the above, the section gives an overview on the components of the scaffold: the total number of genes, the number of novel protein genes, and then number of associated NMPFs. If the taxonomy of the scaffold is known, it is given in the box titled "Taxonomy"; otherwise, it is characterized as "Unclassified".
The Ecosystem & Geography section presents the ecosystem metadata of the scaffold, as defined from the source dataset. The scaffold's ecosystem habitat is shown in the GOLD classification system. In addition, the geographic location of the sample is presented, both in the form of coordinates (longitude, latitude and if applicable, altitude or depth), and in the form of an interactive map panel.
The Associated Families section gives a list of all families containing at least one of the scaffold's novel protein genes.
This section gives a list with all the scaffold's novel protein sequences.
By choosing Browse → Datasets you will be redirected to the following page:
The top of the page contains a search form with a number of options that can help you filter data according to your needs. Browse / search results are presented in a table containing the following information:
Browse and search results are organized in pages. Users can select the number of entries per page by selecting a value in the “Show Entries” drop-down menu, located at the top left side of the entries table.
One or multiple entries can be selected by clicking their checkboxes, appearing at the start of each line. By clicking the checkbox in the table's head line, all entries per page can be selected at once. Selected entries can then be downloaded, using the options presented at the top right side of the table. Data can be downloaded in tab (TSV) or comma-delimited (CSV) format, by choosing the appropriate option and clicking the “Export” button.
Searches can be performed by using the search form in the top of the page. The form contains three panels, corresponding to three search capabilities, “Keyword”, “Environment” and "Phylogeny".
The following options are offered in the Keyword section of the search form:
3300004768
) separated by spaces. The search will return the datasets corresponding to the Taxon OIDs.JGI20166J26741_10000139
) separated by spaces.GI24668J20090_102382961
) separated by spaces.F000015
) and family number IDs (e.g. 15
) can be used. The search will return the datasets with sequences contained in the submitted families.Lake
) to return all datasets containing it in their name.Canada
).The Environment panel enables you to perform searches using the scaffolds' ecosystem annotation, based on the analyzed metagenome and metatranscriptome datasets. All habitats are shown in a hierarchical tree structure, following the GOLD ecosystem classification system. The available ecosystems belong to one of the following general categories:
Each category is further divided into sub-categories that can be displayed or hidden by clicking on the red caret ( ) icons. You can also expand or collapse the entire tree by clicking on the Expand All or Collapse All button, respectively.
To select an ecosystem for search, simply click the checkbox next to its name. You can select multiple ecosystems to perform complex queries. The names of the selected ecosystems will appear in the list box at the right of the tree.
Similar to the ecosystem search, you can perform searches based on phylogenetic associations. All organisms associated with scaffolds in NMPFamsDB are shown in a hierarchical tree structure, following the NCBI nomenclature, classification system and ranking. Like in Ecosystem search, you can expand or collapse a category by clicking on the red caret ( ) icon or by using the Expand All or Collapse All buttons. You can select one or more taxa by clicking the checkbox next to their names. The selected taxa will appear in the list box at the right of the tree.
To perform a search query, simply set the values you wish to search in each of the respective fields. The query options of the Keyword, Environment and Phylogeny panels can be combined. For example, you can choose to search for Metagenome datasets that are associated with Engineered ecosystems, and contain sequences from Archaea, by setting the appropriate options, as in the figure below:
The Dataset entry page contains the basic information on the dataset sample and is divided into five sections: Overview, Ecosystem & Geography, Associated Families, Associated Scaffolds and Sequences.
The Overview section contains the basic properties of the dataset. These include the Taxon OID and metadata such as the sample name, its sequencing status and its use licence. Links to the associated dataset and scaffold pages in IMG/M are offered through the "Open in IMG/M" buttons. Apart from the above, the section gives an overview on the components of the dataset: its category (metagenome or metatranscriptome), the total metagenome size, the number of associated scaffolds, the number of novel protein genes, and then number of associated NMPFs. A metagenome sample contains genomic material from multiple organisms found in the same sample. The phylogenetic evidence for each dataset component, based on the associated scaffolds, is given in the table "Dataset Phylogeny".
The Ecosystem & Geography section presents the ecosystem metadata of the scaffold, as defined from the source dataset. The scaffold's ecosystem habitat is shown in the GOLD classification system. In addition, the geographic location of the sample is presented, both in the form of coordinates (longitude, latitude and if applicable, altitude or depth), and in the form of an interactive map panel.
The Associated Families section gives a list of all families containing at least one of the dataset's novel protein genes.
The Associated Families section gives a list of all sequencing scaffolds derived from the dataset.
This section gives a list with all the scaffold's novel protein sequences.
By choosing Browse → Ecosystems you will be redirected to the following page:
The top of the page contains a search form with a number of options that can help you filter data according to your needs. Browse / search results are presented in a table containing the following information:
Browse and search results are organized in pages. Users can select the number of entries per page by selecting a value in the “Show Entries” drop-down menu, located at the top left side of the entries table.
One or multiple entries can be selected by clicking their checkboxes, appearing at the start of each line. By clicking the checkbox in the table's head line, all entries per page can be selected at once. Selected entries can then be downloaded, using the options presented at the top right side of the table. Data can be downloaded in tab (TSV) or comma-delimited (CSV) format, by choosing the appropriate option and clicking the “Export” button.
Searches can be performed by using the search form in the top of the page. The form enables you to perform searches using the scaffolds' ecosystem annotation, based on the analyzed metagenome and metatranscriptome datasets. All habitats are shown in a hierarchical tree structure, following the GOLD ecosystem classification system. The available ecosystems belong to one of the following general categories:
Each category is further divided into sub-categories that can be displayed or hidden by clicking on the red caret icons. You can also expand or collapse the entire tree by clicking on the Expand All or Collapse All button, respectively.
To select an ecosystem for search, simply click the checkbox next to its name. You can select multiple ecosystems to perform complex queries. The names of the selected ecosystems will appear in the list box at the right of the tree.
NMPFamsDB offers a number of tools for performing sequence queries on the database's contents. These can be performed against the NMPF representative consensus sequences, or their Hidden Markov Models (HMMs). The following tools are currently available:
The LAST search option uses the LAST alignment (lastal) package to perform pairwise sequence alignment searches. LAST operates in a manner very similar to the BLAST and PSI-BLAST algorithms, but has been optimized to work with large datasets and perform searches with a sensitivity matching that of PSI-BLAST.
By choosing Sequence Search → LAST Search you will be redirected to the following input form:
You can perform LAST searches using the following steps:
Paste one or more sequences in FASTA format in the text box area, or click the "Choose File" button and upload a FASTA-formatted sequence file. The input can contain one or multiple sequences.
Important Note: These sequences MUST be in the FASTA format: the first line for each sequence contains the header and starts with the ">" character, while the rest of the lines contain the sequence itself, in the one letter aminoacid code. Submitting plain (i.e. no title) sequences WILL NOT WORK.
Define the search options: in this step, you need to select a dataset to search against, as well as define the alignment parameters:
2.1 Select Dataset to search: Choose one of the NMPFamsDB available datasets. You can perform searches against the entire set of NMPFs (All
) or choose a subset to limit your searches, e.g. Environmental
for NMPFs with Environmental ecosystem associations, Metatranscriptome-only
for NMPFs containing exclusively metatranscriptome sequences, or Bacteria
for NMPFs associated with bacteria.
2.2 Substitution Matrix: Select the substitution matrix for the alignments. Available choices include BLOSUM62
(default), BLOSUM80
, PAM10
, PAM30
and MIQS
.
2.3 Gap Penalties: Select the gap costs for creation (Open) and Extension (Extend). Although you can set the values according to your needs, the default values offered will probably cover most cases. Note that LAST utilizes different default values for different matrix categories, based on their properties:
11
, Extend 2
20
, Extend 3
13
, Extend 3
13
, Extend 2
If you want your results to appear in a new tab or window, click the "Run in New Tab" checkbox.
When you are ready to continue, click the "Submit" button. To clear your choices, click the "Reset" button.
The LAST results are presented in the following window:
The table at the top left contains the input parameters, i.e. the search dataset, substitution matrix and gap penalties. The table at the top right contains a list with downloadable files, including the input sequence(s) in FASTA format, a short table of results and the full LAST output in text format for parsing.
An overview of the results is given in the Results Summary table. Each hit is represented by a row in the table and includes the sequence name as found in its FASTA header (e.g. query_sequence_1
), the ID of the NMPF hit (e.g. F023960
), the alignment parameters (query length, hit length, number of gaps, query range and hit range) and scores (Bit-score, E-value, and a link to the alignment itself ("View"). The NMPF IDs are clickable and open the respective NMPFamsDB entry pages. Clicking the "View" link will automatically navigate you to the respective alignment.
Each alingment is presented separately in the Detailed Results section. The title for each alignment contains all the relevant information (Bit-score, E-value etc), while the alignment itself can be shown or hidden by clicking the "Show/Hide" button.
The HMMER search option uses the HMMER 3 package to perform sequence queries, either with pairwise sequence alignments or with HMM searches.
By choosing Sequence Search → HMMER Search you will be redirected to the following input form:
For HMMER, two different search types are offered:
These search types are available by clicking the tab buttons at the top of the form.
For Sequence vs Sequence searches perform the following steps:
Paste one or more sequences in FASTA format in the text box area, or click the "Choose File" button and upload a FASTA-formatted sequence file. The input can contain one or multiple sequences.
Define the search options: in this step, you need to select a dataset to search against, as well as define the alignment parameters:
2.1 Select Dataset to search: Choose one of the NMPFamsDB available datasets. You can perform searches against the entire set of NMPFs (All
) or choose a subset to limit your searches, e.g. Environmental
for NMPFs with Environmental ecosystem associations, Metatranscriptome-only
for NMPFs containing exclusively metatranscriptome sequences, or Bacteria
for NMPFs associated with bacteria.
2.2 Search Method: Choose the search algorithm you wish to use. Two options are offered, phmmer
and jackhmmer
. These are essentially equivalent to the BLAST and PSI-BLAST functions: phmmer
performs pairwise alignments and returns their results, while jackhmmer
first performs a pairwise alignment query, uses the results to construct an HMM profile, and then performs a sequence-HMM query using that profile (the same concept as the PSSM matrices used in PSI-BLAST, but this time with HMMs). As expected, phmmer
is faster, but jackhmmer
is more sensitive to remote homologs.
2.3 Substitution Matrix: Select the substitution matrix for the alignments. Available choices include a wide range of BLOSUM and PAM matrices, with BLOSUM62
as the default.
2.4 Gap Penalties: Select the gap costs for creation (Open) and Extension (Extend). Although you can set the values according to your needs, the default values offered will probably cover most cases.
Define significance options: in this step you define the cut-off values after which results will be deemed as significant and reported. These include the Inclusion and Report thresholds. You can choose whether to base your cut-off on the Bit-score (default) or the E-value. For each case, different default values are used, although you can adjust them to your needs. Note that different thresholds are given for the total sequence alignment (Sequence) and the local domain alignment (Domain); these will also correspond to the reported Bit-scores and E-values for the total sequence and the best domain hits, respectively.
If you want your results to appear in a new tab or window, click the "Run in New Tab" checkbox.
When you are ready to continue, click the "Submit" button. To clear your choices, click the "Reset" button.
For Sequence vs HMM searches, select the "Sequence vs HMM" option from the tab menu above the form. You will see the following window:
You can perform a search with the following steps:
Paste one or more sequences in FASTA format in the text box area, or click the "Choose File" button and upload a FASTA-formatted sequence file. The input can contain one or multiple sequences.
Define the search options: in this step, you need to select a dataset to search against, as well as define the alignment parameters:
2.1 Select Dataset to search: Choose one of the NMPFamsDB available datasets. You can perform searches against the entire set of NMPFs (All
) or choose a subset to limit your searches, e.g. Environmental
for NMPFs with Environmental ecosystem associations, Metatranscriptome-only
for NMPFs containing exclusively metatranscriptome sequences, or Bacteria
for NMPFs associated with bacteria.
Define significance options: in this step you define the cut-off values after which results will be deemed as significant and reported. These include the Inclusion and Report thresholds. You can choose whether to base your cut-off on the Bit-score (default) or the E-value. For each case, different default values are used, although you can adjust them to your needs. Note that different thresholds are given for the total sequence alignment (Sequence) and the local domain alignment (Domain); these will also correspond to the reported Bit-scores and E-values for the total sequence and the best domain hits, respectively.
If you want your results to appear in a new tab or window, click the "Run in New Tab" checkbox.
When you are ready to continue, click the "Submit" button. To clear your choices, click the "Reset" button.
The HMMER results, either for Sequence vs Sequence or for Sequence vs HMM search, are presented in a manner similar to LAST:
The Ecosystem & Phylogeny tool allows you to visualize and plot the distribution of NMPFs and their relationships, based on their associations with user-defined ecosystems or taxonomic groups. Through the tool, you can create a number of different graph types (matrix plot, venn diagram, circos plot, total-vs-specific bar chart and Upset plot), customize them at will and export them as images.
The tool is available by navigating to Visualization Tools → Ecosystem & Phylogeny. From the top menu. You will be redirected to the following input form:
The input has two panels, labeled "Ecosystem" and "Phylogeny" and accessible through the respective tab buttons at the top of the form.
Through "Ecosystem" you can select different ecosystem types, by expanding or collapsing the tree structure at the left of the form and selecting the ecosystems you desire by clicking at the checkbox next to each ecosystem name. The names of the selected ecosystems will appear at the text panel to the right of the form. In addition, you can set an association cut-off, by clicking the Apply association cut-off checkbox and dragging the slider button to the value of your choice (in the above example, we set it at 5 %). Finally, to limit your run to only families with exclusive association to the selected ecosystems (i.e. 100% association cutoff), click on the relevant checkbox.
Similar to "Ecosystem", you can select different taxonomic groups by navigating to the "Phylogeny" tab and using the search form there. The taxonomic groups are organized in a hierarchical tree in the same manner as the ecosystems. An association cutoff can also be set, by clicking the Apply association cut-off checkbox and dragging the slider button to the value of your choice.
When you are ready to begin, click the "Submit" button.
After the tool has finished, you will see the results in the following manner:
The results are presented in an interactive viewer at the top of the page. These include the run type, association cutoff and a table with all selected categories (ecosystems or taxonomy groups), their taxonomy tree, and number of associated families. The number of families values are links; clicking them will open a new tab into the Browse Families page and retrieve them.
The interactive viewer is vertically split into two parts. To the left, there is a control panel. At the top of the control panel you will find your selected category datasets. You can select or deselect them by clicking on the checkboxes or using the Select / Deselect all checkbox. You can rename or remove them by clicking on the Rename or Remove buttons, respectively. You can change the order of a set in the list by selecting only that set and clicking the Move Up or Move Down buttons at the right of the list. Finally, you can apply your changes by clicking the Create Plots button.
Note: The above actions will re-calculate and re-create all visualizations and plots. This can be time-consuming, especially if the number of families in the datasets is large.
At the bottom of the control panel, you will see a Color Palette panel. Each item in the palette corresponds to a dataset. By clicking on it, a colorpicker window appears, enabling you to change the color. To use the updated colors in your plots, click the Update Plots button. To reset the palette to its original values, click the Reset Palette button. Note that clicking the Update Plots button will only change the color of the plots, not re-calculate them. You can export your color palette to a tab-delimited format using the Export color profile button. You can also import your own color palette by using the Import Color Profile upload button. The palette should be in a tab-delimited format. The first column should contain the dataset names, while the second should contain color values in the hexadecimal format. The first line should contain the column names, Dataset and Color, respectively. An example of a color pallette file is shown below:
Dataset Color
Terrestrial #996600
Air #00CC00
Aquatic #00FFFF
Human #0066FF
Mammals #FF0000
The plots of the results are shown to the right of the interactive viewer. Each plot is given in its own subtab, accessible by clicking the tab buttons at the top of the plots panel. At the bottom of each plot, an Export Image button appears, enabling you to download each graph in the PNG image format. The different plot types are shown below:
As the Upset is the most complex of all plot types offered, a number of additional options are given in its panel to allow further customization:
To update the upset plot click the Update Upset button. Note that this recalculates the values of the upset, so it can be slow, depending on the set sizes.
The Geographical Distribution tool allows you to visualize the distribution of one or more NMPFs on the world map, based on the geolocation metadata of their associated megagenome/metatranscriptome datasets, and the maximum distance among them. The tool operates in the following manner:
The tool is accessible by navigating to Visualization Tools → Geographical Distribution from the top menu. You will be redirected to the following input form:
The input form has the following two options:
When you are ready, click the Submit button. To clear the input, click Reset. To run the tool in a separate tab, click the Run in new tab check box prior to submitting.
The tool will take some time to finish, during which a loading screen is displayed. When calculations are complete, you will be redirected to the following results page:
The results are divided into two tabs, accessible from the buttons at the top of the panel. The Viewer tab contains the world map, showing the geographical locations of the datasets associated with the families. The map is interactive and can be navigated using the mouse, by left clicking and dragging in the map window. Zooming in and out can be performed by using the mouse scroll button, or by operating the zoom controls at the bottom. In the map, each data point corresponds to a metagenomic dataset; hovering the mouse over it will display a pop-up, showing its associated Taxon OID, longitude and latitude, and the number of associated families. The size of each point is proportional to the number of associated families. Each point is colored based on the classification of its associated ecosystem; with the color index shown at the bottom of the map. A detailed list of the results is given in the Map Points tab. The results are presented in an interactive table, including the IDs of the identified families, the Taxon OIDs of the associated datasets, and their assigned metadata (Longitude, Latitude and ecosystem type).
The data can be downloaded by clicking the Map Data button. A list of the families can be downloaded by clicking the Family List button. Finally, the bottom of the page shows the input options, including the family search parameters and the maximum distance cutoff.
NMPFamsDB offers two ways of programmatic access to its components:
The first method is easier to use and more immediate, but it is limited to the sequence and structure data of the NMPF families. The second method (the API) is more complex to use; however, it provides access to the full metadata of each NMPF Family entry, as well as the information of the associated metagenome datasets and sequencing scaffolds.
In NMPFamsDB, each family entry is represented by a number of files in the following data types:
Each data type can be accessed through a URL formatted as below:
https://pavlopoulos-lab.org/NMPFamsDB/data/datatype/ID.extension
where:
datatype
is the type of the data file:
fasta
aln
seed
hmm
hhm
logo
pdb
pivot
ID
is the NMPFamsDB family identifier
extension
is the file extension:
fasta
hmm
hhm
json
pdb
For example, to automatically retrieve the components of entry F000872
, you can use the following links:
https://pavlopoulos-lab.org/NMPFamsDB/data/fasta/F000872.fasta https://pavlopoulos-lab.org/NMPFamsDB/data/aln/F000872.fasta https://pavlopoulos-lab.org/NMPFamsDB/data/seed/F000872.fasta https://pavlopoulos-lab.org/NMPFamsDB/data/hmm/F000872.hmm https://pavlopoulos-lab.org/NMPFamsDB/data/hhm/F000872.hhm https://pavlopoulos-lab.org/NMPFamsDB/data/logo/F000872.json https://pavlopoulos-lab.org/NMPFamsDB/data/pdb/F000872.pdb https://pavlopoulos-lab.org/NMPFamsDB/data/pivot/F000872.fasta
The above can be typed as addresses in a browser, or automatically downloaded with utilities such as wget or cURL.
NMPFamsDB offers an Application Programming Interface (API), enabling users to retrieve database components without utilizing the web interface. Through the API, you can programmatically access subsets of information and retrieve their components, access the database with scripts written in various languages like Perl, Python, R etc., or incorporate connections with NMPFamsDB to your own applications or web pages.
The API currently serves results in the JSON format. All API services can be accessed by using both GET and POST requests.
A detailed description can be found in the dedicated API reference page.
The Downloads page contains all NMPF information for download in the following formats:
Sequences: The novel protein sequences in the FASTA format. The sequences of each NMPF are given in a distinct FASTA file (e.g. F000001.fasta
).
Full MSAs: The full MSAs of the families in the FASTA format. Each MSA is represented by a distinct file (e.g. F000001_msa.fasta
).
Seed MSAs: The non-redundant, seed MSAs of the families in the FASTA format. Each MSA is represented by a distinct file (e.g. F000001_seed.fasta
).
Consensus Sequences: The consesus sequences of the NMPFs, in a signle FASTA file (consensus.fasta
).
HMM profiles: The Hidden Markov Models in the HMMER format. Each NMPF is contained in a distinct file (e.g. F105879.hmm
).
3D Models: The NMPFs with available 3D structures in the PDB format.
The download options are given for the entire NMPFamsDB contents (All
), as well as its following subsets:
Clustered based on habitat distribution:
Environmental: A collection of NMPFs that are associated with Environmental ecosystems.
Host-associated: A collection of NMPFs that are associated with Host-associated ecosystems.
Engineered: A collection of NMPFs that are associated with Engineered ecosystems.
Clustered based on phylogeny:
Bacteria: A collection of NMPFs that are associated with Bacteria.
Archaea: A collection of NMPFs that are associated with Archaea.
Eukaryota: A collection of NMPFs that are associated with Eukaryota.
Viruses: A collection of NMPFs that are associated with viruses.
Unclassified: A collection of NMPFs that have unclassified sequences.
Clustered based on sample type:
Metagenome: Metagenome-only NMPFs
Metatranscriptome: Metatranscriptome-only NMPFs
Metagenome/Metatranscriptome: Mixed Metagenome/Metatranscriptome NMPFs
All of the aforementioned downloads are given in compressed (gzip) tar archives. They can be opened with graphical applications such as WinRar, 7-zip, Engrampa etc., or with the use of the tar command line (tar xvf package.tar.gz
).
In addition, the original sequence data, as retrieved from IMG/M and clustered, are given for download in tab-delimited format. For each sequence, the header, containing the Taxon OID, Scaffold ID, Gene ID and Family, separated by pipes ("|"), are given in the first column, and the sequence in the second. Tab-delimited downloads are given for the following two datasets:
Metagenome Families: The metagenomic sequences that were clustered into NMPFs
Reference Genome Families: A sequence dataset derived from IMG's reference genomes, and clustered in the same manner as the metagenomes, for comparison. See Pavlopoulos et. al, 2022 for more details.
The administrators of NMPFamDB, in accordance with Regulation (EU) 2016/679 and the relevant national legislation on the protection of natural persons with regard to the processing of personal data, provide the following privacy notice to explain what personal data is collected, for what purposes, how it is processed and how we keep it secure.
All data are collected and processed by the administrators of the NMPFamDB database. You can contact us using the information provided in the Contact page.
Data are collected to help monitor website functionality, resolve issues, improve the allocated resources and provide services to you adequately.
The personal data collected by the website’s services are as follows:
The data administrators use the aforementioned data for the following processes:
Any collected personal data is solely accessed and controlled by the website’s administrators (see question 1). No other person has access to the data.
Any personal data directly collected by NMPFamDB's services are handled by the administrators of NMPFamDB exclusively. There are no transfers to any other organisations whatsoever for these data.
Please note that NMPFamDB utilizes a number of third-party resources to provide you with the best possible experience. These include jQuery, FontAwesome, Google Fonts and OpenStreetMap/OpenLayers. Some of these resources store cookies and may record some data to function. The administrators of NMPFamDB are not responsible for the treatment of any data by these services. You are advised to consult the Privacy Policies of each of these services through their respective web pages.
Any personal data directly obtained from you will be retained for the minimum amount of time possible to ensure legal compliance and to facilitate internal and external audits if they arise.
NMPFamDB uses cookies to achieve functionality and provide you with the best possible experience. Specifically, we use cookies for the following purposes:
Most browsers allow you to refuse to accept cookies and to delete cookies. The methods for doing so vary from browser to browser, and from version to version. You can however obtain up-to-date information about blocking and deleting cookies via these links:
Blocking all cookies will have a negative impact upon the usability of NMPFamDB.
You have the right to:
It must be clarified that rights 4 and 5 are only available whenever the processing of your personal data is not necessary to:
Any requests and objections can be sent to us through the Contact page.
Bioinformatics & Integrated Biology Lab Institute for Fundamental Biomedical Research Biomedical Sciences Research Center "Alexander Fleming" |