Useful Databases
GENCODE
GENCODE is a project that aims to identify and annotate all protein-coding genes in the Human and Mouse genome, using a combination of computational analysis, targeted experimental approaches and manual curation. GENCODE also contains gene annotation files, amino acid and nucleotide sequence files of known proteins and transcripts. The project releases updates to its database multiple times per year and all data are available for download.
GTEx
GTEx is the Adult Genotype Tissue Expression (GTEx) project and aims to study human gene expression and regulation, and its relationship to genetic variation across multiple diverse tissues and individuals. More specifically GTEx contains data from and includes RNA-seq, snRNA-Seq, long-read RNA-seq, QTL (single and multi-tissue), histology, protein expression, methylation QTLs, and variant detections, across various tissue types and individuals. These data can be downloaded locally or visualized in a browser. The online browsers can be used to visualize bulk RNA-Seq expression, snRNA-Seq expression and Tissue & Histology data. There is also dGTEx which aims to study development-specific genetic effects on gene expression, as well as NHP dGTEx which studies development-specific genetic effects on gene expression in non-human primates. Both dGTEx and NHP dGTEx are still underway with no data being released yet.
Alphafold Database
Google DeepMind and EMBL’s European Bioinformatics Institute (EMBL-EBI) have partnered to create AlphaFold DB to make these predictions freely available to the scientific community. The latest database release contains over 200 million entries, providing broad coverage of UniProt (the standard repository of protein sequences and annotations). They provide individual downloads for the human proteome and for the proteomes of 47 other key organisms important in research and global health.
ESM Atlas
ESMAtlas
InterPro
InterPro is a database which contains functional analysis of proteins and contains classification of protein families and predictions of domains and sites. InterPro integrates different member databases into a larger InterPro consortium, so searching for proteins can be aggregated across all the databases. Users can search for proteins by sequence, protein name or search for specific domain architectures of interest. Additionally, data from entire proteomes or member databases can be exported locally.
STRING
STRING is a database of known and predicted protein-protein interactions. The interactions include direct (physical) and indirect (functional) associations; they stem from computational prediction, from knowledge transfer between organisms, and from interactions aggregated from other (primary) databases.
PhospoSitePlus
PhosphoSitePlus is an online database which contains data for the study of protein post-translational modifications. PhosphoSitePlus contains data on post-translational modifications such as phosphorylation, acetylation, methylation, ubiquitination, and O-glycosylation. The database includes data from various diseases, tissue types and cell lines as well as built in motif analysis, sequence logo analysis and kinase predictions. Proteins can be searched for uniquely or comparatively by sequence or various specific sites and there are also options to download datasets locally. PhosphoSitePlus uses publicly available data, from numerous journal articles that have been published.
DisProt
DisProt is the major manually curated repository of Intrinsically Disordered Proteins, both for structural and functional aspects. Expert curators are involved in collecting experimentally confirmed biological data, valuable for the scientific community, and for updating and maintaining DisProt over time.
PhaSePred
Phase separation (PS) mediates the compartmentalization of proteins and nucleic acids in cell. This process is driven by multivalent weak interactions mediated by intrinsically disordered regions (IDRs) or multiple modular domains. A difference between these two interactions is that a single molecule species can undergo IDR-mediated phase separation, while phase separation mediated by multiple interacting domains often involves two or more different molecule species. Herein, we characterize proteins that can self-assemble to form condensates as self-assembling phase-separating (PS-Self) proteins, and we define proteins whose phase separation behaviors are regulated by partner components (proteins or nucleic acids) as partner-dependent phase-separating (PS-Part) proteins.
PhaSePred is a centralized resource that provides self-assembling and partner-dependent phase-separating protein prediction and integrates scores from several PS-related predicting tools.
VASTdb
VastDB is a database of Alternative Splicing (AS) profiles across multiple tissue and cell types. VastDB contains AS events (including cassette exons, microexons, alternative 5′ and 3′ splice sites and retained introns) from various species. AS event identification and sequence inclusion level quantification in RNA-seq samples have been performed with VAST-TOOLS. In addition to AS inclusion levels, Vas’d provides general information about the AS events, including genomic and sequence context, impact on the reading frame, overlap with protein domains and disordered regions, mapping to protein structures, evolutionary conservation and primers for AS event validation through RT-PCR. Moreover, it also provides measures of Gene Expression, using the cRPKM metric.
BioGRID
The Biological General Repository for Interaction Datasets (BioGRID) is a public database that archives and disseminates genetic and protein interaction data from model organisms and humans (thebiogrid.org). BioGRID currently holds over 1,740,000 interactions curated from both high-throughput datasets and individual focused studies, as derived from over 70,000+ publications in the primary literature. Complete coverage of the entire literature is maintained for budding yeast (S. cerevisiae), fission yeast (S. pombe) and thale cress (A. thaliana), and efforts to expand curation across multiple metazoan species are underway.
Intact
IntAct provides a free, open source database system and analysis tools for molecular interaction data. All interactions are derived from literature curation or direct user submissions.
Complex Portal
The Complex Portal is an encyclopaedic resource of macromolecular complexes from a number of key model organisms. In addition to the expert manually curated complexes, the portal now holds high-confidence machine-learning predicted human complexes from hu.MAP3.0 and MuSIC. All data is freely available for search and download.
Human Protein Atlas
The Human Protein Atlas is a Swedish-based program initiated in 2003 with the aim to map all the human proteins in cells, tissues, and organs using an integration of various omics technologies, including antibody-based imaging, mass spectrometry-based proteomics, transcriptomics, and systems biology. All the data in the knowledge resource is open access to allow scientists both in academia and industry to freely access the data for exploration of the human proteome.
Allen Brain Map
The Allen Brain Atlas is a free, online resource that maps gene expression, connectivity, and neuroanatomical information for the brains of mice, humans, and non-human primates. Data modalities such as gene expression and neural connectivity are deposited to the atlas by researchers. All data is availble for download both as raw as well as processed and annotated data
Ensembl
Ensembl is a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation. Ensembl annotate genes, computes multiple alignments, predicts regulatory function and collects disease data. Ensembl tools include BLAST, BLAT, BioMart and the Variant Effect Predictor (VEP) for all supported species.
Human Cell Atlas
The Human Cell Atlas is a global consortium that is mapping every cell type in the human body, creating a 3-dimensional Atlas of human cells to transform our understanding of biology and disease. The Atlas is likely to lead to major advances in the way illnesses are diagnosed and treated. Data is avable from many different tissues, labs and donors all freely
Genomics Data Commons
The NCI’s Genomic Data Commons (GDC) provides the cancer research community with a repository and computational platform for cancer researchers who need to understand cancer, its clinical progression, and response to therapy. The GDC supports several cancer genome programs at the NCI Office of Cancer Genomics (OCG), including The Cancer Genome Atlas (TCGA) and Therapeutically Applicable Research to Generate Effective Treatments (TARGET). Data from more than 40000 cases are available for download. While all data is freely available an application through DbGap might be necessary to a
Pediatric cancer platform
The PeCan platform presents curated pediatric cancer genomics data including variants, mutational signatures, and gene expression data in addition to histological slide images* from ~9000 hematological, CNS, and non-CNS solid tumor patient samples. Data can be explored via a series of data facets containing both retrospective and prospective study cohorts from St. Jude Children’s Research Hospital and other trusted institutions and research centers around the world.
MaveDB
MaveDB is a public repository for datasets from Multiplexed Assays of Variant Effect (MAVEs), such as those generated by deep mutational scanning (DMS) or massively parallel reporter assay (MPRA) experiments. MaveDB is open-source, released under the AGPLv3 license.