Datasets
Unity hosts a variety of commonly used public datasets for easy access for Unity workloads. You can find all of Unity’s hosted datasets in the /datasets
directory from any Unity node.
You can also view
/datasets
in the Open OnDemand file browser by navigating to the “/datasets
” entry in the “Files” dropdown.To get information about each dataset, see the menu below.
AI and ML
Code Llama
Imagenet
Imagenet 1K
LAION
Llama2
mixtral
Bioinformatics
BFD/MGnify
infoBFD/MGnify is a database built for ColabFold by combining the Big Fantastic Database (BFD) with the MGnify database.
folder_open
/datasets/bio/colabfold/bfd_mgy_colabfold
Big Fantastic Database
infoBig Fantastic Database (BFD) is a protein sequence database. BFD was created by clustering 2.5 billion protein sequences from Uniprot/TrEMBL+Swissprot, Metaclust, Soil Reference Catalog and Marine Eukaryotic Reference Catalog. It consists of over 65M protein families represented as multiple sequence alignments and hidden Markov models. BFD was built using the Uniclust pipeline and is one of the protein sequence databases used with AlphaFold.
folder_open
/datasets/bio/alphafold/bfd
ColabFoldDB
infoColabFoldDB is a protein database built for ColabFold by extending BFD/MGnify with additional metagenomic protein catalogs containing eukaryotic proteins, phage catalogs and an updated version of MetaClust.
folder_open
/datasets/bio/colabfold/colabfold_envdb_202108
dfam
infoDfam is a database of Transposable Element DNA sequence alignments, hidden Markov Models (HMMs), consensus sequences, and genome annotations.
folder_open
/datasets/bio/dfam/
EggNOG
infoEggNOG is a database of orthology relationships, functional annotation, and gene evolutionary histories. The EggNOG database is used with EggNOG-mapper, a tool for functional annotation of novel sequences.
folder_open
/datasets/bio/eggnog-data/
folder_open
/datasets/bio/eggnog6-data/
Kraken2
infoDatabase for Kraken2, a tool that assigns taxonomic labels to DNA sequences. The database was built with the complete archaeal, bacterial and viral genomes downloaded from the NCBI Reference Sequence Database on July 22nd 2024.
folder_open
/datasets/bio/kraken2
MGnify
infoMGnify is a database of non-redundant protein sequences predicted from metagenomic assemblies. MGnify is one of the protein sequence databases that can be used with AlphaFold.
folder_open
/datasets/bio/alphafold/mgnify
NCBI BLAST databases
infoNational Center for Biotechnology Information (NCBI) database presented in the format required for running Basic Local Alignment Search Tool (BLAST) as well as the sequence aligner DIAMOND. It contains the nucleotide database, the non-redundant Reference Sequence protein database for archaeal and bacterial genomes, the Reference Sequence Prokaryotic Representative Genome Database and the Reference Sequence Eukaryotic Representative Genome Database. NCBI’s BLAST databases are downloaded weekly. See the full details for more information.
folder_open
/datasets/bio/ncbi-db/
NCBI RefSeq database
infoComplete archaeal, bacterial and viral genomes retrieved from the National Center for Biotechnology Information (NCBI) Reference Sequence Database.
folder_open
/datasets/bio/ncbi-refseq/
PDB70
infoPDB70 is a protein database that contains profile hidden Markov models for a representative set of protein sequences from the Protein Data Bank database filtered with a maximum pairwise sequence identity of 70%. PDB70 can be used with AlphaFold.
folder_open
/datasets/bio/alphafold/pdb70
PDB70 for ColabFold
infoPDB70 database (see PDB70 database) built in MMseqs2 format to be used with ColabFold.
folder_open
/datasets/bio/colabfold
Protein Data Bank
infoProtein sequences from the Protein Data Bank in CIF format.
folder_open
/datasets/bio/colabfold/pdb
Protein Data Bank database in mmCIF format
infoProtein sequences from the Protein Data Bank in mmCIF format.
folder_open
/datasets/bio/alphafold/pdb_mmcif
Protein Data Bank database in SEQRES records
infoProtein sequences from the Protein Data Bank in SEQRES records. SEQRES records contain the amino acid sequence of residues in each chain of the proteins.
folder_open
/datasets/bio/alphafold/pdb_seqres
Tara Oceans 18S amplicon
info18S amplicon sequencing data from the Tara Oceans expedition (2009-2013) DNA samples corresponding to size fractions for protists. The sequence files were downloaded from the European Nucleotide Archive under project number PRJEB6610.
folder_open
/datasets/bio/tara-oceans/18S-amplicon
Tara Oceans MATOU gene catalog
infoReference collection of expressed eukaryotic genes called Marine Atlas of Tara Oceans Unigenes (MATOU), obtained with the TARA Oceans expedition (2009-2013) samples.
folder_open
/datasets/bio/tara-oceans/MATOU-gene-catalog
Tara Oceans MGT transcriptomes
infoCollection of metagenomics-based transcriptomes (MGTs) of eukaryotic marine plankton communities obtained with the TARA Oceans expedition (2009-2013) samples.
folder_open
/datasets/bio/tara-oceans/MGT-transcriptomes
Uniclust30
infoUniclust30 is a database of annotated protein sequences and alignments. It is built by clustering the sequences in UniProt Knowledgebase (UniProtKB) at the level of 30% pairwise sequence identity. Uniclust30 can be used with AlphaFold.
folder_open
/datasets/bio/alphafold/uniclust30
UniProtKB
infoThe UniProt Knowledgebase (UniProtKB) is a database of protein sequences consisting of two sections called UniProtKB/Swiss-Prot and UniProtKB/TrEMBL. UniProtKB/Swiss-Prot contains manually annotated and non-redundant protein sequence records while UniProtKB/TrEMBL contains computationally analyzed and unreviewed protein sequence records.
folder_open
/datasets/bio/alphafold/uniprot
UniRef30
infoUniRef30 is a database of protein sequences built for ColabFold by clustering UniRef100 sequences with 30% sequence identity.
folder_open
/datasets/bio/colabfold/uniref30_2103
UniRef90
infoUniRef90 is a database of protein sequences from UniProtKB and selected UniParc records. UniRef90 is built by clustering UniRef100 sequences such that each clustered set is composed of sequences that have at least 90% sequence identity to, and 80% overlap with, the longest sequence in the cluster.
folder_open
/datasets/bio/alphafold/uniref90