AI and ML
Datasets commonly used by Artificial Intelligence and other Machine Learning
AlpacaFarm
infoAlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
folder_open
/datasets/ai/alpaca-farm
biomed_clip
infoBiomedCLIP is a biomedical vision-language foundation model that is pretrained on PMC-15M, a dataset of 15 million figure-caption pairs extracted from biomedical research articles in PubMed Central, using contrastive learning
folder_open
/datasets/ai/biomed-clip
blip_2
infoBLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
folder_open
/datasets/ai/blip2
coco
infoCOCO is a large-scale object detection, segmentation, and captioning dataset
folder_open
/datasets/ai/coco
Code Llama
DeepAccident
infoDeepAccident is the first V2X (vehicle-to-everything simulation) autonomous driving dataset that contains diverse collision accidents that commonly occur in real-world driving scenarios
folder_open
/datasets/ai/deep-accident
DeepSeek
infoDeepSeek trained via large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step, demonstrates remarkable reasoning capabilities. Through RL, DeepSeek-R1-Zero naturally emerges with numerous powerful and intriguing reasoning behaviors.
folder_open
/datasets/ai/deepseek
DINO v2
infoDINOv2 is a self-supervised method to learn visual representation
folder_open
/datasets/ai/dinov2
epic-kitchens
infoEpic-Kitchens-100 is a large-scale dataset in first-person (egocentric) vision; multi-faceted, audio-visual, non-scripted recordings in kitchen environments.
folder_open
/datasets/ai/epic-kitchens
gemma
infoGemma is a family of lightweight, state-of-the art open models built from the research and technology used to create Gemini models
folder_open
/datasets/ai/gemma
gte-Qwen2
infogte-Qwen2-7B-instruct is the latest model in the gte (General Text Embedding) model family that ranks No.1 in both English and Chinese evaluations on the Massive Text Embedding Benchmark MTEB benchmark (as of June 16, 2024).
folder_open
/datasets/ai/alibaba
ibm-granite
infoGranite 3.0, a new set of lightweight, state-of-the-art, open foundation models ranging in scale from 400 million to 8 billion active parameters
folder_open
/datasets/ai/ibm-granite
Idefics2
infoIdefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs
folder_open
/datasets/ai/idefics2
Imagenet 1K
inaturalist
infoThe iNaturalist 2017 dataset (iNat) contains 675,170 training and validation images from 5,089 natural fine-grained categories
folder_open
/datasets/ai/inaturalist
infly
infoINF-Retriever-v1 is an LLM-based dense retrieval model developed by INF TECH. It is built upon the gte-Qwen2-7B-instruct model and specifically fine-tuned to excel in retrieval tasks, particularly for Chinese and English data
folder_open
/datasets/ai/infly
instruct-blip
infoInstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
folder_open
/datasets/ai/instruct-blip
intfloat
infoA novel and simple method for obtaining high-quality text embeddings using only synthetic data and less than 1k training steps
folder_open
/datasets/ai/intfloat
LAION
linq
infoLinq-Embed-Mistral has been developed by building upon the foundations of the E5-mistral-7b-instruct and Mistral-7B-v0.1 models
folder_open
/datasets/ai/intfloat
llama
infoLLaMA, a collection of foundation language models ranging from 7B to 65B parameters
folder_open
/datasets/ai/llama
Llama2
llama3
infoLlama 3 is a herd of language models that natively support multilinguality, coding, reasoning, and tool usage
folder_open
/datasets/ai/llama3
Llava_OneVision
mixtral
msmarco
infoThe MS MARCO dataset is a large-scale information retrieval benchmark that uses real-world questions from Bing’s search queries to evaluate the performance of machine learning models in generating answers
folder_open
/datasets/ai/msmarco
natural-questions
infoNatural Questions corpus, a question answering data set. Questions consist of real anonymized, aggregated queries issued to the Google search engine
folder_open
/datasets/ai/natural-questions
objaverse
infoObjaverse is a Massive Dataset with 800K+ Annotated 3D Objects
folder_open
/datasets/ai/objaverse
qwen
infoQwen2 is the new series of Qwen large language models. For Qwen2, we release a number of base language models and instruction-tuned language models ranging from 0.5 to 72 billion parameters, including a Mixture-of-Experts model
folder_open
/datasets/ai/qwen
R1-1776
rag-sequence-nq
infoRAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever
folder_open
/datasets/ai/rag-sequence-nq
red-pajama-v2
infoRedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata
folder_open
/datasets/ai/red-pajama-v2
s1-32B
infos1 is a reasoning model finetuned from Qwen2.5-32B-Instruct on just 1,000 examples. It matches o1-preview & exhibits test-time scaling via budget forcing.
folder_open
/datasets/ai/simplescaling
satlas_pretrain
infoSatlasPretrain, a remote sensing dataset that is large in both breadth and scale, combining Sentinel-2 and NAIP images with 302M labels under 137 categories and seven label types
folder_open
/datasets/ai/allenai
SlimPajama
infoSlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T token RedPajama dataset contributed by Together
folder_open
/datasets/ai/slim-pajama
t5
infoThe T5 model, short for Text-to-Text Transfer Transformer, is a machine learning model developed by Google
folder_open
/datasets/ai/t5
V2X
infoV2X-Sim, a comprehensive simulated multi-agent perception dataset for V2X-aided autonomous driving
folder_open
/datasets/ai/v2x
vit
infoThe Vision Transformer (ViT) model uses the transformer architecture to process image patches for tasks like image classification
folder_open
/datasets/ai/vit
wildchat
infoWildChat, a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns
folder_open
/datasets/ai/wildchat