Common Crawl Free, open web-crawl corpus for large-scale text/data mining. 0620 Datasets & Labeling# Common Crawl# corpus# open
UCI Machine Learning Repository Classic collection of ML datasets used in research and education. 0500 Datasets & Labeling# benchmark# education# ML datasets
doccano (OSS) Open-source text annotation for classification, NER, and seq2seq tasks. 0470 Datasets & Labeling# doccano# NER# open source
Scale Nucleus Dataset management and curation—debug models, fix labels, and improve data quality. 0460 Datasets & Labeling# curation# dataset management# Nucleus
Roboflow Universe Huge community hub of computer-vision datasets and pre-trained models. 0460 Datasets & Labeling# community# models# Roboflow
Ollama Library (for local label assist & synthetic data) Run open-weight LLMs locally to assist labeling or generate synthetic datasets. 0450 Datasets & Labeling# label assist# local LLM# offline
Labelbox End-to-end data engine with AI-assisted labeling, curation, and evaluations. 0450 Datasets & Labeling# curation# evaluation# labeling
Encord Annotate AI + HITL labeling with customizable workflows, analytics, and ontology management. 0440 Datasets & Labeling# AI-assisted# Encord# HITL
Open Images Dataset Large-scale annotated image dataset with boxes, masks, relationships, and more. 0430 Datasets & Labeling# annotations# detection# Open Images
Google Cloud Public Datasets Curated, analysis-ready public datasets hosted on Google Cloud/BigQuery. 0420 Datasets & Labeling# analytics# BigQuery# Google Cloud
Kaggle Datasets Community-driven repository of datasets with notebooks, discussions, and trending signals. 0420 Datasets & Labeling# community# datasets# Kaggle
Label Studio (OSS) Open-source, highly configurable labeling for text, images, audio, video, and chat. 0410 Datasets & Labeling# active learning# labeling# multimodal
Prodigy Developer-centric annotation tool with strong active learning for NLP/CV/A/V. 0400 Datasets & Labeling# active learning# annotation# CV
V7 Darwin Professional CV labeling with model-in-the-loop and medical/video tooling. 0400 Datasets & Labeling# annotation# auto-annotate# Darwin
Registry of Open Data on AWS Discover public datasets available via AWS resources with examples and tutorials. 0400 Datasets & Labeling# AWS# open data# public datasets