Wednesday, 27 May 2026 | Mise à jour quotidienne L'intelligence artificielle au service des constructeurs

15 Best Free Datasets for Machine Learning Projects (2026)

You can’t learn machine learning by reading — you learn it by building, and building needs data. The good news: there is an enormous amount of high-quality, free data available in 2026. The challenge is knowing where to look. This guide rounds up the 15 best free datasets and dataset sources, organized by type, with advice on choosing the right one.

Principaux enseignements

  • Best starting point: Kaggle and the UCI Machine Learning Repository.
  • For beginners: classic small datasets like Iris, MNIST, and Titanic.
  • For search: Google Dataset Search and Hugging Face Datasets index millions of options.
  • Match the dataset to your goal — small and clean to learn, large and messy to practice realism.

Dataset hubs and search engines

These platforms host or index huge numbers of datasets across every domain — the best place to start.

1. Kaggle Datasets — The largest community dataset platform. Tens of thousands of datasets on every topic imaginable, most with example notebooks showing how others used them. The single best resource for practice and project ideas.

2. UCI Machine Learning Repository — The long-standing academic collection. Hundreds of well-documented, clean datasets that are perfect for learning specific algorithms. Many famous beginner datasets originate here.

3. Google Dataset Search — A search engine for datasets across the entire web. If you have a specific topic in mind, search it here to find datasets you’d never otherwise discover.

4. Hugging Face Datasets — The hub for modern AI, with a massive library of datasets — especially for text, language, and multimodal work — that load directly into code with a single command.

5. Awesome Public Datasets — A large, curated, community-maintained list on GitHub, organized by topic. A great way to browse quality sources by domain.

Government and open data

Public institutions publish vast amounts of free, reliable data — ideal for realistic projects.

6. Data.gov — The US government’s open data portal: hundreds of thousands of datasets covering economics, health, climate, transportation, and more.

7. World Bank Open Data — Global development data across countries and decades — economics, population, education, environment. Excellent for analysis and forecasting projects.

8. Our World in Data — Clean, well-documented datasets on global topics like health, energy, and population, paired with clear explanations.

Image and computer vision datasets

Pour computer vision projects:

9. ImageNet — The huge labeled image dataset that helped launch the deep learning era. Millions of images across thousands of categories — the standard benchmark for image classification.

10. COCO (Common Objects in Context) — The go-to dataset for object detection and segmentation, with images labeled for the objects they contain and where those objects are.

11. MNIST and Fashion-MNIST — Small, clean datasets of handwritten digits (and clothing images). The classic “hello world” of image classification — perfect for a first vision model.

Text and language datasets

For natural language projects:

12. Common Crawl — An enormous, free archive of web page data — the kind of raw text used to train large language models. Big and unwieldy, but unmatched in scale.

13. Wikipedia dumps — The full text of Wikipedia, free to download. A clean, high-quality text corpus widely used for language tasks.

14. Sentiment and review datasets — Collections of product and movie reviews with sentiment labels (widely available on Kaggle and Hugging Face) are ideal for learning text classification.

Beginner-friendly classics

15. Iris, Titanic, and California Housing — The classic teaching datasets. Iris (flower classification) and California Housing (price prediction) are built into scikit-learn; Titanic (survival prediction) is Kaggle’s famous starter competition. Small, clean, and well-documented — the right choice for your first model.

How to choose the right dataset

The best dataset depends on what you’re trying to do:

Your goalChoose…
Learning the basicsSmall, clean classics — Iris, MNIST, Titanic
Practicing real-world skillsLarger, messier Kaggle datasets
A specific topicGoogle Dataset Search
Computer visionMNIST → COCO → ImageNet
Natural languageHugging Face Datasets
A portfolio projectA dataset on a topic you genuinely care about

A few practical tips:

  • Start small and clean. When learning, a tidy dataset lets you focus on the ML concepts. Save messy data for when you’re practicing data cleaning deliberately.
  • Check the licence. Most datasets here are free to use, but if your project is public or commercial, confirm the terms.
  • Pick something you care about. Motivation matters. A dataset about a topic you find genuinely interesting will keep you going when the project gets hard.
  • Mind data quality and bias. Real datasets contain errors and can carry bias. Inspect your data before trusting a model built on it.

FAQ

Where can I find free datasets for machine learning?

The best starting points are Kaggle Datasets and the UCI Machine Learning Repository. For broader searches, use Google Dataset Search and Hugging Face Datasets. Government portals like Data.gov and the World Bank also offer huge amounts of free, reliable data.

What is the best dataset for machine learning beginners?

Classic small, clean datasets: Iris (flower classification) and California Housing (price prediction), both built into scikit-learn, and the Titanic dataset on Kaggle. They are well-documented and let you focus on learning the machine learning workflow itself.

Is Kaggle free to use?

Yes. Kaggle is free — you can download tens of thousands of datasets, run code in free cloud notebooks, study other people’s solutions, and enter competitions, all at no cost. It’s one of the best free resources for learning machine learning.

What dataset should I use for a computer vision project?

Start with MNIST or Fashion-MNIST — small, clean image datasets ideal for a first vision model. Move up to COCO for object detection and segmentation, and ImageNet for large-scale image classification as your skills grow.

Can I use these datasets for commercial projects?

Many are freely licensed for any use, but licences vary by dataset. Always check the specific licence and terms before using a dataset in a commercial or publicly released project — don’t assume “free to download” means “free for any purpose.”

Bottom line

There has never been more free, high-quality data for machine learning than there is in 2026. For practice and projects, start with Kaggle et le UCI repository; to find something specific, use Google Dataset Search et Hugging Face. If you’re just beginning, the classic small datasets — Iris, MNIST, Titanic — remain the best place to learn the workflow.

The real advice is simple: stop collecting datasets and start using one. Pick a topic you care about, grab the data, and build a model. Hands-on practice with real data is what turns machine learning theory into skill.

Défiler vers le haut