The best way to understand machine learning is to build a model yourself. It’s far less intimidating than it sounds — with Python and the right library, your first working model is about 20 lines of code. This tutorial walks through every step, explaining not just what to type but why.
Principaux enseignements
- You’ll use Python and scikit-learn — the standard beginner-friendly ML library.
- The workflow: load data → split it → train a model → evaluate → predict.
- The golden rule: always test on data the model never saw during training.
- No advanced math needed — scikit-learn handles the hard parts.
What you’ll build
You’ll build a classifier — a model that sorts things into categories. We’ll use the classic beginner dataset, the Iris dataset: measurements of iris flowers (petal and sepal length and width), where the task is to predict the flower’s species. It’s small, clean, and built into scikit-learn, so it’s perfect for a first model.
The same five steps you learn here apply to almost every machine learning project, no matter how large.
Step 1: Set up your tools
You need Python and two libraries. scikit-learn is the workhorse — it provides datasets, algorithms, and evaluation tools in a consistent, beginner-friendly interface.
Install them from your terminal:
pip install scikit-learn pandas
You can write the code in a plain .py file, but a Jupyter notebook (or a free cloud notebook like Google Colab) is ideal for learning — you run code in small pieces and see each result immediately.
Step 2: Load the data
Every ML project starts with data. Here we load the built-in Iris dataset:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data # the measurements (the inputs / features)
y = iris.target # the species (the labels / answers)
print("Shape of X:", X.shape) # (150, 4) — 150 flowers, 4 measurements each
print("Classes:", iris.target_names)
Two variables matter here, and the naming is a universal convention:
Xholds the features — the inputs the model learns from (the four measurements).yholds the labels — the correct answers (the species).
Because we have the answers, this is supervised learning.
Step 3: Split the data
This is the most important step for getting an honest result. You must split your data into two parts:
- A training set the model learns from.
- A test set the model never sees during training — used only to evaluate it.
If you tested on the same data you trained on, you’d just be measuring memorization, not real learning. (This is how you catch overfitting.)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
test_size=0.2 keeps 20% of the data for testing and trains on the other 80%. random_state=42 just makes the random split reproducible, so you get the same result every run.
Step 4: Choose and train a model
Now the machine learning itself. We’ll use a Random Forest — an accurate, reliable, beginner-friendly algorithm (see our algorithms guide).
In scikit-learn, training a model is two lines:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
That .fit() call is the training. The model studies the training features and their labels and learns the patterns that connect measurements to species. scikit-learn handles all the math behind that single line.
Step 5: Evaluate the model
Now check how well it learned — using the test set it has never seen:
from sklearn.metrics import accuracy_score
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2%}")
.predict() asks the model to classify the test flowers; accuracy_score compares its guesses to the true answers. On the Iris dataset you’ll typically see accuracy around 95–100% — your model correctly identifies almost every flower it never saw before.
Step 6: Make a prediction on new data
The real payoff: using the model on brand-new input. Give it a set of measurements and it predicts the species:
new_flower = [[5.1, 3.5, 1.4, 0.2]] # sepal & petal measurements
prediction = model.predict(new_flower)
print("Predicted species:", iris.target_names[prediction[0]])
That’s a complete machine learning model: trained, tested, and making predictions on data it has never encountered.
The complete workflow
Those five steps are not just an exercise — they’re the skeleton of essentially every supervised ML project:
| Step | What it does |
|---|---|
| 1. Load data | Get features (X) and labels (y) |
| 2. Split data | Separate training and test sets |
| 3. Train | model.fit() learns the pattern |
| 4. Evaluate | Measure accuracy on unseen test data |
| 5. Predict | model.predict() on new inputs |
Bigger projects add data cleaning, feature preparation, and model tuning — but this core loop stays the same.
Where to go next
To keep building:
- Try other algorithms — swap
RandomForestClassifierforLogisticRegressionorSVCand compare. scikit-learn’s consistent interface makes this trivial. - Try other datasets — practice on free datasets that interest you.
- Learn data preparation — real data is messy; cleaning and preparing it is most of the job.
- Explore evaluation — accuracy is just one metric; learn precision, recall, and cross-validation.
FAQ
How do I build a machine learning model in Python?
Use the scikit-learn library. The workflow is: load your data into features (X) and labels (y), split it into training and test sets, create a model and call .fit() to train it, evaluate it on the test set, and use .predict() for new data. A first model is about 20 lines of code.
What library should beginners use for machine learning?
scikit-learn. It offers a wide range of algorithms, built-in datasets, and evaluation tools through one simple, consistent interface, and it handles the underlying math for you. It’s the standard starting point before moving to deep learning frameworks.
Do I need to be good at math to build an ML model?
No. To build models with scikit-learn you need only basic Python and an understanding of the workflow. The library handles the math. Deeper math becomes useful later if you want to tune models expertly or do research.
Why do I need to split data into training and test sets?
So you can measure real performance. If you test a model on the same data it trained on, you only measure memorization. A separate test set the model never saw shows whether it genuinely learned the pattern and can generalize to new data.
What does model.fit() do?
.fit() is the training step. It feeds the training features and labels to the algorithm, which adjusts its internal parameters to learn the patterns connecting inputs to correct answers. After .fit(), the model is trained and ready to make predictions.
Bottom line
Building your first machine learning model is genuinely a short, achievable project: install scikit-learn, then load, split, train, evaluate, and predict. Those five steps are the foundation of nearly every supervised ML project you’ll ever build.
Don’t just read this — open a notebook and run the code. Change the algorithm, try a different dataset, break things and fix them. The concepts in machine learning click far faster once you’ve trained a model with your own hands. When you’re ready for more, grab a free dataset and build something of your own.
