The best way to understand machine learning is to build a model yourself. It’s far less intimidating than it sounds — with Python and the right library, your first working model is about 20 lines of code. This tutorial walks through every step, explaining not just cosa to type but why.
Punti chiave
- You’ll use Python and scikit-learn — the standard beginner-friendly ML library.
- The workflow: load data → split it → train a model → evaluate → predict.
- The golden rule: always test on data the model never saw during training.
- No advanced math needed — scikit-learn handles the hard parts.
- What you’ll build
- Step 1: Set up your tools
- Step 2: Load the data
- Step 3: Split the data
- Step 4: Choose and train a model
- Step 5: Evaluate the model
- Step 6: Make a prediction on new data
- The complete workflow
- Where to go next
- Common mistakes that quietly break your first model
- Domande frequenti
- Conclusione
- Articoli correlati
What you’ll build
You’ll build a classifier — a model that sorts things into categories. We’ll use the classic beginner dataset, the Iris dataset: measurements of iris flowers (petal and sepal length and width), where the task is to predict the flower’s species. It’s small, clean, and built into scikit-learn, so it’s perfect for a first model.
The same five steps you learn here apply to almost every machine learning project, no matter how large.
Step 1: Set up your tools
You need Python and two libraries. scikit-learn is the workhorse — it provides datasets, algorithms, and evaluation tools in a consistent, beginner-friendly interface.
Install them from your terminal:
pip install scikit-learn pandas
You can write the code in a plain .py file, but a Jupyter notebook (or a free cloud notebook like Google Colab) is ideal for learning — you run code in small pieces and see each result immediately.
Step 2: Load the data
Every ML project starts with data. Here we load the built-in Iris dataset:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data # the measurements (the inputs / features)
y = iris.target # the species (the labels / answers)
print("Shape of X:", X.shape) # (150, 4) — 150 flowers, 4 measurements each
print("Classes:", iris.target_names)
Two variables matter here, and the naming is a universal convention:
Xholds the features — the inputs the model learns from (the four measurements).yholds the labels — the correct answers (the species).
Because we have the answers, this is supervised learning.
Step 3: Split the data
This is the most important step for getting an honest result. You must split your data into two parts:
- A training set the model learns from.
- A test set the model never sees during training — used only to evaluate it.
If you tested on the same data you trained on, you’d just be measuring memorization, not real learning. (This is how you catch overfitting.)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
test_size=0.2 keeps 20% of the data for testing and trains on the other 80%. random_state=42 just makes the random split reproducible, so you get the same result every run.
Step 4: Choose and train a model
Now the machine learning itself. We’ll use a Random Forest — an accurate, reliable, beginner-friendly algorithm (see our algorithms guide).
In scikit-learn, training a model is two lines:
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)
That .fit() call è the training. The model studies the training features and their labels and learns the patterns that connect measurements to species. scikit-learn handles all the math behind that single line.
Step 5: Evaluate the model
Now check how well it learned — using the test set it has never seen:
from sklearn.metrics import accuracy_score
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print(f"Accuracy: {accuracy:.2%}")
.predict() asks the model to classify the test flowers; accuracy_score compares its guesses to the true answers. On the Iris dataset you’ll typically see accuracy around 95–100% — your model correctly identifies almost every flower it never saw before.
Step 6: Make a prediction on new data
The real payoff: using the model on brand-new input. Give it a set of measurements and it predicts the species:
new_flower = [[5.1, 3.5, 1.4, 0.2]] # sepal & petal measurements
prediction = model.predict(new_flower)
print("Predicted species:", iris.target_names[prediction[0]])
That’s a complete machine learning model: trained, tested, and making predictions on data it has never encountered.
The complete workflow
Those five steps are not just an exercise — they’re the skeleton of essentially every supervised ML project:
| Step | What it does |
|---|---|
| 1. Load data | Get features (X) and labels (y) |
| 2. Split data | Separate training and test sets |
| 3. Train | model.fit() learns the pattern |
| 4. Evaluate | Measure accuracy on unseen test data |
| 5. Predict | model.predict() on new inputs |
Bigger projects add data cleaning, feature preparation, and model tuning — but this core loop stays the same.
Where to go next
To keep building:
- Try other algorithms — swap
RandomForestClassifierforLogisticRegressionorSVCand compare. scikit-learn’s consistent interface makes this trivial. - Try other datasets — practice on free datasets that interest you.
- Learn data preparation — real data is messy; cleaning and preparing it is most of the job.
- Explore evaluation — accuracy is just one metric; learn precision, recall, and cross-validation.
Common mistakes that quietly break your first model
Your model trained and printed an accuracy score — but a number that looks good is not the same as a model that works. These are the traps beginners fall into most often, and all of them are easy to avoid once you know they exist.
- Judging the model on data it already saw. If you measure accuracy on the training data, you are grading the model on the answers it memorized. A score of 100% there means nothing. Always evaluate on the held-out test set you created when you split the data — that is the only number that estimates real-world performance.
- Data leakage: letting test data influence training. This is the most damaging and least obvious mistake. If you scale, normalize, or fill in missing values before splitting, statistics from the test set (like a column’s mean) leak into training and inflate your score. The fix is strict ordering: split first, then fit any transformer on the training set only and merely apply it to the test set. scikit-learn’s documentation flags this as one of the most common pitfalls in machine learning.
- Forgetting to scale when the algorithm needs it. Distance- and gradient-based models (k-nearest neighbors, SVMs, logistic regression) are thrown off when one feature ranges 0–1 and another ranges 0–100,000. Tree-based models like random forests do not care. Know which camp your algorithm is in.
- Trusting accuracy on imbalanced data. If 95% of your examples are one class, a model that always guesses that class scores 95% while being useless. When classes are lopsided, read the precision, recall, and F1-score from
classification_reportinstead of accuracy alone.
The cleanest defence against leakage is a Pipeline. Chaining your preprocessing and model into one object means every transformation is automatically fit on the right slice of data, every time — including during cross-validation:
from sklearn.pipeline import make_pipelinemodel = make_pipeline(StandardScaler(), LogisticRegression())- Then call
model.fit(X_train, y_train)exactly as before.
One last habit worth building early: a single train/test split is a noisy estimate. Re-running with a different random split can swing your score by several points. Once you are comfortable, replace the single split with cross_val_score, which trains and tests across several folds and reports the average — a far more honest read on whether your model actually learned something.
Domande frequenti
How do I build a machine learning model in Python?
Use the scikit-learn library. The workflow is: load your data into features (X) and labels (y), split it into training and test sets, create a model and call .fit() to train it, evaluate it on the test set, and use .predict() for new data. A first model is about 20 lines of code.
What library should beginners use for machine learning?
scikit-learn. It offers a wide range of algorithms, built-in datasets, and evaluation tools through one simple, consistent interface, and it handles the underlying math for you. It’s the standard starting point before moving to deep learning frameworks.
Do I need to be good at math to build an ML model?
No. To build models with scikit-learn you need only basic Python and an understanding of the workflow. The library handles the math. Deeper math becomes useful later if you want to tune models expertly or do research.
Why do I need to split data into training and test sets?
So you can measure real performance. If you test a model on the same data it trained on, you only measure memorization. A separate test set the model never saw shows whether it genuinely learned the pattern and can generalize to new data.
What does model.fit() do?
.fit() is the training step. It feeds the training features and labels to the algorithm, which adjusts its internal parameters to learn the patterns connecting inputs to correct answers. After .fit(), the model is trained and ready to make predictions.
My model got high accuracy — does that mean it’s good?
Not necessarily. High accuracy is only meaningful if it was measured on the held-out test set, not the training data, and if your classes are reasonably balanced. On a dataset where one class dominates, a high score can come from the model simply guessing the majority class every time. Check precision, recall, and F1-score with classification_report, and confirm the number came from data the model never trained on.
How do I save my trained model and use it later?
Use Python’s joblib library, which ships with scikit-learn. Call joblib.dump(model, "model.joblib") to write the trained model to disk, and joblib.load("model.joblib") to load it back in another script — no retraining required. Save the entire Pipeline, not just the final estimator, so your scaling and preprocessing travel with the model and new inputs are handled identically.
How do I move from a built-in dataset to my own data?
Load your data with pandas — pandas.read_csv("yourfile.csv") — then separate your input columns (features, usually called X) from the column you want to predict (the target, y). From there the workflow is identical: split, train, evaluate. The new work is mostly cleaning: handling missing values, encoding text categories into numbers, and choosing which columns are actually useful. That data-preparation step is where most real-world ML time is spent.
Conclusione
Building your first machine learning model is genuinely a short, achievable project: install scikit-learn, then load, split, train, evaluate, and predict. Those five steps are the foundation of nearly every supervised ML project you’ll ever build.
Don’t just read this — open a notebook and run the code. Change the algorithm, try a different dataset, break things and fix them. The concepts in machine learning click far faster once you’ve trained a model with your own hands. When you’re ready for more, grab a free dataset and build something of your own.
