Monday, 22 June 2026 | Updating Daily AI insight, written for builders

How to Build Your First Machine Learning Model in Python (2026)

Aggiornato · Originally published May 18, 2026

The best way to understand machine learning is to build a model yourself. It’s far less intimidating than it sounds — with Python and the right library, your first working model is about 20 lines of code. This tutorial walks through every step, explaining not just cosa to type but why.

Punti chiave

  • You’ll use Python and scikit-learn — the standard beginner-friendly ML library.
  • The workflow: load data → split it → train a model → evaluate → predict.
  • The golden rule: always test on data the model never saw during training.
  • No advanced math needed — scikit-learn handles the hard parts.

What you’ll build

You’ll build a classifier — a model that sorts things into categories. We’ll use the classic beginner dataset, the Iris dataset: measurements of iris flowers (petal and sepal length and width), where the task is to predict the flower’s species. It’s small, clean, and built into scikit-learn, so it’s perfect for a first model.

The same five steps you learn here apply to almost every machine learning project, no matter how large.

Step 1: Set up your tools

You need Python and two libraries. scikit-learn is the workhorse — it provides datasets, algorithms, and evaluation tools in a consistent, beginner-friendly interface.

Install them from your terminal:

pip install scikit-learn pandas

You can write the code in a plain .py file, but a Jupyter notebook (or a free cloud notebook like Google Colab) is ideal for learning — you run code in small pieces and see each result immediately.

Step 2: Load the data

Every ML project starts with data. Here we load the built-in Iris dataset:

from sklearn.datasets import load_iris

iris = load_iris()
X = iris.data      # the measurements (the inputs / features)
y = iris.target    # the species (the labels / answers)

print("Shape of X:", X.shape)   # (150, 4) — 150 flowers, 4 measurements each
print("Classes:", iris.target_names)

Two variables matter here, and the naming is a universal convention:

  • X holds the features — the inputs the model learns from (the four measurements).
  • y holds the labels — the correct answers (the species).

Because we have the answers, this is supervised learning.

Step 3: Split the data

This is the most important step for getting an honest result. You must split your data into two parts:

  • A training set the model learns from.
  • A test set the model never sees during training — used only to evaluate it.

If you tested on the same data you trained on, you’d just be measuring memorization, not real learning. (This is how you catch overfitting.)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

test_size=0.2 keeps 20% of the data for testing and trains on the other 80%. random_state=42 just makes the random split reproducible, so you get the same result every run.

Step 4: Choose and train a model

Now the machine learning itself. We’ll use a Random Forest — an accurate, reliable, beginner-friendly algorithm (see our algorithms guide).

In scikit-learn, training a model is two lines:

from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

That .fit() call è the training. The model studies the training features and their labels and learns the patterns that connect measurements to species. scikit-learn handles all the math behind that single line.

Step 5: Evaluate the model

Now check how well it learned — using the test set it has never seen:

from sklearn.metrics import accuracy_score

predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)

print(f"Accuracy: {accuracy:.2%}")

.predict() asks the model to classify the test flowers; accuracy_score compares its guesses to the true answers. On the Iris dataset you’ll typically see accuracy around 95–100% — your model correctly identifies almost every flower it never saw before.

Step 6: Make a prediction on new data

The real payoff: using the model on brand-new input. Give it a set of measurements and it predicts the species:

new_flower = [[5.1, 3.5, 1.4, 0.2]]   # sepal & petal measurements
prediction = model.predict(new_flower)

print("Predicted species:", iris.target_names[prediction[0]])

That’s a complete machine learning model: trained, tested, and making predictions on data it has never encountered.

The complete workflow

Those five steps are not just an exercise — they’re the skeleton of essentially every supervised ML project:

StepWhat it does
1. Load dataGet features (X) and labels (y)
2. Split dataSeparate training and test sets
3. Trainmodel.fit() learns the pattern
4. EvaluateMeasure accuracy on unseen test data
5. Predictmodel.predict() on new inputs

Bigger projects add data cleaning, feature preparation, and model tuning — but this core loop stays the same.

Where to go next

To keep building:

  • Try other algorithms — swap RandomForestClassifier for LogisticRegression or SVC and compare. scikit-learn’s consistent interface makes this trivial.
  • Try other datasets — practice on free datasets that interest you.
  • Learn data preparation — real data is messy; cleaning and preparing it is most of the job.
  • Explore evaluation — accuracy is just one metric; learn precision, recall, and cross-validation.

Common mistakes that quietly break your first model

Your model trained and printed an accuracy score — but a number that looks good is not the same as a model that works. These are the traps beginners fall into most often, and all of them are easy to avoid once you know they exist.

  • Judging the model on data it already saw. If you measure accuracy on the training data, you are grading the model on the answers it memorized. A score of 100% there means nothing. Always evaluate on the held-out test set you created when you split the data — that is the only number that estimates real-world performance.
  • Data leakage: letting test data influence training. This is the most damaging and least obvious mistake. If you scale, normalize, or fill in missing values before splitting, statistics from the test set (like a column’s mean) leak into training and inflate your score. The fix is strict ordering: split first, then fit any transformer on the training set only and merely apply it to the test set. scikit-learn’s documentation flags this as one of the most common pitfalls in machine learning.
  • Forgetting to scale when the algorithm needs it. Distance- and gradient-based models (k-nearest neighbors, SVMs, logistic regression) are thrown off when one feature ranges 0–1 and another ranges 0–100,000. Tree-based models like random forests do not care. Know which camp your algorithm is in.
  • Trusting accuracy on imbalanced data. If 95% of your examples are one class, a model that always guesses that class scores 95% while being useless. When classes are lopsided, read the precision, recall, and F1-score from classification_report instead of accuracy alone.

The cleanest defence against leakage is a Pipeline. Chaining your preprocessing and model into one object means every transformation is automatically fit on the right slice of data, every time — including during cross-validation:

  • from sklearn.pipeline import make_pipeline
  • model = make_pipeline(StandardScaler(), LogisticRegression())
  • Then call model.fit(X_train, y_train) exactly as before.

One last habit worth building early: a single train/test split is a noisy estimate. Re-running with a different random split can swing your score by several points. Once you are comfortable, replace the single split with cross_val_score, which trains and tests across several folds and reports the average — a far more honest read on whether your model actually learned something.

Domande frequenti

How do I build a machine learning model in Python?

Use the scikit-learn library. The workflow is: load your data into features (X) and labels (y), split it into training and test sets, create a model and call .fit() to train it, evaluate it on the test set, and use .predict() for new data. A first model is about 20 lines of code.

What library should beginners use for machine learning?

scikit-learn. It offers a wide range of algorithms, built-in datasets, and evaluation tools through one simple, consistent interface, and it handles the underlying math for you. It’s the standard starting point before moving to deep learning frameworks.

Do I need to be good at math to build an ML model?

No. To build models with scikit-learn you need only basic Python and an understanding of the workflow. The library handles the math. Deeper math becomes useful later if you want to tune models expertly or do research.

Why do I need to split data into training and test sets?

So you can measure real performance. If you test a model on the same data it trained on, you only measure memorization. A separate test set the model never saw shows whether it genuinely learned the pattern and can generalize to new data.

What does model.fit() do?

.fit() is the training step. It feeds the training features and labels to the algorithm, which adjusts its internal parameters to learn the patterns connecting inputs to correct answers. After .fit(), the model is trained and ready to make predictions.

My model got high accuracy — does that mean it’s good?

Not necessarily. High accuracy is only meaningful if it was measured on the held-out test set, not the training data, and if your classes are reasonably balanced. On a dataset where one class dominates, a high score can come from the model simply guessing the majority class every time. Check precision, recall, and F1-score with classification_report, and confirm the number came from data the model never trained on.

How do I save my trained model and use it later?

Use Python’s joblib library, which ships with scikit-learn. Call joblib.dump(model, "model.joblib") to write the trained model to disk, and joblib.load("model.joblib") to load it back in another script — no retraining required. Save the entire Pipeline, not just the final estimator, so your scaling and preprocessing travel with the model and new inputs are handled identically.

How do I move from a built-in dataset to my own data?

Load your data with pandas — pandas.read_csv("yourfile.csv") — then separate your input columns (features, usually called X) from the column you want to predict (the target, y). From there the workflow is identical: split, train, evaluate. The new work is mostly cleaning: handling missing values, encoding text categories into numbers, and choosing which columns are actually useful. That data-preparation step is where most real-world ML time is spent.

Conclusione

Building your first machine learning model is genuinely a short, achievable project: install scikit-learn, then load, split, train, evaluate, and predict. Those five steps are the foundation of nearly every supervised ML project you’ll ever build.

Don’t just read this — open a notebook and run the code. Change the algorithm, try a different dataset, break things and fix them. The concepts in machine learning click far faster once you’ve trained a model with your own hands. When you’re ready for more, grab a free dataset and build something of your own.

Scroll to Top