June 28, 20266 min read

Returning to Machine Learning Fundamentals with Kaggle Titanic

I used Kaggle Titanic to rebuild my machine learning fundamentals from the ground up: build a clean baseline, keep the experiment structure versioned, compare simple models, and create a submission path I can improve over time.

Machine LearningKagglescikit-learnPythonPortfolio

I wanted to get back into machine learning fundamentals in a way that felt practical, not performative. Kaggle Titanic was the right place to start because it is familiar, constrained, and still useful. It forces the basic questions quickly: what kind of problem is this, what does the data need before a model can use it, and what does a reasonable first submission look like?

My goal for V0 was not to squeeze every possible point out of the leaderboard. I wanted a baseline I could trust. That meant minimal feature engineering, no model ensembling, and a small group of standard classification algorithms. Enough structure that future experiments can improve the score without muddying what the first version actually proved.

Key takeaways

A useful ML baseline is an engineering artifact, not just a model score.

Versioned experiment folders make it easier to improve performance without losing the original comparison point.

Reusable preprocessing matters because Kaggle competition data does not include the target column and still has to match the training shape.

In V0, Logistic Regression produced the best local validation score at 0.8603.

Why Start With the Basics

Starting with Titanic might sound basic, but that was the point. I did not want the first step back into ML fundamentals to be a clever architecture or a complicated feature pipeline. I wanted to work through the plain version of the loop: inspect the data, decide the problem type, prepare the columns, train multiple models, compare scores, and generate a valid submission file.

The target is survival, which makes this a binary classification problem. That gave me a clean set of first-pass model choices: Logistic Regression, K-Nearest Neighbors, Gaussian Naive Bayes, Decision Tree, and Random Forest. Those are not exotic choices, but they are exactly the kind of tools I wanted to revisit with my hands on the keyboard instead of only remembering them conceptually.

The repo structure was part of the experiment

Before getting too deep into modeling, I set up the repo so it could grow. That mattered because I wanted the project to support future iterations without burying the baseline. A single notebook can be fine for a quick Kaggle attempt, but it becomes messy when every new idea overwrites the last one.

The Titanic repo is organized around raw data, processed data, and versioned experiments. The raw Kaggle files live in data/raw. Processed outputs have versioned homes under data/processed/V0, data/processed/V1, and so on. The actual experiment work lives under experiments/V0, experiments/V1, and future folders as needed. Each experiment can have its own notebook, notes, exported PDF, and submission file.

That means V0 can stay intact as the baseline. V1 can try a different feature set or modeling approach without rewriting history. The config.yaml file also keeps the common paths visible, so the notebook does not have to depend on a bunch of hidden assumptions about where the data lives.

data/raw holds the untouched Kaggle train and competition files.

data/processed/V* gives each future data-prep iteration a separate landing zone.

experiments/V* keeps notebooks, notes, exports, and submissions isolated by experiment version.

config.yaml records the project name and shared raw, processed, and experiment roots.

What V0 deliberately did not do

V0 stayed intentionally plain. I filled the missing Age values with the median, one-hot encoded Sex and Embarked, and dropped columns that I did not want to engineer yet: Name, Fare, Ticket, and Cabin. That is obviously leaving signal on the table, especially with names, titles, cabins, family structure, and fare bands. But leaving that signal alone was the point of the baseline.

The baseline answers a cleaner question: what happens if I use the available columns with minimal cleanup and no ensemble strategy? From there, any improvement in later versions can be compared against a real starting point instead of a vague memory of the first notebook.

The preprocessing function became the contract

One detail I liked was moving the prep work into a reusable preprocessing function. That kept the training data and competition data aligned. In Kaggle terms, the competition file is not really test data in the normal local-validation sense because it does not include the Survived target. It is the file used to generate predictions for Kaggle to score against hidden outcomes.

That makes consistent preprocessing important. The model sees one version of the features during training, then the submission flow has to produce the same feature shape before prediction. Encapsulating the work made the notebook less fragile and made the submission path feel like part of the pipeline instead of a final copy-paste step.

View the V0 notebook export

I exported the V0 notebook so the code, intermediate outputs, model scores, and submission generation are visible without anyone needing to clone the repo, create a Python environment, or run the notebook locally.

Titanic Kaggle Competition V0

Notebook export showing the baseline preprocessing, model training, score comparison, best-model selection, and Kaggle submission file generation.

Open PDF

Comparing simple models first

For the model comparison, I trained each classifier independently and kept the model, predictions, and score in their own variables. That is not the most abstract version of the code, but it made the first pass easy to inspect. I could see each model in the notebook, see the score directly under it, and then collect the scores into one dictionary for comparison.

The local validation split used train_test_split, and V0 selected the highest-scoring model automatically before generating the competition predictions. In this run, Logistic Regression came out on top with a local validation score of 0.8603. Gaussian Naive Bayes and Random Forest followed at 0.8380, Decision Tree landed at 0.7877, and K-Nearest Neighbors was much weaker at 0.5196.

That result gave me a useful baseline and a useful reminder. The simplest model can be competitive when the feature space is small and the preprocessing is stable. It also gives later experiments a clear target: beat the baseline because the features or validation strategy improved, not because the notebook became harder to reason about.

Making the submission path flow

After selecting the best model, the notebook reads the competition file, runs it through the same preprocessing function, generates predictions, and writes the Kaggle submission CSV. The final file has the expected PassengerId and Survived columns with 418 prediction rows.

That part matters more than it looks. A notebook that ends at a validation score is an analysis notebook. A notebook that produces a valid competition file is closer to a complete workflow. For this project, I wanted the whole thing to flow from baseline idea to submission artifact.

What I would improve next

The next iterations should be more careful, but still controlled. I would rather improve one layer at a time than throw every Titanic trick into V1 and lose the ability to explain what changed.

Good next steps would be extracting titles from names, preserving and transforming Fare, creating family-size features, handling Cabin more thoughtfully, scaling numeric columns where model choice benefits from it, and moving toward cross-validation instead of relying on one train-test split. Ensembling can come later, after the individual feature and model improvements are easier to defend.

The real win of V0

The useful part of V0 is not that it solves Titanic forever. It does not. The useful part is that it gives the project a stable floor. I can come back to the repo, understand what the first version did, and make the next version better without guessing what changed.

That is the version of machine learning fundamentals I wanted to return to: not just remembering algorithms, but rebuilding the habit of making experiments reproducible, comparable, and honest.

If you are also revisiting machine learning through practical competitions, I would be interested in comparing baseline-first approaches.