Titanic Machine Learning Prediction

Colab Notebook

Overview

The Titanic Machine Learning project aimed to predict the survival outcomes of passengers aboard the ill-fated RMS Titanic. Using the Kaggle Titanic dataset, we built a classification model leveraging features such as age, gender, passenger class, and more. The project involved extensive data preprocessing, feature engineering, and evaluation of various machine learning algorithms to find the most accurate model.

Implementation Details

Dataset Overview
- Training Set (train.csv): Includes labeled data with survival outcomes.
- Test Set (test.csv): Includes unlabeled data for testing the trained model.
- Features include demographic data, ticket class, and survival status.
Data Preprocessing
- Missing Data: Handled missing values in features like Age, Cabin, and Embarked.
- Feature Engineering: Created new features such as the ‘Age_Class’ and ‘Fare_Per_Person’, and converted categorical features like ‘Sex’ and ‘Embarked’ into numeric values.
- Feature Scaling: Normalized numerical features to ensure the model performs optimally.
Model Selection
- Tried several machine learning algorithms, including:
  - Stochastic Gradient Descent (SGD)
  - Random Forest
  - Logistic Regression
  - K Nearest Neighbor (KNN)
  - Gaussian Naive Bayes
  - Support Vector Machine (SVM)
Model Evaluation
- Evaluated the models using accuracy, precision, recall, F1 score, and cross-validation.
- The SVM model performed the best, achieving an accuracy of 77%.

Technologies Used

Python: For machine learning implementation and data analysis.
Pandas: For data manipulation and analysis.
NumPy: For numerical operations.
Scikit-learn: For building machine learning models and evaluations.
Seaborn & Matplotlib: For data visualization.
Google Colab: For cloud-based development and model training.

Results & Findings

The SVM algorithm achieved an accuracy of 77%, demonstrating its effectiveness in predicting survival on the Titanic.
Feature engineering improved model performance by creating relevant features like ‘Age_Class’ and ‘Fare_Per_Person’.
The project showed that machine learning can provide valuable insights into complex, historical datasets.

Future Improvements

Explore the use of Random Forest or XGBoost for potentially better performance.
Implement more advanced feature selection and hyperparameter tuning.
Apply model explainability techniques (e.g., SHAP) to improve transparency.

Contributors

Imad-Eddine NACIRI
Achraf Berriane
Errouji Oussama