Blog Time

Artificial Intelligence & Machine Learning Project

Date Created: 11 November 2022 Author: Saurabh M

1] FRESHERBELL.COM – TOP 5 SIMILAR QUIZ RECOMMENDER ON SELF-CREATED DATASET
-    Self-created quiz dataset of shape (2183, 18) using Web scraping and manual inserting.
-    Used necessary EDA to check null value, column datatype, etc, and data preprocessing techniques to select important features, lowercasing, Regex to remove HTML Tag, stemming and removal of special char, stopword, etc.
-    Used CountVectorize and Cosine Similarity to calculate the similarity between two vectors.
-    Created API using FastAPI with Uvicorn and Pydantic, Picking the data, and deployed it on Heroku for public use.
-    Successfully Implemented API on fresherbell.com to provide a recommendation of the top 5 similar quizzes with respect to the current quiz.
https://fresherbell.com/quizdiscuss/java/what-will-be-the-output-of-the-following-java-prog4 

API LINK - https://fresherbell-quiz-api.herokuapp.com/fresherbell_quiz_api

DATASET - Self Created

 

2] KAGGLE – UCI MACHINE LEARNING DATASET - SPAM EMAIL / SMS CLASSIFIER
-    To classify whether the mail is spam or not using Naïve Bayes Classifier & NLP approach
-    Include basic data preprocessing like tokenization, stemming, and removal of special char, stopwords,etc.
-    Used label encoder on target feature, Matplotlib, and Seaborn for EDA, NLTK library for tokenization(word & sentence tokenized), and PorterStemmer.
-    Used BagOfWord & CountVectorize to convert text into numbers as a part of data modelling.
-    Accuracy Score – 93.23%
-    Multiple different algorithms is been tested, from that Gaussian NB has given the best result.
-    Created frontend application using Streamlit Framework and Heroku platform for deployment of application
https://email-sms-spam-classifier-sam.herokuapp.com 

DATASET - https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

 

3] PIMA INDIANS - DIABETES DETECTION
-    To predict whether the patient is diabetic or not diabetic. Dataset is originally from the National Institute of Diabetes which include feature like Pregnancies, Glucose Level, Blood Pressure Level, Skin Thickness etc.
-    Used basic EDA to check null value, duplicate data, and imbalance data, and preprocess it to remove null, duplicate value.
-    Used Standard scaler for scaling the data in the same scale and training the model using a Support Vector Machine Classifier.
-    Accuracy Score of training data – 79.23% and test data - 77.46% .
-    Created API using FastAPI with Uvicorn and Pydantic, Picking the data and deployed it on Heroku for public use.
-    Also created frontend application using Streamlit Framework and Heroku platform for deployment of application

GOOGLE COLAB - https://colab.research.google.com/drive/1c3gH21gZYL-wWeXZfQ4LVXijiTVculHe

APP LINK - https://diabetes-detection-ssaurabh.herokuapp.com/

API LINK - https://api-diabetes-ml.herokuapp.com/diabetes_prediction


4] AMAZON  DELIVERY TIME PREDICTION – HACKEREARTH
-    To predict delivery time based on multiple features including weather, road traffic density, vehicle type, etc.
-    Data preprocessing using glob libraries to extract 54k text files containing training data and transform them into CSV format. 
-    Performing basic EDA for analysis purposes, Handling missing data, duplicates and outliers.
-    Applied ordinal encoder, and standard scaler on the required feature.
-    Used the Geopy library to calculate the distance between the restaurant and the delivery location.
-    Checked accuracy using different – different regression algorithm techniques. At last Random Forest gave maximum accuracy of 82.34%.

 

5] HOUSE PRICE PREDICTION - KAGGLE DATASET
-    We all have experienced a time when we have to look up for a new house to buy. But then the journey begins with a lot of frauds, negotiating deals, researching the local areas and so on.
-    Performing basic EDA for analysis purposes, Handling missing data, duplicates and outliers.
-    Applied Ridge Regression to get accurate results
https://github.com/Sssaurabh425/End-To-End-Machine-Learning-Project/tree/main/House%20Price%20Prediction

 

6] QUORA DUPLICATE QUESTION PAIRS - KAGGLE DATASET
-    Quora launched a competition in Kaggle to predict and prevention of the duplicate question on a quora website
-    Include basic data preprocessing like tokenization, stemming and removal of special char, stopwords,etc.
-    Used label encoder on target feature, Matplotlib, and Seaborn for EDA, NLTK library for tokenization(word & sentence tokenize) and PorterStemmer.
-    Used BagOfWord & CountVectorize to convert text into numbers as a part of data modeling.
-    Using XGBClassifier, we are able to get an accuracy of 79.3 %.

https://github.com/Sssaurabh425/End-To-End-Machine-Learning-Project/tree/main/Quora%20Question%20Pairs