HEART ATTACK ANALYSIS AND PREDICTION USING DECISION TREE ALGORITHM

Introduction

In this post, you will learn to develop a program to generate a decision tree for a heart attack data set using Python and MongoDB, along with user authentication. The program is intended to forecast a chance of having a heart attack based on their medical characteristics. Overview In this post, the following subjects are covered in detail.

Overview

What is the decision tree algorithm?
How to build a decision tree for the heart attack dataset using Python and MongoDB.
An explanation of how to use/operate the program.

What is Decision Tree Algorithm

Decision Tree is one of the easiest and popular classification algorithms to understand and interpret , it is a supervised machine learning technique used for solving regression and classification problems. As the name implies, a decision tree is a tree-based algorithm that divides the data into parts and generates decision rules that help in coming up with a prediction or a label.

Important Terms related to Decision Trees

Root Node: It represents the entire sample and further gets divided into two or more homogeneous sets.
Decision Node: If a sub-node divides into more sub-nodes, the decision node is activated.
Leaf / Terminal Node: When a sub-node does not further split into additional sub-nodes; represents possible outcomes.
Branch / Sub-Tree: A branch or sub-tree is a division of the overall tree.
Parent and Child Node: Nodes that are subdivided into other nodes are referred to as parent nodes of those other nodes, whereas sub-nodes are the children of the parent node.
Splitting: It involves breaking down a node into several smaller nodes.

The most popular algorithms for splitting are listed below:

Gini Impurity

According to Gini, if two objects are randomly chosen from a population, they must belong to the same class, and the probability for them is 1 if the population is pure.

Chi-Square

Finding the statistical significance of differences between sub-nodes and the parent node is done using this algorithm.

Information Gain

When the target variable is categorical, Information Gain is used to divide the nodes. It operates according to the entropy principle. Entropy is used to determine a node's purity. The purity of the node increases as entropy decreases. A homogeneous node has no entropy. The Information Gain is greater for the purer nodes with a maximum value of 1 since we subtract entropy from 1.

Reduction in Variance

This algorithm is applied for continuous target variables (regression problems).

How do Decision Trees work?

Decision trees use multiple algorithms (which are described above) to decide to split a node into two or more sub-nodes. The homogeneity of the resulting sub-nodes is increased by the development of sub-nodes. That means, as the target variable is increased, the purity of the node also increases. The decision tree divides the nodes into sub-nodes based on all of the factors that are available, and it then chooses the split that produces the most homogenous sub-nodes.

Steps in algorithm:

The algorithm starts with original set as the root node.
For each iteration, the method calculates the entropy (H) and information gain (IG) of the attribute of the set.
The attribute with the lowest entropy or greatest information gain is then selected.
The set is then divided by the selected attribute to produce a subset of the data.
The algorithm iterates over each subset, taking into account only attributes that have never been chosen before.

How to build a decision tree for the heart attack dataset using Python and MongoDB

Let's now examine the many ideas incorporated into a decision tree implementation using a dataset of heart attacks analysis.

Here I used MongoDB for storing and managing the data and Spyder as the python programming platform.

You should download and create account on MongoB by using your email account . After that you should follow the following guide to create cluster by using username and password.

https://www.mongodb.com/docs/v6.0/installation/

1. Problem Definition

Given clinical parameters about a patient, can we predict whether the patient has a chance of heart attack or not? We aim to reach a model accuracy of more than 80% using decision tree algorithm.

Data Set : Used Heart attack Data set from Kaggle, details of dataset as follows;

https://www.kaggle.com/datasets/rashikrahmanpritom/heart-attack-analysis-prediction-dataset

displaying all columns of heart attack dataset

2. Understanding Features

1. age: displays the age of the individual.
2. sex:

displays the gender as following format :

• 1 = male

• 0 = female

3. cp (Chest-Pain Type):

displays the type of chest-pains in the individual as follows:
           • 0 = typical angina
           • 1 = atypical angina
           • 2= non — anginal pain
           • 3 = asymptotic
4. trestbps(Resting Blood Pressure):

displays the resting blood pressure value in mmHg (unit)
5. chol(Serum Cholestrol):

displays the serum cholesterol in mg/dl (unit)
6. fbs (Fasting Blood Sugar):

compares an individual's fasting blood sugar value with 120mg/dl.
• If fasting blood sugar > 120mg/dl then it will be 1(true) otherwise it will be 0(false)
7. restecg (Resting ECG):

displays resting electrocardiographic results

             • 0 = normal
             • 1 = having ST-T wave abnormality
             • 2 = left ventricular hyperthrophy
8. thalach(Max Heart Rate Achieved):

displays the max heart rate.
9. exang (Exercise induced angina):
• 1 = yes
• 0 = no
10.oldpeak (ST depression induced by exercise relative to rest):

displays the value as integer or float format.
11.slope (Peak exercise ST segment) :
           • 0 = upsloping
           • 1 = flat
           • 2 = downsloping
12.ca (Number of major vessels (0–3) colored by fluoroscopy) :

displays the value as integer or float format.

13.thal:

displays the thalassemia (is an inherited blood disorder that causes your body to have less hemoglobin than normal) :
           • 0 = normal
           • 1 = fixed defect
           • 2 = reversible defect
14.output (Diagnosis of heart disease):

Displays whether the individual having a chance of heart attack or not :
• 0 = absence

• 1 = present

Classes and Functions -Explanation

1. class FirstScreen_GUI - Used for main Tkinter screen

calling another class to do the data analysis for heart attack dataset.
Opens the GUI as follows to click on buttons for analysis graph , decision tree and prediction.
show_analysis() function - open a new GUI window and display a chart to show the Heart attack analysis in male and female w.r.to output variable in the dataset.
generate_decisiontree() function - creating decision tree and the image stored into your local folder with a path on message box.
Predicting_Heartattack() function - open a new GUI window and asking user to manually input data to the attributes and calculating heart attack chance based on the input data.

2. class DataAnalysis - used to perform importing dataset

importdata() function - Reading csv file, converting to json format, inserting and retrieving data from MongoDB.
feature_analysis() function - Exploring data analysis and feature selection.
splitdataset() function - Splitting the data set into training and testing data, perform training with entropy and Gini index. Adding prediction and cal_accuracy functions to calculate prediction and accuracy on test with Gini index respectively.

Detailing the program flow

Step 1 : Import required packages and the tools for the implementation.

Step 2 : Create a Main Screen using Tkinter

Creating a main screen using Tkinter to show how the main screen looks like. Inside of this class, it is called another class to perform data analysis operations.

The following code used to create GUI for the main screen

Step 3 : Import the Dataset from CSV to MongoDB

Creating a MongoDB connection using the pymongo.MongoClient method to connect to the DB with user name and password.

In the Data_Analysis class, defining a function importdata() to perform reading csv formatted data set of heart attack, analysis dataset and inserting to MOngoDB by converting to json format.

code for reading csv file and creating database on MongoDB

Performing Data Analysis

This step deals with tuning of the dataset by checking null values and doing cleaning if the dataset is required.

In the heart attack dataset, there are no missing values or string values. Hence, the data set is fine for doing the implementation.

Note : Here I imported dataset to MongoDB after checking the dataset requires data cleaning and has missing values.

Inserting dataset into MongoDB

The following code shows the inserting of the analyzed heart attack dataset into MongoDB by converting to json format.

After inserting dataset to MongoDB, You can see the MongoDB data collection in Compass

Step 4 : Retrieving data from MongoDB and performing all functions available in the tkinter window(main screen)

For retrieving data from MongoDB, we use the find() function, here the function return the DataFrame with the current contents of the collection from the database.

code for retrieving data

Generate a chart for displaying relationship between the parameters sex(gender) and output field (from Heart attack dataset).

Here we are using the python matplotlib library to perform visualization of the graphs.

In the graph, we have an output variable on the x axis and the amount on the y axis based on the amount of heart diseases presented in males and females. After creating a figure with size 5x4 pixels, we used FigureCanvasTkAgg to plot graph in GUI.

Here the chart gives an overall idea of whether females or males have more chances of heart disease.

Step 5 : Splitting the Data set to Training and Testing set

We will now divide the dataset using python code. The following is how we intend to divide:

Training Data - 80%

Testing Data - 20%

Step 6: Feature Selection

Here, describes the importance of features against output attribute. It gives a score for each feature of data, the higher the score attributes are more important or relevant. That can be considered as dependent variables and independent variable. Here the graph shows which attributes closely depend on output attributes using ExtraTreesClassifier() model.

Here I'm considering all attributes except the output attribute as "X" (dependent variables) and the output column as "Y" (independent variable), which is nothing but showing heart attack chance is there or not.

Step 7: Building the Decision Tree Model

After the feature selection step, we need to use the training data and create a model using Decision Tree. First of all, we need to perform training by using GiniIndex or Entropy algorithm to predict the result.

(Note: here I used both GiniIndex and Entropy methods to show the difference in results.)

Here I'm predicting the results using both Gini and Entropy approaches. The following is the code describes a function that will take the respective models and X_test as input and return the predicted values for each approach.

Step 8 : Evaluating the Model
After the model is constructed, it needs to be evaluated. Printing the model's test accuracy will help us achieve this. The following code is used to perform the evaluation of model by using accuracy. The both (GiniIndex and Entropy)methods return same percentage of accuracy for this heart attack analysis evaluation.

The accuracy is 73%, so the model is 73% accurate.
Step 9 : Improving Accuracy by restricting the depth of the Tree
By limiting the depth of the tree, we may prevent it from becoming overly complex, which should improve better visualisation. In order to address this, we set up the model again and put with a maximum number of levels(4) that a tree has (max_depth). After that, we fit the model once more to training data and reevaluate it by printing the accuracy.

This time, we achieve a 82% accuracy, which is fairly good.
Step 10 : Plot the Decision Tree
In this I'm using below code to print and save( in the name of DecisionTree.png) the decision tree as image(PNG)in the local system path.A message box is displayed and which shows he details of the saved image tree.

Following image (DecisionTree.png) shows the Decision Tree for Heart attack analysis.

Here I'm additionally performing prediction of heart attack chance by passing values from a dataset as well as values from users. The predict() function is used to predict the chance of a heart attack by accessing the passing attribute values.