HEART ATTACK ANALYSIS AND PREDICTION USING DECISION TREE ALGORITHM



Introduction

In this post, you will learn to develop a program to generate a decision tree for a heart attack data set using Python and MongoDB, along with user authentication. The program is intended to forecast a chance of having a heart attack based on their medical characteristics. Overview In this post, the following subjects are covered in detail.

Overview

  • What is the decision tree algorithm?
  • How to build a decision tree for the heart attack dataset using Python and MongoDB.
  • An explanation of how to use/operate the program.

What is Decision Tree Algorithm

Decision Tree is one of the easiest and popular classification algorithms to understand and interpret , it is a supervised machine learning technique used for solving regression and classification problems. As the name implies, a decision tree is a tree-based algorithm that divides the data into parts and generates decision rules that help in coming up with a prediction or a label.


Important Terms related to Decision Trees

  1. Root Node: It represents the entire  sample and further gets divided into two or more homogeneous sets.
  2. Decision Node: If a sub-node divides into more sub-nodes, the decision node is activated.
  3. Leaf / Terminal Node:  When a sub-node does not further split into additional sub-nodes;  represents possible outcomes.
  4. Branch / Sub-Tree: A branch or sub-tree is a division of the overall tree.
  5. Parent and Child Node: Nodes that are subdivided into other nodes are referred to as parent nodes of those other nodes, whereas sub-nodes are the children of the parent node.
  6. Splitting: It involves breaking down a node into several smaller nodes.


The most popular algorithms for splitting are listed below:

  • Gini Impurity
 According to Gini, if two objects are randomly chosen from a population, they must belong to the same class, and the probability for them  is 1 if the population is pure.
  •  Chi-Square 
Finding the statistical significance of differences between sub-nodes and the parent node is done using this algorithm.
  • Information Gain
When the target variable is categorical, Information Gain is used to divide the nodes. It operates according to the entropy principle. Entropy is used to determine a node's purity. The purity of the node increases as entropy decreases. A homogeneous node has no entropy. The Information Gain is greater for the purer nodes with a maximum value of 1 since we subtract entropy from 1.
  • Reduction in Variance 
This algorithm is applied for  continuous target variables (regression problems).

How do Decision Trees work?


Decision trees use multiple algorithms (which are described above) to decide to split a node into two or more sub-nodes. The homogeneity of the resulting sub-nodes is increased by the development of sub-nodes. That means, as the target variable is increased, the purity of the node also increases. The decision tree divides the nodes into sub-nodes based on all of the factors that are available, and it then chooses the split that produces the most homogenous sub-nodes.

Steps in algorithm:

  • The algorithm starts with original set as the root node.
  • For each iteration, the method calculates the entropy (H) and information gain (IG) of the attribute of the set.
  • The attribute with the lowest entropy or greatest information gain is then selected.
  • The set is then divided by the selected attribute to produce a subset of the data.
  • The algorithm iterates over each subset, taking into account only attributes that have never been chosen before.


How to build a decision tree for the heart attack dataset using Python and MongoDB

Let's now examine the many ideas incorporated into a decision tree implementation using a dataset of heart attacks analysis.

Here I used MongoDB for storing and managing the data and Spyder as the python programming platform
You should download  and create account on MongoB by using your email account . After that you should follow the following guide to create cluster by using username and password.



1. Problem Definition

Given clinical parameters about a patient, can we predict whether the patient has a chance of heart attack or not? We aim to reach a model accuracy of more than 80% using decision tree algorithm.

Data Set : Used Heart attack Data set from Kaggle, details of dataset as follows;


displaying all columns of heart attack dataset


2. Understanding Features

1. age: displays the age of the individual.
2. sex:
 displays the gender as following format :
            • 1 = male
            • 0 = female
3. cp (Chest-Pain Type): 
displays the type of chest-pains in the individual as follows:
            • 0 = typical angina
            • 1 = atypical angina
            • 2= non — anginal pain
            • 3 = asymptotic
4. trestbps(Resting Blood Pressure): 
displays the resting blood pressure value  in mmHg (unit)
5. chol(Serum Cholestrol): 
displays the serum cholesterol in mg/dl (unit)
6. fbs (Fasting Blood Sugar): 
compares an individual's fasting blood sugar value with 120mg/dl.
            • If fasting blood sugar > 120mg/dl then it will be 1(true) otherwise it will be 0(false)
7. restecg (Resting ECG): 
displays resting electrocardiographic results
             • 0 = normal
             • 1 = having ST-T wave abnormality
             • 2 = left ventricular hyperthrophy
8. thalach(Max Heart Rate Achieved):
 displays the max heart rate.
9. exang (Exercise induced angina):
             • 1 = yes
             • 0 = no
10.oldpeak (ST depression induced by exercise relative to rest): 
displays the value as integer or float format.
11.slope (Peak exercise ST segment) :
            • 0 = upsloping
            • 1 = flat
            • 2 = downsloping
12.ca (Number of major vessels (0–3) colored by fluoroscopy) :
displays the value as integer or float format.
13.thal: 
displays the thalassemia (is an inherited blood disorder that causes your body to have less hemoglobin than normal) :
            • 0 = normal
            • 1 = fixed defect
            • 2 = reversible defect
14.output (Diagnosis of heart disease):
 Displays whether the individual having a chance of  heart attack or not :
            • 0 = absence
            • 1 = present

Classes and Functions -Explanation



1. class FirstScreen_GUI - Used for main Tkinter screen

  • calling another class to do the data analysis for heart attack dataset.

  • Opens the GUI as follows to click on buttons for analysis graph , decision tree and prediction.

  • show_analysis() function - open a new GUI window and display a chart to show the Heart attack analysis in male and female w.r.to output variable in the dataset.

  • generate_decisiontree() function - creating decision tree and the image stored into your local folder with a path on message box.

  • Predicting_Heartattack() function - open a new GUI window and asking user to manually input data to the attributes and calculating heart attack chance based on the input data.

2. class DataAnalysis - used to perform importing dataset

  • importdata() function - Reading csv file, converting to json format, inserting and retrieving data from MongoDB.

  • feature_analysis() function - Exploring data analysis and feature selection.

  • splitdataset() function - Splitting the data set into training and testing data, perform training with entropy and Gini index. Adding prediction and cal_accuracy functions to calculate prediction and accuracy on test with Gini index respectively.


Detailing the program flow



Step 1 : Import required packages and the tools for the implementation.


Step 2 :
Create a Main Screen using Tkinter

Creating a main screen using  Tkinter to show how the main screen looks like. Inside of this class, it is called another class to perform data analysis operations.

The following code used to create GUI for the main screen




Step 3 :
Import the Dataset from CSV to MongoDB

Creating a MongoDB connection using the pymongo.MongoClient method to connect to the DB with user name and password.

In the Data_Analysis class, defining a function importdata() to perform reading csv formatted data set of heart attack, analysis dataset and inserting to MOngoDB by converting to json format.


code for reading csv file and creating database on MongoDB

 Performing Data Analysis

This step deals with tuning of the dataset by checking null values and doing cleaning if the dataset is required. 


In the heart attack dataset, there are no missing values or string values. Hence, the data set is fine for doing the implementation.


Note : Here I imported dataset to MongoDB after checking the dataset requires data cleaning and has missing values. 


Inserting dataset into MongoDB


The following code shows the inserting of the analyzed heart attack dataset into MongoDB by converting to json format. 



After inserting dataset to MongoDB, You can see the MongoDB data collection in Compass




Step 4 : Retrieving data from MongoDB  and performing all functions available in the tkinter window(main screen)  

 

For retrieving data from MongoDB, we use the find() function, here the function return the DataFrame with the current contents of the collection from the database.


code for retrieving data

Generate a chart for displaying relationship between the parameters sex(gender) and output field (from Heart attack dataset).



Here we are using the python matplotlib library to perform visualization of the graphs. 


In the graph, we have an output variable on the x axis and the amount on the y axis based on the amount of heart diseases presented in males and females. After creating a figure with size 5x4 pixels, we used FigureCanvasTkAgg to plot graph in GUI.  
Here the chart gives an overall idea of whether  females or males have more chances of heart disease. 

Step 5 : Splitting the Data set to Training and Testing set


We will now divide the dataset using python code. The following is how we intend to divide:


Training Data - 80%

Testing Data - 20%



Step 6: Feature Selection

Here, describes the importance of features against output attribute. It gives a score for each feature of data, the higher the score  attributes are more important or relevant. That can be considered as dependent variables and independent variable. Here the graph shows which attributes closely depend on output attributes using ExtraTreesClassifier() model.


Here I'm considering all attributes except the output attribute as "X" (dependent variables) and the output column as "Y" (independent variable), which is nothing but showing heart attack chance is there or not.



Step 7: Building the Decision Tree Model

After the feature selection step, we need to use the training data and create a model using Decision Tree. First of all, we need to perform training by using GiniIndex or Entropy algorithm to predict the result. 

(Note: here I used both GiniIndex and Entropy methods to show the difference in results.)



Here I'm predicting the results using both Gini and Entropy approaches. The following is the code describes a function that will take the respective models and X_test as input and return the predicted values for each approach.



Step 8 : Evaluating the Model

After the model is constructed, it needs to be evaluated. Printing the model's test accuracy will help us achieve this. The following code is used to perform the evaluation of model  by using accuracy.  The both (GiniIndex and Entropy)methods return same percentage of accuracy for this heart attack analysis evaluation.

The accuracy is 73%, so the model is 73% accurate.

Step 9 : Improving Accuracy by restricting the depth of the Tree
By limiting the depth of the tree, we may prevent it from becoming overly complex, which should improve better visualisation. In order to address this, we set up the model again and put with a maximum  number of levels(4) that a tree has (max_depth). After that, we fit the model once more to training data and reevaluate it by printing the accuracy.

This time, we achieve a 82% accuracy, which is fairly good.

Step 10 : Plot the Decision Tree
In this I'm using below code to print and save( in the name of DecisionTree.png) the decision tree as image(PNG)in the local system path.A message box is displayed and which shows he details of the saved image tree.


Following image (DecisionTree.png) shows the Decision Tree for Heart attack analysis.



Here I'm additionally performing prediction of heart attack chance by passing values from a dataset as well as values from users. The predict() function is used to predict the chance of a heart attack by accessing the passing attribute values. 
showing the prediction using values from dataset

By using Tkinter, creating a new GUI window to show the prediction from user input


HOW TO USE THE PROGRAM : USER GUIDE
The Decision tree model is build in Tkinter GUI by using python. When the user runs the program on the Spyder , the Main Screen appears to the user as follows :
Screen 1 : Main screen

The window shows the following buttons to perform some task based on heart attack analysis:

1. Show Analysis - When you click on this button, it shows a visualization of output variable vs gender on another window.

2. Generate Decision Tree - When you click on this button, downloads the decision tree to your local path of your system.

3. Prediction - When you click on this button, a new window will appears and asks manual input from the user to predict whether the user has a chance of heart attack or not. 

4. Exit - When you click on this button, the main window gets closed.

Screen 2 : A window with Visualization of output variable vs gender

When a user clicks the Show Analysis button, a visualization similar to the one in the following image is displayed.


Screen 3 : Generate Decision tree - Downloading decision tree

When the user clicks the Generate Decision Tree button, the Optimized decision tree is downloaded and also specifies the path where tree is downloaded to the user in the pop up window.


Considering the pop up window, we can see our decision tree in the specified path
in the name of DecisionTree.

Finally decision tree gets downloaded and will look like the following:


Screen 4  : A window predicting the chance of a heart attack


When the user clicks the Prediction button, a window appears by asking the user for input to the specified columns to predict the chance of heart attack for the user.  

The window looks like as below with 3 buttons:



Description of 3 buttons:

  1. Quit - When you click on this button, it closes the prediction window immediately and goes back to the main window.
  2. Show - After successful entries in each column, click on this button, it shows the result in the pop-up window.

 

    3. reset - when you click on this button, it will clear all the entries in each column. 


Video Explanation:




References:


https://datasetsearch.research.google.com/search?src=2&query=Heart%20Attack%20Analysis%20%26%20Prediction%20Dataset&docid=L2cvMTFycWduNjcwbg%3D%3D

"Release Notes for MongoDB 6.0". August 19, 2022. Retrieved August 23, 2022

https://www.w3schools.com/python/python_ml_decision_tree.asp

https://en.wikipedia.org/wiki/Decision_tree

https://www.w3schools.in/python/gui-programming

https://www.geeksforgeeks.org/how-to-embed-matplotlib-charts-in-tkinter-gui/



























Comments