HEART ATTACK ANALYSIS AND PREDICTION USING DECISION TREE ALGORITHM
Introduction
In this
post, you will learn to develop a program to generate a decision tree for a
heart attack data set using Python and MongoDB, along with user authentication.
The program is intended to forecast a chance of having a heart attack based on
their medical characteristics. Overview In this post, the following subjects
are covered in detail.
Overview
- What is the decision tree algorithm?
- How
to build a decision tree for the heart attack dataset using Python and
MongoDB.
- An
explanation of how to use/operate the program.
Important Terms related to Decision Trees
- Root Node: It represents the entire sample and further gets divided into two or more homogeneous sets.
- Decision Node: If a sub-node divides into more sub-nodes, the decision node is activated.
- Leaf / Terminal Node: When a sub-node does not further split into additional sub-nodes; represents possible outcomes.
- Branch / Sub-Tree: A branch or sub-tree is a division of the overall tree.
- Parent and Child Node: Nodes that are subdivided into other nodes are referred to as parent nodes of those other nodes, whereas sub-nodes are the children of the parent node.
- Splitting: It involves breaking down a node into several smaller nodes.
- Gini Impurity
- Chi-Square
- Information Gain
- Reduction in Variance
How do Decision Trees work?
Steps in algorithm:
- The algorithm starts with original set as the root node.
- For each iteration, the method calculates the entropy (H) and information gain (IG) of the attribute of the set.
- The attribute with the lowest entropy or greatest information gain is then selected.
- The set is then divided by the selected attribute to produce a subset of the data.
- The algorithm iterates over each subset, taking into account only attributes that have never been chosen before.
2. sex:
• 0 = typical angina
• 1 = atypical angina
• 2= non — anginal pain
• 3 = asymptotic
4. trestbps(Resting Blood Pressure):
5. chol(Serum Cholestrol):
6. fbs (Fasting Blood Sugar):
• If fasting blood sugar > 120mg/dl then it will be 1(true) otherwise it will be 0(false)
7. restecg (Resting ECG):
• 1 = having ST-T wave abnormality
• 2 = left ventricular hyperthrophy
8. thalach(Max Heart Rate Achieved):
9. exang (Exercise induced angina):
• 1 = yes
• 0 = no
10.oldpeak (ST depression induced by exercise relative to rest):
11.slope (Peak exercise ST segment) :
• 0 = upsloping
• 1 = flat
• 2 = downsloping
12.ca (Number of major vessels (0–3) colored by fluoroscopy) :
• 0 = normal
• 1 = fixed defect
• 2 = reversible defect
14.output (Diagnosis of heart disease):
• 0 = absence
Classes and Functions -Explanation
1. class FirstScreen_GUI - Used for main Tkinter screen
calling another class to do the data analysis for heart attack dataset.
Opens the GUI as follows to click on buttons for analysis graph , decision tree and prediction.
show_analysis() function - open a new GUI window and display a chart to show the Heart attack analysis in male and female w.r.to output variable in the dataset.
generate_decisiontree() function - creating decision tree and the image stored into your local folder with a path on message box.
Predicting_Heartattack() function - open a new GUI window and asking user to manually input data to the attributes and calculating heart attack chance based on the input data.
2. class DataAnalysis - used to perform importing dataset
importdata() function - Reading csv file, converting to json format, inserting and retrieving data from MongoDB.
feature_analysis() function - Exploring data analysis and feature selection.
splitdataset() function - Splitting the data set into training and testing data, perform training with entropy and Gini index. Adding prediction and cal_accuracy functions to calculate prediction and accuracy on test with Gini index respectively.
Creating a main screen using Tkinter to show how the main screen looks like. Inside of this class, it is called another class to perform data analysis operations.
The following code used to create GUI for the main screen
Creating a MongoDB connection using the pymongo.MongoClient method to connect to the DB with user name and password.
In the Data_Analysis class, defining a function importdata() to perform reading csv formatted data set of heart attack, analysis dataset and inserting to MOngoDB by converting to json format.
In the heart attack dataset, there are no missing values or string values. Hence, the data set is fine for doing the implementation.
Note : Here I imported dataset to MongoDB after checking the dataset requires data cleaning and has missing values.
Inserting dataset into MongoDB
The following code shows the inserting of the analyzed heart attack dataset into MongoDB by converting to json format.
Step 4 : Retrieving data from MongoDB and performing all functions available in the tkinter window(main screen)
For retrieving data from MongoDB, we use the find() function, here the function return the DataFrame with the current contents of the collection from the database.
| code for retrieving data |
Here we are using the python matplotlib library to perform visualization of the graphs.
In the graph, we have an output variable on the x axis and the amount on the y axis based on the amount of heart diseases presented in males and females. After creating a figure with size 5x4 pixels, we used FigureCanvasTkAgg to plot graph in GUI.
We will now divide the dataset using python code. The following is how we intend to divide:
Training Data - 80%
Testing Data - 20%
Step 6: Feature Selection
Here, describes the importance of features against output attribute. It gives a score for each feature of data, the higher the score attributes are more important or relevant. That can be considered as dependent variables and independent variable. Here the graph shows which attributes closely depend on output attributes using ExtraTreesClassifier() model.
Step 7: Building the Decision Tree Model
After the feature selection step, we need to use the training data and create a model using Decision Tree. First of all, we need to perform training by using GiniIndex or Entropy algorithm to predict the result.
(Note: here I used both GiniIndex and Entropy methods to show the difference in results.)
Step 8 : Evaluating the Model
After the model is constructed, it needs to be evaluated. Printing the model's test accuracy will help us achieve this. The following code is used to perform the evaluation of model by using accuracy. The both (GiniIndex and Entropy)methods return same percentage of accuracy for this heart attack analysis evaluation.
The accuracy is 73%, so the model is 73% accurate.
Step 9 : Improving Accuracy by restricting the depth of the Tree
By limiting the depth of the tree, we may prevent it from becoming overly complex, which should improve better visualisation. In order to address this, we set up the model again and put with a maximum number of levels(4) that a tree has (max_depth). After that, we fit the model once more to training data and reevaluate it by printing the accuracy.
This time, we achieve a 82% accuracy, which is fairly good.
Step 10 : Plot the Decision Tree
In this I'm using below code to print and save( in the name of DecisionTree.png) the decision tree as image(PNG)in the local system path.A message box is displayed and which shows he details of the saved image tree.
Here I'm additionally performing prediction of heart attack chance by passing values from a dataset as well as values from users. The predict() function is used to predict the chance of a heart attack by accessing the passing attribute values.
| showing the prediction using values from dataset |
By using Tkinter, creating a new GUI window to show the prediction from user input










Comments
Post a Comment