The Project This project has 2 parts:
1st part is for everyone as your Weekly Assignment.
2nd part is optional if you want to make up your midterm score. You can get up to 20 points added to your exam score. To make it easier on you I will allow you to submit it as a group. The groups can be up to 4 people. But you can submit is alone too. Just let me know your groups in advance. You can use Decision Trees which we cover or other more sophisticated methods like Random Forests or Logistics Regression (These two algorithms might generate better results.)
Part1: You will develop a Classification/Prediction model that predicts whether or not a patient has a disease or is healthy (Part 1). Part2: You will do same for individual diseases. If a patient has that disease or not. B. Data 1. You are provided with an Excel file that has 11 Worksheets. 2. The first one (Hormones_Diseases) has the Hormones vs Diseases table. This table lists all Hormonal Measurement values of babies (patients) and their relationship (correlation level) with each disease. These are coming from doctors (Endocrinologists) who are experts on these diseases but they are the ones who need help to better diagnose the patients. So, use these correlation or relationship values as starting points or giving weights to your attributes (hormones) but do not completely ignore the other hormones that show no relationship. After all the data might suggest that some of those blank ones are good predictors in predicting whether or not a genetic disease exists. 3. The Up and Down arrows here shows the direction and strength of these relationship and correlation between diseases and hormones according to the doctors. For example, means this hormone is strongly but negatively correlated with this disease where means they are mildly but positively correlated. Blank ones are not considered correlated by the doctors but they also think they are important. We already removed 3 hormones which they deemed no importance at all. Again, please do not ignore the hormones that are blanks. Make sure you give more importance to the ones with more arrows (using weights is one option, positive value for Up and negative for Down). 4. The other 10 Worksheets are the measurements of the hormonal levels of each hormone for the patients that has that disease. For example, all the patients in 2) 21OHD-C Disease worksheet has the 21OHD-C Disease and these are their measurements. All other worksheets are same way. C. Data Preparation 1. For Part 1: you will need to create a Dataset where Patients are listed in tabular format and their information (attributes) as columns and 1 final attribute which is the class. I created a worksheet as a last worksheet titled Dataset Format to help you get started but you can choose your own format. (you can either create this dataset by copy/paste with Transpose or create a small Macro if you know VBScript) (if you are not going to do Part 2, you do not need to have a column called Disease.) 2. For Part 2: If you are going to Part2, you can either; a. Create a Classification Model that predicts the Disease (Disease is the class attribute and this is a multi-class problem). b. It would be simpler if you develop a Classification Model just like in Part 1, but this time train and test it with only data that has this disease and some from Healthy. c. Or you can keep the ones that has the disease as yes in Sick? attribute and change everyone else to no including patients who have other diseases. (meaning that they do not have that certain genetic diseases) D. Project & Submission 1. Part 1: Develop a Classification Model that predicts whether or not a patient has a disease, improve your model to achieve better predication accuracy & submit following: a) A report that describes the Classification Algorithm you choose and Parameter settings, the Accuracy and Confusion Matrix b) The final code either as Jupyter Notebook (preferably) or .py file(s) (Make sure you explain any dependencies since I need to run your code and see the same results) c) The dataset you created from Excel file and used in your model training and testing. 2. Part2: Submit the same only this time develop a Model that predict if the patient has certain disease as explained in section A. Project above