Problem 4: Statistical Description of Multivariate Data for a Real-World Dataset [40 points]
To complete this task you have to use the crx.data file. This file crx.data contains data collected from credit card applications. All attribute names and values have been changed to meaningless symbols to protect the confidentiality of the data. The dataset is downloaded from the UCI Machine Learning Repository ().
This dataset is interesting because there is a good mix of attributes — continuous, nominal with small numbers of values, and nominal with larger numbers of values. There are also a few missing values. Read the data in R using the following command.
data <- read.table(“path/crx.data”, sep = “,”);
Here, replace the path with the path of the file crx.data in your computer. After loading the data in R you can access each column using data[ , 1], data[ , 2], , data[ , 15]. All the data will be in character format when you load it from crx.data you will have to convert the numeric columns from character to numeric using the as.numeric() function as follows. You can view the data using view(data) command.
attribute1 <- as.numeric(data[ , 2])
For missing values, NAs will be introduced by coercion.
There are 16 columns in the data the first 15 columns are the attributes of the data and the 16th column is the label of the data. You have to only analyze the attributes of the data.
- Find which attributes are the nominal attributes and which are continuous attributes.
- Identify the attribute/attributes with missing values (having NA). Drop the attributes with missing values from the data.
- Calculate the central tendency of the rest of the attributes. Remember for the nominal attribute you can only calculate the mode.
- Calculate the five-number summary of the numeric attributes.
- Show box plots for the numeric attributes and identify the attributes having outliers.
- Show pairwise scatter plots of the numeric attributes. Inspect the scatter plots and mention if each pairs attributes are negatively correlated, positively correlated or there is no correlation.
*Do not forget to label the axes of the plots.