Multiple Choice Questions: There could be more than one correct answer-choose all that apply
Describe Questions: Please describe briefly – no more than 500 words
- Data Mining is:
- Most applicable in large datasets
- Discovering patterns and hidden trends in the data
- Retrospective analyses of data
- For providing accurate models and correct predictions
ALL OF ABOVE
- (T/F) Data Mining requires a good understanding of statistics and computer sciences
TRUE
- Data Mining relies on:
- Cleaned and Curated data
- Unstructured data
- Computational efficiency of the algorithm
- Training data
- Non-experimental (Observational) data
- The model selection process depends on several criteria including:
- Hypothesis to be proved or disproved
- Type of data available
- Underlying methods such as association, etc.
- All of the above
- (T/F) Association mining typically requires you to identify strong rules for measures of minimum support and threshold.
- Interestingness of patterns in a dataset can be determined by these methods
- Correlation
- Association Rules
- Classification
- Lift & Chi Square Test
- (T/F) R2 is a measure of the explanatory power of the independent variables
- (T/F) Model fit refers to how well the variables correlate with one another in a model
- Sensitivity and Specificity are two values useful in:
- Receiver Operating Characteristic curve
- Sigmoid curve
- Logit curve
- Sinusoidal curve
- None of the above
- (T/F): Its best to compare and contrast model by using measures of information criteria AIC/BIC for individual and hybrid models.
- Statistical inference refers to:
- Predicting the outcome of a model run
- Probability of an event occurrence
- Measuring dependent variable and any error terms to arrive at a solution
- None of the above
- (T/F) Sample and Population in Statistics refers to how clean the dataset is before data modeling
- The following technique is useful for a single descriptive measure of income by age
- Variance
- Central Tendency
- Outliers
- All of the above
- (T/F) Probability theory is useful in statistics for improving upon ‘random guess’ related to events occurring
- Probability of joint occurrence refers to:
- Two independent events
- Co-occurring events
- Conditionally independent events
- Multiplying the probabilities of individual events
- In the article: Advanced Scout – Data Mining and Knowledge Discovery in NBA Data
Describe the purpose of creating the data mining software (application) i.e. what value add does it bring?
- In the article: Advanced Scout – Data Mining and Knowledge Discovery in NBA Data
Describe the 4 general steps used in the application as part of data mining – including possible data structure for the application to read the data from.
- A few applications of Text Mining & NLP (Natural Language Processing) are:
- Web reviews and ratings
- Medical Records
- Grading Exams
- Social Media
19 & 20) Describe any Data Mining Application, and write a hypothesis statement for the problem.
Focus on how to build features that are predictive