DATA MINING - Genius Papers

Multiple Choice Questions: There could be more than one correct answer-choose all that apply

Describe Questions: Please describe briefly – no more than 500 words

Data Mining is:

Most applicable in large datasets
Discovering patterns and hidden trends in the data
Retrospective analyses of data
For providing accurate models and correct predictions

ALL OF ABOVE

(T/F) Data Mining requires a good understanding of statistics and computer sciences

TRUE

Data Mining relies on:

Cleaned and Curated data
Unstructured data
Computational efficiency of the algorithm
Training data
Non-experimental (Observational) data

The model selection process depends on several criteria including:

Hypothesis to be proved or disproved
Type of data available
Underlying methods such as association, etc.
All of the above

(T/F) Association mining typically requires you to identify strong rules for measures of minimum support and threshold.

Interestingness of patterns in a dataset can be determined by these methods

Correlation
Association Rules
Classification
Lift & Chi Square Test

(T/F) R2 is a measure of the explanatory power of the independent variables

(T/F) Model fit refers to how well the variables correlate with one another in a model

Sensitivity and Specificity are two values useful in:

Receiver Operating Characteristic curve
Sigmoid curve
Logit curve
Sinusoidal curve
None of the above

(T/F): Its best to compare and contrast model by using measures of information criteria AIC/BIC for individual and hybrid models.

Statistical inference refers to:

Predicting the outcome of a model run
Probability of an event occurrence
Measuring dependent variable and any error terms to arrive at a solution
None of the above

(T/F) Sample and Population in Statistics refers to how clean the dataset is before data modeling

The following technique is useful for a single descriptive measure of income by age

Variance
Central Tendency
Outliers
All of the above

(T/F) Probability theory is useful in statistics for improving upon ‘random guess’ related to events occurring

Probability of joint occurrence refers to:

Two independent events
Co-occurring events
Conditionally independent events
Multiplying the probabilities of individual events

In the article: Advanced Scout – Data Mining and Knowledge Discovery in NBA Data

Describe the purpose of creating the data mining software (application) i.e. what value add does it bring?

In the article: Advanced Scout – Data Mining and Knowledge Discovery in NBA Data

Describe the 4 general steps used in the application as part of data mining – including possible data structure for the application to read the data from.

A few applications of Text Mining & NLP (Natural Language Processing) are:

Web reviews and ratings
Medical Records
Grading Exams
Social Media

19 & 20) Describe any Data Mining Application, and write a hypothesis statement for the problem.

Focus on how to build features that are predictive