DATA MINING

Multiple Choice Questions: There could be more than one correct answer-choose all that apply

Describe Questions: Please describe briefly – no more than 500 words

  • Data Mining is:
  1. Most applicable in large datasets
  2. Discovering patterns and hidden trends in the data
  3. Retrospective analyses of data
  4. For providing accurate models and correct predictions

ALL OF ABOVE

  • (T/F) Data Mining requires a good understanding of statistics and computer sciences

TRUE

  • Data Mining relies on:
  1. Cleaned and Curated data
  2. Unstructured data
  3. Computational efficiency of the algorithm
  4. Training data
  5. Non-experimental (Observational) data

 

 

  • The model selection process depends on several criteria including:
  1. Hypothesis to be proved or disproved
  2. Type of data available
  3. Underlying methods such as association, etc.
  4. All of the above

 

  • (T/F) Association mining typically requires you to identify strong rules for measures of minimum support and threshold.

 

 

  • Interestingness of patterns in a dataset can be determined by these methods

 

  1. Correlation
  2. Association Rules
  3. Classification
  4. Lift & Chi Square Test

 

  • (T/F) R2 is a measure of the explanatory power of the independent variables

 

 

  • (T/F) Model fit refers to how well the variables correlate with one another in a model

 

 

  • Sensitivity and Specificity are two values useful in:
  1. Receiver Operating Characteristic curve
  2. Sigmoid curve
  3. Logit curve
  4. Sinusoidal curve
  5. None of the above

 

  • (T/F): Its best to compare and contrast model by using measures of information criteria AIC/BIC for individual and hybrid models.

 

  • Statistical inference refers to:
  1. Predicting the outcome of a model run
  2. Probability of an event occurrence
  3. Measuring dependent variable and any error terms to arrive at a solution
  4. None of the above

 

  • (T/F) Sample and Population in Statistics refers to how clean the dataset is before data modeling

 

  • The following technique is useful for a single descriptive measure of income by age

 

  1. Variance
  2. Central Tendency
  3. Outliers
  4. All of the above

 

  • (T/F) Probability theory is useful in statistics for improving upon ‘random guess’ related to events occurring

 

  • Probability of joint occurrence refers to:
  1. Two independent events
  2. Co-occurring events
  3. Conditionally independent events
  4. Multiplying the probabilities of individual events

 

  • In the article: Advanced Scout – Data Mining and Knowledge Discovery in NBA Data

Describe the purpose of creating the data mining software (application) i.e. what value add does it bring?

 

 

 

 

 

  • In the article: Advanced Scout – Data Mining and Knowledge Discovery in NBA Data

Describe the 4 general steps used in the application as part of data mining – including possible data structure for the application to read the data from.

 

 

 

  • A few applications of Text Mining & NLP (Natural Language Processing) are:
  1. Web reviews and ratings
  2. Medical Records
  3. Grading Exams
  4. Social Media

 

19 & 20) Describe any Data Mining Application, and write a hypothesis statement for the problem.

Focus on how to build features that are predictive