Apache Spark Distributed Application, using PySpark in Google Colab.

  Develop an Apache Spark application per provided specifications and , using PySpark in Google Colab.

Details

Use the a reference:

  • Create a new notebook in Google Colab
  • Download and upload it to the “Files” section in your Colab notebook (may take a few minutes to upload)
  • Read the Crunchbase Orgs dataset into Spark DataFrame

Implement PySpark code using DataFrames, RDDs or Spark UDF functions:

  1. Find all entities with the name that starts with a letter “F” (e.g. Facebook, etc.):
    • print the count and show() the resulting Spark DataFrame
  2. Find all entities located in New York City:
    • print the count and show() the resulting Spark DataFrame
  3. Add a “Blog” column to the DataFrame with the row entries set to 1 if the “domain” field contains “blogspot.com”, and 0 otherwise.
    • show() only the records with the “Blog” field marked as 1
  4. Find all entities with names that are palindromes (name reads the same way forward and reverse, e.g. madam):
    • print the count and show() the resulting Spark DataFrame