Download the following datasets and perform the necessary preprocessing depending on the task assigned to the dataset.

  • Bank Marketing Dataset (Classification, target = “bank term deposit”)
  • U.S. Pollution Dataset (Regression, target = “CO AQI”)
  • Perform EDA on both datasets
  • Build a Random Forest model for each
  • Report appropriate evaluation metrics for each
  • Compare the results of your test data to your train and baseline model
  • Optional Task:
    • Create a Random Forest using Grid Seach to optimize your hyperparameters


Try an XGboost model and then explore tuning the parameters with Grid Search. Which model gets the best score, “vanilla” Random Forest, vanilla XGboost, Grid Search tuned Random Forest, or Grid Search rune XGboost? What is your hypothesis as to why?

