Assignment 2: Exploratory Data Analysis

In this assignment, you will identify a dataset of interest and perform an exploratory analysis to better understand the shape & structure of the data, investigate initial questions, and develop preliminary insights & hypotheses. Your final submission will take the form of a report consisting of captioned visualizations that convey key insights gained during your analysis.

Step 1: Data Selection

First, you will pick a topic area of interest to you and find a dataset that can provide insights into that topic. To streamline the assignment, we’ve pre-selected a number of datasets for you to choose from.

However, if you would like to investigate a different topic and dataset, you are free to do so. If you do decide to work with a self-selected dataset, please check with the course staff to ensure it is appropriate for this assignment. Be advised that data collection and preparation (also known as data wrangling) can be a very tedious and time-consuming process. Be sure you have sufficient time to conduct exploratory analysis, after preparing the data.

After selecting a topic and dataset — but prior to analysis — you should write down an initial set of at least three different questions you’d like to investigate.

Part 2: Exploratory Visual Analysis

Next, you will perform an exploratory analysis of your dataset using a visualization tool such as Tableau. You should consider two different phases of exploration.

In the first phase, you should seek to gain an overview of the shape & structure of your dataset. What variables does the dataset contain? How are they distributed? Are there any notable data quality issues? Are there any surprising relationships among the variables? Be sure to also perform “sanity checks” for patterns you expect to see!

In the second phase, you should investigate your initial questions, as well as any new questions that arise during your exploration. For each question, start by creating a visualization that might provide a useful answer. Then refine the visualization (e.g., by adding additional variables, changing sorting or axis scales, transforming your data by filtering or subsetting it, etc.) to develop better perspectives, explore unexpected observations, or sanity check your assumptions. You should repeat this process for each of your questions, but feel free to revise your questions or branch off to explore new questions if the data warrants.

Final Deliverable

Your final submission should take the form of a PDF report — similar to a slide show or comic book — that consists of 10 or more captioned visualizations detailing your most important insights. Your “insights” can include important surprises or issues (such as data quality problems affecting your analysis) as well as responses to your analysis questions. To help you gauge the scope of this assignment, see this example report analyzing data about motion pictures. We’ve annotated and graded this example to help you calibrate for the breadth and depth of exploration we’re looking for.

Each visualization image should be a screenshot exported from a visualization tool, accompanied with a title and descriptive caption (2-4 sentences long) describing the insight(s) learned from that view. Provide sufficient detail for each caption such that anyone could read through your report and understand what you’ve learned. You are free, but not required, to annotate your images to draw attention to specific features of the data. You may perform highlighting within the visualization tool itself, or draw annotations on the exported image. To easily export images from Tableau, use the Worksheet > Export > Image… menu item.

The end of your report should include a brief summary of main lessons learned.

To get up and running quickly with this assignment, we recommend exploring one of the following provided datasets:

  • World Bank Indicators, 1960–2017. The World Bank has tracked global human developed by indicators such as climate change, economy, education, environment, gender equality, health, and science and technology since 1960. The linked repository contains indicators that have been formatted to facilitate use with Tableau and other data visualization tools. However, you’re also welcome to browse and use the original data by indicator or by country. Click on an indicator category or country to download the CSV file.

  • Chicago Crimes, 2001–present (click Export to download a CSV file). This dataset reflects reported incidents of crime (with the exception of murders where data exists for each victim) that occurred in the City of Chicago from 2001 to present, minus the most recent seven days. Data is extracted from the Chicago Police Department’s CLEAR (Citizen Law Enforcement Analysis and Reporting) system.

  • Daily Weather in the U.S., 2017. This dataset contains daily U.S. weather measurements in 2017, provided by the NOAA Daily Global Historical Climatology Network. This data has been transformed: some weather stations with only sparse measurements have been filtered out. See the accompanying weather.txt for descriptions of each column.

  • Social mobility in the U.S.. Raj Chetty’s group at Harvard studies the factors that contribute to (or hinder) upward mobility in the United States (i.e., will our children earn more than we will). Their work has been extensively featured in The New York Times. This page lists data from all of their papers, broken down by geographic level or by topic. We recommend downloading data in the CSV/Excel format, and encourage you to consider joining multiple datasets from the same paper (under the same heading on the page) for a sufficiently rich exploratory process.

  • The Yelp Open Dataset provides information about businesses, user reviews, and more from Yelp’s database. The data is split into separate files (business, checkin, photos, review, tip, and user), and is available in either JSON or SQL format. You might use this to investigate the distributions of scores on Yelp, look at how many reviews users typically leave, or look for regional trends about restaurants. Note that this is a large, structured dataset and you don’t need to look at all of the data to answer interesting questions. In order to download the data you will need to enter your email and agree to Yelp’s Dataset License.

Additional Data Sources

If you want to investigate datasets other than those recommended above, here are some possible sources to consider. You are also free to use data from a source different from those included here. If you have any questions on whether your dataset is appropriate, please ask the course staff ASAP!

Visualization Tools

You are free to use one or more visualization tools in this assignment. However, in the interest of time and for a friendlier learning curve, we strongly encourage you to use Tableau. Tableau provides a graphical interface focused on the task of visual data exploration. You will (with rare exceptions) be able to complete an initial data exploration more quickly and comprehensively than with a programming-based tool.

Data Wrangling Tools

The data you choose may require reformatting, transformation or cleaning prior to visualization. Here are tools you can use for data preparation. We recommend first trying to import and process your data in the same tool you intend to use for visualization. If that fails, pick the most appropriate option among the tools below. Contact the course staff if you are unsure what might be the best option for your data!

Graphical Tools

  • Tableau Prep - Tableau provides basic facilities for data import, transformation & blending. Tableau prep is a more sophisticated data preparation tool.
  • Trifacta Wrangler - Interactive tool for data transformation & visual profiling.
  • OpenRefine - A free, open source tool for working with messy data.

Programming Tools

Grading

The assignment score is out of a maximum of 10 points. Submissions that squarely meet the requirements (i.e., the “Satisfactory” column in the rubric below) will receive a score of 8. We will determine scores by judging the breadth and depth of your analysis, whether visualizations meet the expressiveness and effectiveness principles, and how well-written and synthesized your insights are.

We will use the following rubric to grade your assignment. Note, rubric cells may not map exactly to specific point scores.

Component Excellent Satisfactory Poor
Breadth of Exploration More than 3 questions were initially asked, and target substantially different portions/aspects of the data. At least 3 questions were initially asked of the data, but there is some overlap between questions. Fewer than 3 initial questions were posed of the data.
Depth of Exploration A sufficient number of follow-up questions were asked to yield insights that helped to more deeply explore the initial questions. Some follow-up questions were asked, but they did not take the analysis much deeper than the initial questions. No follow-up questions were asked after answering the initial questions.
Data Quality Data quality was thoroughly assessed with extensive profiling of fields and records. Simple checks were conducted on only a handful of fields or records. Little or no evidence that data quality was assessed.
Visualizations More than 10 visualizations were produced, and a variety of marks and encodings were explored. All design decisions were both expressive and effective. At least 10 visualizations were produced. The visual encodings chosen were largely effective and expressive, but some errors remain. Several ineffective or inexpressive design choices are made. Fewer than 10 visualizations have been produced.
Data Transformation More advanced transformation were used to extend the dataset in interesting or useful ways. Simple transforms (e.g., sorting, filtering) were primarily used. The raw dataset was used directly, with little to no additional transformation.
Captions Captions richly describe the visualizations and contextualize the insight within the analysis. Captions do a good job describing the visualizations, but could better connect prior or subsequent steps of the analysis. Captions are missing, overly brief, or shallow in their analysis of visualizations.
Creativity & Originality You exceeded the parameters of the assignment, with original insights or a particularly engaging design. You met all the parameters of the assignment. You met most of the parameters of the assignment.

Submission Details

This is an individual assignment. You may not work in groups.

Your completed exploratory analysis report is due by Tuesday 3/9, 11:59 pm EST. Submit your PDF report on Canvas.