The first steps of the EDA process are to summarize the data and use storytelling to connect the business opportunity to the data. Data visualization is perhaps the most powerful tool at our disposal to help tell that story. In this module, we will use Jupyter Notebooks to showcase several best practices surrounding data visualization. This presentation is in fact a Jupyter Notebook. This lesson is organized into three sections. We will mainly focus on the best practices as these contents assume that you are already familiar with Pandas, Matplotlib, and Jupyter. We will quickly touch on some of the essential tools but this is by no means a comprehensive survey of the data visualization landscape. Keep in mind that data visualization must be carried out in a reproducible way, and the end products are generally packaged into a deliverable that will be communicated to business stakeholders. There are other powerful tools like Zeplin and RStudio but Jupyter has become an industry standard in the Python ecosystem. Currently, there are more than 5 million notebooks saved in GitHub. Jupyter Notebooks are portable and can be run in numerous environments. They support dozens of languages and they're integrated with both Matplotlib and Pandas making them an ideal tool for EDA. There are numerous other frameworks out there and it is reasonable to use other languages like R to carry out data visualization. With respect to the Python ecosystem though, Matplotlib is the most common tool once you have accounted for direct and indirect usage. This visualization helps provide some perspective. Below the graphic, we highlight some of the tools that are commonly used when simple plots are just not adequate. For these materials, we will focus on the use of simple plots and for this, the libraries Matplotlib and Seaborn are the most widely used. We're going to use the world happiness dataset for this lesson. This is a commonly used data set to practice EDA. There is a very specific target variable that does not require domain knowledge namely, happiness. Each observation is a country at a given year. The code shown here loads the data from a CSV file into a Pandas data frame. Once loaded, the columns are cleaned using regular expressions in the convenient method rename. We truncate both the names and the data frame itself for visualization purposes and the outputs show the exact changes. The source website notes that this is a landmark survey of the state of global happiness that ranks over 150 countries by how happy their citizens perceive themselves to be. The data from the 2015 to 2017 reports are in the CSV file. The goal in a business scenario is to tell the story of the data. Often for a data science project, there will be a business metric that we're trying to improve. If we think of happiness in this case as revenue or profit and the countries as different products then this data set starts to look like something that is pretty common in business. If the target or business metric is happiness, then it makes sense to look at the sorted data. These data were first sorted on year in ascending order and then unhappiness in descending order. It is reasonable to expect that the features shown would play a role in explaining a country's perceived happiness. This is a good point to note that sorting and more in general any manipulation on spreadsheet-like data should exist as code to ensure reproducibility. There are several high-profile cases of published academic work being retracted because these types of manipulations were done using a mouse from within a spreadsheet tool. Pandas is an incredible tool to carry out programmatic manipulation of data. The Pandas documentation is also quite good compared to other packages and there is a lot of built-in functionality. Pivot tables and groupbys are methods that perform aggregations over a Pandas' data frame. There are some differences between pivot table and groupbys but either can be used to create aggregate summary tables. See the Pandas' tutorial on reshaping and pivots to learn more. Also note that you can have more than one index. In this slide, we show another aggregation except now it is done over both region and year. The table is truncated after four regions because it does not fit on one slide. The longer summary tables can be useful in reports and dashboards but it is often the case that a simple plot will do a better job telling the story of the data. There are a few other best practices in EDA to keep in mind. Remember to save a maximum amount of code within files, even when you're using Jupyter. Version control is a key component to effective collaboration and reproducible research. Examples of plots that you frequently create should be saved in a repository or folder as a resource. It will save you time. The final headline here is to make an educated guess before you see the plot. This habit is surprisingly useful for quality assurance of both data and code. Matplotlib as a functional interface similar to Matlab that works via the pyplot module for simple interactive use as well as an object-oriented interface that is more Pythonic and better for plots that require some level of customization or modification. The latter is called the artist interface. There is also built-in functionality from within Pandas for rapid access to plotting capabilities. We will see shortly an example of this Seaborn library which is essentially an extension of Matplotlib. We are only going to touch on a few key data visualization techniques and one of them specifically deals with turning your summary tables into simple plots. There are many resources available to help you get better at plotting the official Matplotlib tutorials and gallery are a good place to start. We see here the Pandas interface to Matplotlib in the.plot method. There are some interface limitations when it comes to using this for plotting but it serves as an efficient first-pass. You also see that we are encapsulating the plotting code as a function so that it may be hidden if this notebook would have been used as a formal presentation. Exposed plot generation and other code can limit effective communication. In keeping the best practices of storing code in text files for version control as well as the cataloging of block code, the next version of this plot will make explicit use of a Python script. It will also showcase some of the additional functionality available through the Matplotlib artist interface. This file was created from a script. This Python script is meant to be run directly from the command line which means it can also be called from Jupyter. This script is conceptually organized through the use of functions. The create plot function uses the create subplot and the data ingestion exists as its own function. The result is a more refined and very customizable summary pivot table. This pair plot was produced directly from the data frame with minimal code using Seaborn. We also modify the CSS to match the presentation theme. Pair plots are a powerful tool to summarize the relationships among features. This is another example of a simple plot. Simple plots are quick to produce, quick to modify, and can be saved in multiple formats. When we refer to a plot as a simple plot, it does not necessarily mean that it lacks complexity, for lack of a better term it implies that it can be produced quickly and saved in a portable format, dashboards, interactive plots, and really any environment where a plot is no longer portable is where the term simple plots no longer apply.