How to quickly create a powerful exploratory data analysis visualization

In one case you’ve got yourself a prissy cleaned dataset, the next step is Exploratory Data Analysis (EDA). EDA is the process of figuring out what the data can tell the states and we use EDA to discover patterns, relationships, or anomalies to inform our subsequent analysis. While there are an almost overwhelming number of methods to utilize in EDA, i of the most effective starting tools is the pairs plot (also called a scatterplot matrix). A pairs plot allows us to see both distribution of single variables and relationships betwixt two variables. Pair plots are a keen method to identify trends for follow-upward assay and, fortunately, are easily implemented in Python!

In this article we will walk through getting upwards and running with pairs plots in Python using the seaborn visualization library. Nosotros will meet how to create a default pairs plot for a rapid examination of our information and how to customize the visualization for deeper insights. The lawmaking for this project is available as a Jupyter Notebook on GitHub. We will explore a real-world dataset, comprised of country-level socioeconomic data collected by GapMinder.

Pairs Plots in Seaborn

To get started we need to know what data we have. Nosotros can load in the socioeconomic data as a pandas dataframe and look at the columns:

Each row of the data represents an ascertainment for one country in one year and the columns hold the variables (data in this format is known as tidy data). At that place are 2 chiselled columns (country and continent) and 4 numerical columns. The columns are adequately cocky-explanatory:
life_exp
is life expectancy at birth in years,
popis population, and
gdp_per_cap
is gross domestic product per person in units of international dollars.

The default pairs plot in seaborn only plots numerical columns although later we will use the categorical variables for coloring. Creating the default pairs plot is simple: we load in the seaborn library and phone call the
pairplot
part, passing it our dataframe:

        # Seaborn visualization library
import seaborn equally sns
# Create the default pairplot
sns.pairplot(df)

I’thou still amazed that one uncomplicated line of code gives us this entire plot! The pairs plot builds on two basic figures, the histogram and the scatter plot. The histogram on the diagonal allows us to see the distribution of a single variable while the scatter plots on the upper and lower triangles testify the relationship (or lack thereof) between ii variables. For example, the left-about plot in the 2nd row shows the scatter plot of life_exp versus year.

Popular:   Super-fight excites Thai ace Rodtang

The default pairs plot past itself often gives usa valuable insights. We encounter that life expectancy and gdp per capita are positively correlated showing that people in college income countries tend to alive longer (although this of form does not evidence that 1 causes the other). Information technology likewise appears that (thankfully) life expectancies worldwide are on the rise over time. From the histograms, we learn that the population and gdp variables are heavily correct-skewed. To ameliorate show these variables in futurity plots, we can transform these columns by taking the logarithm of the values:

        # Take the log of population and gdp_per_capita
df['log_pop'] = np.log10(df['pop'])
df['log_gdp_per_cap'] = np.log10(df['gdp_per_cap'])
# Drib the non-transformed columns
df = df.drib(columns = ['popular', 'gdp_per_cap'])

While this plot alone tin can be useful in an analysis, we can notice arrive more valuable by coloring the figures based on a chiselled variable such equally continent. This is also extremely unproblematic in seaborn! All we need to do is use the
hue
keyword in the
sns.pairplot
function call:

        sns.pairplot(df, hue = 'continent')
      

Now we come across that Oceania and Europe tend to have the highest life expectancies and Asia has the largest population. Detect that our log transformation of the population and gdp made these variables unremarkably distributed which gives a more thorough representation of the values.

This graph is more informative, but at that place are still some issues: I tend not to observe stacked histograms, as on the diagonals, to be very interpretable. A amend method for showing univariate (unmarried variable) distributions from multiple categories is the density plot. We can commutation the histogram for a density plot in the part call. While we are at it, we volition pass in some keywords to the scatter plots to change the transparency, size, and edgecolor of the points.

        # Create a pair plot colored by continent with a density plot of the # diagonal and format the scatter plots.
        sns.pairplot(df, hue = 'continent', diag_kind = 'kde',
plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'm'},
size = iv)

The density plots on the diagonal make it easier to compare distributions between the continents than stacked bars. Changing the transparency of the scatter plots increases readability because there is considerable overlap (known as overplotting) on these figures.

Popular:   Missouri basketball: Cuonzo Martin, Tigers players talk win over Ole Miss, quick turnaround against LSU

As a final example of the default pairplot, let’due south reduce the ataxia by plotting merely the years after 2000. Nosotros will withal color by continent, only now nosotros won’t plot the year column. To limit the columns plotted, we pass in a list of
vars
to the function. To clarify the plot, we tin can as well add a title.

        # Plot colored past continent for years 2000-2007
sns.pairplot(df[df['year'] >= 2000],
vars = ['life_exp', 'log_pop', 'log_gdp_per_cap'],
hue = 'continent', diag_kind = 'kde',
plot_kws = {'alpha': 0.6, 's': 80, 'edgecolor': 'thousand'},
size = four);
# Championship
plt.suptitle('Pair Plot of Socioeconomic Information for 2000-2007',
size = 28);

This is starting to await pretty nice! If we were going to do modeling, we could use information from these plots to inform our choices. For example, we know that log_gdp_per_cap is positively correlated with life_exp, and then we could create a linear model to quantify this relationship. For this post nosotros’ll stick to plotting, and, if nosotros desire to explore our data even more, we can customize the pairplots using the PairGrid class.

Customization with PairGrid

In contrast to the
sns.pairplot
part,
sns.PairGrid
is a class which means that it does non automatically fill in the plots for us. Instead, we create a form instance and then we map specific functions to the dissimilar sections of the grid. To create a PairGrid instance with our information, we use the following code which also limits the variables nosotros will prove:

        # Create an instance of the PairGrid grade.
grid = sns.PairGrid(data= df_log[df_log['year'] == 2007],
vars = ['life_exp', 'log_pop',
'log_gdp_per_cap'], size = 4)

If nosotros were to brandish this, we would go a blank graph because nosotros have not mapped any functions to the grid sections. In that location are three grid sections to fill in for a PairGrid: the upper triangle, lower triangle, and the diagonal. To map plots to these sections, we use the
grid.map
method on the section. For example, to map a scatter plot to the upper triangle nosotros use:

        # Map a scatter plot to the upper triangle
grid = filigree.map_upper(plt.scatter, color = 'darkred')

The
map_upper
method takes in any function that accepts two arrays of variables (such as
plt.scatter)and associated keywords (such as
color). The
map_lower
method is the exact same but fills in the lower triangle of the grid. The
map_diag
is slightly different because information technology takes in a part that accepts a unmarried assortment (think the diagonal shows just 1 variable). An example is
plt.hist
which nosotros use to fill in the diagonal section beneath:

        # Map a histogram to the diagonal
grid = grid.map_diag(plt.hist, bins = 10, color = 'darkred',
edgecolor = 'g')
# Map a density plot to the lower triangle
grid = grid.map_lower(sns.kdeplot, cmap = 'Reds')

In this case, we are using a kernel density gauge in 2-D (a density plot) on the lower triangle. Put together, this code gives the states the following plot:

Popular:   Alan Pergament: Buccigross, Ferraro were scary good calling the Sabres game vs. Eichel on ESPN+

The existent benefits of using the PairGrid class come when we want to create custom functions to map different information onto the plot. For instance, I might want to add the Pearson Correlation Coefficient betwixt 2 variables onto the scatterplot. To exercise and so, I would write a function that takes in 2 arrays, calculates the statistic, then draws it on the graph. The following code shows how this is done (credit to this Stack Overflow reply):

Our new part is mapped to the upper triangle because we need ii arrays to calculate a correlation coefficient (detect also that we can map multiple functions to grid sections). This produces the following plot:

The correlation coefficient at present appears above the scatterplot. This is a relatively straightforward example, just we can utilise PairGrid to map whatever function we want onto the plot. We can add as much information equally needed provided we can effigy out how to write the role! Every bit a terminal instance, here is a plot that shows the summary statistics on the diagonal instead of a plot.

This needs a little cleaning up, simply it shows the general idea; in addition to using any existing function in a library such as
matplotlib
to map data onto the figure, we tin write our own function to show custom information.

Conclusion

Pairs plots are a powerful tool to speedily explore distributions and relationships in a dataset. Seaborn provides a unproblematic default method for making pair plots that can exist customized and extended through the Pair Grid grade. In a data assay project, a major portion of the value often comes not in the flashy machine learning, merely in the straightforward visualization of data. A pairs plot is provides us with a comprehensive first await at our information and is a great starting point in data assay projects.

I welcome feedback and constructive criticism and tin can exist reached on Twitter @koehrsen_will.

Source: https://towardsdatascience.com/visualizing-data-with-pair-plots-in-python-f228cf529166