Creating violin plots using SAS and Python
- Date August 30, 2019
- Written by Pavel Rogatch
- Category Python
This guide shows different ways of creating violin plots using SAS9API and Python.
Introduction
Violin plots are similar to box plots except that they also show the probability density (usually smoothed by a kernel density estimator) of the data. Violin plots can be symmetric or asymmetric. They can be very useful in data analysis.
There are several ways to create a violin plot using SAS and Python. Here we demonstrate how to create violin plots using cars and bweight datasets from the SASHELP library. Five examples below will help you to learn how to create violin plots using SAS and different Python libraries: matplotlib , seaborn and plotly .
Prerequisites
To follow the examples presented here you need to have the following:
- access to SAS9API proxy,
- Python 3 installed,
- sas9api , pandas , matplotlib , seaborn and plotly Python libraries.
Step 1 – Getting the necessary libraries
Python sas9api library can be gotten at https://github.com/analytium/python-sas9api. We need to download the sas9api.py file to our local computer. It can be done in different ways:
- using the following command (if you have Git installed): git clone https://github.com/analytium/python-sas9api to download this repository to a designated folder on a local computer;
- going to https://github.com/analytium/python-sas9api/blob/master/sas9api.py, right-clicking on the Raw button in the top right and then choosing “Save target as” to save sas9api.py file to a local computer.
We need to place sas9api.py file in the same folder with our Python code or provide a path to it for a successful import.
If you do not have the following Python libraries installed, please follow the respective links for the installation instructions:
- pandas – https://pandas.pydata.org/pandas-docs/stable/install.html,
- matplotlib – https://matplotlib.org/users/installing.html,
- seaborn – http://seaborn.pydata.org/installing.html,
- plotly – https://plot.ly/python/getting-started/.
Now everything is ready to write a code.
Step 2 – Importing the necessary libraries
We need to import necessary libraries.
import sas9api as sas # Enables you to connect to a SAS server import pandas as pd # Used for a more convenient way to present data import matplotlib.pyplot as plt # Python plotting library
Step 3 – Specifying SAS9API URL and port
We assign our SAS server URL and port to a variable which will be used to access data in SAS libraries :
url = "your_SAS9API_url:port"
Step 4 – Retrieving data from a SAS server
Now let us retrieve data from two SAS datasets using retrieve_data function from the sas9api Python library.
We pass the following parameters to this function:
- url – your SAS server URL defined earlier,
- library_name – SAS library name,
- dataset_name – SAS dataset name,
- limit – maximum number of records to retrieve,
- only_payload – a flag set to True to get data as a list of dictionaries containing dataset records only without the response header.
Then we convert data to a Pandas DataFrame and display its first rows.
# Data from the 'cars' dataset dat = sas.retrieve_data(url, library_name="sashelp", dataset_name="cars", limit=10000, only_payload=True) cars_df = pd.DataFrame(dat) cars_df.head()
Cylinders | DriveTrain | EngineSize | Horsepower | Invoice | Length | MPG_City | MPG_Highway | MSRP | Make | Model | Origin | Type | Weight | Wheelbase | |
0 | 6.0 | All | 3.5 | 265.0 | 33337 | 189.0 | 17.0 | 23.0 | 36945 | Acura | MDX | Asia | SUV | 4451.0 | 106.0 |
1 | 4.0 | Front | 2.0 | 200.0 | 21761 | 172.0 | 24.0 | 31.0 | 23820 | Acura | RSX Type S 2dr | Asia | Sedan | 2778.0 | 101.0 |
2 | 4.0 | Front | 2.4 | 200.0 | 24647 | 183.0 | 22.0 | 29.0 | 26990 | Acura | TSX 4dr | Asia | Sedan | 3230.0 | 105.0 |
3 | 6.0 | Front | 3.2 | 270.0 | 30299 | 186.0 | 20.0 | 28.0 | 33195 | Acura | TL 4dr | Asia | Sedan | 3575.0 | 108.0 |
4 | 6.0 | Front | 3.5 | 225.0 | 39014 | 197.0 | 18.0 | 24.0 | 43755 | Acura | 3.5 RL 4dr | Asia | Sedan | 3880.0 | 115.0 |
# Data from the 'bweight' dataset bweight_dat = sas.retrieve_data(url, library_name="sashelp", dataset_name="bweight", server_name="SASApp", limit=10000, only_payload=True) bweight_df = pd.DataFrame(bweight_dat) bweight_df.head()
Black | Boy | CigsPerDay | Married | MomAge | MomEdLevel | MomSmoke | MomWtGain | Visit | Weight | |
0 | 0.0 | 1.0 | 0.0 | 1.0 | -3.0 | 0.0 | 0.0 | -16.0 | 1.0 | 4111.0 |
1 | 0.0 | 0.0 | 0.0 | 1.0 | 1.0 | 2.0 | 0.0 | 2.0 | 3.0 | 3997.0 |
2 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | -3.0 | 3.0 | 3572.0 |
3 | 0.0 | 1.0 | 0.0 | 1.0 | -1.0 | 2.0 | 0.0 | -5.0 | 3.0 | 1956.0 |
4 | 0.0 | 1.0 | 0.0 | 1.0 | -6.0 | 0.0 | 0.0 | -20.0 | 3.0 | 3515.0 |
Now let us create different violin plots in several different ways.
Step 5 – Creating violin plots
Example 1 (using matplotlib library): violin plots of Weight for every Type of cars.
# Prepare a list of distributions of weights for each type of cars car_weights = [] for car in cars_df["Type"].unique(): car_weights.append(list(cars_df[cars_df["Type"] == car]["Weight"])) # Create a figure instance fig = plt.figure() # Create an axes instance ax = fig.add_axes([0,0,1,1]) # Create the violin plot ax.violinplot(car_weights, showmeans=True) # Customize axis labels = cars_df["Type"].unique() ax.set_xticklabels(labels) ax.set_xlabel('Vehicle Type') ax.set_ylabel('Weight') ax.set_title("Weight of Vehicles") plt.show()
Example 2 (using seaborn library): violin plots of MPG_Highway for every Type of cars.
# Import 'seaborn' library. import seaborn as sns # Create violin plot using 'cars' dataset sns.set(style="whitegrid") ax = sns.violinplot(x="Type", y="MPG_Highway", data=cars_df) ax.set_title("Miles per Gallon for Different Types of Vehicles");
Example 3 (using seaborn library): violin plots of weights of newborn babies depending on their sex and their mothers’ smoking habits.
Here we use hue nesting with a variable MomSmoke that takes two levels and set ‘split’ to True to draw half of a violin for each level. This will make it easier to directly compare the distributions for babies with smoking and non-smoking mothers.
# Create violin plot using 'bweight' dataset ax = sns.violinplot(x="Boy", y="Weight", hue="MomSmoke", data=bweight_df, palette="muted", split=True) ax.set_xticklabels(["Girl", "Boy"]); ax.set_title("Weights of newborn babies");
Example 4 (using plotly library): violin plots of invoice prices for cars depending on their origin.
Hovering mouse over the plots we will be able to see different information about data points.
# Import 'plotly' library import plotly.express as px # Create violin plot using 'cars' dataset fig = px.violin(cars_df, y="Invoice", x="Origin", box=True, hover_data=cars_df.columns, title="Invoice prices for cars depending on their origin") fig.show()
Example 5 (using plotly library): violin plots of weights of newborn babies depending on their mothers’ smoking habits.
Here we draw violins on top of each other to compare distributions for babies with smoking and non-smoking mothers.
# Import 'plotly' library import plotly.express as px # Create violin plot using 'bweight' dataset fig = px.violin(bweight_df, y="Weight", color="MomSmoke", box=True, violinmode='overlay', # draw violins on top of each other hover_data=bweight_df.columns, title="Weight of newborn babies") fig.show()
Conclusion
We have shown several ways of how to create violin plots from your data stored at a SAS server. SAS9API enables you to access your SAS data and use it with different programming languages as needed. It is a powerful tool which gives you more flexibility in your analytical work. Here we have used Python language, but if you want to learn about using SAS with R language bridged by SAS9API, please refer to https://sas9api.io/examples/r-violin-plot/.