DEV Community

Vlythr
Vlythr

Posted on

Creating Statistical Graphics with Python - A Beginner-Friendly Guide

Welcome to a beginner-friendly tutorial on creating statistical graphics for your research using Python, Excel, Seaborn, and Pandas. Whether you're a scientist, researcher, or student, visualizing data is a crucial step in understanding and communicating your findings. In this tutorial, we'll walk you through the process in an easy and straightforward manner, even if you have no prior experience in programming.

We'll leverage the power of Python, a versatile programming language, along with the user-friendly features of Excel to manipulate and organize our data. The data manipulation will be done using Pandas, a powerful data analysis library, and Seaborn will help us visualize the data with beautiful and insightful plots.

Note:
If you are an absolute beginner, you might be wondering what "plots" are. In the context of this tutorial, a plot is a visual representation of data. We'll be creating simple and informative charts, like scatter plots, which help us see patterns and relationships in our data. Don't worry if these terms are new; we'll guide you through each step.

By the end of this tutorial, you should be able to create impactful statistical graphics. So here's one way you can do it:

Getting Started

Step 1: Set Up Your Environment

Before diving into creating statistical graphics, let's make sure you have the necessary tools set up.

1. Code Editor:
A code editor is where you'll write and run your Python code. If you don't have one installed, you can choose from various options such as Visual Studio Code, PyCharm, or Jupyter Notebooks.

  • Install Visual Studio Code:
    Visit Visual Studio Code and follow the installation instructions for your operating system.

  • Install PyCharm:
    Visit PyCharm and download the community version for free.

  • Install Jupyter Notebooks:
    If you prefer a notebook-based environment, install Jupyter Notebooks using:

     pip install notebook
    

2. Python:
Python is the programming language we'll use. If you haven't installed Python yet, follow these steps:

  • Download Python: Visit Python Downloads and download the latest version. During installation, make sure to check the box that says "Add Python to PATH."

3. Seaborn and Pandas:
Seaborn and Pandas are Python libraries that will help us with data manipulation and visualization.

  • Install Seaborn and Pandas:
    Open your command prompt or terminal and run the following commands:

     pip install seaborn
     pip install pandas
    

Now that you've set up your environment, let's move on to working with actual data!

Step 2: Getting Started

Create a Sample Spreadsheet

Now that your environment is set up, let's create a sample spreadsheet that we can later import using Pandas. We'll make a simple Excel file with 4 top columns (A to D) and 4 lateral columns (1 to 4). For simplicity, we'll label them as "Column A," "Column B," and so on.

  1. Sample Spreadsheet:
    • Open Excel or your preferred spreadsheet software.
    • Label the top columns A to D as "Column A," "Column B," "Column C," and "Column D."
    • Label the lateral columns 1 to 4 as "1," "2," "3," and "4."

Example Spreadsheet:

          A          B          C          D
   1  Column A  Column B  Column C  Column D
   2    1234      1234      1234      1234
   3    1234      1234      1234      1234
   4    1234      1234      1234      1234
   5    1234      1234      1234      1234
Enter fullscreen mode Exit fullscreen mode
  1. Save the Spreadsheet:
    • Save this Excel file in a location where you can easily access it. You might want to create a new folder for your project, and within it, save the file as "sample_data.xlsx."

Note for Visual Studio Code Users:

  • Save the Excel file in the same directory as your Python script. If you've created a new folder for your project, save it there.
  1. Code to Import and Display Data: Now, let's write the Python code to import this data using Pandas and display all the data points.
   import pandas as pd
   import seaborn as sns
   import matplotlib.pyplot as plt

   # Load data from Excel
   df = pd.read_excel('sample_data.xlsx', index_col=0)

   # Display all data points
   print("Sample Data:")
   print(df)

   # Use Seaborn's fmri dataset for visualization
   sns.lineplot(data=sns.load_dataset('fmri'), x='timepoint', y='signal')
   plt.title('Seaborn fmri Dataset Example')
   plt.xlabel('Timepoint')
   plt.ylabel('Signal')
   plt.show()
Enter fullscreen mode Exit fullscreen mode

Ensure that you have the "sample_data.xlsx" file in the correct location, and you can run this code to import and display the data. In the next step, we'll delve into customizing and visualizing this data in more detail.

Step 3: Customizing Your Data Visualization

Now that you've successfully imported data, let's explore how to customize your spreadsheet and the corresponding Python code for a more tailored data visualization.

1. Customize the Spreadsheet:

  • Replace Sample Data:
    Replace the numeric data in your "sample_data.xlsx" file with your own dataset. Simply overwrite the numbers while keeping the same structure.

  • Add Hues and Style:
    You can add hues and styles to your data in Excel. For example, you might color cells or columns differently based on specific categories or values. Experiment with Excel's formatting options.

Example Spreadsheet:

          A          B          C          D
   1  Column A  Column B  Column C  Column D
   2     Red       1234      1234      1234
   3     Blue      1234      1234      1234
   4    Green      1234      1234      1234
   5    Yellow     1234      1234      1234
Enter fullscreen mode Exit fullscreen mode
  • Save the Modified Spreadsheet: Save your modified Excel file, ensuring it's still named "sample_data.xlsx."

2. Code Modifications:

Update the Python code to reflect the changes you made in the spreadsheet. Below is an example of how you can modify the code:

   import pandas as pd
   import seaborn as sns
   import matplotlib.pyplot as plt

   # Load your modified data from Excel
   df = pd.read_excel('sample_data.xlsx', index_col=0)

   # Display all data points
   print("Modified Data:")
   print(df)

   # Use Seaborn's fmri dataset for visualization
   sns.lineplot(data=sns.load_dataset('fmri'), x='timepoint', y='signal', hue='region', style='event', markers=True, palette='Set1')
   plt.title('Customized Seaborn fmri Dataset')
   plt.xlabel('Timepoint')
   plt.ylabel('Signal')
   plt.legend(title='Legend')
   plt.show()
Enter fullscreen mode Exit fullscreen mode
  • Explanation:
    • hue: Use this parameter to distinguish data points based on a category (e.g., color by 'region').
    • style: Style data points based on another category (e.g., differentiate by 'event').
    • markers: Display markers for each data point.
    • palette: Choose a color palette (e.g., 'Set1').

Customize these parameters based on your spreadsheet structure and preferences. This flexibility allows you to visualize your data in a way that best communicates your findings. In the next section, we'll delve into more advanced customization options.

Step 4: Exploring Seaborn's Graphic Types and Advanced Possibilities

Now that you have a foundational understanding, let's explore how to switch between different graphic types in Seaborn and delve into more advanced customization options.

1. Switching Between Graphic Types:

Seaborn offers various plot types to suit different data visualization needs. Let's modify our code to switch between a few common plot types: scatter plot, bar plot, and box plot.

   import pandas as pd
   import seaborn as sns
   import matplotlib.pyplot as plt

   # Load your modified data from Excel
   df = pd.read_excel('sample_data.xlsx', index_col=0)

   # Display all data points
   print("Modified Data:")
   print(df)

   # Scatter Plot
   sns.scatterplot(data=df, x='Column A', y='Column B', hue='Column C', style='Column D', markers=True, palette='viridis')
   plt.title('Scatter Plot')
   plt.xlabel('Column A')
   plt.ylabel('Column B')
   plt.legend(title='Column C')
   plt.show()

   # Bar Plot
   sns.barplot(data=df, x='Column A', y='Column B', hue='Column C', palette='muted')
   plt.title('Bar Plot')
   plt.xlabel('Column A')
   plt.ylabel('Column B')
   plt.legend(title='Column C')
   plt.show()

   # Box Plot
   sns.boxplot(data=df, x='Column A', y='Column B', hue='Column C', palette='pastel')
   plt.title('Box Plot')
   plt.xlabel('Column A')
   plt.ylabel('Column B')
   plt.legend(title='Column C')
   plt.show()
Enter fullscreen mode Exit fullscreen mode
  • Explanation:
    • Replace 'Column A', 'Column B', 'Column C', and 'Column D' with your actual column names.
    • Experiment with different parameters to customize each plot.

2. More Advanced Possibilities:

  • FacetGrid:
    You can use FacetGrid to create a grid of subplots based on the values of one or more variables. This is especially useful when you have additional categorical variables to explore.

     g = sns.FacetGrid(df, col='Column C', hue='Column D', palette='Set1', height=4)
     g.map(sns.scatterplot, 'Column A', 'Column B', markers=True)
     g.add_legend(title='Column D')
     plt.suptitle('FacetGrid Scatter Plot')
     plt.show()
    
  • Pair Plot:
    Visualize pairwise relationships between numerical variables in your dataset.

     sns.pairplot(df, hue='Column C', palette='Set2', markers=["o", "s", "D"])
     plt.suptitle('Pair Plot')
     plt.show()
    

These advanced possibilities provide more insights into your data by visualizing relationships and distributions in different ways. Experiment with these examples and adapt them to your specific dataset and research questions. In the final section, we'll discuss how to save your visualizations for presentations or publications.

Step 5: Saving Your Visualizations

Now that you've created compelling visualizations, it's time to save them for presentations or publications. Seaborn makes it easy to export your plots in various formats.

1. Save Plots in Seaborn:

After creating a plot, you can use plt.savefig() to save it in different formats, such as PNG, PDF, SVG, or others.

   import pandas as pd
   import seaborn as sns
   import matplotlib.pyplot as plt

   # Load your modified data from Excel
   df = pd.read_excel('sample_data.xlsx', index_col=0)

   # Display all data points
   print("Modified Data:")
   print(df)

   # Example: Scatter Plot
   sns.scatterplot(data=df, x='Column A', y='Column B', hue='Column C', style='Column D', markers=True, palette='viridis')
   plt.title('Scatter Plot')
   plt.xlabel('Column A')
   plt.ylabel('Column B')
   plt.legend(title='Column C')

   # Save the plot
   plt.savefig('scatter_plot.png')
   plt.show()
Enter fullscreen mode Exit fullscreen mode
  • Replace 'Column A', 'Column B', 'Column C', and 'Column D' with your actual column names.
  • Adjust the file name and format in plt.savefig('scatter_plot.png') as needed.

2. Full Code Block and Data Spreadsheet:

Here's the complete code block for the sample, assuming you have the 'sample_data.xlsx' spreadsheet with your modified data:

   import pandas as pd
   import seaborn as sns
   import matplotlib.pyplot as plt

   # Load your modified data from Excel
   df = pd.read_excel('sample_data.xlsx', index_col=0)

   # Display all data points
   print("Modified Data:")
   print(df)

   # Scatter Plot
   sns.scatterplot(data=df, x='Column A', y='Column B', hue='Column C', style='Column D', markers=True, palette='viridis')
   plt.title('Scatter Plot')
   plt.xlabel('Column A')
   plt.ylabel('Column B')
   plt.legend(title='Column C')

   # Save the plot
   plt.savefig('scatter_plot.png')
   plt.show()
Enter fullscreen mode Exit fullscreen mode
  • Ensure 'sample_data.xlsx' is in the same directory as your Python script.

That's it! By following these steps, you can create, customize, and save your visualizations with ease. Feel free to experiment with different Seaborn functions and parameters to discover new ways to showcase your data.

Top comments (0)