Data analysis has become the cornerstone of decision-making across industries, driving organizations to explore innovative approaches that uncover actionable insights from complex datasets. In this pursuit, Python’s renowned library, Pandas, has been a steadfast companion for data professionals.
Pandas AI is a groundbreaking Python library designed to revolutionize the landscape of data analysis by seamlessly integrating generative artificial intelligence capabilities into the trusted Pandas framework. With Pandas AI, data enthusiasts, analysts, and scientists can supercharge their analytical workflows, harnessing the prowess of AI algorithms to unlock hidden patterns, automate repetitive tasks, and extract invaluable insights from massive datasets.
In this article, we embark on an exhilarating journey into the world of Pandas AI. Let’s dive into the realm of artificial intelligence-driven data analysis and discover how Pandas AI is reshaping the way we derive insights and make informed decisions in a data-driven world.
Getting Started
- What is Pandas AI
- Why Pandas AI
- How to use Pandas AI
- Chat with Olympic dataset using Pandas AI
What is Pandas AI
PandasAI is an extension of the Pandas library in Python, enhancing its functionality by integrating generative artificial intelligence capabilities. It serves as a complementary tool to Pandas, rather than a replacement. By leveraging PandasAI, users can interact with Pandas data frames in a more intuitive and human-like manner, enabling effective data summarization. This library allows users to pose questions about their data and receive answers in the form of Pandas DataFrames, making data analysis more conversational and accessible.
Why Pandas AI
Pandas AI plays a crucial role in improving the efficiency and effectiveness of data analysis workflows. It enables analysts to handle large datasets effortlessly, automate repetitive tasks, and extract valuable insights from complex data. By leveraging the capabilities of Pandas AI, data professionals can save time, enhance data quality, and make more accurate predictions, ultimately leading to improved decision-making and business outcomes.
Pandas AI is important for several reasons:
- Efficient Data Exploration: Pandas AI automates data profiling and visualization tasks, enabling quick comprehension of variable characteristics and distributions.
- Enhanced Data Cleaning: Pandas AI intelligently handles null values, outliers, and inconsistent formats, ensuring data integrity and improving analysis results.
- Automated Feature Engineering: Pandas AI automates the generation of new variables capturing complex relationships and patterns, saving time and improving the accuracy of data-driven solutions.
- Seamless Machine Learning Integration: Pandas AI seamlessly integrates with popular machine learning frameworks, simplifying feature selection and enabling data visualization and insights extraction.
- Improved Decision Making: Pandas AI empowers data professionals to make informed decisions by enabling efficient exploration, automated cleaning, and accurate predictive modeling.
How to use Pandas AI
In the following section we will try to use Pandas AI for data analysis.
Installing dependencies
- Create and activate a virtual environment by executing the following command.
python -m venv venv
source venv/bin/activate #for ubuntu
venv/Scripts/activate #for windows
- Install pandasai library using pip. Note that pandasai requires python 3.9+ version to work.
pip install pandasai
Setting up environment variables
Openai key is required to access pandas ai. Follow the steps to create a new openai key.
- Open platform.openai.com.
- Click on your name or icon option which is located on the top right corner of the page and select “API Keys” or click on the link — Account API Keys — OpenAI API.
- Click on create new secret key button to create a new openai key.
Importing the libraries
Create a file named app.py
and add the following code to it.
import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
Initializing Pandas AI
To initialize an instance of Pandas AI with OpenAI environment, follow the below code.
llm = OpenAI(api_token='YOUR_API_TOKEN')
pandas_ai = PandasAI(llm)
Running Pandas AI
Pandas AI accepts a data frame and prompt as an argument to run the model. In this section, We will take a dataset containing three columns Country, GDP and Happiness index. We are going through the dataset and trying to find the rows where happiness index value is greater than 5 using pandas AI.
import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
llm = OpenAI(api_token="YOUR_API_TOKEN")
pandas_ai = PandasAI(llm)
# Sample DataFrame
df = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})
result = pandas_ai(df, prompt='Which are the 5 happiest countries?')
print(result)
Data Visualization using Pandas AI
Pandas AI is a useful tool for making data visualizations using graphs. It is enabled with various graphs to visualize the data.
import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
llm = OpenAI(api_token="YOUR_API_TOKEN")
pandas_ai = PandasAI(llm)
# Sample DataFrame
df = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada", "Australia", "Japan", "China"],
"gdp": [19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416, 1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12]
})
result = pandas_ai(df, prompt='Plot the histogram of countries showing for each the gdp, using different colors for each bar')
print(result)
Chat with Olympic dataset using Pandas AI
In this section, we will be using the Olympics dataset that contains 120 years of athlete information. The dataset has information about the athlete, their nationality, gender, age, year they participated, the sports played, if they won a medal, and the Olympics event they participated in. You can download the dataset from Kaggle using the following link.
120 years of Olympic history: athletes and results | www.kaggle.com
Importing libraries
Import the necessary Pandas AI libraries using the following code.
import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
Initialize Pandas AI instance with OpenAI environment
Load the API token to OpenAI environment and initialize an instance of Pandas AI using the following code.
llm = OpenAI(api_token="YOUR_API_TOKEN")
pandas_ai = PandasAI(llm)
Loading the dataset
Load the dataset using pandas and read first few records using the head()
method.
df = pd.read_csv("athlete_events.csv")
print(df.head())
Working with Pandas AI
In this section, we will interact with the dataset in a natural way using pandas AI.
- To find out which athlete participated in the highest number of Olympics events using pandasai and data frame group and sort operations respectively follow the below code.
import pandas as pd
from pandasai import PandasAI
from pandasai.llm.openai import OpenAI
llm = OpenAI(api_token="YOUR_API_TOKEN")
pandas_ai = PandasAI(llm)
df = pd.read_csv("athlete_events.csv")
# print(df.head())
print(pandas_ai.run(df, prompt='Which athlete appeared in the most olympics years and how many'))
print(df.groupby(by='Name')['Year'].nunique().sort_values(ascending=False).head(1))
- To find out which athlete participated in the maximum number of sports during his tenure using Pandas AI, use the below code.
print(pandas_ai.run(df, prompt='Which athlete has participated in the most number of events and how many'))
- To find out which country holds the record for the maximum number of medals secured in the Olympics using pandas ai and to verify the result using group and sort operations, use the below code.
print(pandas_ai.run(df, prompt='Which country has won the highest number of medals and how many'))
print(df.groupby(by='NOC')['Medal'].count().sort_values(ascending=False).reset_index().head(1))
- To find out the total number of medals secured by India in the Olympics games over the years and to visualize it as a bar plot, use the below code.
print(pandas_ai.run(df, prompt='Plot a barplot with the total number of medals won by participants from IND'))
Pandas AI is an advancement in AI and conversational interface that is transforming the way users interact with data, making tasks simpler and improving the accessibility of data analysis.
Thanks for reading this article.
Thanks Gowri M Bhatt for reviewing the content.
If you enjoyed this article, please click on the heart button ♥ and share to help others find it!
The full source code for this tutorial can be found here,
GitHub - codemaker2015/pandasai-demos: PandasAI is a Python
The article is also available on Medium.
Here are some useful links,
Top comments (2)
Good to mention alors that data are obfuscated ... and are never sent to openai. This is one of the key benefits of this really cool project.
Thanks Adriens. Will mention that.