PandasAI

Introduction:

Currently, we spend a lot of time editing, cleaning, and analyzing data using various methodologies.

Pandas is a popular Python library through which we can do the data manipulation.

Data Manipulation using Python means — It keeps the data in a structured format that is known as a “Data frame”. Data frames allow us to alter, clean up, or analyze the data. We can analyze the data by generating a bar graph, adding a new row/column, or by replacing the missing data.

To perform this operation of data manipulation it takes a lot of time and it’s a time-consuming process.

To overcome this drawback, we have the PandasAI library, a pandas library extension is more efficient for data analysis and manipulation.

Advantages of PandasAI:

  • It is a useful addition for Panda’s library users.
  • It has amazing capabilities, such as running language prompts that resemble SQL searches and producing visualizations directly from a Data Frame.
  • It increases productivity by automating several processes.
  • Even, if PandasAI is a strong library, still we need Panda’s library to be used because Panda’s library capabilities are required for some operations such as adding missing data to the Data Frame.
  • Pandas AI is a useful addition that enhances the functionality of the Panda’s library and further augments the efficiency and convenience of working with the data in Python.

What is PandasAI?

  • It is an extension of the Pandas library using OpenAI generative AI model.
  • It is used or allows users to generate insights from the Data Frame using just a text prompt.
  • It works on text-to-query generative AI developed by OpenAI.
  • Data Scientists and Data Analysts spend lots of time preparing the data for analysis (which is a huge drawback). Now they can use many strategies and procedures they have researched to reduce the time required for data preparation.
  • PandasAI is an addition to Panda’s library, we can pose these queries to PandasAI, and it will respond in the form of Pandas Data Frames, this saves time of manually browsing and responding to queries about the dataset.
  • By using OpenAI API PandasAI, we can achieve the goal of virtually conversing with a machine and providing the desired results rather than programming the task by ourselves.

How does PandasAI work?

It uses a generative AI model to understand and interpret Natural Language queries and translate them into Python code and SQL queries. It uses the code to interact with the data and return results to the users.

Who can use PandasAI?

It is designed to interact with the data in a more interactive way and it is used or designed for the Data Scientists, Data Analysts and Data Engineers who want to interact with their data in a more natural way.

It is most useful for those who are not familiar with SQL or Python or who want to save time and effort while working with data.

It is also useful for those who are familiar with Python and SQL as it allows them to ask questions about their data without having to write any complex code.

To start with Pandas AI:

Step1: First install PandasAI

pip install pandasai

Step2: After installing PandasAI, we can start using it by importing the SmartDataframe class and instantiating it with the data.

frompandasai import SmartDataframe

Step 3: Load the Dataset into a Data Frame using a dictionary.

import pandas as pd
df = pd.DataFrame({
"country": ["United States", "United Kingdom", "France", "Germany", "Italy", "Spain", "Canada",
"Australia", "Japan", "China",],
"gdp": [
19294482071552, 2891615567872, 2411255037952, 3435817336832, 1745433788416,
1181205135360, 1607402389504, 1490967855104, 4380756541440, 14631844184064, ],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12],
})
df.head()
df.shape

Step 4: Initialize an OpenAl Large- Language Model (LLM)

Since PandasAI works on OpenAI LLM, we need to store OpenAI API key in the environment using the following code:

from pandasai.llm import OpenAI
llm = OpenAI(api_token="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")

Step 5: Provide a text prompt and a DataFrame to PandaAI.

We can then use the chat() method to ask the question in natural language.

Pondering what is SmartDataframe? Read this 👇🏻

A smartDataframe is a pandas (or polars) dataframe that inherits all the properties and methods from the pd.DataFrame, but also adds conversational features to it.

Now that we have instantiated the LLM, we can finally instantiate the SmartDataframe.

sdf = SmartDataframe(df, config={"llm": llm})
sdf.chat("Return the top 5 countries by GDP")
sdf.chat("What's the sum of the gdp of the 2 unhappiest countries?")
print(sdf.last_code_generated)

you can also use PandasAI to easily plot a chart.

at("Plot a chart of the gdp by country")
sdf.chat("Plot a histogram of the gdp by country, using a different color for each bar")

What if we want to work with multiple DataFrames at a time?

If we want to work with multiple dataframes at a time, then we can use the SmartDataLake instead of SmartDataFrame.

Let us understand what is SmartDataLake.

The concept is very similar to the SmartDataframe, but instead of accepting only 1 df as input, it can accept multiple dfs.

Syntax👇🏻

from pandasai import SmartDatalake

Below we are joining 2 different DataFrames, which will make a DataLake.

employees_df = pd.DataFrame(
{
"EmployeeID": [1, 2, 3, 4, 5],
"Name": ["John", "Emma", "Liam", "Olivia", "William"],
"Department": ["HR", "Sales", "IT", "Marketing", "Finance"],
}
)
salaries_df = pd.DataFrame(
{
"EmployeeID": [1, 2, 3, 4, 5],
"Salary": [5000, 6000, 4500, 7000, 5500],
}
)
lake = SmartDatalake(
[employees_df, salaries_df],
config={"llm": llm}
)
lake.chat("Who gets paid the most?")
Examples:
import pandas as pd
df = pd.DataFrame({
"country": [
"United States",
"United Kingdom",
"France",
"Germany",
"Italy",
"Spain",
"Canada",
"Australia",
"Japan",
"China",
],
"gdp": [
19294482071552,
2891615567872,
2411255037952,
3435817336832,
1745433788416,
1181205135360,
1607402389504,
1490967855104,
4380756541440,
14631844184064,
],
"happiness_index": [6.94, 7.16, 6.66, 7.07, 6.38, 6.4, 7.23, 7.22, 5.87, 5.12],
})
from pandasai.llm import OpenAI
llm = OpenAI(api_token=" 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")
sdf = SmartDataframe(df, config={"llm": llm})

sdf.chat("Return the top 5 countries by GDP")

Top 5 countries sorted according to GDP
Top 5 countries sorted according to GDP

sdf.chat("What's the sum of the gdp of the 2 unhappiest countries?")

 

The sum of the GDP of 2 unhappiest countries
The sum of the GDP of 2 unhappiest countries

 

You can also use PandasAI to easily plot a chart

sdf.chat("Plot a chart of the gdp by country")

 

A chart of the GDP by country
A chart of the GDP by country

sdf.chat("Plot a histogram of the gdp by country, using a different color for each bar")

 

A histogram of the GDP by country, using a different color for each bar.
A histogram of the GDP by country, using a different color for each bar.

Sometimes, you might want to work with multiple dataframes at a time, letting the LLM orchestrate which one(s) to use to answer your queries. In such cases, instead of using a SmartDataFrame you should rather use a SmartDataFrame

The concept is very similar to the SmartDataFrame but instead of accepting only 1 df as input, it can accept multiple ones.

from pandasai import SmartDatalake

For example, in this example, we are provided with 2 different dfs. In the first one, it’s reported for each employee, an employee id, a name and a department. In the second one, instad, it’s provided the employee id and the salary for each employee.

Asking PandasAI, it will join the 2 different DataFrames by id and figure out the name of the one that is paid the most.

employees_df = pd.DataFrame(
{
"EmployeeID": [1, 2, 3, 4, 5],
"Name": ["John", "Emma", "Liam", "Olivia", "William"],
"Department": ["HR", "Sales", "IT", "Marketing", "Finance"],
}
)
salaries_df = pd.DataFrame(
{
"EmployeeID": [1, 2, 3, 4, 5],
"Salary": [5000, 6000, 4500, 7000, 5500],
}
)
lake = SmartDatalake(
[employees_df, salaries_df],
config={"llm": llm}
)

lake.chat("Who gets paid the most?")

The employee who gets paid the most is Olivia.

users_df = pd.DataFrame(

  {

    "id": [1, 2, 3, 4, 5],

    "name": ["John", "Emma", "Liam", "Olivia", "William"]

  }

)

users = SmartDataframe(users_df, name="users")

photos_df = pd.DataFrame(

  {

    "id": [31, 32, 33, 34, 35],

    "user_id": [1, 1, 2, 4, 5]

  }

)

photos = SmartDataframe(photos_df, name="photos")

lake = SmartDatalake([users, photos], config={"llm": llm})
lake.chat("How many photos has been uploaded by John?")
>> 2

After reading this one may feel to explore PandasAI in deep. PandasAI is a new library of Python. One can start learning this in the Data Science Course. Elevate your skills through engaging courses designed to make every concept relatable and applicable, ensuring you not only master the data but also connect with the story it tells.

References:

GeeksforGeeks

PandasAI