Basics about pandas in Python
Learn the main data structures and basic opeartions in python for college/university exams !!
1. Data Structures:
Series:
A
Series
is a one-dimensional array that can hold any data type. It's like a column in a spreadsheet. In this example, we create a Series with numeric values, including a NaN (Not a Number) value.Code:
import pandas as pd # Creating a Series s = pd.Series([1, 3, 5, np.nan, 6, 8]) # Displaying the Series print(s)
Output:
0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
DataFrame:
A
DataFrame
is a two-dimensional table with labeled columns. In this example, we create a DataFrame with various data types and structures, including a timestamp, categorical data, and a constant value.Code:
import pandas as pd import numpy as np # Creating a DataFrame df = pd.DataFrame({ 'A': 1.0, 'B': pd.Timestamp('20130102'), 'C': pd.Series(1, index=list(range(4)), dtype='float32'), 'D': np.array([3] * 4, dtype='int32'), 'E': pd.Categorical(["test", "train", "test", "train"]), 'F': 'foo' }) # Displaying the DataFrame print(df)
Output:
A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo
2. Basic Operations:
Head and Tail:
The
head()
method displays the first n rows of a DataFrame. In this case, we print the first 2 rows.Code:
pythonCopy code# Displaying the first 2 rows print(df.head(2))
Output:
A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo
Descriptive Statistics:
- The
describe()
method provides summary statistics for numeric columns, including count, mean, std (standard deviation), min, and max.
- The
Code:
# Displaying descriptive statistics print(df.describe())
Output:
A C D count 4.0 4.0 4.0 mean 1.0 1.0 3.0 std 0.0 0.0 0.0 min 1.0 1.0 3.0 25% 1.0 1.0 3.0 50% 1.0 1.0 3.0 75% 1.0 1.0 3.0 max 1.0 1.0 3.0
3. Data Manipulation:
Selection:
Columns can be selected using square bracket indexing (
df['A']
), and rows can be selected usingloc
with specific indices (df.loc[[0, 2]]
).Code:
# Selecting column 'A' print(df['A']) # Selecting rows 0 and 2 print(df.loc[[0, 2]])
Output:
0 1.0 Name: A, dtype: float64 A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 2 1.0 2013-01-02 1.0 3 test foo
Filtering:
Data can be filtered based on conditions. In this example, we select rows where the 'B' column is greater than a specified date.
Code:
pythonCopy code# Filtering rows where 'B' is greater than a certain date print(df[df['B'] > '2013-01-01'])
Output:
A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo
Grouping:
The
groupby()
method is used to group data based on a column ('E' in this case). Themean()
function is then applied to each group.Code:
# Grouping by column 'E' and calculating the mean of each group print(df.groupby('E').mean())
Output:
A C D E test 1.0 1.0 3 train 1.0 1.0 3
4. Data Cleaning:
Handling Missing Data:
The
dropna()
method is used to remove rows with any NaN values, effectively handling missing or incomplete data.Code:
# Dropping rows with any NaN values print(df.dropna())
Output:
A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo
Filling Missing Data:
The
fillna()
method is used to fill NaN values with a specified value. In this example, we fill NaN values with 0.Code:
# Filling NaN values with 0 print(df.fillna(0))
Output:
A B C D E F 0 1.0 2013-01-02 1.0 3 test foo 1 1.0 2013-01-02 1.0 3 train foo 2 1.0 2013-01-02 1.0 3 test foo 3 1.0 2013-01-02 1.0 3 train foo
5. File I/O:
Reading and Writing Data:
to_csv()
writes the DataFrame to a CSV file, andread_csv()
reads data from a CSV file into a new DataFrame. Similar functions exist for other file formats.Code:
# Filling NaN values with 0 print(df.fillna(0))
Output (output.csv):
A,B,C,D,E,F 0,1.0,2013-01-02,1.0,3,test,foo 1,1.0,2013-01-02,1.0,3,train,foo 2,1.0,2013-01-02,1.0,3,test,foo 3,1.0,2013-01-02,1.0,3,train,foo