Basics about pandas in Python

1. Data Structures:

Series:

A Series is a one-dimensional array that can hold any data type. It's like a column in a spreadsheet. In this example, we create a Series with numeric values, including a NaN (Not a Number) value.

Code:

  import pandas as pd

  # Creating a Series
  s = pd.Series([1, 3, 5, np.nan, 6, 8])

  # Displaying the Series
  print(s)

Output:

  0    1.0
  1    3.0
  2    5.0
  3    NaN
  4    6.0
  5    8.0
  dtype: float64

DataFrame:

A DataFrame is a two-dimensional table with labeled columns. In this example, we create a DataFrame with various data types and structures, including a timestamp, categorical data, and a constant value.

Code:

  import pandas as pd
  import numpy as np

  # Creating a DataFrame
  df = pd.DataFrame({
      'A': 1.0,
      'B': pd.Timestamp('20130102'),
      'C': pd.Series(1, index=list(range(4)), dtype='float32'),
      'D': np.array([3] * 4, dtype='int32'),
      'E': pd.Categorical(["test", "train", "test", "train"]),
      'F': 'foo'
  })

  # Displaying the DataFrame
  print(df)

Output:

     A        B        C   D     E    F
  0  1.0  2013-01-02  1.0  3   test  foo
  1  1.0  2013-01-02  1.0  3  train  foo
  2  1.0  2013-01-02  1.0  3   test  foo
  3  1.0  2013-01-02  1.0  3  train  foo

2. Basic Operations:

Head and Tail:

The head() method displays the first n rows of a DataFrame. In this case, we print the first 2 rows.

Code:

  pythonCopy code# Displaying the first 2 rows
  print(df.head(2))

Output:

      A       B        C   D     E    F
  0  1.0  2013-01-02  1.0  3   test  foo
  1  1.0  2013-01-02  1.0  3  train  foo

Descriptive Statistics:
- The describe() method provides summary statistics for numeric columns, including count, mean, std (standard deviation), min, and max.

Code:

  # Displaying descriptive statistics
  print(df.describe())

Output:

         A    C    D
  count  4.0  4.0  4.0
  mean   1.0  1.0  3.0
  std    0.0  0.0  0.0
  min    1.0  1.0  3.0
  25%    1.0  1.0  3.0
  50%    1.0  1.0  3.0
  75%    1.0  1.0  3.0
  max    1.0  1.0  3.0

3. Data Manipulation:

Selection:

Columns can be selected using square bracket indexing (df['A']), and rows can be selected using loc with specific indices (df.loc[[0, 2]]).

Code:

  # Selecting column 'A'
  print(df['A'])

  # Selecting rows 0 and 2
  print(df.loc[[0, 2]])

Output:

  0    1.0
  Name: A, dtype: float64

     A          B    C  D     E    F
  0  1.0 2013-01-02  1.0  3  test  foo
  2  1.0 2013-01-02  1.0  3  test  foo

Filtering:

Data can be filtered based on conditions. In this example, we select rows where the 'B' column is greater than a specified date.

Code:

  pythonCopy code# Filtering rows where 'B' is greater than a certain date
  print(df[df['B'] > '2013-01-01'])

Output:

     A       B       C    D    E     F
  0  1.0 2013-01-02  1.0  3   test  foo
  1  1.0 2013-01-02  1.0  3  train  foo
  2  1.0 2013-01-02  1.0  3   test  foo
  3  1.0 2013-01-02  1.0  3  train  foo

Grouping:

The groupby() method is used to group data based on a column ('E' in this case). The mean() function is then applied to each group.

Code:

  # Grouping by column 'E' and calculating the mean of each group
  print(df.groupby('E').mean())

Output:

          A    C  D
  E
  test   1.0  1.0  3
  train  1.0  1.0  3

4. Data Cleaning:

Handling Missing Data:

The dropna() method is used to remove rows with any NaN values, effectively handling missing or incomplete data.

Code:

  # Dropping rows with any NaN values
  print(df.dropna())

Output:

      A      B        C   D     E    F
  0  1.0 2013-01-02  1.0  3   test  foo
  1  1.0 2013-01-02  1.0  3  train  foo
  2  1.0 2013-01-02  1.0  3   test  foo
  3  1.0 2013-01-02  1.0  3  train  foo

Filling Missing Data:

The fillna() method is used to fill NaN values with a specified value. In this example, we fill NaN values with 0.

Code:

  # Filling NaN values with 0
  print(df.fillna(0))

Output:

     A       B        C   D    E    F
  0  1.0 2013-01-02  1.0  3   test  foo
  1  1.0 2013-01-02  1.0  3  train  foo
  2  1.0 2013-01-02  1.0  3   test  foo
  3  1.0 2013-01-02  1.0  3  train  foo

5. File I/O:

Reading and Writing Data:

to_csv() writes the DataFrame to a CSV file, and read_csv() reads data from a CSV file into a new DataFrame. Similar functions exist for other file formats.

Code:

  # Filling NaN values with 0
  print(df.fillna(0))

Output (output.csv):

    A,B,C,D,E,F
  0,1.0,2013-01-02,1.0,3,test,foo
  1,1.0,2013-01-02,1.0,3,train,foo
  2,1.0,2013-01-02,1.0,3,test,foo
  3,1.0,2013-01-02,1.0,3,train,foo

Basics about pandas in Python

Learn the main data structures and basic opeartions in python for college/university exams !!

Table of contents

1. Data Structures:

2. Basic Operations:

3. Data Manipulation:

4. Data Cleaning:

5. File I/O: