A Beginners Guide to Pandas in Python
Learn the essentials of Python's pandas library in this beginner's guide. Discover how to effortlessly manipulate and analyze data for your projects.

Introduction to Pandas: A Beginner's Guide
What is Pandas?
- Pandas is a powerful Python library for data manipulation and analysis. It provides easy-to-use data structures, high-performance data analysis tools, and various data cleaning and transformation capabilities.
Installation
- Before you start using Pandas, you need to install it. Open your terminal or command prompt and run:
pip install pandasImporting Pandas
- Once installed, you can import Pandas in your Python script or Jupyter notebook:
import pandas as pdData Structures in Pandas
Series:
- Series is like a single column of a spreadsheet
- A one-dimensional labeled array capable of holding any data type.
- It can be created from lists, arrays, or dictionaries.
DataFrame:
- DataFrame is the entire spreadsheet
- A two-dimensional table with labeled axes (rows and columns).
- It is the most commonly used Pandas object, resembling a spreadsheet or SQL table.
- DataFrames can be created from various data sources, including dictionaries, lists, NumPy arrays, and external files like CSV or Excel.
# Series
s = pd.Series([1, 3, 5, 'a', 6, 8])
print(s)
0 1
1 3
2 5
3 a
4 6
5 8
dtype: object
# to check datatype
type(s)pandas.core.series.Series
# DataFrame:
data = {'Name': ['John', 'Alice', 'Bob'],
'Age': [25, 28, 22],
'City': ['New York', 'San Francisco', 'Los Angeles']}
df = pd.DataFrame(data)
df| | Name | Age | City |
|---|-------|-----|----------------|
| 0 | John | 25 | New York |
| 1 | Alice | 28 | San Francisco |
| 2 | Bob | 22 | Los Angeles |
#to check datatype
type(df)pandas.core.frame.DataFrame
Using Numpy to create data for the dataframe
import numpy as np
df1 = pd.DataFrame(np.arange(0,15).reshape(3,5), index= ['R1', 'R2', 'R3'], columns= ['C1', 'C2', 'C3', 'C4', 'C5'])
df1| | C1 | C2 | C3 | C4 | C5 |
|----|----|----|----|----|----|
| R1 | 0 | 1 | 2 | 3 | 4 |
| R2 | 5 | 6 | 7 | 8 | 9 |
| R3 | 10 | 11 | 12 | 13 | 14 |
Column Selection:
- Accessing a single column: df['Column_Name']
- Accessing multiple columns: df[['Column1', 'Column2']]
#Accessing the elements: Indexing and slicing
# Accessing a single column by label
print(df1['C1'])
R1 0
R2 5
R3 10
Name: C1, dtype: int64
We can't access rows in the similar way as above, For accessing row we need to use 'loc' discused below
# Here even the df1 is a dataframe data type, a single row or a colomn will be of data type series only,
type(df1['C1'])pandas.core.series.Series
# Slicing multiple rows by label
print(df1['R1':'R2'])
C1 C2 C3 C4 C5
R1 0 1 2 3 4
R2 5 6 7 8 9
# Slicing multiple rows by position
df1[1:3]| | C1 | C2 | C3 | C4 | C5 |
| --- | -- | -- | -- | -- | -- |
| R2 | 5 | 6 | 7 | 8 | 9 |
| R3 | 10 | 11 | 12 | 13 | 14 |
Filtering Rows:
- Using conditions to filter rows based on a specific criteria.
# Example
df1[df1['C1'] > 2]
| | C1 | C2 | C3 | C4 | C5 |
|------|----|----|----|----|----|
| **R2** | 5 | 6 | 7 | 8 | 9 |
| **R3** | 10 | 11 | 12 | 13 | 14 |
Indexing with loc and iloc
- In Pandas, effective data manipulation often involves precise indexing. The loc and iloc attributes provide powerful tools for label-based and integer-location-based indexing in DataFrames, enabling you to access, slice, and manipulate data with ease.
loc - Label-based Indexing:
- loc is used for accessing a group of rows and columns by labels or a boolean array.
# Accessing a single row by label
df1.loc['R1']C1 0
C2 1
C3 2
C4 3
C5 4
Name: R1, dtype: int64
# Accessing a specific element by label and column
df1.loc['R2', 'C3']7
# Slicing by labels
print(df1.loc['R1':'R2']) C1 C2 C3 C4 C5
R1 0 1 2 3 4
R2 5 6 7 8 9
iloc - Integer-location based Indexing:
- iloc is used for accessing a group of rows and columns by integer position.
# Accessing a single row by position
df1.iloc[1]C1 5
C2 6
C3 7
C4 8
C5 9
Name: R2, dtype: int64
# Accessing a specific element by position
print(df1.iloc[1, 0])5
# Slicing by position
print(df1.iloc[1:3]) C1 C2 C3 C4 C5
R2 5 6 7 8 9
R3 10 11 12 13 14
Reading and Writing Data
Reading Data
- Pandas supports reading data from various file formats such as CSV, Excel, SQL, and more.
Writing Data
- You can also write data back to these formats:
# Reading from CSV
df = pd.read_csv('jobs_in_data.csv') #Here we are using a sample .csv file, the file is provided at the bottom.
# Reading from Excel
# df = pd.read_excel('example.xlsx')
# Reading from SQL
# df = pd.read_sql('SELECT * FROM table', connection)
#Writing:
# Writing to CSV
#df.to_csv('output.csv', index=False)
# Writing to Excel
#df.to_excel('output.xlsx', index=False)
# Writing to SQL
#df.to_sql('table', connection, index=False, if_exists='replace')
df
| | work_year | job_title | job_category | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|---|-----------|-------------------------|---------------------------------|------------------|---------|---------------|---------------------|------------------|------------------|--------------|-------------------|--------------|
| 0 | 2023 | Data DevOps Engineer | Data Engineering | EUR | 88000 | 95012 | Germany | Mid-level | Full-time | Hybrid | Germany | L |
| 1 | 2023 | Data Architect | Data Architecture and Modeling | USD | 186000 | 186000 | United States | Senior | Full-time | In-person | United States | M |
| 2 | 2023 | Data Architect | Data Architecture and Modeling | USD | 81800 | 81800 | United States | Senior | Full-time | In-person | United States | M |
| 3 | 2023 | Data Scientist | Data Science and Research | USD | 212000 | 212000 | United States | Senior | Full-time | In-person | United States | M |
| 4 | 2023 | Data Scientist | Data Science and Research | USD | 93300 | 93300 | United States | Senior | Full-time | In-person | United States | M |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 9350 | 2021 | Data Specialist | Data Management and Strategy | USD | 165000 | 165000 | United States | Senior | Full-time | Remote | United States | L |
| 9351 | 2020 | Data Scientist | Data Science and Research | USD | 412000 | 412000 | United States | Senior | Full-time | Remote | United States | L |
| 9352 | 2021 | Principal Data Scientist | Data Science and Research | USD | 151000 | 151000 | United States | Mid-level | Full-time | Remote | United States | L |
| 9353 | 2020 | Data Scientist | Data Science and Research | USD | 105000 | 105000 | United States | Entry-level | Full-time | Remote | United States | S |
| 9354 | 2020 | Business Data Analyst | Data Analysis | USD | 100000 | 100000 | United States | Entry-level | Contract | Remote | United States | L |
*Note: 9355 rows × 12 columns*
Data Exploration in pandas
- Data exploration is a crucial step in the data analysis process, and Pandas provides powerful tools to help you understand and analyze your dataset.
Viewing Data:
- head(): Displays the first few rows of the DataFrame.
- tail(): Displays the last few rows of the DataFrame.
- sample(n): Displays a random sample of n rows from the DataFrame.
df.head() #Displays the first few rows of the DataFrame.| | work_year | job_title | job_category | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|----|-----------|-------------------------|---------------------------------|------------------|--------|---------------|---------------------|------------------|------------------|--------------|------------------|--------------|
| 0 | 2023 | Data DevOps Engineer | Data Engineering | EUR | 88000 | 95012 | Germany | Mid-level | Full-time | Hybrid | Germany | L |
| 1 | 2023 | Data Architect | Data Architecture and Modeling | USD | 186000 | 186000 | United States | Senior | Full-time | In-person | United States | M |
| 2 | 2023 | Data Architect | Data Architecture and Modeling | USD | 81800 | 81800 | United States | Senior | Full-time | In-person | United States | M |
| 3 | 2023 | Data Scientist | Data Science and Research | USD | 212000 | 212000 | United States | Senior | Full-time | In-person | United States | M |
| 4 | 2023 | Data Scientist | Data Science and Research | USD | 93300 | 93300 | United States | Senior | Full-time | In-person | United States | M |
df.tail() #Displays the last few rows of the DataFrame.| | work_year | job_title | job_category | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|--------|-----------|------------------------------|--------------------------------|------------------|--------|--------------|---------------------|-------------------|------------------|--------------|-------------------|--------------|
| 9350 | 2021 | Data Specialist | Data Management and Strategy | USD | 165000 | 165000 | United States | Senior | Full-time | Remote | United States | L |
| 9351 | 2020 | Data Scientist | Data Science and Research | USD | 412000 | 412000 | United States | Senior | Full-time | Remote | United States | L |
| 9352 | 2021 | Principal Data Scientist | Data Science and Research | USD | 151000 | 151000 | United States | Mid-level | Full-time | Remote | United States | L |
| 9353 | 2020 | Data Scientist | Data Science and Research | USD | 105000 | 105000 | United States | Entry-level | Full-time | Remote | United States | S |
| 9354 | 2020 | Business Data Analyst | Data Analysis | USD | 100000 | 100000 | United States | Entry-level | Contract | Remote | United States | L |
df.sample(2) #Displays a random sample of 5 rows from the DataFrame.| | work_year | job_title | job_category | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|---|-----------|---------------------|----------------------------|-----------------|--------|---------------|---------------------|-------------------|------------------|--------------|-------------------|--------------|
| 1 | 2023 | Applied Scientist | Data Science and Research | USD | 172040 | 172040 | United States | Senior | Full-time | Remote | United States | M |
| 2 | 2023 | Data Analyst | Data Analysis | USD | 126000 | 126000 | United States | Senior | Full-time | In-person | United States | M |
Summary Statistics:
- describe(): Generates descriptive statistics for numerical columns.
# Summary Statistics:
df.describe() #Generates descriptive statistics for numerical columns.| | work_year | salary | salary_in_usd |
|--------------|--------------|--------------|---------------|
| count | 9355.000000 | 9355.000000 | 9355.000000 |
| mean | 2022.760449 | 149927.981293| 150299.495564 |
| std | 0.519470 | 63608.835387 | 63177.372024 |
| min | 2020.000000 | 14000.000000 | 15000.000000 |
| 25% | 2023.000000 | 105200.000000| 105700.000000 |
| 50% | 2023.000000 | 143860.000000| 143000.000000 |
| 75% | 2023.000000 | 187000.000000| 186723.000000 |
| max | 2023.000000 | 450000.000000| 450000.000000 |
Data Types:
- info(): Provides a concise summary of the DataFrame, including data types and non-null values.
# Datatype:
df.info() #Provides a concise summary of the DataFrame, including data types and non-null values.<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9355 entries, 0 to 9354
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 work_year 9355 non-null int64
1 job_title 9355 non-null object
2 job_category 9355 non-null object
3 salary_currency 9355 non-null object
4 salary 9355 non-null int64
5 salary_in_usd 9355 non-null int64
6 employee_residence 9355 non-null object
7 experience_level 9355 non-null object
8 employment_type 9355 non-null object
9 work_setting 9355 non-null object
10 company_location 9355 non-null object
11 company_size 9355 non-null object
dtypes: int64(3), object(9)
memory usage: 877.2+ KB
Handling Missing Data:
- Handling missing data is an essential part of data cleaning and preprocessing. Pandas provides several methods to deal with missing values in a DataFrame.
Checking for Missing Data:
- Use isnull() or isna() to check for missing values in the DataFrame.
# Detecting Missing Values
df.isnull() #or df.isna()| work_year | job_title | job_category | salary_currency | salary | salary_in_usd | employee_residence | experience_level | employment_type | work_setting | company_location | company_size |
|-----------|-----------|--------------|-----------------|--------|---------------|--------------------|-------------------|------------------|--------------|------------------|--------------|
| False | False | False | False | False | False | False | False | False | False | False | False |
| False | False | False | False | False | False | False | False | False | False | False | False |
| False | False | False | False | False | False | False | False | False | False | False | False |
| False | False | False | False | False | False | False | False | False | False | False | False |
| False | False | False | False | False | False | False | False | False | False | False | False |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| False | False | False | False | False | False | False | False | False | False | False | False |
| False | False | False | False | False | False | False | False | False | False | False | False |
| False | False | False | False | False | False | False | False | False | False | False | False |
| False | False | False | False | False | False | False | False | False | False | False | False |
9355 rows × 12 columns
#Counting Missing Values:
df.isnull().sum()work_year 0
job_title 0
job_category 0
salary_currency 0
salary 0
salary_in_usd 0
employee_residence 0
experience_level 0
employment_type 0
work_setting 0
company_location 0
company_size 0
dtype: int64
Handling Missing Values
# Dropping Missing Values:
# Use dropna() to remove rows or columns with missing values.
# Dropping rows with any missing values
df.dropna(inplace=True)
# Dropping columns with any missing values
df.dropna(axis=1, inplace=True)# Filling Missing Values:
# Use fillna(value) to replace missing values with a specific value.
# Filling missing values in salalry field with a specific value (e.g., mean)
df.fillna(value=df['salary'].mean(), inplace=True)# Forward Fill (ffill) and Backward Fill (bfill):
# Use ffill() to fill missing values with the previous value and bfill() to fill with the next value.
# Forward fill missing values
df.ffill(inplace=True)
# Backward fill missing values
df.bfill(inplace=True)
Grouping and Aggregation:
- Grouping and aggregation are powerful operations in Pandas that allow you to group your data based on certain criteria and perform calculations on those groups.
Grouping Data:
- Use groupby() to group data by a single column.
# Grouping by 'Category' column
grouped_data = df.groupby('job_title')
# Calculating mean for each group
mean_values = grouped_data.mean() # After grouping, apply aggregation functions (e.g., mean, sum, count) to get summary statistics for each group.
print(mean_values) work_year salary salary_in_usd
job_title
AI Architect 2023.000000 249000.000000 250328.000000
AI Developer 2022.944444 140500.000000 141140.888889
AI Engineer 2023.000000 169208.416667 171663.972222
AI Programmer 2022.800000 74000.000000 68817.400000
AI Research Engineer 2022.750000 67275.000000 73271.500000
... ... ... ...
Sales Data Analyst 2020.000000 60000.000000 60000.000000
Software Data Engineer 2023.000000 120000.000000 111627.666667
Staff Data Analyst 2021.000000 84999.333333 79917.000000
Staff Data Scientist 2020.500000 134500.000000 134500.000000
Staff Machine Learning Engineer 2021.000000 185000.000000 185000.000000
[125 rows x 3 columns]
On this page
- What is Pandas?
- Installation
- Importing Pandas
- Data Structures in Pandas
- Series:
- DataFrame:
- Column Selection:
- Filtering Rows:
- Indexing with loc and iloc
- loc - Label-based Indexing:
- iloc - Integer-location based Indexing:
- Reading and Writing Data
- Reading Data
- Writing Data
- Data Exploration in pandas
- Viewing Data:
- Handling Missing Data:
- Checking for Missing Data:
- Handling Missing Values
- Grouping and Aggregation:
- Grouping Data:
Keep exploring
matched by tag + title overlap
Read next
A Beginners Guide to Numpy in Python
Discover NumPy basics in Python! Learn to manipulate data effortlessly with arrays and boost your coding skills. Perfect for beginners diving into numerical computing.
#python#data-analyst#data-scienceLists/Sets/Dictionaries/Tuples in Python
Explore all the key concepts of Lists/Sets/Dictionaries/Tuples in Python related to Data Science
#python#data-analyst#data-sciencePL-300: Microsoft Power BI Data Analyst Certification Roadmap
Your go-to guide for conquering the PL-300 Power BI Data Analyst exam, packed with practical tips and resources for success.
#data-analystApache Spark for data ingestion into Microsoft Fabric Lakehouse using Notebooks
Explore the power of Apache Spark and Python for seamless data ingestion into Microsoft Fabric Lakehouse. Dive into Fabric notebooks, a scalable and systematic solution that empowers you to ingest external data, configure authentication…
#data-analyst