Pandas Comprehensive Learning Guide


🐼 Mastering Pandas: A Web Developer’s Fast Track to Data Analysis in Python

Welcome, fellow web developer! Are you ready to level up your Python skills and dive into the exciting world of data analysis? If you’ve been wrangling data in JavaScript or perhaps manipulating JSON objects in your Angular apps, you’re in for a treat. Pandas, a cornerstone library in the Python data science ecosystem, is about to become your new best friend for handling tabular data with unparalleled ease and power.This guide is tailor-made for you—an Angular developer with a strong grasp of Python fundamentals, but perhaps limited exposure to the specific nuances of data manipulation libraries like Pandas. We’re going to bridge that gap, drawing parallels to concepts you already know, and equipping you with the skills to confidently load, clean, transform, and analyze data like a pro.

🚀 What is Pandas? Why Should You Care?

Imagine you have a massive spreadsheet with thousands of rows and columns. How would you quickly find all customers who bought a specific product last month, calculate the average sales per region, or identify trends in your website’s traffic data? Doing this manually or with raw Python lists of dictionaries can quickly become a nightmare. This is where Pandas shines!Pandas is an open-source Python library built on top of NumPy (another fantastic library for numerical operations). It provides high-performance, easy-to-use data structures and data analysis tools for working with structured data. Think of it as a supercharged spreadsheet software, but with the full programmatic power of Python at your fingertips.

Why is it Important/Useful?

  • Data Cleaning & Preparation: Real-world data is messy. Pandas helps you handle missing values, inconsistent formats, and outliers with elegant solutions.
  • Data Exploration & Analysis: Quickly calculate statistics (mean, median, standard deviation), filter data based on conditions, group data, and pivot tables.
  • Data Loading & Saving: Read and write data from various formats like CSV, Excel, SQL databases, JSON, and more.
  • Integration: Seamlessly integrates with other Python libraries like Matplotlib (for visualization) and Scikit-learn (for machine learning).
  • Performance: Optimized for speed, especially when dealing with large datasets, thanks to its C-optimized underpinnings (via NumPy).

Who is This Guide For?

This guide is specifically designed for:

  • Web Developers: Especially those familiar with Angular, JavaScript, and strong in Python basics.
  • Aspiring Data Scientists/Analysts: If you’re looking to get your foot in the door of data-driven roles.
  • Anyone Tired of Manual Data Wrangling: If you’re constantly struggling with spreadsheets or custom Python scripts for data tasks.

What Will You Learn?

By the end of this guide, you will be able to:

  • Set up your development environment for Pandas.
  • Understand and utilize the two core Pandas data structures: Series and DataFrame.
  • Load and save data from common file formats.
  • Perform basic data selection, filtering, and manipulation.
  • Handle missing data and perform simple data cleaning.
  • Conduct basic statistical analysis and aggregation.
  • Gain an initial understanding of more advanced concepts like merging and plotting.
  • Apply your knowledge in a hands-on mini-project.

Prerequisites

  • Python 3: A solid understanding of Python syntax, data types (lists, dictionaries, tuples), loops, functions, and basic object-oriented programming.
  • Command Line Basics: Familiarity with navigating directories and running commands in your terminal.
  • Basic Data Concepts: A general understanding of tables, rows, columns, and data types (numbers, strings, booleans).

🛠️ Setup: Getting Pandas Ready to Roll

Getting Pandas installed and ready is straightforward, especially if you’re already accustomed to managing Python environments. We’ll use pip for installation, and it’s highly recommended to use a virtual environment to keep your project dependencies isolated.

Just like you might use nvm for Node.js versions or npm workspaces for projects, Python virtual environments (venv or conda) keep your project dependencies clean.Open your terminal or command prompt and run:

# Navigate to your project directory or create a new one
mkdir pandas_guide
cd pandas_guide

# Create a virtual environment named 'venv'
python -m venv venv

# Activate the virtual environment
# On macOS/Linux:
source venv/bin/activate

# On Windows:
.\venv\Scripts\activate

You’ll notice (venv) prepended to your terminal prompt, indicating that your virtual environment is active.

Step 2: Install Pandas

Once your virtual environment is active, install Pandas using pip:

pip install pandas
pip install jupyterlab # Optional, but highly recommended for interactive data analysis

jupyterlab provides an interactive environment (Jupyter notebooks) that is incredibly popular for data analysis, allowing you to run code cells incrementally, visualize results, and document your process. We’ll use a standard Python script for the core examples, but consider exploring Jupyter later.

Step 3: Verify Installation

To ensure everything is set up correctly, open a Python interpreter or create a new Python file (e.g., test_pandas.py) and try importing Pandas:

# test_pandas.py
import pandas as pd

print("Pandas version:", pd.__version__)

# Deactivate the virtual environment when you're done
# deactivate # on macOS/Linux/Windows

Run this file: python test_pandas.py. If you see a version number printed, you’re good to go!

🧩 Core Concepts: The Building Blocks of Data Analysis

Pandas revolves around two fundamental data structures: Series and DataFrame. Think of them as the JavaScript array and object, but purpose-built for efficient data manipulation.

1. The Series: Your Smart Array

A Pandas Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, Python objects, etc.). It’s like a single column in a spreadsheet or a labeled list in Python.Analogy for Web Devs:

  • Think of it as a JavaScript array, but where each element also has an associated “index” or “label” (which can be custom, not just 0, 1, 2…).
  • It’s similar to an ordered dictionary where keys are your labels/index and values are the data.

Creating a Series

import pandas as pd

# From a Python list
# Default integer index (0, 1, 2...)
temperatures = pd.Series([22, 25, 19, 28, 23])
print("Temperatures Series:\n", temperatures)
# Output:
# Temperatures Series:
#  0    22
#  1    25
#  2    19
#  3    28
#  4    23
# dtype: int64

# From a Python dictionary (keys become the index)
city_temps = pd.Series({'London': 18, 'Paris': 22, 'Berlin': 19, 'Rome': 25})
print("\nCity Temperatures Series:\n", city_temps)
# Output:
# City Temperatures Series:
#  London    18
#  Paris     22
#  Berlin    19
#  Rome      25
# dtype: int64

# With a custom index
fruits = ["Apple", "Banana", "Cherry"]
counts = [10, 15, 8]
fruit_counts = pd.Series(counts, index=fruits)
print("\nFruit Counts Series:\n", fruit_counts)
# Output:
# Fruit Counts Series:
#  Apple     10
#  Banana    15
#  Cherry     8
# dtype: int64

Accessing Elements in a Series

You can access elements using their integer position (like a list) or their label (if a custom index is set).

import pandas as pd

city_temps = pd.Series({'London': 18, 'Paris': 22, 'Berlin': 19, 'Rome': 25})

# Access by label (similar to dictionary key access)
print("Temperature in Paris:", city_temps['Paris']) # Output: 22

# Access by integer position (like array index)
print("Temperature at index 0:", city_temps[0]) # Output: 18

# Access multiple elements by labels
print("Temperatures in London and Rome:\n", city_temps[['London', 'Rome']])
# Output:
# Temperatures in London and Rome:
#  London    18
#  Rome      25
# dtype: int64

# Conditional selection (powerful!)
print("Cities with temperature > 20:\n", city_temps[city_temps > 20])
# Output:
# Cities with temperature > 20:
#  Paris    22
#  Rome     25
# dtype: int64

2. The DataFrame: Your Smart Spreadsheet

A Pandas DataFrame is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). It’s essentially a collection of Series objects that share the same index.Analogy for Web Devs:

  • Think of it as a table in a relational database (with rows and columns).
  • Or, imagine an array of JavaScript objects, where each object represents a row, and the object keys are the column names.
  • It’s like a highly optimized spreadsheet, where you can programmatically interact with every cell, row, and column.

Creating a DataFrame

DataFrames can be created from various sources, including dictionaries of lists, lists of dictionaries, NumPy arrays, and external files.From a dictionary of lists (most common for small data): Here, keys become column names, and values are lists representing data for that column.

import pandas as pd

data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['New York', 'London', 'Paris', 'New York', 'London'],
    'Score': [85, 92, 78, 95, 88]
}

df = pd.DataFrame(data)
print("DataFrame from dictionary:\n", df)
# Output:
# DataFrame from dictionary:
#       Name  Age      City  Score
# 0    Alice   25  New York     85
# 1      Bob   30    London     92
# 2  Charlie   35     Paris     78
# 3    David   28  New York     95
# 4      Eve   32    London     88

From a list of dictionaries: Each dictionary represents a row, and keys are column names. This is very intuitive for web developers.

import pandas as pd

data_list = [
    {'Name': 'Alice', 'Age': 25, 'City': 'New York', 'Score': 85},
    {'Name': 'Bob', 'Age': 30, 'City': 'London', 'Score': 92},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Paris', 'Score': 78}
]

df_list = pd.DataFrame(data_list)
print("\nDataFrame from list of dictionaries:\n", df_list)
# Output:
# DataFrame from list of dictionaries:
#       Name  Age      City  Score
# 0    Alice   25  New York     85
# 1      Bob   30    London     92
# 2  Charlie   35     Paris     78

Basic DataFrame Operations

  • df.head() and df.tail(): View the first/last N rows (default 5). Essential for quick checks.
  • df.info(): Get a concise summary of the DataFrame, including data types and non-null values.
  • df.describe(): Generate descriptive statistics (count, mean, std, min, max, quartiles) for numerical columns.
  • df.shape: Get the number of rows and columns as a tuple (rows, columns).
  • df.columns: Get a list of column names.
  • df.index: Get the row index.
print("\nFirst 3 rows:\n", df.head(3))
print("\nDataFrame Info:")
df.info()
print("\nDescriptive Statistics:\n", df.describe())
print("\nDataFrame Shape:", df.shape) # (5, 4)
print("\nColumn Names:", df.columns.tolist())

Selecting Data (Columns and Rows)

This is one of the most frequent operations.

  • Selecting Columns: Use dictionary-like syntax or dot notation.
    • df[‘ColumnName’] returns a Series.
    • df[[‘Column1’, ‘Column2’]] returns a DataFrame with selected columns.
  • Selecting Rows:
    • loc (label-based): Used for selecting by label/index name. df.loc[row_label, column_label]
    • iloc (integer-location based): Used for selecting by integer position. df.iloc[row_position, column_position]
# Select a single column (returns a Series)
ages = df['Age']
print("\nAges Series:\n", ages)

# Select multiple columns (returns a DataFrame)
name_city = df[['Name', 'City']]
print("\nName and City DataFrame:\n", name_city)

# Select rows using .loc (by index label)
# Select row with index 0
print("\nRow at index 0 (using .loc):\n", df.loc[0])
# Select rows with index 0 and 2
print("\nRows at index 0 and 2 (using .loc):\n", df.loc[[0, 2]])
# Select specific cells using .loc
print("\nCity of Alice (loc):", df.loc[0, 'City'])
print("\nName of row with index 1:", df.loc[1, 'Name'])

# Select rows using .iloc (by integer position)
# Select row at position 0
print("\nRow at position 0 (using .iloc):\n", df.iloc[0])
# Select rows at positions 0 and 2
print("\nRows at position 0 and 2 (using .iloc):\n", df.iloc[[0, 2]])
# Select specific cells using .iloc
print("\nCity of Alice (iloc):", df.iloc[0, 2]) # 0th row, 2nd column
print("\nName of row at position 1:", df.iloc[1, 0]) # 1st row, 0th column

Understanding loc vs. iloc (Crucial!):

  • loc is for label-based indexing. If your DataFrame index is [‘A’, ‘B’, ‘C’], you’d use df.loc[‘A’]. If it’s the default integer index, df.loc[0] refers to the row with the label 0.
  • iloc is for integer-position based indexing. Always uses 0-based integer positions, regardless of the index labels. df.iloc[0] always refers to the first row.

When your index is the default integer range, loc and iloc on rows might seem to behave similarly, but their underlying philosophy is different. It becomes critical when you have non-integer or non-sequential indices.

Conditional Selection (Filtering Data)

This is incredibly powerful and common. You pass a boolean Series (a Series of True/False values) inside the square brackets. Only rows where the corresponding boolean is True are selected.Analogy for Web Devs:

  • Like using Array.prototype.filter() in JavaScript, but on an entire dataset at once.
  • Or, a WHERE clause in SQL.
# Select rows where Age is greater than 30
filtered_df = df[df['Age'] > 30]
print("\nPeople older than 30:\n", filtered_df)
# Output:
#    Name  Age    City  Score
# 2  Charlie   35   Paris     78
# 4      Eve   32  London     88

# Select rows where City is 'New York' AND Score is greater than 90
ny_high_score_df = df[(df['City'] == 'New York') & (df['Score'] > 90)]
print("\nNew York residents with high scores:\n", ny_high_score_df)
# Output:
#     Name  Age      City  Score
# 3  David   28  New York     95

# Select rows where City is 'London' OR 'Paris'
london_paris_df = df[(df['City'] == 'London') | (df['City'] == 'Paris')]
print("\nResidents of London or Paris:\n", london_paris_df)
# Output:
#       Name  Age    City  Score
# 1      Bob   30  London     92
# 2  Charlie   35   Paris     78
# 4      Eve   32  London     88

# Using .isin() for multiple values (cleaner than multiple ORs)
cities_of_interest = ['London', 'Paris']
london_paris_df_isin = df[df['City'].isin(cities_of_interest)]
print("\nResidents of London or Paris (using .isin()):\n", london_paris_df_isin)

Adding, Deleting, and Modifying Columns

  • Adding a new column: Assign a Series or a scalar value.
  • Modifying an existing column: Assign new values to it.
  • Deleting a column: Use del df[‘ColumnName’] or df.drop(‘ColumnName’, axis=1). axis=1 means “columns”. axis=0 (default) means “rows”.
# Add a new column 'Status' with a default value
df['Status'] = 'Active'
print("\nDataFrame with new 'Status' column:\n", df)
# Output:
#       Name  Age      City  Score  Status
# 0    Alice   25  New York     85  Active
# 1      Bob   30    London     92  Active
# 2  Charlie   35     Paris     78  Active
# 3    David   28  New York     95  Active
# 4      Eve   32    London     88  Active

# Add a new column based on existing columns (e.g., Score / 100)
df['Percentage_Score'] = df['Score'] / 100
print("\nDataFrame with 'Percentage_Score':\n", df)

# Modify an existing column
df['Age'] = df['Age'] + 1 # Increment all ages by 1
print("\nDataFrame with incremented Age:\n", df)

# Delete a column
# df_after_drop = df.drop('Status', axis=1) # Creates a new DataFrame without 'Status'
# print("\nDataFrame after dropping 'Status':\n", df_after_drop)

# Using 'inplace=True' to modify the DataFrame directly (use with caution!)
df.drop('Status', axis=1, inplace=True)
print("\nDataFrame after dropping 'Status' (inplace):\n", df)

Handling Missing Data (NaN)

Missing data is represented by NaN (Not a Number), which comes from NumPy. Pandas provides powerful tools to deal with it.

  • df.isnull() / df.isna(): Returns a boolean DataFrame indicating where values are missing.
  • df.isnull().sum(): Counts missing values per column.
  • df.dropna(): Remove rows or columns with missing values.
  • df.fillna(): Fill missing values with a specified value (e.g., mean, median, 0, or a constant).
import numpy as np

data_with_nan = {
    'Product': ['A', 'B', 'C', 'D', 'E'],
    'Price': [10.5, 12.0, np.nan, 8.0, 15.0],
    'Quantity': [100, np.nan, 150, 200, 120],
    'Category': ['Electronics', 'Books', 'Electronics', 'Books', np.nan]
}
df_nan = pd.DataFrame(data_with_nan)
print("\nDataFrame with missing values:\n", df_nan)

print("\nMissing values per column:\n", df_nan.isnull().sum())
# Output:
# Price       1
# Quantity    1
# Category    1
# dtype: int64

# Drop rows with any missing values
df_cleaned_dropna_rows = df_nan.dropna()
print("\nDataFrame after dropping rows with NaNs:\n", df_cleaned_dropna_rows)

# Fill missing 'Price' with the mean price
mean_price = df_nan['Price'].mean()
df_filled_price = df_nan.copy() # Work on a copy to preserve original
df_filled_price['Price'].fillna(mean_price, inplace=True)
print("\nDataFrame after filling Price NaNs with mean:\n", df_filled_price)

# Fill missing 'Category' with 'Unknown'
df_filled_category = df_nan.copy()
df_filled_category['Category'].fillna('Unknown', inplace=True)
print("\nDataFrame after filling Category NaNs with 'Unknown':\n", df_filled_category)

Grouping and Aggregation (groupby())

This is equivalent to the GROUP BY clause in SQL or JavaScript’s reduce combined with map on categories. It’s incredibly powerful for summarizing data.

  1. Split: Divide the data into groups based on some criteria (e.g., ‘City’).
  2. Apply: Perform a function to each group independently (e.g., calculate the mean ‘Score’).
  3. Combine: Combine the results into a data structure.
# Group by 'City' and calculate the mean 'Score' for each city
avg_score_by_city = df.groupby('City')['Score'].mean()
print("\nAverage Score by City:\n", avg_score_by_city)
# Output:
# Average Score by City:
#  City
#  London      90.0
#  New York    90.0
#  Paris       78.0
# Name: Score, dtype: float64

# Group by 'City' and get the count of people and sum of scores
city_stats = df.groupby('City').agg(
    total_people=('Name', 'count'), # Rename column
    average_score=('Score', 'mean'),
    max_score=('Score', 'max')
)
print("\nCity Statistics:\n", city_stats)
# Output:
# City Statistics:
#             total_people  average_score  max_score
# City
# London                 2           90.0         92
# New York               2           90.0         95
# Paris                  1           78.0         78

🚀 Loading and Saving Data

One of the primary uses of Pandas is to interact with data stored in various formats.

  • CSV (Comma Separated Values): The most common format.
    • pd.read_csv(‘file.csv’)
    • df.to_csv(‘output.csv’, index=False) (index=False prevents writing the DataFrame index as a column)
  • Excel:
    • pd.read_excel(‘file.xlsx’, sheet_name=‘Sheet1’)
    • df.to_excel(‘output.xlsx’, sheet_name=‘Sheet1’, index=False)
  • JSON:
    • pd.read_json(‘file.json’)
    • df.to_json(‘output.json’, orient=‘records’) (orient=‘records’ outputs as a list of dicts, familiar to web dev)
  • SQL Databases:
    • pd.read_sql(‘SELECT * FROM my_table’, connection) (requires a database connector like sqlalchemy or psycopg2)
    • df.to_sql(’new_table’, connection, if_exists=‘replace’, index=False)

Example: Reading/Writing CSVLet’s create a dummy CSV file first:

# Create a dummy CSV file programmatically
csv_content = """Name,Age,City,Score
Alice,25,New York,85
Bob,30,London,92
Charlie,35,Paris,78
David,28,New York,95
Eve,32,London,88
"""
with open("sample_data.csv", "w") as f:
    f.write(csv_content)

print("\n--- Reading from sample_data.csv ---")
df_from_csv = pd.read_csv("sample_data.csv")
print(df_from_csv)

print("\n--- Saving to output_data.csv ---")
# Add a new column to demonstrate saving modified data
df_from_csv['Grade'] = ['A', 'A', 'B', 'A', 'A']
df_from_csv.to_csv("output_data.csv", index=False)
print("Data saved to output_data.csv (check your directory)")

# You can now open output_data.csv with a text editor to see the result

📈 Advanced Concepts (Briefly)

This guide is for beginners, but it’s good to know what else Pandas can do. These topics typically come after you’re comfortable with the basics.

  • Merging/Joining DataFrames: Combining DataFrames based on common columns, similar to SQL JOINs (pd.merge()). Essential when data is spread across multiple tables.
  • Reshaping Data: Pivoting, melting, stacking, and unstacking data (pivot_table, melt). Useful for changing the layout of your data for analysis or visualization.
  • Time Series Analysis: Pandas has excellent support for date and time data, including resampling, rolling calculations, and handling time zones.
  • Categorical Data: Efficiently handling columns with a limited number of unique values (e.g., ‘Gender’, ‘Region’).
  • Apply and Map: Applying custom Python functions row-wise or element-wise across Series or DataFrames.
  • Plotting: Pandas has built-in plotting capabilities (using Matplotlib as a backend), allowing for quick data visualizations directly from DataFrames.

🤝 Hands-on Project: Analyzing User Engagement Data

Let’s put what we’ve learned into practice! We’ll simulate analyzing a simple web application’s user engagement data.Scenario: You work for a company with a new feature. You’ve collected some mock data about user sessions, feature usage, and demographics. Your task is to perform some basic analysis to understand user behavior.Data Description:

  • user_id: Unique identifier for the user.
  • session_duration_min: How long the user was active in minutes.
  • feature_X_used: Boolean, whether the user used “Feature X”.
  • platform: ‘Web’, ‘Mobile_iOS’, or ‘Mobile_Android’.
  • user_segment: ‘New User’, ‘Returning User’, ‘VIP’.
  • country: Country of the user.

Project Steps:

  1. Generate Dummy Data: Create a Pandas DataFrame to simulate our user data.
  2. Initial Data Exploration: Get a feel for the data.
  3. Data Cleaning: Introduce and handle some missing values.
  4. Feature Engineering: Create a new column based on existing ones.
  5. Basic Analysis: Answer some questions about the data using grouping and filtering.
import pandas as pd
import numpy as np

# --- Step 1: Generate Dummy Data ---
print("--- Step 1: Generating Dummy Data ---")
np.random.seed(42) # For reproducibility

num_users = 100

data = {
    'user_id': range(1001, 1001 + num_users),
    'session_duration_min': np.random.randint(5, 120, num_users),
    'feature_X_used': np.random.choice([True, False], num_users, p=[0.7, 0.3]), # 70% used feature X
    'platform': np.random.choice(['Web', 'Mobile_iOS', 'Mobile_Android'], num_users, p=[0.5, 0.25, 0.25]),
    'user_segment': np.random.choice(['New User', 'Returning User', 'VIP'], num_users, p=[0.4, 0.5, 0.1]),
    'country': np.random.choice(['USA', 'Canada', 'UK', 'Germany', 'Australia'], num_users, p=[0.4, 0.2, 0.15, 0.15, 0.1])
}

df_users = pd.DataFrame(data)

# Introduce some missing values to simulate real-world data issues
# Let's say some session durations or platforms are missing
missing_indices_duration = np.random.choice(df_users.index, 5, replace=False)
df_users.loc[missing_indices_duration, 'session_duration_min'] = np.nan

missing_indices_platform = np.random.choice(df_users.index, 3, replace=False)
df_users.loc[missing_indices_platform, 'platform'] = np.nan

print("Initial DataFrame Head:\n", df_users.head())
print("\nDataFrame Info:")
df_users.info()

# --- Step 2: Initial Data Exploration ---
print("\n--- Step 2: Initial Data Exploration ---")
print("Shape of the DataFrame:", df_users.shape)
print("\nDescriptive Statistics for numerical columns:\n", df_users.describe())
print("\nValue counts for 'platform':\n", df_users['platform'].value_counts())
print("\nValue counts for 'user_segment':\n", df_users['user_segment'].value_counts())
print("\nMissing values before cleaning:\n", df_users.isnull().sum())


# --- Step 3: Data Cleaning ---
print("\n--- Step 3: Data Cleaning ---")
# For session_duration_min, let's fill missing values with the median duration
median_duration = df_users['session_duration_min'].median()
df_users['session_duration_min'].fillna(median_duration, inplace=True)
print(f"Filled missing 'session_duration_min' with median: {median_duration:.2f} minutes")

# For 'platform', let's fill missing values with the mode (most frequent platform)
# Use .mode()[0] because .mode() can return multiple modes if there's a tie
mode_platform = df_users['platform'].mode()[0]
df_users['platform'].fillna(mode_platform, inplace=True)
print(f"Filled missing 'platform' with mode: {mode_platform}")

print("\nMissing values after cleaning:\n", df_users.isnull().sum())
print("\nDataFrame Info after cleaning:")
df_users.info()


# --- Step 4: Feature Engineering ---
print("\n--- Step 4: Feature Engineering ---")
# Create a new column 'is_long_session' if session duration is >= 60 minutes
df_users['is_long_session'] = df_users['session_duration_min'] >= 60
print("Added 'is_long_session' column. First 5 rows:\n", df_users[['session_duration_min', 'is_long_session']].head())

# Create a new column 'engagement_score' based on a simple formula
# Longer session and feature X usage contributes more
df_users['engagement_score'] = (df_users['session_duration_min'] * 0.1) + (df_users['feature_X_used'].astype(int) * 20)
print("\nAdded 'engagement_score' column. First 5 rows:\n", df_users[['session_duration_min', 'feature_X_used', 'engagement_score']].head())


# --- Step 5: Basic Analysis ---
print("\n--- Step 5: Basic Analysis ---")

# Question 1: What is the average session duration for each platform?
avg_duration_by_platform = df_users.groupby('platform')['session_duration_min'].mean()
print("\nAverage Session Duration by Platform:\n", avg_duration_by_platform)

# Question 2: How many VIP users used Feature X?
vip_feature_X_users = df_users[(df_users['user_segment'] == 'VIP') & (df_users['feature_X_used'] == True)]
print(f"\nNumber of VIP users who used Feature X: {len(vip_feature_X_users)}")

# Question 3: What's the distribution of user segments for users with long sessions?
long_session_segments = df_users[df_users['is_long_session']]['user_segment'].value_counts()
print("\nUser Segment Distribution for Long Sessions:\n", long_session_segments)

# Question 4: Find the top 5 countries by average engagement score.
top_countries_engagement = df_users.groupby('country')['engagement_score'].mean().sort_values(ascending=False).head(5)
print("\nTop 5 Countries by Average Engagement Score:\n", top_countries_engagement)

# Question 5: What percentage of users used Feature X?
percentage_feature_X_used = (df_users['feature_X_used'].sum() / len(df_users)) * 100
print(f"\nPercentage of users who used Feature X: {percentage_feature_X_used:.2f}%")

# Save the cleaned and analyzed data to a new CSV
df_users.to_csv("analyzed_user_engagement.csv", index=False)
print("\nAnalyzed data saved to 'analyzed_user_engagement.csv'")

# Clean up dummy CSV files (optional)
# import os
# os.remove("sample_data.csv")
# os.remove("output_data.csv")

This mini-project demonstrates the typical flow: loading data (here, simulated), exploring it, handling imperfections, creating new insights (feature engineering), and finally answering specific questions through aggregation and filtering.

⏭️ Next Steps & Resources

Congratulations! You’ve taken your first significant steps into the world of Pandas. This guide has laid a solid foundation, but the journey of data analysis is vast and exciting.

Logical Next Steps:

  1. Practice, Practice, Practice: The best way to learn is by doing. Find small datasets online (Kaggle is a great resource) and try to replicate common analysis tasks.
  2. Deep Dive into groupby: The groupby function is incredibly versatile. Explore its various aggregation functions (min, max, median, std, etc.) and how to apply multiple aggregations at once.
  3. Visualization with Matplotlib/Seaborn: Pandas integrates beautifully with plotting libraries. Learn to create histograms, scatter plots, bar charts directly from your DataFrames to visualize your findings.
  4. Time Series Analysis: If you work with web analytics, stock data, or IoT, understanding Pandas’ time series capabilities is crucial.
  5. Merging and Joining DataFrames: Real-world data often comes from multiple sources. Mastering pd.merge() is essential for combining disparate datasets.
  6. Advanced apply, map, applymap: For more complex custom transformations.
  7. Performance Considerations: As datasets grow, understanding how to write efficient Pandas code (avoiding explicit loops where possible) becomes important.
  • Official Pandas Documentation:
    • Pandas User Guide: Comprehensive and well-organized. Start with “10 Minutes to Pandas.”
    • API Reference: Detailed documentation for every function and method.
  • Wes McKinney’s “Python for Data Analysis”: The definitive book by the creator of Pandas. It’s often referred to as the “Pandas Bible.”
  • Kaggle: A fantastic platform for data science competitions, but also a treasure trove of datasets and community notebooks (.ipynb files) where you can see Pandas in action.
  • Towards Data Science (Medium): Many excellent articles and tutorials on Pandas and data analysis.
  • Stack Overflow: Your go-to place for specific Pandas questions and troubleshooting.
  • YouTube Channels:
    • Corey Schafer: Excellent Pandas tutorial series.
    • Data School: Concise and practical Pandas tutorials.

Keep exploring, keep building, and soon you’ll be harnessing the full power of Pandas to turn raw data into valuable insights, just like you turn raw code into amazing web applications! Happy data wrangling!