Mastering Data Manipulation and Analysis: NumPy, Pandas, and Visualization for AI
Introduction
In the ever-evolving landscape of artificial intelligence and machine learning, the ability to effectively manipulate, analyze, and visualize data is not just a skill but a cornerstone for success. From the foundational steps of cleaning raw datasets to the sophisticated preparation required for training large language models (LLMs) or understanding agent performance, a deep understanding of data tools is paramount.
This textbook-style document is designed to be a comprehensive guide for anyone venturing into or expanding their expertise in data science, machine learning, and AI. Whether you are an absolute beginner taking your first steps into data manipulation or an experienced professional looking to optimize your data workflows for advanced AI applications, this guide will provide you with the necessary knowledge and practical skills.
We will embark on a journey through the powerful trio of Python libraries: NumPy for numerical operations, Pandas for robust data structuring and analysis, and Matplotlib/Seaborn for compelling data visualization. Each section builds upon the previous, starting with the fundamental concepts and progressively moving towards advanced techniques, performance considerations, and real-world applications, with a special emphasis on preparing data for machine learning models, including the unique challenges and opportunities presented by Large Language Models.
Our goal is to equip you with the expertise to not only handle diverse datasets with confidence but also to understand the nuances of data that drive intelligent systems. By the end of this guide, you will be proficient in transforming raw data into actionable insights and well-structured inputs for your most ambitious AI projects.
Chapter 1: Setting the Stage – Why Data Manipulation Matters
1.1 The Ubiquity of Data in the AI Era
In the age of AI, data is the new oil. Every interaction, every sensor reading, every piece of text, every image—all generate vast amounts of data. This data, however, is rarely in a pristine, immediately usable format. It’s often messy, incomplete, inconsistent, and requires significant effort to transform into a valuable resource for training intelligent systems.
1.2 The Role of Data in Machine Learning and Deep Learning
For any machine learning or deep learning model to perform effectively, it requires high-quality, relevant data.
- Training Data: Models learn patterns and relationships from the data they are trained on. The quality and representativeness of this data directly impact the model’s accuracy and generalization capabilities.
- Validation and Test Data: Separate datasets are crucial for evaluating model performance, tuning hyperparameters, and ensuring the model can perform well on unseen data.
- Feature Engineering: Raw data often needs to be transformed into meaningful features that models can understand and learn from. This process is a critical part of data manipulation.
1.3 Data Manipulation for Large Language Models (LLMs)
LLMs, while incredibly powerful, are no exception to the rule of needing well-prepared data. In fact, due to their scale and complexity, the quality and structure of their training and fine-tuning data are even more critical.
- Pre-training Data: Vast text corpora are required for initial pre-training. Data manipulation involves cleaning text, tokenization, handling special characters, and ensuring consistent formatting.
- Fine-tuning Data: For domain-specific tasks or instruction following, LLMs are fine-tuned on smaller, highly curated datasets. This involves:
- Instruction-Response Pairs: Creating structured datasets where each entry contains an instruction and the desired response from the LLM.
- Reinforcement Learning from Human Feedback (RLHF): Data for RLHF often involves human preferences, requiring careful aggregation and structuring of rankings or comparisons.
- Prompt Engineering Data: Even when using pre-trained LLMs, the input prompts themselves can be manipulated and optimized for better output.
- Analyzing LLM Outputs and Agent Performance: Understanding how LLMs behave, identifying biases, and evaluating their responses (especially in multi-turn conversations or agentic workflows) requires sophisticated data analysis techniques. This includes metrics extraction, sentiment analysis, and pattern recognition in generated text.
1.4 Introducing Our Toolkit: NumPy, Pandas, Matplotlib, and Seaborn
These three libraries form the backbone of data manipulation and analysis in Python:
- NumPy (Numerical Python): The fundamental package for numerical computation in Python. It provides support for large, multi-dimensional arrays and matrices, along with a collection of high-level mathematical functions to operate on these arrays. It’s the underlying engine for many other data science libraries, including Pandas.
- Pandas (Python Data Analysis Library): Built on top of NumPy, Pandas introduces two fundamental data structures:
Series(1D labeled arrays) andDataFrame(2D labeled tables). It provides powerful, flexible, and easy-to-use data structures and data analysis tools for handling tabular data. - Matplotlib and Seaborn: These libraries are essential for data visualization.
- Matplotlib: A comprehensive library for creating static, animated, and interactive visualizations in Python. It offers fine-grained control over every aspect of a plot.
- Seaborn: Built on Matplotlib, Seaborn provides a high-level interface for drawing attractive and informative statistical graphics. It simplifies the creation of complex visualizations and often produces aesthetically pleasing plots with less code.
This chapter sets the foundation for understanding the critical role of data manipulation and analysis in the current AI landscape, particularly highlighting its importance for LLMs. The subsequent chapters will dive deep into each of these powerful tools, equipping you with the practical skills to excel in your data-driven endeavors.
Chapter 2: NumPy – The Foundation of Numerical Computing
2.1 What is NumPy?
NumPy, short for Numerical Python, is the fundamental package for scientific computing in Python. It provides a high-performance multidimensional array object, and tools for working with these arrays. It’s the building block upon which many other data science libraries, including Pandas, are constructed. The core advantage of NumPy arrays over standard Python lists is their efficiency: they are stored more compactly, accessed faster, and are more convenient and efficient for large-scale numerical computations.
2.2 NumPy Arrays: The ndarray Object
The central object in NumPy is the ndarray (n-dimensional array), which is a grid of values, all of the same type, and is indexed by a tuple of non-negative integers.
2.2.1 Creating NumPy Arrays
From Python Lists
The simplest way to create an ndarray is by converting a Python list or list of lists.
import numpy as np
# 1-dimensional array
arr1d = np.array([1, 2, 3, 4, 5])
print("1D Array:", arr1d)
print("Type:", type(arr1d))
print("Shape:", arr1d.shape)
print("Dimension (ndim):", arr1d.ndim)
print("Data Type (dtype):", arr1d.dtype)
# 2-dimensional array (matrix)
arr2d = np.array([[1, 2, 3], [4, 5, 6]])
print("\n2D Array:\n", arr2d)
print("Shape:", arr2d.shape)
print("Dimension (ndim):", arr2d.ndim)
Using Built-in NumPy Functions
NumPy provides various functions to create arrays with initial placeholder content.
np.zeros(): Creates an array filled with zeros.np.ones(): Creates an array filled with ones.np.full(): Creates an array filled with a specified value.np.empty(): Creates an array without initializing its entries (can contain garbage values, but is faster).np.arange(): Similar to Python’srange(), creates an array with evenly spaced values within a given interval.np.linspace(): Creates an array with a specified number of evenly spaced values over a specified interval.np.eye()/np.identity(): Creates an identity matrix.
# Array of zeros
zeros_arr = np.zeros((2, 3))
print("\nZeros Array:\n", zeros_arr)
# Array of ones
ones_arr = np.ones((3, 2))
print("\nOnes Array:\n", ones_arr)
# Array filled with a specific value
full_arr = np.full((2, 2), 7)
print("\nFull Array (with 7s):\n", full_arr)
# Array with a range of values
range_arr = np.arange(0, 10, 2) # start, stop (exclusive), step
print("\nArange Array:", range_arr)
# Array with linearly spaced values
linspace_arr = np.linspace(0, 1, 5) # start, stop (inclusive), num_elements
print("\nLinspace Array:", linspace_arr)
# Identity matrix
identity_arr = np.eye(3)
print("\nIdentity Array:\n", identity_arr)
2.2.2 Array Attributes
Key attributes of ndarray objects include:
ndim: The number of dimensions (axes) of the array.shape: A tuple indicating the size of the array in each dimension.size: The total number of elements in the array.dtype: The data type of the elements in the array.itemsize: The size in bytes of each element of the array.data: The buffer containing the actual elements of the array.
arr = np.array([[1, 2, 3], [4, 5, 6]])
print("\nArray Attributes:")
print("Number of dimensions (ndim):", arr.ndim)
print("Shape (rows, columns):", arr.shape)
print("Total elements (size):", arr.size)
print("Data type of elements (dtype):", arr.dtype)
print("Size of each element in bytes (itemsize):", arr.itemsize)
2.2.3 Data Types (dtype)
NumPy supports a much wider range of numerical data types than Python’s built-in types. This is crucial for optimizing memory usage and performance. Common dtypes include:
int8,int16,int32,int64: Signed integers of various sizes.uint8,uint16,uint32,uint64: Unsigned integers.float16,float32,float64: Floating-point numbers.float64is also known asdouble.bool: Boolean values (True/False).complex64,complex128: Complex numbers.
You can specify the dtype when creating an array:
int_arr = np.array([1, 2, 3], dtype=np.int32)
print("\nInteger Array (int32):", int_arr, int_arr.dtype)
float_arr = np.array([1, 2, 3], dtype=np.float64)
print("Float Array (float64):", float_arr, float_arr.dtype)
2.3 Array Indexing and Slicing
Accessing elements or subsets of arrays is fundamental. NumPy offers powerful and flexible indexing methods.
2.3.1 Basic Indexing
- 1D Arrays: Similar to Python lists.
- 2D Arrays: Use
arr[row_index, column_index]orarr[row_index][column_index].
arr = np.array([10, 20, 30, 40, 50])
print("\nOriginal 1D array:", arr)
print("Element at index 0:", arr[0])
print("Element at index -1 (last):", arr[-1])
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\nOriginal 2D array:\n", arr2d)
print("Element at (0, 0):", arr2d[0, 0])
print("Element at (1, 2):", arr2d[1, 2])
2.3.2 Slicing
Slicing allows you to extract subarrays. The syntax is start:stop:step.
arr = np.array([10, 20, 30, 40, 50, 60, 70])
print("\nOriginal 1D array for slicing:", arr)
print("Elements from index 1 to 4 (exclusive):", arr[1:5])
print("Elements from start to index 3 (exclusive):", arr[:4])
print("Elements from index 2 to end:", arr[2:])
print("Every other element:", arr[::2])
print("Reversed array:", arr[::-1])
arr2d = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
print("\nOriginal 2D array for slicing:\n", arr2d)
# All rows, columns 1 to 3 (exclusive)
print("All rows, columns 1 and 2:\n", arr2d[:, 1:3])
# Rows 0 to 1 (exclusive), all columns
print("Row 0, all columns:\n", arr2d[0:1, :]) # Note: [0:1] keeps it 2D
print("Row 0 (as 1D array):\n", arr2d[0, :])
# Subarray from row 1, col 0 to row 2, col 1
print("Subarray from [1,0] to [2,1]:\n", arr2d[1:3, 0:2])
2.3.3 Integer Array Indexing (Fancy Indexing)
Fancy indexing allows you to select arbitrary rows and columns using arrays of indices.
arr = np.array([100, 200, 300, 400, 500])
indices = np.array([0, 2, 4])
print("\nOriginal 1D array:", arr)
print("Elements at specific indices (0, 2, 4):", arr[indices])
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("\nOriginal 2D array:\n", arr2d)
# Select rows 0 and 2
print("Rows 0 and 2:\n", arr2d[[0, 2]])
# Select specific elements at (0,0), (1,1), (2,0)
print("Specific elements using fancy indexing:", arr2d[[0, 1, 2], [0, 1, 0]])
2.3.4 Boolean Array Indexing
This is extremely powerful for filtering data based on conditions.
arr = np.array([1, 5, 2, 8, 3, 9, 4])
print("\nOriginal 1D array:", arr)
# Select elements greater than 4
bool_mask = (arr > 4)
print("Boolean mask (arr > 4):", bool_mask)
print("Elements greater than 4:", arr[bool_mask])
arr2d = np.array([[10, 20, 30],
[40, 50, 60],
[70, 80, 90]])
print("\nOriginal 2D array:\n", arr2d)
# Select elements greater than 50
print("Elements greater than 50:\n", arr2d[arr2d > 50])
2.4 Array Manipulation
NumPy offers a rich set of functions for manipulating array shapes and elements.
2.4.1 Reshaping Arrays
Changing the shape of an array without changing its data.
reshape(): Gives a new shape to an array without changing its data.flatten()/ravel(): Returns a flattened 1D array.ravel()returns a view when possible,flatten()always returns a copy.
arr = np.arange(1, 10)
print("\nOriginal array:", arr)
reshaped_arr = arr.reshape((3, 3))
print("Reshaped to 3x3:\n", reshaped_arr)
# Using -1 to automatically infer a dimension
reshaped_arr_auto = arr.reshape((3, -1))
print("Reshaped to 3x? (auto infer):\n", reshaped_arr_auto)
flattened_arr = reshaped_arr.flatten()
print("Flattened array:", flattened_arr)
2.4.2 Transposing Arrays
Reversing or permuting the axes of an array.
transpose()or.Tattribute: Permutes the dimensions of the array.
arr = np.array([[1, 2, 3], [4, 5, 6]])
print("\nOriginal 2D array:\n", arr)
transposed_arr = arr.T
print("Transposed array:\n", transposed_arr)
2.4.3 Stacking and Splitting Arrays
np.concatenate(): Joins a sequence of arrays along an existing axis.np.vstack()(vertical stack): Stacks arrays in sequence vertically (row wise).np.hstack()(horizontal stack): Stacks arrays in sequence horizontally (column wise).np.split(): Split an array into multiple sub-arrays.np.vsplit()/np.hsplit(): Convenience functions for splitting vertically or horizontally.
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print("\nArray 1:", arr1)
print("Array 2:", arr2)
# Concatenate 1D arrays
concatenated_1d = np.concatenate((arr1, arr2))
print("Concatenated 1D:", concatenated_1d)
mat1 = np.array([[1, 2], [3, 4]])
mat2 = np.array([[5, 6], [7, 8]])
print("\nMatrix 1:\n", mat1)
print("Matrix 2:\n", mat2)
# Vertical Stack
vstacked = np.vstack((mat1, mat2))
print("Vertically stacked:\n", vstacked)
# Horizontal Stack
hstacked = np.hstack((mat1, mat2))
print("Horizontally stacked:\n", hstacked)
# Splitting arrays
arr_to_split = np.arange(12).reshape(3, 4)
print("\nArray to split:\n", arr_to_split)
h_split = np.hsplit(arr_to_split, 2) # Split into 2 columns
print("Horizontal split (2 parts):\n", h_split)
v_split = np.vsplit(arr_to_split, 3) # Split into 3 rows
print("Vertical split (3 parts):\n", v_split)
2.4.4 Adding/Removing Dimensions
np.newaxis: Used to increase the dimension of the existing array by one more dimension, at a specified position.np.squeeze(): Removes single-dimensional entries from the shape of an array.
arr = np.array([1, 2, 3])
print("\nOriginal 1D array:", arr, "Shape:", arr.shape)
# Add a new axis (make it a row vector)
row_vec = arr[np.newaxis, :]
print("As row vector:", row_vec, "Shape:", row_vec.shape)
# Add a new axis (make it a column vector)
col_vec = arr[:, np.newaxis]
print("As column vector:\n", col_vec, "Shape:", col_vec.shape)
squeezed_arr = np.squeeze(row_vec)
print("Squeezed row vector:", squeezed_arr, "Shape:", squeezed_arr.shape)
2.5 Universal Functions (ufuncs)
NumPy’s power comes from its “vectorized” operations, performed by universal functions (ufuncs). These functions operate element-wise on arrays, making operations significantly faster than traditional Python loops.
2.5.1 Element-wise Operations
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
print("\nElement-wise addition:", arr1 + arr2)
print("Element-wise multiplication:", arr1 * arr2)
print("Element-wise division:", arr1 / arr2)
print("Element-wise exponentiation:", arr1 ** 2)
print("Square root of arr2:", np.sqrt(arr2))
print("Sine of arr1:", np.sin(arr1))
2.5.2 Broadcasting
Broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes.
arr = np.array([[1, 2, 3], [4, 5, 6]])
scalar = 10
print("\nOriginal array:\n", arr)
print("Array + scalar:\n", arr + scalar) # Scalar is broadcast across the array
vec = np.array([100, 200, 300]) # shape (3,)
print("\nOriginal array:\n", arr)
print("Vector for broadcasting:", vec)
# vec (100, 200, 300) is broadcast across each row of arr
print("Array + vector:\n", arr + vec)
# Example with explicit newaxis for broadcasting clarity
col_vec = np.array([[10], [20]]) # shape (2,1)
print("\nOriginal array:\n", arr)
print("Column vector for broadcasting:\n", col_vec)
# col_vec (2,1) is broadcast across columns of arr (2,3)
print("Array + column vector:\n", arr + col_vec)
Broadcasting Rules:
- If the arrays do not have the same number of dimensions, the shape of the smaller array is padded with ones on its left side.
- If the shape of the two arrays does not match in any dimension, the array with shape equal to 1 in that dimension is stretched to match the other shape.
- If in any dimension, the sizes disagree and neither is 1, an error is raised.
2.6 Linear Algebra Operations
NumPy is excellent for linear algebra.
np.dot()/@operator: Dot product of two arrays. For 2D arrays, it performs matrix multiplication.np.linalg.inv(): Inverse of a matrix.np.linalg.det(): Determinant of a matrix.np.linalg.eig(): Eigenvalues and eigenvectors.
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
print("\nMatrix A:\n", A)
print("Matrix B:\n", B)
# Matrix multiplication
print("Matrix multiplication (A @ B):\n", A @ B)
print("Matrix multiplication (np.dot(A, B)):\n", np.dot(A, B))
# Inverse of A
try:
A_inv = np.linalg.inv(A)
print("Inverse of A:\n", A_inv)
print("A dot A_inv:\n", A @ A_inv) # Should be identity matrix
except np.linalg.LinAlgError as e:
print(f"Error calculating inverse: {e}")
# Determinant of A
print("Determinant of A:", np.linalg.det(A))
2.7 Aggregations
NumPy provides functions to compute statistical aggregations on arrays.
sum(),min(),max(),mean(),std(),var(): These can be applied to the entire array or along specific axes.
arr = np.array([[1, 2, 3], [4, 5, 6]])
print("\nOriginal array:\n", arr)
print("Sum of all elements:", np.sum(arr))
print("Minimum of all elements:", np.min(arr))
print("Mean of all elements:", np.mean(arr))
print("\nSum along axis 0 (columns sum down):\n", np.sum(arr, axis=0))
print("Mean along axis 1 (rows mean across):\n", np.mean(arr, axis=1))
print("Standard deviation of all elements:", np.std(arr))
2.8 Random Number Generation
NumPy’s random module is essential for simulations, data generation, and machine learning.
np.random.rand(): Samples from a uniform distribution over[0, 1).np.random.randn(): Samples from the standard normal (Gaussian) distribution.np.random.randint(): Returns random integers fromlow(inclusive) tohigh(exclusive).np.random.normal(): Samples from a normal (Gaussian) distribution with specified mean and standard deviation.np.random.uniform(): Samples from a uniform distribution over a specified interval.np.random.seed(): Used to ensure reproducibility of random numbers.
np.random.seed(42) # For reproducibility
print("\nRandom numbers (uniform in [0,1)):", np.random.rand(3))
print("Random numbers (standard normal):", np.random.randn(2, 2))
print("Random integers (0-9, size 5):", np.random.randint(0, 10, size=5))
print("Random numbers (normal, mean=5, std=1, size=3):", np.random.normal(5, 1, size=3))
2.9 Performance Considerations: Why NumPy is Fast
NumPy achieves its performance through several key mechanisms:
- C/Fortran Implementation: The core of NumPy is implemented in C and Fortran, highly optimized languages. This allows operations to be executed much faster than equivalent pure Python code.
- Contiguous Memory Allocation: NumPy arrays store elements of the same data type in contiguous blocks of memory. This allows for efficient caching and vectorized operations.
- Vectorization (Ufuncs): As discussed, ufuncs perform operations on entire arrays without explicit Python loops, leveraging optimized C code for significant speedups.
- Broadcasting: While conceptually simplifying code, broadcasting also allows operations on arrays of different shapes without creating unnecessary copies of data, saving memory and time.
Illustrative Example: Performance Comparison
import time
# Pure Python list sum
list_data = list(range(10_000_000))
start_time = time.time()
list_sum = sum(list_data)
end_time = time.time()
print(f"\nPython list sum took: {end_time - start_time:.6f} seconds")
# NumPy array sum
numpy_data = np.arange(10_000_000)
start_time = time.time()
numpy_sum = np.sum(numpy_data)
end_time = time.time()
print(f"NumPy array sum took: {end_time - start_time:.6f} seconds")
This simple example clearly demonstrates the substantial performance benefits of using NumPy for large datasets.
2.10 Use Cases in Data Science and ML
NumPy is indispensable across various stages of data science and machine learning workflows:
- Data Preprocessing: Scaling features, handling missing values, encoding categorical data can all be efficiently done with NumPy.
- Feature Engineering: Creating new features from existing ones often involves mathematical operations on arrays.
- Machine Learning Algorithms: Many ML algorithms (e.g., linear regression, logistic regression, k-means, neural networks) internally rely heavily on NumPy for matrix operations, gradient calculations, and numerical optimization.
- Image Processing: Images are represented as multi-dimensional NumPy arrays, making NumPy ideal for operations like resizing, filtering, and color adjustments.
- Scientific Simulations: Numerical simulations in physics, engineering, and other fields extensively use NumPy for array-based computations.
Chapter 3: Pandas – The Data Wrangler’s Workbench
3.1 What is Pandas?
Pandas is an open-source Python library built on top of NumPy, designed for data manipulation and analysis. It introduces two primary data structures, Series and DataFrame, which provide powerful, flexible, and easy-to-use tools for working with structured (tabular) data, much like spreadsheets or SQL tables. Pandas excels at handling missing data, alignment of data by labels, joining and merging datasets, and much more, making it a cornerstone for almost any data science project.
3.2 Pandas Data Structures
3.2.1 Series
A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, Python objects, etc.). It’s essentially a column in a spreadsheet or a single vector with an index.
Creating a Series
import pandas as pd
import numpy as np
# From a list
s1 = pd.Series([1, 2, 3, 4, 5])
print("Series from list:\n", s1)
print("Index:", s1.index)
print("Values:", s1.values)
# From a list with a custom index
s2 = pd.Series([10, 20, 30], index=['a', 'b', 'c'])
print("\nSeries with custom index:\n", s2)
# From a dictionary
data = {'apple': 100, 'banana': 150, 'cherry': 200}
s3 = pd.Series(data)
print("\nSeries from dictionary:\n", s3)
# From a scalar value (index must be provided)
s4 = pd.Series(5, index=['x', 'y', 'z'])
print("\nSeries from scalar:\n", s4)
3.2.2 DataFrame
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. You can think of it as a spreadsheet, a SQL table, or a dictionary of Series objects. It is the most commonly used Pandas object.
Creating a DataFrame
# From a dictionary of lists/arrays
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'City': ['New York', 'London', 'Paris', 'Tokyo']
}
df1 = pd.DataFrame(data)
print("DataFrame from dictionary of lists:\n", df1)
print("Columns:", df1.columns)
print("Index:", df1.index)
print("Shape:", df1.shape)
print("Data Types:\n", df1.dtypes)
# From a list of dictionaries
data_list_dict = [
{'Name': 'Eve', 'Age': 22, 'City': 'Rome'},
{'Name': 'Frank', 'Age': 28, 'City': 'Berlin'}
]
df2 = pd.DataFrame(data_list_dict)
print("\nDataFrame from list of dictionaries:\n", df2)
# From a NumPy array (with optional columns and index)
numpy_data = np.random.rand(4, 3)
df3 = pd.DataFrame(numpy_data, columns=['ColA', 'ColB', 'ColC'], index=['r1', 'r2', 'r3', 'r4'])
print("\nDataFrame from NumPy array:\n", df3)
3.3 Data Loading and Inspection
Before manipulation, data needs to be loaded and understood.
3.3.1 Reading Data from Files
Pandas provides functions to read various file formats:
pd.read_csv(): For Comma Separated Values files.pd.read_excel(): For Excel files.pd.read_sql(): For SQL queries or database tables.pd.read_json(): For JSON files.pd.read_html(): For HTML tables.
# Example: Creating a dummy CSV file to demonstrate reading
csv_data = """id,name,age,city,salary
1,Alice,25,New York,60000
2,Bob,30,London,75000
3,Charlie,NaN,Paris,90000
4,David,40,Tokyo,NaN
5,Eve,28,Berlin,65000
"""
with open("sample_data.csv", "w") as f:
f.write(csv_data)
# Reading a CSV file
df_csv = pd.read_csv("sample_data.csv")
print("\nDataFrame loaded from CSV:\n", df_csv)
# To clean up the dummy file
import os
os.remove("sample_data.csv")
3.3.2 Basic Inspection
df.head(n): Returns the firstnrows.df.tail(n): Returns the lastnrows.df.info(): Provides a concise summary of the DataFrame, including data types, non-null values, and memory usage.df.describe(): Generates descriptive statistics of numerical columns (count, mean, std, min, max, quartiles).df.shape: Returns a tuple representing the dimensionality of the DataFrame (rows, columns).df.columns: Returns the column labels of the DataFrame.df.index: Returns the index (row labels) of the DataFrame.df.values: Returns a NumPy representation of the DataFrame.
print("\n--- DataFrame Inspection (df_csv) ---")
print("Head (first 3 rows):\n", df_csv.head(3))
print("\nTail (last 2 rows):\n", df_csv.tail(2))
print("\nDataFrame Info:")
df_csv.info()
print("\nDescriptive Statistics:\n", df_csv.describe())
print("\nShape:", df_csv.shape)
print("Columns:", df_csv.columns)
print("Index:", df_csv.index)
3.4 Selecting and Filtering Data
Accessing specific subsets of your DataFrame is a core task.
3.4.1 Column Selection
# Select a single column (returns a Series)
ages = df_csv['age']
print("\n'age' column (Series):\n", ages)
print("Type of 'age' column:", type(ages))
# Select multiple columns (returns a DataFrame)
name_city = df_csv[['name', 'city']]
print("\n'name' and 'city' columns (DataFrame):\n", name_city)
print("Type of ['name', 'city'] columns:", type(name_city))
3.4.2 Row Selection (Indexing)
Pandas offers two primary ways to select rows by index:
.loc[]: Label-based indexing. Selects by label (name of the index/column)..iloc[]: Integer-location based indexing. Selects by position (0-based integer).
# Selecting rows by label using .loc
df_loc_indexed = df_csv.set_index('name') # Set 'name' as index for this example
print("\nDataFrame with 'name' as index:\n", df_loc_indexed)
print("\nRow 'Alice' using .loc:\n", df_loc_indexed.loc['Alice'])
print("\nRows 'Alice' and 'David' using .loc:\n", df_loc_indexed.loc[['Alice', 'David']])
print("\nRows 'Alice' to 'Charlie' (inclusive) using .loc:\n", df_loc_indexed.loc['Alice':'Charlie'])
# Selecting rows by integer position using .iloc
print("\nRow at integer position 0 using .iloc:\n", df_csv.iloc[0])
print("\nRows at integer positions 1 and 3 using .iloc:\n", df_csv.iloc[[1, 3]])
print("\nRows from integer position 0 to 2 (exclusive) using .iloc:\n", df_csv.iloc[0:2])
3.4.3 Combined Row and Column Selection
# Using .loc with both row and column labels
# Select 'age' and 'city' for 'Alice'
print("\n'age' and 'city' for 'Alice' using .loc:", df_loc_indexed.loc['Alice', ['age', 'city']])
# Select all columns for 'Bob' to 'Eve'
print("\nAll columns for 'Bob' to 'Eve' using .loc:\n", df_loc_indexed.loc['Bob':'Eve', :])
# Using .iloc with both row and column integer positions
# Select row 0, column 1 (name)
print("\nRow 0, column 1 (name) using .iloc:", df_csv.iloc[0, 1])
# Select rows 0 and 1, columns 2 and 3 (age, city)
print("\nRows 0 and 1, columns 2 and 3 (age, city) using .iloc:\n", df_csv.iloc[[0, 1], [2, 3]])
3.4.4 Boolean Indexing (Filtering)
Filtering rows based on conditions is a common and powerful operation.
print("\nOriginal DataFrame:\n", df_csv)
# Select people older than 30
df_older_than_30 = df_csv[df_csv['age'] > 30]
print("\nPeople older than 30:\n", df_older_than_30)
# Select people from 'London' OR 'Berlin'
df_london_berlin = df_csv[(df_csv['city'] == 'London') | (df_csv['city'] == 'Berlin')]
print("\nPeople from London or Berlin:\n", df_london_berlin)
# Select people with age > 25 AND city is 'New York'
df_filtered_complex = df_csv[(df_csv['age'] > 25) & (df_csv['city'] == 'New York')]
print("\nPeople > 25 from New York:\n", df_filtered_complex)
# Using .isin() for multiple values
df_specific_cities = df_csv[df_csv['city'].isin(['New York', 'Tokyo'])]
print("\nPeople from New York or Tokyo using .isin():\n", df_specific_cities)
3.5 Handling Missing Data
Missing values (often represented as NaN - Not a Number in Pandas) are a common challenge. Pandas provides robust tools to deal with them.
3.5.1 Identifying Missing Data
df.isnull()/df.isna(): Returns a boolean DataFrame indicating where values are missing.df.notnull()/df.notna(): Inverse ofisnull().df.isnull().sum(): Counts missing values per column.df.isnull().sum().sum(): Counts total missing values in the DataFrame.
print("\nDataFrame with NaNs:\n", df_csv)
print("\nIs Null (Boolean DataFrame):\n", df_csv.isnull())
print("\nMissing values per column:\n", df_csv.isnull().sum())
print("\nTotal missing values:", df_csv.isnull().sum().sum())
3.5.2 Dropping Missing Data
df.dropna(): Removes rows or columns containing missing values.
df_dropna_rows = df_csv.dropna() # Drops rows with ANY NaN
print("\nDataFrame after dropping rows with ANY NaN:\n", df_dropna_rows)
df_dropna_all = df_csv.dropna(how='all') # Drops rows where ALL values are NaN
print("\nDataFrame after dropping rows with ALL NaN (no change here):\n", df_dropna_all)
df_dropna_columns = df_csv.dropna(axis=1) # Drops columns with ANY NaN
print("\nDataFrame after dropping columns with ANY NaN:\n", df_dropna_columns)
df_dropna_threshold = df_csv.dropna(thresh=4) # Keep rows with at least 4 non-NaN values
print("\nDataFrame after dropping rows with less than 4 non-NaNs:\n", df_dropna_threshold)
3.5.3 Filling Missing Data
df.fillna(): Fills missing values with a specified value or method.
# Fill with a specific value
df_filled_value = df_csv.fillna(0)
print("\nDataFrame after filling NaNs with 0:\n", df_filled_value)
# Fill 'age' with its mean, 'salary' with its median
df_filled_mean_median = df_csv.copy() # Work on a copy
mean_age = df_filled_mean_median['age'].mean()
median_salary = df_filled_mean_median['salary'].median()
df_filled_mean_median['age'] = df_filled_mean_median['age'].fillna(mean_age)
df_filled_mean_median['salary'] = df_filled_mean_median['salary'].fillna(median_salary)
print(f"\nMean age: {mean_age:.2f}, Median salary: {median_salary:.2f}")
print("DataFrame after filling 'age' with mean, 'salary' with median:\n", df_filled_mean_median)
# Forward fill (ffill) - propagates last valid observation forward to next valid
df_ffill = df_csv.fillna(method='ffill')
print("\nDataFrame after forward fill:\n", df_ffill)
# Backward fill (bfill) - propagates next valid observation backward to next valid
df_bfill = df_csv.fillna(method='bfill')
print("\nDataFrame after backward fill:\n", df_bfill)
3.6 Data Cleaning and Transformation
3.6.1 Renaming Columns
df.rename(): Changes column or index labels.
df_renamed = df_csv.rename(columns={'id': 'employee_id', 'city': 'location'})
print("\nDataFrame after renaming columns:\n", df_renamed)
3.6.2 Changing Data Types
df.astype(): Casts a Pandas object to a specifieddtype.
df_dtypes = pd.DataFrame({'A': [1, 2, 3], 'B': ['4', '5', '6']})
print("\nOriginal dtypes:\n", df_dtypes.dtypes)
df_dtypes['B'] = df_dtypes['B'].astype(int)
print("\nNew dtypes after converting 'B' to int:\n", df_dtypes.dtypes)
# Converting to datetime
df_dates = pd.DataFrame({'date_str': ['2023-01-01', '2023-01-02', '2023-01-03']})
df_dates['date'] = pd.to_datetime(df_dates['date_str'])
print("\nDataFrame with datetime column:\n", df_dates)
print("Datetime column type:", df_dates['date'].dtype)
3.6.3 Applying Functions (.apply(), .map(), .applymap())
These methods allow applying custom functions for more complex transformations.
Series.apply(): Apply a function to each element of a Series.DataFrame.apply(): Apply a function along an axis of the DataFrame (row-wise or column-wise).Series.map(): Map values of a Series according to an inputdictorSeries.DataFrame.applymap(): Apply a function to each element of a DataFrame (element-wise). (Note:applymapis deprecated in newer pandas versions; element-wise operations are typically done directly or with.apply(lambda x: x.apply(func), axis=...)or.pipe()for entire dataframes.)
# Series.apply() - to create a new column 'salary_usd' from 'salary'
df_csv['salary_usd'] = df_csv['salary'].apply(lambda x: x / 1.2 if not pd.isna(x) else x)
print("\nDataFrame with 'salary_usd' column (using Series.apply):\n", df_csv)
# DataFrame.apply() - column-wise
# Normalize numerical columns
df_numerical = df_csv[['age', 'salary_usd']].copy()
min_max_scaled = df_numerical.apply(lambda x: (x - x.min()) / (x.max() - x.min()))
print("\nMin-Max Scaled age and salary_usd (using DataFrame.apply):\n", min_max_scaled)
# Series.map() - to categorize cities
city_mapping = {'New York': 'North America', 'London': 'Europe', 'Paris': 'Europe', 'Tokyo': 'Asia', 'Berlin': 'Europe'}
df_csv['continent'] = df_csv['city'].map(city_mapping)
print("\nDataFrame with 'continent' column (using Series.map):\n", df_csv)
3.6.4 Creating New Columns
# Simple arithmetic
df_csv['bonus'] = df_csv['salary'] * 0.10
print("\nDataFrame with 'bonus' column:\n", df_csv)
# Conditional column creation using np.where
df_csv['seniority'] = np.where(df_csv['age'] >= 35, 'Senior', 'Junior')
print("\nDataFrame with 'seniority' column (using np.where):\n", df_csv)
3.7 Grouping and Aggregation
Grouping data and applying aggregation functions is crucial for summarizing data.
df.groupby(): Groups DataFrame using a mapper or by a Series of columns.- Aggregation functions:
sum(),mean(),median(),min(),max(),count(),std(),var(),first(),last(),size(),nunique().
print("\nOriginal DataFrame for grouping:\n", df_csv)
# Group by 'city' and calculate mean age and salary
grouped_by_city = df_csv.groupby('city').agg({'age': 'mean', 'salary': 'mean'})
print("\nMean age and salary grouped by city:\n", grouped_by_city)
# Group by 'continent' and 'seniority', count employees
grouped_multi = df_csv.groupby(['continent', 'seniority']).size().reset_index(name='count')
print("\nEmployee count by continent and seniority:\n", grouped_multi)
# Using multiple aggregation functions
grouped_stats = df_csv.groupby('city')['salary'].agg(['min', 'max', 'mean', 'std'])
print("\nSalary min, max, mean, std by city:\n", grouped_stats)
3.8 Merging, Joining, and Concatenating DataFrames
Combining multiple DataFrames is a very common task.
3.8.1 Concatenating
pd.concat(): Stacks DataFrames either vertically (axis=0, default) or horizontally (axis=1).
df_part1 = pd.DataFrame({'id': [1, 2], 'value': [10, 20]})
df_part2 = pd.DataFrame({'id': [3, 4], 'value': [30, 40]})
print("\nDataFrame Part 1:\n", df_part1)
print("DataFrame Part 2:\n", df_part2)
concatenated_df_rows = pd.concat([df_part1, df_part2], ignore_index=True) # Reset index
print("\nConcatenated (rows):\n", concatenated_df_rows)
df_left = pd.DataFrame({'id': [1, 2], 'A': ['a1', 'a2']})
df_right = pd.DataFrame({'B': ['b1', 'b2'], 'C': ['c1', 'c2']})
print("\nDataFrame Left:\n", df_left)
print("DataFrame Right:\n", df_right)
concatenated_df_cols = pd.concat([df_left, df_right], axis=1)
print("\nConcatenated (columns):\n", concatenated_df_cols)
3.8.2 Merging
pd.merge(): Combines DataFrames based on common columns or indices, similar to SQL JOINs.how='inner'(default): Only include keys found in both DataFrames.how='left': Include all keys from the left DataFrame.how='right': Include all keys from the right DataFrame.how='outer': Include all keys from both DataFrames.
df_employees = pd.DataFrame({
'employee_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'David'],
'department_id': [101, 102, 101, 103]
})
df_departments = pd.DataFrame({
'department_id': [101, 102, 104],
'department_name': ['HR', 'Engineering', 'Marketing']
})
print("\nEmployees DataFrame:\n", df_employees)
print("\nDepartments DataFrame:\n", df_departments)
# Inner merge (default)
merged_inner = pd.merge(df_employees, df_departments, on='department_id', how='inner')
print("\nInner Merge:\n", merged_inner)
# Left merge
merged_left = pd.merge(df_employees, df_departments, on='department_id', how='left')
print("\nLeft Merge:\n", merged_left)
# Right merge
merged_right = pd.merge(df_employees, df_departments, on='department_id', how='right')
print("\nRight Merge:\n", merged_right)
# Outer merge
merged_outer = pd.merge(df_employees, df_departments, on='department_id', how='outer')
print("\nOuter Merge:\n", merged_outer)
3.8.3 Joining (Index-based merge)
df.join(): Merges DataFrames based on their indexes (or a column from one DataFrame to the index of another).
df1_join = pd.DataFrame({'A': ['A0', 'A1'], 'B': ['B0', 'B1']}, index=['K0', 'K1'])
df2_join = pd.DataFrame({'C': ['C0', 'C1'], 'D': ['D0', 'D1']}, index=['K0', 'K2'])
print("\nDataFrame 1 for Join:\n", df1_join)
print("DataFrame 2 for Join:\n", df2_join)
# Left join (default)
joined_df = df1_join.join(df2_join)
print("\nLeft Join (df1.join(df2)):\n", joined_df)
# Outer join
joined_outer_df = df1_join.join(df2_join, how='outer')
print("\nOuter Join:\n", joined_outer_df)
3.9 Time Series Functionality
Pandas has excellent support for time series data.
pd.to_datetime(): Converts arguments to datetime objects.pd.date_range(): Creates a fixed frequency DatetimeIndex.- Resampling, rolling windows, shifting.
# Create a DataFrame with a DatetimeIndex
dates = pd.date_range(start='2023-01-01', periods=10, freq='D')
ts_data = pd.Series(np.random.randn(10), index=dates)
print("\nTime Series Data:\n", ts_data)
# Resampling (e.g., weekly mean)
weekly_mean = ts_data.resample('W').mean()
print("\nWeekly Mean:\n", weekly_mean)
# Rolling window (e.g., 3-day rolling mean)
rolling_mean = ts_data.rolling(window=3).mean()
print("\n3-day Rolling Mean:\n", rolling_mean)
# Shifting values
shifted_ts = ts_data.shift(1)
print("\nTime Series shifted by 1 day:\n", shifted_ts)
3.10 Advanced Pandas Techniques for AI/ML
3.10.1 Feature Engineering with Pandas
Pandas operations are fundamental for creating new features that can improve model performance.
df_fe = pd.DataFrame({
'user_id': [1, 1, 2, 2, 2, 3],
'product_id': ['A', 'B', 'A', 'C', 'B', 'A'],
'price': [10, 20, 10, 30, 25, 10],
'quantity': [1, 2, 1, 1, 2, 3],
'timestamp': pd.to_datetime(['2023-01-01 10:00', '2023-01-01 11:00',
'2023-01-02 09:00', '2023-01-02 10:00',
'2023-01-02 12:00', '2023-01-03 14:00'])
})
print("\nOriginal DataFrame for Feature Engineering:\n", df_fe)
# Feature: Total price per transaction
df_fe['total_price'] = df_fe['price'] * df_fe['quantity']
print("\nTotal Price per transaction:\n", df_fe[['user_id', 'product_id', 'total_price']])
# Feature: Number of unique products bought by each user
user_unique_products = df_fe.groupby('user_id')['product_id'].nunique().reset_index(name='num_unique_products')
df_fe = pd.merge(df_fe, user_unique_products, on='user_id', how='left')
print("\nNumber of unique products bought by each user:\n", df_fe[['user_id', 'num_unique_products']].drop_duplicates())
# Feature: Time since last purchase for each user (requires sorting)
df_fe = df_fe.sort_values(by=['user_id', 'timestamp'])
df_fe['time_since_last_purchase'] = df_fe.groupby('user_id')['timestamp'].diff().dt.total_seconds().fillna(0)
print("\nTime since last purchase (seconds):\n", df_fe[['user_id', 'timestamp', 'time_since_last_purchase']])
3.10.2 Preparing Data for LLM Fine-tuning
For LLM fine-tuning, data often needs to be in a specific structured format, such as instruction-response pairs, or a sequence of turns in a conversation. Pandas is excellent for preparing these structures.
Example: Converting conversational data to instruction-response format
Imagine a DataFrame of chat logs:
chat_logs = pd.DataFrame({
'conversation_id': [1, 1, 1, 2, 2],
'turn_id': [1, 2, 3, 1, 2],
'speaker': ['user', 'assistant', 'user', 'user', 'assistant'],
'text': [
"What is the capital of France?",
"The capital of France is Paris.",
"And Germany?",
"Tell me a joke.",
"Why don't scientists trust atoms? Because they make up everything!"
]
})
print("\nOriginal Chat Logs:\n", chat_logs)
# Goal: Transform into (instruction, response) pairs for LLM fine-tuning
# A simple approach for alternating user/assistant turns
def create_instruction_response(group):
instructions = []
responses = []
current_instruction = []
for _, row in group.iterrows():
if row['speaker'] == 'user':
current_instruction.append(row['text'])
elif row['speaker'] == 'assistant' and current_instruction:
instructions.append("\n".join(current_instruction))
responses.append(row['text'])
current_instruction = [] # Reset for next instruction
return pd.DataFrame({'instruction': instructions, 'response': responses})
llm_finetune_df = chat_logs.groupby('conversation_id').apply(create_instruction_response).reset_index(drop=True)
print("\nData for LLM Fine-tuning (Instruction-Response Pairs):\n", llm_finetune_df)
# Further processing: Adding a 'system' prompt
llm_finetune_df['full_prompt'] = "### Instruction:\n" + llm_finetune_df['instruction'] + \
"\n\n### Response:\n" + llm_finetune_df['response']
print("\nFull Prompt for LLM Fine-tuning:\n", llm_finetune_df['full_prompt'].iloc[0])
3.10.3 Analyzing Agent Performance Data
When dealing with AI agents, performance analysis often involves structured data that Pandas can efficiently process. This might include logs of agent actions, rewards, states, or conversational turns.
agent_performance_data = pd.DataFrame({
'episode_id': [1, 1, 2, 2, 2, 3],
'step': [1, 2, 1, 2, 3, 1],
'action': ['move_left', 'collect_item', 'move_right', 'attack', 'move_left', 'idle'],
'reward': [0, 10, 0, -5, 0, 0],
'status': ['ongoing', 'completed', 'ongoing', 'ongoing', 'failed', 'completed']
})
print("\nAgent Performance Data:\n", agent_performance_data)
# Calculate total reward per episode
episode_rewards = agent_performance_data.groupby('episode_id')['reward'].sum().reset_index(name='total_reward')
print("\nTotal Reward per Episode:\n", episode_rewards)
# Find episodes that ended in 'failed' status
failed_episodes = agent_performance_data[agent_performance_data['status'] == 'failed']['episode_id'].unique()
print("\nFailed Episodes:", failed_episodes)
# Analyze action distribution
action_distribution = agent_performance_data['action'].value_counts(normalize=True).reset_index(name='proportion')
print("\nAction Distribution:\n", action_distribution)
3.11 Performance Tips with Pandas
While Pandas is powerful, inefficient operations can slow down your workflow, especially with large datasets.
- Vectorization (avoiding loops): Whenever possible, use built-in Pandas/NumPy vectorized operations instead of explicit Python
forloops. These operations are implemented in C and are much faster.- Good:
df['col_new'] = df['col1'] + df['col2'] - Bad:
df['col_new'] = [row['col1'] + row['col2'] for index, row in df.iterrows()]
- Good:
.apply()vs. Vectorized Operations:.apply()is better than explicit loops but still slower than pure vectorized operations. Useapply()when a vectorized alternative isn’t obvious.np.where()for conditional logic: More efficient thanapply()withif/elsefor simple conditions.- Chaining methods: Pandas allows method chaining, which can improve readability and often performance by reducing intermediate DataFrame creation.
- Data Types (
dtypes): Use appropriate data types (e.g.,int32instead ofint64if values fit,categoryfor categorical data) to save memory and speed up operations. - Reading large files in chunks: For extremely large CSVs, use
chunksizeparameter inread_csvto process data in smaller blocks. - Parquet/Feather for storage: For repeated reads, saving DataFrames to binary formats like Parquet or Feather is much faster than CSV.
# Illustrating vectorization vs. loop
large_df = pd.DataFrame(np.random.randint(0, 100, size=(1_000_000, 2)), columns=['A', 'B'])
start_time = time.time()
large_df['C_vectorized'] = large_df['A'] + large_df['B']
print(f"\nVectorized operation took: {time.time() - start_time:.6f} seconds")
start_time = time.time()
large_df['C_apply'] = large_df.apply(lambda row: row['A'] + row['B'], axis=1)
print(f"Apply operation took: {time.time() - start_time:.6f} seconds")
# Don't even try iterrows for 1M rows if you can avoid it!
# start_time = time.time()
# results = []
# for index, row in large_df.iterrows():
# results.append(row['A'] + row['B'])
# large_df['C_loop'] = results
# print(f"Loop operation took: {time.time() - start_time:.6f} seconds")
This chapter provides a comprehensive overview of Pandas, from its basic data structures and operations to advanced techniques for feature engineering and preparing data for modern AI applications. The ability to efficiently manipulate and clean data using Pandas is an essential skill for any data professional.
Chapter 4: Matplotlib and Seaborn – Illuminating Data with Visualizations
4.1 What are Matplotlib and Seaborn?
Data visualization is a crucial step in any data science workflow, allowing us to understand data patterns, identify anomalies, communicate insights, and validate assumptions.
- Matplotlib: The foundational plotting library for Python. It provides a highly flexible and comprehensive API for creating static, animated, and interactive visualizations. Think of it as the canvas and brush, offering fine-grained control over every aspect of a plot.
- Seaborn: A higher-level library built on top of Matplotlib. It provides a more convenient interface for drawing attractive and informative statistical graphics. Seaborn excels at visualizing relationships between multiple variables, handling common statistical plot types, and often produces aesthetically pleasing plots with less code. It’s like having a set of pre-designed templates and smart color palettes for your data.
While Matplotlib offers ultimate control, Seaborn simplifies the creation of complex visualizations, often making it the first choice for exploratory data analysis (EDA). We will explore both, showing how they complement each other.
4.2 Matplotlib Fundamentals
4.2.1 Basic Plotting
The primary way to use Matplotlib is through its pyplot module.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns # Import seaborn as well, we'll use it for data later
# Set a style for better aesthetics (optional)
plt.style.use('ggplot')
# Simple line plot
x = np.linspace(0, 10, 100)
y = np.sin(x)
plt.figure(figsize=(8, 4)) # Create a figure and specify its size
plt.plot(x, y)
plt.title("Simple Sine Wave")
plt.xlabel("X-axis Label")
plt.ylabel("Y-axis Label")
plt.grid(True)
plt.show() # Display the plot
4.2.2 Figure and Axes
Matplotlib plots are organized into Figures and Axes.
- Figure: The entire window or page on which everything is drawn. It’s the top-level container.
- Axes: The area where the data is actually plotted with x and y ticks and labels. A Figure can contain multiple Axes.
# Create a figure and a set of subplots
fig, ax = plt.subplots(figsize=(10, 5)) # fig is the Figure, ax is the Axes object
ax.plot(x, np.cos(x), label='Cosine')
ax.plot(x, np.sin(x), label='Sine')
ax.set_title("Sine and Cosine Waves")
ax.set_xlabel("Angle (radians)")
ax.set_ylabel("Amplitude")
ax.legend() # Show the legend
ax.grid(True)
plt.show()
4.2.3 Multiple Subplots
plt.subplots() is very useful for creating multiple plots within a single figure.
fig, axes = plt.subplots(2, 2, figsize=(12, 8)) # 2 rows, 2 columns of subplots
# Plot 1: Top-left
axes[0, 0].plot(x, y, color='red')
axes[0, 0].set_title("Sine")
# Plot 2: Top-right
axes[0, 1].plot(x, y**2, color='blue')
axes[0, 1].set_title("Sine Squared")
# Plot 3: Bottom-left
axes[1, 0].scatter(x, y, color='green', marker='o', s=10) # s is marker size
axes[1, 0].set_title("Sine Scatter")
# Plot 4: Bottom-right
axes[1, 1].hist(np.random.randn(1000), bins=30, color='purple', alpha=0.7)
axes[1, 1].set_title("Histogram")
plt.tight_layout() # Adjust subplot parameters for a tight layout
plt.show()
4.2.4 Customizing Plots
- Colors:
color='red', hex codescolor='#FF5733', RGB tuples. - Linestyles:
linestyle='--','-.',':'. - Markers:
marker='o','^','s'. - Titles, Labels, Legends:
set_title(),set_xlabel(),set_ylabel(),legend(). - Text and Annotations:
ax.text(),ax.annotate(). - Saving Plots:
plt.savefig('my_plot.png').
plt.figure(figsize=(8, 5))
plt.plot(x, y, color='steelblue', linestyle='--', linewidth=2, marker='o', markersize=5, label='Sine Wave')
plt.title("Customized Sine Wave", fontsize=16)
plt.xlabel("X-axis", fontsize=12)
plt.ylabel("Y-axis", fontsize=12)
plt.legend(loc='upper right')
plt.ylim(-1.5, 1.5) # Set y-axis limits
plt.text(7, 1.2, "Peak Value!", fontsize=10, color='darkgreen')
plt.annotate('Local Max', xy=(np.pi/2, 1), xytext=(np.pi/2 + 1, 1.3),
arrowprops=dict(facecolor='black', shrink=0.05),
fontsize=10)
plt.grid(True, linestyle=':', alpha=0.6)
plt.savefig('custom_sine_wave.png', dpi=300) # Save plot with high resolution
plt.show()
4.3 Common Matplotlib Plot Types
- Line Plots:
plt.plot() - Scatter Plots:
plt.scatter() - Histograms:
plt.hist() - Bar Charts:
plt.bar() - Pie Charts:
plt.pie() - Box Plots:
plt.boxplot()
# Generate some dummy data
categories = ['A', 'B', 'C', 'D', 'E']
values = np.random.randint(10, 100, size=len(categories))
data_for_box = [np.random.normal(0, 1, 100), np.random.normal(1, 1.5, 100), np.random.normal(-1, 0.5, 100)]
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Bar Chart
axes[0].bar(categories, values, color=['skyblue', 'lightcoral', 'lightgreen', 'gold', 'plum'])
axes[0].set_title("Bar Chart of Categories")
axes[0].set_xlabel("Category")
axes[0].set_ylabel("Value")
# Box Plot
axes[1].boxplot(data_for_box, labels=['Group 1', 'Group 2', 'Group 3'])
axes[1].set_title("Box Plot of Distributions")
axes[1].set_ylabel("Value")
plt.tight_layout()
plt.show()
4.4 Seaborn for Statistical Graphics
Seaborn integrates beautifully with Pandas DataFrames and is designed for easy creation of complex statistical plots.
4.4.1 Setting Up Seaborn
Typically imported as sns. It can set a default aesthetic for Matplotlib plots, making them more visually appealing.
# Load a built-in Seaborn dataset (often used for examples)
tips_df = sns.load_dataset('tips')
print("Tips DataFrame (first 5 rows):\n", tips_df.head())
plt.figure(figsize=(8, 6))
sns.histplot(data=tips_df, x='total_bill', kde=True) # kde=True adds a Kernel Density Estimate
plt.title("Distribution of Total Bill")
plt.xlabel("Total Bill Amount ($)")
plt.ylabel("Count")
plt.show()
4.4.2 Common Seaborn Plot Types
- Distributions:
histplot(),kdeplot(),displot(),rugplot(). - Relational Plots:
scatterplot(),lineplot(),relplot(). - Categorical Plots:
boxplot(),violinplot(),stripplot(),swarmplot(),barplot(),countplot(),catplot(). - Regression Plots:
regplot(),lmplot(). - Matrix Plots:
heatmap(),clustermap(). - Multi-plot Grids:
FacetGrid,PairGrid.
Univariate Distributions
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Histogram with KDE
sns.histplot(tips_df['total_bill'], kde=True, ax=axes[0])
axes[0].set_title("Total Bill Distribution (Histplot)")
# KDE Plot
sns.kdeplot(tips_df['total_bill'], fill=True, ax=axes[1])
axes[1].set_title("Total Bill Distribution (KDE Plot)")
plt.tight_layout()
plt.show()
Bivariate Distributions
# Scatter plot with regression line
plt.figure(figsize=(8, 6))
sns.regplot(data=tips_df, x='total_bill', y='tip')
plt.title("Total Bill vs. Tip with Regression Line")
plt.show()
# Pair Plot (shows relationships between all numerical variables)
# Often used for initial EDA
# sns.pairplot(tips_df, hue='sex') # Hue adds color based on a categorical variable
# plt.suptitle("Pair Plot of Tips Dataset", y=1.02)
# plt.show()
Categorical Data Plots
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Box plot
sns.boxplot(data=tips_df, x='day', y='total_bill', ax=axes[0])
axes[0].set_title("Total Bill by Day (Boxplot)")
# Violin plot (combines box plot with kernel density estimate)
sns.violinplot(data=tips_df, x='day', y='total_bill', hue='sex', split=True, ax=axes[1])
axes[1].set_title("Total Bill by Day and Sex (Violinplot)")
axes[1].legend(title='Sex')
plt.tight_layout()
plt.show()
Heatmaps for Correlation
# Calculate correlation matrix
correlation_matrix = tips_df[['total_bill', 'tip', 'size']].corr()
plt.figure(figsize=(6, 5))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=.5)
plt.title("Correlation Matrix of Numerical Features")
plt.show()
4.4.3 Customizing Seaborn Plots
Seaborn plots often return Matplotlib Axes objects, so you can still use Matplotlib’s functions for fine-tuning.
plt.figure(figsize=(10, 6))
ax = sns.scatterplot(data=tips_df, x='total_bill', y='tip', hue='smoker', size='size',
sizes=(20, 400), # Min and max size for markers
palette='viridis', alpha=0.7)
ax.set_title("Total Bill vs. Tip by Smoker Status and Party Size", fontsize=16)
ax.set_xlabel("Total Bill ($)", fontsize=12)
ax.set_ylabel("Tip ($)", fontsize=12)
ax.tick_params(axis='both', which='major', labelsize=10)
ax.legend(title='Smoker', bbox_to_anchor=(1.05, 1), loc='upper left') # Move legend outside
plt.grid(True, linestyle=':', alpha=0.6)
plt.tight_layout()
plt.show()
4.5 Advanced Visualization for AI/ML and LLMs
Beyond basic plots, visualization plays a key role in understanding model behavior, performance, and data characteristics relevant to AI.
4.5.1 Visualizing Feature Distributions
Understanding feature distributions is critical for data preprocessing, identifying outliers, and feature engineering.
# Let's use a synthetic dataset for a classification problem
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0,
n_clusters_per_class=1, random_state=42)
df_ml = pd.DataFrame(X, columns=[f'feature_{i}' for i in range(X.shape[1])])
df_ml['target'] = y
plt.figure(figsize=(12, 4))
for i, feature in enumerate(['feature_0', 'feature_1']):
plt.subplot(1, 2, i + 1)
sns.histplot(data=df_ml, x=feature, hue='target', kde=True, palette='coolwarm')
plt.title(f'Distribution of {feature} by Target')
plt.tight_layout()
plt.show()
# Boxplots for comparing feature values across target classes
plt.figure(figsize=(10, 5))
sns.boxplot(data=df_ml, x='target', y='feature_0', palette='pastel')
plt.title('Feature_0 Distribution Across Target Classes')
plt.show()
4.5.2 Visualizing Relationships for Feature Selection
Scatter plots and correlation heatmaps are invaluable for understanding feature relationships and potential multicollinearity.
# Using the tips_df again for correlation visualization
plt.figure(figsize=(8, 6))
sns.pairplot(tips_df, hue='smoker', vars=['total_bill', 'tip', 'size'])
plt.suptitle("Pair Plot of Tips Dataset (by Smoker Status)", y=1.02)
plt.show()
4.5.3 Visualizing Model Performance
Visualizations are essential for interpreting model performance metrics.
- Confusion Matrix: Shows the number of correct and incorrect predictions for each class.
- ROC Curve (Receiver Operating Characteristic): Visualizes the trade-off between the true positive rate and false positive rate at various threshold settings.
- Feature Importance Plots: Helps understand which features contribute most to model predictions.
from sklearn.metrics import confusion_matrix, roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
import matplotlib.patches as mpatches
# Simple classification example
X_train, X_test, y_train, y_test = train_test_split(df_ml[['feature_0', 'feature_1']], df_ml['target'], test_size=0.3, random_state=42)
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1] # Probability of positive class
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Confusion Matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[0],
xticklabels=['Predicted 0', 'Predicted 1'], yticklabels=['Actual 0', 'Actual 1'])
axes[0].set_title("Confusion Matrix")
axes[0].set_xlabel("Predicted Label")
axes[0].set_ylabel("True Label")
# ROC Curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)
axes[1].plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
axes[1].plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
axes[1].set_xlim([0.0, 1.0])
axes[1].set_ylim([0.0, 1.05])
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('Receiver Operating Characteristic (ROC) Curve')
axes[1].legend(loc="lower right")
axes[1].grid(True)
plt.tight_layout()
plt.show()
# Feature Importance (for models that provide it, like tree-based models or linear model coefficients)
# For Logistic Regression, coefficients can indicate importance (after scaling)
feature_importance = pd.Series(model.coef_[0], index=X_train.columns).sort_values(ascending=False)
plt.figure(figsize=(7, 4))
sns.barplot(x=feature_importance.values, y=feature_importance.index, palette='viridis')
plt.title("Feature Importance (Logistic Regression Coefficients)")
plt.xlabel("Coefficient Value")
plt.ylabel("Feature")
plt.show()
4.5.4 Visualizing LLM Fine-tuning Data Characteristics
For LLMs, visualizations can help in understanding the length distribution of prompts/responses, the diversity of topics, or the impact of data cleaning.
# Using the previously created llm_finetune_df
if 'llm_finetune_df' in locals():
llm_finetune_df['instruction_length'] = llm_finetune_df['instruction'].apply(len)
llm_finetune_df['response_length'] = llm_finetune_df['response'].apply(len)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
sns.histplot(llm_finetune_df['instruction_length'], kde=True, ax=axes[0], color='skyblue')
axes[0].set_title("Distribution of Instruction Lengths")
axes[0].set_xlabel("Length of Instruction (characters)")
axes[0].set_ylabel("Count")
sns.histplot(llm_finetune_df['response_length'], kde=True, ax=axes[1], color='lightcoral')
axes[1].set_title("Distribution of Response Lengths")
axes[1].set_xlabel("Length of Response (characters)")
axes[1].set_ylabel("Count")
plt.tight_layout()
plt.show()
# Visualizing distribution of 'speaker' in raw chat_logs (if available)
if 'chat_logs' in locals():
speaker_counts = chat_logs['speaker'].value_counts()
plt.figure(figsize=(6, 6))
plt.pie(speaker_counts, labels=speaker_counts.index, autopct='%1.1f%%', startangle=90, colors=sns.color_palette('pastel'))
plt.title("Distribution of Speakers in Chat Logs")
plt.axis('equal') # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()
4.5.5 Visualizing Agent Trajectories or State Spaces
For more advanced AI applications, visualizing agent behavior can be complex but immensely insightful. This might involve plotting trajectories in a 2D or 3D state space, or visualizing decision trees/graphs.
While direct code for arbitrary agent state spaces is beyond a generic example, the principle is to leverage Matplotlib’s scatter plots or line plots, potentially with animation or interactive elements for dynamic environments.
# Conceptual example: Visualizing a 2D agent path
# This is illustrative, assuming an agent moves in a 2D grid
agent_path_x = [0, 1, 1, 2, 2, 3]
agent_path_y = [0, 0, 1, 1, 2, 2]
rewards_at_steps = [0, 1, 0, -1, 5, 0]
plt.figure(figsize=(7, 7))
plt.plot(agent_path_x, agent_path_y, marker='o', linestyle='-', color='blue', label='Agent Path')
plt.scatter(agent_path_x, agent_path_y, c=rewards_at_steps, cmap='RdYlGn', s=100, zorder=5, label='Rewards') # Color by reward
plt.colorbar(label='Reward at Step')
plt.title("Agent Trajectory with Rewards in 2D Space")
plt.xlabel("X Coordinate")
plt.ylabel("Y Coordinate")
plt.xlim(-0.5, 3.5)
plt.ylim(-0.5, 2.5)
plt.grid(True)
plt.legend()
plt.show()
This chapter has covered the essentials of Matplotlib and Seaborn, from basic plotting to advanced visualizations tailored for AI/ML and LLM data. Effective visualization is not just about making pretty graphs; it’s about gaining deeper insights into your data and models, enabling better decision-making and more robust AI systems.
Conclusion
This comprehensive guide has traversed the landscape of data manipulation and analysis, from the foundational numerical operations of NumPy to the robust data structuring capabilities of Pandas, and finally, to the illustrative power of Matplotlib and Seaborn for visualization. We began by establishing the critical importance of data in the AI era, particularly for advanced applications like Large Language Models and AI agents.
Throughout the chapters, we have equipped you with the skills to:
- Master NumPy arrays: Understand their creation, indexing, manipulation, and the performance benefits they offer for numerical computations.
- Become proficient in Pandas: Leverage
SeriesandDataFramefor efficient data loading, cleaning, transformation, aggregation, and combining datasets. We specifically explored techniques for feature engineering and preparing data for LLM fine-tuning and agent performance analysis. - Create compelling visualizations with Matplotlib and Seaborn: From basic plots to advanced statistical graphics, you now have the tools to explore data distributions, relationships, and effectively communicate insights from complex AI/ML scenarios.
The journey through these libraries is more than just learning syntax; it’s about developing a mindset for robust data handling. In the world of AI, where models are only as good as the data they consume, your ability to efficiently prepare, process, and understand data will be your most valuable asset. The techniques learned here form the bedrock for advanced machine learning workflows, enabling you to build, fine-tune, and analyze intelligent systems with confidence and precision.
Continue to practice, experiment with real-world datasets, and explore the extensive documentation of these libraries. The field of data science is dynamic, and continuous learning is key to staying at the forefront. Armed with NumPy, Pandas, Matplotlib, and Seaborn, you are now well-prepared to tackle diverse data challenges and contribute meaningfully to the exciting domain of artificial intelligence.
Appendix A: Further Learning and Resources
- Official Documentation:
- NumPy: https://numpy.org/doc/stable/
- Pandas: https://pandas.pydata.org/docs/
- Matplotlib: https://matplotlib.org/stable/contents.html
- Seaborn: https://seaborn.pydata.org/
- Interactive Learning Platforms:
- Kaggle Learn: Offers free micro-courses on Pandas, Matplotlib, and other data science topics.
- DataCamp, Coursera, Udemy: Provide structured courses from beginner to advanced levels.
- Books:
- “Python for Data Analysis” by Wes McKinney (creator of Pandas)
- “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow” by Aurélien Géron (excellent for practical ML, covers data prep)
- Community Forums & Q&A:
- Stack Overflow (for specific coding questions)
- Reddit communities like r/datascience, r/machinelearning
- Best Practices for Production-Ready Code:
- Learn about
PySparkorDaskfor distributed computing with large datasets. - Explore data validation libraries (e.g.,
Pandera,Great Expectations). - Understand software engineering principles for maintainable data pipelines.
- Learn about
Appendix B: Glossary of Terms
ndarray(NumPy): N-dimensional array, the fundamental data structure in NumPy.Series(Pandas): A one-dimensional labeled array capable of holding any data type.DataFrame(Pandas): A two-dimensional labeled data structure with columns of potentially different types; akin to a spreadsheet or SQL table.- Broadcasting (NumPy): The mechanism NumPy uses to perform arithmetic operations on arrays of different shapes.
- Ufuncs (Universal Functions): NumPy functions that operate element-wise on arrays, enabling vectorized computations.
- Missing Data (
NaN): Not a Number; a common representation for missing or undefined values in numerical data. - Feature Engineering: The process of creating new features from existing raw data to improve model performance.
- LLM Fine-tuning: Adapting a pre-trained Large Language Model to a specific task or dataset with a smaller, specialized dataset.
- Categorical Data: Data that can be divided into groups (e.g., ‘city’, ‘sex’).
- Numerical Data: Data representing measurable quantities (e.g., ‘age’, ‘salary’).
- Histogram: A graphical representation of the distribution of numerical data.
- Box Plot: A graphical display summarizing the distribution of a set of data, indicating median, quartiles, and potential outliers.
- Scatter Plot: A graph in which the values of two variables are plotted along two axes, the pattern of the resulting points revealing any correlation present.
- Heatmap: A graphical representation of data where values in a matrix are represented as colors. Often used for correlation matrices.
- KDE Plot (Kernel Density Estimate): A non-parametric way to estimate the probability density function of a random variable.
- Vectorization: Performing operations on entire arrays at once, rather than element by element, often leading to significant performance gains.