Exploratory AnalysisΒΆ
One of the main tasks to perform with Pandas is exploratory analysis. Looking at data, finding what is useful or potentially wrong with it so that you can clean it up are core practices of a data scientist and data engineer.
Create a Pandas DataframeΒΆ
Load a CSV to start working with the data and performing exploratory analysis.
import pandas as pd
csv_url = "https://raw.githubusercontent.com/paiml/wine-ratings/main/wine-ratings.csv"
df = pd.read_csv(csv_url, index_col=0)
# The most common operation is with .head()
df.head(15)
# Now lets get a description of the data
df.describe()
# You can also get metadata about the dataset with .info()
df.info()
# sort based on some condition
df.sort_values(by="rating", ascending=False).head()
Data Cleaning β Removing Newlines and Carriage ReturnsΒΆ
Raw datasets often contain hidden whitespace characters like newlines (\n) and carriage returns (\r) embedded in text fields, especially when data comes from web scraping or copy-paste operations. Using .replace() with regex=True lets you strip these characters across the entire DataFrame in a single operation, which is essential before performing string matching or text analysis.
df = df.replace({"\r": ""}, regex=True)
df = df.replace({"\n": " "}, regex=True)
df.head(10)
# the grape is not a very good column, lets remove it and describe it again
df.drop(['grape'], axis=1, inplace=True)
df.describe()
# Specific operations by method. Like .mean()
df.groupby("region").mean(numeric_only=True)