Effortless Data Extraction from Websites Using Python
Written on
Chapter 1: Understanding Web Scraping
Web scraping refers to the process of gathering information from websites. For Python enthusiasts, this can be accomplished using two essential libraries: BeautifulSoup and Selenium. While Selenium is particularly useful for scraping dynamic sites, we will focus on a static website for simplicity's sake.
In this project, we will utilize three libraries:
- Pandas: For organizing the extracted data into a structured format.
- Requests: To handle requests for accessing web pages.
- BeautifulSoup: To parse the HTML and retrieve the desired data.
Important: Always verify that the website permits web scraping.
The Goal: Extract Email Addresses and Save to CSV
I needed to compile a list of email addresses across various subjects and quickly realized that doing this manually would be tedious. Thus, I turned to my Python skills for assistance.
The data is presented in a table format on the website, as illustrated below:
The aim of this project is to convert this tabular data into a CSV file.
Step 1: Import Necessary Libraries
First, we will import the required Python libraries:
import pandas as pd
import requests
from bs4 import BeautifulSoup
Step 2: Fetching the HTML Content
#### 2.1 Understanding How Web Scraping Works
To extract data from the web, we rely on HTML (HyperText Markup Language), the standard language for displaying web pages. By right-clicking on any webpage and selecting "Inspect," we can view the underlying HTML code.
Python, along with its libraries, enables us to "read" this HTML and extract the necessary information.
#### 2.2 Creating a Function to Retrieve HTML
We will define a function, get_html, that takes a URL as its parameter:
def get_html(url):
try:
response = requests.get(url) # Send a request to the website
return response.text
except Exception as e: # Handle any errors
print(f"Error retrieving the webpage: {e}")
return ""
We also initialize a set to store unique email addresses:
emails = set() # To prevent duplicates
Step 3: Extracting the Data
Next, we will create a function to extract the required information from the HTML.
def extract_data(html):
soup = BeautifulSoup(html, 'html.parser') # Initialize BeautifulSoup
table = soup.find('table') # Locate the table
data = [] # List to store the extracted data
if table:
rows = table.find_all('tr') # Retrieve all rows from the table
for row in rows[1:]: # Skip the header row
cols = row.find_all('td') # Get all table cells
if len(cols) == 4: # Assuming there are always 4 columns
subject_name = cols[0].text.strip() # Subject names
email = cols[1].text.strip() # Email addresses
data.append({'subject': subject_name, 'email': email}) # Append data to the list
return data
Step 4: Running the Functions and Saving the Data
After setting everything up, we will call our functions to extract the data:
html = get_html(url) # Retrieve HTML content
data = extract_data(html) # Extract data from HTML
df = pd.DataFrame(data) # Convert data into a DataFrame
df.to_csv('mail_info.csv', index=False) # Save DataFrame to a CSV file
print("Data has been successfully extracted and saved to mail_info.csv")
And that's all it takes! This guide demonstrates how to scrape data from a website's table using Python, effectively automating a tedious task.
In the next installment, I plan to show how to extract data from grocery stores for your data analysis needs.
Additional Resources:
To further enhance your skills, consider subscribing to my free newsletter, The Super Learning Lab, where I share insights and tips.
This video explains the process of extracting data from websites using Python, focusing on practical applications and examples.
Learn how to scrape URLs and extract data effectively with Python in this comprehensive tutorial.
Thank you for reading! Stay curious,
Axel
In Plain English 🚀
Thank you for being a part of the In Plain English community! Don't forget to clap and follow for more great content.
Follow us on: X | LinkedIn | YouTube | Discord | Newsletter
Explore more at: CoFeed | Differ | Additional content at PlainEnglish.io