hansontechsolutions.com

Effortless Data Extraction from Websites Using Python

Written on

Chapter 1: Understanding Web Scraping

Web scraping refers to the process of gathering information from websites. For Python enthusiasts, this can be accomplished using two essential libraries: BeautifulSoup and Selenium. While Selenium is particularly useful for scraping dynamic sites, we will focus on a static website for simplicity's sake.

In this project, we will utilize three libraries:

  • Pandas: For organizing the extracted data into a structured format.
  • Requests: To handle requests for accessing web pages.
  • BeautifulSoup: To parse the HTML and retrieve the desired data.

Important: Always verify that the website permits web scraping.

The Goal: Extract Email Addresses and Save to CSV

I needed to compile a list of email addresses across various subjects and quickly realized that doing this manually would be tedious. Thus, I turned to my Python skills for assistance.

The data is presented in a table format on the website, as illustrated below:

Email list table structure

The aim of this project is to convert this tabular data into a CSV file.

Step 1: Import Necessary Libraries

First, we will import the required Python libraries:

import pandas as pd

import requests

from bs4 import BeautifulSoup

Step 2: Fetching the HTML Content

#### 2.1 Understanding How Web Scraping Works

To extract data from the web, we rely on HTML (HyperText Markup Language), the standard language for displaying web pages. By right-clicking on any webpage and selecting "Inspect," we can view the underlying HTML code.

Python, along with its libraries, enables us to "read" this HTML and extract the necessary information.

#### 2.2 Creating a Function to Retrieve HTML

We will define a function, get_html, that takes a URL as its parameter:

def get_html(url):

try:

response = requests.get(url) # Send a request to the website

return response.text

except Exception as e: # Handle any errors

print(f"Error retrieving the webpage: {e}")

return ""

We also initialize a set to store unique email addresses:

emails = set() # To prevent duplicates

Step 3: Extracting the Data

Next, we will create a function to extract the required information from the HTML.

def extract_data(html):

soup = BeautifulSoup(html, 'html.parser') # Initialize BeautifulSoup

table = soup.find('table') # Locate the table

data = [] # List to store the extracted data

if table:

rows = table.find_all('tr') # Retrieve all rows from the table

for row in rows[1:]: # Skip the header row

cols = row.find_all('td') # Get all table cells

if len(cols) == 4: # Assuming there are always 4 columns

subject_name = cols[0].text.strip() # Subject names

email = cols[1].text.strip() # Email addresses

data.append({'subject': subject_name, 'email': email}) # Append data to the list

return data

Step 4: Running the Functions and Saving the Data

After setting everything up, we will call our functions to extract the data:

html = get_html(url) # Retrieve HTML content

data = extract_data(html) # Extract data from HTML

df = pd.DataFrame(data) # Convert data into a DataFrame

df.to_csv('mail_info.csv', index=False) # Save DataFrame to a CSV file

print("Data has been successfully extracted and saved to mail_info.csv")

And that's all it takes! This guide demonstrates how to scrape data from a website's table using Python, effectively automating a tedious task.

In the next installment, I plan to show how to extract data from grocery stores for your data analysis needs.

Additional Resources:

To further enhance your skills, consider subscribing to my free newsletter, The Super Learning Lab, where I share insights and tips.

This video explains the process of extracting data from websites using Python, focusing on practical applications and examples.

Learn how to scrape URLs and extract data effectively with Python in this comprehensive tutorial.

Thank you for reading! Stay curious,

Axel

In Plain English 🚀

Thank you for being a part of the In Plain English community! Don't forget to clap and follow for more great content.

Follow us on: X | LinkedIn | YouTube | Discord | Newsletter

Explore more at: CoFeed | Differ | Additional content at PlainEnglish.io

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Healing from Within: A Journey to Self-Love and Wellness

Explore the transformative journey of self-love and holistic healing, emphasizing the importance of loving oneself at all levels.

Incredible Functions of Elephant Trunks Unveiled

Discover the extraordinary capabilities of elephant trunks, their versatility, and their crucial role in the survival of these magnificent creatures.

Understanding What Truly Counts for Fat Loss

Explore key insights about fat loss, what factors truly matter, and effective strategies for long-term success.