Web Scraping from Amazon with Python

Web scraping is a technique used to extract data from websites by sending requests to the server, retrieving the web pages, and parsing the HTML content to extract the necessary information. If you are learning Data Science, you should know how to collect data from APIs or Web Scraping. So, in this article, I’ll take you through a step-by-step tutorial on data collection from Amazon using web scraping with Python.

Web Scraping from Amazon with Python

In this tutorial, I will walk through the process of web scraping from Amazon’s Best Sellers page in the Teaching & Education category to collect data about the top 50 authors and their ratings. Before we start, ensure you have the following Python libraries installed:

  • requests: to send HTTP requests and retrieve the web pages.
  • BeautifulSoup: to parse and extract information from the HTML content.
  • pandas: to organize and save the extracted data in a tabular format.

Pandas and requests are already available in your Python environment if you are using colab or jupyter notebook. Use the command below in your colab or jupyter notebook to install BeautifulSoup:

  • !pip install beautifulsoup4

Now, let’s get started with web scraping from Amazon.

Step 1: Understanding the Target URL and Pagination

We are targeting the Amazon Best Sellers page in the Teaching & Education category. Amazon’s pagination allows us to navigate through multiple pages of results. The base URL for the first page looks like this:

https://www.amazon.in/gp/bestsellers/books/4149461031/ref=zg_bs_pg_1?ie=UTF8&pg=1

Notice the pagination parameters “pg” and “zg_bs_pg” in the URL. We will increment these values to navigate through the pages.

Step 2: Set Up the HTTP Request

To scrape the content from Amazon, we first need to send a request to the server and retrieve the HTML content of the page. We also need to mimic a real browser to avoid being blocked by Amazon, which is why we always need to include a User-Agent header in the request. Here’s how to set up the HTTP request:

import requests
from bs4 import BeautifulSoup
import pandas as pd

# base url of the best sellers page for teaching & education books
base_url = "https://www.amazon.in/gp/bestsellers/books/4149461031/ref=zg_bs_pg_{}?ie=UTF8&pg={}"

# http headers to mimic a browser visit
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}

Step 3: Iterate Over Pages to Collect Data

Now, we will loop through the first three pages to collect data for the top 50 books (assuming each page displays around 20 items). On each page, we will extract the author’s name and rating:

# initialize a list to store book data
book_list = []

# iterate over the first 3 pages to get top 50 books (assuming each page has about 20 items)
for page in range(1, 4):
    # construct the URL for the current page
    url = base_url.format(page, page)
    
    # send a GET request to the url
    response = requests.get(url, headers=headers)
    
    # parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(response.content, "lxml")
    
    # find all the book elements
    books = soup.find_all("div", {"class": "zg-grid-general-faceout"})
    
    # iterate over each book element to extract data
    for book in books:
        if len(book_list) < 50:  # stop once we've collected 50 books
            author = book.find("a", class_="a-size-small a-link-child").get_text(strip=True) if book.find("a", class_="a-size-small a-link-child") else "N/A"
            rating = book.find("span", class_="a-icon-alt").get_text(strip=True) if book.find("span", class_="a-icon-alt") else "N/A"
            
            # append the extracted data to the book_list
            book_list.append({
                "Author": author,
                "Rating": rating
            })
        else:
            break

Here, we looped through the first three pages using a for loop. The condition if len(book_list) < 50: ensures that we stop once we’ve collected data for 50 books. The code works by iterating through the first three pages of Amazon’s Best Sellers list in the Teaching & Education category. For each page, it sends a GET request to retrieve the HTML content, then parses this content using BeautifulSoup to find the relevant book elements. It extracts the author and rating for each book, appending the data to a list until it has collected information for 50 books.

The loop breaks once 50 books have been processed, which ensures that only the top 50 authors and their ratings are captured.

Step 4: Store and Save the Data

After collecting the data, we will store it in a Pandas DataFrame and save it to a CSV file:

# convert the list of dictionaries into a DataFrame
df = pd.DataFrame(book_list)

print(df.head())

# save the DataFrame to a CSV file
df.to_csv("amazon_top_50_books_authors_ratings.csv", index=False)
                    Author              Rating
0 Samapti Sinha Mahapatra 4.6 out of 5 stars
1 Kautilya 4.5 out of 5 stars
2 एम लक्ष्मीकांत 4.4 out of 5 stars
3 PR Yadav 4.4 out of 5 stars
4 Lori Gottlieb 4.6 out of 5 stars

Let’s have a look at some sample rows as well:

print(df.sample(10))
                          Author              Rating
9 R.K. Gupta 4.5 out of 5 stars
49 N/A 4.6 out of 5 stars
28 EduGorilla Prep Experts 4.2 out of 5 stars
12 Ishinna B. Sadana 4.9 out of 5 stars
33 RPH Editorial Board 4.2 out of 5 stars
18 Sujeet Yadav Janmejay Sahani 4.3 out of 5 stars
48 Sanjay Kumar 3.9 out of 5 stars
42 N/A 4.3 out of 5 stars
25 Wonder House Books 4.7 out of 5 stars
13 Professional Book Publishers 4.6 out of 5 stars

This method can be adapted for different categories or more extensive data collection by adjusting the page range or the conditions within the loop.

Summary

So, web scraping is a technique used to extract data from websites by sending requests to the server, retrieving the web pages, and parsing the HTML content to extract the necessary information. I hope you liked this article on data collection using Web Scraping from Amazon with Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2074

One comment

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading