Web scraping is a technique used to extract data from websites by sending requests to the server, retrieving the web pages, and parsing the HTML content to extract the necessary information. If you are learning Data Science, you should know how to collect data from APIs or Web Scraping. So, in this article, I’ll take you through a step-by-step tutorial on data collection from Amazon using web scraping with Python.
Web Scraping from Amazon with Python
In this tutorial, I will walk through the process of web scraping from Amazon’s Best Sellers page in the Teaching & Education category to collect data about the top 50 authors and their ratings. Before we start, ensure you have the following Python libraries installed:
- requests: to send HTTP requests and retrieve the web pages.
- BeautifulSoup: to parse and extract information from the HTML content.
- pandas: to organize and save the extracted data in a tabular format.
Pandas and requests are already available in your Python environment if you are using colab or jupyter notebook. Use the command below in your colab or jupyter notebook to install BeautifulSoup:
- !pip install beautifulsoup4
Now, let’s get started with web scraping from Amazon.
Step 1: Understanding the Target URL and Pagination
We are targeting the Amazon Best Sellers page in the Teaching & Education category. Amazon’s pagination allows us to navigate through multiple pages of results. The base URL for the first page looks like this:
https://www.amazon.in/gp/bestsellers/books/4149461031/ref=zg_bs_pg_1?ie=UTF8&pg=1
Notice the pagination parameters “pg” and “zg_bs_pg” in the URL. We will increment these values to navigate through the pages.
Step 2: Set Up the HTTP Request
To scrape the content from Amazon, we first need to send a request to the server and retrieve the HTML content of the page. We also need to mimic a real browser to avoid being blocked by Amazon, which is why we always need to include a User-Agent header in the request. Here’s how to set up the HTTP request:
import requests
from bs4 import BeautifulSoup
import pandas as pd
# base url of the best sellers page for teaching & education books
base_url = "https://www.amazon.in/gp/bestsellers/books/4149461031/ref=zg_bs_pg_{}?ie=UTF8&pg={}"
# http headers to mimic a browser visit
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3"
}Step 3: Iterate Over Pages to Collect Data
Now, we will loop through the first three pages to collect data for the top 50 books (assuming each page displays around 20 items). On each page, we will extract the author’s name and rating:
# initialize a list to store book data
book_list = []
# iterate over the first 3 pages to get top 50 books (assuming each page has about 20 items)
for page in range(1, 4):
# construct the URL for the current page
url = base_url.format(page, page)
# send a GET request to the url
response = requests.get(url, headers=headers)
# parse the HTML content using BeautifulSoup
soup = BeautifulSoup(response.content, "lxml")
# find all the book elements
books = soup.find_all("div", {"class": "zg-grid-general-faceout"})
# iterate over each book element to extract data
for book in books:
if len(book_list) < 50: # stop once we've collected 50 books
author = book.find("a", class_="a-size-small a-link-child").get_text(strip=True) if book.find("a", class_="a-size-small a-link-child") else "N/A"
rating = book.find("span", class_="a-icon-alt").get_text(strip=True) if book.find("span", class_="a-icon-alt") else "N/A"
# append the extracted data to the book_list
book_list.append({
"Author": author,
"Rating": rating
})
else:
breakHere, we looped through the first three pages using a for loop. The condition if len(book_list) < 50: ensures that we stop once we’ve collected data for 50 books. The code works by iterating through the first three pages of Amazon’s Best Sellers list in the Teaching & Education category. For each page, it sends a GET request to retrieve the HTML content, then parses this content using BeautifulSoup to find the relevant book elements. It extracts the author and rating for each book, appending the data to a list until it has collected information for 50 books.
The loop breaks once 50 books have been processed, which ensures that only the top 50 authors and their ratings are captured.
Step 4: Store and Save the Data
After collecting the data, we will store it in a Pandas DataFrame and save it to a CSV file:
# convert the list of dictionaries into a DataFrame
df = pd.DataFrame(book_list)
print(df.head())
# save the DataFrame to a CSV file
df.to_csv("amazon_top_50_books_authors_ratings.csv", index=False)Author Rating
0 Samapti Sinha Mahapatra 4.6 out of 5 stars
1 Kautilya 4.5 out of 5 stars
2 एम लक्ष्मीकांत 4.4 out of 5 stars
3 PR Yadav 4.4 out of 5 stars
4 Lori Gottlieb 4.6 out of 5 stars
Let’s have a look at some sample rows as well:
print(df.sample(10))
Author Rating
9 R.K. Gupta 4.5 out of 5 stars
49 N/A 4.6 out of 5 stars
28 EduGorilla Prep Experts 4.2 out of 5 stars
12 Ishinna B. Sadana 4.9 out of 5 stars
33 RPH Editorial Board 4.2 out of 5 stars
18 Sujeet Yadav Janmejay Sahani 4.3 out of 5 stars
48 Sanjay Kumar 3.9 out of 5 stars
42 N/A 4.3 out of 5 stars
25 Wonder House Books 4.7 out of 5 stars
13 Professional Book Publishers 4.6 out of 5 stars
This method can be adapted for different categories or more extensive data collection by adjusting the page range or the conditions within the loop.
Summary
So, web scraping is a technique used to extract data from websites by sending requests to the server, retrieving the web pages, and parsing the HTML content to extract the necessary information. I hope you liked this article on data collection using Web Scraping from Amazon with Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.






Helpful!! Can you also explain about extracting reviews of multiple products