Scraping Live Market Data for ML Pipelines

If you’re working on machine learning models for finance, you’ll quickly run into a common problem: finding clean, reliable, live market data without having to pay for pricey APIs. Scraping live market data is often the best solution, but most tutorials make it seem easier than it is or skip over the real challenges you’ll face in production.

In this article, I’ll show you how to do this with modern Python tools, making sure the process is clean, repeatable, and ready for use in a real ML pipeline.

Scraping Live Market Data: Getting Started

Here, we’re doing more than just scraping a webpage. We’re setting up the data ingestion part of an ML pipeline.

If you want to go beyond individual pipelines and learn how to build complete AI systems, I’ve covered it step-by-step in my book: Hands-On GenAI, LLMs & AI Agents.

Here’s the flow we will be using:

  1. Fetch live market data from a website.
  2. Parse structured information (tables).
  3. Convert it into a DataFrame.
  4. Store it for downstream ML tasks.

We’ll use the Most Active Stocks page on Yahoo Finance as our data source. It’s open to everyone and has structured tables we can work with.

You only need a few lightweight libraries:

pip install requests beautifulsoup4 pandas lxml

Step 1: Fetch Live Market Data

We will start by requesting the webpage:

import requests

url = "https://finance.yahoo.com/most-active"

headers = {
    "User-Agent": "Mozilla/5.0"
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    html_content = response.text
else:
    raise Exception("Failed to fetch data")

A lot of websites block requests if they don’t seem to come from a real browser. Adding a User-Agent header helps you get around this problem.

Step 2: Parse the HTML Table

Now we will extract the table containing stock data:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, "lxml")

table = soup.find("table")

rows = table.find_all("tr")

By now, you’ve found the structure you need, but it’s still just raw HTML.

Step 3: Convert HTML to Structured Data

Now it’s time to turn messy HTML into something you can actually use:

import pandas as pd

data = []

for row in rows[1:]:  # skip header
    cols = row.find_all("td")
    cols = [col.text.strip() for col in cols]

    if cols:
        data.append(cols)

columns = [th.text.strip() for th in rows[0].find_all("th")]

df = pd.DataFrame(data, columns=columns)

print(df.head())
  Symbol                     Name                     Price Change Change %  \
0 INTC Intel Corporation 99.62 +5.14 (+5.44%) +5.14 +5.44%
1 NOK Nokia Oyj 13.30 +0.39 (+3.02%) +0.39 +3.02%
2 NVDA NVIDIA Corporation 198.45 -1.12 (-0.56%) -1.12 -0.56%
3 GRAB Grab Holdings Limited 3.67 -0.15 (-3.93%) -0.15 -3.93%
4 SOFI SoFi Technologies, Inc. 16.43 +0.33 (+2.05%) +0.33 +2.05%

Volume Avg Vol (3M) Market Cap P/E Ratio (TTM) 52 Wk Change % \
0 146.979M 105.404M 500.69B -- +391.47%
1 137.133M 69.883M 74.248B 76.62 +165.47%
2 110.589M 175.236M 4.822T 40.73 +74.35%
3 95.244M 48.88M 15.049B 63.67 -24.49%
4 73.634M 67.399M 21.054B 36.28 +27.76%

52 Wk Range
0 18.97 100.45
1 4.00 13.89
2 110.82 216.83
3 3.48 6.62
4 12.43 32.73

At this stage, you have a proper table. This is the format your ML pipeline needs.

Step 4: Save the Data

Next, we’ll save the raw data. This is a key part of how real pipelines are built:

from datetime import datetime

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"market_data_{timestamp}.csv"

df.to_csv(filename, index=False)

print(f"Data saved to {filename}")
Data saved to market_data_20260502_125119.csv

This might seem like a small detail, but it’s an important design choice:

  1. You preserve original data for debugging.
  2. You can reprocess it later with better logic.
  3. You avoid re-scraping if something breaks.

Closing Thoughts

Scraping live market data for ML pipelines isn’t only about collecting data. It’s also about creating a reliable way for your system to get that data.

If you set up this layer well, even with a simple script, you’re already moving beyond notebooks and getting closer to how real ML systems work.

I hope you found this article on scraping live market data for ML pipelines helpful.

For more AI and machine learning tips, follow me on Instagram. My book, Hands-On GenAI, LLMs & AI Agents, can also help you grow your AI career.

Aman Kharwal
Aman Kharwal

AI/ML Engineer | Published Author. My aim is to decode data science for the real world in the most simple words.

Articles: 2096

Leave a Reply

Discover more from AmanXai by Aman Kharwal

Subscribe now to keep reading and get access to the full archive.

Continue reading