Scraping Live Market Data for ML Pipelines

If you’re working on machine learning models for finance, you’ll quickly run into a common problem: finding clean, reliable, live market data without having to pay for pricey APIs. Scraping live market data is often the best solution, but most tutorials make it seem easier than it is or skip over the real challenges you’ll face in production.

In this article, I’ll show you how to do this with modern Python tools, making sure the process is clean, repeatable, and ready for use in a real ML pipeline.

Scraping Live Market Data: Getting Started

Here, we’re doing more than just scraping a webpage. We’re setting up the data ingestion part of an ML pipeline.

If you want to go beyond individual pipelines and learn how to build complete AI systems, I’ve covered it step-by-step in my book: Hands-On GenAI, LLMs & AI Agents.

Here’s the flow we will be using:

Fetch live market data from a website.
Parse structured information (tables).
Convert it into a DataFrame.
Store it for downstream ML tasks.

We’ll use the Most Active Stocks page on Yahoo Finance as our data source. It’s open to everyone and has structured tables we can work with.

You only need a few lightweight libraries:

pip install requests beautifulsoup4 pandas lxml

Step 1: Fetch Live Market Data

We will start by requesting the webpage:

import requests

url = "https://finance.yahoo.com/most-active"

headers = {
    "User-Agent": "Mozilla/5.0"
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    html_content = response.text
else:
    raise Exception("Failed to fetch data")

import requests

url = "https://finance.yahoo.com/most-active"

headers = {
    "User-Agent": "Mozilla/5.0"
}

response = requests.get(url, headers=headers)

if response.status_code == 200:
    html_content = response.text
else:
    raise Exception("Failed to fetch data")

A lot of websites block requests if they don’t seem to come from a real browser. Adding a User-Agent header helps you get around this problem.

Step 2: Parse the HTML Table

Now we will extract the table containing stock data:

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, "lxml")

table = soup.find("table")

rows = table.find_all("tr")

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, "lxml")

table = soup.find("table")

rows = table.find_all("tr")

By now, you’ve found the structure you need, but it’s still just raw HTML.

Step 3: Convert HTML to Structured Data

Now it’s time to turn messy HTML into something you can actually use:

import pandas as pd

data = []

for row in rows[1:]:  # skip header
    cols = row.find_all("td")
    cols = [col.text.strip() for col in cols]

    if cols:
        data.append(cols)

columns = [th.text.strip() for th in rows[0].find_all("th")]

df = pd.DataFrame(data, columns=columns)

print(df.head())

import pandas as pd

data = []

for row in rows[1:]:  # skip header
    cols = row.find_all("td")
    cols = [col.text.strip() for col in cols]

    if cols:
        data.append(cols)

columns = [th.text.strip() for th in rows[0].find_all("th")]

df = pd.DataFrame(data, columns=columns)

print(df.head())

  Symbol                     Name                     Price Change Change %  \
0   INTC        Intel Corporation     99.62  +5.14 (+5.44%)  +5.14   +5.44%   
1    NOK                Nokia Oyj     13.30  +0.39 (+3.02%)  +0.39   +3.02%   
2   NVDA       NVIDIA Corporation    198.45  -1.12 (-0.56%)  -1.12   -0.56%   
3   GRAB    Grab Holdings Limited      3.67  -0.15 (-3.93%)  -0.15   -3.93%   
4   SOFI  SoFi Technologies, Inc.     16.43  +0.33 (+2.05%)  +0.33   +2.05%   

     Volume Avg Vol (3M) Market Cap P/E Ratio (TTM) 52 Wk Change %  \
0  146.979M     105.404M    500.69B              --       +391.47%   
1  137.133M      69.883M    74.248B           76.62       +165.47%   
2  110.589M     175.236M     4.822T           40.73        +74.35%   
3   95.244M       48.88M    15.049B           63.67        -24.49%   
4   73.634M      67.399M    21.054B           36.28        +27.76%   

     52 Wk Range  
0   18.97 100.45  
1     4.00 13.89  
2  110.82 216.83  
3      3.48 6.62  
4    12.43 32.73

At this stage, you have a proper table. This is the format your ML pipeline needs.

Step 4: Save the Data

Next, we’ll save the raw data. This is a key part of how real pipelines are built:

from datetime import datetime

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"market_data_{timestamp}.csv"

df.to_csv(filename, index=False)

print(f"Data saved to {filename}")

from datetime import datetime

timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"market_data_{timestamp}.csv"

df.to_csv(filename, index=False)

print(f"Data saved to {filename}")

Data saved to market_data_20260502_125119.csv

This might seem like a small detail, but it’s an important design choice:

You preserve original data for debugging.
You can reprocess it later with better logic.
You avoid re-scraping if something breaks.

Closing Thoughts

Scraping live market data for ML pipelines isn’t only about collecting data. It’s also about creating a reliable way for your system to get that data.

If you set up this layer well, even with a simple script, you’re already moving beyond notebooks and getting closer to how real ML systems work.

I hope you found this article on scraping live market data for ML pipelines helpful.

For more AI and machine learning tips, follow me on Instagram. My book, Hands-On GenAI, LLMs & AI Agents, can also help you grow your AI career.

Scraping Live Market Data for ML Pipelines

Scraping Live Market Data: Getting Started

Step 1: Fetch Live Market Data

Step 2: Parse the HTML Table

Step 3: Convert HTML to Structured Data

Step 4: Save the Data

Closing Thoughts

Aman Kharwal

Leave a ReplyCancel reply

Scraping Live Market Data: Getting Started

Step 1: Fetch Live Market Data

Step 2: Parse the HTML Table

Step 3: Convert HTML to Structured Data

Step 4: Save the Data

Closing Thoughts

Aman Kharwal

Recommended For You

Agentic AI Projects to Add to Your Resume

Build an AI Agent for End-to-End App Development

How to Build AI Agents Using CrewAI

Top AI Agent Frameworks You Should Master

Leave a ReplyCancel reply

Discover more from AmanXai by Aman Kharwal