If you’re working on machine learning models for finance, you’ll quickly run into a common problem: finding clean, reliable, live market data without having to pay for pricey APIs. Scraping live market data is often the best solution, but most tutorials make it seem easier than it is or skip over the real challenges you’ll face in production.
In this article, I’ll show you how to do this with modern Python tools, making sure the process is clean, repeatable, and ready for use in a real ML pipeline.
Scraping Live Market Data: Getting Started
Here, we’re doing more than just scraping a webpage. We’re setting up the data ingestion part of an ML pipeline.
If you want to go beyond individual pipelines and learn how to build complete AI systems, I’ve covered it step-by-step in my book: Hands-On GenAI, LLMs & AI Agents.
Here’s the flow we will be using:
- Fetch live market data from a website.
- Parse structured information (tables).
- Convert it into a DataFrame.
- Store it for downstream ML tasks.
We’ll use the Most Active Stocks page on Yahoo Finance as our data source. It’s open to everyone and has structured tables we can work with.
You only need a few lightweight libraries:
pip install requests beautifulsoup4 pandas lxml
Step 1: Fetch Live Market Data
We will start by requesting the webpage:
import requests
url = "https://finance.yahoo.com/most-active"
headers = {
"User-Agent": "Mozilla/5.0"
}
response = requests.get(url, headers=headers)
if response.status_code == 200:
html_content = response.text
else:
raise Exception("Failed to fetch data")A lot of websites block requests if they don’t seem to come from a real browser. Adding a User-Agent header helps you get around this problem.
Step 2: Parse the HTML Table
Now we will extract the table containing stock data:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, "lxml")
table = soup.find("table")
rows = table.find_all("tr")By now, you’ve found the structure you need, but it’s still just raw HTML.
Step 3: Convert HTML to Structured Data
Now it’s time to turn messy HTML into something you can actually use:
import pandas as pd
data = []
for row in rows[1:]: # skip header
cols = row.find_all("td")
cols = [col.text.strip() for col in cols]
if cols:
data.append(cols)
columns = [th.text.strip() for th in rows[0].find_all("th")]
df = pd.DataFrame(data, columns=columns)
print(df.head())Symbol Name Price Change Change % \
0 INTC Intel Corporation 99.62 +5.14 (+5.44%) +5.14 +5.44%
1 NOK Nokia Oyj 13.30 +0.39 (+3.02%) +0.39 +3.02%
2 NVDA NVIDIA Corporation 198.45 -1.12 (-0.56%) -1.12 -0.56%
3 GRAB Grab Holdings Limited 3.67 -0.15 (-3.93%) -0.15 -3.93%
4 SOFI SoFi Technologies, Inc. 16.43 +0.33 (+2.05%) +0.33 +2.05%
Volume Avg Vol (3M) Market Cap P/E Ratio (TTM) 52 Wk Change % \
0 146.979M 105.404M 500.69B -- +391.47%
1 137.133M 69.883M 74.248B 76.62 +165.47%
2 110.589M 175.236M 4.822T 40.73 +74.35%
3 95.244M 48.88M 15.049B 63.67 -24.49%
4 73.634M 67.399M 21.054B 36.28 +27.76%
52 Wk Range
0 18.97 100.45
1 4.00 13.89
2 110.82 216.83
3 3.48 6.62
4 12.43 32.73
At this stage, you have a proper table. This is the format your ML pipeline needs.
Step 4: Save the Data
Next, we’ll save the raw data. This is a key part of how real pipelines are built:
from datetime import datetime
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
filename = f"market_data_{timestamp}.csv"
df.to_csv(filename, index=False)
print(f"Data saved to {filename}")Data saved to market_data_20260502_125119.csv
This might seem like a small detail, but it’s an important design choice:
- You preserve original data for debugging.
- You can reprocess it later with better logic.
- You avoid re-scraping if something breaks.
Closing Thoughts
Scraping live market data for ML pipelines isn’t only about collecting data. It’s also about creating a reliable way for your system to get that data.
If you set up this layer well, even with a simple script, you’re already moving beyond notebooks and getting closer to how real ML systems work.
I hope you found this article on scraping live market data for ML pipelines helpful.
For more AI and machine learning tips, follow me on Instagram. My book, Hands-On GenAI, LLMs & AI Agents, can also help you grow your AI career.





