Document Analysis using LLMs with Python

Document analysis refers to extracting, interpreting, and understanding the information contained within a document. Traditionally, this involved manual review or simple keyword-based techniques, but with the rise of Large Language Models (LLMs) like GPT and BERT, LLMs are now preferred for document analysis because they can comprehend context, generate summaries, answer questions, and identify key insights efficiently. So, if you want to learn how to analyze documents using LLMs, this article is for you. In this article, I’ll take you through the task of document analysis using LLMs with Python.

Document Analysis using LLMs with Python

For the task of document analysis using LLMs, I’ll be using a document that contains the terms of the services offered by Google. You can download this document from here.

Step 1: Extract Text from the PDF

The first step in document analysis is extracting the content from a PDF file. We can use libraries like pdfplumber to open and read the text from each page of the PDF and save it into a .txt file for further analysis. You can install pdfplumber on your Python environment using the command: pip install pdfplumber. Here’s how to extract text from the PDF:

import pdfplumber

pdf_path = "/content/google_terms_of_service_en_in.pdf"

output_text_file = "extracted_text.txt"

with pdfplumber.open(pdf_path) as pdf:
    extracted_text = ""
    for page in pdf.pages:
        extracted_text += page.extract_text()

with open(output_text_file, "w") as text_file:
    text_file.write(extracted_text)

print(f"Text extracted and saved to {output_text_file}")

Text extracted and saved to extracted_text.txt

The extracted text is stored in the variable extracted_text, which is then saved to a file for later use.

Step 2: Preview the Extracted Text

After extracting the text, it’s essential to preview the content to ensure everything is correctly captured. This allows you to check for any formatting issues or missing content:

# reading pdf content
with open("/content/extracted_text.txt", "r") as file:
    document_text = file.read()

# preview the document content
print(document_text[:500])  # preview the first 500 characters

GOOGLE TERMS OF SERVICE
Effective May 22, 2024 | Archived versions
What’s covered in these terms
We know it’s tempting to skip these Terms of
Service, but it’s important to establish what you
can expect from us as you use Google services,
and what we expect from you.
These Terms of Service reect the way Google’s business works, the laws that apply to
our company, and certain things we’ve always believed to be true. As a result, these Terms
of Service help dene Google’s relationship with you as

Step 3: Summarize the Document

To get a high-level overview of the document, you can use a pre-trained summarization model like t5-small. This allows you to condense large pieces of text into shorter summaries, which helps you to grasp the most important information. Here’s how to summarize the document:

from transformers import pipeline

# load the summarization pipeline
summarizer = pipeline("summarization", model="t5-small")

# summarize the document text (you can summarize parts if the document is too large)
summary = summarizer(document_text[:1000], max_length=150, min_length=30, do_sample=False)
print("Summary:", summary[0]['summary_text'])

Summary: these Terms of Service reect the way Google’s business works, the laws that apply to our company, and certain things we’ve always believed to be true . these terms include: what you can expect from us, which describes how we provide and develop our services What we expect from you, which establishes certain rules for using our services Content in Google services .

The pipeline(“summarization”, model= “t5-small”) sets up the summarization model using T5-small, a pre-trained transformer model designed for text summarization. The document_text[:1000] specifies the portion of the text to summarize (the first 1000 characters), while max_length = 150 and min_length = 30 control the maximum and minimum length of the summary in tokens. The do_sample = False parameter ensures deterministic output, meaning the model will not randomly sample from possible summaries but will give the same result every time.

Step 4: Split the Document into Sentences and Passages

For more detailed analysis, like question generation, it’s important to split the document into smaller chunks. This step tokenizes the document into sentences and combines them into manageable passages for subsequent steps. Here’s how to split the document into sentences and passages:

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

# split text into sentences
sentences = sent_tokenize(document_text)

# combine sentences into passages
passages = []
current_passage = ""
for sentence in sentences:
    if len(current_passage.split()) + len(sentence.split()) < 200:  # adjust the word limit as needed
        current_passage += " " + sentence
    else:
        passages.append(current_passage.strip())
        current_passage = sentence
if current_passage:
    passages.append(current_passage.strip())

In this part of the code, we are using the NLTK library to split the extracted document text into individual sentences using the sent_tokenize() function. Then, we combine these sentences into manageable passages by setting a word limit of 200 words for each passage. This helps ensure that each passage is of a suitable length for further processing by language models, which often have token limits. If the current passage exceeds the word limit, it is appended to the passages list, and the process continues until all sentences are grouped into passages.

Step 5: Generate Questions from the Passages Using LLMs

The next step is to generate questions based on the document’s content. This helps in understanding key information points and can be used to check the comprehension of the document. Here’s how to generate questions from passages using LLMs:

# load the question generation pipeline
qg_pipeline = pipeline("text2text-generation", model="valhalla/t5-base-qg-hl")

# function to generate questions using the pipeline
def generate_questions_pipeline(passage, min_questions=3):
    input_text = f"generate questions: {passage}"
    results = qg_pipeline(input_text)
    questions = results[0]['generated_text'].split('<sep>')
    
    # ensure we have at least 3 questions
    questions = [q.strip() for q in questions if q.strip()]
    
    # if fewer than 3 questions, try to regenerate from smaller parts of the passage
    if len(questions) < min_questions:
        passage_sentences = passage.split('. ')
        for i in range(len(passage_sentences)):
            if len(questions) >= min_questions:
                break
            additional_input = ' '.join(passage_sentences[i:i+2])
            additional_results = qg_pipeline(f"generate questions: {additional_input}")
            additional_questions = additional_results[0]['generated_text'].split('<sep>')
            questions.extend([q.strip() for q in additional_questions if q.strip()])
    
    return questions[:min_questions]  # return only the top 3 questions

# generate questions from passages
for idx, passage in enumerate(passages):
    questions = generate_questions_pipeline(passage)
    print(f"Passage {idx+1}:\n{passage}\n")
    print("Generated Questions:")
    for q in questions:
        print(f"- {q}")
    print(f"\n{'-'*50}\n")

In this part of the code, we are using a question generation model (T5-based model valhalla/t5-base-qg-hl) from the Hugging Face transformers library to automatically generate questions from text passages. The function generate_questions_pipeline() takes a text passage as input and produces a list of questions. We generate at least three questions for each passage, and if not, we split the passage into smaller parts and generate additional questions. This approach guarantees comprehensive question generation for each passage, and we print the questions along with the corresponding passage for review. Below is the output for the passage 1:

Passage 1:
GOOGLE TERMS OF SERVICE
Effective May 22, 2024 | Archived versions
What’s covered in these terms
We know it’s tempting to skip these Terms of
Service, but it’s important to establish what you
can expect from us as you use Google services,
and what we expect from you. These Terms of Service reect the way Google’s business works, the laws that apply to
our company, and certain things we’ve always believed to be true. As a result, these Terms
of Service help dene Google’s relationship with you as you interact with our services. For
example, these terms include the following topic headings:
What you can expect from us, which describes how we provide and develop our
services
What we expect from you, which establishes certain rules for using our services
Content in Google services, which describes the intellectual property rights to the
content you nd in our services — whether that content belongs to you, Google, or
others
In case of problems or disagreements, which describes other legal rights you have,
and what to expect in case someone violates these terms
Understanding these terms is important because, by accessing or using our services,
you’re agreeing to these terms.

Generated Questions:
- What is the meaning of the Terms of Service?
- What are the terms of service that govern how Google operates?
- What do these Terms of Service help define?

--------------------------------------------------

Step 6: Answer the Generated Questions Using a QA Model

After generating the questions, we can use a pre-trained question-answering (QA) model to find the answers within the text. The deepset/roberta-base-squad2 model extracts answers based on the context of the passage. Here’s how to answer the generated questions:

# load the QA pipeline
qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")

# function to track and answer only unique questions
def answer_unique_questions(passages, qa_pipeline):
    answered_questions = set()  # to store unique questions

    for idx, passage in enumerate(passages):
        questions = generate_questions_pipeline(passage)

        for question in questions:
            if question not in answered_questions:  # check if the question has already been answered
                answer = qa_pipeline({'question': question, 'context': passage})
                print(f"Q: {question}")
                print(f"A: {answer['answer']}\n")
                answered_questions.add(question)  # add the question to the set to avoid repetition
        print(f"{'='*50}\n")
              
answer_unique_questions(passages, qa_pipeline)

In this part of the code, we used a question-answering (QA) pipeline with the deepset/roberta-base-squad2 model to answer questions generated from the document passages. The function answer_unique_questions() tracks unique questions in a set to ensure it answers each question only once. As the code processes each passage, it checks whether it has already answered a question; if not, it generates an answer based on the passage’s context. This avoids answering duplicate questions and ensures efficient processing of all relevant queries. Below is the output for the passage 1:

Q: What is the meaning of the Terms of Service?
A: certain things we’ve always believed to be true

Q: What are the terms of service that govern how Google operates?
A: reect the way Google’s business works, the laws

Q: What do these Terms of Service help define?
A: Google’s relationship with you as you interact with our services

==================================================

Summary

So, this is how we can analyze documents using LLMs step-by-step. LLMs excel at understanding natural language, which makes them ideal for handling complex documents and extracting meaningful insights with high accuracy and minimal human intervention. I hope you liked this article on document analysis using LLMs with Python. Feel free to ask valuable questions in the comments section below. You can follow me on Instagram for many more resources.