A website for extracting, summarizing, and querying PDFs with AI.
The goal of this project is to create an AI-powered document processing tool that extracts, summarizes, and answers questions based on PDF content using large language models. It is designed for professionals, researchers, and students who need to quickly analyze and extract key insights from large documents.
Install llm-text-analysis with pip
pip install pdfplumber
pip install --upgrade transformers
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip install ipywidgets
pip install nltk
pip install tiktoken
pip install protobuf
pip install blobfile
import pdfplumber
pdf_path = "./content/alan_turing_ wikipedia.pdf" # edit
output_text_file = "extracted_text.txt"
with pdfplumber.open(pdf_path) as pdf:
extracted_text = ""
for page in pdf.pages:
extracted_text += page.extract_text()
with open(output_text_file, "w", encoding="utf-8") as text_file:
text_file.write(extracted_text)
print(f"Text extracted and saved to {output_text_file}")
Text extracted and saved to extracted_text.txt
with open("./extracted_text.txt", "r") as file:
document_text = file.read()
print(document_text[:500])
2/20/25, 10:46 AM Alan Turing - Wikipedia
Alan Turing
Alan Mathison Turing (/ˈtjʊərɪŋ/; 23 June 1912 – 7
Alan Turing
June 1954) was an English mathematician, computer
OBE FRS
scientist, logician, cryptanalyst, philosopher and
theoretical biologist.[5] He was highly influential in the
development of theoretical computer science, providing
a formalisation of the concepts of algorithm and
computation with the Turing machine, which can be
considered a model of a general-purpose
computer.[6][7
from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
summary = summarizer(document_text[:2000], max_length=512, min_length=100, do_sample=False)
print("Summary:", summary[0]['summary_text'])
Device set to use cuda:0
Your max_length is set to 512, but your input_length is only 506. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=253)
Summary: Alan Turing (23 June 1912 - 7 June 1954) was an English mathematician, computer, logician, cryptanalyst, philosopher and theoretical biologist. During World War II, Turing worked for the Government Code and Cypher School at Bletchley Park, Britain's codebreaking centre. He led Hut 8, the section responsible for Turing in 1951 for German naval cryptanalysis. Turing devised techniques for speeding the breaking of German ciphers, including 23 June 1912improvements to the pre-war Polish bomba method.
import nltk
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize
sentences = sent_tokenize(document_text)
passages = []
current_passage = ""
for sentence in sentences:
if len(current_passage.split()) + len(sentence.split()) < 400: # adjust word limit as needed
current_passage += " " + sentence
else:
passages.append(current_passage.strip())
current_passage = sentence
if current_passage:
passages.append(current_passage.strip())
[nltk_data] Downloading package punkt_tab to
[nltk_data] C:\Users\Kenny\AppData\Roaming\nltk_data...
[nltk_data] Package punkt_tab is already up-to-date!
qg_pipeline = pipeline("text2text-generation", model="google/flan-t5-base")
def generate_questions_pipeline(passage, min_questions=3):
input_text = f"Generate questions: {passage}"
results = qg_pipeline(input_text)
questions = results[0]['generated_text'].split('\n')
questions = [q.strip() for q in questions if q.strip()]
if len(questions) < min_questions:
passage_sentences = passage.split('. ')
for i in range(len(passage_sentences)):
if len(questions) >= min_questions:
break
additional_input = ' '.join(passage_sentences[i:i+2])
additional_results = qg_pipeline(f"Generate questions: {additional_input}")
additional_questions = additional_results[0]['generated_text'].split('\n')
questions.extend([q.strip() for q in additional_questions if q.strip()])
return questions[:min_questions]
for idx, passage in enumerate(passages):
questions = generate_questions_pipeline(passage)
print(f"Passage {idx+1}:\n{passage}\n")
print("Generated Questions:")
for q in questions:
print(f"- {q}")
print(f"\n{'-'*50}\n")
Device set to use cuda:0
Token indices sequence length is longer than the specified maximum sequence length for this model (642 > 512). Running this sequence through the model will result in indexing errors
Passage 1:
2/20/25, 10:46 AM Alan Turing - Wikipedia
Alan Turing
Alan Mathison Turing (/ˈtjʊərɪŋ/; 23 June 1912 – 7
Alan Turing
June 1954) was an English mathematician, computer
OBE FRS
scientist, logician, cryptanalyst, philosopher and
theoretical biologist. [5] He was highly influential in the
development of theoretical computer science, providing
a formalisation of the concepts of algorithm and
computation with the Turing machine, which can be
considered a model of a general-purpose
computer. [6][7][8] Turing is widely considered to be the
father of theoretical computer science. [9]
Born in London, Turing was raised in southern
England. He graduated from King's College, Cambridge,
and in 1938, earned a doctorate degree from Princeton
University. During World War II, Turing worked for the
Government Code and Cypher School at Bletchley Park,
Britain's codebreaking centre that produced Ultra
intelligence. He led Hut 8, the section responsible for Turing in 1951
German naval cryptanalysis. Turing devised techniques Born Alan Mathison Turing
for speeding the breaking of German ciphers, including 23 June 1912
improvements to the pre-war Polish bomba method, an Maida Vale, London, England
electromechanical machine that could find settings for
Died 7 June 1954 (aged 41)
the Enigma machine. He played a crucial role in
Wilmslow, Cheshire, England
cracking intercepted messages that enabled the Allies to
Cause of Cyanide poisoning as an act of
defeat the Axis powers in many engagements, including
death suicide[note 1]
the Battle of the Atlantic. [10][11]
Alma mater University of Cambridge (BA, MA)
After the war, Turing worked at the National Physical
Princeton University (PhD)
Laboratory, where he designed the Automatic
Known for Cryptanalysis of the Enigma ·
Computing Engine, one of the first designs for a stored-
Turing's proof · Turing machine ·
program computer. In 1948, Turing joined Max
Turing test · unorganised
Newman's Computing Machine Laboratory at the
machine · Turing pattern · Turing
Victoria University of Manchester, where he helped
reduction · "The Chemical Basis
develop the Manchester computers[12] and became
of Morphogenesis" · Turing
interested in mathematical biology. Turing wrote on the
paradox
chemical basis of morphogenesis[13][1] and predicted
Awards Smith's Prize (1936)
oscillating chemical reactions such as the Belousov–
Scientific career
https://en.wikipedia.org/wiki/Alan_Turing 1/382/20/25, 10:46 AM Alan Turing - Wikipedia
Zhabotinsky reaction, first observed in the 1960s.
Generated Questions:
- What was the name of the chemical reaction that Turing predicted?
- What was Alan Turing's profession?
- What was the name of the computer that Turing developed?
qa_pipeline = pipeline("question-answering", model="deepset/roberta-base-squad2")
def answer_unique_questions(passages, qa_pipeline):
answered_questions = set()
for idx, passage in enumerate(passages):
questions = generate_questions_pipeline(passage)
for question in questions:
if question not in answered_questions:
answer = qa_pipeline({'question': question, 'context': passage})
print(f"Q: {question}")
print(f"A: {answer['answer']}\n")
answered_questions.add(question)
print(f"{'='*50}\n")
answer_unique_questions(passages, qa_pipeline)
Device set to use cuda:0
C:\Users\Kenny\AppData\Local\Programs\Python\Python312\Lib\site-packages\transformers\pipelines\question_answering.py:391: FutureWarning: Passing a list of SQuAD examples to the pipeline is deprecated and will be removed in v5. Inputs should be passed using the `question` and `context` keyword arguments instead.
warnings.warn(
Q: What was the name of the chemical reaction that Turing predicted?
A: Belousovâ
Q: What was Alan Turing's profession?
A: mathematician
Q: What was the name of the computer that Turing developed?
A: Manchester
==================================================
pdfplumber
to extract text from PDF files and save it as a .txt file.facebook/bart-large-cnn
model to generate concise summaries of extracted text.nltk
for better processing.google/flan-t5-base
to generate questions based on segmented passages.deepset/roberta-base-squad2
to generate answers to the auto-generated questions from the text.Jupyter Notebook, Python, pdfplumber, nltk, Hugging Face Transformers, facebook/bart-large-cnn, google/flan-t5-base, deepset/roberta-base-squad2, PyTorch