Feed NotebookLM with Scientific Papers, using an AI Agent
AI Agent that automatically collects scientific papers and feeds them into NotebookLM so you can “chat with” fresh research without manual uploading.
I’ll break it into:
What the agent should do (capabilities)
High-level architecture
Detailed pipeline (step-by-step)
Extra smart features you can add
1. What the Agent Should Do
Your AI agent’s job, end-to-end:
Watch for new research
From sources like arXiv, PubMed, journal RSS feeds, conference pages, or a list of DOIs.
Decide what’s relevant
Based on topics, keywords, authors, or venues you care about.
Download the papers
Grab PDFs or fulltext where possible.
Normalize & organize them
Clean filenames and metadata
Group papers into logical “notebooks” (e.g., “LLMs”, “Protein Folding”, “Climate Models”).
Add them as sources for NotebookLM
Store PDFs or Google Docs in a connected Google Drive structure that you then add (or periodically refresh) as NotebookLM sources.
Notify you and accept corrections
Send a short digest (Telegram, Slack, email)
Let you say “yes/no” to include or exclude certain papers.
2. High-Level Architecture
Think of the agent as a pipeline with four main components:
Collector
Fetches candidate papers from APIs/RSS/DOI lists.
Filter & Ranker (AI part)
Uses an LLM/embedding model to decide relevance and importance.
Can cluster by topic.
Normalizer & Uploader
Downloads PDFs
Extracts metadata (title, authors, abstract, year)
Saves to Google Drive with a consistent folder scheme like:
/Research/NotebookLM/Topic/Year/Title.pdf
NotebookLM Integration Layer
You (or a script, if/when APIs allow) add/update those Drive folders as sources in NotebookLM.
The agent doesn’t talk to NotebookLM directly today (no public API), but it controls the documents NotebookLM sees.
3. Detailed Pipeline
Assume you build this in Python, using cron / a scheduler (or something like Airflow) plus Google Drive APIs.
Step 1 – Configure “Research Profiles”
Define what “relevant” means in a config file, e.g. profiles.yaml:
profiles:
llm_research:
keywords:
- "large language model"
- "instruction tuning"
- "RLHF"
- "tool use"
sources:
- "arxiv:cs.CL"
- "arxiv:cs.LG"
- "NeurIPS"
- "ICLR"
protein_folding:
keywords:
- "protein folding"
- "structure prediction"
- "AlphaFold"
sources:
- "arxiv:q-bio.BM"
- "Nature"
- "Science"
Each profile maps to a NotebookLM notebook you’ll maintain.
Step 2 – Collect Papers
For each profile, the Collector:
Queries APIs / RSS
arXiv API with search terms
PubMed / CrossRef / journal RSS feeds
Optional: direct scraping of conference “Accepted Papers” pages.
Extracts candidate paper info
Title
Abstract
Authors
Link to PDF (or DOI)
Published date / last updated
Stores them in a local DB
A simple SQLite table
paperswith:id,title,abstract,pdf_url,doi,source,profile,status,added_on.
Mark new candidates as status = "candidate".
Step 3 – AI-based Relevance & Ranking
Use an LLM or embedding model to decide which candidates are worth keeping.
Possible workflow:
Keyword pre-filter
Quick filter: title/abstract must contain certain words.
This cuts obvious noise before hitting the LLM.
Scoring with an LLM
For each candidate abstract, prompt something like:“You are a research assistant. Score from 1–5 how relevant this paper is to the topic: ‘<profile description>’. Return JSON {score: x, short_reason: '…'}.”
Thresholding
Keep only papers with
score >= 4.Store
ai_score,reasonin the DB.
Ranking
Sort by fresh + high score (e.g.
score DESC,date DESC).Limit how many you ingest per run (e.g. top 20 per day).
Step 4 – Download & Normalize
For each accepted paper:
Download the PDF
Use the
pdf_urlor DOI resolution to retrieve the PDF.
Extract metadata (and fix it)
Use a PDF-parsing tool (e.g.
grobid,scienceparse, or a simple title-detection script) to extract:Title
Authors
Venue
Year
If metadata is messy, you can run a secondary LLM prompt:
“Extract title/authors/venue/year from this text snippet…”.
Standardize filename
For example:
2025 - Doe et al - Instruction Tuning for LLMs.pdf
Create an “index note” (optional but powerful)
Auto-generate a 1–2 paragraph summary with an LLM.
Save it as a
.mdor Google Doc alongside the PDF, e.g.
2025 - Doe et al - Instruction Tuning for LLMs (summary).docx.
Step 5 – Upload to Google Drive in a Notebook-Friendly Structure
Use Google Drive API to upload:
One folder per profile (mapped to NotebookLM notebooks), e.g.:
/Research/NotebookLM/LLM Research//Research/NotebookLM/Protein Folding/
Inside each:
/Year/subfolders:2023/,2024/,2025/Each paper with its summary file (if you generate one).
This structure makes it easy to attach whole folders as NotebookLM sources.
Step 6 – Connect to NotebookLM
Because NotebookLM currently works via Drive-linked sources (Docs, Slides, PDFs, etc.), the integration step is:
In NotebookLM, create a notebook for each profile:
“LLM Research – Auto-Feed”
“Protein Folding – Auto-Feed”
Add sources:
Choose the respective Google Drive folders as sources.
Example: “Add sources → Google Drive →
/Research/NotebookLM/LLM Research/2025/”
Periodically (e.g. once a week), you can:
Add new subfolders as additional sources
Or if/when NotebookLM supports, refresh/auto-discover new docs in those folders
The net result: NotebookLM always has a growing library of the newest curated papers.
Step 7 – Feedback Loop & Notifications
A good AI agent should let you override its choices.
Options:
Daily digest message (email, Slack, Telegram, etc.) showing:
Papers it just added
Their relevance score + short AI explanation
Quick links to:
PDF
Summary doc
NotebookLM notebook
Feedback commands
You send something like:
reject <paper_id>in your chat channel.The agent:
Removes that file from the “active” folder
Moves it to an
archive/folderMarks it as rejected in the DB so it doesn’t re-add it.
Continuous training of preferences
Track what you tend to reject / keep.
Use that to adjust prompts, thresholds, or even fine-tune an embedding space.
4. Extra Smart Features You Can Add
Once the basics work, you can make your agent pretty sophisticated:
Topic clustering & Notebook auto-splitting
Use embeddings to cluster new papers into topics.
Automatically create sub-notebooks in NotebookLM (“Evaluation Methods”, “Alignment”, “Architectures”) and route papers accordingly.
Citation graph exploration
For each new paper:
Pull references
Cross-match with your existing library
Let the agent suggest “You might also want to ingest X and Y since they are heavily cited.”
“Ask Me What To Read” feature
The agent summarizes the top 3–5 new papers each week
Gives you a prioritized reading list with 2–3 bullet reasons each.
Conference-mode
During big conferences (NeurIPS, ICLR, ICML, etc.), point the agent at the “Accepted Papers” list.
It filters by your topics and slurps in the papers when they go live.
Auto-generated comparison docs
When it adds a few related papers, the agent creates a “comparison summary” doc:
“How do these three papers differ?”
“What is the trend over time?”
Stored in the same Drive folders, which NotebookLM uses as high-level overviews.