Feed NotebookLM with Scientific Papers, using an AI Agent

AI Agent that automatically collects scientific papers and feeds them into NotebookLM so you can “chat with” fresh research without manual uploading.

I’ll break it into:

  1. What the agent should do (capabilities)

  2. High-level architecture

  3. Detailed pipeline (step-by-step)

  4. Extra smart features you can add

1. What the Agent Should Do

Your AI agent’s job, end-to-end:

  1. Watch for new research

    • From sources like arXiv, PubMed, journal RSS feeds, conference pages, or a list of DOIs.

  2. Decide what’s relevant

    • Based on topics, keywords, authors, or venues you care about.

  3. Download the papers

    • Grab PDFs or fulltext where possible.

  4. Normalize & organize them

    • Clean filenames and metadata

    • Group papers into logical “notebooks” (e.g., “LLMs”, “Protein Folding”, “Climate Models”).

  5. Add them as sources for NotebookLM

    • Store PDFs or Google Docs in a connected Google Drive structure that you then add (or periodically refresh) as NotebookLM sources.

  6. Notify you and accept corrections

    • Send a short digest (Telegram, Slack, email)

    • Let you say “yes/no” to include or exclude certain papers.

2. High-Level Architecture

Think of the agent as a pipeline with four main components:

  1. Collector

    • Fetches candidate papers from APIs/RSS/DOI lists.

  2. Filter & Ranker (AI part)

    • Uses an LLM/embedding model to decide relevance and importance.

    • Can cluster by topic.

  3. Normalizer & Uploader

    • Downloads PDFs

    • Extracts metadata (title, authors, abstract, year)

    • Saves to Google Drive with a consistent folder scheme like:

      • /Research/NotebookLM/Topic/Year/Title.pdf

  4. NotebookLM Integration Layer

    • You (or a script, if/when APIs allow) add/update those Drive folders as sources in NotebookLM.

    • The agent doesn’t talk to NotebookLM directly today (no public API), but it controls the documents NotebookLM sees.

3. Detailed Pipeline

Assume you build this in Python, using cron / a scheduler (or something like Airflow) plus Google Drive APIs.

Step 1 – Configure “Research Profiles”

Define what “relevant” means in a config file, e.g. profiles.yaml:

profiles:
  llm_research:
    keywords:
      - "large language model"
      - "instruction tuning"
      - "RLHF"
      - "tool use"
    sources:
      - "arxiv:cs.CL"
      - "arxiv:cs.LG"
      - "NeurIPS"
      - "ICLR"
  protein_folding:
    keywords:
      - "protein folding"
      - "structure prediction"
      - "AlphaFold"
    sources:
      - "arxiv:q-bio.BM"
      - "Nature"
      - "Science"

Each profile maps to a NotebookLM notebook you’ll maintain.

Step 2 – Collect Papers

For each profile, the Collector:

  1. Queries APIs / RSS

    • arXiv API with search terms

    • PubMed / CrossRef / journal RSS feeds

    • Optional: direct scraping of conference “Accepted Papers” pages.

  2. Extracts candidate paper info

    • Title

    • Abstract

    • Authors

    • Link to PDF (or DOI)

    • Published date / last updated

  3. Stores them in a local DB

    • A simple SQLite table papers with:

      • id, title, abstract, pdf_url, doi, source, profile, status, added_on.

Mark new candidates as status = "candidate".

Step 3 – AI-based Relevance & Ranking

Use an LLM or embedding model to decide which candidates are worth keeping.

Possible workflow:

  1. Keyword pre-filter

    • Quick filter: title/abstract must contain certain words.

    • This cuts obvious noise before hitting the LLM.

  2. Scoring with an LLM
    For each candidate abstract, prompt something like:

    “You are a research assistant. Score from 1–5 how relevant this paper is to the topic: ‘<profile description>’. Return JSON {score: x, short_reason: '…'}.”

  3. Thresholding

    • Keep only papers with score >= 4.

    • Store ai_score, reason in the DB.

  4. Ranking

    • Sort by fresh + high score (e.g. score DESC, date DESC).

    • Limit how many you ingest per run (e.g. top 20 per day).

Step 4 – Download & Normalize

For each accepted paper:

  1. Download the PDF

    • Use the pdf_url or DOI resolution to retrieve the PDF.

  2. Extract metadata (and fix it)

    • Use a PDF-parsing tool (e.g. grobid, scienceparse, or a simple title-detection script) to extract:

      • Title

      • Authors

      • Venue

      • Year

    • If metadata is messy, you can run a secondary LLM prompt:
      “Extract title/authors/venue/year from this text snippet…”.

  3. Standardize filename

    • For example:
      2025 - Doe et al - Instruction Tuning for LLMs.pdf

  4. Create an “index note” (optional but powerful)

    • Auto-generate a 1–2 paragraph summary with an LLM.

    • Save it as a .md or Google Doc alongside the PDF, e.g.
      2025 - Doe et al - Instruction Tuning for LLMs (summary).docx.

Step 5 – Upload to Google Drive in a Notebook-Friendly Structure

Use Google Drive API to upload:

  • One folder per profile (mapped to NotebookLM notebooks), e.g.:

    • /Research/NotebookLM/LLM Research/

    • /Research/NotebookLM/Protein Folding/

Inside each:

  • /Year/ subfolders: 2023/, 2024/, 2025/

  • Each paper with its summary file (if you generate one).

This structure makes it easy to attach whole folders as NotebookLM sources.

Step 6 – Connect to NotebookLM

Because NotebookLM currently works via Drive-linked sources (Docs, Slides, PDFs, etc.), the integration step is:

  1. In NotebookLM, create a notebook for each profile:

    • “LLM Research – Auto-Feed”

    • “Protein Folding – Auto-Feed”

  2. Add sources:

    • Choose the respective Google Drive folders as sources.

    • Example: “Add sources → Google Drive → /Research/NotebookLM/LLM Research/2025/

  3. Periodically (e.g. once a week), you can:

    • Add new subfolders as additional sources

    • Or if/when NotebookLM supports, refresh/auto-discover new docs in those folders

The net result: NotebookLM always has a growing library of the newest curated papers.

Step 7 – Feedback Loop & Notifications

A good AI agent should let you override its choices.

Options:

  1. Daily digest message (email, Slack, Telegram, etc.) showing:

    • Papers it just added

    • Their relevance score + short AI explanation

    • Quick links to:

      • PDF

      • Summary doc

      • NotebookLM notebook

  2. Feedback commands

    • You send something like: reject <paper_id> in your chat channel.

    • The agent:

      • Removes that file from the “active” folder

      • Moves it to an archive/ folder

      • Marks it as rejected in the DB so it doesn’t re-add it.

  3. Continuous training of preferences

    • Track what you tend to reject / keep.

    • Use that to adjust prompts, thresholds, or even fine-tune an embedding space.

4. Extra Smart Features You Can Add

Once the basics work, you can make your agent pretty sophisticated:

  1. Topic clustering & Notebook auto-splitting

    • Use embeddings to cluster new papers into topics.

    • Automatically create sub-notebooks in NotebookLM (“Evaluation Methods”, “Alignment”, “Architectures”) and route papers accordingly.

  2. Citation graph exploration

    • For each new paper:

      • Pull references

      • Cross-match with your existing library

    • Let the agent suggest “You might also want to ingest X and Y since they are heavily cited.”

  3. “Ask Me What To Read” feature

    • The agent summarizes the top 3–5 new papers each week

    • Gives you a prioritized reading list with 2–3 bullet reasons each.

  4. Conference-mode

    • During big conferences (NeurIPS, ICLR, ICML, etc.), point the agent at the “Accepted Papers” list.

    • It filters by your topics and slurps in the papers when they go live.

  5. Auto-generated comparison docs

    • When it adds a few related papers, the agent creates a “comparison summary” doc:

      • “How do these three papers differ?”

      • “What is the trend over time?”

    • Stored in the same Drive folders, which NotebookLM uses as high-level overviews.