Fetch user-specific research papers from arXiv on a daily schedule, process and structure the data, and create or update entries in a Notion database, with support for data delivery
- Paper Topic: single query keyword
- Update Frequency: Daily updates, with fewer than 20 entries expected per day
- Tools:
- Platform: n8n, for end-to-end workflow configuration
- AI Model: Gemini-2.5-Flash, for daily paper summarization and data processing
- Database: Notion, with two tables — Daily Paper Summary and Paper Details
- Message: Feishu (IM bot notifications), Gmail (email notifications)
1. Data Retrieval
arXiv API
The arXiv provides a public API that allows users to query research papers by topic or by predefined categories.
arXiv API User Manual
Key Notes:
- Response Format: The API returns data as a typical Atom Response.
- Timezone & Update Frequency:
- The arXiv submission process operates on a 24-hour cycle.
- Newly submitted articles become available in the API only at midnight after they have been processed.
- Feeds are updated daily at midnight Eastern Standard Time (EST).
- Therefore, a single request per day is sufficient.
- Request Limits:
- The maximum number of results per call (
max_results
) is 30,000,
- Results must be retrieved in slices of at most 2,000 at a time, using the
max_results
and start
query parameters.
- Time Format:
- The expected format is
[YYYYMMDDTTTT+TO+YYYYMMDDTTTT]
,
TTTT
is provided in 24-hour time to the minute, in GMT.
Scheduled Task
- Execution Frequency: Daily
- Execution Time: 6:00 AM
- Time Parameter Handling (JS):
According to arXiv’s update rules, the scheduled task should query the previous day’s (T-1) submittedDate
data.
2. Data Extraction
Data Cleaning Rules (Convert to Standard JSON)
-
Remove Header
- Keep only the 【entry】【/entry】 blocks representing paper items.
-
Single Item
- Each 【entry】【/entry】 represents a single item.
-
Field Processing Rules
-
【id】【/id】 ➡️ id
Extract content.
Example:
【id】http://arxiv.org/abs/2409.06062v1【/id】 → http://arxiv.org/abs/2409.06062v1
-
【updated】【/updated】 ➡️ updated
Convert timestamp to yyyy-mm-dd hh:mm:ss
-
【published】【/published】 ➡️ published
Convert timestamp to yyyy-mm-dd hh:mm:ss
-
【title】【/title】 ➡️ title
Extract text content
-
【summary】【/summary】 ➡️ summary
Keep text, remove line breaks
-
【author】【/author】 ➡️ author
Combine all authors into an array
Example: [ "Ernest Pusateri", "Anmol Walia" ]
(for Notion multi-select field)
-
【arxiv:comment】【/arxiv:comment】 ➡️ Ignore / discard
-
【link type="text/html"】 ➡️ html_url
Extract URL
-
【link type="application/pdf"】 ➡️ pdf_url
Extract URL
-
【arxiv:primary_category term="cs.CL"】 ➡️ primary_category
Extract term
value
-
【category】 ➡️ category
Merge all 【category】 values into an array
Example: [ "eess.AS", "cs.SD" ]
(for Notion multi-select field)
-
Add Empty Fields
3. Data Processing
Analyze and summarize paper data using AI, then standardize output as JSON.
- Single Paper Basic Information Analysis and Enhancement
- Daily Paper Summary and Multilingual Translation
4. Data Storage: Notion Database
- Create a corresponding database in Notion with the same predefined field names.
- In Notion, create an integration under Integrations and grant access to the database. Obtain the corresponding Secret Key.
- Use the Notion "Create a database page" node to configure the field mapping and store the data.
Notes
- "Create a database page" only adds new entries; data will not be updated.
- The
updated
and published
timestamps of arXiv papers are in UTC.
- Notion single-select and multi-select fields only accept arrays. They do not automatically parse comma-separated strings. You need to format them as proper arrays.
- Notion does not accept
null
values, which causes a 400 error.
5. Data Delivery
Set up two channels for message delivery: EMAIL and IM, and define the message format and content.
Email: Gmail
GMAIL OAuth 2.0 – Official Documentation
Configure your OAuth consent screen
Steps:
- Enable Gmail API
- Create OAuth consent screen
- Create OAuth client credentials
- Audience: Add Test users under Testing status
Message format: HTML
(Model: OpenAI GPT — used to design an HTML email template)
IM: Feishu (LARK)
Bots in groups
Use bots in groups