Paper Reader: Listening to Science, One Word at a Time
I read a lot of papers. And I mean a lot. So I built something to help with that — Paper Reader, a web app that reads scientific papers aloud, sentence by sentence, with live word-by-word highlighting.
The idea¶
The problem is simple: reading papers is exhausting, especially on a screen. But most text-to-speech tools mangle scientific content — they stumble over math notation, skip references weirdly, and sound robotic. I wanted something that:
- Reads any paper (arXiv, HTML, or PDF)
- Handles LaTeX math intelligently — converting it to spoken English
- Shows you exactly where it is — word-level highlighting, not just sentence-level
- Uses natural neural voices, not the old robot ones
How it works¶
The architecture is a three-tier system:
┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐
│ Frontend (nginx)│────▶│ Backend (Flask) │────▶│ TTS (Flask) │
│ static SPA + │ │ parse · serve · │ │ Kokoro · say │
│ KaTeX │ │ word-timing │ │ · espeak │
└──────────────────┘ └──────────────────┘ └──────────────────┘
Parsing¶
The backend resolves arXiv links (auto-fetching the HTML version, with ar5iv and PDF fallbacks), parses HTML articles, and does layout-aware PDF extraction with pymupdf4llm — real headings from font sizes, multi-column text, images at their true positions, and automatic OCR of scanned pages.
Math → speech¶
This is the cool part. I wrote mathtex2text.py which walks the LaTeX AST
using pylatexenc and emits spoken
English:
| LaTeX | Spoken |
|---|---|
\sum |
"sum" |
\int |
"integral" |
\sqrt{x} |
"square root of x" |
\frac{a}{b} |
"a over b" |
\alpha |
"alpha" |
On screen, $...$ is rendered with KaTeX. In the audio,
the spoken length feeds each word's duration weight so word-highlight timing
stays roughly aligned.
Neural TTS¶
I ship the open Kokoro-82M model with 12 bundled voice packs. On Apple Silicon, running natively (not in Docker), it achieves a real-time factor of ~0.21 — about 4.7× realtime, so prefetch always stays ahead of playback.
The RTF math is straightforward:
where RTF \(< 1\) means faster than realtime. The key insight: don't run neural TTS inside a Linux VM on macOS — the overhead kills performance.
# Simplified TTS request flow
def synthesize(text, voice="af_sarah"):
audio, sr = kokoro.generate(text, voice=voice)
duration = len(audio) / sr
# distribute duration over words proportionally to their weights
timings = distribute_duration(text, duration)
return {"audio_b64": encode(audio), "duration": duration, "timings": timings}
Word-level highlighting¶
The frontend syncs audio.currentTime against the word timings in a
requestAnimationFrame loop:
- Fetch audio for sentence \(n\) from
/api/doc/<id>/audio/<n> - Play it
- In each animation frame, compute which word the playback position corresponds to
- Highlight that word (dark yellow) within the sentence (light yellow)
- Prefetch audio for sentence \(n+1, n+2, \ldots\) in the background
Deployment¶
It's containerised with Docker and orchestrated with Kubernetes (minikube or k3s), with nginx rate-limiting and TLS via Traefik + cert-manager (Let's Encrypt). The live instance is at reader.dzim.site.
Security matters when you're fetching arbitrary URLs:
- SSRF protection — validates every redirect hop, refuses loopback/private
addresses (including cloud metadata
169.254.169.254) - Path-traversal guards — document IDs strictly validated
- Size caps — 64 MB upload, 60 MB download
- Least privilege — containers run as non-root with read-only rootfs
Try it¶
Go to reader.dzim.site, paste an arXiv link
(e.g. https://arxiv.org/abs/2412.06787), and listen. The code is
open source under the MIT license.