Paper Reader: Listening to Science, One Word at a Time

I read a lot of papers. And I mean a lot. So I built something to help with that — Paper Reader, a web app that reads scientific papers aloud, sentence by sentence, with live word-by-word highlighting.

The idea¶

The problem is simple: reading papers is exhausting, especially on a screen. But most text-to-speech tools mangle scientific content — they stumble over math notation, skip references weirdly, and sound robotic. I wanted something that:

Reads any paper (arXiv, HTML, or PDF)
Handles LaTeX math intelligently — converting it to spoken English
Shows you exactly where it is — word-level highlighting, not just sentence-level
Uses natural neural voices, not the old robot ones

How it works¶

The architecture is a three-tier system:

┌──────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│  Frontend (nginx)│────▶│  Backend (Flask) │────▶│  TTS (Flask)     │
│  static SPA +    │     │  parse · serve · │     │  Kokoro · say    │
│  KaTeX           │     │  word-timing     │     │  · espeak        │
└──────────────────┘     └──────────────────┘     └──────────────────┘

Parsing¶

The backend resolves arXiv links (auto-fetching the HTML version, with ar5iv and PDF fallbacks), parses HTML articles, and does layout-aware PDF extraction with pymupdf4llm — real headings from font sizes, multi-column text, images at their true positions, and automatic OCR of scanned pages.

Math → speech¶

This is the cool part. I wrote mathtex2text.py which walks the LaTeX AST using pylatexenc and emits spoken English:

LaTeX	Spoken
`\sum`	"sum"
`\int`	"integral"
`\sqrt{x}`	"square root of x"
`\frac{a}{b}`	"a over b"
`\alpha`	"alpha"

On screen, $...$ is rendered with KaTeX. In the audio, the spoken length feeds each word's duration weight so word-highlight timing stays roughly aligned.

Neural TTS¶

I ship the open Kokoro-82M model with 12 bundled voice packs. On Apple Silicon, running natively (not in Docker), it achieves a real-time factor of ~0.21 — about 4.7× realtime, so prefetch always stays ahead of playback.

The RTF math is straightforward:

\[ \text{RTF} = \frac{t_{\text{generation}}}{t_{\text{audio}}} \]

where RTF $< 1$ means faster than realtime. The key insight: don't run neural TTS inside a Linux VM on macOS — the overhead kills performance.

# Simplified TTS request flow
def synthesize(text, voice="af_sarah"):
    audio, sr = kokoro.generate(text, voice=voice)
    duration = len(audio) / sr
    # distribute duration over words proportionally to their weights
    timings = distribute_duration(text, duration)
    return {"audio_b64": encode(audio), "duration": duration, "timings": timings}

Word-level highlighting¶

The frontend syncs audio.currentTime against the word timings in a requestAnimationFrame loop:

Fetch audio for sentence $n$ from /api/doc/<id>/audio/<n>
Play it
In each animation frame, compute which word the playback position corresponds to
Highlight that word (dark yellow) within the sentence (light yellow)
Prefetch audio for sentence $n+1, n+2, \ldots$ in the background

Deployment¶

It's containerised with Docker and orchestrated with Kubernetes (minikube or k3s), with nginx rate-limiting and TLS via Traefik + cert-manager (Let's Encrypt). The live instance is at reader.dzim.site.

Security matters when you're fetching arbitrary URLs:

SSRF protection — validates every redirect hop, refuses loopback/private addresses (including cloud metadata 169.254.169.254)
Path-traversal guards — document IDs strictly validated
Size caps — 64 MB upload, 60 MB download
Least privilege — containers run as non-root with read-only rootfs

Try it¶

Go to reader.dzim.site, paste an arXiv link (e.g. https://arxiv.org/abs/2412.06787), and listen. The code is open source under the MIT license.