← Back to blog

Paper Reader: Listening to Science, One Word at a Time

I read a lot of papers. And I mean a lot. So I built something to help with that — Paper Reader, a web app that reads scientific papers aloud, sentence by sentence, with live word-by-word highlighting.

The idea

The problem is simple: reading papers is exhausting, especially on a screen. But most text-to-speech tools mangle scientific content — they stumble over math notation, skip references weirdly, and sound robotic. I wanted something that:

  1. Reads any paper (arXiv, HTML, or PDF)
  2. Handles LaTeX math intelligently — converting it to spoken English
  3. Shows you exactly where it is — word-level highlighting, not just sentence-level
  4. Uses natural neural voices, not the old robot ones

How it works

The architecture is a three-tier system:

┌──────────────────┐     ┌──────────────────┐     ┌──────────────────┐
│  Frontend (nginx)│────▶│  Backend (Flask) │────▶│  TTS (Flask)     │
│  static SPA +    │     │  parse · serve · │     │  Kokoro · say    │
│  KaTeX           │     │  word-timing     │     │  · espeak        │
└──────────────────┘     └──────────────────┘     └──────────────────┘

Parsing

The backend resolves arXiv links (auto-fetching the HTML version, with ar5iv and PDF fallbacks), parses HTML articles, and does layout-aware PDF extraction with pymupdf4llm — real headings from font sizes, multi-column text, images at their true positions, and automatic OCR of scanned pages.

Math → speech

This is the cool part. I wrote mathtex2text.py which walks the LaTeX AST using pylatexenc and emits spoken English:

LaTeX Spoken
\sum "sum"
\int "integral"
\sqrt{x} "square root of x"
\frac{a}{b} "a over b"
\alpha "alpha"

On screen, $...$ is rendered with KaTeX. In the audio, the spoken length feeds each word's duration weight so word-highlight timing stays roughly aligned.

Neural TTS

I ship the open Kokoro-82M model with 12 bundled voice packs. On Apple Silicon, running natively (not in Docker), it achieves a real-time factor of ~0.21 — about 4.7× realtime, so prefetch always stays ahead of playback.

The RTF math is straightforward:

\[ \text{RTF} = \frac{t_{\text{generation}}}{t_{\text{audio}}} \]

where RTF \(< 1\) means faster than realtime. The key insight: don't run neural TTS inside a Linux VM on macOS — the overhead kills performance.

# Simplified TTS request flow
def synthesize(text, voice="af_sarah"):
    audio, sr = kokoro.generate(text, voice=voice)
    duration = len(audio) / sr
    # distribute duration over words proportionally to their weights
    timings = distribute_duration(text, duration)
    return {"audio_b64": encode(audio), "duration": duration, "timings": timings}

Word-level highlighting

The frontend syncs audio.currentTime against the word timings in a requestAnimationFrame loop:

  1. Fetch audio for sentence \(n\) from /api/doc/<id>/audio/<n>
  2. Play it
  3. In each animation frame, compute which word the playback position corresponds to
  4. Highlight that word (dark yellow) within the sentence (light yellow)
  5. Prefetch audio for sentence \(n+1, n+2, \ldots\) in the background

Deployment

It's containerised with Docker and orchestrated with Kubernetes (minikube or k3s), with nginx rate-limiting and TLS via Traefik + cert-manager (Let's Encrypt). The live instance is at reader.dzim.site.

Security matters when you're fetching arbitrary URLs:

  • SSRF protection — validates every redirect hop, refuses loopback/private addresses (including cloud metadata 169.254.169.254)
  • Path-traversal guards — document IDs strictly validated
  • Size caps — 64 MB upload, 60 MB download
  • Least privilege — containers run as non-root with read-only rootfs

Try it

Go to reader.dzim.site, paste an arXiv link (e.g. https://arxiv.org/abs/2412.06787), and listen. The code is open source under the MIT license.