Stack: Building an Offline Wikipedia Reader

Made an offline Wikipedia reader because existing solutions assume too much about hardware and connectivity. Needed something that works in 256MB of RAM with minimal storage—library stack density, not library catalog sprawl.

The Constraint

The gap isn’t “offline Wikipedia readers don’t exist”—they do. The constraint is portability under resource limits. Disaster relief scenarios, expeditions, educational deployments in low-connectivity areas, systems where you can’t afford browser overhead. Places where having extensive knowledge bases is mission-critical but heavyweight solutions aren’t viable.

This isn’t about convenience. It’s about having access to facts, procedures, and equations when you’re stuck somewhere and need them.

Working with Real Data

Started with Simple English Wikipedia because it’s small enough to iterate quickly but real enough to reveal constraints. Downloaded the XML dump: 387K articles, 944MB uncompressed.

Wrote a streaming XML parser with quick-xml—never load more than one article in memory at once. Parsed the entire dump to understand what we’re actually working with:

Distribution (raw MediaWiki markup):

Median: 1092 bytes Average: 2557 bytes P95: 8616 bytes Max: 551KB

This told us what to optimize for: lots of small articles, not a few huge ones. Chunked compression will be more efficient than per-article compression (overhead dominates at 1KB).

Stripping MediaWiki Markup

Wikipedia text is full of markup: ‘’‘bold’‘’, [[links]], , tags, HTML comments, nested formatting. Built a regex-based stripper that handles the common cases:

Extract [[Article]] and [[Article|Display]] links, keep the display text Remove templates including nested braces Strip formatting (‘’’, ‘’) Convert headings ==Section== to plain text with spacing Remove references, files, HTML tags Collapse excessive whitespace

This isn’t a proper MediaWiki parser—it’s pragmatic. Handles 95% of articles cleanly. The 5% edge cases (complex tables, math notation) get mangled, but for plain text reading that’s acceptable. Also filtered out stubs under 100 bytes—likely redirects or empty articles. This dropped 117K articles (30% of total) but minimal storage savings since they were tiny anyway.

Distribution after stripping:

Median: 648 bytes (40% smaller) Average: 1548 bytes P95: 5056 bytes Max: 323KB Total: 399MB (down from 944MB raw) Extracted: 7.2M links

The stripped text is clean, readable, and compresses well. Estimated compressed size: ~114MB.

The Format That Emerged

Built a custom binary format—not because existing formats don’t work, but because the constraint was specific: minimal overhead, fast random access, simple enough to reimplement anywhere. .stack file layout: ┌─────────────────────────────────────┐ │ Header (64 bytes) │ │ - Magic: “STACK\0\0\0” │ │ - Version: u16 │ │ - Article count: u32 │ │ - Index offset: u64 │ │ - Reserved: 38 bytes │ ├─────────────────────────────────────┤ │ Article Data (sequential) │ │ - title_len(u16) + title │ │ - text_len(u32) + text │ │ - (repeat for all articles) │ ├─────────────────────────────────────┤ │ Title Index (at end) │ │ - count: u32 │ │ - sorted (title, offset) pairs │ │ - Binary searchable │ └─────────────────────────────────────┘

Why this structure:

The index comes at the end so we can stream-write articles during build, then append the index once we know all offsets. The reader loads the index into memory (sorted titles → file offsets), then binary searches for lookups and seeks directly to the article data.

Fixed-length headers and length-prefixed strings mean no parsing complexity. Rust’s BTreeMap gives us free binary search. Format overhead is ~3% (13MB for 270K articles).

Why not SQLite? More overhead, more dependencies, more complexity. We don’t need SQL—we need key-value lookup with sorted keys.

Why not tar + gzip? Can’t do random access without decompressing everything. Our format lets you seek directly to one article.

Building and Reading

CLI with four commands: bash Build archive from Wikipedia dump stack build simplewiki.xml -o simplewiki.stack

Read article by exact title stack read simplewiki.stack “Rust”

Search article titles (substring match) stack search simplewiki.stack “programming”

Show archive metadata stack info simplewiki.stack

Built Simple English Wikipedia in 12 seconds on a laptop: 270K articles, 412MB uncompressed. Article lookup is instant—binary search through title index, seek to offset, read article. No decompression needed yet.

Search is naive (substring matching on titles) but fast enough for 270K articles. Could add fuzzy matching or ranking later, but exact + substring covers most use cases.

Results Simple English Wikipedia:

Input: 387K articles, 944MB XML After filtering: 270K articles, 399MB stripped text Archive: 412MB (3% overhead) Build time: 12 seconds Lookup time: <1ms

The format works. Text is readable. Size is reasonable even uncompressed. Mission accomplished.

What’s Next

This is the MVP—proved the round-trip works and the format handles real data.

Future work:

Compression: Add zstd to get that 412MB down to ~120MB. Compress in 1MB chunks (≈650 articles) for good ratio while keeping random access fast.

Link navigation: We extracted 7.2M links during parsing. Store them in a separate link index. Add a stack links <article> command to show outbound links, maybe backlinks too.

Full English Wikipedia: Scale test with 6.5M articles. Will the index fit in memory? Does binary search stay fast? Format should handle it—just needs more time to build.

TUI reader: Terminal interface with vim-style navigation. Article view, link following, search, history. Keep it minimal—this is about reading, not browsing. Universal format: Wikipedia was the test case, but the format works for any knowledge base. Add tools to convert markdown/HTML/plaintext into .stack archives. Make it easy to package documentation, books, technical references.

The Real Why

The anti-brand answer: made this because the alternatives assume network access, modern hardware, and browser environments. Sometimes you’re on a system with 256MB of RAM, no internet, and you need to look up a procedure. Or you’re deploying educational resources where connectivity is unreliable and storage is expensive. Or you’re building disaster relief hardware that needs to work when nothing else does.

This isn’t a side project for side project’s sake. It fills a specific gap: portable, efficient, terminal-native knowledge access under constraint. Built it by working with real data to understand what matters—not by theorizing about perfect architectures.

The format is simple enough to reimplement in any language. The reader is a few hundred lines of Rust. The archive is just bytes on disk. That’s the point.