Sat
Nov 12

Cloning Pocket, part 1 of about 1000

I read a lot of Web pages but it’s not always convenient to read them when I find them. Apps like Instapaper and Pocket let one save a version of the page for later offline viewing, such as when one is on an underground train. These apps also default to providing a “readable” version of the page, rather than the original — this is a version where only the useful parts of the page are presented, sans headers, sidebars, ads and so on. At least, that’s the theory — it’s not easy to automatically figure out which parts of the page are useful.

Anyway, I’m reasonably happy with these apps but of course it’d be better if I didn’t need to give my reading history to a third party, so I started wondering how difficult it would be to write one of these things myself. Turns out it’s quite easy. There are two obvious options:

Option 1. Download the page and attempt to parse the HTML manually. This seems like it would initially be easy but fixing annoyances would be hard — for example, what if JavaScript is required to generate the page? This led me to choose:

Option 2. Remote-control a browser and extract the useful portions of the page in browser context.

Obviously, you don’t actually need a complete browser — in particular, you don’t need a user interface — but you do need most of it — JavaScript engine, cookies, everything related to the DOM, and so on. Enter Chromium Embedded Framework (CEF), and in particular Chromium Embedded Framework for Python

The Chromium Embedded Framework is effectively a completely Chromium browser without, uh, the chrome. You get complete control over what it loads and what it runs, and have access to the DOM, via JavaScript. This is great from a Pocket-cloning perspective because several JavaScript projects exist to provide a “readable Web page” experience, such as the original Readability.js which was the inspiration behind Firefox and Safari's “Reader View”.

The process, then, is:

  • Retrieve page using CEF.
  • Inject readability.js and wait for it to complete
  • Siphon out the HTML created by readability.js
  • Store this in a database
  • User accounts and Web UI for stored data
  • App to sync the stored data and view it offline
  • ???
  • Profit (Warning: neither Instapaper nor Pocket have figured out what “???” is)

… but so far I only have the first three steps of the process to demonstrate to you.

git clone https://github.com/nfd/cefpythondemo
cd cefpythondemo
git clone https://github.com/nfd/readability-js
pip install cefpython3
python readabilitydemo.py <url to download>

This will store a file named readable.html in the current working directory. Unfortunately Python 2 is required — this is a limitation of cefpython. Tested on a Mac — some changes may be required for other systems.

Future work: readability.js is a bit old now and doesn’t work for some pages. Also, it’s useful to store images, sometimes, which readability.js doesn’t do. Since we’re running a full browser it’s also possible to take screenshots (of the hidden display), which would be useful for various things. Lots of finessing would be required around timeouts, resource limits, and clean-up before I’d put this thing on a server. And of course it needs an app to be actually useful.