Cloning Pocket, part 1 of about 1000
I read a lot of Web pages but it’s not always convenient to read them when I find them. Apps like Instapaper and Pocket let one save a version of the page for later offline viewing, such as when one is on an underground train. These apps also default to providing a “readable” version of the page, rather than the original — this is a version where only the useful parts of the page are presented, sans headers, sidebars, ads and so on. At least, that’s the theory — it’s not easy to automatically figure out which parts of the page are useful.
Anyway, I’m reasonably happy with these apps but of course it’d be better if I didn’t need to give my reading history to a third party, so I started wondering how difficult it would be to write one of these things myself. Turns out it’s quite easy. There are two obvious options:
Option 2. Remote-control a browser and extract the useful portions of the page in browser context.
The process, then, is:
- Retrieve page using CEF.
- Inject readability.js and wait for it to complete
- Siphon out the HTML created by readability.js
- Store this in a database
- User accounts and Web UI for stored data
- App to sync the stored data and view it offline
- Profit (Warning: neither Instapaper nor Pocket have figured out what “???” is)
… but so far I only have the first three steps of the process to demonstrate to you.
git clone https://github.com/nfd/cefpythondemo
git clone https://github.com/nfd/readability-js
pip install cefpython3
python readabilitydemo.py <url to download>
This will store a file named readable.html in the current working directory. Unfortunately Python 2 is required — this is a limitation of cefpython. Tested on a Mac — some changes may be required for other systems.
Future work: readability.js is a bit old now and doesn’t work for some pages. Also, it’s useful to store images, sometimes, which readability.js doesn’t do. Since we’re running a full browser it’s also possible to take screenshots (of the hidden display), which would be useful for various things. Lots of finessing would be required around timeouts, resource limits, and clean-up before I’d put this thing on a server. And of course it needs an app to be actually useful.