Scraping Apple News+
I’ve been an Apple News+ subscriber for the past year or so, just to be able to read paywalled articles from WSJ, Business Insider, WIRED, and The Atlantic etc. It’s not my primary news source (I use Feedly via Reeder), but when I encounter links to these sites on socials or via RSS feeds, I open the link with the News.app. A keyboard shortcut for Choosy helps a lot for this.
That ended today when I learned that Apple had struck a deal with Taboola, a company known for serving low-quality ads next to web content. Taboola + Apple News? No thanks – On my Om
The reading experience on News+ is actually pretty terrible because of the number of ads on it even before this deal with Taboola, which would make it even worse. The thing is, I don’t read them on News+. I do extract the plaintext version of the news from News+ to read it later, outside the app.
Let me explain how to automate it.
When you open an article on News app on the Mac, you can get the link for the article via a keyboard shortcut: Cmd-Alt-C
:
And you can extract the full body of the article by simply selecting the body with “Select All,” i.e. Cmd-A
, then Cmd-C
. Fortunately, this won’t include any text from the ads on the page.
The title of the article is already included in the body when you select all, but alternatively, it is available via the URL we retrieved from the Copy Link
above. That URL contains a redirect page to open the News app on the Mac, but that page has a <title>
tag for the article itself.
Given all that, you can write a script to run on a Mac to get:
- Link (via
Copy Link
) - Title (via HTML of the link above)
- Plaintext Content without ads (via Select All, then copy)
#!/bin/bash
# select text, then Cmd-C
# brew install cliclick
cliclick dc:.
sleep 0.2
osascript -e 'tell application "System Events" to keystroke "a" using command down'
osascript -e 'tell application "System Events" to keystroke "c" using command down'
body=$(pbpaste)
# send Cmd-alt-C to copy the URL
osascript -e 'tell application "System Events" to keystroke "c" using (command down, option down)'
url=$(pbpaste)
# get the $url to look for <title> to get the title
# brew install html-xml-utils
title=$(curl -s $url | hxselect -c title)
# do something with $body, $title and $url
The script uses cliclick to force a double click, to make sure Cmd-A
can actually grab the text from the article. For some reason, Cmd-A
doesn’t actually select it until you click the article page to get the focus.
I added this script to Alfred so that I can run it with a keyboard shortcut (Shift-Alt-E
), and hooked it up with BetterTouchTool so that I can tap on 4-finger to invoke it on the News app.
Once you get the link, title and body of an article, you can do whatever you want with it. In my case, I have another script to save it to Instapaper via API, and read them later.