Download RSS Offline
Wednesday 12 October 2022

ruby

Couple weeks ago I changed my workflow regarding reading/sending emails. So to have more control over my emails I started using offlineimap which can download your emails offline to a directory on your filesystem. I then used Mu which index this directory so you can search for emails offline using some queries, like “show me unread emails from inboxes” or you can search for a word in all your emails from all inboxes. Over all of that I used Mu4e which is an email client inside Emacs (my default editor). As I’m using Spacemacs So I added a binding that opens Mu4e using SPC M and boom I can see all my emails, I can search with s and I have a bookmark that shows all unread emails using bi.

Now I want the same for RSS.

Here is the problem, I made some research and couldn’t find similar tools that does the same for RSS, although it would be easier, there are no authentication required like IMAP/SMTP servers, So I spent an hour or so writing a small script that does the same as offlineimap, this script on my machine is called offlinerss, it’s the first piece of the puzzle and it looks like that

 1#!/usr/bin/env ruby
 2# frozen_string_literal: true
 3
 4require 'bundler/inline'
 5require 'open-uri'
 6require 'fileutils'
 7require 'digest'
 8require 'yaml'
 9
10gemfile do
11  source 'https://rubygems.org'
12  gem 'rss'
13end
14
15def mkdir(*paths)
16  path = File.join(*paths)
17  FileUtils.mkdir(path) unless Dir.exist?(path)
18  path
19end
20
21destination = mkdir(File.expand_path('~/rss/'))
22inbox = mkdir(destination, 'INBOX')
23meta_dir = mkdir(destination, '.meta')
24
25config_file = File.join(destination, 'config.yml')
26config = YAML.load_file(config_file)
27urls = config['urls']
28
29urls.each do |url|
30  url_digest = Digest::SHA1.hexdigest(url)
31
32  URI.open(url) do |rss|
33    content = rss.read
34    feed = RSS::Parser.parse(content)
35
36    feed.items.each do |item|
37      id = item.respond_to?(:id) ? item.id : item.guid
38      id_digest = Digest::SHA1.hexdigest(id.content)
39      file_basename = url_digest + '-' + id_digest + '.xml'
40
41      next unless Dir.glob(File.join(destination, '**', file_basename)).empty?
42
43      filename = File.join(inbox, file_basename)
44      File.write(filename, item.to_s)
45    end
46
47    [{ start_tag: '<entry>', end_tag: '</entry>' }, { start_tag: '<item>', end_tag: '</item>' }].each do |tag|
48      next unless content.include?(tag[:start_tag])
49
50      content[content.index(tag[:start_tag])...(content.rindex(tag[:end_tag]) + tag[:end_tag].length)] = ''
51    end
52
53    metafile = File.join(meta_dir, url_digest + '.xml')
54    File.write(metafile, content)
55  end
56end

I have a small config file in ~/rss/config.yml which has all the URLs I care for, so far just ruby/rails/go main blogs to be alerted by the latest versions.

1urls:
2   - https://server.tld/feed.rss
3   - https://server.tld/feed.atom

This just reads the URLs, and saves each entry to a file on your machine ~/rss/INBOX if the file doesn’t exist in any sub directory in ~/rss. Then it removes all entries/items from the feed and save the rest to ~/rss/.meta.

The file names of the RSS item is sha1(url)-sha1(item.id).xml and the meta file name is sha1(url).xml very simple.

So now I need to write a client that reads files in ~/rss/ and render the XML files and some actions to create directories under ~/rss/ and actions to move the file to another directory after it’s read or the user want to move it to read-later or something, just like emails directories.

Another piece of the puzzle is indexer like what Mu does for the emails.

I was surprised that it was easy to just sit down and write the thing for myself, than searching for days for a solution.

Backlinks

See Also