|[] |
| __________ |
| | RSS- | |
| | saver | |
| |________| |
| ________ |
| [ [ ] ] |
\___[_[_]__]___|
A simple Clojure (Babashka) script to save world@hey.com blog posts locally.
I use Hey's blog feature to write blog posts. I do it in the hopes that each post helps me clarify and improve my own thinking about life, design, and programming. It's a cool feature, but I worry that I may not be able to retrieve my posts in the event of service shutdown, or if I move on to another email provider in the future.
The script version of this project exists at https://github.com/adam-james-v/rss-saver and is a much more realistic way to use this code.
About This Post
This post is a showcase of the functionality of the script. It is not meant to be a real interface to the script. It's just a fun way to show off some code!As you read through the post, you'll encounter src blocks with editable Clojurescript code. Every edit causes the src block to re-evaluate, so I suggest keeping the code small and simple for any given block.
In the script version of this code, I don't run any of the functions until the
-main
function is called, but that's not necessarily the most instructive way to show code in post form. So, I generally follow this pattern:
- explanation of a need and implementation idea(s)
- implementing the necessary functions in a src block
- a 'playground' src block with commented code that the reader can uncomment to observe function behaviour.
Using this pattern, I think, will make for an informative post. I hope you like it.
deps
For reference, these are the dependencies compiled into this page. Dependencies for the script are all built into the v0.6.0 Babashka binary.{:deps {org.clojure/data.xml {:mvn/version "0.2.0-alpha6"} cljs-ajax/cljs-ajax {:mvn/version "0.8.4"}}}
Design
The full design doc can be found here.The high-level view of what this code does is very simple:
- Grab the feed XML from the given URL
- Parse the URL and get nodes that correspond to posts (entries)
- Transform each entry node into a Clojure map with a Hiccup data structure
- Save and/or Render the document
Code
I love Clojure and Clojurescript. If you haven't already noticed it, I'm using Clojure for the script version of this project, which works almost perfectly in this Clojurescript context, too. The only code that is Clojurescript specific are related to the http GET request and the rendering function that uses Reagent to render things into this document's DOM directly. A CLI interface is omitted in the browser context as well, for obvious reasons.Anyway, let's get started by setting up a namespace.
ns
As part of the design criteria, I want this to work without pulling any new libraries from outside of the babashka tool. This means sticking with clojure.data.xml even though other libraries might be a little more straight forward.For this web example, I'm also using
cljs-ajax
to perform a GET to grab the example
feed.atom
XML.
(ns rss-saver.main (:require [clojure.string :as str] [clojure.data.xml :as xml] [clojure.zip :as zip] [ajax.core :refer [GET]]))
Getting the Feed
Usecljs-ajax
to get the feed XML and store in in an atom. Defonce is used to prevent the feed atom from being overwritten if/when the reader edits the src block.
I spent some time trying to get fetch to work with cross-origin URLs, so that the reader could paste different hey.com URLs into the code, but failed. I am honestly very confused by what I need to do. Everything I look up talking about CORS headers indicates something happening server side, but this is a client only page trying to make a GET request to an external URL.
If anyone knows how to set this up to work with cross-origin URLs, please create an issue or discussion on https://github.com/adam-james-v/rss-saver or Tweet me @RustyVermeer.
Instead, I have a copy of my feed.atom uploaded with the rest of this page which I will use as if it were the real URL.
(defonce feed (atom "not fetched.")) (defn fetch [] (GET "feed.atom" {:handler #(reset! feed %)})) (fetch)
Uncomment the code below to confirm that the feed is stored in the atom as an XML string. The
@
in front of the
feed
symbol is syntax for dereferencing an atom. It is equivalent to
(deref feed)
and is required to access the value stored inside the atom.
;;@feed
Zippers and Node Editors
I'll admit up front that I don't fully grasp the idea of zippers yet. I have to study them a bit further and fiddle around with my own implementations, but I know enough to still make use of them. I suggest searching around and reading a few tutorials and documentation pages to start understanding. I found this article by Ivan Grishaev to be quite thorough and informative.A zipper is a data structure that you can use to traverse, search, and modify tree data structures. They are valuable in Clojure because they offer an immutable way to work with trees. To build a zipper, you do need to specify what a branch and a child look like. Luckily, I don't have to do that manually for XML as
clojure.zip
provides the function
xml-zip
to do the heavy lifting.
I want to become more confident with zippers, but for now, I can use the examples provided by
https://ravi.pckl.me/short/functional-xml-editing-using-zippers-in-clojure/ to create some tree editing tools:
edit-nodes
,
remove-nodes
, and
get-nodes
.
Each of these functions takes a zipper and a matcher function, which is a predicate function checking each node for some criteria. The
edit-nodes
function also takes an
editor
function which takes in the node and returns a node modified in some way. With the XML zipper, each node is a Clojure
record
, technically, but those can be treated just like a map, which means they have keys and vals.
In my case, I'll be editing nodes that have keys:
:tag
,
:attrs
, and
:content
, where the tag value is a keyword, attrs is a map of attrs (can be empty), and content is a list that can contain strings and other nodes, or be empty.
(defn edit-nodes "Edit nodes from `zipper` that return `true` from the `matcher` predicate fn with the `editor` fn. Returns the root of the provided zipper, *not* a zipper. The `matcher` fn expects a zipper location, `loc`, and returns `true` (or some value) or `false` (or nil). The `editor` fn expects a `node` and returns a potentially modified `node`." [zipper matcher editor] (loop [loc zipper] (if (zip/end? loc) (zip/root loc) (if-let [matcher-result (matcher loc)] (let [new-loc (zip/edit loc editor)] (if (not (= (zip/node new-loc) (zip/node loc))) (recur (zip/next new-loc)) (recur (zip/next loc)))) (recur (zip/next loc)))))) (defn remove-nodes "Remove nodes from `zipper` that return `true` from the `matcher` predicate fn. Returns the root of the provided zipper, *not* a zipper. The `matcher` fn expects a zipper location, `loc`, and returns `true` (or some value) or `false` (or nil)." [zipper matcher] (loop [loc zipper] (if (zip/end? loc) (zip/root loc) (if-let [matcher-result (matcher loc)] (let [new-loc (zip/remove loc)] (recur (zip/next new-loc))) (recur (zip/next loc))))))
The
get-nodes
function doesn't return the tree like the above functions. Instead it returns a sequence of nodes that are
true
given the matcher function. This is useful for situations where you want to keep certain nodes but don't need to save the entire tree. I'll use this function to grab each post from the feed.
(defn get-nodes "Returns a list of nodes from `zipper` that return `true` from the `matcher` predicate fn. The `matcher` fn expects a zipper location, `loc`, and returns `true` (or some value) or `false` (or nil)." [zipper matcher] (loop [loc zipper acc []] (if (zip/end? loc) acc (if (matcher loc) (recur (zip/next loc) (conj acc (zip/node loc))) (recur (zip/next loc) acc)))))
In this code, I am only ever matching based on the key found in a node's
:tag
entry, so I can make a little helper function to construct key matchers.
NOTE:
The
clojure.data.xml
v0.2.0alpha-6
is needed for cljs compatibility. Unfortunately, behaviour is not exactly the same between the two implementations. In particular, the
{:namespace-aware false}
option has
no effect
in clojurescript. Therefore, the match-tag implementation presented here is different than in my script. Here, I use
xml/qname
to strip the namespace away from the tag so that the equality actually works as expected.
(defn match-tag "Returns a `matcher` fn that matches any node containing the specified `key` as its `:tag` value." [key] (fn [loc] (let [node (zip/node loc) {:keys [tag]} node] (= (xml/qname tag) key))))
Grabbing Posts, aka Entry Nodes
When I was first building this project, I pulled the XML string into the REPL, parsed it, and started poking around the tree to see what nodes were available to me. I discovered that each post corresponded to a node that had an:entry
tag. So, To grab each post, all I need to do is get every node that matches with an
:entry
tag.
(defn feed-str->entries "Returns a sequence of parsed article entry nodes from an XML feed string." [s] (-> s xml/parse-str zip/xml-zip (get-nodes (match-tag :entry))))
Have a look at what an entry node is below by uncommenting the last line in the src block.
(def entries (feed-str->entries @feed)) ;;(into {} (nth entries 6))
Now I've handled grabbing the XML string, parsing it, and pulling out a list of each post. Let's move on to manipulating this data into something more suitable to my purposes.
Entry Node Transforms
Now that I've got a list containing the data for each post, I can transform them into a structure that's more suitable for my needs.Normalize
Each entry can be flattened down, so I have a normalize function to help with that.Content within any node is a sequence that can contain either strings or other nodes, or be empty. At this stage, all strings within the entry's content are empty or newline characters. The newline characters arguably contain data regarding the document's structure, but I have some logic later on for grouping strings and
tags that cover this issue, so I can comfortably filter out these strings.
There are two special elements:
links
and the
author
. Links have empty
:content
tags but need the
:href
from the attributes instead, so a cond is built to handle this. The author map is built separately, using the same map function as with the rest of the content. Then, the content and author maps are merged to form the flat, normalized map, which can be processed further.
The
:content
of each normalized node ends up being an HTML string, which is great as we can use the existing XML node editing machinery to further parse and transform each entry into a Hiccup data strucure.
(defn strip-namespaces [node] (update node :tag xml/qname)) (defn normalize-entry "Normalizes the entry node by flattening content into a map." [entry] (let [content (map strip-namespaces (filter map? (:content entry))) f (fn [{:keys [tag content] :as node}] (let [val (cond (= tag :link) (get-in node [:attrs :href]) :else (first content))] {(xml/qname tag) val})) author-map (->> content (filter #(= (:tag %) :author)) first :content (filter map?) (map f) (apply merge))] (apply merge (conj (map f (remove #(= (:tag %) :author) content)) author-map))))
Let's check that the normalization works properly. Run it on one of the entries and see if its structure is flattened.
(def normalized-entry-example (normalize-entry (nth entries 6))) (def normalized-entry-keys #{:name :email :id :published :updated :link :title :content}) (= normalized-entry-keys (into #{} (keys normalized-entry-example))) ;;normalized-entry-example
Clean Html
Since no external libraries are used, I am manipulating HTML strings slightly to keep the XML parser from complaining about html tags that don't have terminating tags, like
and
.
I also unwrap image tags from figures, which is how
world.hey.com
wraps images in entries. I think it does that to display nicely inside email clients, but I'm not certain. Whatever the reason, I want to modify each
:figure
node by removing them and replacing each with their child
:img
node.
This string cleaning method is as bit of a hack, but works fine and is meant to allow
clojure.data.xml
to continue being used for further parsing/transforming steps later on in the script.
The
clean-html
function is run on every entry's content string after normalization.
(defn unwrap-img-from-figure "Returns the simplified `:img` node from its parent node." [node] (let [img-node (-> node zip/xml-zip (get-nodes (match-tag :img)) first) new-attrs (-> img-node :attrs (dissoc :srcset :decoding :loading))] (-> img-node (assoc :attrs new-attrs) (assoc :content nil)))) (defn clean-html "Cleans up the html string `s`. The string is well-formed html, but is coerced into XML conforming form by closing `br` and `img` tags. The emitted XML string has the <\\?xml...> tag stripped. This cleaning is done so that clojure.data.xml can continue to be used for parsing in later stages." [s] (let [s (-> s (str/replace "<br>" "<br></br>") (str/replace #"<img[\w\W]+?>" #(str %1 "</img>")))] (-> s (xml/parse-str {:namespace-aware false}) zip/xml-zip (edit-nodes (match-tag :figure) unwrap-img-from-figure) xml/emit-str (str/replace #"<\?xml[\w\W]+?>" ""))))
Now let's try clean one of the entry's html strings.
(def cleaned-html-example (-> (nth entries 4) normalize-entry :content clean-html)) ;;cleaned-html-example
NOTE: There's a design flaw in my approach here. It's minor as this is a simple script that can run over the posts quickly and easily, and extreme optimization is unnecessary for this simple script, but it's important to acknowledge limitations and flaws in design.
The
clean-html
function takes an html string, parses it, transforms it, and emits a string again. My immediate next steps will once again take the string and parse it, causing extra emit/parse steps. This was here because I originally took the html string emitted from
clean-html
as the final html output. I later decided to create a cleaner Hiccup structure and use that as the export basis. I've left this design flaw in here as an example of the reality of software development: sometimes the flaws don't show up as errors at all. The flaws are sometimes in the design, not the implementation.
Final Transformation of Nodes
My script's design calls for a .edn export, which is a map containing all of the data normalized and in a useful format. The .edn file output will have a Hiccup data structure as its:post
key value.
I need to build a set of functions that transform normalized entry nodes (defrecords, which can be treated just as Clojure maps) into Hiccup-style vectors (eg.
[:p {:display "inline-block"} "This is the content of a p tag.]
).
I'll use a multimethod to help produce the correct output based on the html tag of the nodes. Every multimethod needs a function that specifies the dispatch behaviour. To understand multimethods, I suggest reading this article to start.
Dispatch
I like to build in a simple check in the dispatch function for lists of nodes. This way, I can handle recursive use ofnode->hiccup
by building the
:list
method appropriately.
(defmulti node->hiccup (fn [node] (cond (map? node) (:tag node) (and (seqable? node) (not (string? node))) :list :else :string)))
Default and Simple Cases
I don't need much special behaviour, so the:default
dispatch acts as a 'catch-all' method and does most of the work. A simple string case and
:div
case are also given.
(defmethod node->hiccup :string [node] (when-not (= (str/trim node) "") node)) (defmethod node->hiccup :br [_] [:br]) (defmethod node->hiccup :div [node] (node->hiccup (:content node))) (defmethod node->hiccup :default [{:keys [tag attrs content]}] [tag attrs (node->hiccup content)]) (defmethod node->hiccup :img [{:keys [tag attrs]}] [tag (assoc attrs :style {:max-width 500})])
List Case
This case has a bit of machinery to it. Every time the list method is used, it means that a sequence of nodes have to be handled. To clean up the structure, I am building a flattening function that runs on each list. This flatten function will flatten everything down completely, except for hiccup vectors.I cannot simply
mapcat
everything because it would destry the hiccup-style structure, as vectors can be flattened down to their elements.
The result of
selective-flatten
is a flat list of strings and/or hiccup elements.
I also take this opportunity to de-duplicate the list. This has the effect of removing extra newlines and linebreaks.
NOTE:
One weakness that I recognize yet am willing to accept is that
de-dupe
may eliminate duplicates that a writer intended for stylistic reasons. Sometimes repetition is a nice way to emphasize a point, and if the repetition is the same paragraph several times, the duplicates will be removed.
(defn de-dupe "Remove only consecutive duplicate entries from the `list`." [list] (->> list (partition-by identity) (map first))) (defn selective-flatten ([l] (selective-flatten [] l)) ([acc l] (if (seq l) (let [item (first l) xacc (if (or (string? item) (and (vector? item) (keyword? (first item)))) (conj acc item) (into [] (concat acc (selective-flatten item))))] (recur xacc (rest l))) (apply list acc)))) (defmethod node->hiccup :list [node] (->> node (map node->hiccup) (remove nil?) de-dupe selective-flatten))
re-grouping
The flattened list of hiccup elements can then be processed and re-grouped on the basis of inline elements and string-br pairs. The html from hey.com blog posts has a lot oftags and plain strings. I think that comes from the fact that it's html formatted to be viewed by email readers.
For re-hosting to my own site, I want to use proper html structure, which means I need to group plain strings and
tags into
tags. I also need to make sure
ul
,
ol
,
li
,
em
, and
strong
tags are handled appropriately. This all means I have some grouping to do.
I also de-dupe the list which can be helpful in eliminating extra newlines. There is a slight risk of this eliminating a deliberately duplicated sentence, but I'll just accept that as a potential weakness to this solution. I don't think I'll use that writing style at all anyway.
(defn inline-elem? [item] (when (#{:em :strong :a} (first item)) true)) (defn inline? [item] (or (string? item) (inline-elem? item))) (defn group-inline "Groups the `list` of strings and Hiccup elements using the `inline?` predicate and wraps them intags. Once all groups are wrapped, the list is flattened again and any remaining
tags are removed." [list] (let [groups (partition-by inline? list) f (fn [l] (if (not= (first (first l)) :br) (into [:p] l) l))] (->> groups (map f) selective-flatten (remove #(= :br (first %))))))
Turn Entry Nodes into edn
Finally I can put all of the node transforms and list manipulations together to build a function that turns the html string into a hiccup structure. Then I can also create theentry->edn
transform needed to produce the correct output for my script.
(defn html-str->hiccup "Parses and converts an html string `s` into a Hiccup data structure." [s] (-> s (xml/parse-str {:namespace-aware false}) node->hiccup group-inline de-dupe)) (defn entry->edn "Converts a parsed XML entry node into a Hiccup data structure." [entry] (let [entry (normalize-entry entry)] {:id (:id entry) :file-contents (assoc entry :post (->> entry :content clean-html html-str->hiccup))}))
Exporting html
Since I have the parsing machinery, it's trivial to build an html page export function now. I simply have to make a document structure with Hiccup and place the content from the entry inside.
NOTE:
This is a modified function to render in the browser. See
the script's html section for the original implementation using
hiccup.core/html
to compile the Hiccup structure into an html document string.
(defn readable-date "Format the date string `s` into a nicer form for display." [s] (as-> s s (str/split s #"[a-zA-Z]") (str/join " " s))) (defn entry->html "Converts a parsed XML entry node into an html document." [entry] (let [entry (normalize-entry entry) info-span (fn [label s] [:span {:style {:display "block" :margin-bottom "2px"}} [:strong label] s]) post (->> entry :content clean-html html-str->hiccup)] [:div [:div {:class "post-info"} (info-span "Author: " (:name entry)) (info-span "Email: " (:email entry)) (info-span "Published: " (readable-date (:published entry))) (info-span "Updated: " (readable-date (:updated entry)))] [:a {:href (:link entry)} [:h1 (:title entry)]] post]))
Try it out and see the result! Make an edit to the src below to force a re-evaluation of the code.
@feed (def entries (feed-str->entries @feed)) (-> (nth entries 4) entry->html)