Monday, 24 October 2011

File parsing with Clojure

A while back I felt like writing a simple state-based file parser. After some pondering, I decided to write it in Clojure. The Clojure way of doing this was something new to me and turned out to be a lot more fun to code than yet another quick Python hack.

The input data in this case is a GnuCash accounts file, which looks like this:

<gnc:transaction version="2.0.0">
  ...
  <trn:splits>
    <trn:split>
      ...
      <split:value>4270/100</split:value>
      <split:account type="guid">b5e2367d505a1fdabd9bf53b1b307a6a</split:account>
    </trn:split>
    <trn:split>
      <split:value>-4270/100</split:value>
      <split:account type="guid">9ded5f95a5c0aca5b019beec91282bcf</split:account>
    </trn:split>
  </trn:splits>
</gnc:transaction>

Note that even though the file is XML, in this case there's no nested structure or other potentially difficult stuff - GnuCash stores the transactions flatly and sequentially. Therefore we don't really need to use a proper XML parser for this one.

The idea then is to scan through the file and, for a given account, pair up the positive and negative transactions for that account. For instance, given the transaction above, we'd expect to eventually find another transaction with the value -4270/100 for the account b5etc (and a transaction with the value 4270/100 for the account 9detc). If the counteracting transaction isn't found, the account is out of balance. It's easy to make a parser that can tell us which transactions are the guilty ones, instead of just summing everything up like GnuCash does: put the values in a map and you'll see exactly which ones cancel each other out.

Here's a simple but sufficient parser in pseudocode:

TARGET = 'b5e2367d505a1fdabd9bf53b1b307a6a'

in_trans = False
curr_val = -123

for line in open("input"):
    if line.startswith('<gnc:transaction'):
        in_trans = True
    elif line.startswith('</gnc:transaction'):
        in_trans = False
    elif in_trans:
        if line.startswith('<split:value>'):
            curr_val = line.replace('<split:value>', '').replace('</split:value>', '')
        elif line.startswith('<split:accounttype="guid">'):
            if line.find(TARGET) >= 0:
                process_value(curr_val)

def process_value(val):
    # add it to a hash map, etc

Works, good enough.

The Clojure version is rather different. The following is based on the elevator example from The Joy of Clojure. I added simple line parsing and some state variables.

(def target-guid-pattern (re-pattern ".*b5e2367d505a1fdabd9bf53b1b307a6a.*"))

(defn parse [lines]
  (letfn
      [(not-in-trans [[line & rest] result]
         #(cond
           (empty? rest) result
           (re-matches #"^[ \t]*<gnc.transaction.*" line)
             (in-trans rest result)
           :else (not-in-trans rest result)))
       (in-trans [[line & rest] result]
         #(cond
           (re-matches #"^[ \t]*</gnc:transaction.*" line)
             (not-in-trans rest result)
           (empty? rest) result ;; actually an error (premature end of input)
           :else (cond
                  (re-matches #"^[ \t]*<split:value>.*" line)
                    (let [val (-> line
                                  (.replaceAll "<split:value>" "")
                                  (.replaceAll "</split:value>" "")
                                  (.replaceAll " " "")
                                  (read-string))] ;; note: probably unsafe
                      (in-trans rest (assoc result :last-value val)))
                  (and (re-matches #"^[ \t]*<split:account type.*" line)
                       (re-matches target-guid-pattern line))
                    (let [val (get result :last-value)
                          vals (get result :values)
                          new-vals (update-values vals val)]
                      (in-trans rest (assoc result :values new-vals)))
                  :else (in-trans rest result))))]
    (trampoline not-in-trans lines {:values {}, :last-value -1})))

(Note that in real code you don't want to use read-string for parsing random input. Since this is a quick hack, I'm using it here to easily turn the values into Clojure fractions, which are nice and exact.)

The idea is that the different states are represented by mutually recursive functions. Since the functions return ready-made functions that represent the next state, Clojure's trampoline can ensure that the recursive function calls happen in constant space. The resulting code turns out to be nicer than the simple if-else hack; the function representation of states is more natural. The performance, though obviously linear in this case, might be somewhat worse than the if-else parser's, but I haven't tested this.

A benefit of the functional representation is that it can be relatively easily extended into a more proper parser in case you do have arbitrary recursive structures to deal with. Of course using an existing parser where applicable will probably be a better solution, though a lot more boring.

To finish up, here's a simple update-values function that remembers the values not yet balanced:

(defn update-values [val-map val]
  (let [val-sign (if (neg? val) -1 1)
        val-abs (* val val-sign)
        curr-count (get val-map val-abs 0)
        new-count (+ curr-count val-sign)
        new-map (if (= 0 new-count)
                  (dissoc val-map val-abs)
                  (assoc val-map val-abs new-count))]
    new-map))

And to parse an input file, you can use slurp to read it into memory:

(def input (map #(.replaceAll % "\n" "")
                (re-seq #".*\n" (slurp "input"))))
(parse input)

Or use a reader such as the one in duck-streams to read line by line:

(with-open [rdr (clojure.contrib.duck-streams/reader "input")]
  (parse (line-seq rdr)))

That's all for now - thanks for reading.