Saturday 24 December 2011

AI Challenge: Ants - afterthoughts

Update: here are some links to interesting write-ups etc:

Memetix, 10th place
FlagCapper, 8th place
Xathis, contest winner
MBCook, another Clojure bot, rather different from mine
Official strategy forum for all the rest

---

So the AI challenge is over. Sadly I didn't really have enough time to properly participate; I started in earnest with less than a week remaining, so I guess mine was a last-minute entry. It turned out suprisingly well except for one large detail: completely unoptimised, the code runs fast enough on my machine with the official settings, but times out on the official game servers almost every game. This was due to a bug in the timeout code that I didn't notice until after the deadline. Oh well.

The code is immature enough that it's not worth publishing, but the general ideas are pretty interesting, so I'll say a few words about those.

The best writeup I've seen so far is the one by Memetix, the UK champion. My ideas, most of which were scavenged from the excellent forums, were similar to some extent, except of course for the whole UK champion bit.

As I see it, the two major parts of a successful ant bot are the general path finding and target assignment algorithm, and the combat engine. These are basically the macro and the micro.

Macro: path-finding routines

Starting with path finding, the one that appealed to me the most was the rubber surface attraction thing, some version or variation of which seemed to be used by everyone and their cat on this competition.

The attraction algorithm is basically just breadth-first search, with some optional tweaks. One important idea is to begin the search at each source, such as food or an unexplored tile, instead of from each of the player's ants; this way the algorithm scales fairly well when the number of visible things grows, because the time requirement doesn't depend on how many ants you have. (It might still depend on the number of food, the number of enemy ants etc, depending on exactly how you tweak it, but initialising the search from the sources is a good start.)

Collaborative diffusion, which I guess can be described as a further variation of attraction, has also been used a lot in this competition. The scents used in diffusion are explicitly averaged for each square based on the neighbouring scents, instead of just taking the maximum attractiveness like in basic attraction. Of course, you can use this idea with attraction too. An important consequence of this averaging is that an area with, say, a lot of food gives off a stronger combined scent than an area with just one food. I'm not certain however how much of a benefit this gives (see below).

In the collaborative part of diffusion, friendly ants block some or most of the scent from proceeding further, depending on the type of scent. E.g. a food scent would be mostly blocked by an ant, since one or two ants are usually enough to harvest all the food in an area; but an enemy hill's scent would go through ants more or less intact, since we want as many ants as possible to help in razing enemy hills. Such collaboration seems like a pretty useful idea. Notably though some very highly ranked bots such as Memetix's didn't use it at all. I didn't have time myself to implement anything like this, so I can't really say.

Another thing with diffusion, that you can also do with attraction, is to a non-linear distance function. E.g. exponential falloff is often used with diffusion, and Memetix used weight over distance squared (as in gravity). Sadly, I didn't have time to properly experiment with this either. It seems to me that neither the collaboration part nor the non-linear distance function is truly necessary for a very well performing algorithm, but this is mere speculation - hopefully some of the other top contenders will do a writeup soon so we can learn how they dealt with this.

One final note about diffusion: for some reason, some people implemented it by doing multiple sequential passes over the entire map and hoping that they can do enough passes to propagate the scents to every square - which takes one pass for each square traveled. Instead, the standard BFS method is of course to use a queue for the squares yet to be processed and a hash set or similar for the squares already visited. In this case one can easily replace the hash set with a double array of booleans since the maximum map size is only 200 by 200. The standard queue-and-visiteds version should perform much better than the multiple-pass version, since it skips over a zillion uninteresting squares, so it's not entirely clear to me why anyone would opt for the latter.

Tweaking it

Once you have your attraction or diffusion or whatever, the endless and endlessly fun tweaking can begin. As described above, there are many of variations of the basic algorithm. But there are also lots of parameters to tweak: which weight should I give to a food compared to an unseen tile? What about tiles that I've seen at some point, but not recently (and what is "recently")? Should I assign some of my ants to be food gatherers and some to be soldiers, or should I just rely on the different weights? Should I change the weights, or the assignments, as the game progresses, or as I gain or lose area or ants? Countless questions - and after all this, there's still combat to worry about!

It turns out that early exploration and food gathering is extremely important, which is why my bot was doing surprisingly well against my coworkers' bots on our private TCP server. I implemented a good food finding routine first, then a good exploration routine; then I was nearly out of time, so my combat code is hilariously primitive, and of course everything times out when ran on a machine slower than my own.

To assign food gatherers, I do a simple BFS (linear distance function, no blocking of "scent" by friendly ants; just the basic vanilla thing) separately from each visible food until I either find a friendly ant or I hit a distance cutoff. The BFS doesn't consider non-visible squares. I set the cutoff at twice the sight radius (which might be too small though); the idea is that each food is obviously within the sight radius of a friendly ant, but a food is only worth going after if there are not too many obstacles on the way. After finding the closest ant for each food, I then find the closest food for each ant. This way when there's a cluster of food only one ant will get assigned to it, which seems good. A few obvious improvements to this routine would be remembering where food was last seen (at least for a few turns after seeing it) and ignoring food when an enemy ant is closer to it than we are.

For the ants that weren't assigned as food gatherers, I do another BFS, starting at each interesting non-food square simultaneously. Such interesting squares are enemy hills, enemy ants, and the squares just off the edge of visible space. I assign a very large weight (that is, a very small "distance") to the enemy hills, a medium weight to enemy ants and a small weight proportional to last-seen time to the non-visible squares.

The exploration part here seems to work quite nicely. When there are no enemies around, my ants will explore in all directions, and will try to bring and keep the whole map under their control. At first every non-visible edge square has the same score, so the deciding factor is the distance: after the first ant goes, say, north, the new northern edge is now farther away from the hill than the other edges, which causes the ants to spread out naturally. As the game progresses, ants that don't have anything better to do than explore will tend to run around in their own small area, chasing the edge that they've least recently seen, which maximises the visibility and map control. And of course, when new food spawns, gathering it takes priority for the ant nearest to it.

For combat targeting this doesn't work quite as well (all the combat stuff was a last-minute addition). Even though the attraction of enemy hills is very large (and my code remembers the location of currently unseen hills, and forgets hills once they're razed), the BFS doesn't have an option of going through unseen squares, so my ants are only drawn towards enemy hills in those turns when the hill is seen by at least one of my ants. In addition, when searching a path to enemy hills the search cutoff is the same as for exploration, which is probably suboptimal. Well, still better than nothing.

It seems to me that on the exploration and macro side, a few relatively simple BFS routines, when properly tweaked, would be enough for even a top ten bot. I refer again to Memetix, whose code is remarkably simple yet performs very well. A shame that I didn't have more time to polish my own routines.

Micro: combat routines

I had no time to really get into the intricacies of combat, which are many and detailed. Apparently however there are simply too many possibilities to evaluate them all: five possible moves per ant for each of n ants leads to 5^n different moves on every turn. Even if you can cull most of these, you can't always cull enough.

An interesting approach then is a1k0n's random sampling, which attempts to iteratively figure out a plausible probability distribution corresponding to the best moves. Of simpler heuristic approaches, I refer again to Memetix. His experiences in general seem to show that non-complicated approaches, both to path-finding and to combat, can be sufficient to do very well on this challenge. Update: Xathis uses a traditional min-max algorithm with heavy heuristic pruning, highly polished, which obviously works very well.

My combat code is ran after I've figured out the general moves with the path finding routine. The combat routine then checks these moves and, if necessary, replaces them with hopefully safer moves. For each ant I add up the number of enemy and friendly ants within potential combat range (well, actually within the square root of (combat range squared + 12), which in some cases might be almost the same). If there are as many enemy ants as friendly ants, or more, I arbitrarily choose the "first" enemy, figure out where it is, and tell my ant to go in the opposite direction. This works surprisingly well: if there's just one enemy ant, my ant will run away from it; if there are several enemies and they're surrounding my ant, it doesn't really matter where my ant will move. (Well, it does, but figuring out the best move would have taken me too long.)

Other stuff

In addition to the micro management of individual combat moves, there's also base defence (see e.g. Memetix - apparently static defence is a bad idea, and dynamic defence can be good enough), zone control and blockading enemy ants (see contest winner Xathis for a good example), symmetry detection etc etc. I don't have enough time to go through all of these in detail, so I recommend checking out the strategy forum for lots of good discussion on this stuff.

In summary, if only I had had more time to polish and optimise my code and implement better routines, this would have been the most awesome programming project ever. Really looking forward to the next AI challenge - hopefully I'll do better then!

Thursday 24 November 2011

AI Challenge: Ants

A bunch of coworkers are participating in the AI Challenge programming competition. Against my better judgement I decided to have a go as well - might mean less sleep in the near future. Using Clojure of course. Let's see how this goes!

Speaking of AI, here's a good article on a Starcraft-playing AI by Berkeley Uni students. Many of the same techniques could be used for the ant colony bot as well.

Aiming to have something worth blogging about once the AI contest finishes.

Thursday 3 November 2011

A chess problem solver

Some time ago when I was looking for a job, basically every company I applied to had some kind of preliminary programming exercise that one needs to pass before getting to a live interview. There was quite a bit of variance in these exercises. The most interesting (and the most challenging) by far was a chess problem solver.

In this problem, you're given a chess board of size n by m and a bunch of pieces excluding pawns. It is assumed that all the pieces are colourless anarchists that ceaselessly threaten each other when given the chance. The task: print out all the ways the pieces can be placed on the board so that they don't threaten each other.

I decided to do this one in Clojure, which is currently my favourite language. So let's see how that went.

(Notice that the following code might contain severe performance-related suboptimalities. While the code isn't very optimised, it worked well enough to get me the interview. More on the performance later.)

Before we begin, let's declare the namespace and include some stuff we'll need later:

(ns chess.core
  (:require [clojure.string :as str]
            [clojure.set :as set]
            [clojure.contrib.combinatorics :as combi]))

The first thing to decide is the representation. For some reason, I didn't represent the board explicitly; instead I went with a hash map of [x y] -> piece for the piece locations and a separate vector [bx by] for the board size. A piece is just its name - a keyword: :Q for queen, :K for king, :B for bishop, :N for knight, :R for rook.

The first thing to code then was something that takes a board - that is, a board-size vector and a location-to-piece hash map - and prints it in a more human-readable format:

(defn get-board [[bx by] loc-pieces]
  (partition bx
             (for [j (range by)
                   i (range bx)] ;; x first
               (when-let [a (get loc-pieces [i j])]
                 (name a))))) ;; when :X => "X", otherwise nil

(defn print-board [board]
  (str/join "\n" (for [row board]
                   (str/join "" (for [cell row]
                                  (if cell cell "."))))))

(defn pb [bs locs]
  (-> (get-board bs locs) (print-board) (println)))

For instance:

chess.core> (pb [3 5] {[0 0] :R, [1 1] :N, [2 2] :R, [1 3] :N, [1 4] :B})
R..
.N.
..R
.N.
.B.
nil
chess.core>

Now the algorithm itself is just a variation of basic depth-first search. The idea is simple: for each piece, we maintain a set of the free squares on which that piece can be placed. When placing a new piece, this set is updated by removing those squares that are now impossible. If the set becomes empty for any piece, the situation is impossible, so we backtrack.

To make this more precise, here's the algorithm in pseudocode:

def solve(boardSize, locPieceMap, freeSquares, remPieces):
    if remPieces is empty:
        return locPieceMap
    moves = getMovesForBestPiece(boardSize, locPieceMap, freeSquares, remPieces)
    if moves is empty:
        return None
    solutions = []
    for piece, locMapNow, freesNow in moves:
        recResult = solve(boardSize, locMapNow, freesNow, remPieces-piece)
        if recResult:
            solutions.append(recResult)
    return solutions

Here getMovesForBestPiece is a function that figures out which piece we should play next, using some heuristic. The function returns a list of valid moves for the best piece. It also checks for impossible situations such as some piece not having any suitable squares; in this case the function returns empty.

What's a good heuristic? I tried three different ones:
  1. Choose the piece of which we have the fewest. E.g. if we have 1 queen and 2 knights left to place, choose the queen.
  2. Choose the piece whose free-squares count is the lowest. E.g. if we have 1 bishop which has 11 squares it can potentially be placed on, and 2 kings which each have 8 potential squares, choose the kings.
  3. Try each piece on each square it can go on, and choose the piece which results in the fewest options after that.
Not surprisingly, option 3 was hilariously slow and thus abandoned. The first two are much faster. In my test runs they seem to be within 2% of each other; I'm not sure why this is, but hey.

Before looking at the details of getMoves, let's see how to deal with duplicate avoidance. When you have multiple copies of the same piece, a naive depth-first search implementation will give you duplicates. Consider the situation where there are, say, two free squares A and B and two pieces K1 and K2. The pieces are both kings and thus identical, but the algorithm will consider the cases "K1 on A; K2 on B" and "K2 on A; K1 on B" as separate, unless instructed otherwise.

To avoid duplicates, I decided to place all identical pieces on the board at the same time. (Another option would have been to maintain a list of already used squares for the current piece type, which is essentially the same thing, except more complicated.) I simply create each combination of (free-squares choose number-of-pieces-to-place) and iterate through those. The combinations are generated with the combinations function from clojure.contrib.combinatorics; this is lazy and so probably efficient enough - it is necessary to iterate through every combination, but we don't want to pre-calculate too many.

So let's look at what getMoves needs to do, in pseudocode.

def getMovesForBestPiece(boardSize, locPieceMap, freeSquares, remPieces):
    piece = guessBestPiece(boardSize, locPieceMap, freeSquares, remPieces)
    frees = freeSquares[piece]
    count = remPieces[piece]
    moves = []
    for squares in combinations(frees, count):
        locPieceMapNow = locPieceMap+{squares:piece}
        newThreateneds = squaresThreatenedBy(locPieceMapNow)
        # cannot place pieces so that they threaten any other piece
        if intersection(newThreateneds, locPieceMapNow) is not empty:
            continue
        nonFreeNow = union(locPieceMapNow, newThreateneds)
        freeNow = updateFreeSquares(boardSize, locPieceMapNow, freeSquares, nonFreeNow, remPieces)
        # require valid squares for each piece
        if any in freeNow is empty:
            continue
        moves.append([piece, locPieceMapNow, freeNow])
    return moves

(Note that in case of heuristic 3 above, the code would be slightly different: instead of guessing the piece, we'd run this routine for every remaining piece and then choose.)

Fairly straightforward, I guess. The set intersection and union are found in clojure.set, so all we need now are squaresThreatenedBy and updateFreeSquares. For these I'll go straight to the implementation.

(defn remove-nonexistent-squares [squares [bx by]]
  (for [[x y] squares :when (and (>= x 0) (>= y 0) (< x bx) (< y by))]
    [x y]))

(defn squares-threatened-by [[location piece] board-size]
  (let [[x y] location
        [bx by] board-size]
    (condp = piece
      :K (let [neighbours (remove (partial = [0 0]) (for [i [-1 0 1] j [-1 0 1]] [i j]))
               king-squares (for [[nx ny] neighbours]
                              [(+ x nx) (+ y ny)])]
           (remove-nonexistent-squares king-squares board-size))
      :R (let [lines (into (for [i (range bx)] [i y]) (for [j (range by)] [x j]))
               rook-squares (remove (partial = location) lines)]
           rook-squares) ; construction doesn't generate nonexistents
      :N (let [xs [1 1 2 2 -1 -1 -2 -2]
               ys [2 -2 1 -1 2 -2 1 -1]
               coord-adds (map list xs ys)
               knight-squares (for [[xa ya] coord-adds]
                                [(+ x xa) (+ y ya)])]
           (remove-nonexistent-squares knight-squares board-size))
      :B (let [coord-adds (for [i (range 1 (max bx by))] [i i])
               bishop-squares (apply concat
                                     (for [[xa ya] coord-adds]
                                       (map #(list (%1 x xa) (%2 y ya)) [+ + - -] [+ - + -])))]
           (remove-nonexistent-squares bishop-squares board-size))
      :Q (let [rooks (squares-threatened-by [location :R] board-size)
               bishs (squares-threatened-by [location :B] board-size)]
           (into rooks bishs)) ;; rooks, bishs are distinct by construction
      ;; else:
      (str "Error: unknown piece " piece))))

That was fun. Let's try it out:

chess.core> (squares-threatened-by [[2 2] :Q] [3 4])
([0 0] [1 1] [1 3] [2 3] [2 1] [2 0] [0 2] [1 2])
chess.core> (squares-threatened-by [[0 0] :K] [3 3])
([0 1] [1 0] [1 1])
chess.core> (squares-threatened-by [[2 1] :R] [3 3])
([2 2] [2 0] [0 1] [1 1])
chess.core> (squares-threatened-by [[1 2] :N] [4 4])
([2 0] [3 3] [3 1] [0 0])
chess.core> (squares-threatened-by [[1 2] :B] [4 5])
([2 3] [2 1] [0 3] [0 1] [3 4] [3 0])
chess.core>

Seems to work, as you can verify.

For the free squares, the data structure is a map where each piece is associated with a set of the squares that are yet free for it. To illustrate, the initial setup is one where each piece is associated with all the squares:

(defn get-initial-free-squares [[bx by] piece-types]
  (let [all-squares (set (for [i (range bx), j (range by)] [i j]))]
    (into {} (for [piece piece-types] {piece all-squares}))))

chess.core> (get-initial-free-squares [2 2] [:Q :R])
{:Q #{[1 0] [0 0] [1 1] [0 1]}, :R #{[1 0] [0 0] [1 1] [0 1]}}
chess.core>

With this structure, the code for update-free-squares is then:

(defn update-free-squares [board-size loc-pieces free-squares non-frees piece-types]
  (into {}
        (for [[piece frees] free-squares
              :let [rem-frees (set/difference frees non-frees)]]
          {piece rem-frees})))

Now, armed with squares-threatened-by and update-free-squares, we can finally implement the real version of get-moves (here with heuristic #1):

(defn get-piece-square-combinations [piece squares piece-count]
  (let [pieces (repeat piece-count piece)
        square-combs (combi/combinations squares piece-count)]
    (for [comb square-combs]
      (zipmap comb pieces))))

(defn get-moves-for-piece [board-size loc-pieces free-squares piece-counts piece]
  (let [frees (get free-squares piece)
        piece-count (get piece-counts piece)
        piece-square-combos (get-piece-square-combinations piece frees piece-count)
        rem-piece-counts (dissoc piece-counts piece)
        rem-free-squares (dissoc free-squares piece)]
    (for [attempt piece-square-combos
          :let [squares (keys attempt)
                locs-now (into loc-pieces attempt)
                loc-set (set (keys locs-now))
                threatened-now (set (apply concat (for [loc-piece attempt]
                                                    (squares-threatened-by
                                                     loc-piece board-size))))]
          ;; cannot place pieces so they threaten each other/existing pieces
          :when (empty? (set/intersection loc-set threatened-now))
          :let [non-free-now (set/union loc-set threatened-now)
                free-now (update-free-squares board-size locs-now rem-free-squares
                                              non-free-now (keys rem-piece-counts))]
          ;; require valid squares for all remaining pieces
          :when (every? pos? (map count (vals free-now)))]
      [piece locs-now free-now])))

(defn get-moves [board-size loc-pieces free-squares piece-counts]
  (if (some empty? (vals free-squares))
    ()
    (let [piece (first (apply min-key second piece-counts))]
      (get-moves-for-piece board-size loc-pieces free-squares piece-counts piece))))

All that's left now is solve. But before looking at that, there's an optimisation which turns out to be beneficial.

In algorithms like this, there's a balance of how much to compute on each step. If you do too much work on each round, as in heuristic #3, the rounds take too long and the algorithm is slow; but if you do too little, you end up recursing into impossible paths too much, which is also slow. So to detect impossible paths earlier, it can be a good idea to lift some of the calculations for the next round into this round, as long as you don't overdo it and end up with too-heavy rounds.

In our case, it turns out we can optimise get-moves a bit to make it detect impossible cases earlier. For example, suppose we're trying to fit a rook, a queen and a knight on a 3x3 board. The situation might be that we have placed the rook on [2, 1], leaving us with 4 free squares for each remaining piece, and our heuristic randomly recommends that we place the knight next. We then run get-moves-for-piece for the knight. Here's the situation:

chess.core> (pb [3 3] {[2 1] :R})
...
..R
...
nil
chess.core> (def frees {:Q #{[1 0] [0 0] [1 2] [0 2]}, :N #{[1 0] [0 0] [1 2] [0 2]}})
#'chess.core/frees
chess.core> (get-moves-for-piece [3 3] {[2 1] :R} frees {:Q 1, :N 1} :N)
;;; a list of: [piece placed; board after placing; free squares after placing]
([:N {[2 1] :R, [1 0] :N} {:Q #{[0 0] [1 2]}}]
 [:N {[2 1] :R, [1 2] :N} {:Q #{[1 0] [0 2]}}])
chess.core>

Looking at the boards resulting from these moves, you can see that in fact the queen cannot be placed on either of its potentially free squares:

chess.core> (pb [3 3] {[2 1] :R, [1 0] :N})
.N.
..R
...
nil
chess.core> (pb [3 3] {[2 1] :R, [1 2] :N})
...
..R
.N.
nil
chess.core>

Now these free squares are just fine for, say, a bishop, but they're not compatible with a queen. On the next round, the algorithm will notice this incompatibility and backtrack. However, we can lift the computation that notices this into this round, and this happens to be beneficial in this case, resulting in a much faster algorithm.

Here's the uplifted version of the code. In update-free-squares, just the last line is changed to use the new pre-calculation function.

(defn remove-threatening-squares [board-size loc-pieces free-square-set piece]
  (set
   (let [locs (set (keys loc-pieces))]
     (for [square free-square-set
           :let [threateneds (set (squares-threatened-by [square piece] board-size))]
           :when (empty? (set/intersection locs threateneds))]
       square))))

(defn update-free-squares [board-size loc-pieces free-squares non-frees piece-types]
  (into {}
        (for [[piece frees] free-squares
              :let [rem-frees (set/difference frees non-frees)]]
          {piece (remove-threatening-squares board-size loc-pieces rem-frees piece)})))

The remove-threatening-squares function goes through each preliminarily free square for the given piece. When that piece placed on the square results in one of the existing pieces being threatened, that square is filtered out. Like so:

chess.core> (get-moves-for-piece [3 3] {[2 1] :R} frees {:Q 1, :N 1} :N)
()
chess.core>

That's better.

To wrap up, here's the solve function, plus related auxiliary stuff.

(defn handle-solution [board-size loc-pieces]
  (pb board-size loc-pieces)
  (println)
  [1]) ;; could also return the solution

(defn solve-rec [board-size loc-pieces free-squares piece-counts]
  (if (empty? piece-counts)
    (handle-solution board-size loc-pieces)
    (let [moves (get-moves board-size loc-pieces free-squares piece-counts)]
      (if (empty? moves)
        ()
        (apply concat
               (for [[piece locs-now frees-now] moves
                     :let [rem-counts (dissoc piece-counts piece)
                           rec-result (solve-rec board-size locs-now frees-now rem-counts)]
                     :when (not (empty? rec-result))]
                 rec-result))))))

(defn solve [board-size piece-counts]
  (let [initial-frees (get-initial-free-squares board-size (keys piece-counts))
        result (solve-rec board-size {} initial-frees piece-counts)
        solution-count (count result)]
    (println (str "\n" "total number of solutions: " solution-count "\n"))
    solution-count))

Phew. A test run:

chess.core> (solve [3 3] {:K 1, :R 2})
.R.
R..
..K

K..
..R
.R.

.R.
..R
K..

..K
R..
.R.


total number of solutions: 4

4
chess.core>

And that seems to be that!

Update: the code is now available on GitHub. The version there includes some extra code for debugging and all 3 heuristics. Also available are the unit tests, and a Leiningen project to run all that stuff a bit more easily.

Finally, a note on performance. The specification for this exercise gave a rather large example and asked for the exact number of solutions for it; in addition, the code had to be capable of enumerating these solutions within one hour. The code just presented in fact took longer than that (but not much longer). I haven't profiled my code or, indeed, given its performance much thought, since it turned out to be close enough. Proper optimisation of this is a task for another day.

If you want to test the performance yourself, here's an example of similar size; the goal is to get this to run in less than an hour:

;;; from a REPL:
(solve [7 8] {:K 3, :Q 1, :B 2, :R 2, :N 3})

# or from the command line, with the script included in the package:
./run '[7 8]' '{:K 3, :Q 1, :B 2, :R 2, :N 3}'

How fast would this algorithm run when properly optimised? Or when implemented in, say, Python or Java or C? Or even better, what would be a more efficient algorithm?

Monday 24 October 2011

File parsing with Clojure

A while back I felt like writing a simple state-based file parser. After some pondering, I decided to write it in Clojure. The Clojure way of doing this was something new to me and turned out to be a lot more fun to code than yet another quick Python hack.

The input data in this case is a GnuCash accounts file, which looks like this:

<gnc:transaction version="2.0.0">
  ...
  <trn:splits>
    <trn:split>
      ...
      <split:value>4270/100</split:value>
      <split:account type="guid">b5e2367d505a1fdabd9bf53b1b307a6a</split:account>
    </trn:split>
    <trn:split>
      <split:value>-4270/100</split:value>
      <split:account type="guid">9ded5f95a5c0aca5b019beec91282bcf</split:account>
    </trn:split>
  </trn:splits>
</gnc:transaction>

Note that even though the file is XML, in this case there's no nested structure or other potentially difficult stuff - GnuCash stores the transactions flatly and sequentially. Therefore we don't really need to use a proper XML parser for this one.

The idea then is to scan through the file and, for a given account, pair up the positive and negative transactions for that account. For instance, given the transaction above, we'd expect to eventually find another transaction with the value -4270/100 for the account b5etc (and a transaction with the value 4270/100 for the account 9detc). If the counteracting transaction isn't found, the account is out of balance. It's easy to make a parser that can tell us which transactions are the guilty ones, instead of just summing everything up like GnuCash does: put the values in a map and you'll see exactly which ones cancel each other out.

Here's a simple but sufficient parser in pseudocode:

TARGET = 'b5e2367d505a1fdabd9bf53b1b307a6a'

in_trans = False
curr_val = -123

for line in open("input"):
    if line.startswith('<gnc:transaction'):
        in_trans = True
    elif line.startswith('</gnc:transaction'):
        in_trans = False
    elif in_trans:
        if line.startswith('<split:value>'):
            curr_val = line.replace('<split:value>', '').replace('</split:value>', '')
        elif line.startswith('<split:accounttype="guid">'):
            if line.find(TARGET) >= 0:
                process_value(curr_val)

def process_value(val):
    # add it to a hash map, etc

Works, good enough.

The Clojure version is rather different. The following is based on the elevator example from The Joy of Clojure. I added simple line parsing and some state variables.

(def target-guid-pattern (re-pattern ".*b5e2367d505a1fdabd9bf53b1b307a6a.*"))

(defn parse [lines]
  (letfn
      [(not-in-trans [[line & rest] result]
         #(cond
           (empty? rest) result
           (re-matches #"^[ \t]*<gnc.transaction.*" line)
             (in-trans rest result)
           :else (not-in-trans rest result)))
       (in-trans [[line & rest] result]
         #(cond
           (re-matches #"^[ \t]*</gnc:transaction.*" line)
             (not-in-trans rest result)
           (empty? rest) result ;; actually an error (premature end of input)
           :else (cond
                  (re-matches #"^[ \t]*<split:value>.*" line)
                    (let [val (-> line
                                  (.replaceAll "<split:value>" "")
                                  (.replaceAll "</split:value>" "")
                                  (.replaceAll " " "")
                                  (read-string))] ;; note: probably unsafe
                      (in-trans rest (assoc result :last-value val)))
                  (and (re-matches #"^[ \t]*<split:account type.*" line)
                       (re-matches target-guid-pattern line))
                    (let [val (get result :last-value)
                          vals (get result :values)
                          new-vals (update-values vals val)]
                      (in-trans rest (assoc result :values new-vals)))
                  :else (in-trans rest result))))]
    (trampoline not-in-trans lines {:values {}, :last-value -1})))

(Note that in real code you don't want to use read-string for parsing random input. Since this is a quick hack, I'm using it here to easily turn the values into Clojure fractions, which are nice and exact.)

The idea is that the different states are represented by mutually recursive functions. Since the functions return ready-made functions that represent the next state, Clojure's trampoline can ensure that the recursive function calls happen in constant space. The resulting code turns out to be nicer than the simple if-else hack; the function representation of states is more natural. The performance, though obviously linear in this case, might be somewhat worse than the if-else parser's, but I haven't tested this.

A benefit of the functional representation is that it can be relatively easily extended into a more proper parser in case you do have arbitrary recursive structures to deal with. Of course using an existing parser where applicable will probably be a better solution, though a lot more boring.

To finish up, here's a simple update-values function that remembers the values not yet balanced:

(defn update-values [val-map val]
  (let [val-sign (if (neg? val) -1 1)
        val-abs (* val val-sign)
        curr-count (get val-map val-abs 0)
        new-count (+ curr-count val-sign)
        new-map (if (= 0 new-count)
                  (dissoc val-map val-abs)
                  (assoc val-map val-abs new-count))]
    new-map))

And to parse an input file, you can use slurp to read it into memory:

(def input (map #(.replaceAll % "\n" "")
                (re-seq #".*\n" (slurp "input"))))
(parse input)

Or use a reader such as the one in duck-streams to read line by line:

(with-open [rdr (clojure.contrib.duck-streams/reader "input")]
  (parse (line-seq rdr)))

That's all for now - thanks for reading.