Friday, 27 April 2012

Carmack: Functional Programming in C++

Here's John Carmack with a solid blog post on the benefits of a (more) functional style.

Functional programming seems to be trending up lately. Interesting stuff.

The further reading links on the bottom are also worth checking out.

Tuesday, 10 April 2012

On objects and state


"State is inherently complex, since it complects value and time". - Rich Hickey

At work one part of our multi-tier architecture, written in Java, is a system sitting between the data source (usually a database) and the front end. An important task of this service tier is to unify the data from different data sources into a single common format for display to the end user, and also to enable easy caching of data. The connection between the service tier and the front end is REST and the connection between the service tier and the data source is either JDBC or REST.

One main thing the service tier does then is receive data from a data source, validate it, and digest it into a common, easily serialisable format for the front end. For this purpose we have a bunch of different Model classes, one for each different type of chart we support (line chart, bar chart etc).

The life cycle of a Model object is straightforward: the Model is created; data is given to it for validation; and if the data was valid, eventually the toDto() method is called to produce the digested version of the data. (DTO stands for Data Transfer Object, which is just a pretentious way of saying "struct". That's design patterns for you, heh.)

Our system is pretty sensible and works well in practice. However, when I was recently tasked with creating a new Model for our system, I spent some time thinking whether this way of doing things is really the best possible. Specifically, I wanted to figure out how best to manage the life cycle of my Model object, that is, to manage its state over time.

The question I eventually came across was simple: why would I want my Model to have state?

In fact, since the whole point of an object is that it encapsulates state, why would I want to use an object at all?

This may seem like a silly question; and in this case the obvious answer "of course you need to use objects and state - it's Java, duh" happens to be correct. But it is worthwhile to think of just how essential object-orientedness and state is in general. So bear with me here.

In my currently favourite language, Clojure, it seems that the recommended way to do this kind of thing is to use a multimethod. A multimethod consist of (at least) two things: the method name and a dispatch function. When you call the method, the arguments you give are first passed to the dispatch function, whose return value is then used to determine which actual implementation function is called. This sounds a bit weird, but is rather cool in practice. Let's look at a hello-world example of a multimethod:

(defmulti validate :chart-type)
(defmethod validate :line-chart [{:keys [data] :as data-map}]
  (merge {:valid-data data} data-map))
(defmethod validate :default [d]
  nil)

Here the first line defines a multimethod called validate and defines its dispatch function to be the keyword :chart-type. In Clojure, using a keyword as a function just returns the corresponding value from the given map:

multim.core> (def data {:chart-type :line-chart, :data [[1 2] [3 4]]})
#'multim.core/data
multi.core> (:chart-type data)
:line-chart

The second and third lines of the multimethod bit define the function that is run whenever the dispatch function returns the value :line-chart. This function simply constructs a map with the key :valid-data associated with the given map's :data, merges that with the given map and returns the result. Next the default-case method is defined to just return nil for any dispatch function value not covered by other methods. (The default default is to throw an exception, which you might like better.)

Let's test this multimethod:

multim.core> (validate data)
{:data [[1 2] [3 4]], :chart-type :line-chart, :valid-data [[1 2] [3 4]]}
multim.core> (def data2 {:chart-type :bar-chart, :data [[5 6] [7 8]]})
#'multim.core/data2
multim.core> (validate data2)
nil

The multimethod recognises the line chart and calls the correct method for it. Other :chart-type values are not bound to any method, so the default method is called for those, as expected.

Here's another simple multimethod to munch valid data into a DTO:

(defn munch [data-matrix]
  (for [col (range (count (first data-matrix)))]
    (map #(nth % col) data-matrix)))

(defmulti make-dto :chart-type)
(defmethod make-dto :line-chart [{:keys [valid-data] :as data-map}]
  (merge {:dto (munch valid-data)} data-map))
(defmethod make-dto :default [d]
  nil)

So with this, we can turn valid data (but not unvalidated data) into a DTO:

multim.core> (make-dto data)
{:data [[1 2] [3 4]], :chart-type :line-chart, :dto ()}
multim.core> (make-dto (validate data))
{:valid-data [[1 2] [3 4]], :chart-type :line-chart, :data [[1 2] [3 4]], :dto ((1 3) (2 4))}

And data that cannot be validated cannot be DTOified either:

multim.core> (make-dto data2)
nil
multim.core> (make-dto (validate data2))
nil

Now this might all seem pretty trivial. But that's sort of the point. The system I just described is very simple, consisting of two things: functions and data. Even though the functions are called multimethods in Clojure terminology, they're not methods in the Java sense of the word - they stand alone, not contained in any object; and they also happen to be pure functions, free of side effects such as state.

The question I asked previously now becomes: Why would I want to add state to this system?

The management of state can get much more complicated than it would be in this example. Let's look at one of the old Model implementations in our system to illustrate this. We actually have three different toDto()-like methods: one to construct the actual DTO; one to convert that into comma-separated values (CSV); and one to convert it to another simple representation we use internally. For instance, here's the pseudocode for toCsv() in one of the old implementations:

public String toCsv() {
    if(this.data == null)
        throw new IllegalStateException("data is not set");
    if(!this.validateData())
        throw new IllegalStateException("data is invalid");
    if(this.dto == null)
        this.toDto();

    // actual computation, using this.dto
}

So in every method one has to check what the current state of the object is, and if the state is wrong, to either throw an exception or to try to fix the situation. It's easy to forget to check one of the many conditions in one of the many methods, which can lead to subtle bugs when (not if) someone at some point calls the methods "out of order".

The problem here is that it's impossible to enforce the correct calling order. The data is set into the object, then validated, and then the DTO computed from it, in completely separate operations: the programmer who calls these methods may do so in any order s/he wishes. We then have to manually check the situation in each method to figure out what the calling order was this time.

The solution seems simple: don't store any state in the object. Instead, just require that each relevant datum is passed to each method as a parameter. This way, it's physically impossible for the caller to call the methods out of order; in Java, you wouldn't get past the compiler.

One problem that may still remain is that one can pass invalid data to a method. One way to get around that is to make two classes, say Data and ValidData, to represent raw data and already-validated data, and to have methods only accept one or the other. Or, if using a language such as Clojure, just add a field into the data to indicate it's been validated like in the multimethod example above.

Given all that, if I were designing this data-to-DTO system from the ground up (and if I had to use Java), I would emulate the multimethod solution with something like this:

public interface Munchable {
    public ValidData validate(Data data);
    public DTO makeDto(ValidData validData);
    public String makeCsv(ValidData validData);
    public InternalThing makeInternalThing(ValidData validData);
}

And then simply leave out any class variables in the implementing classes. This way, the objects become just dummy things which can be constructed whenever and discarded, if you wish, after each method call; you no longer need to worry about managing their internal state, since they don't have any.

(One could also have the makeCsv() and makeInternalThing() methods take a DTO object as a parameter instead, or add DTO-taking versions of them; but this doesn't affect the point.)

This brings me to the following criticism of Java. Cleanly implementing something like the multimethod in Java is impossible, since Java doesn't have functions - everything has to be wrapped in an object, which then provides methods. There are static methods, but those cannot be overridden in a subclass. One could write a separate static method for each different type to be handled, but there's no good place to put all these static methods. This means that in Java, you pretty much have to always use objects, even if you're not really doing anything at all with them - in the Munchable example, all the objects do is hold functions.

Java, of course, was designed to very strongly encourage you to always use objects; it is quite literally the only paradigm the language offers. Because of this it is, often implicitly, assumed that when you use Java, you'll write all of your code in the objects-and-state manner, such as the old toCsv() example above. Since you have all these object-oriented constructs and it's so easy to use them to "manage" state, why not do that?

I must point out that the examples given here are fairly benign, and it might be difficult to see from them why this would be a big deal. In fact our system, even the old code, works quite well; it is very much possible to take all the possible states into account and to write proper unit tests that verify all the different state transitions. But the point is that when you can do the exact same thing in a much simpler way, why wouldn't you? Given two systems that do the same thing, isn't the simpler one preferable?

The criticism of using state when you don't have to becomes more relevant as the system and therefore the number of its possible states grows. (Obviously the number of state transitions grows exponentially with the number of states.) For instance, one issue I haven't mentioned at all, but which will be both more important and more difficult to solve in larger stateful systems, is concurrency. I won't go into that in more detail here; perhaps later.

I began to ponder all this more seriously recently, after witnessing Rich Hickey's presentation Simple Made Easy. This presentation is the best "how to code" lecture I've ever seen, and I encourage everyone to watch it. The issue of state vs. no state is but one of the things Hickey discusses in his speech. For a more Java-specific critique, see Steve Yegge's awesome rant.

Somebody once quipped that it's still possible to write good code in Java, but that code will be very non-idiomatic. Indeed. But I am glad that other idioms exist; even if in practice I end up using Java, it's good to expand one's mind by considering the alternatives.

Monday, 30 January 2012

Online courses

Last autumn Stanford university offered three programming courses online: machine learning, artificial intelligence and database basics. I took the machine learning one. The course recently ended, with I and over 12000 fellow students receiving a nice statement of accomplishment.

The lectures by Andrew Ng - which you can watch online for free, or download if you register (for free) - were awesome. Having previously taken a few courses on the subject (introduction to machine learning; basic statistics courses; independent component analysis) I knew most of the stuff already, but watching the lectures was still absolutely worth it as a rehearsal, for the excellent clarity of the presentation. The programming exercises were somewhat milder, being sometimes too brief, but still good.

I haven't yet checked out the other courses' lecture videos, but if they're at the same level as the machine learning stuff, this is going to be world-changing stuff. The Stanfordians aren't done with just three courses, and it's not just the Stanfordians. One of the Stanford lecturers is starting his own online education company. Other Stanforders are offering 12 new online courses starting soon - see e.g. the course on probabilistic graphical models, which has links at the bottom to the other ones. MIT has had their Open CourseWare thing for a while now, and they're just launching MITx, a competitor to the Stanford online stuff. And of course Khan Academy has also been around for a while, although it's a different level of education. Probably I'm forgetting some as well.

Based on all this, I predict that online education is going to be pretty big pretty soon. We live in interesting times.