Lobotomy's

Saturday, 1 July 2017

Spark union & column order issue

Edit: the demonstration code is also on Github.

I quite like Spark, though it has some peculiar gotchas, perhaps more than most other big data tools I've used. For example a long time ago I came across some code which had a list of ordinary-looking transformations on datasets, but each of them ended with a .map(identity). What on earth was the point of that?

Well, it turns out that the union() method of Spark Datasets is based on the ordering, not the names, of the columns. This is because Datasets are based on DataFrames, which of course do not contain case classes, but rather columns in a specific order. When you examine a Dataset, Spark will automatically turn each Row into the appropriate case class using column names, regardless of the column order in the underlying DataFrame. However union() is based on the column ordering, not the names.

An example to illustrate. Say we have a case class with some counter value:

case class Thing(id: String, count: Long, name: String)
val things1: Dataset[Thing] = sc.parallelize(Seq(
  Thing("thing1", 123, "some_thing"),
  Thing("thing2", 101, "another_thing"),
  Thing("thing2", 100, "another_thing")
)).toDS
val things2: Dataset[Thing] = sc.parallelize(Seq(
  Thing("foo", 5, "different_thing"),
  Thing("foo", 15, "different_thing"),
  Thing("bar", 6, "whatever_thing")
)).toDS

things1.union(things2).show // works as expected

So far, so good. But say we want to add up the counter values. Depending on how we do the aggregation, the columns might end up in a different order - even though the Dataset has the same type:

val agg1: Dataset[Thing] = things1.groupBy($"id", $"name").agg(sum("count").as("count")).as[Thing]

scala> agg1.show
+------+-------------+-----+
|    id|         name|count|
+------+-------------+-----+
|thing2|another_thing|  201|
|thing1|   some_thing|  123|
+------+-------------+-----+

Now trying to union the aggregated things with the original things will fail, even though both are of type Dataset[Thing]. The reason is the different column order in the DataFrames (the error message is not the clearest):

scala> agg1.union(things2).show
org.apache.spark.sql.AnalysisException: Cannot up cast `count` from string to bigint as it may truncate
The type path of the target object is:
- field (class: "scala.Long", name: "count")
- root class: "Thing"

The easiest workaround is to add a .map(identity) to the end of each such aggregation. After this everything works as expected:

scala> agg1.map(identity).union(things2).show
+------+-----+---------------+
|    id|count|           name|
+------+-----+---------------+
|thing2|  201|  another_thing|
|thing1|  123|     some_thing|
|   foo|    5|different_thing|
|   foo|   15|different_thing|
|   bar|    6| whatever_thing|
+------+-----+---------------+

Note that this is a known issue, SPARK-21109. A method to do a union by name will be added in the future as detailed in SPARK-21043.

Saturday, 13 August 2016

Normalised Discounted Cumulative Gain (NDCG) for Spark DataFrames, using a UserDefinedAggregateFunction

(Edit: The code is available on GitHub.)

Recently I was messing around with a small free-time project to improve search results with machine learning. As part of this I needed a way of evaluating the quality of a given set of search results. An important goal for any search engine is to display the most relevant search results first; and as always, one needs a metric to measure how well one is doing in attaining the goal.

Suitable evaluation metrics for search results would be, for example, Normalised Discounted Cumulative Gain and Mean Average Precision. For this project I decided to go with the former (NDCG), since it intuitively felt more suitable due to the sparsity of accurate relevance scores in the data set I was working with. I used Spark with DataFrames. Now of course Spark already has an NDCG implementation built in, but it doesn't work directly with DataFrames, and I felt like educating myself in DataFrame usage and extensibility.

What is NDCG then? Well, suppose you're given a bunch of search results data, something like this:

val schema = new StructType(Array(
  StructField("searchId", LongType),
  StructField("timestamp", LongType),
  StructField("resultUrl", StringType),
  StructField("position", IntegerType),
  StructField("clicked", IntegerType),
  StructField("converted", IntegerType),
  StructField("relevanceScore", DoubleType)))
val data = sc.parallelize(Seq(
  Row(123L, 1471097840569L, "https://some.site/",        1, 1, 0, 1.28),
  Row(123L, 1471097840569L, "https://another.site/",     2, 0, 0, 2.3001),
  Row(123L, 1471097840569L, "https://yet.another.site/", 3, 0, 0, 0.792),
  Row(123L, 1471097840569L, "https://a.relevant.site/",  4, 1, 1, 1.51),
  Row(456L, 1471102902205L, "https://another.search/",   1, 0, 0, 0.07),
  Row(456L, 1471102902205L, "https://another.result/",   2, 0, 0, 0.04),
  Row(456L, 1471102902205L, "https://another.site/",     3, 1, 0, 0.02)
))
val df = sqlContext.createDataFrame(data, schema)

Now the non-normalised Discounted Cumulative Gain is easy to calculate directly:

df.groupBy($"searchId").agg(sum($"relevanceScore"/log(2.0, $"position"+1)).as("DCG")).show

// +--------+------------------+
// |searchId|               DCG|
// +--------+------------------+
// |     456|0.1052371901428583|
// |     123|3.7775231288805324|
// +--------+------------------+

But the problem of course is because the DCG is not normalised, it's difficult to use it as a comparison between search results. To solve this, we can normalise the DCG score by calculating the ideal (i.e. best possible) score for each set of search results, then dividing DCG by that. This gives the NDCG (normalised DCG).

Unlike DCG, it is difficult to calculate the NDCG directly with SQL or the SQL-like language supported by Spark DataFrames. Luckily defining your own aggregate functions for DataFrames is easy:

object NDCG extends UserDefinedAggregateFunction {
  def inputSchema = new StructType()
    .add("position", DoubleType)
    .add("relevance", DoubleType)
  def bufferSchema = new StructType()
    .add("positions", ArrayType(DoubleType, false))
    .add("relevances", ArrayType(DoubleType, false))
  def dataType = DoubleType
  def deterministic = true
  def initialize(buffer: MutableAggregationBuffer) = {
    buffer(0) = IndexedSeq[Double]()
    buffer(1) = IndexedSeq[Double]()
  }
  def update(buffer: MutableAggregationBuffer, input: Row) = {
    if(!input.isNullAt(0) && !input.isNullAt(1)) {
      val (position, relevance) = (input.getDouble(0), input.getDouble(1))
      buffer(0) = buffer.getAs[IndexedSeq[Double]](0) :+ position
      buffer(1) = buffer.getAs[IndexedSeq[Double]](1) :+ relevance
    }
  }
  def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
    if(!buffer2.isNullAt(0) && !buffer2.isNullAt(1)) {
      buffer1(0) = buffer1.getAs[IndexedSeq[Double]](0) ++
                   buffer2.getAs[IndexedSeq[Double]](0)
      buffer1(1) = buffer1.getAs[IndexedSeq[Double]](1) ++ 
                   buffer2.getAs[IndexedSeq[Double]](1)
    }
  }
  private def totalGain(scores: Seq[(Double, Double)]): Double = {
    val (_, gain) = scores.foldLeft((1, 0.0))(
      (fa, tuple) => tuple match { case (_, score) =>
        if(score <= 0.0) (fa._1+1, fa._2)
        else if(fa._1 == 1) (fa._1+1, fa._2+score)
        else (fa._1+1, fa._2+score/(Math.log(fa._1+1)/Math.log(2.0)))
      })
    gain
  }
  def evaluate(buffer: Row) = {
    val (positions, relevances) = (buffer.getAs[IndexedSeq[Double]](0), buffer.getAs[IndexedSeq[Double]](1))
    val scores = (positions, relevances).zipped.toList.sorted
    val ideal = scores.map(_._2).filter(_>0).sortWith(_>_).zipWithIndex.map { case (s,i0) => (i0+1.0,s) }
    val (thisScore, idealScore) = (totalGain(scores), totalGain(ideal))
//    println(s"scores $scores -> $thisScore\nideal $ideal -> $idealScore")
    if(idealScore == 0.0) 0.0 else thisScore / idealScore
  }
}

And using it is easy:

df.groupBy($"searchId").agg(NDCG($"position", $"relevanceScore").as("NDCG")).show

// +--------+------------------+
// |searchId|              NDCG|
// +--------+------------------+
// |     456|               1.0|
// |     123|0.8922089188046599|
// +--------+------------------+

The code may not be production-quality, but it works as expected. The idea here is simple: given a set of rows, each containing at least a "position" and a "relevance", the custom aggregate function simply saves these in arrays, and then after the last row is read, calculates both the ideal score and the actual score from the arrays and returns their quotient. A typical search engine will return tens or maybe hundreds of search results for each query, so the temporary arrays do not grow to an unmanageable size and performance is good.

And of course it is easy to use any other arbitrary formula for the relevance scores as well, provided you have the data:

df.groupBy($"searchId").agg(NDCG($"position", $"clicked"+$"converted".cast(DoubleType)*3.0).as("NDCG")).show

The code is also available on GitHub.

Wednesday, 16 March 2016

Spark & streaming: first impressions

I recently participated in a 24-hour company hackathon with two colleagues. We used Spark Streaming to do near real-time processing of production data, with plain old MySQL as a real-time "session" data store. We even managed to bolt on a machine learning algorithm using Spark MLlib, courtesy of yours truly.

Spark turned out to be amazingly easy to use and it performed really well for our use case. We did ten second microbatches, which gave us near real-time data processing, and we could also get near real-time metrics and statistics as well with statsd, graphite and d3.js.

For the processing, our division of labour was fairly standard, with three separate Spark workflows:

1) The main processor continuously reads production data from a suitable data source, in real time, and processes it in ten-second chunks. Data "sessions" are created and updated, using a single MySQL table for storage. As soon as we have a minimum amount of data available, we also calculate a prediction for the future "outcome" of each data bunch, based on a previously learned model (see below). Real-time stats and metrics are sent to statsd as data comes in.

2) The history processor runs periodically every X seconds. It connects to MySQL, selects all recently completed data bunches from the main MySQL table and processes them. A completed bunch is simply one whose data has not been updated for some specific amount of time, say 15 minutes.

For each completed bunch the final stats (sums, counts etc) are calculated and sent on to statsd and the bunch is then moved into the history table. This way we can keep the live table at a manageable size. In addition, for each finished bunch we also check how well our previous prediction held up, i.e. whether our prediction was the same as the actual final outcome.

3) The model builder runs periodically every Z minutes. It connects to MySQL, selects a random sample of recent historical data bunches from the history table, and trains a machine learning model on them. We used a random forest for our predictions. Basically, I wrote a bunch of code to turn our historical samples into vectors of doubles (taking categorical variables into account as well), configured the MLlib random forest learner with proper parameters, and that was that. After the model is trained it is simply saved to disk and then used at step 1) by the main processor. With very little work I was able to get a prediction accuracy that was many times better than random guessing and clearly worthwhile.

All in all, initial impressions of Spark were very positive. It made building a relatively non-trivial pipeline like the above super easy. For our processing we don't need to do any joins or other complicated things, which does make things a bit easier; nevertheless with Spark you get scaling, redundancy and failover out of the box, which will help a lot with future-proofing. The ease of MLlib and overall the amount of attention being paid to proper working and scaling of all the tools and libraries is just really nice. Spark Streaming works very nicely as well, and according to several smart people seems to be a very good solution for streaming in general.

MySQL in contrast works really well on just one beefy machine (if you have enough RAM and a proper SSD), in our case handling up to 1000 read-write requests per second with a live table of around 300-600k items at any given time. But high-availability and failover is a bit harder to do, as is scaling. Since this was just a random hackathon we "solved" those problems by simply ignoring them. The value of MySQL was that we could have multiple indices on a table and thus do lookups based both on the key and the recentness of the latest data, which is more annoying to do with something like Cassandra or indeed with Spark itself. So MySQL worked well for us in this limited use case, but future-proofing it would be noticeably harder than with Spark.

Need to do more things with Spark in the future.

Monday, 9 November 2015

Packer, Vagrant, CentOS, VirtualBox, Docker and so on

At work we're using Docker to easily package our applications into predictable, repeatable bunches, the recipes for which can also easily be pushed into Git, complete with diffs, code reviews, pull requests etc. This is pretty cool and Docker is pretty cool.

Looking deeper into how Docker works I noticed this post called Docker image insecurity. I don't know if the situation is still the same as described there, but it was (and maybe is) pretty bad. However, it seems that in a big company setting, one would not want to rely on public Docker images anyway; creating your own private Docker registry, containing only images properly vetted and verified by your company's security team along with each of their dockerfiles, should still be alright.

Out of interest I put on my techops hat (for the first time in a long time) and started looking at how one might arrive at such a secure Docker image, starting from scratch. As it happens I didn't quite get to the Docker part. Rather, I figured out how to use Packer to download and verify a Linux ISO image, which can then be automatically installed into a virtual machine and used as a base image. This answers a different but potentially related question: given such a known-good complete VM image, smaller Docker images could then be partitioned off on a "list of files" basis, for example using one of the many Docker image creation scripts, or just rolling your own. This would enable the entire chain of software to be specified in the config files stored in your internal repository, easily verified by your security team and easily improved and tweaked by developers and techops.

I think the main reason to use tools like Packer and Docker is that they enable easy automation of otherwise tedious and error-prone installation and creation of base systems, and that they also make this process easy enough to verify and secure, given proper support for checksums etc (which hopefully exists in Docker by now). This should make all our lives easier.

Tuesday, 8 September 2015

Recurrent neural networks

Just got back from one of the best meetups I've been to, featuring Andrej Karpathy going through his Recurrent Neural Networks tutorial in detail. Basically, recurrent neural networks are powerful enough that even with a very low-level model, such as a character-based one, an RNN can still learn words, spelling, upper- and lowercase letters, punctuation, line lengths etc stupendously well - and even LaTeX or C code.

Also a good presentation was Semantic Image Segmentation, a different application for recurrent neural networks using a more complicated model.

An interesting takeaway is that when specifying neural network models one desirable feature is differentiability of each transform, which enables the use of straightforward stochastic gradient descent for model fitting. And apparently, even though SGD is a fairly simple technique and thus somewhat too-rough in some respects, there exist fairly simple improvements such as AdaGrad that make it much better. It seems that RNNs in general are capable of being very expressive while also being relatively easy to implement using e.g. AdaGrad. Good stuff all around.

Edited to add: Karpathy's slides.
Edited to add: Romera's slides.

Saturday, 15 August 2015

Jaynes: Probability Theory & Gödel's incompleteness theorem

I've recently been dabbling in statistics and probability, and it was only a matter of time before my attention became drawn to the book Probability Theory: The Logic of Science by E. T. Jaynes. In it, Jaynes proposes to start from the smallest possible set of common-sense axioms and proceed to derive more or less the entire theory of probability, demonstrating how desirable properties such as consistency and paradox-free reasoning can thus be achieved for the whole system.

I decided to buy this book, and at 50 pages in (of about 700 pages total) I can already say it will be worth the full price. Here's a quote to show what I mean (chapter 2.6.2, pp. 45-46):

To understand [Gödel's incompleteness theorem], the essential point is the principle of elementary logic that a contradiction A and not-A implies all propositions, true and false. -- Then let A = {A1, A2,..., An} be the system of axioms underlying a mathematical theory and T any proposition, or theorem, deducible from them:

A => T.

Now, whatever T may assert, the fact that T can be deduced from the axioms cannot prove that there is no contradiction in them, since, if there were a contradiction, T could certainly be deduced from them!

This is the essence of the Gödel theorem, as it pertains to our problems. As noted by Fisher (1956), it shows us the intuitive reason why Gödel’s result is true. --

Recommended.

Monday, 13 July 2015

Silly Scala tricks, part 1

I've recently been working (very slowly, but still) on a hobby project that I'm writing with Scala. In the first steps there's some basic object-oriented modeling to be done, which has served as a good introduction to/reminder of how Scala works in that respect and how things are best arranged in it. It's also been interesting to compare and contrast this to Java.

The project is a simple game. There are two pets fighting each other. A pet has six base skills, in three slots of two skills each. Before the game, each player chooses a skill for each slot to use in that game. So you'll choose between skills 1A and 1B for the first slot, skills 2A and 2B for the second slot etc. Most skills will simply deal damage to the other pet; some have cooldowns, damage-over-time effects, healing effects and so on.

I decided to model the game structure as a tree, where a Game has two Pets who each have three Skills, with both the Pets and their Skills linking to each other for convenience. (Not sure if this is the best way to do it, but it'll be good enough.)

So let's see how to model this stuff in Scala. For brevity I'll list just the class definitions, with methods omitted. Start with skills, where the most common type of skill is one that damages the other pet:

abstract class Skill(val pet: Pet, val cooldown: Int = 0)

abstract class DamageOther(val family: Family, val baseDamage: Int, pet: Pet, cooldown: Int = 0) extends Skill(pet, cooldown)

case class Zap(p: Pet) extends DamageOther(Mechanical, 20, p)

Pretty simple and straightforward. Basically, class variables can be defined right there in the "headline" (val means the same as final in Java), and default parameters are supported to reduce hassle.

Now let's see how the same would look in Java:

public abstract class Skill {

        public final Pet pet;
        public final int cooldown;

        public Skill(Pet pet) {
                this(pet, 0);
        }

        public Skill(Pet pet, int cooldown) {
                this.pet = pet;
                this.cooldown = cooldown;
        }
}

public abstract class DamageOther extends Skill {

        public final Family family;
        public final int baseDamage;

        public DamageOther(Family family, int baseDamage, Pet pet) {
                super(pet);
                this.family = family;
                this.baseDamage = baseDamage;
        }

        public DamageOther(Family family, int baseDamage, Pet pet, int cooldown) {
                super(pet, cooldown);
                this.family = family;
                this.baseDamage = baseDamage;
        }
}

public class Zap extends DamageOther {

        public Zap(Pet pet) {
                super(Family.MECHANICAL, 20, pet);
        }
}

Ugh, right? The code is not horrible as such, but it's pretty clunky. The worst offender to me is Zap; with Java, there's just no way to compactly define actual individual things like a Skill in a way that would make you want to list 20 of them in the same data file. This kind of easy "in-program data definition" is just inelegant in Java.

How about the pets themselves? Here we want to do two things: define individual pets which have certain base skills and attributes; and then for a game, pick one of these and select just the this-time skills for it. Let's see this in Java first:

public abstract class Pet {

        public final String name;
        public final Family family;
        public final int baseHealth;
        public final int baseAttack;
        public final int baseSpeed;
        public final List<Skill> baseSkills;
        public final SkillChoice sc1;
        public final SkillChoice sc2;
        public final SkillChoice sc3;
        public final List<Skill> skills;

        /**
        * @param baseSkills In the order S1A, S2A, S3A, S1B, S2B, S3B
        */
        public Pet(String name, Family family, int baseHealth, int baseAttack, int baseSpeed, List<Skill> baseSkills, SkillChoice sc1, SkillChoice sc2, SkillChoice sc3) {
                if(baseSkills == null || baseSkills.size() != 6) {
                        throw new IllegalArgumentException("baseSkills must be non-null and contain exactly 6 things");
                }
                this.name = name;
                this.family = family;
                this.baseHealth = baseHealth;
                this.baseAttack = baseAttack;
                this.baseSpeed = baseSpeed;
                this.baseSkills = Collections.unmodifiableList(baseSkills);
                this.sc1 = sc1;
                this.sc2 = sc2;
                this.sc3 = sc3;
                final List<Skill> s = new ArrayList<>();
                s.add(this.baseSkills.get(this.sc1 == SkillChoice.SC1 ? 0 : 3));
                s.add(this.baseSkills.get(this.sc2 == SkillChoice.SC1 ? 1 : 4));
                s.add(this.baseSkills.get(this.sc3 == SkillChoice.SC1 ? 2 : 5));
                this.skills = Collections.unmodifiableList(s);
        }

    @Override
    public int hashCode() {
        final int prime = 31;
        int result = 1;
        result = prime * result + baseAttack;
        result = prime * result + baseHealth;
        result = prime * result
                + ((baseSkills == null) ? 0 : baseSkills.hashCode());
        result = prime * result + baseSpeed;
        result = prime * result + ((family == null) ? 0 : family.hashCode());
        result = prime * result + ((name == null) ? 0 : name.hashCode());
        result = prime * result + ((sc1 == null) ? 0 : sc1.hashCode());
        result = prime * result + ((sc2 == null) ? 0 : sc2.hashCode());
        result = prime * result + ((sc3 == null) ? 0 : sc3.hashCode());
        result = prime * result + ((skills == null) ? 0 : skills.hashCode());
        return result;
    }

    @Override
    public boolean equals(Object obj) {
        if (this == obj)
            return true;
        if (obj == null)
            return false;
        if (getClass() != obj.getClass())
            return false;
        Pet other = (Pet) obj;
        if (baseAttack != other.baseAttack)
            return false;
        if (baseHealth != other.baseHealth)
            return false;
        if (baseSkills == null) {
            if (other.baseSkills != null)
                return false;
        } else if (!baseSkills.equals(other.baseSkills))
            return false;
        if (baseSpeed != other.baseSpeed)
            return false;
        if (family != other.family)
            return false;
        if (name == null) {
            if (other.name != null)
                return false;
        } else if (!name.equals(other.name))
            return false;
        if (sc1 != other.sc1)
            return false;
        if (sc2 != other.sc2)
            return false;
        if (sc3 != other.sc3)
            return false;
        if (skills == null) {
            if (other.skills != null)
                return false;
        } else if (!skills.equals(other.skills))
            return false;
        return true;
    }
}

Plenty of boilerplate, as always, but it's understandable enough.

Now in case you haven't noticed, I like things being immutable when they don't need to be mutable - for instance, the base skills and chosen skills for the pets just don't need to change over the course of the game. So for the actual pets, what I'd really like to do is something like the following:

public class LilXT extends Pet {

        public LilXT(SkillChoice c1, SkillChoice c2, SkillChoice c3) {
                super("Lil' XT", Family.MECHANICAL, 1546, 322, 228,
                        listOf(new Zap(this) // error: Cannot refer to 'this' nor 'super' while explicitly invoking a constructor 
                                // , other skills...
                                ),
                                c1, c2, c3);
        }

        private static List<Skill> listOf(final Skill... skills) {
            final List<Skill> l = new ArrayList<>();
            for(Skill s: skills) {
                l.add(s);
            }
            return l;
        }
}

But of course that cannot work, since we can't both refer to this and also be constructing it at the same time. So we're forced to do a two-part construction instead, where we first set everything else, then create the skills and link them up with this, then set this.skills. Meh. This, again, is not the end of the world - it works, but it is a bit clunky. (What happens if someone calls setSkills() a second time? You'll have to remember to check for that, which adds more boilerplate.)

Can we do better? Actually, with Scala, we kinda can. I'm not sure if the following is the best or most sane way of doind things, but I found it pretty cool.

In Scala you can override not just methods, but values. And I love that. So I figured I could define an abstract Pet's base skills first as null, and override that in the actual implementing subclasses. This way each actual pet is very clean to construct:

abstract class Pet(val name: String, val family: Family, val baseHealth: Int, val baseAttack: Int, val baseSpeed: Int, val sc1: SkillChoice, val sc2: SkillChoice, val sc3: SkillChoice) {
  val baseSkills: List[Skill]
  lazy val skills: List[Skill] = {
    val s = baseSkills
    List(
      (s(0), s(3), sc1),
      (s(1), s(4), sc2),
      (s(2), s(5), sc3)
    ) map { case (a,b,c) => if(c == C1) a else b }
  }
}

case class LilXT(s1: SkillChoice, s2: SkillChoice, s3: SkillChoice) extends Pet("Lil' XT", Mechanical, 1546, 322, 228, s1, s2, s3) {
    override val baseSkills = List(Zap(this), Repair(this), XE321Boombot(this), Thrash(this), Heartbroken(this), TympanicTantrum(this)
  }

So what's going on here? To clarify, let's follow what happens when a new LilXT is constructed:

I've decided I want to use a Lil' XT as my pet. So, to construct a Lil' XT instance, I decide which of the two skills I want for each slot and pass those to the constructor, as in LilXT(1,2,1).

The constructor for LilXT calls the Pet superclass constructor, with the hardcoded arguments "Lil' XT" (name), Mechanical (family), and the appropriate stats; and with the three SkillChoices I just gave to LilXT. The Pet is constructed with those arguments.

Now the Pet's baseSkills are null at this point, so trying to figure out the SkillChoice stuff directly at construction time would cause a null pointer exception. This is the same problem as before, where we have a chicken-and-egg dependency in the constructor.

So here in Scala, what I did is I made skills a lazy val; this means it's not resolved immediately, but only when needed. So the fancy map computation thing isn't actually executed yet, it's just "remembered".

The rest of the stuff in the LilXT class definition is run. This will override the baseSkills value with the default base skills that a LilXT has, which are constructed at this time, with the pointer back to this.

The skills variable of the LilXT object is now ready to be accessed; the first time it's accessed, the defining code is run, the skills variable is populated based on the baseSkills and the skillChoices, and things work.

This is pretty alright. Everything is a val, all the lists are immutable, and stuff works. The definitions of the actual concrete things are concise and clear, and overall there's much less pointless busywork code than in Java.

I like Scala.