dil

Nowadays its not all that popular to be a testing enthusiast or fanatic.

After all, real developers ship and TDD is dead, so developers ought to spend less time getting caught up in testing-related hogwash and instead focus on the immediate churning out of what is at hand. Be 10x, move fast and break things, pew pew pew, and so on, and so on.

neck Free time from shipping our way to software unicorn land is often spent perusing blog posts from high-culture engineering forums. In doing so you may have stumbled upon this strange blog, and, much too your dismay we’ll be getting into plenty of 1x testing nonsense. Sarcasm aside, here we’re going to explore and demonstrate some things like Mutation Testing, Code Coverage and Subsumption.

The hope is to possibly convince you that (in general) code coverage is useful, and moreover, that mutation coverage testing is both awesome and practical.

- Buckle up buckaroos.

Back to Basics, Yo

Remember, you might want to begin rolling your eyes when you hear the terms “complete testing”, “exhaustive testing”, and “full coverage”. They are each poorly defined due to a theoretical limitation of software - the number of potential inputs for most programs is so large that they are effectively infinite.

rly Consider javac. As a software artifact its potential inputs include not only every Java program, but all strings. Its only limitation is the size of the file that can be read by the parser. Therefore, the number of inputs is effectively infinite and cannot be explicitly enumerated. It can never have “full coverage”.

Enter formal coverage criteria. As it is not feasible to generate all test inputs, coverage criteria can be used to decide our inputs. The supposition here is that coverage criteria will more effectively reveal faults (over using n arbitrary inputs). Albeit expensive, the most obvious way to use criteria is to directly generate test case values to satisfy the criterion C.

However, in practice we generally manually create an arbitrary amount of pseudo-random inputs, and then measure the tests against the criterion in terms of their coverage. On the surface this understandable. Generating tests to directly satisfy a criterion is hard, really hard. This particularly true when attempting generate test paths for graph coverage criteria.

Not all is lost though. More flexible criteria do exist which, more importantly, can actually be used sustainably against real programs.

Kanye test Grammar?

The most common coverage criteria generate tests to satisfy their criterion based on either graphs, logical expressions, or partitions of input space. Graph and logical models are built from software artifacts (such as source code, design descriptions and specifications). Similarly, a model of the inputs can be built based on some description of an input space. Test criteria can then be applied to these models. kanye

Another major method of deriving coverage criteria is though the syntactic descriptions of software artifacts. Under syntax-based testing, the syntax of the software artifact is used as the criteria model and tests are created from the syntax (ie. covering or violating the syntax in some way). So simply, by virtue a grammar describes what a test input is not. We say that an input is valid if it is in the language specified by the grammar, and invalid otherwise.

Thus, it is often useful to produce invalid strings from a grammar. It is also helpful in testing to use strings that are valid but that follow a different derivation from a pre-existing string. This general use of grammar in testing is known as mutation analysis, and it can be applied most types of software artifacts.

Albeit this all quite theoretical feeling thus far, mutation analysis has paved the way for what is currently argued to be the gold standard in software testing; Program-based Mutation Testing.

Mutant Ninja Testing

Although mutation testing sounds somewhat highbrow, its conceptually quite simple. Faults, or mutations, are injected into one’s testing target, and when those tests are ran there are total and per case mutation adequacy scores derived. Under most systems, given those tests have failed then the mutation is killed and if they have passed then the mutation has survived. In turn, this total percentage of mutations killed acts as a testing criterion for test case quality.

So, unlike traditional test coverage criteria that only measure which code is executed by your tests, mutation testing actually measures the ability of your tests in revealing faults. This premise of mutation testing is intuitively appealing; we can discover the input data that can be used to reveal real faults. Nevertheless, as highlighted earlier, it is not possible for any system to generate every possible permutation of mutant in a candidate program.

turt Consequently, mutation testing systems will only offer a finite set of injectable mutators. So, hopeful that these finite amount mutators will reveal faults, mutation testing can be said to rely on two assumptions: the Competent Programmer Hypothesis (CPH) and Coupling Effect.

The CPH asserts that a program produced by a competent programmer is close to the final correct version of a program. Thus, it is assumed that the source faults introduced by a competent programmer are relatively simple and can be corrected by small syntactical changes. Accordingly, in mutation testing only simple syntactical mutations are generally used. Contrarily, the Coupling Effect brings the suggestion that tests capable of catching first order mutations will also detect higher order mutations that contain these first order mutations.

Strong and Weak Mutations

The implementation of mutation testing systems are generally characterised in terms of their adopted mutation generation techniques. Another distinction can be made with respect to how a given tool analyses how mutants are killed during the execution process. Such techniques can be classified as employing strong or weak mutation. Under strong mutation given program p, a mutant m of program p is said to be killed only if mutant m gives a different output from the original program p. jug

In contrast, weak mutation requires only that the value produced at the point we introduce m is different from the corresponding value in p for m to be killed. This more optimal approach is adopted by tools such as PIT; instead of analysing each mutant after the execution of the entire run, mutants are checked immediately after their respective execution point.

Killing Mutants with PIT: (╯°□°)–︻╦╤─ - - -

PIT is an open-source mutation testing framework for Java and the JVM. Formerly being named Parallel Isolated Test, the project’s initial function was to isolate static state through running JUnit tests in parallel using separate classloaders, however the author later found that this turned out to be a much less interesting problem than mutation testing which initially needed a lot of the same plumbing.

Most importantly, as far as mutation testing projects go, PIT is commercially popular and actively contributed to. As you would expect, it too has first-class support for the things like JUnit4, TestNg, Ant and Maven. Third-party support does also exist for Gradle, Eclipse and InteliJ integrations.

In terms of design, mutation testing systems generally derive adequacy scores in four phases: mutant generation, test selection, mutant insertion and mutant detection. PIT parallelises these phases, making a mutation coverage run feel almost like a traditional statement coverage run. Most testing frameworks in this space have adopted varying strategies in implementing each of these phases. One reason for this disparity could be that many JVM-based mutation testing systems have been written to meet the needs of academic research.

PIT is currently positioned as being the most performant mutation testing tool today, primarily due to fact that it adopts an effective byte code based approach in the mutant detection and generation phases. Generally systems will adopt either a source code or byte based approach at these stages. Using byte code during the mutation detection phase is quite computationally expensive, though this can be offset by an effective test selection phase. doge2

Additionally, systems using source code at the generation stage can also incur a large computational cost if a naive approach is followed. Tools like MuJava, Javalanche and Jumble too use byte code generation, however PIT performs various other optimisations. Rather than blindly running all cases against one mutation PIT will first determine overall line coverage and subsequently run only those cases that can actually reach the mutation.

The Super Complicated Test Candidate

Most tech blogs love using real-world code examples, so lets not be an exception to that rule. Here we define a program that allows one classify a triangle shape based on three supplied coordinates. A classification can be either equilateral, isosceles, scalene or invalid. Accordingly we implement a TriangleType enum and following we define our Triangle class with one static method for classification.

public enum TriangleType {
  INVALID, SCALENE, EQUILATERAL, ISOSCELES
}
public class Triangle {
  public static TriangleType classify(final int a, final int b, final int c) {
    int trian;
    if ((a <= 0) || (b <= 0) || (c <= 0)) {
      return TriangleType.INVALID;
    }
    trian = 0;
    if (a == b) {
      trian = trian + 1;
    }
    if (a == c) {
      trian = trian + 2;
    }
    if (b == c) {
      trian = trian + 3;
    }
    if (trian == 0) {
      if (((a + b) < c) || ((a + c) < b) || ((b + c) < a)) {
        return TriangleType.INVALID;
      } else {
        return TriangleType.SCALENE;
      }
    }
    if (trian > 3) {
      return TriangleType.EQUILATERAL;
    }
    if ((trian == 1) && ((a + b) > c)) {
      return TriangleType.ISOSCELES;
    } else if ((trian == 2) && ((a + c) > b)) {
      return TriangleType.ISOSCELES;
    } else if ((trian == 3) && ((b + c) > a)) {
      return TriangleType.ISOSCELES;
    }
    return TriangleType.INVALID;
  }
}

We also implement a TriangleTest case, and because we’re good testers we ensure to meet total line coverage:

public class TriangleTest {
  @Test
  public void test1() {
    final TriangleType type = Triangle.classify(1, 2, 3);
    assertEquals(SCALENE, type);
  }
  @Test
  public void testInvalid1() {
    final TriangleType type = Triangle.classify(1, 2, 4);
    assertEquals(INVALID, type);
  }
  @Test
  public void testInvalid2() {
    final TriangleType type = Triangle.classify(1, 4, 2);
    assertEquals(INVALID, type);
  }
  @Test
  public void testInvalid3() {
    final TriangleType type = Triangle.classify(4, 1, 2);
    assertEquals(INVALID, type);
  }
  @Test
  public void testInvalidNeg1() {
    final TriangleType type = Triangle.classify(-1, 1, 1);
    assertEquals(INVALID, type);
  }
  @Test
  public void testInvalidNeg2() {
    final TriangleType type = Triangle.classify(1, -1, 1);
    assertEquals(INVALID, type);
  }
  @Test
  public void testInvalidNeg3() {
    final TriangleType type = Triangle.classify(1, 1, -1);
    assertEquals(INVALID, type);
  }
  @Test
  public void testEquiliteral() {
    final TriangleType type = Triangle.classify(1, 1, 1);
    assertEquals(EQUILATERAL, type);
  }
  @Test
  public void testIsoceles1() {
    final TriangleType type = Triangle.classify(2, 2, 3);
    assertEquals(ISOSCELES, type);
  }
  @Test
  public void testIsoceles2() {
    final TriangleType type = Triangle.classify(2, 3, 2);
    assertEquals(ISOSCELES, type);
  }
  @Test
  public void testIsoceles3() {
    final TriangleType type = Triangle.classify(3, 2, 2);
    assertEquals(ISOSCELES, type);
  }
  @Test
  public void testInvalid() {
    final TriangleType type = Triangle.classify(3, 1, 1);
    assertEquals(INVALID, type);
  }
}

So now lets gets webscale and get our candidate unit tested and mutation tested under Maven. The pitest artifact and its related dependencies are hosted by Sonatype so we can add the typical oss-sonatype repository and pluginRepository entries to our pom.xml. We can then simply configure PIT under JUnit4 as follows:

...
<dependencies>
    <dependency>
        <groupId>junit</groupId>
        <artifactId>junit</artifactId>
        <version>4.11</version>
        <scope>test</scope>
    </dependency>
</dependencies>
<build>
    <plugins>
        <plugin>
            <groupId>org.apache.maven.plugins</groupId>
            <artifactId>maven-compiler-plugin</artifactId>
            <version>2.4</version>
        </plugin>
        <plugin>
            <groupId>org.pitest</groupId>
            <artifactId>pitest-maven</artifactId>
            <version>1.1.10-SNAPSHOT</version>
            <executions>
                <execution>
                    <id>verify</id>
                    <phase>verify</phase>
                    <goals>
                        <goal>mutationCoverage</goal>
                    </goals>
                </execution>
            </executions>
        </plugin>
    </plugins>
</build>
  ...

Unless we configure PIT to employ a whitelist of mutation operations, it will use a default of four default mutation operations. Of a possible sixteen operations these four are perhaps the most useful and least expensive:

  1. Negate Conditionals Mutator inverts all conditionals to their boundary counterpart.
    • boolean a = 1 == 2; becomes mutated to boolean a = 1 != 2;
  2. Conditionals Boundary Mutator replaces relational operators with their boundary counterpart.
    • int a = 1 > 2; becomes mutated to int a = 1 <= 2;
  3. Return Values Mutator mutates the return values of method calls.
    • return new Object(); becomes mutated to new Object(); return null;
    • return new Long(123); becomes mutated to return new Long(123)+1;
  4. Math Mutator replaces binary arithmetic operations for either integer or floating-point arithmetic with another operation.
    • int a = 1 + 1 becomes mutated to int a = 1 - 1

Testing Tests While Testing

Given our configuration, PIT will now perform mutation testing at end the of the Maven verfiy lifecycle phases (after test). This involves running both line coverage and mutation coverage testing:

> $ mvn verify
            ....
[INFO] Adding org.pitest:pitest to SUT classpath
[INFO] Mutating from /usr/src/app/maven-pit-review/target/classes
[INFO] Defaulting to group id (com.review.app*)
PIT >> INFO : Sending 1 test classes to minion
PIT >> INFO : Sent tests to minion
PIT >> INFO : MINION : PIT >> INFO : Checking environment
PIT >> INFO : MINION : PIT >> INFO : Found  12 tests
PIT >> INFO : MINION : PIT >> INFO : Dependency analysis reduced number of potential tests by 0
PIT >> INFO : MINION : PIT >> INFO : 12 tests received
PIT >> INFO : Calculated coverage in 0 seconds.
PIT >> INFO : Created  1 mutation test units
PIT >> INFO : Completed in 3 seconds
              ....
================================================================================
- Statistics
================================================================================
>> Generated 44 mutations Killed 36 (82%)
>> Ran 115 tests (2.61 tests per mutation)
================================================================================
- Mutators
================================================================================
> org.pitest.mutationtest.engine.gregor.mutators.ConditionalsBoundaryMutator
>> Generated 10 Killed 2 (20%)
> KILLED 2 SURVIVED 8 TIMED_OUT 0 NON_VIABLE 0
> MEMORY_ERROR 0 NOT_STARTED 0 STARTED 0 RUN_ERROR 0
> NO_COVERAGE 0
--------------------------------------------------------------------------------
> org.pitest.mutationtest.engine.gregor.mutators.ReturnValsMutator
>> Generated 8 Killed 8 (100%)
> KILLED 8 SURVIVED 0 TIMED_OUT 0 NON_VIABLE 0
> MEMORY_ERROR 0 NOT_STARTED 0 STARTED 0 RUN_ERROR 0
> NO_COVERAGE 0
--------------------------------------------------------------------------------
> org.pitest.mutationtest.engine.gregor.mutators.MathMutator
>> Generated 9 Killed 9 (100%)
> KILLED 9 SURVIVED 0 TIMED_OUT 0 NON_VIABLE 0
> MEMORY_ERROR 0 NOT_STARTED 0 STARTED 0 RUN_ERROR 0
> NO_COVERAGE 0
--------------------------------------------------------------------------------
> org.pitest.mutationtest.engine.gregor.mutators.NegateConditionalsMutator
>> Generated 17 Killed 17 (100%)
> KILLED 17 SURVIVED 0 TIMED_OUT 0 NON_VIABLE 0
> MEMORY_ERROR 0 NOT_STARTED 0 STARTED 0 RUN_ERROR 0
> NO_COVERAGE 0
--------------------------------------------------------------------------------

So, who would of guessed, our “total covered” candidate app yields 44 mutations but eight survive. Resulting in a total mutation adequacy score of 82%. The default mutation operations are generated across our 12 test cases, resulting in the execution of 115 mutated test cases. So what is this tellings us about our tests? It means for each test run against a mutated version of the program all of the mutants can be reached (because we have full statement coverage). However, for eight of these runs at least the respective fault goes unnoticed as the tests pass.

Matrix Multiplication Feels Bad

In our last post we talked about the implications of both sequential and parallel naive Matrix-Matrix multiplication (MMM). Sadly we found that this classic triple for-loop approach suffers from poor locality.

psad OMG, who would have known? Our algorithms actually sometimes need to be cache-aware.

This time around we will look at the tall cache assumption along with loop-tiling applied to parallel and sequential MMM. As you would expect there are also some pretty benchmarks involved.

Cache-Awareness, So Confuse

In easy terms, most of the transfers between cache and disk simply involve blocks of data. Whereby the disk is partitioned into blocks of B elements each, and accessing one element on disk copies its entire block to cache. When assuming a tall cache, the cache can store up to M/B blocks, for a total size of M elements, M ≥ B².

Meaning when the cache is already full and a block is grabbed from disk, then another block will be evicted from cache. Before exploring something more cache-aware let us demonstrate this occurring in our naive (and cache-oblivious) solution. Assuming square matrices which do not fit into a tall cache then

n = rowAmountOfA = colAmountOfA = rowAmountOfB

n > M/B

Bellow we see the row-major layout of matrices A and B. We depict the red rectangles in matrix B as our blocks of cache. For all rows of matrix A we traverse over each column of B. Upon traversing B the accessing of each element is cached. Given that n = colAmountOfA and n > M/B then before the traversal of a column B completes the first, and possibly several subsequent, cached blocks of B will be evicted. So upon the next and remaining traversals of B we incur cache-misses on each block

m

From this we can be quite clever and infer that a naive algorithm with n > M/B results gets the following work done

Q(n) = Θ(n3)

Which implies that a naive algorithm with n ≤ M/B will get the following work done since the second matrix can now exploit spatial locality

Q(n) = Θ(n3)/B

Loop-Tiling, Feels Good

Lets take a look at a standard tiling solution to resolve our cache capacity misses. good

Pro tip: loop-tiling is a type of loop transformation on which many cache blocking algorithms are built.

It involves a combination of strip-mining and interchange in multiply-nested loops, in turn strip-mines several nested loops and performs interchanges to bring the strip-loops inward.

This following tiling solution ensures greater spatial locality for accesses across matrix B. Simple but effective, our implementation partitions the matrices’ iteration space though such strip mining and loop interchange

s = (int) Math.ceil(rowAmount1  1.0 / 4);
s2 = (int) Math.ceil(colAmount1  1.0 / 4);
s3 = (int) Math.ceil(rowAmount2  1.0 / 4);
for (i = 0; i < rowAmount1 / s; i+=s){
  for (j = 0; j < colAmount1 / s2; j+=s2){
    for (k = 0; k < rowAmount2 /s3; k+=s3){
      for (i2 = i; i2 < (i+s) && i2 < rowAmount1; i2++){
        for (j2 = j; j2 < (j+s2) && j2 < colAmount1; j2++){
          for (k2 = k; k2 < (k+s3) && k2 < rowAmount2; k2++){
            C[i2][j2] += (A[i2][k2]  B[k2][j2]);
                          ....

Although not implemented recursively, tiling shares some similarities to the divide-and-conquer approach in that it aims to refine the problem size through partitioning the problems (matrices) into subproblems (submatrices). The design of the tiling solution quite naturally lends itself to parallelisation. Such is implemented as follows

s = (int) Math.ceil(rowAmount1  1.0 / 4);
s2 = (int) Math.ceil(colAmount1  1.0 / 4);
s3 = (int) Math.ceil(rowAmount2  1.0 / 4);
executor = Executors.newFixedThreadPool(10);
for(interval = 10, end = rowAmount1,
    size = (int) Math.ceil(rowAmount1  1.0 / 10);
    interval > 0 && end > 0; interval−−, end = size){
  to = end;
  from = Math.max(0, end  size);
  Runnable runnable = new Runnable(){
    @Override
    public void run() {
      for (i = from; i < to / s; i+=s) {
        for (j = 0; j < colAmount1 / s2; j+=s2){
          for (k = 0; k < rowAmount2 /s3; k+=s3){
            for (i2 = i; i2 < (i+s) && i2 < rowAmount1; i2++){
              for (j2 = j; j2 < (j+s2) && j2 < colAmount1; j2++){
                for (k2 = k; k2 < (k+s3) && k2 < rowAmount2; k2++){
                  C[i2][j2] += (A[i2][k2]  B[k2][j2]);
                               ...

It is important to note the s, s2 and s3 parameters to our tiling algorithm. Above we see that they are derived respectively from a matrix dimension divided by 4. They are then later used to determine the dimensions (size s2xs3) of the submatrix-submatrix multiply tasks to be ran. In the case of a square MMM the submatrices’ dimensions would be sxs.

We divide up these multiply tasks through from/to bounds - requiring no synchronisation as no mutable data is shared across threads.

Determining values for s, s2 and s3 is not straight-forward, they ultimately determine the tiling size which in turn demands an accurate estimate of the cache size on your target machine. Assuming square matrices which do not fit into a tall cache then

n = rowAmountOfA = colAmountOfA = rowAmountOfB

n > M/B

Once we determine s ≤ M/B we can easily decompose submatrices (tiles) from matrix B that respect the target cache size

m

As you would expect the tiling optimisation still results in Θ(n3) computations. With an optimal s value, Θ(M½), we see an improvement in the amount of work done to that of our naive approach. The tall-cache assumption implies that each submatrix and its data will fit in cache, meaning only cold caches misses should occur per submatrix (Θ(s2/B)).

Θ(n) = Θ((n/s)3(s3)) = Θ(n3)

Q(n) = Θ((n/s)3(s2/B)) = Θ(n3/BM½)

From the above, as you might have guessed, we can see that s is actually a “tuning” parameter for cache-awareness. In our solution this tiling dividend is set to 4 which, from some rudimentary testing, is the optimal value on recent i5 and i7 processors (OS X 10.10).

Dem Benchmarks: What was measured?

Our benchmarking goal was to analyse the trends displayed across the executions of naive MMM and cache-aware MMM in conjunction with parallelisation. Thus the benchmarking process undertaken had the following features:

  1. A varying set of dimensions for A and B respectively on both algorithms were used to identify trends. The dimensions (200x200), (200x300), (1000x1000) and (1000x1200) were used.
  2. To maintain measurement accuracy the elements held by A and B needed to be the same for each measurement. Every element held by A is always 2, while each element in B is always 3.
  3. Each benchmark was conducted on two machines (4 core and 8 cores) under the same conditions (no other user processes running, machines connected to power outlets, etc.)

m

m

Dem Results

Reflecting on the mean (μ) execution-time and according standard deviation (σ) provided by our benchmark runs, the collated results are as follows

m

Our data poses several interesting avenues of investigation. However our main concern is exploring the benefits of cache-aware algorithms, so lets just explore the performance differences between both algorithms across runs. The bellow sidebar chart illustrates an obvious difference in our algorithms, at a glance one can see that there is large deviation between the naive and tiled time results.

m

This is particularly true across the benchmarks where matrix dimensions are in their thousands. Such is unsurprising given that the naive algorithm is cache-oblivious. When viewing every sidebar related to the tiled results we can see that varying matrix dimensions do not greatly affect performance. Interestingly, we see that an increase in cores only greatly benefits the naive algorithm. It sees around a 55% improvement in execution-time with the assistance of 4 extra cores across all dimension sizes.

Why, Why, Why?

Before undertaking my postgraduate studies I led a quite joyous and sheltered life. Living the high life in the worlds of Ruby, Javascript, Python, Java and all of those sorts. My life got flipped-turned upside down when I began studying subjects like Secure Programming, Concurrent Programming and Formal Methods. For me, the biggest take-aways from these learnings were parallel algorithm design and the principle of locality. fresh

Most dense linear algebraic computations rely on the consideration of both spatial and temporal locality. If you are unfamiliar with locality, the former describes the tendency of applications to reference memory addresses that are near other recently accessed addresses and the latter describes the amount of reuse of the same memory locations.

One such linear algebra problem subject to this is dense Matrix-Matrix multiplication (MMM). Coincidentally MMM is also highly highly parallelisable so it is a perfect to demonstrate parallel algorithm design and the principle of locality.

Matrix Multiply, Ain’t Nobody Got Time For That

Matrix multiplication is important, it is actually central to many scientific computations. For instance, solving a linear system or least-square fits of coefficients to data rely on MMM implementations for good performance.

As you would expect, the performance of an MMM implementation is intimately related to the memory layout of the arrays. On modern shared-memory multiprocessors with multi-level memory hierarchies, naive access patterns in the memory hierarchy can incur cache capacity misses.

Matrix Multiply, How Tho?

MMM for square matrices has the following mathematical definition (which is independent of any implementation). Given two n x n matrices A and B, the sum and product

C + A ·B

For non-square matrices MMM differs in that the amount of columns in matrix A must equal equal the amount of rows in B.

Naive Matrix-Matrix Multiply

The most naive algorithm for MMM comes straight from the above definition and mimics computing the product by hand. Assuming the arrays are stored such that rows of the arrays are in consecutive memory locations, i.e. row-major order, then we derive C as follows

for (int i = 0; i < rowAmount1; i++)
    for (int j = 0; j < colAmount1; j++)
        for (int k = 0; k < rowAmount2; k++)
            C[i][j] += (A[i][k]  B[k][j]);

This algorithm is said to be naive because it suffers from poor locality. Elements of B are accessed column-wise, and therefore not in sequential order in memory. While each iteration in j reuses row i of A, that row may have been evicted from the cache by the time the inner-most loop completes. As a result, the algorithm is bandwidth limited and displays poor performance and low efficiency.

As we highlighted before, matrix multiplications have abundant parallel computation. While maintaining the less favourable design of this naive approach, let us introduce a parallelised implementation

executor = Executors.newFixedThreadPool(10) ;
for (interval = numTasks, end = rowAmount1,
        size = (int) Math.ceil(rowAmount1  1.0 / 10);
        interval > 0 && end > 0; interval−−, end = size){
    final int to = end;
    final int from = Math.max(0, end  size);
    Runnable runnable = new Runnable(){
        @Override
        public void run(){
            for (i = from; i < to; i++)
                for (j = 0; j < colAmount1; j++)
                    for (k = 0; k < rowAmount2; k++)
                        C[i][j] += (A[i][k]  B[k][j]);
        }
    };
    executor.execute(runnable);
}

We must manually divide the work up into tasks and provide each task with its from/to bounds. Each task will run the entire outer loop but for only specific non-overlapping values. Thus there is no synchronisation in this implementation as no mutable data is shared across threads. The input arrays A and B are not mutated. While the contents of output array C are altered, each thread works on different slices.

What’s The Actual Damage Tho?

When working with most matrix sizes this approach sees significant improvements in execution time against its sequential counterpart. However as you may expect these improvements are not linear against different matrix sizes.

The basic flaw of the naive triple-for-loop implementation is that the size of its per-loop working set prevents bandwidth amplification from the closest levels of the memory hierarchy, i.e. even with this parallelised horizontal-slicing approach the algorithm still suffers from poor locality.

Until Next Time

will Later on we shall investigate some fun stuff like Cache-Awareness and the Tall-Cache Assumption. Of course we’ll then develop a cache-aware MMM algorithm using a “tiling” approach. Finally we will see some pretty benchmark graphs that demonstrate how our algorithms fared across different matrix sizes over a different number of cores.

sorry

I’m Totally Sorry

Oops, seems I haven’t actually followed through on my commitment to write a blog post every month, shame on me!

Down to Bizness

Lets start off by creating something difficult. An application that is incredibly useful, original, cutting-edge and technically challenging to develop. What else could it be but a TODO list of course.

bizcat From scratch we are going develop a TODO list API with Sinatra. If you are unfamiliar with Sinatra it is simply a small web framework for Ruby, somewhat like Ruby on Rails. As Sinatra is built on top of Rack we can then easily deploy our completed app to Heroku.

This tutorial expects that you have some grounding in Ruby and understand basic RESTful API design. We will be using Ruby 2.1.2 but any Ruby version post-2.0 should be fine.

Setting up your dependencies

Even in a small project nobody likes managing their own dependencies, it really can be terrible so lets reach for Bundler, which essentially frees you from dependency hell by making sure your application runs the same code on every machine.

First we need to install the Bundler gem.

$ gem install bundler

We then must define our according Gemfile, a Ruby file placed at the root of your directory for describing your gem (package) dependencies.

source 'https://rubygems.org'
# ruby '2.1.2' # tested with 2.1.2 but other versions should be fine

gem 'sinatra', '1.4.2' # web framework
gem 'sinatra-contrib', '1.4.2' # web framework extensions

gem 'data_mapper' # lightweight ORM

group :development do
  gem 'dm-sqlite-adapter' # local DB ORM adapter
  gem 'sqlite3' # local DB
end

group :production do
 gem 'pg' # prod DB
 gem 'dm-postgres-adapter' # prod DB ORM adapter
 gem 'unicorn' # prod application server
end

Finally we must install our required gems with Bundler.

$ bundle install --without production

Say Harro World!

Great, we have our dependencies in place so we better verify that everything is working as expected. Any Ruby file which requires or loads the Sinatra library, with at least one endpoint defined, can be conveniently executed as a WEBrick server.

Lets simply define one route or endpoint and test it out.

require 'sinatra'

get '/' do
  "LOLOLOL! You rock!"
end

Like any other Ruby script, we can execute it with the ruby command.

$ ruby proof.rb

As you would expect proof.rb begins to run on a default port and we see the desired response on http://localhost:4567/. Among other things, we can easily change this target port number by passing a flag argument to the script.

A list of the available flags can be outputted by passing -h as an argument.

App Environment Cruft

Lets now create our cleverly named main application file, app.rb. We need to then first require our library modules.

# Require the bundler gem and then call Bundler.require to load in all gems
# listed in Gemfile.
require 'bundler'
Bundler.require(:default, :development) if defined?(Bundler)
# Reload app.js when changed
require "sinatra/reloader" if development?

Using Bundler.require we tell Bundler to load those modules in the gem groups named development and default, where the latter is all of those gems not contained in the development or production groups.

Remember our dependency sinatra-contrib? It just contains lots of nice extensions for Sinatra. One of which no one should live without is Sinatra::Reloader. It automatically reloads all files required by your application which of course we need during development.

Models, Models Everywhere

We are using DataMapper as our ORM. In terms of our requirements its main appeal is the ever useful auto_migrate! feature which creates your datastore schema based off your defined DataMapper::Resource models, removing the need to write and manage migrations.

Now we must create our local SQLite DB and establish a connection.

# DB init
DataMapper.setup(:default, "sqlite3://#{Dir.pwd}/development.db") if development?

With such in place we can now define our naive Task which models a TODO list item. Similar to ActiveRecord, in DataMapper the convention with model names is to use the singular version of the name. Though unlike ActiveRecord, DataMapper does not enforce this.

# Models
class Task
  include DataMapper::Resource

  property :id, Serial, key: true
  property :name, String, required: true
  property :completed_at, DateTime
  property :created_at, DateTime
end
# Finalize and migrate
DataMapper.finalize.auto_upgrade!

Before using our model we must first finalize. This checks the validity of all of our DataMapper model definitions and initializes all properties associated with relationships. This nicely returns its caller, DataMapper, so we can chain a call to auto_upgrade! which issues the necessary CREATE statement for our task table if it does not already exist.

Now lets verify that our model is in fact persistent and working as expected. We can simply execute script and our DB will be in place. Lets query the current amount of task rows.

$ ruby -r ./app.rb -e "puts Task.all.size"

Gimme Them Endpoints

In Sinatra we can define endpoints that are a combination of a HTTP verb and the according request path. Which as you can see results in succinct and pretty code ♥

# Api
get '/task/' do
  content_type :json

  @tasks = Task.all(order: :created_at.desc)
  @tasks.to_json
end

Upon a defined endpoint being accessed, a before code block is ran if defined. This is a simple before-request hook, we do not avail of it for our endpoints. Upon its successful termination, the user-defined block on that endpoint will be executed.

Above we simply return a JSON list of all the task rows. We can verify this endpoint by simply executing curl http://localhost:4567/task/. As expected the body of our response is an empty array.

Now we can define our remaining endpoints for POST, GET,PUT and DELETE.

post '/task/' do
  content_type :json

  @task = Task.new(permit_params)
  @task.save ? @task.to_json : halt(500)
end

put '/task/:id' do
  content_type :json

  @task = Task.get(params[:id].to_i)
  @task.update(permit_params)

  @task && @task.save ? @task.to_json : halt(500)
end

get '/task/:id' do
  content_type :json

  @task = Task.get(params[:id].to_i)
  @task ? @task.to_json : halt(404)
end

delete '/task/:id' do
  content_type :json

  @task = Task.get(params[:id].to_i)
  @task && @task.destroy ? { success: "ok" }.to_json : halt(400)
end

def permit_params
  params.select { |k, v| ["name", "completed_at"].include? k }
end

Data, Not Enough Data

Before we deploy to Heroku it would be nice to have a prepopulated API in place, so lets just seed two tasks.

# Seed
if Task.count == 0
  Task.create(name: "Test Task One")
  Task.create(name: "Test Task Two")
end

Again, we can verify this by executing curl http://localhost:4567/task/ and see our listed tasks.

Ship First, Test Later

Lets put this thing of beauty out there! The usual tweaks are required for our Heroku deploy. We need to parameterize what gems are required and what DataMapper connection we wish to use per environment.

Bundler.require(:default, ENV['RACK_ENV'] || :development) if defined?(Bundler)

if development?
  DataMapper.setup(:default, ENV['DATABASE_URL'] || "sqlite3://#{Dir.pwd}/development.db")
elsif production?
  DataMapper.setup(:default, ENV['DATABASE_URL'] || 'postgres://localhost/sinatra-todos')
end

We then need to create a rakeup file named config.ru to instruct Heroku how to run our Rack app.

require './app'
run Sinatra::Application

Finally, we must setup a repository, add a Heroku remote and push our work to deploy the app.

git init && git add --all && git commit -m "initial"
heroku create
git push heroku master

Ah yes, after much procrastination I have finally gotten around to my first blog post. I now faithfully vow to post at least one blog entry per month ♥