Java 8: A quick introduction to Parallelism and the Spliterator

java-8-parellelism-spliterator

With the release of Java 8 a number of new language features were introduced [1]. These included lambda functionsstreamsand completable futures. Colleagues of mine have already reviewed these features in previous articles on this blog, which I recommend reading as part of this topic [2][3]. In this article I will touch on an aspect of the Java 8 release that relates to the push towards exploiting parallelism, in the context of the existing Collections Framework – specifically the new Spliterator interface.

Due to the size and complexity of these topics, it will only be possible to provide a high-level introduction, but it is hoped that this will trigger interest in the reader, to go off and delve further into an area of growing importance and interest.

Parallelism

First thing first – what is meant by the term ‘parallelism’? For the purpose of this article, at a high level, it is a term that refers to programming in a way to exploit multi-core CPU units. That is to say, to take a piece of work (a task) and break it out into separate units (sub-tasks) that can be processed in parallel. Then aggregating the results of all the processed units to complete the original work [4]. This is sufficient for now, although the reader might want to investigate further on differences between parallelism, concurrency and sequential programming.

Figure 1

For this to be successful, the processing on a unit should not have a dependency on any other unit – this processing should be stateless and also should not depend on the result of the processing of any other unit. This is important when it comes to the data that is involved – as parallel computing lends itself to situations where vast amounts of data is to be handled.

This also raises the question of when to go for parallelism – there is an overhead in the added complexity of identifying and breaking out the units of work and then coordinating the parallel processing. Then there is also the issue of latency in reading the data – the savings found in using parallel cores for processing is to try to keep each CPU core busy all the time – which requires reading data as and when is needed without delay. Before embarking on a parallel processing architecture, some cost-benefit analysis is required to be sure that this is the right approach.

With the introduction of Java 8 Oracle was aiming to make it more attractive and simpler for developers to use Java in parallel programming. The inclusion of features such as lambda functions and the Streams API was to be seen as part of this push [5]. How successful that becomes, time will tell.

This concept of parallelism is behind the existing fork/join framework found in Java, which will be covered in a future article, but for now we will confine our overview to the new Spliteratorinterface, found in the java.util library.

Collections Framework

The Collections Framework is also found in the java.util library, and provides a range of classes that act as containers to hold objects (technically, references to the objects) and allow useful behaviour such as adding new objects, searching the container for a particular object, and the sorting of such objects – the behaviour available is depending on the type of container used [6].

Within the library there are two different concepts of containers, captured in the core interfaces, that are defined within the framework [7]. One concept is that of a Collection that is a sequence of individual elements, which is further refined by extending the Collectioninterface with interfaces for ListSetQueue or Deque.

Listing 1 – Simple Iterator on an ArrayList

// Create a new Collection type; in this case an ArrayList Collection<Person> people = new ArrayList<Person>(); // Add some people using Person objects people.add(new Person("Jane", "Doe", "Ireland")); people.add(new Person("Joe", "Doe", "England")); people.add(new Person("John", "Doe", "Scotland")); people.add(new Person("Julie", "Doe", "Wales")); people.add(new Person("Jerry", "Doe", "France")); people.add(new Person("Jim", "Doe", "Italy")); // simple iterator example Iterator<Person> peopleIterator = people.iterator(); while(peopleIterator.hasNext()) { Person person = peopleIterator.next(); System.out.println("Hello " + person.getFirst_name() + " " + person.getLast_name() + " from " + person.getCountry()); }

The output is:

Hello Jane Doe from Ireland Hello Joe Doe from England Hello John Doe from Scotland Hello Julie Doe from Wales Hello Jerry Doe from France Hello Jim Doe from Italy

Separately, there is the concept of a container, defined in the Map interface, holding a group of key-value object pairs, which allow the use of a key to find a value – in this way an object (value) can be found using another object (key) that has been mapped to it.

Both the interfaces of Collection and Map are found in the Collections Framework, in java.util, along with a class called Collections. Note that the Collections class consists exclusively of static methods, which include polymorphic algorithms, that operate on or return collections.

Paralellism and the Collections Framework

A problem with using the containers of the Collections Framework in some form of parallel programming is that these containers are not thread-safe. Wrappers are provided for adding automatic synchronization (thread safety) but the drawback is that this introduces thread contention, where two or more threads are trying to access the same resource simultaneously and therefore cause the runtime to either suspend their execution or execute them more slowly.

The Spliterator

A new interface added to java.util is the Spliterator, which as the name implies, is a new special kind of Iterator that can traverse a Collection[8]. For Java 8, the Collection interface has been updated to include a new spliterator() method, that when called returns a Spliterator. This is not the case for the separate Map interface, although it is considered part of the Collections Framework as explained above.

The Spliterator can ‘split’ the Collection, partitioning off some of its elements as another Spliterator. This does allow parallel processing of different parts of a Collection but note that the Spliterator itself does not provide the parallel processing behaviour. Instead, the Spliteratoris there to support parallel traversal of the suitably partitioned parts of a Collection. This solves the problem of dividing the data, as held in a Collection such as an ArrayList, into suitably sized sub-units that can be processed in parallel.

The fork/join framework, which is found in Java 7 libraries, can be used with the Spliteratorand is designed for parallelizable work that can be broken into smaller pieces recursively to be processed independently, and then aggregating the results of the sub-units to produce a final result. Again, however note that using the Spliterator is not dependant on the fork/join framework, this framework is just one way of implementing parallel processing [9].

Listing 2 – Simple Spliterator on an ArrayList

The following code gives us an ArrayList, called people that is to hold objects of the Persontype – a user defined type:

Collection<Person> people = new ArrayList<Person>();

The ‘people’ collection then has a number of objects added, and we can now call on the spliterator() of the ArrayList:

Spliterator<Person> peopleSpliterator = people.spliterator();

This returns a Spliterator we are calling ‘peopleSpliterator’.

System.out.println(" " + peopleSpliterator.characteristics()); System.out.println(" " + peopleSpliterator.estimateSize());

The output from the above code is :

16464 
6

The value 16464 represents the characteristics of the Spliterator partition. This is important as it is these predefined characteristics that result in what ends up in each new partition and how it is structured. The int value returned from the characteristics() call is the result of the OR’ed values of the individual characteristics for an ArrayList, which is ORDEREDSORTED and SUBSIZED[10].

ORDERED indicates that the elements have a defined order, which is expected from a List, when traversing and partitioning them. SORTED means that the elements follow a predefined sort order while SUBSIZED indicates that this and any further resulting Spliterator are SIZEDSIZED means the Spliterator has been created from a source with a known size. The details of the characteristics for a particular Collection type can be found in the online API documentation. Where you wish to have a collection with non-standard characteristics, that means going down the route of implementing your own version of the Spliterator interface.

Listing 3 – Split the Spliterator

Now that we have a Spliterator, lets ‘split’ it:

Spliterator<Person> newPartition = peopleSpliterator.trySplit(); System.out.println(" " + newPartition.estimateSize()); System.out.println(" " + peopleSpliterator.estimateSize());

The output from this code is:

3 
3

As you can see we have a new partition with 3 elements, leaving the existing partition with 3. The important method of trySplit() has partitioned the elements of the original Spliterator – in this case split it evenly, leaving the existing Spliterator with a reduced number of elements in its own partition. Clearly this is very different from the behaviour of the pre-existing Iterator.

If this Spliterator can be split (partitioned), the method will try to divide the elements in half for a balanced parallel computation. This will not always be the case, and where a Spliteratorcannot be split, a null is returned. The splitting, or partitioning, can continue recursively until a Spliterator returns null. If a different behaviour is required in how the trySplit() method works, then again a custom implementation of the Spliterator interface is required.

When used in a parallel programming solution, bear in mind that the individual Spliterator is not thread safe; instead the parallel processing implementation must insure that each individual Spliterator is handled by one thread at a time. A thread calling the trySplit() of a Spliterator, may hand over the returned new Spliterator to a separate thread for processing. This follows the idea of decomposition of a larger task into smaller sub-tasks that can be processed in parallel individually of each other.

When working with a Spliterator, especially in a parallel programming environment, changes to the data structure that is undergoing processing can lead to arbitrary and unknown behaviour and that is not good. Ideally the original data source should not be interfered with (no elements added, replaced or removed). Collection types are not immutable and cannot manage concurrent modifications. To help with this problem, for the ArrayList, the trySplit() returns a ‘late-binding’ and ‘fast-fail’ Spliterator. This late-binding binds the elements to the Spliteratoron first split, first traversal or first query for estimated size rather then at the time of the creation of the Spliterator. After this binding occurs any interference with the elements is then detected, throwing a ConcurrentModificationException. Here, it fails-fast, rather then risking arbitrary and non-deterministic behaviour at a future time.

Of course with the above trivial example of only 6 elements in the list, this is hardly a work load that would justify parallel programming – but as mentioned before, Spliterator itself does not provide the parallel programming behaviour – the reader might be interested in experimenting with a Collection type holding a vastly larger number of objects and then look to implement a parallel processing solution to exploit the use of the Spliterator. A suggestion is that such a solution may incorporate the fork/join framework and/or the Stream API.

Future articles will delve further into this interesting and increasingly important area.

Overview of Method References

Method references are a feature of Java 8. They are effectively a subset of lambda expressions, because if a lambda expression can be used, then it might be possible to use a method reference, but not always. They can only be used to call a singular method, which obviously reduces the possible places they can be used, unless your code is written to cater for them.

It would be a good idea if you knew the notation for a method reference. In fact, you have probably already seen it assuming you read the title. If not then just look below.

Person::getName 

The example above is the equivalent of writing person.getName(), where person is an instance of Person. Let me tell you a bit more about when you can use method references and show some examples as it makes a lot more sense with them.

Types of Method References

Type Syntax Method Reference Lambda expression
Reference to a static method Class::staticMethod String::valueOf  s -> String.valueOf(s)
Reference to an instance method
of a particular object
instance::instanceMethod s::toString  () -> “string”.toString()
Reference to an instance method
of an arbitrary object of a particular type
Class:instanceMethod String::toString  s -> s.toString()
Reference to a constructor Class::new String::new  () -> new String()

Reference to a Static Method

public class StaticMethodReference{
    public static void main(String args[]) {
        List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
        // Method reference
        list.forEach(StaticMethodReference::print);
        // Lambda expression
        list.forEach(number -> StaticMethodReference.print(number));
        // normal
        for(int number : list) {
            StaticMethodReference.print(number);
        }
    }
    public static void print(final int number) {
        System.out.println("I am printing: " + number);
    }
}

Here, it calls the static method StaticMethodReference.print. This example is pretty simple. There is a static method, and for each element in the list, it calls this method using the element as the input.

Reference to an Instance Method of a Particular Object

public class ParticularInstanceMethodReference {
    public static void main(String args[]) {
        final List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
        final MyComparator myComparator = new MyComparator();
        // Method reference
        Collections.sort(list, myComparator::compare);
        // Lambda expression
        Collections.sort(list, (a,b) -> myComparator.compare(a,b));
    }

    private static class MyComparator {
        public int compare(final Integer a, final Integer b) {
            return a.compareTo(b);
        }
    }
}

Here, it calls the instance method myComparator.compare, where myComparator is a particular instance of MyComparator.

Reference to an Instance Method of an Arbitrary Object of a Particular Type

public class ArbitraryInstanceMethodReference {
    public static void main(String args[]) {
        final List<Person> people = Arrays.asList(new Person("dan"), new Person("laura"));
        // Method reference
        people.forEach(Person::printName);
        // Lambda expression
        people.forEach(person -> person.printName());
        // normal
        for (final Person person : people) {
            person.printName();
        }
    }
 private static class Person {
        private String name;
        public Person(final String name) {
            this.name = name;
        }
        public void printName() {
            System.out.println(name);
        }
    }
}

This calls the method Person.getName for each Person object in the list. Person is the particular type, and the arbitrary object is the instance of Person that is used during each loop. This looks very similar to a reference to a static method, but the difference is how the object is passed to the method reference. Remember, a static reference passes the current object into the method, whereas an arbitrary method reference invokes a method onto the current object.

Reference to a Constructor

public class ConstructorMethodReference {
    public static void main(String args[]) {
        final List<Integer> list = Arrays.asList(1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
        // Method Reference
        copyElements(null, ArrayList<Integer>::new);
        // Lambda expression
        copyElements(list, () -> new ArrayList<Integer>());
    }
    private static void copyElements(final List<Integer> list, final Supplier<Collection<Integer>> targetCollection) {
        // Method reference to a particular instance
        list.forEach(targetCollection.get()::add);
    }
}

This is the example I had the most trouble trying to make, as no matter how hard I thought, I couldn’t think of a way this could be used in something complicated. I am sure my opinion would change if I used Java 8 while at work, but for now, I do not see why this type of method reference is particularly useful. The example uses the Supplier functional interface to pass Integer::new into the copyElements method.

Conclusion

In conclusion, method references can be used to make your code even more concise, but they have some restrictions on when they can be used for and what they can do. If you simplify your code by using a lambda expression, then you might be able to make it even shorter by using a method reference. Eventually, your code will be so short your bosses will wonder what you have even been doing as you have only written a few lines of code!