Company

Posts Tagged ‘Java’

QCon London 2013 - Simplicity, complexity and doodles

March 21st, 2013 by
(http://blog.trifork.com/2013/03/21/qcon-london-2013-simplicity-complexity-and-doodles/)

Westminster Abbey

Westminster Abbey - View from the Queen Elizabeth II conference center

...and now back home

On my desk lies a stack of notepads from the QCon sponsors. I pick up one of them and turn few pages trying to decipher my own handwriting. As I read my notes I reflect back on the conference. QCon had a great line up and awesome keynote speakers: Turing award winner Barbara Liskov, Ward Cunningham, inventor of the Wiki, and of course Damian Conway who gave two highly entertaining keynotes. My colleague Sven Johann and I were at QCon for three days. We attended a few talks together but also went our own way from time to time. Below I discuss the talks I attended that Sven didn't cover in his QCon blog from last week.

Ideas not art: drawing out solutions - Heather Willems

The first talk I cover has nothing to do with software technology but with communication. Heather Willems shows us the value of communicating ideas visually. She started the talk with an entertaining discussion of the benefits of drawing in knowledge work. Diagrams and visuals help us to retain information and helps group discussion. The short of it: it's OK to doodle. In fact it is encouraged!

The second part of the talk was a mini-workshop where we learned how to create our own icons and draw faces expressing basic emotions. These icons can form the building blocks of bigger diagrams. Earlier in the day Heather made a graphic recording of Barbara Liskov's keynote. In real-time: Heather was drawing on-the-spot based on what Barbara was talking about!

Graphic recording keynote Barbara Liskov

Graphic recording of Barbara Liskov's keynote 'The power of abstraction'

You are not a software developer! - Russel Miles

Thought provoking talk by Russel Miles about simplicity in problem solving. His main message: in the last decade we learned to deliver software quite well and now face a different problem: overproduction. Problems can often be solved much easier or without writing software at all. Russel argues that software developers find requirements boring, yet they have the drive to code, hence they sometimes create complex, over-engineered solutions.

He also warns of oversimplifying: a solution so simple that the value we seek is lost. His concluding remark relates to a key tenet of Agile development: delivering valuable software frequently. He proposes to instead focus on 'delivering valuable change frequently'. Work on the change you want to accomplish rather than cranking out new features. These ideas are related to the concepts of impact mapping, which he used to structure the presentation itself, he revealed in the end :-)

Want to see Russel live? He will be giving an updated version of this presentation at a GOTO night in Amsterdam on May 14 and he'll be speaking at GOTO Amsterdam in June too.

The inevitability of failure - Dave Cliff

In this talk professor Dave Cliff of the Large Scale Complex IT systems group at University of Bristol warns us about the evergrowing complexity in large scale software systems. Especially automated traders in financial markets. Dave mentions recent stock market crashes as failures. These failures did not make big waves in the news, but could have had catastrophic effects if the market did not recover properly. He discusses an interesting concept, normalization of deviance.

Everytime a safety margin is crossed without problems it is likely that the safety margin will be ignored in the future. He argues that we were quite lucky with the temporary market crashes. Because of 'normalization of defiance' it's only a matter of time before a serious failure occurs. Unfortunately I missed an overview of ways to prevent these kind of problems. If they can be prevented at all. A principle from cybernetics, Ashby's law of requisite variety, says that a system can only be controlled if the controller has enough variety in it's actions to compensate any behaviour of the system to be controlled. In a financial market, with many interacting traders, human or not, this isn't the case.

Performance testing Java applications - Martin Thompson

Informative talk about performance testing Java applications. Starts with fundamental definitions and covers tools and approaches on how to do all sorts of performance testing. Martin proposes to use a red-green-debug-profile-refactor cycle in order to really know what is happening with your code and how it performs. Another takeway is the difference between performance testing and optimization. Yes, defer optimization until you need it. But this is not a reason not to know the boundaries of your system. When load testing, use a framework that spends little time on parsing requests and responses. All good points and I'll have to read his slides again later for all the links to the tools he suggests for performance testing.

Insanely Better Presentations - Damian Conway

Great talk on how to give presentations. Damian shows examples of bad slides and refactors them during his talk. He discusses fear of public speaking, how to properly prepare a talk, a lot of great tips! I won't do the talk justice by describing it in text. Many of Conway's ideas have to be seen live to make sense. Nevertheless there is a method to the madness:

  • Dump everything you know on the subject
  • Decide on 5 main points and create storyline that flows between them
  • Toss out everything that does not fit the storyline
  • Simplicity - show less content, on more slides
  • Use highlighting for code walkthroughs
  • Use animations to show code refactorings
  • Get rid of distractions
  • The most important part of a presentation is person-to-person communication!
  • Practice in front of an audience at least 3 times. Even if it is just your cat.

Visualization with HTML 5 - Dio Synodinos

In this tour of technologies for visualizing data, Dio showed everything from CSS3 to SVG, processing and D3js. For each of these he gave a good overview of their pros and cons and made specific animations and demos for all of them. He also mentioned pure CSS3 iOS icons. Lot's of eye candy and from reading the #QconLondon Twitter stream it seems a few people liked to try out all these frameworks and technologies.

Coffee breaks

Thankfully, there were plenty of coffee breaks at the conference. During breaks I often bumped into Sejal and Daphne, as well as other Triforkers from both our Zurich & Aarhaus offices. Besides attending talks we went to a nice conference party and went out to dinner a few times. Between talks Sven and I meetup and had a chat about what we saw, whilst we grabbed some delicious cookies here and there. Unfortunately the chocolate chip ones were gone most of the time!

Souvenir

At one point I took the elevator to the top floor. On my right is a large table covered with techy books. Conference goers try to walk by, but look over and can't help but gravitate to this mountain of tech information. Of course I couldn't resist either so I browsed a bit and finally bought 'Team Geek - A software developer's guide to working well with others'. Later on I visit the web development open space. I listen in on a few conversations and end up chatting with James and Kathy, the camera operators, while they are packing their stuff. They have been filming all the talks for the last three days and we talk a bit about the conference until the place closes down.

All in all QCon London 2013 was a great conference!

Introducing the elasticshell

March 6th, 2013 by
(http://blog.trifork.com/2013/03/06/introducing-the-elasticshell/)

elasticshell
A few days ago I released the first beta version of the elasticshell, a shell for elasticsearch. The idea I had was to create a command line tool that allows you to easily interact with elasticsearch.

Isn't elasticsearch easy enough already?
I really do think elasticsearch is already great and really easy to use. However, on the other hand there is quite some API available and quite some json involved too. Also, interacting with REST APIs requires a tool other than the browser to use the proper http methods and so on. There are different solutions available: some of them are generic, like curl or browser plugins, while others are elasticsearch plugins like head or sense, that you can use to send json requests and see the result, still in json format. What was missing is a command line tool, something that plays the role of the mongo shell in the elasticsearch world. That's ambitious, isn't it?

In the meantime the es2unix tool has been released by Drew, a member of the elasticsearch team. The interesting approach taken there is to hide all the json and show only text in a nice tabular format, providing an executable command that makes possible to pipe its output to other unix commands like grep, sort and awk. That's a great idea, and an even greater result I must say.

A json friendly environment
I decided to take another approach: provide an environment that makes it easier to play around with all that json. That's why I started writing a javascript shell, where json is native and it's relatively easy to provide auto-suggestions directly within json objects. I also wanted to use the elasticsearch Java API, which are complete, performant, and powerful, allowing to even fire a new node if needed.
Read the rest of this entry »

Prepare for the Storm and be saved by Puppet!

February 28th, 2013 by
(http://blog.trifork.com/2013/02/28/prepare-for-the-storm-and-be-saved-by-puppet/)

GOTO_night_Amsterdam_v2

After a highly successful edition of the GOTO Night in December with Timan Rebel and Erik Meijer, we are happy to announce the next GOTO Night that will take place on March 7 2013.

This time at the Trifork office in Amsterdam on March 7 we have two great technical presentations lined up:

  1. Within the eye of the Storm (Introduction to Storm framework) // Sjoerd Mulder from Persuasion API
  2. Using Puppet, Foreman and Git to Develop and Operate a Large Scale Internet Service (in this case eBuddy) // by Joost van de Wijgerd from eBuddy

Here is just a little taster of what what they said their presentations will cover:

Sjoerd: "Curious about the Storm Framework? Storm is a distributed real-time computation system. It's new and exciting and you might have heard about it and have some questions about it. How does it work? What are its use-cases? How do I get started? What are the differences with Hadoop? How can I run it in production? How do I connect it with product XYZ?". This is what I will cover and show you how you can get started with Storm, including some live coding, and I'll even cover how you can use Storm in production”. Read more about Sjoerd & his session in detail and sign up now.

Joost: At eBuddy we are implementing DevOps and in this talk I would like to introduce our current setup. We use Foreman in combination with Puppet and a Custom Git based Configuration Management solution to manage our Infrastructure and the Services running on top of it. My talk will be centered around Foreman and Puppet and I will show how we use these tools to do deployments, scale out our clusters and configure new machines on the fly. Read more about Joost & his session in detail and sign up now.

Just before we dive into the beers, Dan Roden, our special guest from the Program Committee for GOTO Amsterdam, will present a short teaser for his sessions & track Emerging Interfaces at the GOTO Amsterdam event.

So sign up and join us on March 7 at 18.00 for great talks, free beers & pizza at Trifork.

WANT TO SPONSOR GOTO AMSTERDAM 2013?

We are already proud to have some great sponsors onboard, including 42, Appdynamics, Basho, Hippo, Neo4J & Zilverline to name a few, if you are interested contact me, Daphne Keislair or visit the event website.

Summer time...

September 4th, 2012 by
(http://blog.trifork.com/2012/09/04/summer-time/)

For those you may have missed our newsletter last week I'd like to take this opportunity to give you a quick lowdown of what we've been up to. The summer months have been far from quiet and I'm pretty excited to share in this month’s edition lots of news on projects, products & upcoming events.

Hippo & Orange11

hippo logoThe countdown has begun for the launch of The University of Amsterdam online platform. Built by Orange11 with the use of Hippo CMS the website developed with multi-platform functionality in mind is a masterpiece of technology all woven together. We’ll keep you posted about the tips & tricks we implemented.

If you can’t wait until then and want more information, contact Jan-Willem van Roekel.

Mobile Apps; just part of the service!

new motion appWe mentioned in our last newsletter the launch of Learn to write with Tracy. Well since then we’ve been working on apps for many customers including for example The New Motion, a company dedicated to the use of Electric Vehicles. Orange11 has developed an iPhone app that allows users to view or search load locations (in list or map form) and even check real-time availability of these.

ysis screenAnother example is the app for GeriMedica;Ysis Mobiel, a mobile addition to their existing Electronic Health Record database used largely in Geriatric care (also an Orange11 project). The mobile app supports specific work processes allows registered users to document (in line with the strict regulations) all patient related interaction through a simple 3-step logging process. A registration overview screen also shows the latest activities registered, which prevents co-workers from accidentally registering the same activity twice.

Visit our website for more on our mobile expertise.

Orange11 & MongoDB 

mongoDB logo

We’ve got tons of exciting things going on with MongoDB as trusted implementation partner so here are a few highlights:

Brown bag sessions

Since the launch of our brown bag sessions we’re excited that so many companies are interested to find out more this innovative open source document database. What we offer is a 60 minute slot with an Orange11 & MongoDB expert, who can educate & demonstrate MongoDB best practices & cover how it can be used in practice. It’s our sneak preview to you of the host of opportunities there are with MongoDB.

Sign up now!

Tech meeting / User Group Meeting

As a partner we’re also proud to host the next user group session on Thursday 6th September, whereby Chief Technical Director, Alvin Richards will be here to cover all the product ins & outs and share some use cases.

Don’t miss out & join usas always it’s free & pizza and cold beer on the house!

Coffee Cookies, Conversation & Customers

Last week we invited some of our customers to a brainstorm session around the new Cookie Law in the Netherlands. Together with Eric Verhelst, a lawyer specialized in the IT industry, Intellectual Property, Internet and Privacy we provided our customers with legal insight and discussed what their concerns & ideas were around solutions. If you have any questions around the new cookie law and are looking for advice, answers & solutions, contact Peter Meijer.

ElasticSearch has just got bigger

es logoCongratulations to our former CEO, Steven Schuurman who announced his new venture:ElasticSearch, the company. The company's product "elasticsearch", is an innovative and advanced open source distributed search engine. The combination of Steven joining forces with elasticsearch founder & originator Shay Banon and his background as co-founder of SpringSource, the company behind the popular Spring Framework, (also close to our heart at Orange11) it’s bound to be a great success. The company offers users and potential users of elasticsearch a definitive source for support, education and guidance with respect to developing, deploying and running elasticsearch in production environments. As Search remains a key focus area for Orange11, with our experience in both Solr and elasticsearch, our customers are guaranteed the best search solution available. For more info contact Bram Smeets. 

Our team is getting bigger & better

beach eventsWe’re happy to welcome Michel Vermeulen to the team this month. Michel is an experienced Project Manager and will further professionalize our agile development organization. We also have new talent starting next month, BUT there is room for more.

So if you’re a developer and wanna work on great project with a fun team (left: snapshot from our company beach event) then call Bram Smeets now.

That's all for now folks....

Apache Whirr includes Mahout support

December 22nd, 2011 by
(http://blog.trifork.com/2011/12/22/apache-whirr-includes-mahout-support/)

In a previous blog I showed you how to use Apache Whirr to launch a Hadoop cluster in order to run Mahout jobs. This blog shows you how to use the Mahout service from the brand new Whirr 0.7.0 release to automatically install Hadoop and the Mahout binary distribution on a cloud provider such as Amazon.

Introduction

If you are new to Apache Whirr checkout my previous blog which covers Whirr 0.4.0. A lot has changed since then. After several services, bug fixes, improvements Whirr became a top level Apache project with its new version 0.7.0 released yesterday! During the last weeks I worked on a Apache Mahout service for Whirr included in the latest release. (Thanks to the Whirr community and Andrei Savu in particular for reviewing the code and helping out to ship this cool feature!)

How to use the Mahout service

The Mahout service in Whirr defines the mahout-client role. This role will install the binary Mahout distribution on a given node. To use this feature checkout the sources from https://svn.apache.org/repos/asf/whirr/trunk or http://svn.apache.org/repos/asf/whirr/tags/release-0.7.0/ or clone the project with Git at http://git.apache.org/whirr.git and build it with a mvn clean install. Let me walk you through an example how to use this on Amazon AWS.

Step 1 Create a node template

Create a file called mahout-cluster.properties and add the following

whirr.instance-templates=1 hadoop-jobtracker+hadoop-namenode+mahout-client,2 hadoop-datanode+hadoop-tasktracker

whirr.provider=aws-ec2
whirr.identity=TOP_SECRET
whirr.credential=TOP_SECRET

This setup configures two Hadoop datanode / tasktrackers and one Hadoop namenode / jobtracker / mahout-client node. For the mahout- client role, Whirr will:

* Download the binary distribution from Apache and install it under /usr/local/mahout

* Set MAHOUT_HOME to /usr/local/mahout

* Add $MAHOUT_HOME/bin to the PATH

(Optional) Configure the Mahout version and / or distribution url

By default, Whirr will download the Mahout distribution from
http://archive.apache.org/dist/mahout/0.5/mahout-distribution-0.5.tar.gz
You can override the version by adding
whirr.mahout.version=VERSION

Also, you can change the download url entirely; useful if you want to test your own version of Mahout. To do so, first create a Mahout binary distribution by entering the mahout distribution folder in your checked out Mahout source tree and run

$ mvn clean install -Dskip.mahout.distribution=false

Now put the tarball on a server that will be accessible by the cluster and add the following line to your mahout-cluster.properties

whirr.mahout.tarball.url=MAHOUT_TARBALL_URL

Step 2 Launch the cluster

You can now launch the cluster the regular way by running:

$ whirr launch-cluster --config mahout-cluster.properties

Step 3 Login & run

When the cluster is setup, run the Hadoop proxy, upload some data, SSH into the node and voilà, you can run Mahout jobs by invoking the command line script like you would do normally, such as:

$ mahout seqdirectory --input input --output output

Enjoy!

Apache Lucene FlexibleScoring with IndexDocValues

November 16th, 2011 by
(http://blog.trifork.com/2011/11/16/apache-lucene-flexiblescoring-with-indexdocvalues/)

During GoogleSummerOfCode 2011 David Nemeskey, PhD student, proposed to improve Lucene’s scoring architecture and implement some state-of-the-art ranking models with the new framework. Prior to this and in all Lucene versions released so far the Vector-Space Model was tightly bound into Lucene. If you found yourself in a situation where another scoring model worked better for your usecase you basically had two choices; you either override all existing Scorers in Queries and implement your own model provided you have all the statistics available or you switch to some other search engine providing alternative models or extension points.

With Lucene 4.0 this is history! David Nemeskey and Robert Muir added an extensible API as well as index based statistics like Sum of Total Term Frequency or Sum of Document Frequency per Field to provide multiple scoring models. Lucene 4.0 comes with:

Lucene's central scoring class Similarity has been extended to return dedicated Scorers like ExactDocScorer and SloppyDocScorer to calculate the actual score. This refactoring basically moved the actual score calculation out of the QueryScorer into a Similarity to allow implementing alternative scoring within a single method. Lucene 4.0 also comes with a new SimilarityProvider which lets you define a Similarity per field. Each field could use a slightly different similarity or incorporate additional scoring factors like IndexDocValues.

Boosting Similarity with IndexDocValues

Now that we have a selection of scoring models and the freedom to extend them we can tailor the scoring function exactly to our needs. Let's look at a specific usecase - custom boosting. Imagine you indexed websites and calculated a pagerank but Lucene's index-time boosting mechanism is not flexible enough for you, you could use IndexDocValues to store the page rank. First of all you need to get your data into Lucene ie. store the PageRank into a IndexDocValues field, Figure 1. shows an example.


IndexWriter writer = ...;
float pageRank = ...;
Document doc = new Document();
// add a standalone IndexDocValues field
IndexDocValuesField valuesField = new IndexDocValuesField("pageRank");
valuesField.setFloat(pageRank);
doc.add(valuesField);
doc.add(...); // add your title etc.
writer.addDocument(doc);
writer.commit();
Figure 1. Adding custom boost / score values as IndexDocValues

Once we have indexed our documents we can proceed to implement our Custom Similarity to incorporate the page rank into the document score. However, most of us won't be in the situation that we can or want to come up with a entirely new scoring model so we are likely using one of the already existing scoring models available in Lucene. But even if we are not entirely sure which one we going to be using eventually we can already implement the PageRankSimilarity. (see Figure 2.)

public class PageRankSimilarity extends Similarity {

private final Similarity sim;

  public PageRankSimilarity(Similarity sim) {
    this.sim = sim; // wrap another similarity
  }

  @Override
  public ExactDocScorer exactDocScorer(Stats stats, String fieldName,
      AtomicReaderContext context) throws IOException {
    final ExactDocScorer sub = sim.exactDocScorer(stats, fieldName, context);
    // simply pull a IndexDocValues Source for the pageRank field
    final Source values = context.reader.docValues("pageRank").getSource();

    return new ExactDocScorer() {
      @Override
      public float score(int doc, int freq) {
        // multiply the pagerank into your score
        return (float) values.getFloat(doc) * sub.score(doc, freq);
      }
      @Override
      public Explanation explain(int doc, Explanation freq) {
        // implement explain here
      }
    };
  }
  @Override
  public byte computeNorm(FieldInvertState state) {
    return sim.computeNorm(state);
  }

  @Override
  public Stats computeStats(CollectionStatistics collectionStats,
                float queryBoost,TermStatistics... termStats) {
    return sim.computeStats(collectionStats, queryBoost, termStats);
  }
}
Figure 2. Custom Similarity delegate using IndexDocValues

With most calls delegated to some other Similarity of your choice, boosting documents by PageRank is as simple as it gets. All you need to do is to pull a Source from the IndexReader passed in via AtomicReaderContext (Atomic in this context means is a leave reader in the Lucene IndexReader hierarchy also referred to as a SegmentReader). The IndexDocValues#getSource() method will load the values for this field atomically on the first request and buffer them in memory until the reader goes out of scope (or until you manually unload them, I might cover that in a different post). Make sure you don't use IndexDocValues#load() which will pull in the values for each invocation.

Can I use this in Apache Solr?

Apache Solr lets you already define custom similarities in its schema.xml file. Inside the <type> section you can define a custom similarity per <fieldType> as show in Figure 3 below.


<fieldType name="text" class="solr.TextField">
  <analyzer class="org.apache.lucene.analysis.standard.StandardAnalyzer"/>
  <similarity class="solr.BM25SimilarityFactory">
    <float name="k1">1.2</float>
    <float name="b">0.76</float>
  </similarity>
</fieldType>
Figure 3. Using BM25 Scoring Model in Solr

Unfortunately, IndexDocValues are not yet exposed in Solr. There is an issue open aiming to add support for it without any progress yet. If you feel like you can benefit from IndexDocValues and all its features and you want to get involved into Apache Lucene & Solr feel free to comment on the issue. I'd be delighted to help you working towards IndexDocValues support in Solr!

What is next?

I didn't decide on what is next in this series of posts but its likely yet another use case for IndexDocValues like Grouping and Sorting or we are going to look closer into how IndexDocValues are integrated into Lucene's Flexible Indexing.

Introducing Lucene Index Doc Values

October 27th, 2011 by
(http://blog.trifork.com/2011/10/27/introducing-lucene-index-doc-values/)

From day one Apache Lucene provided a solid inverted index datastructure and the ability to store the text and binary chunks in stored field. In a typical usecase the inverted index is used to retrieve & score documents matching one or more terms. Once the matching documents have been scored stored fields are loaded for the top N documents for display purposes. So far so good! However, the retrieval process is essentially limited to the information available in the inverted index like term & document frequency, boosts and normalization factors. So what if you need custom information to score or filter documents? Stored fields are designed for bulk read, meaning the perform best if you load all their data while during document retrieval we need more fine grained data.

Lucene provides a RAM resident FieldCache built from the inverted index once the FieldCache for a specific field is requested the first time or during index reopen. Internally we call this process un-inverting the field since the inverted index is a value to document mapping and FieldCache is a document to value datastructure. For simplicity think of an array indexed by Lucene's internal documents ID. When the FieldCache is loaded Lucene iterates all terms in a field, parses the terms values and fills the arrays slots based on the document IDs associated with the term. Figure 1. illustrats the process.

Figure 1. Univerting a field to FieldCache

FieldCache serves very well for its purpose since accessing a value is basically doing a constant time array look. However, there are special cases where other datastructures are used in FieldCache but those are out of scope in this post.

So if you need to score based on custom scoring factors or you need to access per-document values FieldCache provides very efficient access to them. Yet, there is no such thing as free lunch. Uninverting the field is very time consuming since we need to first walk a datastructure which is basically the invers of what we need and then parse each value which is typically a String (until Lucene 4) or a UTF-8 encoded byte array. If you are in a frequently changing environment this might turn into a serious bottleneck.

Nevertheless, with FieldCache you get all or nothing. You either have enough RAM to keep all your data withing Java Heapspace or you can't use FieldCache at all.

IndexDocValues - move FieldCache to the index

A reasonably new feature in Lucene trunk (4.0) tries to overcome the limitations of FieldCache by providing a document to value mapping built at index time. IndexDocValues allows you to do all the work during document indexing with a lot more control over your data. Each ordinary Lucene Field accepts a typed value (long, double or byte array) which is stored in a column based fashion.

This doesn't sound like a lot more control yet, right? Beside "what" is stored IndexDocValues also exposes "how" those values are stored in the index. The main reason for exposing internal datastructure was that users usually konw way more about their data so why hide it, Lucene is a low level library. I will only scratch the surface of all the variant so see the ValueType javadocs for details.

For integer types IndexDocValues provides byte aligned variant for 8, 16, 32 and 64 bits as well as compressed PackedInts. For floating point values we currently only provide float 32 and float 64. However, for byte array values IndexDocValues offers a lot of flexibility. You can specify if you values have fixed or variable length, if they should be stored straight or in a dereferenced fashion to get a good compression in the case the number of distinct values is low.

Cool, the first limitation of FieldCache is fixed! There is no need to neither un-invert nor parse the values from the inverted index. So loading should be pretty fast and it actually is. I ran several benchmarks for loading up IndexDocValues for float 32 variants vs. loading FieldCache from the same data and the results are compelling - IndexDocValues loads 80 to 100 X faster than building a FieldCache.

So lets look at the second limitation - all or nothing. FieldCache is entirely RAM resident which might not be possible or desired in certain scenarios but since we need to un-invert there is not much of a choice.

IndexDocValues provide the best of both worlds whatever the users choice is at runtime. The API provides a very simply interface called Source which can either be entirely RAM resident (a signle shared instance per field and segment) or Disk-Resident (insteance per thread). With both RAM resident and on disk the same Random-Access interface is used to retrieve a single value per field for a given document ID.

The performance conscious of you might ask if the lookup performance is comparable to FieldCache since now we need to do an additional method call per value lookup vs. a single array lookup. The answer is: "it depends"! For instance if you choose to use PackedInts to compress your integers you certainly pay a price but if you choose a 32-bit aligned variant you can actually access the underlaying array via Source#getArray() just like Java NIO ByteBuffers.

You want it sorted?

By default Lucene returns search results sorted by the individual document score. However, one of the most commonly used features (especially in the Enterprise sector) is sorting by individual fields. This is yet another usecase where FieldCache is used for performance reasons. FieldCache can load up the already sorted values from the terms dictionary, providing lookup by Key & Ordinal. IndexDocValues provides the same functinality for fixed and variable length byte [ ] variants through a SortedSource interface. Figure 2 illustrates obtaining a SortedSource instance.

PerDocValues perDocValues = reader.perDocValues();
IndexDocValues docValues = perDocValues.docValues("sortField");
<span style="color: #3f7f59;">// Source source = docValues.getDirectSource() for disk-resident version</span>
Source source = docValues.getSource(); 
SortedSource sortedSource = source.asSortedSource();
BytesRef spare = new BytesRef();
sortedSource.getByOrd(2, spare);
int compare = spare.compareTo(new BytesRef("fooBar"));
Figure 2. Obtaining a SortedSource from a loaded Source instance

Recent benchmarks using SortedSource have shown equal performance to FieldCache and even with on-disk versions the performance hit is between 30% and 50%. These properties can be extremely helpful especialy for users with a large number of fields they need to sort on.

Wrapping up...

This introduction is the first post in a serious of posts for IndexDocValues. In the upcoming weeks we gonna publish more detailed post on how to use IndexDocValues for Sorting and Result Grouping. Since Lucene 4 also provides Flexible Scoring, using IndexDocValues with Lucene's new Similarity deserves yet another post.

For the folks more interested in the technical background and how to extend IndexDocValues, understand how the internal type promotion works or event write your own low level implementation I'm planning to publish a low level codec post too. So stay tuned.

Axon Framework 1.0, first release candidate available

February 16th, 2011 by
(http://blog.trifork.com/2011/02/16/axon-framework-1-0-first-release-candidate-available/)

The Axon Framework 1.0 release is closing in. After over a year of development, all features planned for the 1.0 version are included. With the latest added features, Axon has become a powerful framework that helps developers implementing applications using on a CQRS based architecture.

Although the 1.0-rc1 version doesn’t add a lot of new features to the previous release (0.7), it does represent a major milestone in Axon’s lifecycle. If Axon continues to prove it works as expected in production environments, the final 1.0 release can be expected before summer. Meanwhile, development will start on the remote messaging components required for scalability in larger systems.

Read the rest of this entry »

Spatial Solr Plugin 2.0-RC4 Released

February 11th, 2011 by
(http://blog.trifork.com/2011/02/11/spatial-solr-plugin-2-0-rc4-released/)

Having worked with a number of SSP users over the past few weeks, we are pleased to announce that SSP 2.0-RC4 has been released. This bug fix release addresses a number of problems identified by our users:

  • GeoDistanceComponent opening IndexReaders and not closing them
  • Solr returning all fields of documents rather than just those requested
  • Invalid bounding box range queries being created when the boxes crossed a meridian

Note, as part of the improvements to the GeoDistanceComponent, it is now necessary to configure the component as well. Please see the plugin's documentation for more information.

We recommend that all SSP 2.0 users upgrade to this new release.

Again, I'd like to thank our users for helping us identify and resolve these issues.

Mahout at FOSDEM 2011 DataDevRoom

February 10th, 2011 by
(http://blog.trifork.com/2011/02/10/mahout-at-fosdem-2011-datadevroom/)

Last saturday, february 5th, FOSDEM 2011 hosted the DataDevRoom where talks were given on topics surrounding data analysis with free and open source software. I was there and gave an introductory talk on clustering with Apache Mahout. In case you missed the conference, read on to learn about some of the talks or checkout the slides or demo code from my Mahout talk.

Read the rest of this entry »