Lots of information about World War 2 has been digitized into archives in the past few years by various organizations, such as the Dutch Royal Library and the Dutch Institute for War Documentation (NIOD). Now, how do you make this information useful and accessible for research? The Dutch newspaper NRC posed this challenge in an article on September 10 (http://www.nrc.nl/nieuws/2011/09/10/nederlandse-digitale-archieven-blijken-nauwelijks-bruikbaar/) and this is the challenge that faced us here at Dutchworks in the development of the website www.oorlogsbronnen.nl for NIOD. The solution we built includes use of standard metadata formats, standards for data harvesting, and an amazingly fast search engine with a powerful query language.
A standard metadata format
The data that is searchable on the website in question, www.oorlogsbronnen.nl, is called metadata. It's the data that describes the source objects, such as a title, description, temporal and spatial indication -- When was this picture taken? Where did this story take place? --, labels, etc.
Combining metadata from different archives into a single metadata repository such as the Netwerk Oorlogsbronnen requires a standard metadata representation; otherwise searching through this data is like comparing apples and oranges: how can I search for a document with "resistance" in the title unless the title is the same field in each archive my documents come from?
The worldwide standard for storing metadata is Dublin Core which defines standard fields like the title, subject, and description and dictates what they should contain. Luckily, all of the digital information sources used for Netwerk Oorlogsbronnen have digitized their metadata as Dublin Core, so this was the easy part for us.
A document's metadata stored as Dublin Core might look like this when represented as XML (example from Beeldbank WO2):
<record
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:identifier>
http: //www.bbwo2.nl/detail_no.jsp?action=detail&imid=85893
</dc:identifier>
<dc:source>NIOD</dc:source>
<dc:description>
Al vanaf 1941 werden wapens ook gebruikt voor de liquidatie van
voor het verzet gevaarlijke collaborateurs. Sommigen van hen
waren erin geslaagd in illegale organisaties te infiltreren met
het doel zoveel mogelijk verzetsdeelnemers in handen van de
bezetter te spelen. In de eerste helft van 1943 leidde de
liquidatie van enkele collaborateurs door gewapende verzetsmensen
tot onenigheid binnen het verzet. Aanslag door het verzet op de
commandant der Ornungspolizei Jac. Chr. Tetenburg op 31 maart
1945 om 11.15 uur voor het politiebureau aan de Hoflaan te
Rotterdam.
</dc:description>
<dc:date>31-03-1945 (Opname)</dc:date>
<dc:subject>
Aanslagen - Zie ook: Moordaanslagen, Represailles
</dc:subject>
<dc:subject>Georganiseerd verzet</dc:subject>
<dc:subject>Lijken</dc:subject>
<dc:subject>
Verzet - Zie ook: Widerstand, Georganiseerd verzet, Illegaliteit
</dc:subject>
<dc:subject>Tetenburg, Maj.</dc:subject>
<dc:coverage>Nederland</dc:coverage>
<dc:coverage>Rotterdam</dc:coverage>
<dc:type>IMAGE</dc:type>
<dc:relation>http://.../85893-thumb.jpg?frskey=85893</dc:relation>
<dcterm:provenance xmlns:dcterm="http://purl.org/dc/terms/">
BBWO2
</dcterm:provenance>
</record>
For non-Dutch speakers, excuse the exemplary text above, you didn't miss any content!
Let's continue with the difficult part: how do we get the data out of the various archives into our Netwerk Oorlogsbronnen metadata repository?
Standards for data harvesting
Retrieving metadata from a metadata repository is called "harvesting" in metadata terminology. The Open Archives Initiative OAI has defined a standard protocol for metadata harvesting over HTTP very originally named PMH ("Protocol for Metadata Harvesting"). Dutchworks has built an OAI-PMH harvester that can retrieve documents from a metadata repository and copy them into our own metadata repository.
However, some digital archive systems are slow to adopt OAI-PMH so we had to offer other options as well. Actually, as of now the majority of our harvesting is done using the search interface of the various digital archives. That is, we do a full search on every single document in the archive and get the data that way. Luckily, the standard REST API for metadata searching (SRU - "Search/Retrieval via URL", developed by the United States' Library of Congress) is much better adopted among the Dutch archives we're using than OAI-PMH is, and with these two options we can harvest the majority of archives.
For the remaining archives that support neither OAI-PMH nor SRU, we build custom file import solutions. Hopefully at some point these archives will adopt the standards and we can replace these custom solutions with our OAI-PMH or SRU harvester.
Search engine
Now that we've got all these documents in one place, we want to be able to search through them to find the information we are looking for. Apache's Solr, an open source search engine, provides this functionality and at an amazing speed.
Here's how we do it: we store each document into a Solr index. On top of this we have a REST-based search interface to search by keywords and operators like OR, AND, and NOT, as well as term proximity. Besides keyword search, we leverage Solr's faceting feature to break down results by data provider, collection, type (image, video, or text), and date. The user can then easily narrow down their search without ever running into a "No results found" dead end situation.
Zeezeilen & Overboord, our partners in the design and development of the PHP front-end of oorlogsbronnen.nl, use this REST API to search and retrieve documents from the repository and display them in a user-friendly layout. This way we have a clear separation of search logic and view logic which makes for a very maintainable system.

High level architecture
Where do we go from here?
The current Netwerk Oorlogsbronnen website is a solid foundation on which to build new features -- and NIOD already has many requests on their wish list: adding more archives, make their digital documents easier to search for, adding search result highlighting, enabling searching within a time period, and much much more. Dutchworks is already excited to get involved in the next phase of this project and continue to make our Dutch heritage digitally accessible to the public.
For now, check it out for yourself and find out more about World War 2 : www.oorlogsbronnen.nl