Skip to main content

Index and Search structured XML documents using Apache Solr

Apache Solr is a highly scalable search engine with lots of goodies inbuilt. In this guide we will learn how to get our structured data in XML can be indexed and searched effectively.

We will learn the following concepts:

  1. Starting up Apache Solr
  2. Importing structured XML document for indexing in Apache Solr

Tools & Library used in this project:

  1. Apache Solr 5.3.0
  2. Java 8
  3. Mac OSX

Downloading & Starting Apache Solr

Download Apache Solr Binary Distribution

We can download Apache Solr latest version from their official website. When we click on the major or mirror download distribution link, we got a page like it:

apache solr download page package to chooseApache Solr download page package to choose

Tip: Apache Solr downloadable package size is around 130 MB. Make sure you have this much bandwidth left on your internet connection.

Unpack Apache Solr Binary Download Zip

When we unpack Apache Solr Binary Download Zip we see the following files and folders inside the main folder:

apache solr 5.3 binary folder structure
Apache Solr 5.3 binary folder structure 

Starting, Stopping, and Restarting Apache Solr

Starting Apache Solr Server

$ cd /Volumes/Drive2/App/solr-5.3.0/

Start Solr Server

$ bin/solr start

Apache Solr has been started at https://localhost:8983/solr.

Stopping Apache Solr

$ cd /Volumes/Drive2/App/solr-5.3.0/

Stop Solr

$ bin/solr stop -p 8983

Restarting Apache Solr

$ cd /Volumes/Drive2/App/solr-5.3.0/

Stop Solr

$ bin/solr restart -p

Note: Replace the Solr folder path with your installation path

Let's create a core (or Collection) "xmlhub"

$ bin/solr create -c xmlhub

Setup new core instance directory: /Volumes/Drive2/App/solr-5.3.0/server/solr/xmlhub

Creating new core 'xmlhub' using the command: https://localhost:8983/solr/admin/cores?action=CREATE&name=xmlhub&instanceDir=xmlhub

{
 "responseHeader":{
 "status":0,
 "QTime":874},
 "core":"xmlhub"}
}
## Indexing XML files

### Sample XML File

We will be indexing xml files kept in a folder (In our application its at _<solr\_installtion\_root\_dir>/example-data_). An example of a XML file content:

**File: example1.xml**
```xml
<?xml version="1.0" encoding="UTF-8"?>
<ele xmlns:dc="https://purl.org/dc/elements/1.1/">
<attr1>
  Atrr1 Value 1
</attr1>
<attr2>
  Attr2 Value 1
</attr2>
<meta property="meta1">
  Meta 1 Val 1
</meta>
<meta property="meta2">
  Meta 2 Val 1
</meta>
<meta name="name1">
  Name 1 value 1
</meta>
<meta name="name2">
  Name 2 value 1
</meta>
</ele>

Uploading XML structured data for Indexing using Data Import Handler

Step 1: Configure solrconfig.xml

We will find solrconfig.xml file in location <solr_installtion_root_dir>/solr/<collection/node_name>/conf.

 

Location of solrconfig.xml in solr 5.3 installationLocation of solrconfig.xml in Solr 5.3 installation

File: solrconfig.xml

....
<lib dir="${solr.install.dir:../../../..}/dist/" regex="solr-dataimporthandler-.\*.jar" />
....
<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
    <lst name="defaults">
      <str name="config">xmlhubconfig.xml</str>
    </lst>
</requestHandler>
...

We can place this code in solrconfig.xml

Step 2: Create Data Import configuration

We may provide data import configuration in solrconfig.xml file, but we choose to do that in external file xmlhubconfig.xml.

File: xmlhubconfig.xml

<dataConfig>
  <dataSource type="FileDataSource"/>
  <document>
    <!-- this outer processor generates a list of files satisfying the conditions specified in the attributes -->
    <entity name="f" processor="FileListEntityProcessor" fileName=".\*.xml$" recursive="true" rootEntity="false" dataSource="null" baseDir="/Volumes/Drive2/App/solr-5.3.0/example-data">

      <!-- this processor extracts content using Xpath from each file found -->

      <entity name="nested" processor="XPathEntityProcessor" forEach="/ele | /metadata" url="${f.fileAbsolutePath}" >
              <field column="attr1\_s" xpath="/ele/attr1"/>
              <field column="attr2\_s" xpath="/ele/attr2"/>
              <field column="meta1\_s" xpath="/ele/meta\[@property='meta1'\]"/>
              <field column="meta2\_s" xpath="/ele/meta\[@property='meta2'\]"/>
              <field column="name1\_s" xpath="/ele/meta\[@name='name1'\]"/>
              <field column="name2\_s" xpath="/ele/meta\[@name='name2'\]"/>
      </entity>
    </entity>
  </document>
</dataConfig>

This configuration is specific to the XML file structure. Pay attention to how we had used XPATH. You should also replace baseDir with your path.

Step 3: Configure to generate unique id automatically

In solrconfig.xml we will be using updateRequestProcessorChain to setup UUIDUpdateProcessorFactory to generate a unique UUID for the id column.

File: solrconfig.xml

...
<updateRequestProcessorChain>
      <processor class="solr.UUIDUpdateProcessorFactory">
        <str name="fieldName">id</str>
      </processor>
      <processor class="solr.LogUpdateProcessorFactory" />
      <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
...

Index File

We should restart Apache Solr.

Go to https://localhost:8983/solr/#/xmlhub/dataimport//dataimport:

Execute dataimport handler to index xml
Execute dataimport handler to index xml

It will index the XML files and create documents. You can browse the document at https://localhost:8983/solr/xmlhub/browse.

Browse indexed documents in built-in solr collection browserBrowse indexed documents in built-in solr collection browser

Using the built-in collection browser we can search indexed documents. Learn more about Solr Query Syntax at the official documentation. Apache Solr also provides API to access search interfaces with all the available features.

References

  1. Learn about the Apache Solr Query Syntax
  2. Apache Solr Data Import Helper Documentation
  3. Apache Solr Site

Comments

Popular posts from this blog

Working with request header in Jersey (JAX-RS) guide

In the  previous post , we talked about, how to get parameters and their values from the request query string. In this guide learn how to get request header values in Jersey (JAX-RS) based application. We had tested or used the following tools and technologies in this project: Jersey (v 2.21) Gradle Build System (v 2.9) Spring Boot (v 1.3) Java (v 1.8) Eclipse IDE This is a part of  Jersey (JAX-RS) Restful Web Services Development Guides series. Please read Jersey + Spring Boot getting started guide . Gradle Build File We are using Gradle for our build and dependency management (Using Maven rather than Gradle is a very trivial task). File: build.gradle buildscript { ext { springBootVersion = '1.3.0.RELEASE' } repositories { mavenCentral() } dependencies { classpath("org.springframework.boot:spring-boot-gradle-plugin:${springBootVersion}") } } apply plugin: 'java' apply plugin: 'eclipse' a

Ajax Cross Domain Resource Access Using jQuery

Some time back in our project we faced a problem while making an Ajax call using jQuery. Chrome Browser console had given some weird error message like below when we try to access one of our web pages: When we try to access the same web page in the Firefox browser, it doesn't give any error in the console but some parsing error occurred. In our case, we were accessing XML as an Ajax request resource. I was curious to check if the non-XML cross-domain resource was successfully loading or not. But finally, I realized that it is not going through. jersey-spring-boot-quick-starter-guide In our Ajax call, requesting domain was not the same as the requested URL domain. $.ajax({ url: "https://10.11.2.171:81/xxxxxx/xxxxxxx.xml" , type : "get" , success: function (response) { alert( "Load was performed." ); }, error : function (xhr, status) {

FastAPI first shot

Setup on my Mac (Macbook Pro 15 inch Retina, Mid 2014) Prerequisite Python 3.6+ (I used 3.7.x. I recently reinstalled OS after cleaning up disk, where stock Python 2.7 was available. I installed Pyenv and then used it to install 3.7.x). I already had a git repo initialized at Github for this project. I checked that out. I use this approach to keep all the source code safe or at a specific place 😀. I set the Python version in .python-version file. I also initialize the virtual environment using pyenv in venv folder. I started the virtual environment. FastAPI specific dependencies setup Now I started with basic pip commands to install dependency for the project. I saved dependencies in requirements.txt  the file. Minimal viable code to spin an API Server FastAPI is as cool as NodeJS or Go Lang (?) to demonstrate the ability to spin an API endpoint up and running in no time. I had the same feeling for the Flask too, which was also super cool. app/main.py: from typing i