One of our recently concluded projects was about building a generic platform for managing multiple B2B online marketplaces catering to different customers within the biotechnology space. An interesting problem we faced when building the “Search”  functionality for this platform was how we would deal with the massive volumes of products and product specifications that one had to “intelligently”  sift through, when rendering a search result to the user.

The platform supported semantically different content types like Products and Articles, and moreover, each product could have a large number of specifications tagged to it. The search functionality required was feature-rich, where the results had to be prioritized based on weights attached to different product specification types (color, weight etc), product localization and other configurable site-specific parameters. This resulted in us needing significantly complex SQL queries to generate a search result. In addition, we were dealing with large volumes of data; as much as 2.8 million products in a single online marketplace, mapped to over 40 million specification records in the MS SQL Server database. An initial proof of concept using purely SQL queries proved futile, where it took us around 20 seconds to render a properly weighted search result, in spite of the ample allocation of hardware resources within the hosting environment. We had to go back to the drawing-board and rethink our Search architecture in order to improve its performance.

We then did some research on ElasticSearch, a schema free, document oriented search solution that could be hosted on a cloud environment for scalability. This got us thinking along the lines of a hybrid architecture, where we would distribute the processing of the vast data volume of the search on a cloud deployment of ElasticSearch, whilst running the complex search queries for weighting the search results within the usual dedicated MS SQL Server environment. The final solution was a two-piece affair as depicted in the (simplified) conceptual diagram below.

High-performance search

When a user types a search string on a given product marketplace website and hits the “Search”  button, a two step process is invoked. There is a summarized database of products and essential specifications that is deployed on a cluster of ElasticSearch nodes on the Amazon EC2 cloud, which is updated on a daily basis (see “Daily Update Task” ). This full text search server is initially queried for the search string via the ElasticSearch web service API, and a “shortlist”  of product records are sent to the DAL of the marketplace app as a JSON string, in double-quick time. The details of the search results –  i.e. all the product attributes and the application of specification-based weighting rules –  are generated via regular SQL queries. These SQL queries are run on the MS SQL Server database for the shortlisted search results recordset provided by ElasticSearch.

The actual implementation itself was a learning experience for us. For example, we initially wired the Web App, which was hosted in a data center in San Francisco, with an ElasticSearch solution hosted on Amazon’s cloud servers in the East Coast of the United States. This introduced significant network latency, which was greatly reduced by moving our ElasticSearch solution to the West Coast. We also realized that we could break up the summary recordset returned by ElasticSearch based on pagination criteria, and query the details from the local MS SQL Server to render page-wise search results. This reduced database processing time significantly. Several other minor tweaks were done to the detailed design, such as transferring the complexity of certain bitwise operations to determine localization based priority for products, to the server side SQL queries. All these improvements were based on a thorough performance testing of the Search implementation.

The final SQL-ElasticSearch hybrid solution immensely reduced the waiting time for a given search query, from as much as 20 seconds duration in the pure SQL solution, to under two seconds duration in the SQL-ElasticSearch hybrid solution. This performance improvement was immensely appreciated by our client and the end users of their online B2B marketplaces.