An Organized and Indexed Data Lake

The data lake is a relatively new system design pattern.  By being able to serve as a cost effective and efficient landing zone and bulk processing engine for any and all kinds of data, data lakes should have a significant impact in the efficiency and effectiveness of organizations being able to leverage data.  This impact, in many cases, has not been realized when data lakes get deployed.  Frequently the reason for lackluster data lake deployments is that data gets “lost” in the lake.  “Throw your data in the lake and watch it sink to the bottom” is a common critique.  When we started Koverse, this issue was painfully obvious to us.  A data lake without proper organization and indexing is really not very useful.

Data lakes without a full data organization and indexing scheme suffer from several maladies that limit their usefulness:

  1. They can only serve data to a handful of users at once.
  2. Data is hard to find and its extremely hard to search.
  3. Applications that want to leverage data in the lake frequently have to cache the data needed outside of the lake to achieve the required application performance.

At Koverse, we built into our data lake a novel indexing and data organization approach that not only makes data very easy to find and search, and can serve thousands of users across dozens applications, it has the added benefit of providing a compelling alternative to traditional enterprise search and data warehousing.

At the heart of the Koverse indexing and organization approach is our Universal Indexing Engine.  It allows a single Koverse data lake to ingest data from hundreds of different sources—each with a different structure—and automatically generates a massively scalable index that can serve thousands of searches per second across all of the datasets.

All of this is handled through a single Koverse interface or API call. The value of this in a data-driven business environment is enormous.

For the first time, anyone with proper permissions can quickly get to the data they need without having to know which dataset to search or having specific prior knowledge of the structure of data they are searching.

The Koverse search process is simple:

  1. Easily search all data for a topic, name, string, ID, or any other point of interest. You can see all the datasets that contain relevant information.
  2. Hone in on a specific discovered dataset.
  3. Narrow search to a specific field or set of fields.
  4. Extract key data for use.

See the embedded video where we walk through an example of how this search works using data loaded in

First, you can search for a term or concept in Koverse and immediately see all of the specific datasets and records associated with that term.  In this example, you can see a search for “Lillian Henson” across all data in

Results for “Lillian Hensen” on
Results for “Lillian Hensen” on

Within a few seconds, you have examples of where “Lillian Henson” occurs across multiple datasets. You can now select the dataset you want to drill into.  For this example, let’s assume that we want to only find IT incidents where Lillian Henson is involved.

If we need to hone in even further we can take it one step further and only pull IT incidents where Lillian Henson was involved as a manager.

With the data quickly identified, it can be extracted from the lake for any follow-on use, or directly fed into analytic tools like Tableau.

The Koverse search syntax also supports a range of more complex queries, including Boolean search, time, geospatial and complex wildcard searches, across pre-specified fields allowing for complex interrogation workflows to be supported at scale directly from Koverse.  Koverse data interrogation enables fast and easy identification and extraction of the data you need in the vast data lake.

Try a live online version of the Koverse platform for yourself at


About Paul Brown, Chief Product Officer and Co-Founder of Koverse, Inc.

Paul Brown has overseen and advised dozens of big data efforts in intelligence, defense and finance, some of which are among the largest and most sophisticated big data implementations in the world.  Paul led the development effort that resulted in the Apache Accumulo project and the National Security Agency’s first scalable data system, revolutionizing the way the NSA handles data.  Paul also led big data implementations for Booz Allen.  He holds a Bachelor of Science degree from the University of Washington and a Master of Science degree in Electrical and Computer Engineering, Information Theory and Electronic Communications from The Johns Hopkins University.

The Koverse UI displaying datasets.
The Koverse UI displaying datasets.