How Koverse Makes Analytics Production Ready

Koverse is an intelligent solutions platform for digital business.

Developers and data scientists can fairly easily prototype simple to complex analytics, including those involving machine learning and AI, on their own machines using powerful, popular, and open analytical frameworks like Apache Spark. But getting analytics into production remains a challenge for most organizations. This is because running analytics in production requires solving the problems of

  1. How live data will be delivered and in what formats
  2. Where results will be stored
  3. How results will be accessed by potentially many decision makers
  4. How to schedule, monitor and secure the entire process and
  5. How to make it repeatable.

Koverse delivers analytics-ready data.

Koverse supports analytics written using the popular Apache Spark framework. Analytics written for Koverse can use all of the APIs and functions of Spark, including Spark SQL and Mllib for machine learning and AI workflows. What makes Koverse transforms different from “vanilla” analytics is that:

  • Developers can avoid writing code to read the original format of data. Koverse provides a common format for all data whether structured, semi- or unstructured.
  • Developers do not have to specify the schema of the data. Koverse learns and provides this to the underlying frameworks. This makes using structured APIs like Spark’s SQL or Data Frames much easier.
  • Developers can identify which parts of the analytic are configurable. For example, a Sentiment Analysis algorithm can be configured to process the field containing text at run-time. This way it can be run on the ‘text’ field of Twitter data in one workflow, and run on the ‘body’ field of some email data in another workflow. This makes analytics reusable for more than one type of data.
  • Developers don’t have to worry about where the output goes. Results can be written back to Koverse, automatically secured and indexed. From there, these results can be queried directly, accessed by applications, fed into additional analytics or exported to external file systems or databases.
  • Non-developers can use these analytics simply by configuring them using the Koverse UI. No coding knowledge or command-line access is needed.
  • Koverse adds the following capabilities to make every analytic production-ready:
    • Triggering jobs to run on a scheduled or automatic basis
    • Configurable input data windows, e.g. process data from the last 30 days
    • Maintaining relationships and lineage between data sets
    • Auditing
    • Job monitoring and reporting

Technical example.

Here is a “toy” example analytic written in Java 8 for the Apache Spark framework:

JavaRDD textFile = sc.textFile("hdfs://...");
JavaPairRDD<String, Integer> counts = textFile
    .flatMap(s -> Arrays.asList(s.split(" ")).iterator())
    .mapToPair(word -> new Tuple2<>(word, 1))
    .reduceByKey((a, b) -> a + b);
counts.saveAsTextFile("hdfs://...");

That code is fairly concise, thanks to Java 8’s support for functional programming and the powerful Spark API, but the ellipses should be a clue as to where the Spark framework stops and where it expects developers to start solving the problems of from where to obtain data and how to deliver results.

As a developer, you may be tempted to add hard-coded paths to text files in HDFS for reading data and writing output; which leads to a brittle, hard to maintain, and insecure processing. This leaves unanswered the questions of ‘What happens when the data location changes?’, ‘Who has access to those directories in HDFS?’, ‘What if we get data in a different format?’, ‘How often will this process run and can we control that at all?’. These are all things that can keep your project from ever seeing the light of production.

Writing an analytic in Koverse is easy, just leverage the Koverse SDK to describe the analytic, get input data and write output data in a common format, and the rest is coding up analytic logic as usual. Here is an example word count using Koverse. First, we’ll give our analytic a name, a description, and a version number. This helps everyone understand what your analytic is and does, and how it’s different from other analytics in the system:

@Override
public String getName() {
  return "Word Count";
}
@Override
public String getTypeId() {
  return "word-count";
}
@Override
public Version getVersion() {
  return new Version(1, 0, 0);
}
@Override
public String getDescription() {
  return "Everyone's favorite data example.";
}

Now people using the Koverse UI will be able to find your Word Count analytic and use it on data they can access:

Next, instead of hard coding our analytic to work with one particular format of data, conforming to one schema, we’ll declare which fields in the data our analytic can work with, which in this case is any field containing one or more words that we want to count:

private static final String TEXT_FIELD_PARAM = "textField";
@Override
public Iterable getParameters() {
  return newArrayList(Parameter.newBuilder()
            .displayName("Text field")
            .parameterName(TEXT_FIELD_PARAM)
            .type(Parameter.TYPE_COLLECTION_FIELD)
            .build()
  );
}

Now Koverse knows to prompt users for the name of a field that contains text, so your analytic works equally well for email, word documents, social media streams, and anything else containing text!

This is how the analytic configuration screen will look to Koverse users. Note that our parameter allows users to select the text field our analytic requires, and that we have lots of options for specifying how our analytic will run – automatically or on a schedule, on all data or only new data that arrives, whether to replace results every time it is run, and the name of the data set holding the results of our analytic:

Finally, we specify what to do with the data. Instead of hard coding HDFS paths into our analytic, Koverse gives us an RDD of SimpleRecord objects, which are similar to a Java Map of Strings to Objects.

JavaRDD inputRdd = jstc
               .getInputCollectionRdds().values().iterator().next();
Next, we grab the name of the text field the user has specified and proceed with word count:
String textField = jstc.getParameters().get(TEXT_FIELD_PARAM);
JavaPairRDD<String, Integer> counts = inputRdd
            .flatMap(record -> Arrays.asList(
                       record.get(textField).toString().split(" ")))
            .mapToPair(word -> new Tuple2<>(word, 1))
            .reduceByKey((a, b) -> a + b);

Instead of writing out data to a text file in HDFS, we’ll save our results as a new data set in Koverse that can be searched and to which we can control access.

return counts.map(t -> {
  SimpleRecord r = new SimpleRecord();
  r.put("word", t._1);
  r.put("count", t._2);
  return r;
});

We can grant permission to query and download these results to precisely the groups of people that need them; who can then access the results via searching in the Koverse UI or through specialized web apps built on the Koverse API.

To use a live Koverse system yourself, you can start up an instance in the Amazon AWS Marketplace.