Comparison Testing

In the context of the Data Validation Library, a comparison test compares two catalogs. These tests fall under two key use cases:

• comparing two versions of the same catalog
• comparing two versions of two distinct catalogs from which contents can be extracted into shared data structures

The validation library's comparison API is exposed in three packages with different levels of abstraction:

• validation.core.comparison - provides an abstract recipe to implement comparisons for any kind of object using any kind of keys. You need to work directly with apache Spark's RDD objects.
• validation.core.comparison.metadiff - provides an abstract implementation of the base recipe using a combination of layer names and and partition identifiers as keys. This package checks for differences by comparing the metadata checksums. For grouping and actually creating the output data, you are still required to work with RDDs.
• validation.core.comparison.metadiff.grouped - provides two callbacks to work with pairs of differing catalog partition metadata; either layer-wise or partition-wise. Typically, this abstraction fulfills the majority of use cases.

The recommended process is to consider comparison.metadiff.grouped first, to see if it meets your requirements. Next, evaluate comparison.metadiff. If both of these packages do not meet your requirements, then consider the base level, comparison.

In turn, you can use comparison.metadiff as an example for how to use comparison.metadiff, and comparison.metadiff.grouped as an example for how to use comparison.metadiff.

To run a grouped comparison pipeline, refer to the quickstart-example in the SDK package.

In the sections below:

• reference refers to the baseline catalog version
• candidate refers to the catalog version being tested

The Comparison Package

The Comparator is the main class, which implements the inherited compile() function, that is called with the complete data of the reference.

You must provide access to the candidate's data. As shown in the code snippet below, the compile method extracts the candidate and reference data for comparison via the Joiner.join method, which you implement, and performs the actual comparison in the Comparison.compare method which you also implement.

scala
abstract class Comparator[K, C] (joiner: Joiner[K, C], comparison: Comparison[K, C])
extends NonIncrementalCompiler {

def queryReference(): InData

final override def compile(candidateData: InData, parallelism: Int)(
implicit logContext: LogContext): ToPublish = {
val referenceData = queryReference
val joinedData: JoinedData[K, C] = joiner.join(candidateData, referenceData)
val results: ToPublish = comparison.compare(joinedData)
results.partitionBy(outPartitioner(parallelism))
}
}

Since the following RDD declaration appears often, there is a type definition for convenience:

scala
type JoinedData[K, C] = RDD[(K, (Option[C], Option[C]))]

The Joiner trait joins both the reference and candidate data. It is your implementation's responsibility to do this in a way that is suitable for the corresponding comparison.

scala
trait Joiner[K, C] {
def join(referenceData: InData, candidateData: InData): JoinedData[K, C]
}

The Comparison trait does the actual comparison of the previously joined data and returns output data appropriately to the output layer configuration.

scala
trait Comparison[K, C] {
def compare(data: JoinedData[K, C]): ToPublish
}

The ContextHelper class queries the reference data and also provides Retrievers for the reference and the candidate catalog. These Retrievers are needed if you want to access the actual partition's content by retrieving the Payload for the given partition's metadata.

This package implements comparison by using a LayerKey as key to join the partitions' metadata.

scala
case class LayerKey(layer: Layer.Id, partition: Partition.Name)

The MetadataComparison class defines a retrieveResults() callback, that you need to implement to handle the metadata pairs that differ in their partitions' payload checksums. Since this data remains in an RDD, you can still group it according to your output needs.

scala
def retrieveResults(different: JoinedData[LayerKey, InMeta]): ToPublish

The Grouped Package

This package provides a quick and easy way for you to get a diff for a defined set of layers. As mentioned, this package offers two callbacks that you can use to work with pairs of differing catalog partition metadata: either layer-wise or partition-wise.

For layer-wise pairs:

scala
def handleDiff(layer: Layer.Id,
partitioned: Iterable[(Partition.Name, Option[InMeta], Option[InMeta])])
: Iterable[(OutKey, Option[Payload])]