Comparison Testing
In the context of the Data Validation Library, a comparison test compares two catalogs. These tests fall under two key use cases:
- comparing two versions of the same catalog
- comparing two versions of two distinct catalogs from which contents can be extracted into shared data structures
The validation library's comparison API is exposed in three packages with different levels of abstraction:
-
validation.core.comparison
- provides an abstract recipe to implement comparisons for any kind of object using any kind of keys. You need to work directly with apache Spark's RDD objects. -
validation.core.comparison.metadiff
- provides an abstract implementation of the base recipe using a combination of layer names and and partition identifiers as keys. This package checks for differences by comparing the metadata checksums. For grouping and actually creating the output data, you are still required to work with RDDs. -
validation.core.comparison.metadiff.grouped
- provides two callbacks to work with pairs of differing catalog partition metadata; either layer-wise or partition-wise. Typically, this abstraction fulfills the majority of use cases.
The recommended process is to consider comparison.metadiff.grouped
first, to see if it meets your requirements. Next, evaluate comparison.metadiff
. If both of these packages do not meet your requirements, then consider the base level, comparison
.
In turn, you can use comparison.metadiff
as an example for how to use comparison.metadiff
, and comparison.metadiff.grouped
as an example for how to use comparison.metadiff
.
To run a grouped
comparison pipeline, refer to the quickstart-example in the SDK package.
In the sections below:
-
reference
refers to the baseline catalog version -
candidate
refers to the catalog version being tested
The Comparison Package
The Comparator
is the main class, which implements the inherited compile()
function, that is called with the complete data of the reference.
You must provide access to the candidate
's data. As shown in the code snippet below, the compile
method extracts the candidate and reference data for comparison via the Joiner.join
method, which you implement, and performs the actual comparison in the Comparison.compare
method which you also implement.
abstract class Comparator[K, C] (joiner: Joiner[K, C], comparison: Comparison[K, C])
extends NonIncrementalCompiler {
def queryReference(): InData
final override def compile(candidateData: InData, parallelism: Int)(
implicit logContext: LogContext): ToPublish = {
val referenceData = queryReference
val joinedData: JoinedData[K, C] = joiner.join(candidateData, referenceData)
val results: ToPublish = comparison.compare(joinedData)
results.partitionBy(outPartitioner(parallelism))
}
}
Since the following RDD declaration appears often, there is a type definition for convenience:
type JoinedData[K, C] = RDD[(K, (Option[C], Option[C]))]
The Joiner
trait joins both the reference
and candidate
data. It is your implementation's responsibility to do this in a way that is suitable for the corresponding comparison.
trait Joiner[K, C] {
def join(referenceData: InData, candidateData: InData): JoinedData[K, C]
}
The Comparison
trait does the actual comparison of the previously joined data and returns output data appropriately to the output layer configuration.
trait Comparison[K, C] {
def compare(data: JoinedData[K, C]): ToPublish
}
The ContextHelper
class queries the reference data and also provides Retrievers
for the reference and the candidate catalog. These Retrievers
are needed if you want to access the actual partition's content by retrieving the Payload
for the given partition's metadata.
This package implements comparison by using a LayerKey
as key to join the partitions' metadata.
case class LayerKey(layer: Layer.Id, partition: Partition.Name)
The MetadataComparison
class defines a retrieveResults()
callback, that you need to implement to handle the metadata pairs that differ in their partitions' payload checksums. Since this data remains in an RDD, you can still group it according to your output needs.
def retrieveResults(different: JoinedData[LayerKey, InMeta]): ToPublish
The Grouped Package
This package provides a quick and easy way for you to get a diff for a defined set of layers. As mentioned, this package offers two callbacks that you can use to work with pairs of differing catalog partition metadata: either layer-wise or partition-wise.
For layer-wise pairs:
def handleDiff(layer: Layer.Id,
partitioned: Iterable[(Partition.Name, Option[InMeta], Option[InMeta])])
: Iterable[(OutKey, Option[Payload])]