Table-Based Validation (TBV) Details
Introduction -
"Table-Based Validation" (TBV) is a quality assurance method to help verify that Vitalnet results are correct.
TBV verifies that both (1) Vitalnet software and (2) Vitalnet data files are correct.
TBV automatically compares historically correct large tables with the same tables currently produced by the software.
The tables are designed to exercise the capabilities of the software, and expose any errors.
In the course of the process, many thousands of results are compared.
This page gives more detailed information concerning TBV,
beyond that given in the
Main TBV Page.
You can use and understand Table-Based Validation without reading this page.
The information on this page is for those who want to understand the background and technical details.
History of Vitalnet Validation -
Historically, Vitalnet has been accurate and correct.
There has always been an emphasis on ensuring correct results.
Validation methods used have included (1) "Cross-Checking",
(2) "Reality Check Tables", and (3) "Worksheet Validation".
These validation methods have helped ensure that data are correctly imported,
that results are correct,
and that something working last year still works correctly.
Method #1: "Cross-Checking" -
This is the fundamental method.
It internally and externally "cross-checks" cells in different Vitalnet tables.
For example, switching rows and columns should give the same results in each cell,
just with the axes reversed.
Also, a shared cell between two different tables should always agree.
For example, there should be 38 deaths for the category:
[white females, year 2000, C50 - breast cancer, Texas Anderson County, age 40-49],
regardless of what table the cell is in.
Also, results should agree with any published results,
or with those produced by other methods, eg bash, SPSS, SAS, etc.
Method #2: "Reality Check Tables" -
This method uses a series of "Reality Check Tables",
such as
this example and
another example.
A "Reality Check Tables" may be compared with one produced in a previous year,
to check that the results match,
may be compared with results produced by generic stats software (eg, SPSS),
and may be scanned to look for abrupt changes in historical trends that would indicate a likely glitch.
This method is useful to help verify new data.
However, the method relies on error-prone visual scanning,
does a relatively poor job detecting small errors,
and does not check the overall validity of the data warehouse.
Method #3: "Worksheet Validation" -
In 2015, Jeff Swanson, at Texas DSHS,
devised a "Worksheet Validation" system.
The system is within an Excel spreadsheet.
Each worksheet tab contains three areas:
1) left area contains "Reality Check Tables" from previous release,
2) middle area contains "Reality Check Tables" from current release, and
3) right area does comparisons.
Each cell in the right area corresponds to a data cell,
blank if the left and middle numbers match,
and 'X' if a discrepancy.
New Method: "Table-Based Validation" (TBV) -
This method, initiated in 2016, tests Vitalnet results for validity.
It builds on ideas from "Reality Check Tables" and "Worksheet Validation".
It directly uses Vitalnet output without any manual editing,
assembles and tabulates all differences,
and produces detailed reports.
Using Unix shell scripts, TBV compares two large sets of results:
(1) "Reality Check Tables" from a previously installed VitalPro module, and
(2) the same "Reality Check Tables" produced by the current program.
Based on the comparison, further TBV shell scripts produce report web pages,
such as the
this example.
TBV retains historical comparisons, such as 2010-11-08 and 2011-03-13,
providing an "audit trail" to better verify correctness.
"Reality Check Table" Attributes -
For the purposes of Table-Based Validation,
each "Reality Check Table" (RCT) always has both rows and columns,
to provide more numbers for comparison.
An RCT usually uses multi-tables (a table series),
also providing more numbers to compare.
If the RCT includes rates,
the population numbers for the years compared must be the same,
or any difference must be taken into account.
For examples of "Reality Check Tables", refer to the
example results page.
Directory Structure -
"Table-Based Validation" is file-based,
directly uses Vitalnet output text files,
and resides on a Unix computer.
For each area / data / test_date combination (eg, Texas birth data released 2017-05-22), there is a "CHECK_DIR",
accessible via the web.
For example,
the CHECK_DIR for Texas population updated 2011-03-13 is:
/vn/tbvdir2/txu1/tpl/pt1/2011-03-13/
In each CHECK_DIR, there are two INPUT subdirectories (subdirs) for "Reality Check Table" results:
subdir | Contents | Explanation |
tbv-1-old | "Old RCT" | Verified results from previous release |
tbv-2-new | "New RCT" | Results Vitalnet currently makes |
In each CHECK_DIR, there are two OUTPUT subdirectories (subdirs):
subdir | Contents | Explanation |
tbv-3-dif | "Difference" | Compares old vs new results |
tbv-4-adj | "Adjusted" | Removes false positives |
Old (Previous) Results -
The
tbv-1-old (old) subdir holds a series of "Reality Check Tables", accepted as correct.
For each RCT, there are three files (txt, csv, htm) (text, CSV, HTML).
The files have the same logical content (such as counts and rates), but in different formats (eg, text vs HTML).
All three files are exactly as produced by Vitalnet.
The text file is the "Reality Check Table" (RCT) used to make comparisons.
The CSV and HTML files may be imported into a spreadsheet, for convenient viewing.
The HTML file may also be viewed in a browser.
If "current release" is 2011-03-13, "previous release" would be 2010-11-08:
New (Current) Results -
The
tbv-2-new (new) subdir holds a series of "Reality Check Table" files made by the current Vitalnet,
using the same analysis parameters as the old "Reality Check Table" (RCT) files.
The files are named as in the
tbv-1-old subdir. Results for 2011-03-13 "current release":
PAR_INFO -
For each "Reality Check Table" result,
a shell script holds the input parameters - PAR_INFO - such as "Year Rows".
The parameters are the same for both old and new results (so they can be compared),
and are listed in the TBV report.
FIX_INFO -
Sometimes, adjusted differences between two releases can be accounted for.
For example, when the NCHS leading causes were modified to include C. difficile as a leading cause.
For each such "Reality Check Table" result,
a shell script holds FIX_INFO,
explaining why the current result is correct.
Differences -
The
tbv-3-dif (difference) output subdir holds the difference between "previous" and "current".
The Unix (
diff command)
compares "previous" with "current" to produce the "difference".
For each "Reality Check Table",
there is a single file showing the differences,
as in
this example.
Adjusted Differences -
For each "Reality Check Table",
the
tbv-4-adj (adjusted) output subdir holds a single file with an "adjusted" form of the
tbv-3-dif file.
"False-positive" lines are removed from the adjusted file.
For example, the unique ID will be different,
so that is not a discrepancy, and hence needs to be adjusted for.
Or sometimes there are wording changes, such as "Age" vs "Age (Years)", and these are adjusted for.
If there are no lines in the adjusted file, there are no discrepancies,
as in
this example.
Generating Differences -
For each "Reality Check Table" result,
a shell script compares the
tbv-1-old text file with the
tbv-2-new text file,
writing
diff output to a file in the
tbv-3-dif subdir.
Next, a shell script processes the tbv-3-dif file, removing false positives such as "ID" and "Date" lines,
and writes the "adjusted differences" to the
tbv-4-adj subdir.
Finally, with input from PAR_INFO and FIX_INFO,
a shell script creates the report page,
such as
this example or
this other example.
Please refer to the
Main Table-Based Validation Page for how to interpret the report page.
Main TBV Page
·
Send corrections or suggestions