Testing that Scales

The Motivation	Testing as a first-class citizen
Incremental Statistics	A Testing Threshold
Scaling Up	The Future of Testing

Efficiently building large-scale, WL-based, computational systems requires a testing framework that scales. This means confronting an unavoidable tension; controlling the growth of test suites inevitably requires their decomposition into manageable chunks but this then necessitates these chunks subsequent composition as a burgeoning system is tested. CodeAssurance manages this composing/decomposing tension by using a "TestFiles" paclet extension to facilitate test suites' decomposition while functions, TestSummary, TestFile and TestFiles facilitate their subsequent composition. In particular, TestSummary transparently integrates a system's testing with the public functions it delivers while its lighter footprint promotes seamless testing throughout the system's evolution.

The Motivation

Motivating the use of TestSummary comes from first observing the non-scaling behaviour of TestReport. Imagine two reports each containing 100K tests of a complexity say similar to that of finding an integer's parity.

First define the parameters of an arbitrary test suite.

In[12]:=12

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-gwk8v3

Generate two test suites which should taking ~30 sec:

In[13]:=13

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-gemr5u

Out[13]=13

In the wild, test suites of this order might take years to assemble but with these now generated, we can begin to get a feel for the complexity of producing test reports in such a regime:

Generate a report for each test suite:

In[14]:=14

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-2wzbd

Out[14]=14

Already the memory usage exceeds that which is ordinarily expected in notebooks. Compare this with test summaries.

Load CodeAssurance`

In[15]:=15

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-sqeh4b

Generate a test summary for each of the test suites.

In[16]:=16

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-y2d15u

Out[16]=16

Managing multiple reports/summaries often involves their merging.

Merge the two test reports.

In[17]:=17

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-fz4pgf

Out[17]=17

Again notebook memory limitation is flagged and indeed is now exacerbated by a memory-doubling. Compare this merging with test summaries.

Merge the two summaries.

In[18]:=18

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-pdfulq

Out[18]=18

Again, the same memory constraints do not arise when merging test summaries.

Both test reports and test summaries indicate a successful testing run.

In[20]:=20

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-607q1w

Out[20]=20

The resources needed to ascertain this core outcome however, are vastly different. While the memory disparity is the most alarming (a non-negligible percentage of total memory), there also exists a significant time-penalty when merging reports compared with summaries. The following quantifies the respective differences in this scenario.

In[34]:=34

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-jy0z4

Out[34]=34

Incremental Statistics

A test report contains certain numerical properties like AbsoluteTimeUsed, CPUTimeUsed, MemoryUsed that are essentially aggregated values from the entire collection of individual tests. Often this is useful information in its own right but sometimes important are other insights related to a dataset's distribution. For example, knowing the AbsoluteTimeUsed in a test report can immediately assist planning by being able to estimate how long testing runs are ging to take. Knowing the spread of the AbsoluteTimeUsed however, might prove useful for identifying testing bottlenecks or even as a feature through which effective test-suites can be machine-learned.

Generating the type of descriptive statistics just described is obviously readily available in the WL. For example, we can compute the standard deviation of the 100K values of AbsoluteTimeUsed for each test in the suite.

In[22]:=22

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-19pnku

Out[22]=22

Basic statistics are similarly available using test summaries but these can computed with order-of-magnitude improvements in efficiency (~0.3 milliseconds vs ~2.0 seconds).

In[23]:=23

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-9rd75i

Out[23]=23

Similarly, while MemoryUsed measures the total memory used in a test report, its sequential evaluation includes recycled memory suggesting maximum memory as an alternative, more realistic measure.

Determine the maximum memory used by any individual test by traversing the test report's 100K tests.

In[24]:=24

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-2fates

Out[24]=24

This descriptive statistic is also available however, in the corresponding summary but is now more efficiently extractable (~0.3 milliseconds vs ~2.0 seconds).

In[25]:=25

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-hmsu2k

Out[25]=25

While merged summaries readily maintain incremental statistics like: minimum, maximum and the mean, perhaps more surprising is that summaries can do likewise for more distributive measures like variance, skewness and kurtosis. By maintaining "MomentrSums" (

x_i^r) and then aggregating these across multiple summaries, more efficient merging becomes available. In particular, updating basic statistics when merging summaries can be performed in virtually constant-time independent of the number of tests. In contrast, merging test reports depends explicitly on the number of tests and hence become susceptible to performance-degradation as a system scales.

The InputForm of a summary shows its stored moment information that, despite its comparatively light memory-footprint, useful properties can nonetheless still be extracted.

In[26]:=26

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-ua6xwf

From combining these moments basic statistics can be precisely reconstructed with traditional reservations about floating-point arithmetic errors overcome by WL's arbitrary precision.

Naturally, more detailed analyses will always require the original tests traditionally stored in test reports. This reinforces an idiom of best-practice consisting of using TestSummary as a complement to TestReport rather than as a replacement. Nonetheless, the performance advantage of working with more lightweight summaries, improves with the number of tests solidifying a WL-based system with the following section outlining the scale of this potential advantage.

Scaling Up

We have just observed then, that summary provide more efficient extraction of statistical measures in addition to the lighter time and memory footprints needed for merging as was earlier observed. Putting these together to collates a measure of overall advantage per 100K tests.

In[27]:=27

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-v0p99e

Out[27]=27

While the order of magnitude differences are clear for this manufactured sample of 200K toy tests, 1-2s computations occupying ~2% of available memory might seem manageable. There are however, emerging trends that point towards these resources increasing substantially over time, perhaps even by orders of magnitude for WL-based systems with serious growth ambitions. These trends relate to three factors that influence the resources needed to run and manage test suites namely; the number of tests, the nature of tests and the frequency of test runs.

Number of tests

While a number of tests in the order of 200K are well beyond what is needed for the types of packages historically implemented in the WL, this calculus changes as fully-fledged WL-based computational systems become more commonplace. Unit tests remain the gold standard for ensuring system-correctness and with the infrastructure for third-party developers continuing to grow, (e.g. the paclet repository), the volume of tests is set to rapidly accelerate along with needs for their efficient management. To provide some forward-leaning context, the number of unit tests for Mathematica is, by now, in the millions which no doubt requires considerable management while also incurring significant computational costs.

Nature of tests

The nature of tests also impacts the resources needed in their management. While the very notion of a unit test speaks to a unit of functionality, in practice, it is usually more efficient to pack several pieces of functionality into a single test. Further, TestCreate allows tests to be stored (without necessarily being immediately evaluated) rendering the formed TestReportObject as a type of test repository thereby further incentivizing its routine summarization. Finally, as the number of packages within a system grows, the interrelatedness between tests inexorably increases portending the arrival of non-trivial memory footprints well before an arbitrary 200K benchmark as just simulated.

Frequency of testing

The frequency of testing forms part of a modern trend in software development and deployment or continuous integration/continuous development (CI/CD). And the extra flexibility in these activities requires a similar flexibility in an accompanying testing program; in short, successful CI/CD rests on having versatile and scalable test suites. For large systems, versatile testing often involves decisions about allocating de-bugging resources, nightly builds, merging multiple test-suites, staged releases etc. with each process invariably carrying its own correctness threshold as reflected in corresponding test-suites. Being able to organizing such bespoke test-suites in terms of summaries then, (especially when expressed in responsive dashboards) is poised to play a critical role in modern software creation.

Testing as a first-class citizen

Test suites are traditionally not shipped with software products instead being restricted to the quality-control undertaken prior to official release. We think this is a mistake and worth changing. Routinely including such tests imposes a discipline and accountability on developers that can but improve the robustness of their published software. Further, users benefit from enjoying new standard of transparency born from being able to, if so inclined, peak under the development-hood to gain insights into a function's scope, robustness and historical evolution.

A function's documentation is one important marker of functionality but in our view, equally important are the sum total of a function's unit tests. This is because it is via the thoroughness of an explicit test suite that a package's underlying correctness can be established in a way that is arguably fundamentally impossible to achieve through any other methodology. A transparent testing regime bundled with any release then, represents a form of certificate that can ultimately vouch for the quality and integrity of a WL-based system.

This spirit of transparency and accountability motivates a convention that promotes a healthy, robust WL ecosystem, namely one whereby paclets routinely include a TestFiles directory containing multiple test files that effectively certifies a package's functionality. Naturally, CodeAssurance follows this convention through its own TestFiles directory and ability of end-users to run the same tests as were used to establish its correctness.

View the directory structure of a CodeAssurance following installation:

In[28]:=28

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-ojavow

The TestFiles directory of CodeAssurance is made available through a "TestFiles" extension in its corresponding PacletInfo.wl file:

PacletObject[
<|"Name"->"CodeAssurance",
"Version"->"1.0.0",
"Description"->"Assuring the robustness of your paclet-underpinned computational system",
"Extensions"->{
{"Kernel","Root"->"Kernel","Context"->"CodeAssurance`"},
{"Documentation", Language -> "English"},
{"TestFiles","Root"->"TestFiles"},
{"PreparedPaclets","Root"->"PreparedPaclets"},
{"Asset","Root" -> "Assets", "Assets" -> {
{"Classic", "PacletLayout/Classic"},
{"ClassicNotebook", "PacletLayout/ClassicNotebook"},
{"FileSystemModify.wlt", "TestFiles/FileSystemModify.wlt"},
{"HalfCorrect", "TestFiles/HalfCorrect.wlt"},
{"ThirdCorrect", "TestFiles/ThirdCorrect.wlt"},
{"AllCorrect", "TestFiles/AllCorrect.wlt"},
{"TestSummaryHandlerNotebook", "TestFiles/Core.nb"}
}}
} |>]

contents of theTestFiles directory and its subdirectories contain the test files which typically include all of a package's public functions. For CodeAssurance, it appears as follows:

In the developer paclet of CodeAssurance, for example, the tests associated with the function TestFiles appears as follows

The deployed paclet of CodeAssurance, on the other hand, is what is finally delivered to end-users. Since it is the .wlt test files that are most commonly used these are typically only included in deployed paclet while DeveloperOnly test files are typically omitted altogether as they include testing for functionality earmarked for future versions.

Hence, with these in place, tests of any type according the file-based categorization become immediately and conveniently available.

Run and summarize the core tests associated with the function TestFiles,

In[31]:=31

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-1zrus3

Out[31]=31

Run and summarize the exception-handling tests associated with the function TestFiles.

In[34]:=34

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-bljijq

Out[34]=34

Bespoke testing can therefore proceed by managing sets of test files as required.

In[36]:=36

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-p5ci9t

Out[36]=36

Wholesale testing is also available.

Perform all the tests associated with TestFiles.

In[39]:=39

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-1wqcq4

Out[39]=39

The advantage of using test summaries when broadening to more system-like testing is that footprints do not commensurately grow.

In[41]:=41

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-m6iyuf

Out[41]=41

Measure the size in kB of the generated test summary object.

In[42]:=42

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-zgsqt4

Out[42]=42

Generate a test report of all the tests associated with all the functions of CodeAssurance. Note that for this built-in, an intermediary use of TestFiles is needed to specify the paclet's tests.

In[46]:=46

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-ki4vm

Out[46]=46

Measure the size in kB of the generated test report object.

In[47]:=47

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-1jytb

Out[47]=47

In this case of a test report then, CodeAssurance's entire test suite of has effectively been deposited into the previous object and hence notebook (note that ExceptionHandling.wlt file quoted in the TestReportObject header is misleading here since it simply refers to the last test file tested).

A Testing Threshold

The scalability advantages from using TestSummary on the tests of CodeAssurance are already starting to emerge. Earlier, in the section Scaling Up, back-of-the-envelope calculations suggested that certainly by 100K tests, exclusively relying on TestReport is likely to seriously degrade testing usability. But even with CodeAssurance, a relatively modest application with ~100 tests, routinely generating notebook footprints of 0.5 Mb in test reports begins to feel unwieldy in comparison to the ~10 kB footprints that test summaries routinely generate. It seems reasonable to conclude therefore, that more significant advantages from integrating TestSummary are likely to kick-in well short of the ~100K test threshold mooted during previous simulations.

Another important scaling measure in organizing test files is the use of .nb test files to store tests before pivoting to applying their .wlt counterparts when it comes to running or shipping tests. The test .nb files contain useful"front-end" functionality for adding, deleting or editing tests while their corresponding .wlt files are more efficient to store and run.

The following table illustrates the extent of the timing and memory differences between .nb and .wlt files for all the test files in CodeAssurance. Note that this table was generated on CodeAssurance's developer paclet that retains the original .nb test files unlike the deployed paclet that you may be currently using (hence re-running the generating code will likely not reproduce the results--it is included here for completeness).

In[60]:=60

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-ggo3rk

Out[61]=61

The time efficiency allows end-users to more readily integrate their application's tests while the smaller file memory ensures a lighter installation footprint.

TestSummary and TestReport work in concert.

While summaries provide a fast, lightweight method for measuring the current correctness of a WL-based system, sometimes a granularity at the level of individual tests is required and which is available from a test report by marginally adjusting syntax. Suppose for example, using test summaries there was interest in the maximum test size across all tests used in CodeAssurance.

Find the size of the largest test in CodeAssurance.

In[62]:=62

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-ggqghw

Out[62]=62

Retrieving this maximally-sized test itself however, is not available in this use of test summaries since the tests themselves are discarded during the aggregation process. But, by instead generating a test report, the corresponding test can still be readily retrieved.

In[63]:=63

✖

https://wolfram.com/xid/0vh95nyr8l88tboku7onjvi6xwwec-h8g118

Out[63]=63

Hence, wherever the threshold lies whereupon TestSummary becomes indispensable, continuing to have TestReport available remains important for being able to arbitrarily manipulate and analyze the original tests themselves.

The Future of Testing

An ecosystem of correctness

The by-product of generating "certificates of assurance" in the form of readily accessible test-suites is clearly advantageous for software development. It is easy to overlook however, its potential to give your code wider expression. For in the modern era, end-users are likely to consist not only of computational explorers, as important as these are, but also other aspiring developers ... or both. And if your application comes with ready-made test suites, it also comes with an ability to integrate with the test suites of these other systems, which translates into an inherent ability to seamlessly integrate with other systems' functionality --all across time, versions and domains.

Types of Testing Categories & Developing Best Practice.

The decomposing of a system's tests into testing files within paclet folders provides a basic framework for managing testing complexity that developers can exploit further by standardizing the types of testing files constructed and using these as meta-data for developing best testing practice. This tutorial has suggested that a solid starting point for such standardization includes dedicated files for testing a function's 1) core functionality 2) exception-handling and 3) and documentation. There are however further types that may naturally extend this categorization. These might include, for instance, dedicated files for: error-handling, corner-cases, file-system-touching, computationally-intensive, known bugs to name a few along with surely other domain-specific possibilities.

There are also other "best-practice" questions worth monitoring as test-suites evolve in building large-scale WL-systems: What is a good upper threshold of number of tests per notebook? What is the relationship between the "complexity" of a code-base and the number of unit-tests required to exceed correctness thresholds? What forms of unit tests consistently prove optimal? We anticipate and hope that future WL-based systems constructed in a healthy package ecosystem can start to reveal insights into exactly such questions.