Machine Learning Testing: Survey, Landscapes and Horizons (The Cliff Notes)

I really enjoyed Hee-Meng Foo‘s talk at QASIG on 6/10/20 . I am thankful he so freely shares his knowledge and insights.  I was also glad to learn about the meetup group, AI in Testing & Testing AI, he is part of along with Jason Arbon.  I’ve been following Jason’s AI for Software Testing Association 

I’m slightly amused that his cliff notes is 46 slides for a 37 page paper.  I took an ML course from Andrew Ng which has served me well in grounding me on basic AI terms and acronyms.  However, if you need an intro, you might try Hee-Meng’s other slide deck:
An Introduction to AI in Test Engineering

Talks like this challenge me to recognize how far behind I get in current technology. I appreciate he has a lot of links in the slides for followup.
I’d forgotten about Metamorphic Testing and hadn’t realized its applicability to testing AI based systems.
I knew A/B testing, but hand’t heard of Multi-Armed Bandit (MAB) Testing until now.
AV systems? (oh – autonomous vehicles with disengagements).

It was also a great review of testing paradigm applied to AI with input generation, and Testing Properties, Oracles, and Adequacy with some new additions like Functional model relevance, Non-Functional fairness, and Surprise Adequacy.
Interesting to see bug analysis of AI along with new coverage measures.

Now I just need to find the time to read the paper and its 292 references.

Posted in software testing | Leave a comment

Write fewer automated checks/tests

I attended Alan Page’s QA SIG talk on Adventures in Modern Testing and heard 2 interesting quotes:

 “All the testing getting in the way of quality”  and
Help team write fewer tests“.

In talking with Alan after the talk, I discovered I misunderstood which “team” should write fewer tests. Alan was thinking the quality / test team should write fewer or no tests – as those are picked up by developers on feature teams.
Brian Gaudrea at the meeting also mentioned having the test / quality team align better with continuous improvement initiatives that save companies money, to transition from a cost center to a cost reducer.

My own thought is:

How can we help teams write fewer automated checks or tests?

(1) My model based testing historical bias is : don’t write tests (or automated checks as they are now sometimes called), but generate them!
Academia has been working on automated test generation and automated test suite augmentation for decades; see for example: An Orchestrated Survey on Automated Software Test Case Generation.
We need to encourage all of development to discover ways to generate more effective tests, while writing fewer hand-scripted automated checks.     Besides model based testing, there are static analysis tools, symbolic executor tools, and many other tools that auto-generate tests.
We need to understand how to generalize a test issue so tests aren’t written individually.  A simple example, from John Lambert, was his realization that many devs weren’t verifying simple things like null object values as possible inputs.
Did he write a bunch of tests with nulls?  NO
He wrote a simple scanner that determined when a method could accept an object, then generated and ran the null object test.  He found lots of issues.
More recently, I learned about the difficulty of counting characters, glyphs, or Unicode units (UTF-*) for Emoji’s.
Should you write a bunch of Emoji tests?   NO
You should find a way to add Emoji’s to your current string character tests.

(2) I think the future is humans writing fewer automated checks.  Automated check writing will instead be done via artificial intelligence.
With the AI tool sets and testing data sets we have today, it will not be long before most automated checks are created by AI, not by humans.
Alan Page also mentioned the GTAC 2007 talk by Ed Keyes – Sufficiently Advanced Monitoring is Indistinguishable from Testing.   Perhaps that can be combined with an automated adversary, similar to generative adversarial networks, or GANs.   The AI generates input based on previous logs and monitoring results, and the monitoring tries to determine if the system is being fooled (increases the error rate).   This is a more advanced form of simple fuzz testing (generated inputs looking merely for simple failures).

I think all of the above can help with Alan’s motto:
“Accelerate the achievement of shippable quality”

Posted in software testing | 5 Comments

Review of Introduction to Combinatorial Testing

I received a copy of this book from ACM SIGsoft Software Engineering Notes (SEN) in exchange for reviewing it.

Introduction to Combinatorial Testing
 Written by D. Richard Kuhn, Raghu N. Kacker, and Yu Lei
and published by CRC Press, ©2013, (paperback), 978-1466552296, 319 pp., $71.95.

The book is the most comprehensive introduction to the technique of combinatorial testing I’ve seen.  It’s an interesting amalgamation of academic and very practical test guidance for using combinatorial testing.  The book is actually a compendium from several authors. The examples can require a bit of study to actually glean what the authors are expressing.  Some chapters could easily be skipped by practitioners focused just on using the technique.

Most chapters have figures, tables of inputs, outputs, or tests, and sometimes pseudocode.  Each chapter ends with review questions and answers which help the reader judge their understanding of the material.

Even if the literature is inconsistent, it would have been nice if the authors had chosen a single nomenclature across the book, instead of for each chapter.  For example, Chapter 1 defines “covering array CA(N, n, s, t)”  while Chapter 14 defines “a fixed-value covering array denoted by CA(N,vk,t)”.  Similarly, Chapter 14 states “Methods for constructing CAs can be put into three categories: (1) algebraic methods … (2) metaheuristic methods … (3) greedy search methods”, while Chapter 15 tells us “covering array construction can be classified as either computational or algebraic”, and later breaks algebraic into “computing a mathematical function” and “recursive construction”.

The first four chapters introduce the concept and illustrate combinatorial testing. Chapter 1 introduces a lot of foundation quickly which may be overwhelming for those not used to formalism and mathematics.  Appendix A also provides some of the mathematical background.  Chapter 2 gives examples and rationale, along with Appendix B’s empirical data on software failures.  Chapters 3 and 4 illustrate configuration testing and input testing with the free ACTS tool from NIST. ACTS is also further described in Appendix D.

Chapter 5, Test Parameter Analysis, uses classification tree method (CTM) for this critical step before being able to do combinatorial testing.  It also addresses the number of tests versus known error patterns. It points to the practical need to identify “missing, infeasible, and ineffectual combinations”, and points out “There are few bad tests – but a lot of unproductive ones”.

Chapter 6, Managing System State, introduces yet another notation, direct product block (DPB) which is more understandable to computers than to humans.  This input mechanism for the commercial TestCover tool helps illustrate multiple models for testing UML state machines.

Chapter 7 on measuring combinatorial coverage seems more theoretical (proposed measures), than practical (which measure helps most in empirical testing).  It also makes reference to the Best Choice Strategy papers that discounted the effects of fault masking as described elsewhere in this book.

Chapter 8, Test Suite Prioritization, presents the method and empirical results from ordering tests based on combinatorial covering.  The Combinatorial-based Prioritization for User-session-based Testing (CPUT) tool is described for creating test suites from user logs of real behavior.

Chapter 9 gives practical guidance about when to choose random values versus covering arrays.  Chapter 10 describes covering subsets for the factorial combination of sequences, but without evidence of the error finding effectiveness of this approach.

Chapter 11, Assertion-based Testing, and Chapter 12, Model-based Testing, attempt to address the oracle problem and are not specific to combinatorial testing.  If you go to the trouble of using a symbolic model verifier (SMV), you generally can do advanced test generation, rather than just predicting output from combinatorial input as described here.

Chapter 13 “introduces the fault localization problem, describes the basic framework for fault localization using adaptive testing, and discusses how some of these approaches might be implemented.”  Unfortunately, no empirical data comparing these various approaches, or for delta debugging, are given to guide the practitioner in which to choose.

Chapter 14 gives a nice bit of history and helps distinguish the sometimes-confused orthogonal arrays from covering arrays.

Chapter 15 is background for those who want to understand the algorithms used in the tools.

While Appendix C gives pointers to several tools, I would like to have seen more pointers to practical materials, such as the list of tools at, and the Domain Testing Workbook by Cem Kaner which complements the multiple, short introductions given to equivalence classes, boundary value analysis, etc.  The examples are sometimes small and can thus be misleading, e.g., Table 6.2 “valid calendar dates” does not appropriately account for century leap years such as 1900 and 2100.

The authors have done little to reduce the burden on the reader of having different nomenclatures in different chapters, but overall I liked the book, and it provided me greater depth and breadth of understanding combinatorial testing.

Posted in software testing | 1 Comment

Reliable Backup is hard to do

I believe in triple copies after working with Cosmos at Bing and other similar services and reading recommendations and descriptions of backup schemes.

So at home, we have 2 disk copies and a cloud copy of every file.
So I thought I was covered.   Not quite.
We have a lot of old static files (photos and videos and old docs) that we keep on a read only disc and a backup on a disk kept in our basement.
For active files, we used a mirrored (two 3TB disks) from Buffalo.

Well, the Buffalo HD-WL6TU3R1 device failed.  It wouldn’t boot.  When turned on, both disk access lights blink red for a few seconds and then shut off.   The manual doesn’t describe this diagnostic code and contacting Buffalo was useless. They just told us we are out of warranty.

My recommendations:

  • Don’t buy Buffalo
  • If you do buy Buffalo, toss it immediately after the warranty period, because it is useless.

My suspicion is that the controller in the device, a Single Point of Failure, failed.

No problem, I thought, I removed the individual drives and using my Coolmax multifunction converter, I temporarily hooked up the drives individually to the PC via USB.  They were readable, but it turns out about a third of the data when being copied produced
“ERROR 87 (0x00000057) Copying File . . . The parameter is incorrect”
Tried running chkdsk to repair the disk, but it failed with several errors.

So now 2 of our 3 copies are incomplete.   No problem, we use Backblaze.

In general, I love Backblaze and have recommended it to all of my family and friends because it is truly unlimited backup, reasonably quick, doesn’t seem to slow down our systems, and works smoothly, quietly, automatically behind the scenes.   The annual price is extremely reasonable for backing up our current 6TB of data.

I have previously restored from Backblaze a few small files while I was travelling and realized the file I needed was on a computer back home.
I also previously restored my 200GB music library with a series of restores over just a couple of days using downloaded zip files.

We replaced our dead Buffalo 3TB mirror with 8TB mirror (two 8TB disks) WD MyBook Duo.   Now we have room to grow as GoPro 4K Videos dump 27GB/hour (instead of HD 9GB/hour).

But now we had to restore the 2TB of data we had mirrored.
Backblaze provides a very reasonable alternative, for $189 refundable, they FedEx us a disk.  We ship the disk back in 30 days and they refund the $189.  We just pay shipping back!

Only 2 problems:

  • It took over a week for Backblaze to “gather” the data and then to “build” the disk, before shipping it. Backblaze recommends doing online .zip restores for files you need during the week.   We had to do a few.
  • Backblaze keeps your “data”, but not the meta-data, specifically the timestamps on the files. So all of your files lose their date and become created, modified, and accessed at exactly the same time – when the disk is built to send you.

I find the lack of timestamps screws up a lot of things for us.  We can’t tell when a picture or video was taken unless it happens to have it inside the metadata of the file (and most of our oldest pictures do not).    Many of our documents for business have different versions and to know when a tax document, corporate motion, or other such file was created or modified is useful.

So Backblaze gives us our data, but not our timepstamps.   Beware.

Ultimately we chose a a multi-step solution:

  • Restore those files I had made a 4th local offline copy of 8 months ago.
  • Restore those files we could read off the broken Buffalo raw disks.
  • Restore using Backblaze ZIP files instead those we really cared about
    (750GB downloaded over 3 days in multiple 100GB downloads)
  • Restore any missing files (about 15% of all of them) using the Backblaze data, with wrong timestamps.

Now, I think I need to create a periodic job that dumps into a file all of the timestamp data, so Backblaze will back that up.   Then I can use a program to reset the timestamps to those that I have logged after I’ve done a restore.   Backblaze can be a hassle.


Posted in software testing | 3 Comments

Software Testing: A Research Travelogue by Orso and Rothermel

Interesting survey on updates to software testing was given as an ACM Sigsoft webinar.
Software Testing: A Research Travelogue,” Alessandro Orso, Gregg Rothermel, Willem Visser.
Based on their paper “Software Testing: A Research Travelogue (2000-2014)” by A. Orso and G. Rothermel in Proceedings of the 36th IEEE and ACM SIGSOFT International Conference on Software Engineering (ICSE 2014) — FOSE Track (invited).

This 10 page paper is followed by 210 references! Most encouraging to me was the slide on Empirical Studies & Infrastructure.   I fully agree that Testing is heuristic and thus must be empirically evaluated:

“• State of the art in 2005: study on 224 papers on testing (1994–2003)
None 52%, Case studies 27%, Experiments 17%, Examples 4%

Things have changed dramatically since then 

  • Empirical evaluations are almost required
  • Artifact evaluations at various conferences”

In their conclusion they also stated something I strongly believe in:

“Stop chasing full automation

  • Use the different players for what they are best at doing
    • Human: creativity
    • Computer: computation-intensive, repetitive, error-prone, etc. tasks “

I hope all professional testers are aware of the many topics they touched on:

  • Automated Test Input Generation using Dynamic Symbolic Execution,
    Search-based Testing, Random Testing, Combined Techniques.
  • Combinatorial Testing, Model-Based Testing, and Mining/Learning from Field Data
  • Regression Testing – selection, minimization, prioritization
  • Frameworks for Test Execution and Continuous Integration
Posted in software testing | Leave a comment

Thoughts on Rimby’s How Agile Teams Can Use Test Points

At February 3 SeaSpin meeting, Eric Rimby provided a discussion-provoking thought experiment around what he termed “Test Points” (analogous to Story Points). I didn’t quite have time to snapshot his slide, but I think he finally defined them something like:

“ The number of functional tests cases adequate to achieve complete coverage of boundary values and logical branches strongly correlates with effort to develop.”

He counts functional test cases specified as part of backlog grooming for a user story as “test points”.

While I’ve always had issues with counting test cases (e.g. “small” versus “large” tests and Reponse to How many test cases by James Christie), he at least restricted the context in which the test cases he was counting were created.   He presumed a team trained in a particular test methodology for doing boundary values and logical branches (I suggested Kaner’s Domain Testing Workbook), and that they compared notes over time. Another audience member afterwards indicated that Pex (or other tools) could probably auto generate many of these cases. Similar to how a scrum team should get more uniform in ascribing story points over time, Eric expects the number of functional test cases estimated by various team members for a story would become sufficiently uniform over time.

While I disagree with many of the suppositions he made during his talk, I agree that tracking the number of functional test cases estimated for a story might be a useful thing to track. Whether it correlates to anything remains to be measured. However, I think just getting teams to be better about their upfront Acceptance Test Driven Development (ATDD) as part of story definition can only help.

Abstract from , How Agile Teams Can Use Test Points
Test points are similar to story points. They can be used to estimate story size, track sprint progress, normalize velocity across teams, among other things. Test points have some advantageous that story points do not. They could be used instead of, or alongside with story points.

Posted in software testing | Leave a comment

What are Synthetics?

I attended PNSQC Birds of a feather session “Why do we need synthetics” by Matt Griscom. It was advertised as “The big trend in software quality is towards analytics, and its corollary, synthetics. The question is: why and how much do we need synthetics, and how does it replace the need for more traditional automation?”
I spoke with Matt briefly to understand what he meant by synthetics, because I thought it was a rare, relatively unused term.

I learned a lot from other participants at the session.   First, New Relic is trying to stakeout the term!   (They describe test monitors as Selenium-driven platform sends data using turnkey test scripts).

Second, I attended a great talk, which I highly recommend, by former colleague from Bing:
Automated Synthetic Exploratory Monitoring of Dynamic Web Sites Using Selenium by Marcelo De Barros, Microsoft.

So synthetics are mainstream.   So what are synthetics?   What are not synthetics?   I had a hard time parsing synthetics as corollary of analytics. Still do.
For me synthetics are tests that run in production (or production-like environments) and use production monitoring for verification.
Analytics are just a method, in this context, for doing monitoring.   To me synthetics are artificial (synthetic) data introduced to the system.

I thought synthetics were almost always automated and A/B testing would be a type of testing where synthetics wouldn’t apply.   I was proven wrong on both with a single example: Using Amazon’s Mechanical Turk to pay people to choose whether they like A or B.!     This is manual testing and synthetic, as it is not being done by the real user base.

Maybe the problem with “synthetics” is the same problem I have with “automation”.   Even “test automation” isn’t very specific, and means many things.   I’m not sure if “synthetics” is supposed to mean synthetic data (Wikipedia since 2009), synthetic monitoring (Wikipedia since 2006 – the description also uses “synthetic testing”), or something else.

Posted in software testing | 1 Comment

How Data Science is Changing Software Testing – Robert Musson talk

I enjoyed Robert Musson’s recent presentation How Data Science is Changing Software Testing and recommend you watch it or at least read Robert’s Presentation Slides which don’t do it full justice, but should tease you.
As the abstract stated: It will describe the new skills required of test organizations and the ways individuals can begin to make the transition.

I worked with Bob a few times while at Microsoft, and he truly was one of the original Data Science testers for the past decade doing Data Analytics.   He says (37:50 into video) the tide has turned recently and he has “seen more progress in last 6 months than seen in past 10 years.”

So now I need to learn

  • Statistics, e.g., r-value, p-value, Poisson and Gamma distributions
    Homogenous (non-changing) or Non-homogenous (changing) Poisson for reliability measurements to get me used to time analysis.
  • R language (open source version of S).
    Object oriented with many packages to do exploratory data analysis and quick linear models.
  • Python for easier data manipulation including building dictionaries and packages for linear algebra

So I can prepare for the mindset change.

Mindset change is to one of information discovery vs. bug discovery

An audience member asked how to learn, and Bob recommended for many courses, including statistic courses..  He called out specifically,
Model Thinking – Scott Page – U. of Michigan.
I love models, but I might also start with

Posted in software testing | Tagged , , , | 2 Comments

Testing Magazines

Last week I listened to the Webinar, State of Software Testing. Tea Time With Testers  approached testing experts Jerry Weinberg and Fiona Charles to review their ‘State of Software Testing’ survey for year 2013 survey results.   Mostly the experts indicated the questions were poor and thus the results irrelevant.   But one side comment caught my attention since they both agreed:

every tester should read at least one test magazine a month.

While I do that, I asked a colleague recently, and he said no and wasn’t even aware of the many free online test magazines available.  Thus, this post to list several Test Magazines to choose from.     Note, I consider this an addition to the poll I’ve often heard, which I also agree with: have your read at least 1 book on software testing and better – have your read at least 1 book on software testing every year.

So now my belief is professional testers as part of their continuing education in their profession should read at least 1 book a year and 1 magazine a month about software testing.    Maybe following some blogs also and many of the magazine sites have associated blogs.

Since coming to, I discovered two of these magazines due to posts by Reena Mathews.     Not in a priority order, just a count:

  1. Tea Time With Testers
  2. Automated Software Testing
  3. Better Software
  4. Professional Tester
  5. Testing Trapeze
  6. Software Test & Quality Assurance
  7. Testing Experience
  8. Testing Circus
  9. Testing Magazine

Non Free magazines:

10 .ASQ Software Quality Professional

Blog examples:

There are many sites that list lots of software testing blogs.

Posted in software testing | 1 Comment

Putting Lean Principles to Work for Your Agile Teams – Sam McAfee’s talk

Interesting talk, Putting Lean Principles to Work for Your Agile Teams, by Sam McAfee at Bay area Agile Leadership Network (

While as one commenter put it, nothing new, it was interesting to me to see it strung together and the one-year transformation.
Basically starting with a team that followed many Agile practices he described changes using Lean that transformed or even eliminated the need for some of the Agile practices.

Initial agile practices: Collocated teams with daily stand ups.  Pair Programming 95% of the time amongst the 16-17 engineers.  2 weeks sprints with shipping at the end.  Test Driven Development (TDD) & Continuous Integration (CI).  Engineers estimate in story points.

With all this, they still had stories stuck or blocked for long periods of time.
So apply Theory of Constraints,  (Book:  The Goal — Eli Goldratt , or more recent IT oriented version: The Phoenix Project: A Novel about IT, DevOps, and Helping Your Business Win by Gene KimKevin Behr and, George Spafford).
Also need more visual progress, so use Kanban board.
Kanban — use buffers to throttle upstream demand and reduce cycle time.   Reduce Work In Progress (WIP):   Ready, Doing, Done
Bottleneck was some times deployment.  They had manual deployment process.
So use Continuous Deployment  (CD)— automate deployment — Reduce cycle time of deploying value to customers as close to zero as possible.

dev -> continuous integration —> test in cloud —>  Deploy & Monitor (system health, . . )

Note: Theory of constraints assume single, stable bottleneck.

With Knowledge work, the bottlenecks bounce around.
Kanban more systematically lays down constraints.

Typically delay in handoff between roles or back & forth between roles.

==> use tighter feedback loops to reduce stuck stories.

CD allows change to fixed length sprint structure. Sprint planning meetings continue, but fixed sprints becoming superfluous because of continuous deployment.

Not all features released produced the Key Performance Indicators (KPIs) they wanted.  So use Innovation Accounting from book The Lean Startup.
Use the build-measure-learn loop to reduce the amount of uncertainty in new product development or process innovation.
Build smallest KPI change you can ship and run experiments.

Small lightweight experiments could be costly using full TDD and pair-programmed development.  For short lived proto-typing code, may not need TDD.   Also consider pair with Designer —  Designer & Engineer to create experiments.   Use experiments to validate business needs

A change from Agile to Lean was dropping story points in favor of Statistical Process Control (SPC) ala Shewhart and Demming.
Assume most stories flow normally, but analyze why outliers outside the range?  What makes those work items special?   Estimation time focused on risky (outlier) areas.
Team used electronic Kanban board (not Sam’s first choice of way to do it) that collected data, which was fortuitously used for SPC control chart.

Not all was rosy.  CEO would make urgent requests to reprioritize.   Which slowed original work down.   How to do the tradeoff?    Measure Cost of Delay.
Compare cost of delay for what you are doing now vs. what CEO wants now.

Cost of Delay: Quantifying the impact of delay on total life-cycle profits for each of your projects.  Delay typically shifts when start recognizing revenue, without shifting end of life.
How to get costs?  Spreadsheet of conversion rates, traffic, etc. from Finance.
Assumes you know cost of production:  Engineering time spent and cost of payroll.

Quantify Risk using data — not intuition — to model, and validate, risk factors.
Books by Hubbard : How to Measure Anything  & Failure of Risk Management.
Quantifying Risk of : Traffic -> convert user —> paid user retention
“all other risk” (without data) is just hand waiving.
Use Monte Carlo simulations — which parts are most sensitive to Risk.

Sam’s summary:

Continuous Deployment
Optional pair programming (E/E or D/E pairs)
Optional TDD & Continuous Integration
Use experiments to validate business needs
Use historical data to provide estimates, and asses risks.

Change daily stand ups  from what I did, doing, and am blocked on
to talk about flow of work.

Moving KPIs in right direction.
How to make many small bets.

But don’t believe what I wrote!   Watch the video with the graphics for more detailed descriptions.
Video of Sam McAfee’s Putting Lean Principles to Work for Your Agile Teams talk.
To visit Bay Area Agile Leadership Network, go here:

Posted in software testing | Leave a comment