Introduction
Model-based software engineering (MBSE) involves using high-level abstractions to guide discussions, code generation, and validation [1]. In MBSE, models serve as blueprints for developers and automated tools to enhance productivity. For an example of model-based reasoning in software engineering, see our next section.
When faced with large and complex problems, humans use heuristic cues to lead them to the most important parts of a model [2].Such cues are essential if humans are to reason about large problems. But they can introduce their own errors:
...people (including experts) are susceptible to “automation bias” (involving) omission errors—failing to take action because the automated system did not provide an alert—and commission errors (where they incorrectly read model output) [3].
This is troubling since there is an increasing legal demand for humans to certify that models are performing correctly [3]. When humans are required to audit a model that they cannot understand, then this leads to “rubber stamping” (model approval without proper consideration) and the “legitimizing the use of faulty and controversial (models) without addressing (their fundamental issues”) [3].
Accordingly, we explore interactive tools where human knowledge and AI interact to find the factors that most change model conclusions. Our iSNEAK tool asks humans the least questions to find the most desirable solutions. Working with iSNEAK, we have found that a small set of easily computed “hints” can generate an approximate “partial orderings” of the candidate solutions. Using the orderings generated from those “hints”, humans can guide a simple binary chop procedure over nested bi-clusters built in a space built through dimensionality reduction (see details, below). In this way, after offering
Two concerns with this procedure are:
AI inference can confuse humans; i.e. tools like iSNEAK would confuse people by either overwhelming people with too many questions or, at the end of the inference, propose a solution that made no sense to them.
Humans’ advice can confuse AI tools like iSNEAK; i.e. the guidance offered by humans would lead our AI algorithms to make sub-optimal conclusions.
The rest of this paper explores these concerns over a set of a dozen models of increasing size and complexity (with up to 10,000 variables). These models address a range of problems including software avionics; flight guidance; management decisions for agile software projects; and various software feature models including one electronic billing system.
Using these case studies, our experimental section explores the two issues raised above:
RQ1: Does iSNEAK-style AI confuse humans? For our models, iSNEAK-style AI does not confuse humans since it only asks a handful of questions and returns human-acceptable solutions;
RQ2: Does human advice confuse iSNEAK-style AI? For the models studied here, advice from humans do not confuse AI since iSNEAK’s solutions, obtained via human-in-the-loop reasoning out-perform the prior state-of-the-art in the optimization of software models.
We justify the use of partial orderings and tools like iSNEAK to solve the information overload problem. This problem is particularly acute in scenarios where humans are being asked their preferences for one solution over another. In that situation, information overload means humans can make mistakes (as well as needing more time to accurately comment on examples).
Overall, we say that the contribution of iSNEAK is that it lets humans explore complex problem spaces within far less time, with far less effort, thereby avoiding information overload and its associated problems. To help with the reproduction and improvement of these results, all our data and scripts are on-line 1.
The rest of this paper is structured based on advice by Wohlin et al. [4] on empirical software engineering. We present our methods in Section III and the comparable baselines in Section II-F. Followed by a definition and explanation of our case studies in Section IV, our results are in Section V. See also Section VI-A for a discussion on the threats to the validity of our conclusions.
A. Current Limitations
Before going on, we list some limitations to iSNEAK. One limitation of the current implementation is that it assumes binary attributes (and n-valued attributes are handled with one-hot encoding based on equal-width binning). It would be useful to extend this code base to arbitrary precision numbers. Another more fundamental limitation is that in its current form, iSNEAK is only defined for optimization problems, not regression, classification or generative tasks. We think more research can remove that limitation for regression tasks (e.g. a regression problem can be viewed as an optimization problem where an optimizer explores different models to minimize the difference between predicted and actuals). But as to classification and generative tasks, those are an open issue.
Background
A. Model-Based Software Engineering Is Widely Used
We study model-based methods since they are widely used in software engineering. For example, in the cyberphysical realm, it is standard practice [5] to deliver hardware along with a simulation system (often written in Simulink or C). Developers use that simulation during verification.
For another example, in requirements engineering, software product line researchers explore a space of inter-connected features in a software design [6], [7], [8], [9], [10], [11], [12], [13], [14]. Such feature models can be reversed engineering from code, then used to debate proposed changes to a system [15].
Also, proponents of domain-specific languages argue for a systems-development life-cycle that begins with (a) defining a domain language [16], then (b) modeling & executing that system via that language. Low code proponents then add some visual editor to allow for rapid changes to the model [17]).
Further, software safety researchers generate two models: one for the system under study and another for the safety properties. Automatic tools then check for violations of safety in the systems model [18], [19].
Furthermore, models can be used for the generation of runtime systems. This is especially useful for automotive and embedded systems that make extensive use of models (e.g. generating executables from Simulink [20].
Lastly, researchers in search-based software engineering (SBSE) explore models with sets of goals that may be competing. We have applied SBSE to models of cloud-based architectures [21] (looking for ways to reduce the cost of the overall design); NASA spaceships [22] (looking for cheaper satellites that can handle more error conditions); video encoders [8], [23] (looking for better and faster compression control parameters); software process models (looking for more functionality, at less cost) [24].
B. Interactive Model-Based Software Engineering
Formally, then, we can define the problem we are trying to solve as such:
Given a model M that contains n variables, c constraints and g goals, how to find a valid candidate solution that optimizes for the specific model M goals as well as maximizing user satisfaction (i.e: human input in what options they find important is being respected).
With the subgoal of minimizing the amount of human interaction (
A valid candidate solution is defined as a configuration of the given model that respects those model’s constraints. When dealing with very large models with a high number of constraints, the space of candidate solutions is represented as all possible configurations of the model irrespective of whether they satisfy all constraints or not. This distinction is necessary because many of the methods studied in this arena use evolutionary computation algorithms, which are notoriously inefficient when dealing with highly constrained search spaces, those algorithms as we will show in our results section struggle with generating valid candidate solutions, instead spending a large majority of their compute time reasoning over invalid space.
The reason for that is many of these models contain what is called a cross-tree constraint, which is a non-trivial constraint where there are conditions applied to configurations depending on different configurations somewhere else in the model. (e.g.: In the scrum model, if you choose for teams to have no Project Manager, you cannot then, choose for the Project Manager to lead the Daily meetings)
As with any optimization problem, a key concern is to report feasible solutions (where infeasible solutions violate domain constraints). A naive random model generation process is particularly prone to generating infeasible solutions. To solve that, we use a theorem prover to apply domain constraints to find the feasible within the infeasible (see our use of PicoSAT in Section IIC). After that, all of the processing described in this paper occurs within the feasible space.
C. Models Can Confuse People
Having stated our goal, above, this goal lists some problems with achieving that goal.
This section argues that (c) complex models can confuse and overwhelm stakeholders; and (b) standard AI tools are not enough to address that confusion.
Figure 1 is a visual representation of the Mendonca et al. [6]’s software product line model of SCRUM (agile project development).
SCRUM Feature Model. Such feature models define features and their dependencies, typically in the form of a feature diagram + left-over (a.k.a. cross-tree) constraints. Numbers on leaves show the numbers of constraints in that part of the model. Not shown here, for space reasons, are “cross-tree constraints” that connect choices in different subtrees.
The feature model of Figure 1 contains 128 project management options and 250+ constraints (e.g., if sprints last two weeks, then each individual task must take less than 10 days of programming).
We say that a candidate is one setting to all the variables in Figure 1. We also say that a valid candidate is one such setting that respects all constraints present in the model.
We note that this is a multi-objective problem, meaning that all of the models in this study contain at least 4 goals. Each node in Figure 1 comprises (a) an estimated development effort; (b) a purchasing cost (which is non-zero for features from third-party libraries); (c) some number of defects seen in past developments; (d) a “success” number that counts usage in prior successful projects. A project manager reviewing all the candidate solutions must reflect on how best to trade off between these competing concerns; e.g. how to minimize cost and effort without increasing defects while maximizing the features delivered.
How to explore all the options within Figure 1? Empirically, it is known that if we select variable settings at random, then less than 2% of those candidates will satisfy the constraints of that figure [7]. Modern AI tools like the PicoSAT [25] solver can navigate those constraints to find valid solutions 2.But now there is a new problem: too many solutions (specifically
Q: How to help humans choose from all the candidates?
D. Partial Orderings
This paper argues that an efficient way to choose between all the candidates is partial orderings. A partial order is an ordering of a set such that, for certain pairs of elements, one precedes the other.
We use heuristic partial orderings as a way to quickly divide and conquer a problem space. For example, suppose there is a heuristic that Chinese manufacturing is better than Australian manufacturing. From this rule, we could define a partial ordering of the form:
Partially order anything from SE Asia from “good” to “bad” by mapping its manufacturing location (e.g. in the Philippines) to its closest point on a line connecting Sydney, Australia to Shanghai, China.
By this partial ordering, we might prefer to buy goods from Vietnam over goods from New Guinea.
Like any heuristic, partial orderings are not a complete inference (since there may be pairs for which neither element precedes the other, according to a partial ordering). However, as shown below, they can be remarkably effective.
A repeated result in software engineering (and other domains) [27], [28], [29], [30] is that when optimizing for some goals, it is possible to be guided by some easy-to-compute partial heuristic. For example, automatic configuration assessment tools can be built from regression trees with just a few examples. Even if those trees become very poor predictors of performance (in an absolute sense), they can still be useful to rank (in a relative sense) different configurations. E.g., Nair, et al. used those trees to find the top 1% of configurations, even though those models had error rates as high as 90% [30]. Also, some researchers [28] use generative models to improve upon supervised learning. While the results from the generative model may not be of the highest quality, all these outputs provide hints on how to better direct another algorithm (e.g., a machine learner). Further, researchers explored test case selection via crowd-sourcing since this is a fast way to collect many opinions. While the value of one opinion is questionable, they can be useful in the aggregate [29].
To say all this another way, SE researchers often guide the search for “good” solutions via some heuristic that can pre-order the space of possible solutions. To demonstrate the power of this approach, we turn to Hamlet’s probable correctness theory [31]. That theory says that the confidence of seeing an event occurring at probability p after n trials is \begin{equation*} n(c,p)=\frac {log(1-c)}{log(1-p)} \tag {1}\end{equation*}
Now consider what happens if we can compute some partial ordering over all those solutions. For solutions sorted in that manner, Hoare’s quickselect algorithm 3 needs only explore:\begin{equation*} \mathit {evaluations}= 2\log _{2}(n) \tag {2}\end{equation*}
One way to summarize this equation is to say that optimization problems can be easier than other tasks like classification or regression. Those other tasks require us to label examples across the entire space while optimization means looking for a small region (and pruning the rest). And that small region can be found via partial ordering.
When discussing this work, we are often asked how this work compares to other SE researchers in what they call “partial orderings”. For example, researchers trying to reduce the state space explosion of model checking [33], or simplifying the number of refactoring steps for source code [34], apply a technique they call partial order reduction that prunes similar paths leading to the same result. While that work certainly shares some of the intuitions of this work, we note that (a) our partial orderings sort the entire space of solutions; after which we (b) explore N solutions using
E. Why Not Explore Models Using Traditional Optimization Methods?
This paper is about optimization. Optimization problems, like those discussed above, have been well-studied for decades in the operations research literature [35]. There are many reasons to be cautious about using traditional optimizers for model-based SE. Those optimizers often assume continuous functions with numeric attributes that can be differentiated at all points along a function (differentiability is a key requirement for any classical gradient descent algorithm). Models like Figure 1, on the other hand, have many non-numeric attributes and are not expressed in the equational form needed by traditional optimizers.
Also, traditional optimizers may make unrealistic assumptions about the nature of model goals. Researchers like Harman [36] note that SE often requires trading off between competing many goals and, in that context, there may not be a single optimal but rather a (small) space of most acceptable solutions which must be debated by stakeholders.
Further, if used “off-the-shelf”, traditional optimizers can waste much time exploring issues that many humans find irrelevant or incomprehensible. While models may contain many variables, any one subject matter expert may only have experience with a small number of them. One result from the user study of this paper is that when stakeholders are asked to comment on all model attributes (e.g., the 128 attributes of Figure 1), they only ever comment on around 20 attributes. This means that, in our experiments on Figure 1, the space where their expertise can be applied is just
F. Reasoning Over Many Goals
Many rival technologies that have been applied to many-goal model-based reasoning in SE:
Genetic algorithms, used for interactive search-based SE;
Sequential model optimizers, used for SE configuration.
This section describes those two approaches.
1) SBSE and Interactive Search-Based SE
Search-based SE methods explore design trade-offs using evolutionary programs. These algorithms run in multiple generations
Generation
is the initial population.G_{0} Each new generation
is built byG_{i+1} Randomly mutating the
solutions,G_{i} Then selecting the best individuals,
y-evaluating these individuals
Then mixing together (also known as crossover) parts of two best individuals to create a new example for
G_{i+1}
This kind of reasoning must assess and compare solutions (in step 2b of the above). In single-objective optimization problems, a simple sorting function can rank goals between candidates. However, when dealing with many-objective reasoning, candidates must be ranked across many goals. As the number of goals increases, simple schemes such as boolean domination find it harder to distinguish different candidates 5.Hence, we use the Zitler predicate described in Equation 3.
Interactive SBSE (iSBSE) is a variant of SBSE that tries to include humans in the reasoning process. One drawback with these iSBSE tools is cognitive fatigue. Typical control policies for genetic algorithms are to 100 individuals in
To reduce cognitive fatigue, iSBSE researchers augment genetic algorithms with tools that enable pruning the search space, without having to ask the stakeholders too many questions. For example, Palma et al. [40] use a constraint solver (the MAX-SMT algorithm) to evaluate pair-wise comparisons of partial model solutions to decrease their search space. We do not compare iSNEAK against this method since their technique has scaling problems:
Their biggest model had 50 variable constraints.
iSNEAK, on the other hand, has been successfully applied to a 1000 variable model: see V-B.
In other work, Lin et al.’ [41]’s iSBSE tool generates plans of what to change within a system. Then, stakeholders interactively examines the recommended steps to accept, reject, or ignore them. These interactions will then be used as feedback to calculate the next recommendation. But like Palma et al., Lin et al. did not demonstrate that their methods can scale to the same size models as we process later in this paper.
In yet other work, Araúgo et al. [42] combine an interactive genetic algorithm with a machine learner. Initially, humans are utilized to evaluate examples, but once there are enough examples to train a surrogate; i.e., a model learned via machine learner that can comment on solutions, this surrogate starts answering queries without having to trouble the stakeholders.
In the future work section of Araújo et al., they recommended exploring software product lines such as Figure 1. Since they have approved that kind of study, our study compares iSNEAK against Araújo et al. (as well as SMBO, described below).
2) SMBO = Sequential Model-Based Optimization
Recall that Araújo et al. reduced the number of questions to the stakeholders by building a surrogate. Once that surrogate is available then when new solutions are generated, these are evaluated by the surrogate (without having to ask the stakeholders any questions).
The use of surrogates is an important technique in the automatic configuration literature. Traditional configuration methods require the evaluation of many candidates. For example, in 2013, Guo et al. proposed the use of the CART regression tree learner to generate a model from some historical examples that could be used to assess configuration options for future configuration problems [45]. That said, literature reviews in this field such as [46] lament the narrow range of algorithms used in this area. Firstly many of these tools still require a large number of pre-evaluated candidates. Secondly, the configuration tools reported by Ochoa et.al. (that reportedly supposedly supports human interaction), often use a human ranking of attributes that is fixed for all candidates [46]. Hence those reportedly interactive tools can overlook nuances related to specific examples.
Nair et al. use surrogates in a different way. In their tool, new examples are assessed on a case-by-case basis by sequential model-based optimization [47], [48] (SMBO). Table 1 shows their acquisition function loop that uses the surrogate to guess which candidate should be reviewed next.
Acquisition functions are the core of tools that use SMBO. While working on automatic software configuration, Nair et al. explored several acquisition functions [8]. For SE models, they found that standard SMBO methods (using Gaussian Process Models) did not scale very well. Instead, they found that regression tree models based on CART [43] scaled up to the kinds of large models seen in SE. Following their recommendations, this paper will use their FLASH sequential model optimizer, described in Table 2. Note step 4d, where one new example is evaluated. This is the point where some oracle would be asked for their opinion on a candidate (e.g., some human could be asked a question or some model could be executed to obtain y-values).
Our Proposal: The iSNEAK Algorithm
iSNEAK uses “hints” from the independent variables to quickly generate a partial ordering of candidates. This space is then explored via a recursive binary chop algorithm that spaces the candidates over a PCA-like space.
A. Recursive Bi-Clustering
iSNEAK uses a recursive binary chop inspired by Chen et al. [7]. This approach finds two very distant points east,west, then recurses on the half associated with the “best” of east or west (where “best” is calculated using the methods described below). Formally, this process is a recursive FASTMAP algorithm [49] which is Nyström [50] approximation to PCA. We provide a pseudocode for FASTMAP in Alg. 1. The recursive version of this procedure simply consists on calling fastmap again on each of the (east and west) generated clusters from the latest step. This terminates on a given stopping criteria.
iSNEAK’s recursion terminates when it reaches
Algorithm 1. FASTMAP
procedure FASTMAP(rows)
pivot = random(rows)
east = mostDistant(pivot)
west = mostDistant(east)
projectedPoints = []
for row in rows do
project row in the line between east and west
projectedRows.append(projectedRow)
end for
split projectedRows in half
first half is EastRows
second half is WestRows
Return EastRows, WestRows
end procedure
B. Discretization
This algorithm uses discretized attributes. Following the advice of Kerber [52], numbers are initially divided into 16 ranges of equal-width bins. This division (into many small bins) is then carefully evaluated within our system and (sometimes) repaired. Specifically, our system checks if two adjacent bins can be combined to use statistical measures of disorder. To implement this, we find the size \begin{equation*} \sigma _{k} \leq \frac {n_{i}}{n_{k}}\sigma _{i} + \frac {n_{j}}{n_{k}}\sigma _{j}\end{equation*}
In early experimental work on this paper, we have attempted to fixate different values for the initial number of bins. But the results were similar or worse. As such we have kept the initial division of 16 ranges by Kerber [52].
C. Distance, Projection, and Division
For the descretized data represented as one-hot encoding, Chen et al.’s boolean distance metric [7] is used to measure the distance between two candidates. With that distance metric, the two remote points east,west can be found in linear time, using the Faloutsos and Lin [49] heuristic 6 and the cosine rule:
Let E (east) and W (west) be separated by distance c.
Let
be the distance of some example toa,b .E,W All examples have distances
toa,b and distanceE,W on a line drawn E to W.x=(a^{2}+c^{2}-b^{2}) / (2ac) To halve the data, iSNEAK splits on the median x value.
D. Ranking Best and Rest
At each level of the recursion, the division process of the last section generates two halves, which iSNEAK can rank in one of two ways: (a) automatically; or (b) human-in-the-loop or Once ranked, the lower ranked half is pruned and iSNEAK recurses into the surviving candidates.
1) Automatic Ranking
Following the advice of Sayyad et al. [37] iSNEAK uses Zitzler’s continuous domination predicate [53] to label our two halves best and rest. This predicate favors E over W if jumping from E “loses” most:\begin{align*} \textit {worse}(E,W) & =\textit {loss}(E,W) \gt \textit {loss}(W,E) \\ \textit {loss}(x,y) & =\sum \nolimits _{j=1}^{n} -e^{\Delta (j,x,y,n)} \\ \Delta (j,x,y,n) & =w_{j}(o(x)_{j} - o(y)_{j}) / n \tag {3}\end{align*}
2) Human-in-the Loop Ranking
iSNEAK can rank the two halves using human input to label our two halves best and rest. Starting with two items picked from each half, iSNEAK looks at the top six more informative attributes (where “most informative” is assessed via the INFOGAIN feature ranking predicate [54]) that have different values in item1 and item2. These two sets (of up to six values) are presented to the user. The user then selects either item1 values or item1 values. Each half can then be scored by the percentage P of candidates in that half containing any of the selected values. This frequency then becomes an additional term in a separate version of Equation 3:\begin{align*} \textit {worse}(E,W) & =\textit {loss}(E,W) \gt \textit {loss}(W,E) \\ \textit {loss}(x,y) & =\left ({{\sum \nolimits _{j=1}^{n} -e^{\Delta (j,x,y,n+1)}}}\right ) -e^{\Delta '(x,y,n+1)} \\ \Delta (j,x,y,n) & =w_{j}(o(x)_{j} - o(y)_{j}) / n \\ \Delta '(x,y,n) & =(P(x) - P(y)) / n \tag {4}\end{align*}
The reasoning behind showing 6 attributes to the user comes from experimental data. In the field of Interactive Search Based Software Engineering, a big concern is to avoid information overload. Using the example model from Figure 1, even if humans felt they could comment on all those 128 options at once, should we trust their assessment? Shackleford [55] warns us that human information overload leads to errors in human decisions about which variables are most influential in a large comparison scenario. And Takagi [56], [57], [58] notes that we can decrease this effect by:
Reducing I, the number of interactions (where at each interaction, we showcase the user N attributes)
Reducing S the size of questions per interaction
We note that:
Every run of iSNEAK could use a single interaction
if an expert commented on all the n variables in our models (ifI=1 ).S=n Similarly, iSNEAK has (at most) n interactions if every interaction only commented on one variable (if
).S=1
Figure 2 explores the trade-off between asking for everything (
E. Two Passes
Our iSNEAK system is a two-pass system. Both passes apply recursive bi-clustering with top-down pruning. In the second pass, Chen’s et al. algorithm [7] prunes subtrees via the automatic ranking method of III-D1 and Equation 3.
Before that, iSNEAK’s pass1 uses Equation 4 to guide the search via human preferences. Pass1 recursively bi-clusters all the candidates (with no pruning). Next, we generate a list of subtrees in the cluster tree, sorted by how much that subtree reduces the variability in the candidates. Here, variability is measured via entropy. If \begin{equation*} S{\times }(e_{0}-e_{1}){\times }(e_{0}-e_{2})\frac {\mathit open}{d} \tag {5}\end{equation*}
The open term in Equation 5 constrains the dialog with users such that they rarely have to offer an opinion about the same attribute twice. Once III-D2 determines what attributes to we could ask about this subtree, open is percentage of those attributes that we have not yet presented to users (so the subtrees we explore are the ones converned with attributes we have yet to ask about).
Pass1 continues until the cluster tree contains less than
Algorithm 2. iSNEAK
procedure iSNEAK(rows)
tree = recursiveFASTMAP(rows)
good = calculateGood(tree)
if good and size(tree)
Ask question on best
prune worst half
recalculate good
else
survivors = recoverRows(tree)
selected = SWAY(survivors) [7]
return selected
end if
end procedure
Methods
A. Case Studies
Table 3 shows the models used in our study. For all the models, the experiments of this paper (see next section) apply the methods of III and II-F to find inputs that leads to best values in outputs.
We use these models since these are standard models used in other SBSE papers [7], [59], [60]. Note that these are all multi-objective problems with 4 to 5 goals. In that table, the constraint ratio is the number of constraints per clause. The models POM3A, POM3B, POM3C, OSP2, FLIGHT and GROUND are small and have no constraints (so all solutions generated to these models are valid). All the other models are much larger and have many constraints that only 1% to 3% of the solution space is valid.
The rest of this section describes these case studies.
1) XOMO
OSP2, GROUND and FLIGHT are variants of XOMO [61] which combine four different COCOMO-like software process models in order to calculate metrics for project’s success. The XOMO model, is a general framework for Monte Carlo simulations that combine four COCOMO-like software process models from Boehm’s group at the University of California.
The problem solved by XOMO+iSNEAK is to “generate good management recommendations” that satisfy the constraints of that project, while also minimizing the development time (as judged by the COCOMO model) while also avoiding other factors that increase the risk of cost overrun (as judged by other Boehm models).
Containing between 6 and 12 variables on each of its variants, the XOMO model is a representation of real situations in a software project. Under this model, a user can obtain four objective scores: (1) project risk; (2) development effort; (3) predicted defects; (4) total time for development. The model was developed with data collected from hundreds of commercial and Defense Department projects [62]. As to its risk model, it is defined as a rule-based algorithm in certain variables on these models associated with risk (e.g.: demanding more reliability whilst decreasing analyst capability). As such XOMO measures risk as the percentage of triggered rules
OSP2 (short for “orbital space plane, version2”) describes the software context of a second-generation NASA space plane. Expressed in the terminology of Boehm’s COCOMO system [63], OSP2 lists the range of acceptable software reliability (it must be very high), database size (which is variable), developer experience (which can be changed by management assigning difference developers to the project), and so on.
GROUND, FLIGHT are similar to OSP2, but represent two specialized classes of NASA software (flight software and ground software). The problem solved by these two models are the same as OSP2 (but for special kind of specific software types).
2) POM3
OSP2,GROUND,FLIGHT assume waterfall style development. POM3, on the other hand, explores factors related to agile development such as Figure 3. These factors are organization factors identified by Boehm and Turner’s in their book: “Balancing Agility and Discipline: A Guide for the Perplexed” [64]. According to that book, agile works best in the center of Figure 3. However, they also note that with management support, organizations can adjust themselves.
According to Boehm and Turner [64], agile is best suited for projects in the middle of this figure.
POM3+iSNEAK is an advice tool used during SCRUM meetings to answer the question “what tasks to complete in the next sprint?” (where a sprint is a very short period of intense coding with the aim to produce some specific deliverables). To answer that question, POM3 conducts trade-off studies within Figure 3.
This paper explores three POM variants:
POM3a representing a very broad space of projects; and
POM3b representing safety critical small projects; and
POM3c representing highly dynamic large projects.
For all these POM models, the goal is to maximize the number of requirements met while minimizing the time to completion and idle time (time spent waiting for some other team to complete some required module).
3) Assorted Product Lines
The rest of our models come from the software product line research. Here, a large space of potential software projects are modeled as a tree of features (and super-features are meet via some conjunction/disjunction of subtrees). Between the tree, there may be “cross-tree constraints”; i.e. things that must be absent or present in one branch to enable a feature in another branch. From one product line, many software designs can be generated, provided that each satisfy the cross-tree constraints. For an example software product line, see Figure 1.
When combined with some optimizing technology like iSNEAK, the problem solved by these models plus iSNEAK is “what product to build?”. To answer that question, an optimizer must trade off between competing goals such as minimizing development time and maximizing number of features delivered. An example software product line was the SCRUM model shown in Figure 1. Another software product line model used in this paper represents options with a BILLING system. Apart from that, this paper will also optimize eight other artificially generated SPL models of increasing size and/or frequency of cross-tree constraints.
All these software product line models were taken from the SPLOT reseaerch model repository 8.That website has a tool for artificially generating models. This tool is useful for stress testing an algorithm by e.g., generating models of increasing complexity. We used eight such models:
We increased features size while using the same ratio of constraints to features for 125FEAT, 250FEAT, 500FEAT. 1000FEAT.
In 0.25 C.D, 0.75 C.D, 0.50 C.D and 1.00 C.D, the number of features was kept constant (at 500) while the ratio of constraints to features was increased.
These last ten models were taken from the SXFM format and converted into the Dimacs format using FeatureIDE [65]. And from the Dimacs format, we use a PicoSAT to generate a database of valid solutions for each.
B. Generating Candidates
This section discusses how we generate candidates from the models described above.
Chen et al. [7] found that if they generated very large initial populations (e.g., 10,000 candidates), then their recursive binary chop could very quickly find solutions as good, or better, as genetic algorithms that take an initial population of (say) 100 candidates then mutate them over 100 generations. Following their advice, iSNEAK uses various mechanisms to generate those 10,000 candidates:
C. Comparison Algorithms
Arcuri and Briand [66] recommend that algorithms need to be compared against some simple baseline. For that purpose, we use a Non-Interactive Genetic Algorithm (NGA) which is the same genetic algorithm used by Araújo et al. but without human interactions. This approach serves as a baseline for comparison towards answering our second research question, where we only perform y-evaluations to optimize our models. For the population control parameters of NGA, we used the advice of Holland [39]; i.e., 100 valid candidates and is run for 100 generations, thus generating 100 new candidates each time. As to the cross-over and mutation method, we retained the Araújo et al. settings; i.e., 90% crossover and a mutation rate set to
iSNEAK is also compared to a state-of-the-art evolutionary method; specifically, FLASH [8] and the Araújo et al. iSBSE [42] algorithm We choose their algorithms for their recency and superior performance (compared to other methods in their field [48]). Also, the [42] algorithm, on the other hand, is engineered to accept a wide range of models as input.
The code for FLASH, written in Python, comes from the Nair et al. repository 10. Araújo et al. did not offer a reproduction package for their work so we reimplemented their code from their descriptions.
D. Manual and Automatic Oracles
We conducted two studies:
Human-in-the-loop, to see if human trust or accept our solutions.
Fully automated, to test dozens of models.
In the former, humans were the oracle. In the latter, each attribute was given a randomly assigned priority (and approved answers where those using attributes that maximized the sum of that priority).
E. Evaluation Metrics (Distance to Heaven)
To evaluate the optimally of solutions found by different methods, we must see where our solutions fall within the space of all solutions. To that end we:
Ran one method (NGA, iSBSE, FLASH, etc) to collect a small number of recommended candidates.
Evaluated all candidates and ranked them all;
Find recommended candidates ranks for all candidates.
For ranking all candidates, using the Ziztler predicate, we can obtain a vector Z containing all candidates sorted from best to worst. The “distance to heaven” of a specific recommended candidate, denoted x by its index \begin{equation*} \mathit {d2h}_{x} = i_{x} / |Z| \tag {6}\end{equation*}
Results
The results of this paper comment on the research questions offered in the introduction.
A. RQ1: Does iSNEAK-Style AI Confuse Humans?
RQ1 addresses the core issue that motivated this paper; i.e., can we make AI deliver solutions that humans agree with? Stakeholders may have preferences for a tiny fraction of the total space. Hence, it is possible that automated tools like iSNEAK will differ from what a stakeholder considers acceptable. RQ1 tests for the presence of that problem.
The RQ1 experiment was performed in accordance with the Investigator Review Board policies of the North Carolina State University. Note that in the following experiments, human input is used in all the comparison algorithms studied here (i.e., both iSNEAK and Araújo et al.) Within the experiment, we only asked humans to compare iSNEAK with the iSBSE method proposed by Araújo et al. Neither FLASH nor NGA was used here. We make this choice for two reasons:
Every added component to a human-in-the-loop study increases the effort associated with our human subjects. Hence we were keen to minimize their cognitive load.
FLASH and the NGA behaved so poorly in our automatic trials that there was no pressing need to test them.
To obtain human subjects, we reached out to our contacts in the Brazilian I.T. community where one manager granted us access to 20 of her developers, on the condition we took no more than two hours of their time (for initial briefing and running the experiments). Our choice of subjects lead to certain decisions for our experimental design. We had to use a model with attributes that our subjects understood. In consultation with the manager, we reviewed several models (before speaking to subjects) and the decision was that POM3a was the most approachable for our subjects.
For this experiment, subjects were selected by their manager. All subjects had at least 4 years of experience in the field and 3 years of experience in an agile team. Subjects were made aware that the manager endorsed their participation in this study. No added incentives were offered to subjects except a commitment that if in the future they wanted to use the tool, we would make it freely available and support their use. Subjects and their specific results were guaranteed anonymity from their manager. Thus, the experiment did not collect logins, names, or IP addresses.
Prior to collecting data, we ran a short (under 20 minutes) in-person meeting with participants (including the manager of the group, who also was as a subject). Subjects were introduced to the goals of the work and briefed on the models. After that, subjects had 1.5 hours to complete the experiment during which time, they were asked not to talk to each other about their experiences. During the experiment, subjects were silently observed in a company office by one of the authors. Subjects installed and ran our software locally on their own machines. At the end of the experiment, subjects rated two configurations (using a score of 0 to 5, 0 being the worst).
In order to control for experimental-subject bias, after running both the Araújo et al.’s tool and iSNEAK tools, solutions were presented in a randomized order (so that the iSNEAK solutions did not always appear in the same place on the screen). This meant that our subjects never knew from which tool came each solution they were asked to evaluate.
As seen in Fig. 4, our subjects strongly preferred iSNEAK. The median score for iSNEAK’s solutions was 4 while the median score for Araújo et al.’s solutions was 2. Hence for RQ1, we say:
RQ1 results. Distribution of ratings given by human experts to presented solutions. The x-axis denotes human ratings and the y-axis shows the frequency counts by which humans offered those ratings.
For models we could show to our subjects in their available time, SNEAK’s solutions are acceptable.
We have recorded the time it took each participant to obtain a solution with each tool. As seen in Fig. 5, iSNEAK requires orders of magnitude less human time in order to obtain preferable solutions.
B. RQ2: Does Human Advice Confuse Isneak-Style AI?
RQ2 addresses the following question: are human opinions detrimental to optimization, or if we generate solutions using the methods of RQ1, will we achieve sub-optimal results?
To answer this question, we explore numerous models other than the POM3a model used for RQ1. To facilitate this process, we used some oracle that can comment on 320 results; i.e., 20 repeats over 16 models (many of which are unfamiliar to specific subject matter experts). Hence we needed to build an automatic oracle.
To do this, before any interaction in each run, our oracle would randomly assign priority values for all of the model’s variables. Then using those priority values it would be able to consistently respond to any oracle questions posed by our algorithms. These priority values are set according to a random seed.
All results come from 20 repeated runs (with different random number seeds) of an automatic oracle exploration of our models.
Fig. 6 comments on the effectiveness of iSNEAK’s questions. Usually, iSNEAK can find results within 2% to 3% of the best solution ever seen for a model. More specifically, in Fig. 6:
For all models, the simple GA usually performs worst.
FLASH always ranks third or fourth best.
Araújo et al.’s tools have medians better than iSNEAK for unconstrained models (the first 6 models). But for models with constraints (i.e., from BILLING downwards), iSNEAK out-performs Araújo et al. Sometimes, those wins are very significant: e.g., see BILLING here iSNEAK’s median d2h scores are an order of magnitude better than the other algorithms.
When iSNEAK lost to Araújo et al., the difference is usually very small (the actual d2h performance deltas from the best results are
).\{1,2,2,3,3,3\}\%
RQ2 results. Median d2h values seen over 20 runs (Log scale on the Y axis). For a definition of d2h, see Equation 6 in IV-D. Note that lower values are better.
Fig. 7a shows that when Araújo et al. y-evaluates its 10,000 candidates, it pauses around 50 times to ask stakeholders some questions. As shown in Fig. 7b, these questions cover all the attributes in the model, which means the size of the interaction for Araújo et al. scales linearly with the number of variables in the model. On the other hand, iSNEAK pauses around 20 times, and when it does, it asks far fewer questions, and as seen in Fig. 7b the size of iSNEAK’s interactions has no direct relationship with the size of the model.
RQ2 results. Any iSBSE tool will, I times, ask a stakeholder about S attribute. Shown here are the
Just to clarify, the reason Araújo et al. asks more questions than iSNEAK is thaxst when Araújo et al., ask the stakeholders questions, those questions mention every attribute in the model (e.g., all 128 attributes of Figure 1). On the other hand, before iSNEAK asks a question, it applies the attribute selection methods of III in order to isolate the most informative attributes.
We answer RQ2 as follows:
For large and complex models with many constraints, iSNEAK is the clear winner (of the systems studied here).
To be clear, sometimes iSNEAK is defeated by other methods– but only by a very small amount and only in simple models (that are both very small and have no constraints). More importantly, measured by the human cost to find a solution, iSNEAK is clearly the preferred method. We say cost is number of interactions times number of questions per interaction. Using Fig. 6, we compare the cost for our different systems. iSNEAK’s oracle cost was 1%,4% of Araújo et al. for the constrained and unconstrained models
C. Other Issues
We have other reasons to recommend iSNEAK over the methods of Araújo et al.: 100% of all the solutions explored by iSNEAK are valid. The reason for this is simple: we let PicoSAT, or the model generative tool, generate valid solutions, then we down-sample from that space. Genetic algorithms, on the hand, such as those used in Araújo et al. mutate examples without consideration of the logical constraints of a domain.
This is a significant and important aspect of iSNEAK. Fig. 8 showed what happens when we take the solutions generated by Araújo et al.’s genetic algorithm [42] for the SPL models, then applied the model constraints to those solutions. As seen in that figure, most of the solutions generated by their methods are not valid. This also applies to NGA and FLASH, which waste most of their time evaluating invalid solutions.
Valid Solutions over 20 runs (Technical aside: to handle these invalid solutions, we added a post-processor to the Araújo et al. algorithm that rejected invalid solutions.).
Discussion
A. Threats to Validity
As with any empirical study, biases can affect the final results. Therefore, any conclusions made from this work must be considered with the following issues in mind.
1) Evaluation Bias
In our RQ1 experiments, we studied 20 software engineers using iSNEAK to select for a good POM3A [62], [64], [67] solution (a model for exploring the management process of agile development). Although the number of professionals could be higher to provide us with a more reliable evaluation of the approach, we have mitigated this by selecting only professionals with at least 4 years of experience in the field and 3 years of experience in being in an agile team.
2) Parameter Bias
iSNEAK contains many modular subprocesses within itself, containing different hyperparameters. While this study shows that the current configuration of iSNEAK is capable of producing results comparable to state-of-the-art algorithms, different configurations of the iSNEAK system might prove valuable. That said, in defense of our current setup, we were able to find solutions that were significantly better than those found by the compared baseline techniques.
3) Sampling Bias
This threatens any empirical study using datasets. i.e, what works here may not work everywhere. For example, we have applied iSNEAK to two real-life models of different software product lines, six XOMO and POM3 models, and eight artificially generated feature models of varying characteristics. But, the behavior of iSNEAK on significantly larger models (i.e., hundreds of thousands of attributes) still needs to be evaluated.
Another concern is that if the models studied here are “trivial” in some sense, then there may be very little value added by iSNEAK. We believe these these models are not trivial for several reasons. In fact, for all practical purposes, these models have a search space so large that it cannot be enumerated. Consider, for example, the SCRUM model with its 128 binary options. From Figure 8, we see that (usually) 98% of thee randomly generated solutions might violate domain constraints. That still leaves a space for at least
B. Relationship to Prior Work
As to our own prior work with iSNEAK [70], we applied an earlier version of this algorithm to the problem of learning from very small data sets. This paper differs from that prior work in several significant ways. That article lacked any facility for human-in-the-loop interaction. But here, as discussed in IV-D, we exploit that facility in two important ways:
In one of our studies, we ask humans for their opinions during the optimization process (and those opinions are used to guide the results).
In another of our studies, that explores many models, we guide the inference with an “artificial human” with their own built-in bias (that we create as part of this study, see IV-D).
Aslo, that prior article was a hyperparameter optimization study where a SNEAK-like algorithm was used to control a second algorithm (a Random Forest predictor). Here, there is no second algorithm (so the output of iSNEAK is the final output).
Further, that prior article was focused on a highly specific “small data” problem: tuning learners to explore tiny data sets (60 rows or less). This article, on the other hand, is a “big data” study that explores much bigger problems:
That article explored data with just five independent variables while here, we explore data sets with up to 1000 independent variables.
That article explored data data sets with 60 rows (or less) while this article explores samples of model output of size 10,000.
C. Future Work
Going forward, these results needs to be applied to more models. Also, we need more results from more human-in-the-loop studies. Further, here we found that the iSNEAK pre-processor helped one particular optimizer (SWAY) and it could be insightul to check if this pre-processor is helpful for other optimizers.
Furthermore, [46] propose a set of challenges that need to be addressed for semi-automatic configuration tools including (a) tuning all the control parameters of tools in iSNEAK; (b) supporting multi-stakeholder environments; (c) reasoning about qualitative requirements (such as the non-functional requirements explored by [71]); (d) real-time perpetual re-evaluation of solutions in highly dynamic environments (e.g., drones proving assistance within disaster sites); and (e) exploring different algorithms.
As to this last point (exploring different algorithms), we note that this work used a particular clustering algorithm (FASTMAP) and a particular attribute selector (INFOGAIN). Clearly, future work should explore better clustering and attribute selection algorithms.
More generally, there is an as-yet unexplored connection between our work here and semi-supervised learning (SSL). Given a few evaluated examples, SSL tries to spread out those labels across related areas in the data set [72], [73]. A key assumption of SSL is that higher-dimensional data sets can be approximated by a lower-dimensional manifold, without little to no loss of signal [73]. When such manifolds exist, then the number of queries required to understand data is just the number of queries needed to understand the manifold. A common way to find that manifold is to apply some dimensionality reduction method such as PCA. We saw in III that iSNEAK uses an analog for PCA. Hence, in some sense it could be said that iSNEAK is a semi-supervised learner. That said, we have yet to find an algorithm from the SSL literature that improves on the results of V-B. Nevertheless, this could be a fruitful direction for future study; i.e.
For all the SSL algorithms, see if any of them do better than iSNEAK.
Conclusion
When human knowledge falls within a small fraction of the total space then it is vanishingly unlikely that a fully automated algorithm will select solutions that are comprehensible and acceptable. However, as our RQ1 results show, it is possible to select solutions that make sense to human subject matter experts.
Also, when humans guide the reasoning, but can only comment on a tiny fraction of the total problem, this limited knowledge might lead to sub-optimial results. However, as shown by our RQ2 results, it is possible for AI tools to respect human preferences while still delivering highly optimal solutions.
The key to this process are quickly computed “hints” that produce approximate partial orderings of the data. In our first round of “hinting”, iSNEAK partially orders, then prunes, recursive partitions of the data (using questions to some oracle about the independent attributes). Next, in a second round of “hinting” we used Equation 3 to order, then prune the data (using queries to the goal attributes).
This paper tests this process on numerous software models. For the models studied here:
iSNEAK-style AI does not confuse humans since it only asks a handful of questions and returns human- acceptable solutions,
For the models studied here, advice from humans do not confuse AI since iSNEAK’s solutions, obtained via human-in-the-loop reasoning out-perform the prior state- of-the-art in the optimization of software models.
Conflict of Interest Statement
The authors declared that they have no conflict of interest.
Data Availability Statement
All our data and scripts are on-line at https://github.com/zxcv123456qwe/iSNEAK. Permission is granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the conditions of the MIT License.