Saturday 16 September 2017

The Problem of Productivity in Software Engineering Research

Software engineering research has a productivity problem. There are many researchers across the world who are engaged in software-engineering research, but the path from idea to publication is often a fraught one. As a consequence there is a danger that many important ideas and results are not receiving the attention they deserve within academia or finding their way to the practitioners whom the research is ultimately intended to benefit.

One of the biggest barriers faced by software engineering researchers is (perhaps ironically) the need to produce software. Research is overwhelmingly concerned with the development of automated techniques to support activities such as testing, remodularisation and comprehension. It is rightly expected that, in order to publish such a technique at a respectable venue, the proposed approach has to be accompanied by some empirical data, generated with the help of a proof-of-concept tool.

Developing such a tool requires a lot of time and effort. This effort can be roughly spread across two dimensions
(1) the `scientific’ challenge of identifying and applying suitable algorithms and data-types to fit the problem, and running experiments to gather data, and
(2) the `engineering’ challenge of ensuring that the software is portable, usable, and can be applied in an `industrial’ setting, to scale to arbitrarily large systems, to be used by a broad range of users.

Whereas the first dimension can often be accomplished within a relatively short time-frame (a couple of person months perhaps), the second dimension — taking an academic tool and scaling it up — can rapidly become enormously time-consuming. In practice, doing so will often only realistically be possible in a well-resourced and funded lab, where the researcher is accompanied by one or more long-term post-doctoral research assistants.

This is problematic because the second dimension is often what matters when it comes to publication. An academic tool that is widely applicable can be used to generate larger volumes of empirical data, from a broader range of subject systems. Even if the underlying technique is not particularly novel or risky, the fact that it is accompanied by a large volume of empirical data renders it immediately more publishable than a technique that, whilst more novel and interesting, does not have a tool that is a broadly applicable or scalable, and thus does not have the same volume of empirical data. I previously discussed this specific problem in the context of software testing research.

Indeed, the out-of-the-box performance of the software tool (as accomplished by dimension 2) is often used to assess at face-value the performance of the technique it seeks to implement (regardless of whether or not the tool was merely intended as a proof of concept). One of the many examples of this mindset shines through in the ASE 2015 paper on AndroTest, where a selection of academic tools (often underpinned by non-trivial heuristics and algorithms) were compared against the industrial, conceptually much simpler MonkeyTest  random testing tool. Perhaps embarrassingly for the conceptually more advanced academic tools, MonkeyTest was shown to be the hands-down winner in terms of performance across the board. I am personally uneasy about this sort of comparison, because it is difficult to delineate to what extent the (under-)performance of the academic tools was simply due to a lack of investment in the `engineering’ dimension. Had they been more usable and portable, with less dependence upon manual selection of parameters etc., would the outcome have been different?

This emphasis on the engineering dimension is perhaps one of the factors that contributes to what Moshe Vardi recently called the “divination by program committee”. He argues that papers are often treated as “guilty until proven innocent”, and the maturity and industrial applicability of an associated tool can, for many reviewers, become a factor in deciding whether a paper (and its tool) should make the cut.

In my view, this is the cause of a huge productivity problem in software engineering. The capacity to generate genuinely widely usable tools that can produce large volumes of empirical data is rare. Efforts to publish novel techniques based proof-of-concept implementations geared towards smaller-scale, specific case studies often fail to reach the top venues and fail to make the impact they perhaps should.  


In his blog, Moshe Vardi suggests that reviewers and PC members should perhaps adopt a shift in attitude towards one of “innocent until proven guilty”. In my view, this more lenient approach taken by reviewers should include a shift away from this overarching emphasis on empirical data and generalisability (implying the need for highly engineered tools).  

Friday 16 June 2017

On The Tension between Utility and Innovation in Software Engineering


For a software engineering publication to be published, it must above all provide some evidence of the fact that it is of value (or at least potentially of value) in a practical, industrial context. Software Engineering publications and grant proposals live or die by their perceived impact upon and value to the software industry. 

To be published at a high-impact venue the value of a piece of research hinges on the ability to demonstrate this with a convincing empirical study, with extra credit given to projects that involve large numbers of industrial developers and projects. For a grant proposal to be accepted it should ideally involve significant commitments from industrial partners.

Of course this makes sense. Funding councils should rightly expect some form of return on investment; by funding Software Engineering researchers there should be some form of impact upon the industry as a result. The motivation of any research should always ultimately be to improve the state of the art in some respect. Extensive involvement of industrial partners can potentially bridge the “valley of death” in the technology readiness levels between conceptual research and industrial application.

However, there are downsides to framing the value of research area in such starkly utilitarian terms.  There is a risk that research effort becomes overly concentrated on activities such as tool development, developer studies and data collection. Evaluation focusses from novelty and innovation to issues such as the ease with which the tool can be deployed and the wealth of data supporting its efficacy. This is ok if an idea is easy to implement as a tool, and the data is easy to collect. Unfortunately, this only tends to be the case for technology that is already well established (for which there are already plenty of APIs around for example), and where the idea lends itself to easy data collection, or the data already exists and merely has to be re-analysed.

There is however no incentive (in fact, there is a disincentive) to embark upon a line of research for which tools and empirical studies are harder to construct in the short-term, or for which data cannot readily be harvested from software repositories. It is surely the case that the truly visionary, game-changing ideas might require a long time (5-10 years) to refine and will potentially require cultural changes that will put them (at least in the initial years of a project) beyond the remit of empirical studies. But it is surely within this space that the truly game-changing innovations lie.

The convention is that early-stage research should be published in workshops and “new idea” papers, and can only graduate to full conference or journal papers once it is “mature” enough. This is problematic because a truly risky, long-term project of the sort mentioned above would not produce the level of publications that are necessary to sustain an academic career.

This state of affairs is by no means a necessity. For example, the few Formal Methods conferences that I’ve been to and proceedings that I’ve read have always struck me as being  more welcoming of risky ideas with sketchier evaluations (despite the fact that these same conferences and researchers also have formidable links to industry). 

It is not obvious what the solution might be. However, I do believe that it probably has to involve a loosening of the empiricist straightjacket.



* For fear of this being misread, it is not my opinion that papers should in general should be excused for not having a rigorous empirical study. It’s just that some should be.

Friday 20 January 2017

Automated cars may prevent accidents. But at what cost?

The advent of driverless car technology has been accompanied by an understandable degree of apprehension from some quarters. These cars are after all entirely controlled by software, much of which is difficult to validate and verify (especially given the fact that this software tends to involve a lot of behaviour that is the result of Machine Learning). These concerns have been exacerbated by a range of well-publicised crashes of autonomous cars. Perhaps the most widely reported one was the May 2016 crash of a Tesla Model S, which “auto piloted” into the side of a tractor trailer that was crossing a highway, killing the driver in the process.

As a counter-argument, proponents of driverless technology only need to point to the data. In the US Department of Transportation report on the aforementioned Tesla accident, it was observed that the activation of Tesla’s autopilot software had resulted in a 40% decrease of crashes that resulted in airbag deployment. Tesla’s Elon Musk regularly tweets links to articles that reinforce this message, such as a link to an article, stating that “Insurance premiums expected to decline by 80% due to driverless cars”.

Why on earth not embrace this technology? Surely it is a no-brainer?

The counter-argument is that driverless cars will probably themselves cause accidents (possibly very infrequently) that wouldn’t have occurred without driverless technology. I have tried to summarise this argument previously - that the enormous complexity and heavy reliance upon Machine Learning could make these cars prone to unexpected behaviour (c.f. articles on driverless cars running red lights and causing havoc near bicycle lanes in San Francisco).

If driverless cars can in and of themselves pose a risk to their passengers, pedestrians and cyclists (this seems to be apparent), then an interesting dilemma emerges. On the one hand, driverless cars might lead to a net reduction in accidents. On the other hand, they might cause a few accidents where they wouldn’t have under the control of a human. If both are true, then the argument for driverless cars is in its essence a utilitarian one. They will benefit the majority, and the question of whether or not they harm a minority is moot.

At this point, we step from a technical discussion to an philosophical one. I don’t think that the advent of this new technology has really been adequately discussed at this level.


Should we accept a technology that, though it causes net benefits, can also lead to accidents in its own right? This is everything but a no-brainer in my opinion.  

Friday 16 December 2016

Virtuous Loops or Tyrranical Cycles?

Feedback loops - the ability to continually refine and improve something in response to feedback - are innocuous, pervasive phenomena. For this blog post I’m particularly interested in instances where these loops have been deliberately built into a particular technology or solution to solve a specific problem. I’m writing this post because I want to try to put forward the following argument:

The most successful, `disruptive’ changes in society and technology have been (and are being) primarily brought about by innovations that are able to harness the power of this simple loop. 

Potted History

Throughout the 20th century, this simple loop has dominated technological progress. We will proceed to look at some examples. It is worth highlighting that these are treated both in a rough chronological order. However (and I don’t believe this is an accident) they are also ordered according to the rate at which feedback is provided; one of the reasons I am writing this blog entry is that (I believe) it appears that this rate is constantly increasing.

In the early 1900’s, Bell Labs were experiencing a lot of variability in the quality of their telephone wiring. Walter Shewhart, a statistician, was charged with devising a means by which to eliminate this variability. In order to refine the cabling process, he proposed the a cycle - later popularised by Edwards Deming - as the Plan-Do-Check-Act cycle. In essence, any procedure is devised as follows. You “Plan” your procedure, you “Do” it, you “Check” the quality of the outcome, identifying any problems, and you “Act” by refining the plan, and repeat the cycle.

The approach was one of the major innovations that formed the basis for the American manufacturing boom in the early 20th century. Shewhart, Deming and their colleagues travelled to Japan in the aftermath of the Second World War, and carried on to spread this ethos of continuous feedback. This set the foundations for the Toyota Production System and similar manufacturing procedures that underpinned the Japanese “economic miracle” and since inspired various modern equivalents, such as ``lean manufacturing’’ or Continuous Process Improvement. 

These principles eventually fed into the most enduring approaches within the software engineering. Incremental Iterative Development and their successors in Agile techniques all centrally play upon feedback. Regular cycles to enable continual feedback from the client about the product, often daily team meetings to enable feedback from developers, the increasing use of technologies such as Github, which enable feedback on individual code-changes, etc.

Within web applications continuous feedback from users to providers (or to each other) has become a core means by which to maintain and enforce service standards. One can think of the likes of Uber, AirBnB, and Ebay.

At an algorithmic level, most Machine Learning algorithms play upon some form of internal feedback loop. Two obvious examples are Genetic Algorithms and “Deep Learning” Recurrent Neural Networks. At their heart is a capability to continually modulate an inferred model by adapting to feedback. The last couple of years have seen enormous advances in Machine Learning technology. Computers can now beat humans at Go. They can recognise details in enormously complex patters, they can drive cars.  I do not think that this is because of enormous advances in terms of the algorithms themselves. It is because there has been a sudden surge in the availability of data, and specifically in the availability of feedback.

What next?

What links the emergence of powerful Machine Learning algorithms with the sudden rise of apps such as Uber? 

I believe that both have become as powerful as they are because it has been easier to collect data that can be channeled into feedback. This is as true for Machine Learning as it is for apps, where there are suddenly millions of users, all of whom have smartphones with virtually uninterrupted internet access.

The potential that could be gained from exploiting this loop first became apparent to me when I read a 2004 Nature paper: “The Robot Scientist”; this nicely embodied the feedback loop - replacing a human scientist with a Machine Learner with the goal of carrying out experiments (automatically via robotic equipment) to infer a model of a particular genetic pathway. In their loop the hypothesis model was refined after each experiment and tested in each subsequent cycle. At the time this appeared to be impossibly futuristic to me. We are now almost 13 years down the line, and the process of fully automated drug discovery has been realised.

To me, this renders the coming years as exciting as they are terrifying. For every noble cause, such as automated drug discovery, there is an equally disconcerting one, especially when humans become a direct part of this loop. An unrelenting, unwavering mechanised channel of feedback is necessarily reductive. If the feedback has direct implications for someone’s livelihood, the consequences can be brutal (c.f. the consequences for Uber drivers who receive consistently low ratings).

As organisations are seduced by the apparent “direct democracy” and self-regulating properties of these mechanisms, it is easy to see how, for its workers, this can turn into tyranny at the hands of a capricious, remote customer.


Even in my field of university teaching, similar changes are becoming increasingly tangible. Students are now customers, and their feedback on teaching (however subjective that may be) plays at least as much of a role as research in ranking departments and universities against each other, a trend that will no doubt become more pronounced with the emergence of the Teaching Excellence Framework. I noticed this year that the Panopto lecture-recording software, mandated across a growing list of UK universities, even has a feature that gives students the option (in a Netflix-esque way) to rate individual lectures out of five stars…

Monday 10 October 2016

Autonomous Vehicles in the UK: Beware the Robotised Drunk Driver



[Disclaimer: I have had no involvement with the development of autonomous vehicle software. Any concerns raised here are based on my generic software development knowledge, and might (hopefully) be completely unfounded with respect to autonomous vehicles in the UK.]

Much has been written about the interesting ethical questions that arise with the control systems in autonomous vehicles. Most of this has been a rephrasing of the various age-old gedanken experiments that could confront an autonomous AI. For example, the car is driving at speed, a pedestrian steps out in front of the car - the AI can either take evasive action into a lane of incoming traffic (posing a risk to the driver and oncoming vehicles), or it ploughs on into the pedestrian and kills them. What action should / can it take, and who is at fault when someone dies?”. 

Fundamentally, this elicits either a technical response (the car would employ some technology to prevent such a situation from ever arising). It can also simply be discarded with a legalistic response (the driver should always have their hands hovering over the wheel so that this is ultimately their responsibility). 

However, in my view there is another ethical question that cannot be shrugged off so lightly. Previous thought exercises have assumed that the software is behaving correctly according to the rules set out by its developers. The software could however easily contain bugs. One can imagine a pathological case, where a bug leads to a car veering off course, yet also prevents any interference from the driver. 

If this is possible, or even probable, is it ethical to expose drivers and the wider UK public to them?

Software systems within cars are enormously complex. As an illustrative example, well before the era of autonomous vehicles, the software system that controlled just the break and the throttle of a 2005 Toyota Camry amounted to 295 thousand non-commented lines of C code. Now, if we move on a decade, and consider the software that controls a modern autonomous car, it is orders of magnitude more complex.

Highly complex software systems become difficult to manage and understand. They become especially difficult to test rigorously, and impossible to verify. This is especially true if they include a lot of concurrency, non-determinism, Machine Learning systems, and rely upon complex sensor inputs (as is invariably the case with autonomous vehicles). 

The “pathological” example of a software bug causing a car to veer off course, beyond the control of a driver, is perhaps not as pathological as it seems. This is what happened with the Toyota Camry mentioned above (along with a range of Toyota and Lexus models up to 2010). Even though the software was merely in charge of a break and throttle control, it led to circumstances where the break became unresponsive and the driver was unable to slow down, and has been linked to “at least 89" deaths in the US. Subsequent inspection of the software showed it to be incomprehensible, untestable, a probable hotbed of bugs. Since then, we have witnessed most modern car manufacturers, regularly recalling hundreds of thousands of cars due to software defects. 

There is also no sign that this trend is about to abate in the case of autonomous vehicles. An autonomous vehicle with software bugs is akin to a drunk driver in robot form.
We have already witnessed collisions and even a fatality, caused by bugs in autonomous vehicle software. Google's autonomous car manoeuvred itself into a bus. There have been several reports of Tesla crashes - e.g. when a Tesla car "autopiloted" into a SUV, killing its passengers, or an auto-piloted Tesla crashed into a bus full of tourists, or auto-piloted into a truck, again killing its driver.

This is not necessarily due to poor practice by Google or Tesla. It is simply a brutal reflection of the fact that software is inevitably defect-prone. This has shown to be especially the case with the Machine-Learning oriented autonomous car software.

This leads to an ethical conundrum, not just for car manufacturers, but for governments who chose to offer themselves up as a testbed for this technology. By providing  “permissive regulations” for these vehicles to be trialled in cities across the UK, the British public is being unwittingly exposed to robots, not under the control of their drivers, and which are controlled by software that almost inevitably contains bugs. 

It is important to emphasise this - it is not just drivers themselves, but pedestrians, cyclists, and families with children crossing roads, that are being exposed. In academia, if this were an experiment and the general public were the subjects, it not come close to passing an ethics committee. 

My personal view is that our processes for quality assurance are not yet mature enough to provide adequate confidence in the correctness of such software systems.

I am completely unfamiliar of the QA processes that are mandated by the UK government in this instance. But, if the genie is to be released from the bottle, one has to hope that they have at the very least:

  1. Established, quality and trust models that specifically factor in the various properties of autonomous software that make it so particularly difficult to reason about.
  2. Mandated that the software artefacts in autonomous vehicles are inspected and verified by an independent third party, cognisant of the specific UK driving conditions that might not have been factored in to QA activities abroad, and that the reports from these verification activities are made openly available to the public.
  3. Are in the process of compiling a large number of test scenarios and associated data that is to be applied to all autonomous vehicles to be driven in the UK.
  4. Maintain a database of all autonomous vehicles in the UK so that if (when) dangerous software defects are detected, that these vehicles are mandatorily recalled to prevent them from causing harm.




Friday 2 September 2016

Blurring Boundaries in Computer Science


Research groups in Computer Science departments are awkward. I’m not talking about group members themselves, but the very notion of a group. 

Whereas traditional subjects such as Medicine have lots of nice, well-defined groupings (oncology, diabetes, etc.), this is not as clear for Computer Science, where new disciplines are continuously arising, and the boundaries between established areas seem to be gradually fusing. 

Let us take a set of typical department groups:

Formal Methods

Algorithms

Software Engineering

Machine Learning

Natural Language Processing

These areas all have their own journals and conference venues, their own superstars and core problems. 

But it is striking how much overlap there can be. Let us start with Software engineering. There are now countless Software Engineering papers that apply and build upon Machine Learning techniques. Natural Language Processing has a second home in Software Engineering, with the extensive use of Topic Modelling, LSI etc. Topic Modelling by the way is a technique in itself that straddles the areas of Machine Learning and Natural Language Processing. Formal Methods and Software Engineering essentially share the same goals. Theorem proving, which is a core Formal Method, is increasingly using Machine Learning techniques. Indeed many of the automated reasoning techniques that underpin theorem provers fundamentally share the same goals as a Machine Learning algorithms. All of these areas (FM, SE, ML, and NLP) revolve around Algorithms. 

Crucially, the expertise of individuals rarely sits squarely in one of these areas, but usually cuts across two or more of these areas. The traditional groups are simply no longer appropriate for pigeon-holing many researchers or research problems.

This is what makes Computer Science groups (and the discipline as a whole) awkward. The community (and departments) are very much split along these traditional, entrenched lines. Formal methods people operate in their own groups and publish in their own conferences and journals. As do Algorithms people, Software Engineers, etc. 

This post however argues that these boundaries are unnecessary and largely artificial. This always hits home when I get the opportunity to attend a conference that is slightly outside of my area. It is always striking how much scope there is for cross-disciplinary collaboration, how similar the fundamental problems are. One can’t help but wonder how much work is duplicated in different fields, and how much further we would be as a discipline if these boundaries simply didn’t exist.


Friday 13 May 2016

In defence of the Siemens Suite: There is nothing wrong with evaluating testing techniques on "small" programs.


Software testing research has an (in my opinion somewhat perverse) obsession with "scalability". When producing empirical results, techniques that have been applied to larger systems are favoured over techniques that have "only" been applied at a unit level. I know this, because many reviewers of my own papers (where empirical results tend to be produced at a unit level) have indicated this.

Take this example from a paper (that was happily accepted despite this comment) from last year:

"The case studies are small (and include the infamous and deprecated Seimans suite... the authors might want to work on **MUCH** bigger programs for future versions of this work."

The reviewer is correct in one respect; the Siemens suite (of seven small-ish C programs) is widely derided. I remember it being contemptuously referred to as "the seven dwarves".

I believe, however, that this derision of small programs is irrational. To support this I put forward three arguments. These are elaborated below.

First, let us set aside the specifics of what is meant by "size". A"small" program can be an executable program with <200 LOC and a simple interface contained within a single module (e.g. TCAS). A "big" program can be a large executable consisting of >10,000 LOC spread across multiple modules, with a complex interface (e.g. Mozilla or JEdit, etc.).

For the sake of terminology, we'll take a "unit" to mean a single executable piece of functionality (e.g. a public method in a class).

Argument 1: To be useful in the industry, a testing technique or tool does not necessarily have to scale up to "large" units or systems.

Complex software systems are commonly decomposed into relatively small units of functionality. Even if some units do happen to be too large for a given technique to test, it might still be capable of handling the majority of smaller units. If this is the case, then it would save time and resources, and would surely be useful.

The argument against the Siemens suite is not necessarily just the size of the units,  but also that they are old -- much of the code originates from the mid nineties. But then again, so is the code in many industrial systems, which commonly contain many legacy components (that are much older and smaller) than the components in the Siemens suite.

Argument 2: Real bugs are often contained within small units of code that can be run in a self-contained manner.

If one looks at some of the "bugs" that hit the headlines in recent years, one can often surmise that these could have been localised to a specific unit of code.

Look at the Apple Goto-Fail bug and Heartbleed, etc.:

https://nakedsecurity.sophos.com/2014/02/24/anatomy-of-a-goto-fail-apples-ssl-bug-explained-plus-an-unofficial-patch/

http://martinfowler.com/articles/testing-culture.html#heartbleed

These are all contained within functions that are of a comparable complexity to TCAS. In fact, Martin Fowler (in the above link) makes this case - they could both have been detected by unit-testing.

In other words, automated tools and techniques that scale only to "small" programs can readily have a huge impact. It brings to mind the formal-methods techniques that have found bugs in the Java implementation of the TimSort algorithm - a reasonably small piece of code, with a bug that ended up affecting billions of android devices and PCs.

http://www.envisage-project.eu/proving-android-java-and-python-sorting-algorithm-is-broken-and-how-to-fix-it/

So, if "much bigger" programs are required, how much bigger should they be? What criterion would they have to meet? And why?

Argument 3: A "small" system can be enormously complex

The critical piece of functionality might be contained within 5-10 lines of code. But one of these lines might call a library function. One can consider as an example a Java method that uses collections, calls Math utilities, IO libraries etc. So a relatively small program could be functionally very complex because a substantial amount of its functionality is delegated.

Even if libraries do not come into play, it is still possible to have compact programs that, within a couple of hundred lines of code, include complex data manipulations, and control structures. I refer again to the above example of the Timsort algorithm - implemented again and again, widely studied, relatively small, but still buggy.

As soon as a piece of code becomes difficult to rapidly verify by manual inspection (which is very soon), an automated tool or technique becomes very useful. Hundreds of lines of code are not necessary for this, and I would argue that the threshold for a useful tool is much lower.

Conclusions


This emphasis on scale has a pernicious effect on research. It focusses efforts on techniques that can feasibly apply to large programs, at the expense of (potentially much more powerful) techniques that can be applied to smaller programs. As a consequence, many of the really fundamental testing problems that have not even been addressed with respect to "small" programs remain untouched.

This attitude also drives an (in my view artificial) wedge between the formal methods and the software engineering communities. Important techniques that have proven to be enormously valuable in practice such as model checking and theorem proving are often dismissed because they "don't scale", and only get consideration in a Software Engineering forum if they are framed in such a way that they can possibly be adapted to "real life" "big" programs.

However, this attitude is again misplaced, for exactly those reasons that I have listed above. These techniques are useful, and are accepted as such in practice. Formal methods events such as FM have an industrial presence that can probably match that of most large software engineering events. I can think of many "unscalable" formal methods techniques to have been successfully spun-off or absorbed into the working practices large organisations such as Facebook and Microsoft.

For many tasks, there are no convincing state-of-the-art techniques that even work on small programs. So getting a technique to work on a small program is important, and is in and of itself potentially useful to the industry.

Let's not dismiss the Siemens suite.