Applying AI know-how to the giant pool of data gathered from the world’s leading and most powerful scientific instruments could accelerate the process of scientific discovery. Powerful machine-learning approaches offer new ways to extract scientific meaning from the raw experimental data, which ultimately could help funders to unlock more value from their investment in research.
Large-scale experimental facilities such as neutron and synchrotron sources have become an essential element of modern scientific research, allowing visiting researchers to probe the structure and properties of many different types of materials. They also generate huge amounts of experimental data, which can make it difficult for visiting scientists without specialist knowledge of the experiment to extract meaningful information from the raw datasets. As a result, some of the data collected during their valuable beamtime is never properly analysed.
The good news is that this situation has improved dramatically over the last 10 years, with a consortium of leading neutron facilities working together to streamline and standardize the software used to analyse data from neutron scattering and muon spectroscopy experiments. The framework – called MANTiD – supports a common data structure and shared algorithms to enable visiting scientists to easily process and visualize their experimental results.
“This common framework helps visiting scientists to get to grips with instruments at different facilities,” comments Nick Draper, one of Tessella’s senior project managers. “But it also helps researchers to make use of a different instrument at the same facility.”
Next big challenge
According to Draper, who has long been involved in supporting big science projects, the next major challenge is to make it easier for researchers from different scientific backgrounds to analyse and interpret the complex experimental output that can be produced. “Often there’s not just one model that you could fit to your data, there could be 20 or 30 options, and sometimes it’s not absolutely clear which model you should be picking,” Draper explains. “At the moment, it takes expert opinion from instrument scientists who really understand the experiments to lead and guide on which approaches to take.”
But with larger and larger volumes of data to get through, this can create a bottleneck that delays results. One option for speeding up the process is to exploit artificial intelligence (AI) to help with model selection. It’s a concept that some researchers might feel uneasy about, but Draper’s colleague Matt Jones – an analyst at Tessella who keeps a watchful eye on the latest industry trends – has some words of reassurance. “AI is there to help the human, it’s not there to govern and provide the answers – it’s there to augment,” he states.
Matt Jones has followed the rise of AI from early monolithic offerings to today’s cloud-based solutions, and notes its success in aiding pharmaceutical development. An example is AI-augmented analysis when scaling-up drug discovery processes – which in turn frees up experts to work on higher value tasks. And he advocates taking a tailored approach to maximize the benefits. “The most accurate and best solutions are built for solving the immediate problem at hand,” he comments.
The deep-learning revolution
Today, the buzz surrounding artificial intelligence is hard to ignore. We’ve been wowed by computers that can beat grandmasters at chess and Go, and are served by increasingly powerful speech recognition and machine translation tools. To the list of highlights, you can also add breakthroughs in image recognition together with progress in driverless vehicles. But why is it all happening now? After all, many machine learning algorithms have been around for decades.
The crucial factor is the impact of scale, specifically the parallel growth of data and available computing power. And this has transformed the capabilities of one technique in particular – deep learning – which benefits greatly from the availability of large datasets.
While other methods plateau when you feed them with more information, the performance of deep learning’s artificial neural networks keeps climbing. And the larger (or deeper) the neural network, the greater its capacity to absorb the value of its inputs and deliver meaningful outputs.
Combining big data with large amounts of compute makes it possible to create artificial neural networks with many so-called hidden layers. These deep-learning systems are giant mathematical functions that comprise multiple layers of nodes, equipped with self-adjusting weights and biases, all sandwiched between a series of inputs and outputs.
The rich combination of data and compute – together with a greater understanding of how to train (or propagate) these powerful multi-layered networks – is now taking the performance of machine-learning techniques to new heights.
Engaging the benefit
The flip-side is that research groups need access to large amounts of data and large amounts of compute to engage the full benefits of deep learning, and they need support from teams who can get these systems up and running.
It’s an issue that Tony Hey, Chief Data Scientist at the STFC, and his team are aware of. To help researchers to extract more science, more efficiently, from their experiments, Hey is assembling a Scientific Machine Learning group, working closely with the Alan Turing Institute – the UK’s national institute for data science and artificial intelligence.
Hey is also linked to STFC’s Ada Lovelace Centre, which is being established as an integrated, cross-disciplinary, data-intensive science hub that has the potential to transform research at big science facilities through a multidisciplinary approach to data processing, computer simulation and data analytics.
Objectives for Hey include applying AI and advanced machine-learning technologies to the experimental data generated by STFC-supported facilities at the Harwell Campus: the Diamond synchrotron source; the ISIS neutron and muon source; the UK’s Central Laser Facility; and the NERC Centre for Environmental Data Analytics with its JASMIN super data cluster.
“The analysis of huge datasets requires automation and machine help as the volume goes beyond what used to be possible by hand,” Hey comments. “However, there are lots of opportunities to try to help automate the data flow in the pipeline in getting data from a machine to the point where you can do science with the results.”
Building this pipeline requires helping researchers to understand more about the machine-learning algorithms. “You need transparency and understandability as to how various methods will get you to an answer, not black boxes,” he points out.
Hey is keen to develop what he describes as machine-learning benchmarks. He also wants to leverage existing expertise in communities such as particle physics and astronomy, who have been dealing with petabyte-scale big data challenges for some time. The goal is to create a broader support structure for machine learning and AI that other disciplines can tap into. It means being able to strip out the jargon and make processes such as data classification models understandable outside a given field.
Teaching labs
One way of lowering the barrier to entry is to provide what John Watkins of the CEH calls “teaching labs” – for example, C++ routines that have been packaged into an R library, married with a dataset, and then wrapped in a web-based R-shiny app for convenient access. “They let people look at various algorithms and play with them to learn their particular characteristics and discover how methods may or may not be useful in their work,” he says.
For Watkins and his environmental science colleagues, one size rarely fits all. Researchers in the field commonly need to understand a variety of data from different sources – for example, output from sensors on land and in the atmosphere, as well as oceanographic measurements.
“Ideally you want access to a range of tools to hit a block of data with and compare the results to identify the most efficient method,” he advises. “You don’t want to be in the position where you can only attack it with one method, because that’s the only capacity that you have.”
There are other considerations too, beyond stripping out the jargon and providing accessible and benchmarked tools. It’s also important to support the optimal workflow for a given task, which might be running models on an HPC, storing the results on a large-scale data cluster, and then switching to a smaller scale operation once the portion of the data that’s important has been identified.
Clearly, it’s a job for multi-skilled teams who can navigate not just the technology, but also the science that the AI is being targeted at. Returning to our earlier example, Draper is encouraged by pilot analysis using small-angle neutron scattering data, where AI is now being used to steer users towards using either a spherical model or a cylindrical model to fit the data. Early results are promising, but the next question is whether the approach remains effective when the choice jumps to as many as 40 different models.
Just the beginning
Draper and his Tessella colleague Matt Jones believe this is just the beginning of a trend that could revolutionize the analysis of scientific data, with interest growing among the research community in the possible benefits of AI. “We are just starting to prick the edges of this future now,” says Matt Jones. He anticipates more conversational type interfaces, as well as visual approaches such as virtual reality, that lend themselves to presenting highly-detailed scientific structures and complex data.
“AI is a really interesting place for the future,” adds Draper, who is also well aware of the hurdles. “You need lots of training data,” he points out, “and that data has to be properly tagged.”
But what happens if training data doesn’t exist, or is only available in limited quantities? One idea is to back-generate images that indicate what a particular model would look like. “If you do that lots of times with different parameters, mixing in static and distorting the images to make them as realistic as you can, then you can create training data,” says Draper. “The challenge is to ensure that you are not simply overtraining your dataset to recognize the things that you have created as opposed to actual experimental results.”
Synthetic data that sums a number of signals has proven useful in enhancing speech recognition – for example, by training systems to overcome background sounds such as in-car noise – so again, it’s possible that knowledge developed in one sector can be transferred across different domains.
Predictive power
Success in deploying AI requires teams with talent across multiple areas: an understanding of the data, knowledge of machine learning algorithms plus statistical methods, and expertise in high-performance or cluster computing. But the potential rewards make the challenges worth conquering and can extend to other areas beyond analysing experimental results.
Google has reportedly saved a fortune by using deep learning to reduce the costs of running its data centres. Algorithms can alert operators when machinery is close to failure and should be replaced, which minimizes downtime. The output can also inform optimal servicing frequencies to keep equipment in reliable working order for as long as possible.
Shared resources enable greater collaboration: big science in the cloud
This predictive power can be applied at big science facilities too, notes Tessella’s Kevin Woods – a senior project manager involved in the update of instrument control systems. “By looking at the long-term patterns [in the signals] you can actually spot imminent failures,” he says. One example could be a gradual increase in motor operating temperature, which may indicate that an actuation unit is on its way to overheating.
The results so far suggest that investing in AI puts multiple rewards within reach. Machine learning has the potential to dramatically speed up the analysis of big data across different domains, hopefully allowing research teams to make faster progress in their understanding of increasingly complex phenomena. To succeed, researchers need easy access to extensive data sets, large amounts of compute, and the ability to experiment with and understand which algorithms are best matched to the task.
Read more in “Artificial intelligence and cloud computing: the future for scientific research”, available for download from the Tessella website.