# Neural networks extract information from sparse datasets

18 Nov 2019 Margaret Harris
This article first appeared in the 2019 Physics World Focus on Computing under the headline "Learning from incomplete data"

Condensed-matter theorist Gareth Conduit developed an algorithm that can “learn” from incomplete data. His next challenge: turning it into a business

### How did you get the idea for your company?

I was chatting to a materials science PhD student in a pub a few years ago, and he started telling me about some mathematical problems his group was facing. They were trying to use neural networks to predict the properties of new materials as a function of their composition, and I showed them how to use a tool called a covariance matrix to calculate the overall probability that a new material will satisfy various requirements – strength, cost, density and so on – at once. By doing that, we were able to design several new metal alloys, which are now being tested by Rolls-Royce.

At that point, I began to investigate ways of getting even deeper insights into material properties. Certain physical laws, like the fact that electrical conductivity is proportional to thermal conductivity, or that the tensile strength of a material is proportional to three times its hardness, are very powerful for predicting how a material will behave. However, because we set up our neural networks to always extrapolate from composition to property, we weren’t exploiting property–property correlations. So I changed the algorithm so that the neural network could capture that additional information, and we used it to design materials that can be used in a 3D printing process called direct metal deposition. We only had 10 experimental data points for how well materials could be 3D printed, but we were able to take that small amount of data and merge it with the huge database of how weldable different alloys are, which is an analogous property. The resulting extrapolations guided our design of new materials.

### What happened next?

The direct metal deposition project exposed me to the idea that there might be new opportunities in merging sparse databases (like the one for 3D printability) with full ones (like the one for weldability), so the next step was to develop a much more comprehensive method for doing that. The mathematical inspiration for this method comes from many-body quantum mechanics, where something called the Dyson formula is used to calculate the Green’s function for an interacting particle in terms of the Green’s function for a non-interacting particle and a self-energy term that captures the effect of one particle interacting with another. We’re able to make an analogy in which the Green’s function of an interacting particle is like a prediction of a full material property, while the Green’s function of a non-interacting particle is like an “empty” data cell, for which we just make a naïve guess about what the value might be. Then our neural networks use the quantity we know to guide the extrapolation of the quantity we don’t. This enables us to merge experimental datasets, which are sparse, with some first-principles computer simulations and molecular dynamics simulations, which are complete.

We also noticed that there is often a lot of information hidden in the “noise” within data. Again, we know this from many-body physics, from the physics of critical phenomena that occur in low-temperature solid-state systems, and from renormalization group theory, where the large-scale fluctuations in one physical quantity can be related to the mean expectation value of a different physical quantity. Physicists have developed a lot of maths to capture that knowledge, and if I port that across to our neural network, we can use the uncertainty in one quantity to tell us the mean value of another. That’s been helpful for interpreting microstructures and phase behaviour in materials.

These techniques have many possible uses, and although I’ve worked on a few of them in my capacity as a researcher at the University of Cambridge – collaborating first with Rolls-Royce, and later with Samsung to design new battery materials and BP to design new lubricants – I eventually decided that I needed to form a spin-out company to really drive them forward.

### What was the spin-out process like?

Initially, I was put in touch with Cambridge Enterprise, which is the university’s commercialization arm. They introduced me to several local business angels. I took each of them out to dinner, worked out what they thought the opportunities were and tried to understand what they’d be like to work with, and eventually selected an angel called Graham Snudden. Working with Graham helped me to understand our business plan, and he also introduced me to a former employee of his, Ben Pellegrini, who became my co-founder and the CEO of our spin-out, Intellegens. Ben had experience of working at smaller companies, and he had worked in software, which is a complementary area to my own skillset and absolutely core to our business strategy.

Ben Pellegrini: When I first met Gareth, he was running the algorithm through a terminal prompt on the university computer centre. He was always very enthusiastic and very bright, and I could see that there was real interest and value in what he was doing, but it was hard – I had to meet him a few times before I understood that when he was moving data around, he was generating interesting results. The big question was how to transform this tool from something that a specialized user can engage with at the command line into something your average engineer or scientist in a clinical lab or materials company can use. That’s a challenge I enjoy.

### How did you get funding?

BP: For the first six months, I was based in my kitchen and Gareth was doing work for Intellegens in the evenings. Then we got some money from Innovate UK to get us going with a proof-of-concept project, plus a little bit of money from Cambridge Enterprise and from Graham, who (as Gareth mentioned) is a local angel investor. We’ve also been quite lucky in that we can run consultancy-style projects to generate income as we’re going along.

### What are some of the projects you’ve worked on?

GC: We’ve been pushing hard on the problem of designing new drugs. The basic question is, if you inject a drug into a patient, which proteins will react to it? Does the drug activate them or inhibit them? There are about 10,000 proteins in the body and about 10 million drugs that you can test, so if you imagine a huge matrix where each column is a different protein and each row is a different drug, the dataset is only about 0.05% complete, because it’s impossible to conduct experimental tests on that many drug-protein combinations. It’s the ultimate sparse dataset.

However, we do have information about the chemical structure of every drug and every protein. That’s a complete dataset. Our goal is to marry the complete dataset of chemical knowledge to the sparse dataset of protein activity and use it to predict the activities of proteins. We can do this by taking advantage of protein-to-protein correlations and protein-to-drug-chemical-structure correlations. It’s very similar to what we were doing with materials for 3D printing, where weldability is a complete dataset and 3D printability is a sparse dataset.

The business has now moved to the stage of licensing machine learning as a product. For drug discovery, Alchemite is marketed through Optibrium, and there has already been enthusiastic take-up by Big Pharma. For materials discovery, Intellegens is licensing a full-stack solution direct to the customer, with the first sales now complete.

BP: We’re also talking to people who work on infrastructure, trying to understand gaps in maintaining things like bridges or equipment. In a transport network, for example, you may or may not have data on relevant factors such as weather, geography, topology, road composition and pedestrian use at specific points in the network, so you end up with very big, sparse datasets. We’re working on patient analytics as well, trying to predict optimum treatment profiles from sparse sets of historical patient data. Again, we may or may not have the same data available for all patients, but we have a combination of data points, and trying to learn from all the data points we have seems to give us an edge in suggesting possible routes of treatment.

I would like to point out, though, that there’s a lot of hype around artificial intelligence (AI) and deep learning at the moment, and that’s a double-edged sword for us. It’s getting us a lot of interest, but we have a special – maybe even unique – academically driven toolset that solves problems in a new way, and that can sometimes get lost in the noise about AI-based voice recognition or image recognition.

### How is your technology different?

The main differentiator is our ability to train models from incomplete data. The usual methods for training an AI or a neural network require lots of good-quality training data to make good models for future predictions. In contrast, the driver for our algorithm is that we don’t have enough data for an AI to learn the correlations and build a model on its own. I think that’s our unique selling point. Everyone talks about “big data”, and you sometimes hear people complain about it – “Oh, I’ve got big data, I’ve got too much data to deal with.” But when you hone in on a specific use case and look at it in a certain way, you realize that in fact, their problem is that they don’t have enough data, and they never will. At that point, we can say, well, given that you haven’t got enough data, we can use our technology to learn from the data you have, and use that information to help you make the best decisions.

### What’s the most surprising thing you’ve learned from starting Intellegens?

BP: This is the first time I’ve worked closely with academics, which has been interesting (in a good way). I’d worked in software start-ups before, so I was used to dealing with experienced software people who are familiar with the tools and processes of commercial software. Academic software sometimes needs a bit more finessing to get it into a commercially stable product, in terms of source control, release management and documentation. It might sound like quite boring stuff, but if you’re going to be selling a product and supporting it, it becomes critical.

GC: I was surprised to learn that the process of getting contracts depends so much on word of mouth. I give talks at conferences, potential customers come up to me afterward, and then one customer introduces us to the next one, like stepping stones.

I also didn’t fully understand the different reasons why people might want to engage with a business like ours. Some people really want to bring in the latest technology to give their company a competitive advantage. Others want to be associated with using a technique that’s right at the bleeding edge. And some are interested in working with entrepreneurs because they personally want to buy in to the adventure and the excitement of a smaller company.

• Gareth Conduit is a Royal Society University Research Fellow at the University of Cambridge, UK, and the chief technology officer at Intellegens, e-mail gjc29@cam.ac.uk. Ben Pellegrini is the chief executive officer at Intellegens