Serverless Chats
Episode #94: Serverless for Scientific Research with Denis Bauer
About Denis Bauer
Dr. Denis Bauer is an internationally recognized expert in artificial intelligence, who is passionate about improving health by understanding the secrets in our genome using cloud-computing technology. She is CSIRO’s Principal Research Scientist in transformational bioinformatics and adjunct associate professor at Macquarie University. She keynotes international IT, LifeScience, and Medical conferences and is an AWS Data Hero, determined to bridge the gap between academe and industry. To date, she has attracted more than $31M to further health research and digital applications. Her achievements include developing open-source bioinformatics software to detect new disease genes and developing computational tools to track, monitor, and diagnose emerging diseases, such as COVID-19.
- Twitter: https://twitter.com/allPowerde
- LinkedIn: https://www.linkedin.com/in/denisbauer/
- Webpage: https://bioinformatics.csiro.au/
This episode sponsored by New Relic.
Transcript:
Jeremy: Hi everyone. I'm Jeremy Daly, and this is Serverless Chats. Today, I'm chatting with Denis Bauer. Hey, Denis, thanks for joining me.
Denis: Thanks for having me. Great to be on your show.
Jeremy: So you are a Group Lead at CSIRO and an Honorary Associate Professor at Macquarie University in Sydney, Australia. So I would love it if you could explain and tell the listeners a little bit about your background and what CSIRO does.
Denis: Yeah. CSIRO is Australia's government research agency and Macquarie University is one of Australia's Ivy League universities. They've been working together on really translating research into products that people can use in their everyday life. Specifically, they worked together in order to invent WiFi, which is now used in 5 billion devices worldwide. CSIRO has also collaborated with other universities, for example, has developed the first treatment for influenza. And on a lighter note has developed a recipe book, the Total Wellbeing Diet book, which is now on the book bestseller list alongside Harry Potter and The Da Vinci Code. From that perspective CSIRO really has this nice balance between product that people need and product that people enjoy.
Jeremy: Right. And what's your background?
Denis: So my background is in bioinformatics, which means that in my undergraduate, I was together with the students that did IT courses, math, stats, as well as medicine and molecular biology and then in the last year of the study all of this was brought together and sort of a specialized way of really focusing on what bioinformatics is. Which is using computers, back in the days it was high-performance compute, in order to analyze massive amounts of life science data. Today, this is of course, cloud computing for me at least.
Jeremy: Right. Well, that's pretty amazing. Today's episode ... I've seen you talk a number of times all remotely, unfortunately. I hope one day that I'll be able to see you speak in-person when we can start traveling again. I've seen you speaking a lot about the scientific research that's being done and the work the CSIRO doing and more specifically, how you're doing it with serverless and how serverless is sort of enabling you to do some of these things in a way that probably was only possible for really large institutions in the past. I want to focus this episode really on this idea of serverless for scientific research. We're going to talk about COVID later, we can talk about a couple of other things, but really it's a much broader thing. I had a conversation with Lynn Langit before, we were talking about Big Data and the role that plays in genomics and some of these other things and how just the cloud accelerates people's ability to do that. Maybe we can start before we get into the serverless part of this. We could just kind of take step back and you could give me a little bit more context on the type of research that you and your organization has been doing.
Denis: Yeah. So my group is the Transformational Bioinformatics Team. So again, it's translating research into something that affects the real world. In our case that usually is medical practice because we want to research human health and improve disease treatment and all this management going forward and for that data is really critical. It's sort of the one thing that separates a hunch from actually something that you can point to and say, "Okay, this is evidence moving forward," and from there you can incrementally improve and you know that you're going in the right direction rather than just exploring the space.
Jeremy: Right. And you mentioned data again. Data is one of those things where, and I know this is something you mentioned in your talks, where the importance of data or the amount of data and what you can do with that is becoming almost as important, if not just as important, as the actual clinicians on the frontline actually treating disease. So can you expand upon that a little bit? What role does data play? And maybe you could give us an example of where data helped make better decisions.
Denis: Yeah. So a very recent example is of course with COVID, where no one knew anything really at the beginning. I mean, coronaviruses were studied, but not to that extent. So the information that we had beginning of a pandemic were very basic. From that perspective, when you know nothing about a disease, the first thing you need to do is collect information. Back then, we did not have that information and actions were needed. So some of the decisions that had to be made back then were based on those hunches and those previous assumptions that were made about other diseases. So for example, in the UK they define their strategy based on how influenza behaved and how it spread and we now know that it's vastly different, how influenza is spreading and how coronavirus is spreading. So therefore in the course of the action more research was done and based on that, they adjusted, probably the whole world adjusted how they managed or interfered with the disease. We now know that whatever we did at the beginning was not as good as what we're doing now, so therefore data is absolutely critical.
Jeremy: Right. And the problem with medical data, I would assume, is one, that it's massive, right? There's just so much of it out there. When we're going to start talking about genomics and gene sequencing and things like that, I can imagine there's a lot of data in every sample there. And so, you've got this massive amount of data that you need to deal with. I do want to get into that a little bit. Maybe we can start getting into this idea of sort of genome editing and things like that and where serverless fits in there.
Denis: Yeah, absolutely. So my group researches two different areas. One is genome analysis where we try to understand disease genes, predict risk, for example, of developing heart disease, diabetes, in the future, but the other element is around doing something, treating actual patients with newer technology, and this is where genome editing or genomic surgery comes in, where the aim is to cure diseases that previously thought to be incurable genetic diseases. The aim of genome engineering is to go into a living cell and make a change in the genome, at a specific location, at a specific time, without any interference of accidentally editing other genes. And this is a massively complicated task on a molecular level, but also on a guidance level, on a computational level, which is where serverless comes in.
Jeremy: Right. Now, this is that CRISPR thing, right?
Denis: Exactly. So CRISPR is the genome engineering or genome editing machinery. It's basically a nano machinery that goes into yourself, find right location in the genome, and makes that edit at that spot.
Jeremy: Right. So then how do you find the spot that you're supposed to edit?
Denis: Mm-hmm. So CRISPR is programmable, so as IT people we can easily relate to that, in that it basically is a string set. It goes through the genome, which is 3 billion letters, and it finds a specific string that you program it with. Therefore, this particular string needs to provide the landing pad for this machinery to actually interact with the DNA because you can't interact at any location.
Jeremy: Right.
Denis: From that perspective, it's like finding the right grain of sand on a beach. It has to be in the right shape, the right size, and the right color, for this machinery to actually be able to interact with the genome, which of course, it's very complicated. But it doesn't stop there because we want it to be only editing a specific gene and not accidentally editing another correct gene. Therefore, this particular landing pad or the string needs to be unique enough in the 3 billion letters of the genome in order to not accidentally veer it away. This particular string needs to be compared to all the other potential binding sites in the genome to make sure that it's unique enough to attract faithfully this machinery. This particular string is actually very short, therefore, when you think of the combinatorics, it's a hugely complicated problem that requires a lot of computational methods in order to get us there.
Jeremy: Yeah, I can imagine. So before CRISPR can go in and even identify that spot, I'm assuming there's more research that goes into understanding even where that spot is, right? Like how you would even find that spot within the sequence genome.
Denis: Yeah, of course. The first thing you need to find out is what kind of gene do you actually want to edit, where's the problem and this the first part of my microbes research, finding the disease genes of really identifying and even within the gene because it has a complicated structure. Even within the gene, you need to find the location that is actually most beneficial for the machinery to interact with and this is where we developed the search engine for the genome. It's a webpage where researchers can type in the gene that they want to edit and the computational then goes in and finds the right spot, right shape, color, and size, binding side, but also makes sure that it's unique enough compared to all the other sites.
Jeremy: Right. And so, this search engine, exactly how does this work explain this. Like what's the architecture of it.
Denis: Yeah. So in order to build the search engine for the genome we wanted to have something that is always online, that researchers can go in at any time of the day and trigger off or kick off this massive compute. In order to do that in the cloud, you would have the option of having massive EC2 instances running 24/7, which of course would have broken the bank. Or, we could have used an autoscaling group where it would eventually scale out to the massive amount of compute in order to serve that task. Researchers tend to not have a lot of patience when it comes to online tools and online analysis, therefore it needed to be something that could be done within seconds. Therefore, an autoscaling group wasn't an option either, so therefore the only thing that we could do was use serverless. This search engine for the genome is built on serverless architecture and back then, we built it like four years ago, that was one of the first real-world architectures that did something more complicated than serve an Alexa scale.
Jeremy: Right. You obviously can't fit 4 billion letters into a single Lambda function, so how do you actually use something like Lambda, which is stateless, to basically load all that data to be able to search it?
Denis: Yeah, exactly. That was the first problem that we actually ran into and back then, we weren't really aware of this problem. Back then the research requirements were even less. It wasn't only the memory issue, but it was also the timing out issue. We figured, "Okay, well, how about rather than processing this one task in one go, we could break it up into smaller chunks, parallelize it."
Jeremy: Right.
Denis: And this is exactly what we've done with a serverless architecture in that, we used SNS topic in order to send the payload of which region in the genome a specific Lambda function should analyze. And then from there the result of that Lambda function was then put into a DynamoDB database sort of in an asynchronous way of collecting all the information and after all of this was done the summary was sent back to the user.
Jeremy: So like a fan-in fan-out pattern?
Denis: That's exactly right.
Jeremy: Right. Cool. So then, where were you storing the genome data, was that in like S3?
Denis: Exactly. This particular one is in S3. We did experiment with other options, like having a database or having Athena work with that, but the problem was that the interaction wasn't quite as seamless as S3. Because in biometrics we do have a lot of tricks around the indexing of large flat files and therefore any other solution that was in there in order to shortcut this wasn't as efficient as this purpose-built indexing approaches. So, therefore, having the files just sit on S3 and query from there was the most efficient way of doing things.
Jeremy: Right. And so, are you just searching through like one sequence or there are like thousands of sequences that you're searching through as part of this? And then how were they stored? We're you storing like 4 billion letters in one flat file or are they all multiple files, how does that work?
Denis: Yeah, so it is for 3 billion letters in one flat file.
Jeremy: Did I say 4 billion, sorry, 3 billion.
Denis: 3 billion letters in one flat file and indexing in order to, not start from the beginning, but jump in straight where that letter is. It depends on the application case as well, like if you're searching one reference genome, which is basically what it's called when you search a specific genome for a specific species, for example, human. For human, it typically is one genome, but if you search bacterial data or viral data, there can be multiple organisms in one file, so it really depends on the application case.
Jeremy: Awesome. Yeah. I'm just curious of how that actually works, because I can see this being a solution for other big data problems as well, like being able to search through a massive amount of texts in parallel and breaking that up, so that's pretty cool. Basically, what you're doing is you're using Lambda here sort of as that parallelized supercomputer, right? And sort of as high-performance compute. From a cost standpoint, you mentioned having this running all the time would be sort of insane to run all the time. How do you see the cost differ? I mean, is this something that is like dramatically different where like anybody can use this or is it something where it's still somewhat cost-prohibitive?
Denis: Anyone can use it for sure. Not for this application, but for another application, we've made a side-by-side comparison of running it the standard way in the cloud with EC2 instances and databases and things like that. The task that we looked at was around $3,000 a month, and this was for hosting human data for rare disease research, whereas using serverless, we can bring that down to $15 a month ...
Jeremy: Wow.
Denis: ... which is like less than a cup of coffee to advance human research. So to me that's absolutely a no-brainer to go into this area.
Jeremy: I would say. What are the tricks might have you been using, or might you have been using to speed up some of this processing? Like in terms of like loading the data and things like that, were there anything that you could use serverless to power that?
Denis: Well, we look at parquet indexing as one of the solutions and that worked for the super-massive files in the human space really well. But again, it comes down to indexing S3 and there was nothing really special around the serverless access. In saying that, one of the big benefits of serverless, again, is being able to paralyze it, which means the data doesn't have to be in one account. It can be spread over multiple accounts and you just point the Lambda functions to the multiple accounts and then collect back the results. And this is something that we've done, for example, for the COVID research where we did the parallelization in a different way, so by now. Genome research there's always the problem of having to deal with large data. Serverless is our first. We were always going to serverless first and therefore, we came against this problem of running out of resources in the Lambda function very frequently.
Jeremy: Right.
Denis: Therefore, we came up with this whole range of different parallelization patterns that serve anything from completely asynchronous with the GT scan is where you reserve the data back in a DynamoDB. Synchronous approach is where you don't necessarily have to collect, sorry, asynchronous approach is where you don't actually have to collect the data back, to completely synchronous approaches where you basically have to monitor everything that you do and make sure that everything is running in CSIRO in order to collect the data back together.
Jeremy: Right. Let's get into the COVID response here because I know there was quite a bit of work that your organization did around that. Before we get into the pattern differences, what exactly was the involvement of CSIRO in the Australian government's response to coronavirus?
Denis: Yeah, so we were fortunate in walking together with CEPI, which is the international consortium sponsored by the Gates Foundation, which way back when was preparing for disease X, pandemic X to come and it was curious that only a year later COVID hit. So all of this pre-work in setting up this hypothetical disease in the future only a year later it actually was needed. So, therefore, CSIRO and CEPI had already put everything in place in order to have rapid response should the pandemic hit, being able to test the vaccine development, so the efficacy in animal models, that was the part that CSIRO was tasked to do. But in order to do that, because with pathogen RNA viruses in this particular case, we know that they mutate, which means it changed slightly the genome and every replication cycle. Also, we've heard about the England strain or the South African strain, being slightly different.
So with every mutation, there is a risk that the vaccine might not be working anymore, might not be effective anymore. Therefore, the first task we needed to find out was, where is this whole global pandemic heading? Is it mutating away in a certain direction? Like, is that direction something that we should put the future disease research on, rather than focusing on the current strains that are available. So, therefore, we've done the first study around this particular question of how the virus is mutating and whether the future of vaccine development is actually deputized by that. Good news was that coronavirus is mutating relatively slowly and therefore the changes that we've observed back then and likely nowadays as well, is probably not going to affect vaccine efficacy dramatically.
Jeremy: Right. You had mentioned in another talk that you gave something about being able to look at those different variants and trying to identify the peaks or whatever that were close to one another, so you could determine how far apart each individual variant was or something like that. Again, I know nothing about this stuff, but I thought that was kind of fascinating where it was like, I don't know if it was, you could look at the different strains and figure out if different markers had something to do with whether or not it was more dangerous or they were easier to spread and things like that, so I found that sort of to be really interesting.
Denis: Yeah. There are different properties, again, with those mutations. We don't know what actually could come out of this because again, coronaviruses are not studied to that extent to really be confident to say a change here would definitely cause this kind of effect.
Jeremy: Right.
Denis: Therefore, coming back to a purely data-driven approach and that's what we've done. So we've converted each virus with its 20,000 letters in its sequence into a KMO profile. So KMO are being little strings, little rods, and be collected how often the specific rod appeared in that letter. So basically, sterilizing it, or [inaudible] coding, if you want to. And with that kind of information, we were running a principal component analysis in order to put it on a 2D map. And then from there, each distance between a dot, which represents a particular virus strain, to the next dot represents the evolutionary distance between those two entities. And from there, we can then overlay the time component to see if it's moving away from its origin. And we do know that this is happening because with every mutation it gets passed on to the next generation of viruses and mutates then and so on, so it does slightly drift away from the first instance that we recorded.
And this is what we've done with machine learning in order to identify and create this 2D map for researchers to really have a sort of an understanding and a way of monitoring how fast it's actually moving and whether that pace is accelerating or not. Currently, they have 500,000 instances of the viruses collected from around the world. So 500,000 times 20 thousand the lengths of the genome, that is 10 billion data points that we need to analyze in order to really monitor where this whole pandemic is going.
Jeremy: Right. And so are you using a similar infrastructure to do that, or is that different?
Denis: We are. Although in this particular case we had to actually give up on serverless in that, the actual compute that we're doing is not done on serverless. We're using EC2 instance, but the EC2 instance is triggered by serverless and the rest of this whole thing is handled and managed by a serverless instance. Eventually, we're planning on making it serverless, but it requires some re-implementation of the traditional approaches which we just didn't have time for it at the moment.
Jeremy: Right. Is that because of the machine learning aspect?
Denis: It's not necessarily the machine learning aspect, it's more of the traditional methods of generating these distances if you want. There's another element to it, which is around creating phylogenetic trees which is, basically, a similar way of recording the genetic distances between two. So you can think of this like the tree of life, where you have humans and the apes, and so on. A phylogenetic tree is basically that, except for only the coronavirus space. And in order to create that we needed to use traditional approaches, which use massive amounts of memory and there was no way of us parallelizing it in one of those clever ways to bring it down into the memory constraints of a Lambda function yet.
Jeremy: But you say yet, so you think that it is possible though that you could definitely build this in a serverless way?
Denis: Yeah, absolutely. I mean, it's just a matter of parallelizing it with one of our clever parallezation methods that we developed now. Another COVID approach, for example, which we implemented from scratch, we're using serverless parallelization in a different way. So here we're using recursion in order to break down these tasks in a more dynamic way, which would basically be required in the tracking approach as well. With this one, the approach is around being able to trace the origin of infection. So imagine someone comes into a pathology lab and it is not quite clear where they got the infection from therefore the social tracing is happening, interviews where they've been, who did they get in contact with, and so on. Also, molecular tracing can happen, where you can look at the specific profile, the mutation profile that that individual has and compare it to all the 500,000 virus strains that are known from around the world and the ones closest to it are probably close to the origin where someone got it from.
And therefore, being able to quickly compare this profile with 10 billion entities that are online that you can compare with was the task and there for doing that serverless was what we developed. It's called the Path Beacon Approach because Beacon is a protocol in the human health space that we adopted and it's completely serverless. What it does is it breaks down the task of recording all those 10 billion elements out there. It breaks it down into dynamic chunks because we don't necessarily know how much mutations are in each element of the genome and therefore sometimes there might be two or three mutations and sometimes there might be thousands of them.
Jeremy: Right.
Denis: Therefore, first paralyzing it in larger chunks and then if necessary, and a Lambda function would be running out of time, we can split off to new Lambda functions that handles some tasks and so on. So if we can process down the recursion in order to spin more and more Lambda functions that all individually deposit their data. So here's another asynchronous approach because we don't have to go back to the recursion tree in order to resolve the whole chain, but each Lambda function itself has the capability of recording, handling, and shutting down the analysis.
Jeremy: Let's say that I'm an independent lab somewhere, I'm a lab in the United States or whatever, and I run the test and then I get that sequence. Is this something I can just put into this service and then that service will run that calculation for me and come back and say, "This strain is most popular or occurs most likely in XYZ?"
Denis: That's exactly right. That's exactly the idea. And this is so valuable because the pathology labs they might have their own data from their local environment, like from the local country, which they don't necessarily are in a position of sharing with the world yet. And therefore being able to merge these two things of the international data with the local data, because serverless allows you to have different data sources in different accounts, I think is going to be crucial going forward. Especially around with a vaccination status and things like that, where we do want to know if the virus managed to escape, should it escape from the vaccine. All of this is really crucial information to keep monitoring the progression going forward.
Jeremy: Right. Now you get some of the data, was it GISAID, or something like that, where you get some data from. And I remember you mentioning something along the lines of, you were trying to look at different characteristics, like maybe different symptoms that people are having, or different things like that, but the reporting was wildly inaccurate or it was very variant. It varied greatly. I think one of the examples you gave was, like the loss of smell, for example, it was described multiple ways in free-text, so that's the kind of thing. So what were you doing with that, what was the purpose of trying to collect that data?
Denis: Yeah. GISAID is the largest database for genomic COVID virus data around the world. They originally came from influenza data collecting and then very quickly moved towards COVID and provided this fantastic resource for the world and the pathology labs of the world to deposit their data. In that effort, in order to make that data, to collect the crucial data, the genomic data for tracing and tracking made that available. They not necessarily implemented the medical data collection part in a way that enables the analysis that we would want to do. Partly because of the technical aspects, but mainly because it requires a lot more ethical and data responsibility and security consideration in order to get access to that kind of data. All they had was a free-text field with every sample to sort of have, if the pathology lab had that information, to quickly annotate how the patient was doing.
This clearly was a crude proxy for what we actually would have needed to have the exact definition of the diseases, ideally, annotated in an interoperable way using technologies and this is basically what we've developed. So we're using FIRE, which is the most accepted terminology approach really around the world, which allows you to catalog certain responses. Instead of saying anosmia, which is the loss of sense smell, it has a specific code attached to it. This code is universal and it's relatively straightforward to just type in the free-text and then the tool that we've developed automatically converts that into the right code and this should be the information that is recorded. Similarly, in the future, what kind of vaccines a person has received and so on. And then from there we can identify, or we can run the analysis of saying, 20,000 letters in the SARS-CoV-2 genome, so the COVID virus genome, any one of those mutations is it associated with how relevant or how infectious a certain strain is or whether it has a different disease progression, or it might be whether it's resistant to a certain vaccine.
All of these is really critical, but because there 20,000 letters these associations can be very [spiries 00:32:49]. In order to get to a statistical significant level, we do need to have a lot of data and currently, this data is just not available. Like we went through and we looked for the annotations where we had good quality data of how the patient was going. I think we ended up with 500 instances out of the 200,000 that was submitted back then that were good enough annotated in order to do this association analysis of saying that mutation is associated with an outcome. And while we found some association specifically in the spike protein that would be affecting how virulent or what kind of disease this particular strain could cause, it definitely was not statistically significant. So we definitely need to repeat that once we have more data and better-annotated data.
Jeremy: Yeah. But that's pretty amazing if you could say someone's loss of smell for example, is associated with particular variants of the disease or that certain ones are more deadly or more contagious or whatever. And then if you were able to track that around the world, you'd be able to make decisions about whether or not you might need a lockdown because there was a very contagious strain or something like that. Or, maybe target where vaccines go in certain areas based off of, I guess, the deadliness of that strain or whatever it was. That's pretty cool stuff.
Denis: Yeah, exactly. So rather than shutting down completely, based on any strain, it could be more targeted in the future and probably will be more targeted in the future.
Jeremy: All right. Now, is this something where everything you've built, all of this information you've learned, that when the next pandemic comes because that's another thing I hear quite a bit. It's like, the next pandemic is probably right around the corner, which is not comforting news, but unfortunately probably true. Is this the kind of thing though where with all this stuff you're putting into place that the next round of data is just going to be so much better and we're going to be so much better prepared?
Denis: Absolutely. That is definitely the aim. I mean, you do have to learn from the past, and having this instance happen firmly puts it from the theoretical space where everyone was talking about before to, "Oh, yes. Is actually happening." There was a paper published in Nature last month, sorry, last year. It was around, how much money have you lost through this particular pandemic, I mean, the lives lost obviously, are invaluable. Looking at the pure economics of it, so how much money have we lost and how much will this damage go on to the future. Therefore, they did a cost-benefit analysis of saying, "How much are we willing to invest in order to prevent anything like this from happening in the future?"
Jeremy: Right.
Denis: The figures that they came up with, and this was way back when we didn't really even know what the complete effect was, and we still don't know. But even back then the figures were astronomical. So I think there's going to be huge shift in order to see the value of being prepared, the value of the data, the value of collecting all this information, the value of making science-based decisions, I think it's going to ...
Jeremy: It will be nice. A change of pace at least here in the United States.
Denis: ... I'll be very optimistic going forward, we're much more prepared than we ever were in the past.
Jeremy: That's awesome. All right. So you are part of this transformational bioinformatics group and so, you have sort of the capabilities to work on some of these Serverless things and build some other products or some other solutions to help you do this research. But I can imagine there are a lot of small labs who, nevermind having the money to pay for, or small research groups that don't necessarily have the money to pay for all this compute power, but also maybe don't have the expertise to build these really cool things that you've built that are obviously, incredibly helpful. What have you done in terms of making sure that the technical side of the work that you've done you've made that accessible to other researchers?
Denis: Yeah, absolutely. My group, the Transformational Bioinformatics Group, is very privileged in that we do have a lot of support from CSIRO in order to build the latest news, tours with the latest news to compute. As you said, other researchers around the world are not as privileged, therefore the tools that we developed we want to make as broadly applicable and as broadly accessible as possible so that other people can build on those achievements that we had. If COVID has taught us anything, it's working together to really move into the right direction together, what is not only rewarding, but it's also necessary in order to keep up with the threats that are all around us. So with that, the digital marketplaces, from my perspective, are the way to do this. Typically, digital marketplaces you think that it's an EC2 instance that is spun up with a Windows machine or something like that, while subscribed to a specific service that is set up for a fixed consumption.
But from my perspective, because it allows you to spin up a specific environment with a specific workflow in there, that you have access to because it's in your account, you can build upon. Therefore, this is the perfect reproducible research and collaborative research approach where someone, like us, can put in the initial offering and other people can build on top of that. This is what we've done with VariantSpark, which is our genome analysis technology, so in order to find associations between disease genes and certain diseases. This is a hugely complicated workflow because you first have to normalize stuff, you have to quality control things, you then have to actually run a variance bug and then visualize the outcomes.
So typically, being able to describe all of that and for other people to set it up in their account from scratch without us helping them, it is complicated. And this is basically the bane of the existence of biomedics research, in that the workflows are so complicated that reproducing them is typically impossible. Whereas now, we can just make a Terraform or CloudFormation or ARM template or whatnot, put it into the marketplace for other people to ascribe to, to spin it up in the way that we intended to, that we optimized to and then from there they have this perfectly reproducible base in order to build upon. Unfortunately, this whole thing ... Variance bug is an Elastic MapReduce offering. The marketplaces are currently only looking at EC2 instances as sort of their basis, the virtual machines as their basis.
Jeremy: Right.
Denis: What we definitely need is a serverless marketplace.
Jeremy: Right. I totally agree with that. So you mentioned something about the data. Have your organization run this in your AWS account for example, and then have other people just send their data to you?
Denis: That certainly would be an option. The problem typically with medical data is that there's a security and a privacy concern around it. Genomic data is the most identifiable data you can think of, you only have one genome, and encodes basically your future disease risks and everything that's ... Basically, it is a blueprint of your body. From that perspective, keeping that data as secure as possible is the aim of the game. Nevermind that it's so large, you totally want it to shift that around, but I think the security element is what really sells me to the idea of bringing the compute to the data, bringing that compute and the structure to the securely protected data source of the researchers or the research organization that have and hold the data and is responsible for the data. It also allows dynamic consent, for example, where people that consent for their data to be used for research, they can revoke that, so it's a dynamic process. Being able to have the data in one place and handle the data in one place directly allows this to be executed faithfully, robustly, and swiftly, which I think is absolutely crucial in order to build the trust so that people can donate or lend their data to genomic research.
Jeremy: Yeah. That is certainly something from a privacy standpoint where you think about ... You're right. Everything about you is encoded in your DNA, right? So like there's a lot of information there. But now, I'm curious if somebody else was running this in their environment after they do the processing on this, and again, I'm just completely ignorant as to what happens on the other end of this thing, but. The data that they get out of the other end of this thing, is that something that can be shared and can be used for collaboration?
Denis: Yeah. Typically, the process is you run the analysis, you get the result out, you publish that, and it sort of ends there. I think in order for genomic data to be truly used in the clinical practice and to inform anything from disease risk to what kind of treatments someone should receive, what kind of adverse drug reactions they are at risk of, it really needs to be a bit more integrated. So, therefore, the results that comes out of it should somehow feed back into the self-learning environment. That's one avenue. The other avenue is that the results that are coming out they really need to be validated and processed. Therefore, typically there are wet labs that investigate that this theoretical analysis is correct in order to move forward.
Jeremy: Interesting. Yeah. I'm just thinking, I know I've seen these companies that supposedly analyze your DNA and they try to come up with like, are you more susceptible to carbohydrates, those sorts of things there. Now while that may be a lofty endeavor for some, I'm thinking more like, people who are allergic to things or environmental exposures that may trigger certain things. Tying all that information together and knowing if that, I mean, I'm assuming that has to be encoded in your DNA somewhere like your, I guess your allergies, I keep using that example. So how does that information gets shared? Is that just something that is like way out of scope because you've got people testing just their own group of samples and doing specific analysis on it, but then not sharing that back to a larger pool where like everybody can sort of look at that?
Denis: Definitely, that is the aim going forward. The Global Alliance for Genomics and Health is putting things in place in order to enable this data sharing on a global scale. The serverless beacon that we've developed is moving along the line as well to make it more efficient for individual research labs to light their own beacon in order to share their results with the rest of the world, like the $15 per month in order to share data with the world. I don't think we're quite there yet in terms of the trust, in terms of the processes to make this actually a reality within the next, I don't know, five years.
Ultimately, it definitely is the easy aim, and ultimately this is the need. An element to that is also that, the human genome is incredibly complex and therefore there is no real one-to-one relationship between mutation and an outcome. We do know that, for example, for cystic fibrosis, it's one mutation that causes this deadly devastating disease, but typically it's a whole range of different exacerbation factors, resilience factors, that work together and it's very personal with the kind of risk that it generates. In order to quantify this risk, we need to have massive amounts of data, massive amounts of examples of which kind of combination is causing what kind of outcome.
Jeremy: Right.
Denis: In order to do that probably putting all the data in the same place it's not going to happen ever. Therefore, sharing the models that were created on individual sub-parts and refining the models on a global level, like sharing machine learning compute models, I think is probably going to be the future. And this is a really interesting and exciting space and a new space as well where it's sort of a combination of secret sharing and distributed machine learning in order to build models that truly capture the complexity of the human genome.
Jeremy: Yeah. Well, it's certainly amazing and fascinating stuff and I am glad we have people like you that are working on this stuff because it is really exciting in terms of where we're going just to mean, not only just tracking and tracing diseases and creating vaccines but getting to the point where we can start curing other diseases that are plaguing us as well. I think that's just amazing. I think it's really cool that serverless is playing a part.
Denis: Absolutely. So my goal is really to bring the world together and see the value of scientific research and bring that scientific research into industry practices.
Jeremy: Awesome. All right. Well, Denis, thank you so much for sharing all this knowledge with me. I don't think I understood half of what you said, but again, like I said, I'm glad we have people like you working on this stuff. If people want to reach out to you or find out more about CSIRO and some of the other research and things that you're doing, or they want to use some of your tools, how do they do that?
Denis: Yeah. The easiest is to go to our web page, which is bioinformatics.csiro.au, or find me on LinkedIn, which is allPowerde, and start the conversation from there.
Jeremy: All right. That sounds great. I will get all that information in the show notes. Thanks again, Denis.
Denis: Fantastic to be here.