A Small Startup Fights Rare Diseases With Big Data

Steven Cherry

from IEEE Spectrum on 2021-11-16 17:59 (#5RZZX)

image.jpg?id=27979196&width=1245&coordin

Hi, this is Steven Cherry for IEEE Spectrum's podcast, Fixing the Future.

Rare diseases are, well, rare. In two not unrelated ways. By definition, they're diseases that afflict fewer than 200,000 people. But because, in the world of big business, in particular big pharma, that's not enough to bother with, that is, it's not profitable enough to bother with, rare diseases are rarely worked on, to say nothing of cured.

For example, hypertryptophanemia is a rare condition that likely occurs due to abnormalities in the body's ability to process the amino acid, tryptophan. How rare? I don't know. A Google search didn't yield an answer to that question. In fact, it's rare enough that Google didn't autocomplete the word even with 15 of its 19 letters typed in.

Paradoxically, big data has the potential to change that. Because 200,000 is, after all, a lot of data points. But it presents problems of its own. There isn't one giant pool of 200,000 data points. So the first challenge is to aggregate all the potential data that's out there. And the big challenge there is that a lot of the data is contained, not in beautifully homogeneous, joinable, relatable databases. It's buried deep in documents like PubMed articles and patent filings.

Deep Learning can help researchers pull that data out of those documents. At least, that's the strategy of a startup called Vyasa. Here to explain it is Vyasa's CEO and founder, Christopher Bouton.

Chris, welcome to the podcast.

Chris Bouton Thank you so much. It's really great to be here.

Steven Cherry: Chris, if I understand this correctly, using Vyasa, data scientists or other researchers can construct a sort of traditional rows and columns database in a still-somewhat-manual process that's greatly sped up with your software. Is that right?

Chris Bouton: That's correct. One of the big challenges that we have in science today is that so much of the information that we're generating nowadays was originally designed for humans to read one by one. And yet now we're generating tens of thousands or hundreds of thousands of these types of data elements every day, all day long. And of course, I'm referring to things like scientific papers, PDF documents. All those things were originally designed for humans to read them. But now it's basically impossible to read all the scientific literature that's being published all the time. And so we need better tools to do that. And that's where deep learning comes in. Deep learning is really good at analyzing that kind of information and pulling information out of it that you can then use in something like a more structured form, like a database.

Steven Cherry: So how would that work specifically for hypertryptophanemia, which I used in my intro because it's an example on your website.

Chris Bouton: Well, if you think about it, you have this, let's call it a haystack of data and then you want to find that needle, which is rare in this case, that has to do with that particular rare disease. And the way that we do this is we train a deep learning algorithm on language itself, and we can train these deep learning algorithms on many different kinds of languages so we can find information about this particular disease in French, German, English, Chinese, Japanese, all at the same time. And as those algorithms learn how to literally read the language in all of the documents that we're giving them access to, they're also able to identify these specific terms. As they do that, we're then able to ask these algorithms natural language questions like What are the effects of this particular disease? What's the prevalence of this particular disease? What are the good treatments for this type of disease? And the algorithm is able to also go find those answers without us telling it how to find those answers. So the combination of being able to find the information in the first place and then find answers about questions about it turn out to be a really powerful way to conduct science and extract information from this kind of information that was previously very difficult for machines to analyze.

Steven Cherry: There's a broader aspect to this than just medicine. In fact, Vyasa was developed mainly to pull out what you call dark data. You say that data scientists and researchers are spending too much time finding the data they need. What is dark data?

Chris Bouton: Dark data or siloed data are two ways of referring to the fact that most organizations know that when they're attempting to make business decisions or research decisions, they know that they're making it on a very small percentage of all the information that they should have access to. This is a combination of all the external content that's being published and put out there every single day, all day long, as well as all of the internal data that each organization has access to. So the combination of those two forms of data and the fact that the organizations aren't using all of that effectively-it basically is the definition of dark data. In addition, we know that the vast majority of that dark data is unstructured.

Steven Cherry: You worked at Pfizer before your first startup, and while there you developed something called the Pfizerpedia. That looks like an early attempt at finding dark data at this corporate level.

Chris Bouton: Yeah, that's going back a ways now. Pfizerpedia was a really fun project to work on. You know, Pfizer, like many other organizations, do have this dark data challenge. I downloaded a MediaWiki instance, which is the same type of software that runs Wikipedia, installed it on a Linux computer under my desk, turned it on, and within a year we had, you know, we went from about zero to 20,000 users of the system. It was still on a computer under my desk, and I'd accidentally kicked the power strip every once in a while and the whole system would die out, which made nobody happy.

But, but yeah, Pfizerpedia was a really great early example of how organizations are really excited to make better use of the data and information within their organizations. It was a collaborative project. It was a project that allowed people to share information at scale in a secure fashion within the organization. And all of those were really valuable learnings for me around how organizations want to do a better job of using their data internally.

Steven Cherry: One of your company's slogans, at least on Twitter and you alluded to this before, is "Build the haystack, find the needle."

Chris Bouton: So that tagline came from the fact that when we started the company, we went out and we were telling people primarily about the deep learning algorithms themselves, the things that can help find the needles. Time and time again, what we heard was, "that's great, but we still can't find our data." In other words, they were referring to the dark data problem.

And so what we realized was that along the way, we had also built a completely novel type of architecture for integrating data and that's referred to as a data fabric. Layar, our solution for data fabrics, was being built the entire time that we were telling people about deep learning because we needed a better foundation for running the algorithms. What we realized was that Layar, and that data fabric architecture, was just as important as the algorithms themselves. And so that's why, in that tagline, we're saying, you know, build the haystack, i.e., use the data fabric to bring all your data together, and then you can find the needle, i.e., use the deep learning algorithms to drive the pulling of these types of insights from that haystack.

Steven Cherry: There's a process in legal cases, especially lawsuits, that involves an incredibly tedious process of finding and extracting information from sometimes enormous masses of data that by law, the other side has to provide. It involves such things as looking for one incriminating statement in three years of all the company's emails, but this "discovery," as it's called, is yet another of your use cases. Is this hypothetical or are there clients already doing this?

Chris Bouton: No, this is absolutely a real use case with clients already doing this type of work. This is yet another great example of where, as you noted, there's just far too many documents to read in a reasonable timeframe nowadays. In fact, I think in many of those cases, just hundreds of people are brought into rooms and given a lot of coffee to read all those documents. And by the way, that kind of activity is happening all over the place. Hundreds of thousands, often, of PDF documents are being sent to teams of people to just read them to extract information all over the place and many different types of verticals, many different types of activities.

Deep learning algorithms give us a tool to do a much better job of extracting insights from those large document datasets. For example, we have a client who was running these kinds of manual extraction exercises and having it take months. That same exercise for them now takes milliseconds. And that time savings alone... not only is it a huge time efficiency gain, but also has allowed them to completely rethink their business model, how they're doing their business and what they're doing with their data. So yes, these are very much real-world use cases. And at Vyasa, we're excited about the applications of these technologies in the life sciences and health care space, but also in other verticals like legal, like fintech, like manufacturing.

Steven Cherry: A new book, Work Without the Workers, notes that microwork-the kind of work that started with Amazon's Mechanical Turk, but now that's not even the largest microwork aggregator-microwork often involves tedious work cleaning data, labeling images and videos, for example. We'll have a show with the author of that book in a few weeks, but in the meantime, Chris, is it fair to say that Vyasa also would automate some of that microwork?

Chris Bouton: There are cases where building things like training sets for deep learning algorithms does involve microwork, and that's a valuable place where microwork applies to deep learning use cases. I think, though, that at the same time, there are places where people assume that you need far more data than you actually need in order to run deep learning algorithms. And language models are a great example of that. Because these language models have literally all language to train against, they have plenty of data to train against, and that does two things. It means that these systems, like Layar, out of the box within just a couple of hours, is ready to perform the kinds of tasks that I've described. And two, it means that Layar in these deep learning models running in Layar can perform the types of microtasks that you're speaking about.

I think that it's also important to note that these are just tools. They're new tools. They're really cool tools in the toolkit, but they're still tools that are used by humans. And so, for example, we've built applications on top of Layar that allow human curators to go in and make sure that what the algorithms are finding is correct and allowing those humans to update what the model is finding. And then the model actually actively learns from that type of curation. So there really is a very interesting novel set of technologies at play here that allow humans to increase the value of their work activity and do higher level, more strategic work-while using these new tools to do a lot more of that mundane type of work that has previously only been possible with humans.

Steven Cherry: We're speaking with data scientist Christopher Bouton. When we come back, we'll talk about some data analytics tools and discoveries he made-milestones on a journey that started for him as a teenager.

Fixing the Future is supported by COMSOL, the makers of COMSOL Multiphysics simulation software. Companies like the Manufacturing Technology Centre are revolutionizing the designs of additive manufactured parts by first building simulation apps from COMSOL models, allowing them to share their analyses with different teams and explore new manufacturing opportunities with their own customers. Learn more about simulation apps and find this and other case studies at comsol.com/blog/apps.

We're back with my guest Christopher Bouton, founder and CEO of Vyasa Analytics, a provider of A.I. data tools and applications.

Chris, I mentioned you had an earlier start-up after your stint at Pfizer. Tell us a bit about Entagen.

Chris Bouton: Yeah, Entagen was a company that I founded in 2008 and we ran that company for five years and then it was ultimately acquired by Thomson Reuters in 2013. Entagen was a first pass at attempting to build data integration in infrastructures for organizations. So really, in a lot of ways, the same sorts of ideas that I had been working on throughout my career-I actually also worked on data integration in graduate school at Johns Hopkins. I built a system called DRAGON that integrated data for something called microarray data analysis. So I've been thinking about this for quite a long time. I'm not sure why, but it's interesting to me. An Entagen was also involved in the development of technologies for data integration, also primarily for the life sciences and health care space.

At Entagen, what we were doing was using a specific kind of data format called RDF in order to do that data integration. The upside of that approach is that there's a number of standards and ways of structuring that type of integration capability. The downside is that it's a bit more brittle to all of the richness of information that we have in things like documents today. And so you have a difficult time converting from all of that rich information in the documents themselves into something that's usable in the RDF. And so with Vyasa what we tried to do was rethink how we could do the integration without having to use a data format like RDF in the middle.

Steven Cherry: I'm glad you mentioned DRAGON, which is one of those contrived acronyms for Database Referencing of Array Genes ONline. And I understand that people are still using DRAGON. Your Ph.D., from Johns Hopkins involved using data to study the mechanisms, at the neural level, of lead poisoning.

Chris Bouton: So lead is known to mimic calcium in the body. There's sort of an interesting backstory there that has to do with the fact that, you know, as mammalian systems were evolving, lead didn't exist in the environment, right? So mammalian systems-proteins, for example-didn't need to evolve the ability to differentiate between calcium and lead because lead wasn't in the environment. Then all of a sudden humans start digging lead out of the ground and we have a problem, right?

In particular, the proteins in our body, many of them have what are called calcium-binding domains, and those binding domains know how to bind to calcium and as a result, do important things in the body like, for example, control synaptic vesicle release in the brain, which is really how our brains operate. Lead can get into the brain, mimic calcium in these calcium-binding domains and cause aberrant protein activity as a result. And so I was doing that type of research both at the level of the proteins themselves, but then we were also studying the expression of the genes associated with calcium-binding proteins. And that's where DRAGON became useful.

Steven Cherry: Chris, your work has always had a data angle, but always tilted in the biomedical direction. It looks like it started all the way back in high school with your Westinghouse Science Talent Search submission.

Chris Bouton: The Westinghouse changed my life. It was a wonderful opportunity to conduct biomedical research and then go through that whole process with that award. Prior to the Westinghouse, actually my first love in life was sharks, and I've always loved sharks, have always been fascinated by them and have recently become more involved again in the shark conservation, marine ecosystem conservation. And that's also an area that's near and dear to my heart. So you're right. Science has always been a love of my life and certainly a thread in my career.

Steven Cherry: The name Vyasa comes from Hindu mythology, specifically, the Mahabharata, which is an enormous epic poem 20 or 30 times longer than the Iliad or the Odyssey, and only a bit younger than them-from the third or fourth century B.C. What's the connection between Hindu mythology and contemporary data analytics?

Chris Bouton: Oh yeah, this is a great question. So I lived in India for four years as a boy, so I actually grew up reading the Mahabharata as a comic book, and I was trying to come up with a name for the company. And I thought, "Wow, 'Oracle' is cool name, ah, like that one's taken." And so I was looking around for the idea of gurus and knowledge compilers and came across the name Vyasa and just loved it. Because of my connection with India and because Vyasa was a guru who compiled knowledge, brought knowledge together, and I loved the idea of that activity of knowledge compilation being part of what Vyasa was going to do with deep learning algorithms. So there's a personal reference there, but then also a reference to the activity of knowledge compilation.

Steven Cherry: Well, Chris, automation seems more related to me to the Hindu god Shiva. The name Shiva means "the auspicious one," but he is commonly thought of as the destroyer. It's incumbent on those using deep learning to develop tools that are auspicious, and you seem to have been doing that for your entire career. Thank you for all these innovations-may they always be auspicious, and thank you for joining us today.

Chris Bouton: Thank you so much. Thank you to you and thank you so much for putting the podcast together, and it's been wonderful to participate.

Steven Cherry: You're quite welcome.

We've been speaking with Christopher Bouton, founder and head of Vyasa, a maker of deep learning tools to relieve the burden and tedium of data acquisition.

Fixing the Future is sponsored by COMSOL, makers of mathematical modeling software and a longtime supporter of IEEE Spectrum as a way to connect and communicate with engineers.

IEEE Spectrum is the member magazine of the Institute of Electrical and Electronic Engineers, a professional organization dedicated to advancing technology for the benefit of humanity.

This interview was recorded October 12, 2021, on Adobe Audition via Zoom, and edited in Audacity. Our theme music is by Chad Crouch. I'd like to thank Nick Brown for suggesting the topic.

You can subscribe to Fixing the Future wherever you get your podcasts, or listen on the Spectrum website, where you'll also find transcripts of all our episodes. We welcome your feedback on the web or in social media, and your rating us at your favorite app.

For Fixing the Future, I'm Steven Cherry.

Source	RSS or Atom Feed
Feed Location	http://feeds.feedburner.com/IeeeSpectrum
Feed Title	IEEE Spectrum
Feed Link	https://spectrum.ieee.org/