Open data, the future of scientific publication, and the nature of code

Here’s Part I of our interview with Joanne Kamens, Excecutive Director at Addgene. We discuss open data, scientific publication, and the relationship between code and biology.

Taking Zika virus as an object lesson, there’s a tension between urgently publishing data and completing peer review. What’s the right balance?

Biology will ultimately have to adopt either a more rapid publishing model, or some form of pre-print data deposit. No one should be deprived of Zika data right now. In fact I’m sure Zika scientists are sharing their data as speedily as they can.

I realize that pre-print servers are contentious for the paid journals. But having to submit a paper eight times to get it accepted is an untenable pain point. If you’re a postdoc and your paper takes three years to get accepted, how can you move on? It’s messing up people’s lives. It’s messing up science.

That said, I’m sympathetic to the journals. They’ve done us a service. Peer review teaches scientists a lot by requiring a certain level of rigor. The journals are an integral part of our science infrastructure. We hear high profile stories about published papers that are inaccurate, but inaccuracies are pretty rare considering the sheer number of publications.

One risk in releasing data prior to journal publication is idea theft. Is there and an upside to early sharing that’s more compelling than the risk of being copied?

The risk of being scooped is definitely an issue. It’s why Addgene embargoes plasmids prior to publication. We’ve asked scientists why they don’t want their plasmids to go live before the paper. It’s because they don’t want people to know what they’re working on. In this respect biology is different from other sciences.

In physics and math, you think with your head and devise an idea that becomes a paper. You can write that idea out and get credit for it. But in biology you have to demonstrate something first. You could perhaps write an editorial about an idea, but until you have some data no one is going to believe it on faith. So there is naturally going to be a time lag. I think potentially one way to handle that is to reduce the complexity of individual publications. Since my day in publishing basic research, the expectations for peer-reviewed research have elevated: the number of figures you need, the depth of information, the completeness of the paper, nailing down every possible variant experiment control—it’s weighty. And the papers get sent back until you’ve completed all these extra experiments. That seems like an inefficient process.

We need smaller papers and different types of journals. Those are starting to appear. For instance, journals that publish validation of data. You did it, then someone did it again. That’s a good thing in science. Why do we resist publishing results just because someone else already published them? That’s a problem. We want reproduced results to be published.

I favor the open, online review model. The paper goes online right away and then reviewers are solicited to review it publicly, with their names. People are more civil when they have to identify themselves. They’re more professional. They take more time with their wording and with their care than with anonymous reviews.

I think that open review prevents the typical slowdown of publication. We don’t need paper journals for that. We can do that online.

If the open review process is attributed can we stop worrying about being scooped?

I don’t think anything solves the pressure to publish and get funding now. There’s a whole infrastructure problem in science, at least in this country, that’s making it difficult for people to succeed, difficult for people to commit to a career in science, difficult to commit to doing a PhD. I’m not one of those people who thinks we have too many PhDs. I think we need more PhDs. We just need them doing a wider variety of things. We need more people thinking about science politics, science communication, and science policy in government and in business. I think we need PhDs everywhere. I don’t want to quench the training system. But the way that it’s structured now, the barriers are becoming far too high for people to commit to a PhD. That’s a loss to progress.

The data that makes it into a published paper seems neat and manicured. Hiding behind it are heaps of data intermediates that nobody ever gets to see, “data dark matter,” so to speak.

There’s dark matter in plasmids too. If you took 10 mini-preps and you used one of them, the other 9 often lie in your freezer for years. Then one day you ask, “What are those things?” If you made variants and you captured 3 of the 20 variants, the other variants may be in your freezer, unstudied. There’s a lot of dark matter in molecular biology, at the bench and in the data. I don’t think that every experiment should go live. A lot of what we do is learning and developing protocols.

I have no problem with a lab publishing their best paper. I do think that it’s important to do things more than once and with sufficiently large N. A Cell paper where you did an experiment on 3 mice—that’s probably not enough. It’s expensive. I understand why you only did 3 mice, but it’s probably not enough.

Ultimately, I think there has to be data dark matter. It’s part of the discovery process. I don’t have any problem with that.

Addgene compiles detailed data on plasmid transactions. Is there undiscovered gold in that transaction data?

Oh yeah. We know there’s undiscovered gold in there. There are quite a few people out there playing with our data now. We finally reached the number of data points where people come to us and ask, “Can I look at your data?” We have ways of sanitizing the data so that it’s useful but doesn’t compromise privacy.

There are about three groups working with our data right now, one of which we have a closer relationship with because it’s local. I’m really excited and hope they’re going to write a paper about it. They’re looking at things like distribution patterns of plasmids, the influence of depositing on citation and uptake. Another group is looking at the influence of the open science on patents. There are actually a host of dimensions that people are looking at.

Inside Addgene we look at trends around the world, for example which countries are showing an increase or decrease in requests. We’re scientists so we like to try an experiment in a country and then see if it has an effect. For example we’ll present at a conference in China, generate some Chinese language support pages on our website and then watch to see if there are more requests from China. We troubleshoot plasmids by geographical region, asking “why are scientists in this region having trouble? Oh, we can’t deliver there. We need a distributor.” Then we interact with a distributor and measure these results. We do these types of tests with countries around the world.

Right now we’re looking to see what we can do in Latin America and Africa. These developing countries have science infrastructure, but not a lot of funds. If we can help them accelerate their science, we will. Typically, growing countries have issues with shipping, language, international payment. We look for custom-tailored solutions in each country. That’s how we leverage our own plasmid data. We ask, “How can we help? How can we make plasmid delivery smoother? Where are the hold ups?”

There’s a balkanization of data portals and consortia in genomics. It seems like they’re all doing the same data infrastructure work over and over again.

As a centralized resource, we have a bit of that frustration too. Addgene’s efficiency comes from economy of scale. Being a central distributor, creating industry-wide standards—that’s how we do what we do. It’s a shame when there’s a little repository with a few useful plasmids that insists on distributing them themselves. That said, we support all repositories. We’re all in favor. But it seems a bit wasteful. A few big repositories would probably be sufficient.

We’re like Quilt. We see the value in bringing the data together in one place. What we’ve found is that asking scientists to give us their data is not trivial. You have to make flexible portals where people can easily deposit data. There are many of our users that can deposit plasmid data. In the past we focused on trying to figure out how to get scientists to give us more of the important data. Then we finally said, “Oh, we’ll just do it.” We hire people who specialize in data entry. They’re faster than most scientists because they know which data is important. In short, making data deposits smooth is a large portion of what we do.

One of the things we discovered is that the only time researchers really know where their data is, is at the moment of publication.

That’s why we strongly encourage depositing before publication and during the preparation of the manuscript. We hold back all the information until the depositor tells us the plasmid and data should be released to the public (such as on publication). Especially for very hot papers, where the authors want to avoid a million phone calls the next day. Furthermore, if you pre-deposit with Addgene you don’t have to describe your process in the paper. People can just find the plasmid with a catalog number, then its map and all of its information. That simplifies writing the paper. The CRISPR community has been excellent about depositing before publication. It’s why their field has moved so quickly.

What is the role of software engineering in biology? In the past some bench scientists have been reluctant to code.

When I talk to people about science careers I say, “Learn to code in grad school. Learn a little Python. Make that part of your curriculum.” We see it here. Many Addgenie scientists like to do some coding. Addgene is a Python/Django shop. We have something called Jupiter (iPython notebooks) that allows anyone to query our vast and interesting database. Biologists frequently want to dive deeper into the data. Now they can. Because biological data is no longer just at the bench—at least not yet. It’s in databases.

That’s where the interesting stuff is happening here at Addgene. Many of our biologists have learned to code a little bit. They create tools for themselves to perform analyses with code. We see code and biology as a marriage.

On the other side of the aisle, we have five software engineers at Addgene. They all take an interest in biology, though they’re not formally trained in it. They understand a lot about plasmids, even though they’re not biologists.

There’s nothing more valuable for a career right now than to learn to code. I think MIT has a joint CS and Biology program for undergraduates. People are starting to say, “Wow you can do both of these things together.” It’s critical.

So it’s productive to have biology and coding skills in a single person?

I couldn’t agree more. It’s hugely valuable. We realized this with next gen sequencing. It turns out that the next gen algorithms are not suited for short segments of DNA. For short segments, the raw data from the sequencer isn’t enough. The algorithm to deconvolute the data is what matters. There’s only one place we send our plasmids to be sequenced right now. Because their algorithm is advanced. They’ve figured out how to use next gen sequencing to deconvolute 100 plasmids from the same mix. You have to understand quite a bit about plasmids, how they’re structured. They often have confusingly similar backbone features. There had to be a lot of understanding of biology and code at MGH for someone to create the algorithm, tweak it, and get it to produce desirable outcomes. This kind of multi-disciplinary talent is going to be useful in many areas of science.

If all of Addgene’s plasmids were sequenced, would it make for a different kind of repository?

To some extent and we are working on providing much more sequence data with a variety of approaches. A lot of the plasmids may not have a full sequence but they do have NCBI Entrez gene insert information. The data are there. Addgene didn’t focus on which data would be useful in 10 years. Instead we focused on what was ultra-cheap and ultra-high-quality. We asked, “How can we make this really inexpensive for scientists and never raise the price?” And we have never raised our prices.

We focus on providing access and community curation. If a plasmid is bad we find out about it fast. Somebody calls us back and tells us. Then we take it offline. We can fix it. I’m not sure how that will go when we have more sequence data.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s