The Nutritious Rice for the World (Rice) project, a World Community Grid BOINC project, ended a few weeks ago. BOINC (Berkeley Open Infrastructure for Network Computing) is a non-commercial program and infrastructure which allows volunteers to donate their computer’s spare computing resources to take part in very interesting, computing intense scientific projects. Many people around the world contributed their CPU-resources to help figure out the structure of proteins of the most common strains of rice. In the end, about 25,761 years of CPU-time were contributed to the project. IBM heavily contributed to this project through their World Community Grid (WCG) program, offering Rice a massive userbase and community.
Rice is one of the most common foods in various parts of the world. It’s in the interest of us all to find varieties and breeds of rice which are most nutritious or resistant against pests; the project’s goal is to find out which varieties of rice interbreed with others to give the best results so that we’ll get new strains of rice which are harder, better, faster, stronger.
A lot of BOINC-users who contributed to the project (like myself) are now asking themselves a lot of questions. Who are the people behind the scenes? How much work is necessary to get a project like this into operation? What was IBM’s role? What will happen with the contributed results? And after all, who will benefit from the project?
I think no one can give better answers than Ram Samudrala, PhD and Principal Investigator of a computational genomics research group at the University of Washington. Rocker, scientist and Emacs-admirer – he was so kind to answer me some questions about the project.
Tell us a little about yourself and how you got involved in the Rice-project.
Ram: I’m a professor researching computational biology at the University of Washington Seattle. My overarching interest has been to understand and model how the genome of an organism (genotype) specifies its behaviour and characteristics (phenotype). We develop computational algorithms to this end that are applied to whole genomes and we work on many organisms. Rice was specifically chosen since our collaborators at the Beijing Genomics Institute had just finished sequence (and we annotated the refined version) and I also got a $1.9 million grant from the US National Science Foundation (NSF) to predict the structure and functions of all proteins encoded by the rice genome. We developed algorithms to do this and we applied it to all rice proteins. Then IBM came along and offered us the means to redo some of our calculations on the most difficult proteins using the WCG and then we ported our code over to work on the Grid.
When was the first time you considered using voluntary distributed computing for your project?
Ram: Since the days of SETI@home, and since we built our own local clusters to do structural computational biology, but porting our code to BOINC was always a inertial challenge.
Did you consider using other DC-infrastructures except BOINC, like distributed.net? If yes, why did you decide using BOINC?
Ram: No, we used BOINC since it was what was supported by IBM WCG.
Have you considered asking the NCSA for computing resources?
Ram: Yep, but it’s a cumbersome process, like applying for a grant, and again, porting software to work on different architectures. The barrier is that we get grant money to do research and not develop software. I have used NIST supercomputing resources in the past.
You said you would need 200 years of computing time using your available resources. Besides voluntary distributed computing and the University of Washington, were there other universities or institutes directly contributing computing-resources to your project?
Ram: Not for this project, no.
You were using algorithms from the Protinfo website. Which one did you actually use, how much effort did you put into customizing it for using it in BOINC? Can you tell us if those algorithms and implementation are released under a free license?
Ram: It’s the Protinfo AB algorithm, which is our ab initio or de novo simulation protocol. IBM spent a fair amount of time porting the code
to work with BOINC. The original algorithms/software are all freely available without any claim of copyright (i.e., in the public domain).
Could you explain “de novo” and “ab initio” for non-scientists, please?
Ram: “De novo” and “ab initio” generally are translated to mean “from first principles”. In the old days, this used to mean using pure physics energy potentials for protein folding. These days, to us, it means any set of general principles that is not biased to a particular protein or organism.
If the algorithms you used are under a free license, did you already manage to publish the modifications, if there are any?
Ram: The modifications involving the porting are with IBM and they are unpublished.
(Ed. note: Since the software was released in the public domain there’s no requirement to publish the modifications.)
IBM helped you out in customizing the protein-prediction algorithms for various platforms. Can you tell us how much they contributed?
Ram: All the customisation was done by IBM engineers. We just gave them the original software and ran sanity checks on the output. I’m a strong free software and anti IP proponent, to the degree that I encourage commercial use without restrictions on the software (people can always use the public domain versions if they want to).
How much time did you save by using the World Community Grid’s infrastructure compared to if you would’ve set it up all on your own, like other projects do?
Ram: IBM took about six months or so to port our software, so I presume it would’ve required that kind of an investment. Keep in mind that they had a lot of prior experience with BOINC. IBM now maintains the code and does the PR and runs the predictions for us. I’d say this would be a full time programmer/sysadm type of person and if I had that extra money, I’d rather spend it on someone doing the basic research.
If there are flaws about BOINC, which would you like to be addressed first?
Ram: I can’t think of any in the way we did it with IBM, but without IBM, the PR machine has to be powerful to get people on board. It’s more than just recruiting people, but also motivating them as IBM does with badges and giving them a sense of community and providing a support infrastructure. This is hard for a research lab to do on their own (it can be done, but is it really the best use of our talents is the questions).
Programming and debugging is an iterative process. Looking at your sourcecode-repository, how many releases of the software were necessary until you got the cow flying?
Ram: For this case, internally we probably had about 10 or so iterations in total, but the basic science part of the software is something that has evolved over 18 years.
How did you do beta-testing, did you use the publicly available beta-projects at WCG? Or, were you actually just doing it in your lab?
Ram: It was mostly in our group. We just submitted sequences for which we knew the answers and we did a dry run initially with the same sequences.
I’m curious there – were these structures predicted by other algorithms or was that done the hard way, using X-ray crystallography?
Ram: These were done the hard way, at the bench. These are our gold standard for when we know we’re right or wrong, so we benchmark our methods against all this. When we did the rice project, we did sequences with known answers to see how well things would work and that there was no chance of anything going wrong.
How was is like getting in touch with the community? Was the feedback helpful? How many people from your team were actually dealing with the community?
Ram: At its peak, we had 3 people dealing with the community, our sysadm and project lead Michal Guerquin, our programmer and scientist Ling-Hong Hung, and myself. Opening our software to the Grid and the community definitely presented some challenges, which I believe will be the focus of our first paper. An interesting tangent of that is that we’ve had to port some of our analysis software to work on GPUs so we could handle all this data. So some good technological developments here that we’ll be writing about shortly.
A lot of people are concerned about “Frankenfood”. Your project’s website explicitly states that this is not about genetic engineering, but about finding the most nutritious rice-strains for interbreeding with other rice-crops. Is there anything you’d like to explain to people who are still concerned?
Ram: We’re simply extending what farmers have been doing for millenia in a more rational way, and also what has been going on in nature for billions of years. The problem to us is scientific and all knowledge that is produced (which from our end will be completely free and transparent) can be used in various ways according to the will of the people. But we have governments and politicians to handle the deeper societal implications. What I mean by this is that people should petition their representatives, as they are doing successfully in many parts of the world, to decide where to go with genetically modified organisms, which I see as ultimately having a socioeconomic/political solution.
Your project is one of the very few with a fixed end, almost all other projects are handing out work-units for new phases. How comes that you’re finished now? Is everything from the rice-genome now analyzed from a computational point-of-view and nothing else left to do?
Ram: Not at all. We obtained a huge amount of data and we’re now pressed to analyse it. I honestly can say that we were overwhelmed with this data. My goal as a scientist though is not just to develop technical tools and produce large tables and graphs but try to come up with something tangible that is prioritised and can be tested at the bench that really changes the make up of rice in a desired manner. The computations and the Grid are the means by which we arrived at this step, but our job now is to figure out where the best low hanging fruit is in collaboration with rice researchers (which we are doing with researchers around the world including IRRI, Phillipines). [Ed. note: IRRI, International Rice Research Institute]
Focussing on the data: Now that you know how those proteins really look like, where do you draw a line and say “this protein is more nutritious than others”? My basic understanding is that the nutritious parts in rice is actually carbohydrates (starch), proteins and some fat. How do I have to imagine this analysis?
Ram: So the proteins we’re talking about are gene products, that carry out almost all the functions in rice (or any other organism). So we use
the protein to refer to a molecule that does this, rather than the nutrition use of the word “protein” which refers to these biological molecules broken down and aggregrated (see “Protein” and “Protein (nutrient)” in Wikipedia).
By nutrition we mean anything that leads to higher range of bioavailable substances like dietary minerals and vitamins. In rice, examples include elements like iron or organics like vitamin A. Incidentally the “golden rice” GMO is a product of Monsanto that has higher beta-carotene, a precursor to vitamin A (“Golden Rice” at Wikipedia). We’d like to get to something like that by crossbreeding without the use of genetic engineering, working on both micro and macronutrients.
So in the end, we need to be able to create a rice strain that does have enriched nutrients and is perhaps better than current strains in
terms of yield and/or hardiness. Before we go off and start crossing rice, there are a number of molecule biology bench experiments that
can be done to say whether predictions we make about the activity of certain proteins will be correct so we’d do them first.
Do you plan to publish all your results in an Open Access Journal?
Ram: Yep, that would be the ideal. Publishing in Open Access Journals also sometimes costs money. I’m not a big fan of the “pay to publish”
model—it’s not a lot of money and some scientists have grants to do this, but it’s not a good principle.
Thank you very much for this interview!
Ram: Thanks; I enjoyed the questions!
Dr. Ram Samudrala is a tenured Professor at the University of Washington, Seattle. He’s head of the Nutritious Rice For The World project and one of the inventors of protein prediction algorithms. He’s a notorious contributor of scientific papers and generally a very nice guy I’d like to buy a drink.
Dieses Werk bzw. Inhalt ist unter einer Creative Commons-Lizenz lizenziert.
The rice picture is copyrighted and CC-BY-SA by Flickr-user kadaoor.
The rice-paddy picture is copyrighted and CC-BY by Flickr-user ~MVI~
The pictures of the teammembers were used by permission of the Rice-team.
The BOINC splashscreen is copyrighted by IBM and the World Community Grid and was used with permission.