A field guide to cultivating computational biology
Anne Carpenter, Casey Greene, Piero Carnici, Benilton Carvalho, Michiel de Hoon, Stacey Finley, Kim-Anh Le Cao, Jerry Lee, Luigi Marchionni, Suzanne Sindi, Fabian Theis, Gregory Way, Jean Yang, Elana Fertig
Read the preprint
Computational biology is now a mature discipline at the heart of biomedical sciences. The field is poised to make tremendous advances, empowered by the wealth of new large scale data streams and analysis methodologies. Likewise, new high-throughput technologies make computationalists critical members of biological research teams.
Computational biology is a transdisciplinary field, spanning expertise in mathematics, computer science, informatics, statistics, data science, and engineering. Investigators also employ numerous models to research, and are most commonly engaged in team science research. In spite of the major advances and healthy culture of computational biology, academic incentive structures often cause many interdisciplinary investigators to languish in career development. In
Carpenter et al, we outline solutions to solve the structural barriers to attract and cultivate computational biologists, summarized below.
1. Respect between collaborators:
Computational biologists aren’t code monkeys and biologists aren’t pipette robots. All parties must recognize the unique expertise of different disciplines, noting that different data modalities and analysis tasks require disparate computational expertise. Computational faculty may not be able to take on a particular project if the data modality is not in their wheelhouse or require a routine analysis that may not lead to effective authorship for career advancement. Institutions can subsidize robust core facilities to provide protected time to enable computational biologists to focus on innovative methods or analyses that will promote their research portfolios. Likewise, collaborators should equally values the productivity requirements of different disciplines. For example, computational and biological disciplines weigh the value of journal and conference publications differently often resulting in different priorities and timelines for publication.
2. Seek input throughout a project:
The entire investigative team should be involved throughout the life cycle of a project. Good study designs with careful collaborations between biologists and computational biologists can minimize the opportunities for garbage in garbage out. Likewise, biologists’ insights can guide analysis approaches and interpretation. This can be promoted by including computational biologists on traditional biology study sections and biologists on computational study sections, with interdisciplinary models for peer-review of manuscripts.
3. Preserve budgets for computationalists:
Although invisible, personnel and compute pose real costs for computational labs. Often, computational biologists commit significant up front work into a grant for study design and to generate preliminary data, but are then cut out of awarded grant budgets. This non-collegial behavior threatens the financial health of computational biology labs. Funding agencies can address this by promoting even distribution of budget cuts between investigators as default policy, requiring special request by the contact principle investigator to override.
4. Changing authorship conventions in publications:
The team science nature of computational biology results in significant contributions represented as middle (or at best shared lead authorship) for publications. Traditional academic review criterion deeply undervalue these collaborative research efforts. Journals can address this through interfaces that randomly swap the order of co-first and co-senior authorships to normalize collaborative leadership and allow for additional leadership roles (e.g., lead computational biologist, lead statistician, etc) to be credited on publications.
5. Value software as an academic product:
Open-source software implementing new methodologies that unlock emerging measurement technologies are one of the critical contributions of computational biologists. Tracking software usage (e.g., through download statistics, active user groups, etc) may represent a stronger metric of a computational biologists impact than traditional metrics such as H-index. Moreover, these software projects require significant maintenance not rising to the level of novelty for new publications or grants. Additional software infrastructure grants as well as institutional shared resource facilities employing software developers for maintenance are critical to ensure the longevity of these products.
6. Develop academic structures to reward team science and applied research:
Whereas traditional academic promotion and tenure models as well as grant review panels value the impact of the lone investigator, computational biologists often have research portfolios with significant team science contributions. The collaborative contributions that are critical institutional service and unlock new data modalities can cause these investigators to languish in career advancement and obtaining funding. We can change this as a community by placing value on annotated contributions to publications, software, and grants, rather than traditional lead authorship on publications and PI roles on grants.
Computing introduces real costs for labs. But even over and above the cost of the compute infrastructure, require experienced personnel for hardware maintenance, cloud compute access, and software installation. These are out of the scope of an individual lab and should be funded as institutional infrastructure. Software developed in house requires significant maintenance that is unrewarded, which can be addressed through software maintenance grants as well as centralized resources for software developers at institutions.
8. Facilitate computationally-driven experimentation:
Robust input and validation data are the lifeblood of many computational biology research projects. Even in a collaborative research model, biology labs may be unable to prioritize data generation solely in support of computational theories or methodologic validation for pure dry labs. Individual investigators can consider models that allow for co-mentored trainees to lead experimental data generation or institutions can support shared resources with designated wet lab space and centralized lab managers to allow for custom data generation.
9. Provide incentives for data sharing, while addressing the ethical challenges of access and biases in clinical data:
Large scale, public domain databases and atlas projects are often the lifeblood of computational biology research. In addition to providing a basis for methodologic advancement, access to well-curated, large-scale databases is necessary to reach robust conclusions without overfitting to small training datasets. Institutions and societies can empower this through database structures that allow for open data sharing, enforced through mandatory data deposition by funders supporting the data generation studies. Biomedical data is particularly valuable. In cases of data from patients, care must be taken in these data sharing efforts to protect patient privacy and ensure representative populations to mitigate bias and disparities in medically-driven computational biology research.
10. Promote cross-disciplinary training:
As biological sciences become increasingly intertwined with computational methodologies, we anticipate a next generation of scientists that are facile in both domains. Interdisciplinary training programs with mentorship spanning disciplines are an important complement to traditional computational and biological training programs to build the next generation of scientific researchers.
*icons made by DinosoftLabs from www.flaticon.com
**cover image by Arturo Araujo