Disclaimer: This article is based on a multitude of conversations with mainly bioinformaticians that have many years of experience in working with and in academics, biotech, CROs/pharma companies. This post sums up these shared experiences.
The bioinformatics industry is relatively young. Well, at least immature, so it seems. It has been around for a while and there was for sure a lot to complain about 8 years ago. Now, 8 years later, it continues growing but has it really changed?
Let’s quickly recap with how I got here: Last year we started evaluating potential business cases. The one that hooked me the most was the idea to offer DNA- and RNA-sequencing analysis as SaaS. We’d not only offer variant calling and gene expression but also a machine learning optimised approach for alternative splicing, which is much more error prone these days. Thus, I dived into bioinformatics by starting to talk to potential clients, experts, investors, work-horses and used every second of my spare-time to learn about genomics and transcriptomics.
Eventually the team split into research-oriented and business-oriented, and while we continued researching the market in search for a use-case for a SaaS opportunity, our research-oriented counterparts started a consulting business. Their projects cover a wide range of cases, from single cell organisms to humans, from basic research to clinical. But as is often the case for consultancy, their work is not scalable, and their feedback clearly shows some yet existing challenges in the industry of bioinformatics. In the meantime, we continued researching the market and having lengthy conversations with bioinformaticians (big shout out to the Independent Data Lab) and potential clients over the last year. Here is my quick observation of what is (still) going on.
Poor Programming Paradigms
Despite a fast-growing bioinformatic ecosystem the software quality remains surprisingly low. One key reason for this is that the tools are often developed and maintained by a single person. For example, Nextflow, which is one of the most popular workflow management systems in Bioinformatics, has about 90% of its code contributed by a single developer. STAR aligner, which is one of the most popular aligners for RNA-seq data, as well has been primarily developed by one contributor. Not saying that Nextflow or STAR aligner are suffering under bad code quality, but the fact that it is mainly maintained by a single person, is a risk and as a consequence, there is often no quality assurance involved in the process.
The lack of community-driven software development causes new problems. A single developer without greater experience in software engineering, especially without teamwork experience often neglects established code management practices. Most of the tools are missing any continuous integration workflows or use code conventions. Different programming styles (say C++ and R) are getting mixed up so that the maintenance of the code is almost infeasible. These problems are even more pronounced when it comes to development of customized pipelines by each institute, bioinformatics core facility or a lab, where code management is not only neglected but sometimes is avoided altogether. No version control via git, no code sharing via GitHub or GitLab, no branching system like git flow (which is debatable) or versioning system a la semver.
Academic Mindset Of Bioinformatics
One of the primary reasons why it is usually one developer per tool is the academic domination over the field of bioinformatics tools development. Academic labs are not paid to maintain the tool, but rather to develop and publish a new one - maintaining the software will not get you grant money. As a result, a plethora of new bioinformatics tools comes out every day, and the burden for their maintenance usually lays on the shoulders of a student or a post-doc who developed the tool as part of their degree or a project. Academic jobs are short-lived, with contracts for just several years, after which the researcher needs to move forward, often leaving the developed software behind. As a result, the state-of-the-art in bioinformatics changes rapidly, and so do the “standards”.
To be fair, we did observe some convergence in best practices within the field of DNA-sequencing due to the wide adaptation of whole genome sequencing (WGS) and whole exome sequencing (WES) in clinical settings, forcing this branch of bioinformatics into more regulated, and hence more reliable state.
Reinvent All The Wheels!
The path of the individual programmer leads to another inefficiency: developers are building the same bioinformatic analysis pipelines over and over and over again (while no one is looking over their shoulders).With lack of collaborative development experience, many bioinformaticians prefer to re-develop their pipelines rather than try to understand someone else's code. This is a pity. All that time and money could be spent elsewhere.
Code needs to be developed in teams, shared and distributed, algorithms have to be improved, software should be written in faster languages and for optimised memory, CPU and GPU usages. Money should be spent on services instead of overwhelming research facilities with the same tasks over and over again. Speaking of services.
Neglecting Specialised Service Providers
Bad news for Software-as-a-Services (SaaS) or Bioinformatics-as-a-Service (BaaS). There are more and more services but they barely make money. The lack of trust into those services is astonishing. Services are not only avoided by clients but also by bioinformaticians themselves. It seems like BaaS is perceived as a threat to bioinformaticians, even though they themselves could save a lot of time using them. While in other industries new businesses are getting stacked together within days with a strong trend towards no-code platforms, bioinformaticians want to do it all by themselves, not recognising that most of these services are offering outstanding human support, proven and flexible/customisable pipelines and affordable prices. This needs to change and it will change. And the sooner the better, as it would create more room for innovation and quality.
The well-known issue of management-IT gap is also widely present in the field of bioinformatic. Most clients have a great knowledge-gap of how easy and fast results can be delivered today, they just go with it, while 90% of the time results should be available within 24 hours (max). Most of the problems have standardized solutions in place. There’s no need to wait 4 weeks for results and getting distracted in the meanwhile by having to start working on the next project already.
The main reason yet again, relies on the academic domination of the bioinformatics world. Academics often prioritize money over time - they would rather spend less and wait weeks for the results, than pay and get the result tomorrow. Furthermore, industry is often demonized by academics as evil, only caring for making money and not caring for state-of-the-art and quality, while bioinformatics core facilities offer sequencing services with “free” bioinformatics support. But of course, as we know, only the cheese in the mousetrap is free. The analyses at core facilities are often performed by young PhD students and post-docs in addition to their thesis projects, with the pipelines that they developed and maintain themselves, resulting in weeks of delivery time and poor communication (often xls tables with very little explanation or follow-up).
In contrast, most BaaS offer the results within 24 hours, provide interactive platforms for visualization and results browsing, historic data management, and an extensive follow-up on the results. And yet, the myth of the demonic industry persists, and researchers in need for bioinformatics turn to academic collaboration rather than seeking industrial partners.
Clinical Decisions A Made Based On Non-Proofed Analysis Pipelines
To emphasize the necessity to rethink and invest in the bioinformatic ecosystem, we need to talk about its meanings. Next Generation Sequencing (NGS) is becoming widely used not only in basic research, but is now also entering the clinical settings: rare disease diagnostics, personalized medicine in oncology and immuno-oncology, prenatal diagnostics to name a few. We're talking about research and diagnoses that affect human life. It has an immediate impact on individuals in the short run, as well as an impact on millions in the long-run. That means all aforementioned problems are alarming and it needs to change.
The industry has to professionalize. There has to be ownership, there have to be legal entities kept responsible, there has to be quality assurance and a community that inherently cares. Bioinformatic is a typical case of public service that would, to a certain extent, heavily benefit from privatization. Competition would ensure a run for the best quality, performance and user-experience (especially improving interpretability). As bioinformatics gets closer and closer to actual healthcare and dealing with patients, we’d all benefit from it.
20. July 19:07: Made clear that Nextflow and STAR aligner are not particularly examples of bad quality but of single person maintained software.