Machine learning’s growing importance in researching cells

Posted: 3 May 2024 | Ian Shoemaker (Beckman Coulter Life Sciences) | No comments yet

As we move towards more generalised AI models, neural networks and natural language interfaces, we’re starting to see machine learning take the place of higher order reasoning and data analysis “sense making.” Traditional scientific inquiry has typically been about asking specific questions of a specific model system under specific conditions. We’re starting to open the door to more generalised questions that yield testable, meaningful conclusions without asking specific questions of our data.

Life sciences is fundamentally governed by large, complicated, and chaotic datasets with difficult to model interactions. Those in life sciences have relied on statistical modelling, predictive algorithms, and empirically derived data for decades to build on the insight of earlier generations of scientists, and to refine techniques. This differs somewhat from physics, which more classically derives its predictions from theory and maps those to some sort of probability; life sciences for many years leaned on imperfect approximations and existing large datasets to generate testable predictions.

This is especially true in domains such as protein structure, predictive binding kinetics, and even in larger systemic investigations like cell migration models or disease progression. It can be argued that much of the life sciences automation we know and use today grew out of the necessity for large datasets to capture the inherent variability of even model organisms.

As we move towards more generalised AI models, neural networks and natural language interfaces, we’re starting to see machine learning take the place of higher order reasoning and data analysis “sense making.” Traditional scientific inquiry has typically been about asking specific questions of a specific model system under specific conditions. We’re starting to open the door to more generalised questions that yield testable, meaningful conclusions without asking specific questions of our data.

One obvious example of this is image analysis. Machine learning can reduce an image to data patterns, descriptive mathematical paths, and even elucidate features that might not be perceptible to even the most well-trained scientists because you may not know what you’re looking for. The human capacity for analysis in something like a confocal image stack can only be so robust for a given amount of time invested. We as humans look “for” things based on contextual knowledge of the experiment and report back what we see. Inherently there’s some bias there no matter how talented the microscopist.

Algorithms, however, can be trained to simply look “at” images as agnostic data and report back in less biased fashion. Another good example is really any process optimization / screening domain, whether it’s cloning screening, media formulation, drug screens, etc. These are laborious solution spaces to search and often involve best-guess statistical models and factor analyses to determine the most cost-effective screen that could be run to obtain a set of conditions sufficiently optimised to move forward with. Machine learning has enabled us to create feedback loops where these processes are close to training themselves to find the “best” solution in fewer iterations. Of course, properly defining and measuring “best” in the context of a machine learning algorithm is always the trick.

One of the more niche applications of generative AI such as ChatGPT and other large language models is in “plain language” trouble-shooting and early experimental design. Often you learn so much more from the mistakes and challenges of others when perfecting a specific technique or chasing down an investigative possibility. Large language models are exceptional at collecting vast amounts of disparate information from esoteric websites, forums, book chapters, review articles, and even open access journals and then cramming the sum total of that information into a plain language summary that does a fair job of approximating human knowledge on the subject.

For example, try asking ChatGPT: What are the most common challenges and failures when performing [insert experimental technique here]? The accuracy of the answer may surprise you because ChatGPT doesn’t particularly care about presenting the technique as fool-proof, as a manufacturer’s literature might be incentivised to do, and it doesn’t need to be frustratingly concise as a manuscript might. Large language models are even starting to replace the classic “seminal paper library” as a means of digesting, amalgamating, and communicating the general breadth of knowledge on a subject to bring new investigators up to speed quickly.

I believe democratisation of any technology is generally a good thing provided we recognize and implement the proper guardrails. We’ve already seen some cases of bad actors using AI-generated images of particularly well-endowed rats for manuscripts that are entirely non-sensical and inaccurate. Perhaps that says more about peer review than it does AI, but it’s a genuine concern that AI will facilitate the proliferation of “bad science” and create a basal level noise that makes it difficult to tease truth out of.

Doom saying aside, the jury is still out on just how transformative generative AI will be to life sciences research. Whether it will be just another enabling technology that clears bandwidth for more meaningful pursuits or fundamentally alters how we approach problem solving by forcing us to adopt complementary modes of research that are more “AI-friendly.” It’s too early to tell but I’m optimistic.

Investigating the origins of a particular pathology is always an arms race with complexity and reducing that complexity to questions that you can actually answer in a lifetime. Machine learning working in concert with automation has a huge role to play here. The more you make complexity mundane, the closer you get to meaningful answers.

We can see this philosophy in practice in the recent explosion of 3D culture methods, organoids, and on-chip devices that more closely mimic the biological context of disease with much higher fidelity when compared to conventional 2D culture. Liquid handling automation shines in this domain because culture workflows are long, laborious, and often must be planned around states rather than convenient work-day cycles to be relevant. Robots simply don’t care if it’s 2 am on a Sunday morning when they’re passaging cells.

More generally speaking, liquid handling automation is quickly earning the trust of scientists to do the “dirty work” of even highly complex workflows and free up human capital to focus on abstracted problems. This dovetails nicely with machine learning because larger and larger datasets can be generated under more well-characterised, if not controlled, conditions to train feedback algorithms. The net results of these multi-variate datasets inform which organoid models are yielding actionable information under what conditions. Humans can then focus on the higher-order “why” questions as opposed to factor level concerns of “which”, “when” and “how much.”

We’re still very much in the “hype” phase of ChatGPT-like services in the near term, and with existing open-source tools I’d expect to see a domain-specific specialisation of these large language models for use in life sciences. I think that will be the first application, even if not the most exciting one. Beyond that, I’m looking for two things: first a trend-shift towards multi-omics and otherwise massively parallel experiments, and secondly the closer marriage of human intuition and in-silico predictive power to inform upstream experimental design and streamline analysis.

To the first point, imagine a situation where every experiment included phenotypic, genomic, transcriptomic, and proteomic data. I think we’re trending toward a world where that approach actually makes sense, and the classic research question becomes less of a focused, “Does X impact Y” to a more generalised, system-wide “What’s happening here?” where machine learning enables that deep analysis by pointing human research at what’s relevant in the stream of information. When asking those open-ended questions we really want to have as much data along as many axes as possible to give us the best chance of hitting on something crucial to either prophylactic care, diagnosis, or therapeutic design.

To the second point regarding the marriage of human intuition with in-silico prediction, we all have limited time, resources, money, and expertise bandwidth. I expect the next year of AI and machine learning innovations in life sciences to improve the deployment of those resources down avenues that may not have been immediately obvious. Giving scientists the ability to ask “what if…” type questions using predictive software, and knowing the confidence of those predictions, has the potential to drastically accelerate the search for drug targets, proteins of interest, biomarkers, etc. Already we see these technologies deployed to make more sophisticated high-throughput screens, but I imagine we’ll begin to see similar percolation down towards basic research to inform experimental designs before a scientist even steps up to the wet bench.

Multiomics is all about relational analysis in large datasets. Something humans alone are terrible at. We don’t know what we don’t know, so we miss a lot. Machine learning doesn’t have this problem; it can agnostically search for and characterise patterns or relations across any investigational axes, and even contrive combinatorial factors such as in Principal Component Analysis! Many of these techniques are entirely “unsupervised” in the sense that the techniques are broadly applicable with minimal guidance from a human operator. While these kinds of “big data” analyses have historically been the realm of computational biologists, many research groups simply don’t have access to that skill set, or, if they do, that person or group doesn’t have bandwidth for high-risk exploratory work. AI tools along with machine learning is beginning to democratize access to these kinds of analyses.

Biological context is everything when it comes to understanding disease, and the ugly truth is that experimentation often necessarily strips elements of context in the pursuit of factor control. Omics approaches represent a general departure from conventional control schemes by allowing more biological context and variability to remain in place, because omics experiments are themselves large-scale characterisations of the factors one would otherwise have needed to control. Of course, that’s a somewhat reductive explanation but it’s close enough to general case to be useful.

Now, multi-omics takes this a step further and has the potential to characterize systems either vertically as a directly correlative stack in the case of genomics > transcriptomics > proteomics, or as complementary technology as in the case of genomics and metagenomics informing the speciation, diversity, and taxonomy of gut microbiota for microbiomic study and profiling.

Ultimately, multi-omics represents an opportunity to generate high-fidelity, richly informative datasets that are often internally orthogonal. These datasets can then be used to make surgically precise predictions regarding promising pathways, targets, or therapeutic modalities. So why doesn’t everyone do it? A cursory power analysis often reveals that staggeringly large numbers of biological replicates, wells, or conditions are needed to make statistically relevant inferences, and while liquid handling automation can certainly address many of those challenges, you still have massive amounts of data that’s difficult to directly resolve. However, as machine learning techniques mature and grow alongside our ability to generate large datasets, we’re becoming more adept at untangling the riddles and arriving at answers to questions we didn’t even think to ask, and that’s the real promise of integrated multi-omics.

With nearly 15 years of translational lab automation and instrumentation experience in personalised medicine and clinical molecular diagnostics. At Beckman Coulter Life Sciences, he supports the applications development team in NGS, cell-based assays, and proteomics workflows.

Related topicsArtificial Intelligence, Drug Delivery, Drug Development, Drug Discovery, Drug Discovery Processes, Machine learning

Related peopleIan Shoemaker (Beck, Ian Shoemaker (Beckman Coulter Life Sciences)

Artificial Intelligence, Drug Delivery, Drug Development, Drug Discovery, Drug Discovery Processes, Machine learning

All subscriptions include online membership, giving you access to the journal and exclusive content.

By Dr Jamie Rich (Zymeworks), Dr Paul Moore (Zymeworks), Dr Raffaele Colombo (Zymeworks)

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Drug Target Review is published by: Russell Publishing Ltd.Court LodgeHogtrough HillBrasted, Kent, TN16 1NUUnited Kingdom

© Russell Publishing Limited, 2010-2024. All rights reserved. Terms & Conditions | Privacy Policy | Cookie Policy

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorised as “Necessary” are stored on your browser as they are as essential for the working of basic functionalities of the website. For our other types of cookies “Advertising & Targeting”, “Analytics” and “Performance”, these help us analyse and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these different types of cookies. But opting out of some of these cookies may have an effect on your browsing experience. You can adjust the available sliders to ‘Enabled’ or ‘Disabled’, then click ‘Save and Accept’. View our Cookie Policy page.

Necessary cookies enable the core functionality of the website, including security, network management and accessibility. These cookies do not store any personal information. You may disable these by changing your browser settings, but this may affect how the website functions.

CookieTypeDurationDescriptioncookielawinfo-checkbox-advertising-targetingpersistent1 yearThe cookie is set by GDPR cookie consent to record the user consent for the cookies in the category “Advertising & Targeting”.cookielawinfo-checkbox-analyticspersistent1 yearThis cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category “Analytics”.cookielawinfo-checkbox-necessarypersistent1 yearThis cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category “Necessary”.cookielawinfo-checkbox-performancepersistent1 yearThis cookie is set by GDPR Cookie Consent WordPress Plugin. The cookie is used to remember the user consent for the cookies under the category “Performance”.PHPSESSIDsession1 yearThis cookie is native to PHP applications. The cookie is used to store and identify a users’ unique session ID for the purpose of managing user session on the website. The cookie is a session cookies and is deleted when all the browser windows are closed.viewed_cookie_policypersistent1 yearThe cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.zmember_loggedsession1 yearThis session cookie is served by our membership/subscription system and controls whether you are able to see content which is only available to logged in users.

Advertising and targeting cookies help us provide our visitors with relevant ads and marketing campaigns.

CookieTypeDurationDescriptionadvanced_ads_browser_widthpersistent1 monthThis cookie is set by Advanced Ads and measures the browser width.advanced_ads_page_impressionspersistent2 yearsThis cookie is set by Advanced Ads and measures the number of previous page impressions.advanced_ads_pro_server_infopersistent1 monthThis cookie is set by Advanced Ads and sets geo-location, user role and user capabilities. It is used by cache busting in Advanced Ads Pro when the appropriate visitor conditions are used.advanced_ads_pro_visitor_referrerpersistent1 yearThis cookie is set by Advanced Ads and sets the referrer URL.bscookiepersistent2 yearsThis cookie is a browser ID cookie set by LinkedIn share Buttons and ad tags.IDEpersistent2 yearsThis cookie is set by Google DoubleClick and stores information about how the user uses the website and any other advertisement before visiting the website. This is used to present users with ads that are relevant to them according to the user profile.li_sugrpersistent3 monthsThis cookie is set by LinkedIn and is used for tracking.UserMatchHistorypersistent1 monthThis cookie is set by Linkedin and is used to track visitors on multiple websites, in order to present relevant advertisement based on the visitor’s preferences.VISITOR_INFO1_LIVEpersistent5 monthsThis cookie is set by YouTube. Used to track the information of the embedded YouTube videos on a website.

Analytics cookies collect information about your use of the content, and in combination with previously collected information, are used to measure, understand, and report on your usage of this website.

CookieTypeDurationDescriptionbcookiepersistent2 yearsThis cookie is set by LinkedIn. The purpose of the cookie is to enable LinkedIn functionalities on the page.GPSpersistent30 minutesThis cookie is set by YouTube and registers a unique ID for tracking users based on their geographical locationlangsession1 yearThis cookie is set by LinkedIn and is used to store the language preferences of a user to serve up content in that stored language the next time user visit the website.lidcpersistent1 dayThis cookie is set by LinkedIn and used for routing.lisscpersistent11 monthsThis cookie is set by LinkedIn share Buttons and ad tags.vuidpersistent2 yearsWe embed videos from our official Vimeo channel. When you press play, Vimeo will drop third party cookies to enable the video to play and to see how long a viewer has watched the video. This cookie does not track individuals.wow.anonymousIdpersistent2 yearsThis cookie is set by Spotler and tracks an anonymous visitor ID.wow.schedulepersistent20 minutesThis cookie is set by Spotler and enables it to track the Load Balance Session Queue.wow.sessionpersistent20 minutesThis cookie is set by Spotler to track the Internet Information Services (IIS) session state.wow.utmvaluespersistent20 minutesThis cookie is set by Spotler and stores the UTM values for the session. UTM values are specific text strings that are appended to URLs that allow Communigator to track the URLs and the UTM values when they get clicked on._gapersistent2 yearsThis cookie is set by Google Analytics and is used to calculate visitor, session, campaign data and keep track of site usage for the site’s analytics report. It stores information anonymously and assign a randomly generated number to identify unique visitors._gatpersistent1 minuteThis cookies is set by Google Universal Analytics to throttle the request rate to limit the collection of data on high traffic sites._gidpersistent1 dayThis cookie is set by Google Analytics and is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected including the number visitors, the source where they have come from, and the pages visited in an anonymous form.

Performance cookies include cookies that deliver enhanced functionalities of the website, such as caching. These cookies do not store any personal information.

CookieTypeDurationDescriptioncf_ob_infopersistent1 minuteThis cookie is set by Cloudflare content delivery network and, in conjunction with the cookie ‘cf_use_ob’, is used to determine whether it should continue serving “Always Online” until the cookie expires.cf_use_obpersistent1 minuteThis cookie is set by Cloudflare content delivery network and is used to determine whether it should continue serving “Always Online” until the cookie expires.free_subscription_onlysession1 yearThis session cookie is served by our membership/subscription system and controls which types of content you are able to access.ls_smartpushpersistent1 monthThis cookie is set by Litespeed Server and allows the server to store settings to help improve performance of the site.one_signal_sdk_dbpersistentUntil clearedThis cookie is set by OneSignal push notifications and is used for storing user preferences in connection with their notification permission status.YSCsession1 yearThis cookie is set by Youtube and is used to track the views of embedded videos.