2022 Annual Meeting Report: Artificial Intelligence and Machine Learning as a Decision-Making Tool: Principles and Limitations

By Olushola Awoyemi posted 04-01-2022 17:16


On Monday, March 28, 2022, at the SOT 61st Annual Meeting and ToxExpo in San Diego, California, I attended the SOT/EUROTOX Debate, addressing the proposition entitled “Is There a Role for Artificial Intelligence (AI) and Machine Learning (ML) in Risk Decisions?” To give a brief introduction, this is a yearly event at the SOT Annual Meeting that began in the 1990s, where leaders in the field of toxicology debate and/or advocate significant issues that are of toxicological importance. This year, the principles of AI and ML as a decision-making tool, as well as their limitations, were discussed. The session addressed these questions:

  • Can recent developments in AI and ML be applied in toxicology as they are being applied in precision medicine and other fields?
  • Are current algorithms inherent to AI and ML sufficiently refined for the design of safer products?
  • Are these applications sufficiently robust to identify toxic “signatures,” providing information for safety and risk assessment?
  • Are they sufficiently reliable to predict toxicity, and can they account for genetic and other toxicodynamic variability?
  • Are they able to predict risk associated with exposures to mixtures?
  • Can they predict the potential toxicity of new compounds or relate chemical structure or activity to risk?
  • How can we tell if the results of AI and ML are accurate and defensible?

The SOT debater was Craig Rowlands of UL LLC (advocating for), and the EUROTOX debater was Thomas Hartung of Johns Hopkins University Bloomberg School of Public Health (advocating against).

During the first part of the session, the debaters used 10–15 mins to make their cases for or against the proposition, followed by five minutes of rebuttal by each debater. This was followed by questions from the audience (in person and online). Several points were raised throughout the session.

In Support

  • The Gartner Hype Cycle, which begins with a technology trigger that peaks at inflated expectation, then makes a downward slope to a trough at disillusionment, makes a second upward slope of enlightenment, and finally plateaus at productivity.
  • AI/ML has multiple algorithms for all techniques, including classification (e.g., support vector machines, neural networks), regression (linear regression, decision trees), and clustering (hierarchical, gaussian mixture). There are the expert systems, which are good for explanations but not so good for accuracy. Also, there are the neural nets, which are good for accuracy but not so good for explanations. So, how can the best of both be achieved?
  • There are the OECD guidelines on the validation of quantitative structure-activity relationship (QSAR) models, which include the following: (1) a defined endpoint; (2) an unambiguous algorithm; (3) a defined domain of applicability; (4) appropriate measure of goodness-of-fit, robustness, and predictivity; and (5) mechanistic interpretation, if possible.
  • AI/ML Risk Decisions with Explainable Artificial Intelligence (EAI)—the NIST combinational testing can provide explainability, verification, and validation for assured autonomy and AI. Combinational methods can provide explainable AI, and there are prototypes that have applied the approach. Method can be applied to black-box functions and present explanation in the preferred form of rules.
  • AI/ML Risk Decisions in Product Development and Compliance—predicts global harmonization system (GHS) classification and category, important information that includes safety data sheets (SDS), finished product labels, etc.
  • AI/ML Risk Decisions in Regulations for Pharmaceuticals—there is the ICH M7(R) Guideline for mutagenicity assessment of impurities in drug manufacturing, which is the first guideline to allow replacement of an in vitro test with an in silico prediction that applies two orthogonal approaches, an expert tool and a machine learning model. Also, for medical devices, there is the ISO 10993-1 for biological evaluation using in silico assessments (e.g., ICH M7[R]). There is a consortium developing guidance for in silico predictions including genotoxicity and sensitization.
  • AI/ML Risk Decisions in Regulations for Medical Devices and Health Care—there is Software as a Medical Device (SaMD), which is intended to be used to perform a medical purpose(s) without being part of the medical device hardware. There is applicability in radiology, cardiovascular medicine, and neurology, as well as other clinical practices, including drug efficacy, medication choices, and improvement of clinical practices.
  • The debater noted that the proposition depends on context and question, and the most important factors are relevance (purpose fit) and quality of the training dataset, while the biggest challenges are DoA and black boxes with EAI.

The SOT debater concluded by stating that yes, AI and ML have a role in risk decisions since they have already been extensively used in product development R&D decisions, product compliance, regulatory risk decisions for drugs using mutagenicity predictions, medical devices (SaMD), augmented surgery, therapy plans, pathology, and radiology.

In Opposition

  • The debater provided a disclaimer that he actually thinks that AI is the best thing for toxicology since the dose makes the poison (i.e., hazard is not risk).
  • AI-Based Toxicology—an example is Human-Siri interaction: “Hey, Siri, is this ‘chemical compound’ toxic?” Siri replies, “Here is what I found,” with a great deal of information. “But Siri, this is aspirin! Which aspirin?” “Here is what I found.” Siri provides very many different chemical names. “Hey, Siri! Can’t I learn from similar substances?” “Here is what I found on the internet: Felix Hoffmann synthesized aspirin in 1897 by acetylating acid, and two weeks later, he used the same approach to create heroin (as sedative for cough), which has led to overdoses and many deaths of Americans.”
  • Waiting for the software to distinguish between enantiomers (e.g., R-Thalidomide, which is sleep-inducing, and S-Thalidomide, which is teratogenic). This is like setting up a function for garbage that will spill out garbage. AI makes big sense of big data with all the biases in the data translated into biases of results. There is currently no acceptable big (quality) toxicological data for regulatory use. All that is available is animal data with questionable quality and relevance, meaning that we can predict only animal outcomes.
  • How can AI do quantitative risk assessment (QRA) if there is a dramatic difference in human oral bioavailability in relation to bioavailability in animals (including rodents, dog, and primates)?
  • With big data, the limitations of animal tests can be shown. Currently, the REACH database has ~10,000 chemicals and ~800,000 studies. Hence, toxicology should be better than other areas, including standardized test (e.g., OECD), good laboratory practice (GLP), maximum tolerable doses (MTD), etc.
  • Reproducibility of certain animal tests is questionable (e.g., the Draize eye test in rabbits). Also, reproducibility of different acceptable animal tests is questionable (e.g., skin sensitization tests using the guinea pig maximization test [GPMT] compared with local lymph node assay [LLNA] since the results of either of these tests do not always translate to the other).
  • Furthermore, the six most frequent toxicology tests (OECD TG for acute dermal, acute oral, eye irritation, mutagenicity, sensitization, and skin irritation) consume 57% of animals, involve about 350–750 chemicals with repeat tests, and have about 80% reproducibility and 69% reproducibility for toxic chemicals. Given that these tests do not indicate the extent to which they will correctly predict human hazard, they cannot be any better.
  • There are no data to suggest that reproducibility is any better for chronic and systemic hazards. For instance, carcinogenicity studies have been shown to have 57% inter-species reproducibility and 52% of all substances test positive, varying with sample size. In reproductive toxicology studies, there is 60% species correlation, and of 1,223 definitive, probable, and possible animal teratogens, the linked defects to human birth defects were fewer than 2.3%.
  • Concordance of noncarcinogenic endpoints in rodent chemical bioassays—for instance, the repeat-dose toxicological test involving 37 NTP cancer studies analyzed for noncancer endpoints shows no correlation between species, between genders, and with historic results. Mouse-to-rat organ prediction is 25%–55%, and species concordance between mouse and rat is 68%.
  • Bioassays conducted following test guidelines are not devoid of contradictory or discordant results. The lack of standardized publication will hinder data extraction coupled with publication biases toward toxic substances. There is increased likelihood of more false positives due to low prevalence of toxicities. This also is not applicable to mixtures and UVCB (unknown or variable composition, complex reaction products, or of biological materials).
  • Difficulty in validating AI since AI is a black box, the testing is not explainable, additional new data equal new model, and freezing in time makes no sense. Which will regulators trust between a new or an alternative method, given the evidence being provided? Trust as a factor in model validation, requiring reliability, predictivity, and relevance to predict between reference points (e.g., reality [human safety] and gold standard [e.g., animal data]) will play a role in regulatory acceptance of the model.
  • Using an AI tool to predict animal toxicology will end a $4 billion industry and replace the jobs of many toxicologists. Do we really want that?
  • For safety, human-in-the-loop is needed should there be a failure in AI (e.g., in the case of the IBM Watson for cancer treatment that flopped, overpromising and underdelivering).

To conclude, AI has its dark side: it is black box that will always give a result whether the information is in the data or not. To put it to use, one first needs good big data, and it’s not without challenges (e.g., explainable AI, causality, validation, bias in data will equal bias in results).


A vote was cast by the audience both in person and online. An overwhelming number of the audience voted in support of the proposition that “AI/ML has a role in risk decisions,” with an online poll of 78% voting yes (in support) vs. 22% voting no (against).

What’s Next?

This debate will take place again (with the debaters taking the reverse positions) in Maastricht, Netherlands, during the XVIth International Congress of Toxicology, September 18–22, 2022.

This blog was prepared by an SOT Reporter and represents the views of the author. SOT Reporters are SOT members who volunteer to write about sessions and events in which they participate during the SOT Annual Meeting and ToxExpo. SOT does not propose or endorse any position by posting this article. If you are interested in participating in the SOT Reporter program in the future, please email Giuliana Macaluso.

On-demand recordings of all Featured and Scientific Sessions delivered during the 2022 SOT Annual Meeting and ToxExpo will be available to meeting registrants in the SOT Event App and Online Planner after their conclusion, through July 31, 2022.

1 comment


04-07-2022 19:36

I vote NO