Testing Generative Large Language Model: Mission Impossible or Where Lies the Path?
OpenAI’s ChatGPT, a generative language model, has attracted widespread attention from industry, academia, and the public for its impressive natural language processing capabilities. Although we know how to train such generative language models, we do not know how these models can solve such a diverse range of open-ended tasks. Every time we “prompt program” a large language model to complete a task, we create a customized version of the language model, which exhibits different abilities and outputs than other customized versions. Some people believe that the emergent capabilities of large language models are turning AI from engineering into natural science, as it is hard to think of these models as being designed for a specific purpose in the traditional sense. As our focus shifts from ensuring design and construction correctness to trying to explore and understand un-designed AI products and behaviors, we need to consider the methodological challenges posed by this transformation. For example, will differential testing, metamorphic testing, and adversarial testing, which are effective for testing discriminative models in specific tasks, no longer be the saviors of open-ended task testing for large language models? How can we test and correct ethical issues and hallucinations in generative AI? Due to the emergent capabilities of large language models, which are customized through in-context learning, will we face similar problems to the Schrödinger’s cat problem in quantum physics? If observation and measurement have a fundamental impact on the observed object, can we still fully test the essence of large language models, or can we only test the appearances of a specific customized version? Large language models are changing the way humans interact with AI, what adjustments do we need to make to our existing data and algorithm-centric MLOps? There may be many unknown problems. In this talk, I will share my thoughts (or even confusion) on these questions and some thoughts of actions (likely be wrong), hoping to inspire the community to explore the feasibility and methodology of testing generative large language models.
Mon 15 MayDisplayed time zone: Hobart change
15:45 - 17:15 | |||
15:45 50mKeynote | Testing Generative Large Language Model: Mission Impossible or Where Lies the Path? DeepTest Zhenchang Xing CSIRO’s Data61; Australian National University | ||
16:35 30mPanel | Panel DeepTest | ||
17:05 10mDay closing | Closing DeepTest |