Presented at GlueCon 2023
Joe Shockman & Allen Romano
As we (hopefully?) all know LLMs are simply a statistical model for generating reasonable sounding text. This, of course, presents some problems.
The chief among them is Hallucination.
Just like a person who knows nothing, an LLM with insufficient context will generate reasonable sounding bullshit … and for much the same reason (It wants to be helpful, but sometimes doesn’t have the context or facts it needs to give you the right answer). Also, just like a human making errors due to lack of knowledge it is often found to be “confidently wrong”
The good news is that unlike a person the LLM probably does contain the information you’re looking for. It just needs to be set up for success in order to get it for you.
Keep in mind that it’s not lying to you. It literally doesn’t know what any of this means.
From the point of view of the LLM, what we call “the failure mode” is identical to success.
It is a generative model. It generates. Therefore the success and failure check must happen outside the LLM.
[And maybe it is a bad metaphor because it’s not exactly that…]
Is there nothing to be done?
Sundar Pichai: “No one in the field has yet solved the hallucination problems”
To the contrary, it is not a problem that will be soon “solved” because it is baked into the technology. It is a failure risk at the application level that must mitigated.
Ye, Ou, et al (May 18, 2023): minor changes in prompt can shift results up to 27% of the time; consistency can fluctuate, on average more than 3 times per hundred for identical prompts.
Manipulating the prompt is not enough. Simple hallucination vs. complex hallucination. We need to account for both. Prompt engineering best practices helps mitigate some simple hallucination and some complex hallucination. (best practices include clarity, CoT, context quotation, NA state specification.) But it is not dependable. Small changes to the prompt or components of the context can create unforeseen changes in behavior. There are also numerous gotchas. For example: token stuffing and neologisms.
Google chat bot demo example:
Factual error regarding James Webb telescope wiped $100B off their stock price.
Yes, yes stock goes up, stock goes down.
This particular error spooked wallstreet because it definitively answered the question on everyone’s minds. “Is Google ready to compete in the age of AI?” This error answered that question with a resounding “NO”
Business opportunity alert. If you use this talk to launch a hallucination prevention business you can build a business that would have saved Google $100B. If you do that I ask that you come find me and buy me a drink.
If you’re still not convinced
Ask ChatGPT about yourself. I’ll get some things right but it’ll probably make up some honors and awards. It really wants to be helpful and make you sound impressive and it has no concept that making things up isn’t allowed.
We need to shout from the rooftops about the possibility of hallucination. That’s what we’re doing here today. My goal from this talk is to be clear and compelling enough that each of you can take up the torch and continue talking about it.
The biggest danger is insufficient data, context and guidance.
Identify bootstrap modes where the LLM has insufficient knowledge or context (fail with an error before asking)
E.G. At Logoi and Grounded we ingest knowledge data. We know that the LLM will hallucinate if queried before sufficient information is present (or if you ask a question out of bounds of provided data). Firstly set the service into lame duck mode till we have sufficient context to generate useful answers.
Beware high-stakes uses. When the stakes are high you cannot rely solely on AI. Nor can a human rely on information generated solely by AI (GPT4 technical report Open AI 2023)
Split complex tasks into a cascade of steps.
E.G. Parse the prompt and attempt to identify: context, intent, setting and optimal prompt characteristics. If you fail to identify intent that’s a good opportunity to exit with failure!
This model also offers the added bonus of protecting against prompt injection attacks and other unexpected input.
This approach is also likely to be the structure for getting better performance on complex tasks (e.g. SmartGPT)
Use the parsed context and intent to context tune the prompt *This improves quality by allowing automatic context aware prompt tuning and also allows for validation at each step
LLMs are better than people in that they probably DO know everything so your task is to wring the correct data out of it. By splitting the task in step 2 we are able to benefit from richer context
Prompt engineering recipes: Ask model to “quote” from context for specific information; chain of thought (CoT) prompting;
Set the LLM temperature low so it is less creative “just the facts ma’am”
Expert mode. Once you identify the proper context, bake a request for an expert response into the prompt.
Add error handling on the response to catch hallucinations before they reach the end user.
Create and call a Fact Check Module™ (< not actually trademarked so feel free). Run responses through a more fixed verification system. In the case of our chat knowledge engine we found that the LLM could hallucinate username. That’s reasonably easy to guard against on the response level before the hallucinations can reach the user.
When running your own model you have an opportunity for model tuning (deliberately overfit the model to the current use case so it blows up noticeably if it goes out of bounds)
Again, pre-parsing prompts also prevents prompt injection attacks
Note for abuse departments out there. In cases where malicious behavior or attack is suspected or confirmed insert randomness into responses to obscure the nature of the detection state machine. A little randomness wreaks havoc on black box testing.
—
Mitigations we’ve implemented now:
Future Mitigations planned: