Introspective Contextual Augmentation (ICA)

Introspective Contextual Augmentation is a powerful feature of Ditana Assistant that significantly enhances the quality and accuracy of AI responses. This innovative approach creates a synergy between various knowledge sources, including the AI model’s own introspective capabilities and Wolfram|Alpha (when enabled).

Note: This feature is turned off by default. Users can enable it through the UI or by using the -a command line switch when using the terminal tool.

Key Aspects of ICA

  1. Dynamic Information Gathering:

    • The assistant automatically generates and processes contextual queries to supplement user inputs.
    • Adapts to each specific request, creating a more tailored response.
  2. Dual-Source Augmentation:

    • With Wolfram|Alpha: Incorporates up-to-date information such as current weather, statistics, and more.
    • Without Wolfram|Alpha: Engages in self-dialogue, using the underlying LLM to answer contextual queries, enhancing response quality through introspection.
  3. Adaptive Contextual Queries:

    • Dynamically generates relevant questions based on the user’s input.
    • These questions are answered either by Wolfram|Alpha or the LLM itself, creating a form of “inner monologue”.
  4. Guided Introspective Reasoning:

    • Incorporates contextual queries and their answers into the message history before addressing the main user prompt.
    • Guides the LLM through a structured, introspective approach to problem-solving.
    • Allows the model to break down complex problems and approach them more systematically, even when engaging in self-dialogue.
    • The main user prompt is kept unchanged, but with additional context due to the messages automatically inserted into the dialogue.
  5. Enhanced Problem-Solving:

    • Leverages additional context and guided introspective reasoning.
    • Provides more comprehensive and accurate solutions to complex problems.
  6. Improved Accuracy:

    • Reduces errors and inconsistencies in AI responses.
    • Provides additional relevant information and a structured thought process before addressing the main query.
  7. LLM Capability Maximization:

    • Even without external sources like Wolfram|Alpha, helps the LLM leverage its own knowledge more effectively.
    • Utilizes a guided, multi-step introspective reasoning approach.

Example of ICA in Action

To illustrate how ICA works, here’s an example from the MMLU multitask test (without using Wolfram|Alpha):

Question: A victim and a defendant both worked as longshoremen at a shipyard. After the victim
was shot to death, the defendant was tried for murder and acquitted. Following the acquittal,
the victim's estate sued the defendant in a wrongful death action. During the civil trial, the
victim's estate called a witness to testify. The witness, who worked with both men at the
shipyard, testified that two weeks before the shooting, the victim came to work with a broken
nose and said that the defendant had caused it. The attorney for the victim's estate then
asked the witness the following question, "Was the defendant present during your conversation
with the victim, and if so, did he say anything about the victim's broken nose?" The witness
replied, "Yes, the defendant was present, and after the victim told me that the defendant
broke his nose, the defendant said, And that's only the beginning." Upon objection by the
defendant's attorney, the witness's testimony is

Choices:
B. admissible, because it reports a declaration against interest.
C. admissible, because it reports the defendant's adoptive admission of the victim's assertion.
D. inadmissible, because of the principle of collateral estoppel.
E. inadmissible, because it is hearsay not within any recognized exception.

Correct answer: C

Model’s answer (without ICA feature): E

                            systematic contextual query: "What legal principle governs the admissibility of the witness's testimony regarding the defendant..."
                             answer to systematic query: "In the scenario, a victim and a defendant, both longshoremen at a shipyard, were involved in a ca..."
   Are you sure? Please answer only with "yes" or "no".: "no."
                                      critical question: "What specific hearsay exception could potentially apply to the defendant's statement, and how mig..."
                            answer to critical question: "The specific hearsay exception that could potentially apply to the defendant's statement is the "..."
Model’s answer (with ICA feature): C

This example demonstrates how ICA guides the model through a series of contextual queries, helping it arrive at the correct answer.

Please note that the additional questions and answers you see in the above log merely reflect a subset of the internal processes, and in particular, do not provide the context in which they occur. Furthermore, the generation of the questions is not logged. The objective of this log is to provide an overview of the internal processes. While additional log outputs can be enabled, they clutter the output.

Technical Implementation

  1. Query Generation:

    • Utilizes meta-questions to the LLM to generate contextual queries.
    • Employs meta-meta-questions to generate the questions themselves.
    • Developed through extensive trial and error to determine effective query strategies.
  2. Dialogue Structure:

    • Creates independent dialogues separate from the main conversation.
    • The main dialogue is then constructed based on the context derived from these sub-dialogues.
  3. Prompt Engineering:

    • Varies the prompt structure based on the current step in the procedure.
    • Converts LLM-generated questions (initially Assistant messages) into User messages for efficient processing.
    • Sometimes repeats dialogues within a single message instead of using separate messages, depending on efficiency requirements.
  4. Experimental Features:

    • Explored recursive calls with limitations to ensure prompts become progressively shorter.
    • Implemented in the generate_sub_prompts function in the text_processors_ai module.
    • Socratic method simulation (socratic_method function) attempted but not yet yielding statistically significant improvements.

Statistical Evaluation and Optimization

  1. Benchmark Tests:

  2. Performance Improvements:

    • MMLU test: Corrected 188 (5.6%) of 3,348 initially incorrect answers using the OpenAI model gpt-4o-mini.
    • ARC-Challenge test: Fixed 26 out of 94 initially incorrect answers.
  3. Statistical Significance:

    • Employed McNemar’s statistical test to verify improvements.
    • Achieved over 99% probability of significant differences in all cases.
    • Despite optimization on ARC-Challenge potentially affecting statistical significance, MMLU results confirmed the effectiveness.
  4. Methodology:

    • Conducted tests without Wolfram|Alpha to evaluate the ICA process capability independently.
    • Ran the ARC-Challenge test multiple times (4) due to potential variations in LLM API responses.
  5. Documentation and Transparency:

    • Complete test logs available for download here.
    • Logged subset of internal processes to provide an overview without cluttering output.

Evaluation Methodology for Multiple-Choice Questions

The evaluation of Large Language Models (LLMs) on multiple-choice questions presents unique challenges, particularly when using pre-existing datasets. This section outlines the methodology employed in this project, discussing its rationale and comparing it to other potential approaches.

Dataset Preparation and Prompt Engineering

It’s crucial to note that many multiple-choice datasets, including those from HuggingFace like ai2_arc and cais_mmlu, provide only the question text and a list of answer choices. They do not include a pre-formatted prompt suitable for direct input to an LLM. Consequently, researchers must design an appropriate prompt structure.

In this project, the process_question function was developed to transform raw dataset entries into suitable prompts. This function constructs a prompt that clearly delineates the question and answer choices, concluding with an explicit instruction for the model to provide the letter of the correct answer.

Challenges with Standard Evaluation Metrics

Initial attempts at evaluation utilized the HuggingFace evaluate library, specifically the bertscore and Google Research rouge metrics. These metrics are widely used for various natural language processing tasks. However, manual verification of individual question evaluations revealed significant discrepancies in accurately identifying correct and incorrect responses in the multiple-choice context.

Development of a Specialized Evaluation Method

In response to these challenges, a specialized method for evaluating multiple-choice responses was developed. This method capitalizes on the structured nature of multiple-choice answers and the linguistic properties of English. Key aspects of this approach include:

  1. Answer Choice Labeling: The method ensures that answer choices begin with “B” rather than “A”. This is crucial because “A” frequently appears as an article in English text, whereas “B”, “C”, “D”, etc., are less likely to appear as standalone words.

  2. Response Parsing: The LLM’s response is analyzed for the presence of valid answer choice labels (e.g., “B”, “C”, “D”) as whole words. This approach significantly reduces false positives compared to more complex semantic analysis methods.

  3. Validation Criteria:

    • A response is considered valid if exactly one answer choice label is present.
    • If multiple valid labels are detected, the response is deemed incorrect.
    • If no valid label is found, the response is also considered incorrect.
    • The response is correct only if the detected label matches the dataset’s provided answer.

This method has shown high reliability in manual verification, outperforming the initially tested bertscore and Google Research rouge metrics for this specific task.

Considerations and Future Work

While this specialized method has proven effective, it’s important to acknowledge that the HuggingFace evaluate library may contain other evaluation methods specifically designed for multiple-choice questions that were not explored in this project. Future work could involve a comprehensive comparison of this method against other potential evaluation techniques for multiple-choice responses.

Moreover, the necessity of prompt engineering in working with these datasets highlights an important consideration in LLM evaluation: the impact of prompt design on model performance. This aspect warrants further investigation and standardization efforts in the field of LLM evaluation.

Experimentation and Further Development

  1. Modular Design:

    • Modifications can be implemented under the condition Configuration.get()['ENABLE_EXPERIMENTAL_FEATURES'].
    • Allows for easy testing and benchmarking of new features.
  2. Benchmarking Process:

    • Use the -e option when running benchmarks to evaluate experimental modifications.
    • Statistical significance is calculated continuously after each question using McNemar’s statistical test, allowing for early detection of significant improvements or deteriorations.
    • Typically requires 2,000-3,000 questions to achieve robust statistical significance.
    • This approach quickly determines if changes are beneficial or detrimental, enabling efficient iteration and refinement of experimental features.
  3. Ongoing Development of Experimental Features:

    • Building upon the experimental features mentioned in the Technical Implementation section, several areas are being actively explored and refined:
      • Further development of recursive techniques with result summarization, expanding on the generate_sub_prompts function.
      • Continued optimization of the Socratic method simulation, aiming to achieve statistically significant improvements.
      • Enhancement of query generation and dialogue construction methods to improve context understanding and response relevance.
    • These developments aim to push the boundaries of the ICA feature’s capabilities while maintaining efficient API usage and overall performance.
  4. Open for Contributions:

    • The Python codebase allows for easy integration of new ideas and techniques.
    • Encourages experimentation within the augment_context_introspectively method in the conversation_manager module.

By leveraging these techniques and continual refinement, the ICA feature ensures that Ditana Assistant can deliver high-quality responses across a wide range of topics, maximizing the potential of the underlying AI model through structured, introspective reasoning, with or without external knowledge sources.