RAG Hybrid Keyword and Context Retrieval
RAG (Retrieval Augmented Generation) is a developing technology for using embedding models to search documents for context based on an embedding generated from the user query. There are problems with the current methods, including loss of contextual information and failure to find relevant information. To circumvent these difficulties, a hybrid keyword search and contextual search method is being considered here.
Queries must be separated into distinct types, or otherwise; a query must generate two possible responses. The first type (or response) would be an exact keyword search, the second type (or response) would be contextual in nature and take advantage of embedding models and vector database searching.
On the first type of retrieval, specific information related to exact matching is important… for example; a search for a specific labeled piece of information, such as an error code or part number. Once a keyword has retrieved a relevant chunk of document data, (which should include the entire text which relates to that subject) the LLM can operate over that information.
So basically, this is a writing exercise to organize the design of a RAG system. The core component which needs to be carefully designed is the backend retrieval and prompt handling system. There are some design flaws in the simple first pass approach I used:
- Sometimes I simply want the application to read out loud a specific chapter or section, verbatim. To that end, I will be implementing a ‘read’ keyword. If the first word of the user query is ‘read’, the backend should interpret that as a request to retrieve an exact match keyword search for a section in the document corpus, which it will forward to the client side application.
- Sometimes I want to use a hybrid approach, where certain portions of the document corpus are included verbatim for reference purposes, and to ensure accuracy in it’s reasoning. At the same time I want LLM to operate over this data and perform various simple reasoning tasks, or just coherent grammatical formatting of the response. For example: “What is the policy of organization on such and such issue?”. For this to be meaningful, references need to be included to allow the user to determine more easily whether or not the intended function of this method was successful, or rather some form of garbage contamination in retrieval or LLM hallucination had occurred.
- Sometimes I want to scan disparate parts of the corpus and come to a conclusion about something not explicitly stated, but rather implied. This requires careful contextual matching with the exclusion of seemingly syntactically relevant portions of the corpus which do not have actual semantic relevance. This is a work in progress, notes to follow.
Key Concepts
- Variable chunk size, related to sections and subsections. Sufficient server memory should make this practical.
- Meta-data which includes all relevant information related to the document title, chapter title, section title and subsection title. It should also include date information (for prioritizing more recent information or allowing date-sensitive searches) and page number ranges.
- Identification of images in documents, which should be stored in a relational database via a keyword or other identifier, so that diagrams and graphs can be displayed in a manner user-side similar to the way Anthropic displays similar information in ‘artifacts’.
- Table information in multi-dimensional arrays in order to preserve row and column labels relative to the element data.
- Possible reconstitution of tabular information into semantic contextual information upon retrieval.
- Possible image or graph semantic description to facilitate LLM operations.