Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

  • Identity Chunker Chain - Keeps the entire content text as a single large block of text without breaking it apart

  • Identity Matching Text - Uses the entire content text as the matching text without transforming it

  • Identity Qualifying Text - Uses the entire content text as the qualifying text without transforming it

  • Identity Query Transformer - Passes through the users query verbatim without transforming it

  • Identity Reranker - Assigns the highest rerank score of 1.0 to every knowledge chunk regardless of its contents

  • Identity Filterer - Does not actually filter anything and just passes through all of the knowledge chunks through to its output

Note

It is recommended to not modify the identity chains, because various parts of the knowledge base system are programmed on the assumption that they do nothing as described here. Do not modify them unless you really know what you are doing.

Premade Document Processing Chains

...

The most important part of that customizing comes in the form of the Matching Text and the Query Transformer. Lets look at an example where we might need to customize these.

Example 1: E-Commerce Product Search Engine

The default knowledge base is designed for doing Q&A based on documents and data that was imported to it. But what if you aren’t so much asking a question but rather searching for something specific?

...

  • Chunker - Our goal should be that the knowledge base looks up the complete information available about a product. So this depends somewhat on the format of the data we are uploading. If we are importing documents or pages where there is a strict one page == one product relationship, then we can use the identity chunker. Let’s say we are uploading product information sheets in the form of PDF documents, where each document contains several products. Then we might want to create a custom chunker that breaks apart the document into different sections containing information from different products

  • Matching Text - We want to search through the knowledge base based on characteristics of the product. So the matching text we use should be a one or two sentence description containing the characteristics of the product. It might be most effective to generate multiple different candidate one-sentence descriptions and embed all of them, similar to how the default Q&A matching text chain generates multiple questions

  • Query Transformer - Our query transformer now needs to take the ambiguous query provided by the user, and turn it into a hypothesized one or two sentence description of a product. E.g. the same way that the original query transformer took ambiguous text and cleaned it up into a proper question. Our new Product Search query transformer takes ambiguous text and cleans it up into a standardized product description, matching a specific length and format.

  • Reranker - The default reranker is prompted to look at how well the searched knowledge chunk matches the query provided by the user. We would now need a new reranker that is prompted to determine how well the product we found matches the description given by the user of what they wanted

Example 2: Best Practices Engine for Pitch Deck

Many agents are based on the premise that they can analyze some data provided by the user against a knowledge base containing rules or best practices. E.g. analyzing a pitch deck that the user provided against a knowledge base containing best practices for writing pitch decks generally. Since our standard Q&A style knowledge base has only been built for answering questions, we need to customize the knowledge base to suit our unique needs.

Let’s look at the situation of a pitch-deck best-practice knowledge base. The key part of customizing our knowledge base is deciding what information is relevant for matching queries to knowledge, and what information can be discarded. This is a matter of design and there is no one right answer. Included bits of information might make results better on some queries but worse on others. The best practice would be to measure your results on a statistical basis. But we can still get great results without ground-truth data simply by applying a bit of design principle to our matching text.

For the purpose of applying our pitch deck, lets say that we want to group together our best-practices depending on the type of slide, stage of company, and industry or vertical. Therefore we want to construct a matching text that contains each of these three elements. For example, our matching text might look like this:

Code Block
A go-to-market slide for a seed-stage startup in fintech.

And you could imagine different rules having different texts but in a similar format:

Code Block
A problem slide for a series b startup in health care.
A solution slide for a pre-seed startup in e-commerce tech.

If we make both the Matching Text Smart Chain and the Query Transformer Smart Chain produce outputs that look like the above bits of text, then we will be able to match the specific sections of our pitch deck with specific best practices that have been loaded into the knowledge base.

So the setup for our system is as follows:

  • We are digesting blog articles, presentations, and other documents, and extracting the knowledge out of them as best practices

  • These best practices are going to be formatted as a single paragraph of text that contains some specific pithy piece of wisdom or advice

  • The user is going to upload their Pitch Deck in the form of a PDF

  • We run a custom smart chain that:

    • Breaks apart the PDF into pages

    • Uploads each page text verbatim into the knowledge base (relying on the knowledge base query transformer to transform it into the appropriate format)

    • Takes the resulting knowledge chunks, and applies a custom prompt to each one to “contextualize” it with the specific information contained in the deck, producing a reccomendation that can be displayed to the user

To accomplish this, we must apply all of the following customization's:

  • Chunker - Our document chunker smart-chain will need to take in the original documents, and transform them into small, bite sized chunks of text, each containing a single best practice or idea that was extracted from the document. This isn’t really breaking apart the original document, but rather wholly digesting it and transforming it into something new - one paragraph descriptions of best practices

  • Matching Text - Our matching text smart chain needs to take each of the best-practice paragraphs generated by the chunker, and transform them into our standard matching text format described above, e.g. A problem slide for a series b startup in health care. We may design the prompt to generate a bunch of potential matching texts for each best-practice paragraph that it processes

  • Qualifying Text - We can stick with the default qualifying text, which would just be a summary of the original document that the best-practice was derived from, in case that information is relevant

  • Query Transformer - Our use scenario for our knowledge base involves uploading the raw text of each presentation slide. We would need to make a custom query transformer which can take that raw-text and convert it into our standard matching format described above, e.g. A problem slide for a series b startup in health care.

  • Reranker - Our matching text is relatively simple and close ended, so we may want to disable the reranker by switching to the identity reranker and rely on the matching text alone. This reduces the cost of knowledge base queries and improves the response speed.

  • Filterer - Given the very closed-ended nature of our matching text, we may find that by default, all the match scores end up always being very high, e.g. > 0.9 even between matching texts that are supposed to be different and not match. Therefore, we may have to adjust the default filtering smart chain to calibrate it for our specific use case, setting a cutoff that may be as high as 0.95 or 0.97