[Retrieval] Retrievers

LangChain

[Retrieval] Retrievers

LYShin 2023. 9. 11. 19:30

- 출처 : https://python.langchain.com/docs/modules/data_connection/

- 이 블로그 글은 LangChain API document의 글을 기반으로 번역되었으며 이 과정에서 약간의 내용이 추가되었습니다.

- Retrievers에서는 Langchain에서 주어진 비정형 쿼리에 대해 문서를 반환하는 여러 가지 방법에 대해 설명합니다.

- 본 글에서는 Retrievers 전반에 대해 다룹니다.

Retrivers

Retriever는 주어진 비정형 쿼리에 대해 문서를 반환하는 인터페이스입니다. 이는 Vector store보다 더 일반적인 방법입니다. Retriever는 문서를 저장할 필요 없이 오직 반환만 합니다. Vector stores는 retriever의 backbone으로 사용될 수 있으며, 여러 가지 종류의 retriever가 존재합니다.

1. Retriever Base

먼저, LangChain의 `BaseRetriever` 클래스의 API를 호출하여 살펴보겠습니다.

from abc import ABC, abstractmethod
from typing import Any, List
from langchain.schema import Document
from langchain.callbacks.manager import Callbacks

class BaseRetriever(ABC):
    ...
    def get_relevant_documents(
        self, query: str, *, callbacks: Callbacks = None, **kwargs: Any
    ) -> List[Document]:
        """Retrieve documents relevant to a query.
        Args:
            query: string to find relevant documents for
            callbacks: Callback manager or list of callbacks
        Returns:
            List of relevant documents
        """
        ...

    async def aget_relevant_documents(
        self, query: str, *, callbacks: Callbacks = None, **kwargs: Any
    ) -> List[Document]:
        """Asynchronously get documents relevant to a query.
        Args:
            query: string to find relevant documents for
            callbacks: Callback manager or list of callbacks
        Returns:
            List of relevant documents
        """
        ...

보이는 바와 같이 간단하게 구성되어 있습니다. `get_relevant_documents` 혹은 async `aget_relevant_documents` 메서드를 사용하여 쿼리와 관련된 문서를 검색할 수 있습니다. 이때, `relevance`는 호출한 retriever object에 따라 정의됩니다.

LangChain에는 다양한 retriever이 존재합니다. 그중, LangChain에서 가장 집중하는 retriever은 vector store retriever입니다. 기본적으로 LangChain은 임베딩을 인덱싱하고 찾기 위해 `Chroma`를 vector store로 사용합니다. `Chroma`를 활용한 vector store retriever에 대해 살펴보겠습니다.

이번 예시는 간단한 question answering 문제입니다. 이를 가장 기본적인 예시로 사용하는 이유는, 이 과정에서 다양한 요소들이 결합되어 활용되는 과정을 살펴볼 수 있기 때문입니다.

Question answer은 다음과 같은 4가지의 스텝을 따라갑니다.

1. 인덱스를 생성합니다.

2. 인덱스로부터 retriever을 생성합니다.

3. question answering chain을 생성합니다.

4. 질문을 던집니다.

먼저, 검색에 필요한 문서를 로드합니다.

from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
from langchain.document_loaders import TextLoader
loader = TextLoader('../state_of_the_union.txt', encoding='utf8')

다음으로 `VectorstoreIndexCreator`를 사용하여 인덱스를 생성하고, 질문을 던집니다.

from langchain.indexes import VectorstoreIndexCreator
index = VectorstoreIndexCreator().from_loaders([loader])
query = "What did the president say about Ketanji Brown Jackson"
index.query_with_sources(query)

{'question': 'What did the president say about Ketanji Brown Jackson',
 'answer': " The president said that he nominated Circuit Court of Appeals Judge Ketanji Brown Jackson, one of the nation's top legal minds, to continue Justice Breyer's legacy of excellence, and that she has received a broad range of support from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.\n",
 'sources': '../state_of_the_union.txt'}

간단한 예시를 통해 Retriever에 대해 살펴보았습니다. 그런데, 위 코드는 실제로 어떤 방식으로 구동되고 있을까요? 어떻게 index를 생성하고 있을까요?

`VectorstoreIndexCreator`에 많은 것들이 숨겨져 있습니다. 이번 세션에서는 조금 더 구체적으로 살펴보겠습니다.

먼저, 문서가 로드된 후 다음 3개의 스텝이 진행됩니다.

1. 문서를 chunk로 split

2. 각 문서를 임베딩

3. vector store에 문서와 임베딩을 저장

4. 인덱스를 생성

이 스텝을 구현해 보겠습니다.

먼저, 원하는 Splitter를 사용하여 text를 chunk로 분할합니다.

from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

다음으로 분할된 문서를 임베딩합니다. 이때, 모델은 다양한 모델을 사용할 수 있으며, 본 예시에서는 OpenAI의 임베딩 모델을 사용합니다. 동시에 임베딩한 문서를 vector store에 저장합니다.

from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
embeddings = OpenAIEmbeddings()
db = Chroma.from_documents(texts, embeddings)

이제 마지막으로 인덱스를 생성하여, 위 예시와 동일한 질문을 해보겠습니다.

retriever = db.as_retriever()
qa = RetrievalQA.from_chain_type(llm=OpenAI(), chain_type="stuff", retriever=retriever)
query = "What did the president say about Ketanji Brown Jackson"
qa.run(query)

" The President said that Judge Ketanji Brown Jackson is one of the nation's top legal minds, a former top litigator in private practice, a former federal public defender, and from a family of public school educators and police officers. He said she is a consensus builder and has received a broad range of support from organizations such as the Fraternal Order of Police and former judges appointed by Democrats and Republicans."

`VectorstoreIndexCreator`는 이러한 로직을 단지 래핑했을 뿐입니다. 하지만 이 클래스를 통해 위의 복잡한 단계를 하나의 단순한 코드로 변경할 수 있으며, 수정가능하기 때문에 원하는 splitter, vectorstore, embedding model를 사용할 수 있습니다.

2. MultiQueryRetriever

거리 기반의 vector database 검색은 쿼리를 고차원의 벡터 공간으로 임베딩하고 거리가 비슷한(가까운) 문서를 검색합니다. 그러나, 쿼리의 단어가 약간 변화하거나 임베딩이 데이터의 의미를 제대로 포착하지 못한다면, 검색은 전혀 다른 결과를 만들어 낼 수 있습니다. 프롬프트 엔지니어링과 튜닝은 가끔 매뉴얼 하게 이런 문제를 해결하지만, 조금 지루한 방법입니다.

`MultiQueryRetriever`은 주어진 입력 쿼리에 대해 다른 관점으로부터 다양한 쿼리를 생성하는 LLM을 사용하여 프롬프트 튜닝의 방식을 자동화합니다. 개별 쿼리는 연관된 문서의 세트를 검색하고, 모든 쿼리의 결과를 결합하여 잠재적으로 연관된 문서의 더 큰 세트를 만듭니다. 같은 질문에 대해 여러 관점에서 새로운 질문을 생성함으로써 `MultiQueryRetriever`은 거리 기반 검색의 한계를 극복할 수 있으며, 더 좋은 결과를 생성할 수 있습니다.

이번 세션에서는 github의 블로그 포스트를 로드하여 질문을 하는 예시를 다루겠습니다.

가장 먼저, 블로그의 포스트를 로드하고 문서를 chunk로 나눈 후 vector store에 저장합니다.

from langchain.vectorstores import Chroma
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter

loader = WebBaseLoader("https://lilianweng.github.io/posts/2023-06-23-agent/")
data = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0)
splits = text_splitter.split_documents(data)

embedding = OpenAIEmbeddings()
vectordb = Chroma.from_documents(documents=splits, embedding=embedding)

다음으로, 쿼리 생성에 사용할 LLM과 retriever 객체를 정의합니다.

from langchain.chat_models import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever

question = "What are the approaches to Task Decomposition?"
llm = ChatOpenAI(temperature=0)
retriever_from_llm = MultiQueryRetriever.from_llm(
    retriever=vectordb.as_retriever(), llm=llm
)

정의한 객체를 사용하여 질문과 비슷한 쿼리를 생성하고, 이와 관련된 문서를 검색합니다.

# 로깅은 생성된 쿼리의 로그를 보기 위함입니다.
import logging

logging.basicConfig()
logging.getLogger("langchain.retrievers.multi_query").setLevel(logging.INFO)

unique_docs = retriever_from_llm.get_relevant_documents(query=question)
for i in unique_docs:
    print('Next docs \n', i.page_content, '\n\n')

INFO:langchain.retrievers.multi_query:Generated queries: ['1. How can Task Decomposition be approached?', '2. What are the different methods for Task Decomposition?', '3. What are the various approaches to decomposing tasks?']

Next docs 
 Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs. 

Next docs 
 Fig. 1. Overview of a LLM-powered autonomous agent system.
Component One: Planning#
A complicated task usually involves many steps. An agent needs to know what they are and plan ahead.
Task Decomposition# 

Next docs 
 Challenges in long-term planning and task decomposition: Planning over a lengthy history and effectively exploring the solution space remain challenging. LLMs struggle to adjust plans when faced with unexpected errors, making them less robust compared to humans who learn from trial and error. 

Next docs 
 Planning

Subgoal and decomposition: The agent breaks down large tasks into smaller, manageable subgoals, enabling efficient handling of complex tasks.
Reflection and refinement: The agent can do self-criticism and self-reflection over past actions, learn from mistakes and refine them for future steps, thereby improving the quality of final results.


Memory 

Next docs 
 (3) Task execution: Expert models execute on the specific tasks and log results.
Instruction:

마지막으로, 프롬프트를 활용하여 쿼리 생성에 영향을 행사할 수 있습니다. 프롬프트와 output parser를 사용하여 쿼리를 생성하고, 관련 문서를 검색해 보겠습니다.

from typing import List
from langchain import LLMChain
from pydantic import BaseModel, Field
from langchain.prompts import PromptTemplate
from langchain.output_parsers import PydanticOutputParser

# output parser는 LLM이 생성한 쿼리를 분할하여 list로 변경합니다.
class LineList(BaseModel):
    lines: List[str] = Field(description="Lines of text")

class LineListOutputParser(PydanticOutputParser):
    def __init__(self) -> None:
        super().__init__(pydantic_object=LineList)

    def parse(self, text: str) -> LineList:
        lines = text.strip().split("\n")
        return LineList(lines=lines)


output_parser = LineListOutputParser()

QUERY_PROMPT = PromptTemplate(
    input_variables=["question"],
    template="""You are an AI language model assistant. Your task is to generate five 
    different versions of the given user question to retrieve relevant documents from a vector 
    database. By generating multiple perspectives on the user question, your goal is to help
    the user overcome some of the limitations of the distance-based similarity search. 
    Provide these alternative questions seperated by newlines.
    Original question: {question}""",
)
llm = ChatOpenAI(temperature=0)

llm_chain = LLMChain(llm=llm, prompt=QUERY_PROMPT, output_parser=output_parser)

question = "What are the approaches to Task Decomposition?"

retriever = MultiQueryRetriever(
    retriever=vectordb.as_retriever(), llm_chain=llm_chain, parser_key="lines"
)  # "lines" is the key (attribute name) of the parsed output

unique_docs = retriever.get_relevant_documents(
    query="What does the course say about regression?"
)
for i in unique_docs:
    print('Next docs \n', i.page_content, '\n\n')

INFO:langchain.retrievers.multi_query:Generated queries: ["1. What is the course's perspective on regression?", '2. Can you provide information on regression as discussed in the course?', '3. How does the course cover the topic of regression?', "4. What are the course's teachings on regression?", '5. In relation to the course, what is mentioned about regression?']
Next docs 
 }
]
Challenges#
After going through key ideas and demos of building LLM-centered agents, I start to see a couple common limitations: 

Next docs 
 to start a new trial depending on the self-reflection results. 

Next docs 
 learning history and feeds that into the model. Hence we should expect the next predicted action to lead to better performance than previous trials. The goal is to learn the process of RL instead of training a task-specific policy itself. 

Next docs 
 Self-reflection is created by showing two-shot examples to LLM and each example is a pair of (failed trajectory, ideal reflection for guiding future changes in the plan). Then reflections are added into the agent’s working memory, up to three, to be used as context for querying LLM. 

Next docs 
 (2) Model selection: LLM distributes the tasks to expert models, where the request is framed as a multiple-choice question. LLM is presented with a list of models to choose from. Due to the limited context length, task type based filtration is needed.
Instruction: 

Next docs 
 Task decomposition can be done (1) by LLM with simple prompting like "Steps for XYZ.\n1.", "What are the subgoals for achieving XYZ?", (2) by using task-specific instructions; e.g. "Write a story outline." for writing a novel, or (3) with human inputs. 

Next docs 
 They did an experiment on fine-tuning LLM to call a calculator, using arithmetic as a test case. Their experiments showed that it was harder to solve verbal math problems than explicitly stated math problems because LLMs (7B Jurassic1-large model) failed to extract the right arguments for the basic arithmetic reliably. The results highlight when the external symbolic tools can work reliably, knowing when to and how to use the tools are crucial, determined by the LLM capability. 

Next docs 
 (3) Task execution: Expert models execute on the specific tasks and log results.
Instruction: 

Next docs 
 Fig. 5. After fine-tuning with CoH, the model can follow instructions to produce outputs with incremental improvement in a sequence. (Image source: Liu et al. 2023) 

Next docs 
 \dots \geq r_1$ The process is supervised fine-tuning where the data is a sequence in the form of $\tau_h = (x, z_i, y_i, z_j, y_j, \dots, z_n, y_n)$, where $\leq i \leq j \leq n$. The model is finetuned to only predict $y_n$ where conditioned on the sequence prefix, such that the model can self-reflect to produce better output based on the feedback sequence. The model can optionally receive multiple rounds of instructions with human annotators at test time. 

Next docs 
 ... (Repeated many times)