[Retrieval] Document transformers

LangChain

by LYShin 2023. 8. 24. 20:00

- 출처 : https://python.langchain.com/docs/modules/data_connection/

- 이 블로그 글은 LangChain API document의 글을 기반으로 번역되었으며 이 과정에서 약간의 내용이 추가되었습니다.

- Document transformers에서는 Langchain에서 문서를 변형하는 여러 가지 방법에 대해 설명합니다.

- 본 글에서는 Text splitter, Post retrieval에 대해 다룹니다.

Document transformers

문서를 로딩한 후, 그 문서를 어플리케이션에 알맞게 변형해야 하는 경우가 있습니다. 가장 간단한 예시는 긴 문서를 여러 개의 짧은 문서로 나누어 모델에 알맞게 문서를 변경하는 것입니다. LangChain은 많은 빌트인 문서 transformer를 제공하고 있으며, 이를 통해 간단하게 문서를 나누거나, 통합하거나 필터링할 수 있습니다.

1. Text splitters

긴 문서를 다루어야 할 때, 문서를 작은 단위의 청크로 나눠야하는 것은 필수적입니다. 듣기로는 쉬워 보이지만, 여기에는 잠재적인 어려움이 존재합니다. 이상적으로 문서들이 의미론적으로 연관되어 있도록 유지해야 합니다. `의미론적으로 연관된`이 무엇을 의미하는지는 문서의 종류에 달려있습니다. 이번 세션에서는 이를 위한 몇 가지 방법에 대해 다룹니다.

high level에서 text splitter는 다음과 같은 워크플로우를 따릅니다.

1. 텍스트를 작은 청크(often sentences)로 나눕니다. 이 때 청크는 의미적으로 원본의 내용과 같아야합니다.

2. 이렇게 만든 작은 청크를 충분한 사이즈의 크기가 될 때까지 결합합니다.

3. 청크가 특정 사이즈에 도달하면 약간의 오버랩을 포함하여 다음 청크를 만들기 시작합니다.

1.1 Text Splitter Basic

기본적으로 추천되는 Text Splitter는 `RecursiveCharacterTextSplitter`입니다. Text splitter는 문자의 리스트를 받습니다. 그리고 첫 번째 문자를 기반으로 청크를 만듭니다. 이때 청크의 크기가 너무 크다면, 두 번째 문자를 기반으로 다시 청크를 만듭니다. 기본적인 문자 리스트는 ["\n\n", "\n", " ", ""]입니다.

이 방법으로 문서를 나눌 때 다음과 같이 조절 가능한 몇 가지 조건이 있습니다.

1. length_function
청크의 길이를 어떻게 계산할지에 대한 함수입니다. 기본적으로는 character의 개수를 계산하지만, token의 개수를 계산하는 것이 조금 더 일반적입니다.

2. chunk_size
청크의 최대 크기입니다.

3. chunk_overlap
청크 사이 오버랩의 최대 크기입니다. 청크 사이에 연속성을 유지하기 위해 오버랩을 갖는 것이 좋습니다.

4. add_start_index
청크가 원본 문서에서 어디에 존재하는지를 metadata에 포함할 지 여부입니다.

`RecursiveCharacterTextSplitter` 클래스를 사용하여 문서를 여러개의 작은 청크로 나누어보겠습니다.

with open('../../state_of_the_union.txt') as f:
    state_of_the_union = f.read()

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 100,
    chunk_overlap  = 20,
    length_function = len,
    add_start_index = True,
)
texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])
print(texts[1])

page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and' metadata={'start_index': 0}
page_content='of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.' metadata={'start_index': 82}

1.2 Split by character

이는 가장 간단한 방법입니다. 기본적으로 특정 문자를 기반으로 문서를 분할하며 문자의 수를 기반으로 청크의 길이를 계산합니다.

with open('../../../state_of_the_union.txt') as f:
    state_of_the_union = f.read()
    
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(        
    separator = "\n\n",
    chunk_size = 1000,
    chunk_overlap  = 200,
    length_function = len,
    is_separator_regex = False,
)

texts = text_splitter.create_documents([state_of_the_union])
print(texts[0])

page_content='Madam Speaker, Madam Vice President, ...중략... , inspires the world.' lookup_str='' metadata={} lookup_index=0

1.3 Split code

`CodeTextSplitter`는 여러 언어를 제공하며, 언어에 맞는 코드를 나누는데 도움을 줍니다. 본 글에서는 Python code에 대한 예시를 살펴보겠습니다. 이 외에 Markdown, HTML 등 여러 종류의 입력에 대한 splitter는 LangChain의 공식 도큐먼트에서 확인하시길 바랍니다.

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    Language,
)

PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

# Call the function
hello_world()
"""
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON, chunk_size=50, chunk_overlap=0
)
python_docs = python_splitter.create_documents([PYTHON_CODE])
python_docs

[Document(page_content='def hello_world():\n    print("Hello, World!")', metadata={}),
 Document(page_content='# Call the function\nhello_world()', metadata={})]

1.4 MarkdownHeaderTextSplitter

많은 챗봇, Q+A 어플리케이션은 문서를 임베딩하고 벡터에 저장하기 전 청크로 나눕니다. 청크는 보통 텍스트의 전체 맥락을 유지해야 합니다. 이를 고려하면서, 문서 구조 그 자체에 특별히 집중했습니다.

예를 들어, 마크다운(markdown)은 헤더(header)에 의해 구조화됩니다. 특정한 헤더 그룹 내에서 청크를 만드는 것은 직관적인 아이디어입니다. 이 문제를 해결하기 위해 우리는 `MarkdownHeaderTextSplitter`를 사용합니다. 이는 마크다운 파일을 특정한 헤더의 세트를 기반으로 분할합니다.

예를 들어, 아래 예시 마크다운을 분할해보고자 합니다.

md = '# Foo\n\n ## Bar\n\nHi this is Jim  \nHi this is Joe\n\n ## Baz\n\n Hi this is Molly'

우리는 분할을 위해 헤더를 특정할 수 있습니다.

[("#", "Header 1"),("##", "Header 2")]

이후 컨텐츠는 헤더에 의해 그룹화/분할 됩니다.

{'content': 'Hi this is Jim  \nHi this is Joe', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Bar'}}
{'content': 'Hi this is Molly', 'metadata': {'Header 1': 'Foo', 'Header 2': 'Baz'}}

이제 `MarkdownHeaderTextSplitter` class를 사용하여 마크다운을 분할해 보겠습니다.

from langchain.text_splitter import MarkdownHeaderTextSplitter
markdown_document = "# Foo\n\n    ## Bar\n\nHi this is Jim\n\nHi this is Joe\n\n ### Boo \n\n Hi this is Lance \n\n ## Baz\n\n Hi this is Molly"

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)
print(md_header_splits)

[Document(page_content='Hi this is Jim  \nHi this is Joe', metadata={'Header 1': 'Foo', 'Header 2': 'Bar'}),
 Document(page_content='Hi this is Lance', metadata={'Header 1': 'Foo', 'Header 2': 'Bar', 'Header 3': 'Boo'}),
 Document(page_content='Hi this is Molly', metadata={'Header 1': 'Foo', 'Header 2': 'Baz'})]

각 마크다운 그룹 안에서, 우리는 다른 text splitter를 사용하여 청크를 분할할 수도 있습니다.

markdown_document = "# Intro \n\n    ## History \n\n Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9] \n\n Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files. \n\n ## Rise and divergence \n\n As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for \n\n additional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks. \n\n #### Standardization \n\n From 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort. \n\n ## Implementations \n\n Implementations of Markdown are available for over a dozen programming languages."

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
]

# MD splits
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
md_header_splits = markdown_splitter.split_text(markdown_document)

# Char-level splits
from langchain.text_splitter import RecursiveCharacterTextSplitter

chunk_size = 250
chunk_overlap = 30
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(md_header_splits)
splits

[Document(page_content='Markdown[9] is a lightweight markup language for creating formatted text using a plain-text editor. John Gruber created Markdown in 2004 as a markup language that is appealing to human readers in its source code form.[9]', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),
 Document(page_content='Markdown is widely used in blogging, instant messaging, online forums, collaborative software, documentation pages, and readme files.', metadata={'Header 1': 'Intro', 'Header 2': 'History'}),
 Document(page_content='As Markdown popularity grew rapidly, many Markdown implementations appeared, driven mostly by the need for  \nadditional features such as tables, footnotes, definition lists,[note 1] and Markdown inside HTML blocks.  \n#### Standardization', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),
 Document(page_content='#### Standardization  \nFrom 2012, a group of people, including Jeff Atwood and John MacFarlane, launched what Atwood characterised as a standardisation effort.', metadata={'Header 1': 'Intro', 'Header 2': 'Rise and divergence'}),
 Document(page_content='Implementations of Markdown are available for over a dozen programming languages.', metadata={'Header 1': 'Intro', 'Header 2': 'Implementations'})]

1.5 Split by tokens

Language model은 token limit을 가지고 있습니다. 사용자는 token limit을 넘긴 문서를 사용하지 못합니다. 그러므로 문서를 청크로 분할할 때, token 단위로 분할하는 것은 좋은 아이디어로 보입니다. 최근에는 많은 tokenizer가 존재합니다. 사용자는 사용하려는 모델에 맞는 tokenizer를 사용하여 문서를 분할해야 합니다.

본 글에서는 OpenAI의 모델이 사용하는, tiktoken tokenizer를 사용하여 문서를 분할하는 방법을 살펴보겠습니다.

# This is a long document we can split up.
with open("../../../state_of_the_union.txt") as f:
    state_of_the_union = f.read()
    
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=100, chunk_overlap=0
)

texts = text_splitter.split_text(state_of_the_union)
print(texts[0])

Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans.  

Last year COVID-19 kept us apart. This year we are finally together again. 

Tonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. 

With a duty to one another to the American people to the Constitution.

2. Lost in the middle : The problem with long contexts

어떤 모델을 사용하는지에 상관없이 당신이 10개 이상의 검색된 문서를 포함하고 있다면, 약간의 성능저하가 발생할 수 있습니다. 실제로, 모델이 긴 맥락 중간에 있는 관련 정보에 접근할 때, 제공된 문서를 무시하는 경향이 있다는 연구가 있습니다.

위 문제로 인한 성능 저하를 피하기 위해 우리는 검색 후 문서를 재정렬할 수 있습니다.

import os
import chromadb
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_transformers import (
    LongContextReorder,
)
from langchain.chains import StuffDocumentsChain, LLMChain
from langchain.prompts import PromptTemplate
from langchain.llms import OpenAI

# Get embeddings.
embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

texts = [
    "Basquetball is a great sport.",
    "Fly me to the moon is one of my favourite songs.",
    "The Celtics are my favourite team.",
    "This is a document about the Boston Celtics",
    "I simply love going to the movies",
    "The Boston Celtics won the game by 20 points",
    "This is just a random text.",
    "Elden Ring is one of the best games in the last 15 years.",
    "L. Kornet is one of the best Celtics players.",
    "Larry Bird was an iconic NBA player.",
]

# Create a retriever
retriever = Chroma.from_texts(texts, embedding=embeddings).as_retriever(
    search_kwargs={"k": 10}
)
query = "What can you tell me about the Celtics?"

# Get relevant documents ordered by relevance score
docs = retriever.get_relevant_documents(query)
docs

[Document(page_content='This is a document about the Boston Celtics', metadata={}),
 Document(page_content='The Celtics are my favourite team.', metadata={}),
 ...중략...
 Document(page_content='Fly me to the moon is one of my favourite songs.', metadata={}),
 Document(page_content='This is just a random text.', metadata={})]

문서를 재정렬할 때, 덜 관련된 문서는 리스트의 중간에, 더욱 관련 있는 문서는 리스트의 앞/뒤에 위치하게 됩니다. 다음과 같이 4개의 관련 문서는 리스트의 맨 앞과 뒤에 위치하게 됩니다.

reordering = LongContextReorder()
reordered_docs = reordering.transform_documents(docs)

print(reordered_docs)

[Document(page_content='The Celtics are my favourite team.', metadata={}),
 Document(page_content='The Boston Celtics won the game by 20 points', metadata={}),
 Document(page_content='Elden Ring is one of the best games in the last 15 years.', metadata={}),
 ...중략...
 Document(page_content='Larry Bird was an iconic NBA player.', metadata={}),
 Document(page_content='L. Kornet is one of the best Celtics players.', metadata={}),
 Document(page_content='This is a document about the Boston Celtics', metadata={})]

'LangChain' 카테고리의 다른 글

[Retrieval] Retrievers (0)	2023.09.11
[Retrieval] Document loaders (0)	2023.08.24
[Retrieval] (1)	2023.08.23
[MODEL I/O - Langauge Models] Output Parsers - 2 (0)	2023.07.24
[MODEL I/O - Langauge Models] Output Parsers - 1 (0)	2023.07.14

LanguageData

고정 헤더 영역

메뉴 레이어

메뉴 리스트

검색 레이어

검색 영역

상세 컨텐츠

본문 제목

본문

Document transformers

1. Text splitters

1.1 Text Splitter Basic

1.2 Split by character

1.3 Split code

1.4 MarkdownHeaderTextSplitter

1.5 Split by tokens

2. Lost in the middle : The problem with long contexts

'LangChain' 카테고리의 다른 글

관련글 더보기

댓글 영역

추가 정보

인기글

최신글

티스토리툴바