[Retrieval] Document loaders

LangChain

[Retrieval] Document loaders

LYShin 2023. 8. 24. 19:30

- 출처 : https://python.langchain.com/docs/modules/data_connection/

- 이 블로그 글은 LangChain API document의 글을 기반으로 번역되었으며 이 과정에서 약간의 내용이 추가되었습니다.

- Document loader에서는 Langchain에서 문서를 입력하는 여러 가지 방법에 대해 설명합니다.

- 본 글에서는 CSV, HTML, PDF 등 다양한 타입의 문서를 입력하는 방법을 다룹니다.

Document loaders

Document loaders는 다양한 소스로부터 문서를 입력합니다. `문서`는 텍스트의 하나이며 메타데이터와 연관되어 있습니다. 예를 들어, 단순한 `.txt`파일을 로딩하는 document loader도 있고, web page의 콘텐츠를 로딩하는 document loader도 있습니다.

Document loaders는 `load` 메소드를 사용하여 설정된 소스로부터 문서로써 데이터를 로딩합니다.

1. CSV

CSV 파일을 읽어 각 행을 문서(Document)로 표현합니다.

from langchain.document_loaders.csv_loader import CSVLoader

loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv')
data = loader.load()
print(data)

[Document(page_content='Team: Nationals\n"Payroll (millions)": 81.34\n"Wins": 98', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 0}, lookup_index=0), 중략...]

파이썬 csv module 문서에서 지원되는 csv argument에 대한 정보를 살펴볼 수 있습니다.

loader = CSVLoader(file_path='./example_data/mlb_teams_2012.csv', csv_args={
    'delimiter': ',',
    'quotechar': '"',
    'fieldnames': ['MLB Team', 'Payroll in millions', 'Wins']
})

data = loader.load()
print(data)

[Document(page_content='MLB Team: Team\nPayroll in millions: "Payroll (millions)"\nWins: "Wins"', lookup_str='', metadata={'source': './example_data/mlb_teams_2012.csv', 'row': 0}]

`source_column` argument를 사용하여 각 행으로부터 만들어진 문서에 대한 소스를 특정할 수 있습니다. 그렇지 않으면 소스는 `file_path`가 됩니다. 이는 소스를 사용하여 질문에 대답하는 체인에 CSV파일에서 로드된 문서를 사용할 때 유용합니다.

2. File Directory

이번 세션에서는 특정 디렉토리에서 원하는 형태의 파일을 한 번에 로드하는 것을 다룹니다. 기본적인 사용방법은 다음과 같습니다.

상위 디렉토리내에 존재하는 모든 `.md` 확장자인 파일을 로드합니다.

from langchain.document_loaders import DirectoryLoader
loader = DirectoryLoader('../', glob="**/*.md")
docs = loader.load()

기본적으로 DirectoryLoader는 UnstructuredLoader 클래스를 사용합니다. 그러나 필요에 따라 loader의 타입을 변경할 수 있습니다. 로드할 파일의 형식이 일반적인 텍스트인 경우, UnstructuredLoader 대신 TextLoader를 사용할 수 있습니다.

from langchain.document_loaders import TextLoader
loader = DirectoryLoader('../', glob="**/*.md", loader_cls=TextLoader)
docs = loader.load()

만약, 로드하려는 파일이 `.py`와 같은 코드 텍스트 파일이라면, PythonLoader를 사용할 수 있습니다.

from langchain.document_loaders import PythonLoader
loader = DirectoryLoader('../../../../../', glob="**/*.py", loader_cls=PythonLoader)
docs = loader.load()

마지막으로, `TextLoader` 클래스를 사용하는 경우 파일의 인코딩 형식을 자동으로 탐색하여 로드할 수 있습니다. 파일 내의 텍스트 파일에 대한 인코딩 형식이 모호한 경우 유용합니다. 만약 특정 파일의 인코딩 형식이 다른 경우, `load()` 함수를 통해 파일을 가져올 때 메시지와 함께 실패할 것입니다. `TextLoader` 클래스는 특정 파일의 로드가 실패할 경우, 모든 로딩 프로세스가 실패하게 되고, 문서의 입력은 단 한 개도 되지 않습니다. 이때 사용할 수 있는 두 가지 인자가 있습니다.

첫 번째는 `silent_errors`입니다. 디렉토리 내 특정 파일 로드가 실패할 경우, 그 파일을 스킵하고 다음 파일을 로드합니다.

loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader, silent_errors=True)
docs = loader.load()
doc_sources = [doc.metadata['source']  for doc in docs]
print(doc_sources)

Error loading ../../../../../tests/integration_tests/examples/example-non-utf8.txt

['../../../../../tests/integration_tests/examples/whatsapp_chat.txt',
 '../../../../../tests/integration_tests/examples/example-utf8.txt']

두 번째는 `Auto detect encodings`입니다. 로드를 실패하기 전, `TextLoader`에 파일 인코딩을 탐지할 것을 요구할 수 있습니다.

text_loader_kwargs={'autodetect_encoding': True}
loader = DirectoryLoader(path, glob="**/*.txt", loader_cls=TextLoader, loader_kwargs=text_loader_kwargs)
docs = loader.load()
doc_sources = [doc.metadata['source']  for doc in docs]
doc_sources

    ['../../../../../tests/integration_tests/examples/example-non-utf8.txt',
     '../../../../../tests/integration_tests/examples/whatsapp_chat.txt',
     '../../../../../tests/integration_tests/examples/example-utf8.txt']

3. HTML

이번 세션에서는 HTML문서를 로드하는 방법을 다룹니다.

from langchain.document_loaders import UnstructuredHTMLLoader
loader = UnstructuredHTMLLoader("example_data/fake-content.html")
data = loader.load()
print(data)

[Document(page_content='My First Heading\n\nMy first paragraph.', lookup_str='', metadata={'source': 'example_data/fake-content.html'}, lookup_index=0)]

HTML 문서를 BeautifulSoup4를 이용하여 로딩할 수 있습니다. `BSHTMLLoader`를 사용하여 HTML 문서를 로딩하는 경우, BeautifulSoup4를 사용할 수 있습니다. HTML로부터 텍스트를 `page_content`, 그리고 문서제목을 `metadata` 안에 출력합니다.

from langchain.document_loaders import BSHTMLLoader
loader = BSHTMLLoader("example_data/fake-content.html")
data = loader.load()
print(data)

[Document(page_content='\n\nTest Title\n\n\nMy First Heading\nMy first paragraph.\n\n\n', metadata={'source': 'example_data/fake-content.html', 'title': 'Test Title'})]

4. JSON

이번 세션에서는 JSON파일로부터 문서를 로딩하는 방법을 다룹니다. `JSONLoader`는 특정 jq schema를 사용하여 JSON파일을 파싱 합니다. 이를 위해 `jq` python package를 사용합니다.

아래와 같은 내용을 담고있는 json 파일이 있습니다. `json.load()`를 사용하여 JSON파일을 열 경우, JSON 형태의 데이터 형식으로 출력됩니다.

from langchain.document_loaders import JSONLoader
import json
from pathlib import Path
from pprint import pprint


file_path='./example_data/facebook_chat.json'
data = json.loads(Path(file_path).read_text())
pprint(data)

    {'image': {'creation_timestamp': 1675549016, 'uri': 'image_of_the_chat.jpg'},
     'is_still_participant': True,
     'joinable_mode': {'link': '', 'mode': 1},
     'magic_words': [],
     'messages': [{'content': 'Bye!',
                   'sender_name': 'User 2',
                   'timestamp_ms': 1675597571851},
                  {'content': 'Oh no worries! Bye',
                   'sender_name': 'User 1',
                   'timestamp_ms': 1675597435669},
                  {'content': 'No Im sorry it was my mistake, the blue one is not '
                              'for sale',
                   'sender_name': 'User 2',
                   'timestamp_ms': 1675596277579},
                  {'content': 'I thought you were selling the blue one!',
                   'sender_name': 'User 1',
                   'timestamp_ms': 1675595140251},
                  {'content': 'Im not interested in this bag. Im interested in the '
                              'blue one!',
                   'sender_name': 'User 1',
                   'timestamp_ms': 1675595109305},
                  {'content': 'Here is $129',
                   'sender_name': 'User 2',
                   'timestamp_ms': 1675595068468},
                  {'photos': [{'creation_timestamp': 1675595059,
                               'uri': 'url_of_some_picture.jpg'}],
                   'sender_name': 'User 2',
                   'timestamp_ms': 1675595060730},
                  {'content': 'Online is at least $100',
                   'sender_name': 'User 2',
                   'timestamp_ms': 1675595045152},
                  {'content': 'How much do you want?',
                   'sender_name': 'User 1',
                   'timestamp_ms': 1675594799696},
                  {'content': 'Goodmorning! $50 is too low.',
                   'sender_name': 'User 2',
                   'timestamp_ms': 1675577876645},
                  {'content': 'Hi! Im interested in your bag. Im offering $50. Let '
                              'me know if you are interested. Thanks!',
                   'sender_name': 'User 1',
                   'timestamp_ms': 1675549022673}],
     'participants': [{'name': 'User 1'}, {'name': 'User 2'}],
     'thread_path': 'inbox/User 1 and User 2 chat',
     'title': 'User 1 and User 2 chat'}

우리는 `messages` 키 내부 `content`에 관심이 있습니다. `JSONLoader`는 이를 간단하게 정리할 수 있게 도와줍니다.

loader = JSONLoader(
    file_path='./example_data/facebook_chat.json',
    jq_schema='.messages[].content')

data = loader.load()
pprint(data)

[Document(page_content='Bye!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 1}),
 Document(page_content='Oh no worries! Bye', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 2}),
 ...중략...
 Document(page_content='Goodmorning! $50 is too low.', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 10}),
 Document(page_content='Hi! Im interested in your bag. Im offering $50. Let me know if you are interested. Thanks!', metadata={'source': '/Users/avsolatorio/WBG/langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat.json', 'seq_num': 11})]

만약, JSON 파일 대신 JSON Lines 파일을 로드해야 한다면, `json_lines`를 True로 설정하여 로딩할 수 있습니다.

loader = JSONLoader(
    file_path='./example_data/facebook_chat_messages.jsonl',
    jq_schema='.content',
    json_lines=True)

data = loader.load()
pprint(data)

[Document(page_content='Bye!', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 1}),
 Document(page_content='Oh no worries! Bye', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 2}),
 Document(page_content='No Im sorry it was my mistake, the blue one is not for sale', metadata={'source': 'langchain/docs/modules/indexes/document_loaders/examples/example_data/facebook_chat_messages.jsonl', 'seq_num': 3})]

이외로, `content_key`를 설정하여 유저이름 등 다른 데이터를 로딩할 수도 있습니다. `metadata_func`을 사용하여 메타데이터를 변경할 수 있습니다.

5. Markdown

이번 세션은 `Markdown` 문서를 로딩하는 것을 다룹니다. `UnstructuredMarkdownLoader`를 사용하여 간단히 `Markdown` 문서를 로딩할 수 있습니다.

from langchain.document_loaders import UnstructuredMarkdownLoader
markdown_path = "../../../../../README.md"
loader = UnstructuredMarkdownLoader(markdown_path)
data = loader.load()
pprint(data)

[Document(page_content="ð\x9f¦\x9cï¸\x8fð\x9f”\x97 LangChain\n\nâ\x9a¡ ...중략... \n\nFor detailed information on how to contribute, see here.", metadata={'source': '../../../../../README.md'})]

Unstructured는 텍스트의 청크에 대해 다른 `element`를 만듭니다. 기본적으로, 모든 `element`를 하나로 결합하여 출력하지만, `mode`를 `elements`로 설정하면 쉽게 변경할 수 있습니다.

loader = UnstructuredMarkdownLoader(markdown_path, mode="elements")
data = loader.load()
print(data[0])

Document(page_content='ð\x9f¦\x9cï¸\x8fð\x9f”\x97 LangChain', metadata={'source': '../../../../../README.md', 'page_number': 1, 'category': 'Title'})

6. PDF

PDF는 일반적으로 사용하는 가장 표준적인 문서 파일이라고 여겨집니다. 이번 세션에서는 PDF파일을 로드하는 방법을 다룹니다. PDF 파일을 로드하는 방법은 굉장히 다양합니다. 본 글에서는 그중 `PyPDFLoader`만 다루며, 더 많은 정보는 LangChain의 공식 도큐먼트를 참고하시길 바랍니다.

먼저, PyPDF를 사용한 `PyPDFLoader`로 PDF 파일을 로딩할 수 있습니다. PyPDF를 사용하여 로드하면, 문서 내용과 함께 문서 페이지 번호가 metadata로 입력됩니다.

from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data/layout-parser-paper.pdf")
pages = loader.load_and_split()
print(pages[0])

Document(page_content='LayoutParser : A Uni\x0ced Toolkit for ...중략... i\x0ccation [ 11,arXiv:2103.15348v2  [cs.CV]  21 Jun 2021', metadata={'source': 'example_data/layout-parser-paper.pdf', 'page': 0})

이 형태의 장점은 문서를 검색할 때, 페이지 번호와 함께 검색할 수 있다는 것입니다. `OpenAIEmbeddings`를 사용하여 문서를 검색하는 방법을 살펴보겠습니다.

from langchain.vectorstores import FAISS
from langchain.embeddings.openai import OpenAIEmbeddings

faiss_index = FAISS.from_documents(pages, OpenAIEmbeddings())
docs = faiss_index.similarity_search("How will the community be engaged?", k=2)
for doc in docs:
    print(str(doc.metadata["page"]) + ":", doc.page_content[:300])

9: 10 Z. Shen et al.
Fig. 4: Illustration of (a) the original historical Japanese document with layout
detection results and (b) a recreated version of the document image that achieves
much better character recognition recall. The reorganization algorithm rearranges
the tokens based on the their detect
3: 4 Z. Shen et al.
Efficient Data AnnotationC u s t o m i z e d  M o d e l  T r a i n i n gModel Cust omizationDI A Model HubDI A Pipeline SharingCommunity PlatformLa y out Detection ModelsDocument Images 
T h e  C o r e  L a y o u t P a r s e r  L i b r a r yOCR ModuleSt or age & VisualizationLa y ou