Daddy Makers: 5월 2024

2024년 5월 22일 수요일

그래프 데이터베이스 Neo4J 기반 데이터 질의 서비스 개발하기

이 글은 그래프 데이터베이스 Neo4J 기반 데이터 질의 서비스 개발하는 방법을 간략히 설명한다. Neo4j는 ACID(Atomicity,Consistency,Isolation,Durability)의 모든 속성을 지원하는 완전 트랜잭션을 지원하는 그래프 데이터베이스이다.

Neo4j는 2007년에 노이만 라디칼 시스템즈(Neo Technology)라는 회사에 의해 개발되었다. 그래프 데이터베이스는 관계형 데이터베이스와 달리 데이터를 노드와 엣지(관계)로 구성된 그래프 형태로 저장한다. 이러한 방식은 실제 세계의 복잡한 관계를 모델링하기에 적합하며, 네트워크, 사회 네트워크, 지도 및 추천 시스템 등 다양한 분야에서 유용하게 사용된다. neo4j는 가장 인기 있는 그래프 데이터베이스 중 하나로, 고성능, 확장성 및 질의 언어 지원을 특징으로 한다.

Neo4j는 SNS분석, 지도 위치 기반 서비스, IoT 등 그래프형 데이터 관리에 주로 사용된다.

Neo4j구조는 여러 프로젝트와 프로젝트에 포함된 데이터베이스로 구성된다.

기본 기능

시작하기

다음 링크에서 프로그램 다운로드 후 설치한다.

Neo4j Free Graph Database Download

설치 후, new project 후 add - local DBMS 선택한다. DB 생성 후 Open 하면 다음같이 브라우저 창을 볼 수 있다. 생성 시 입력 암호를 잘 기억한다.

다음과 같이 neo4j$ 에 명령을 입력해 레코드를 생성해 본다.

CREATE (john:Person {name: 'John'})

CREATE (joe:Person {name: 'Joe'})

CREATE (steve:Person {name: 'Steve'})

CREATE (sara:Person {name: 'Sara'})

CREATE (maria:Person {name: 'Maria'})

CREATE (john)-[:FRIEND]->(joe)-[:FRIEND]->(steve)

CREATE (john)-[:FRIEND]->(sara)-[:FRIEND]->(maria)

다음 명령을 입력, 실행한다.

MATCH (n) RETURN n LIMIT 5

결과는 다음과 같이, 질의한 그래프를 표시해준다. 이전 명령이 아래에 표시된 것을 알 수 있다.

http://127.0.0.1:7474/browser 브라우저로 접속해 본다. 초기 ID/PWD는 neo4j, neo4j이다. 여기서, 데스크탑 프로그램과 동일하게 데이터 관리 가능하다.

그래프 DB와 질의어

그래프 DB는 다음 구성요소를 가진다.

Node: 그래프 데이터 레코드
Relationship: Node 간의 관계
Property: Node의 속성
Label: Node를 묶는 단위

Cypher는 그래프 쿼리 언어이다. 문법은 다음과 같다.

CREATE: 노드, 관계 생성
MATCH: 기존 노트, 관계 검색. RETURN이나 WITH로 매칭된 대상 반환.

WHERE: 조건을 지정

MERGE: CREATE와 MATCH를 합친 함수
SET: 노드 LABEL과 PROPERTY를 업데이트 함
DELETE: 노드, 관계를 삭제

Relation 표현은 다음과 같다.

(node)-[relationship]->(node)

예를 들어, A-[:Knows(since:2020)]->B 경우는 A와 B가 knows 관계가 있으며, since 속성이 2020값을 가진다는 것을 의미한다.

다음은 cypher언어의 기본 형식을 보여준다.

Nodes: ()

Relationships: -[:DIRTECTED]->

Pattern: ()-[]-()

: ()-[]->()

: ()<-[]-()

다음은 이를 이용한 질의 예이다.

MATCH (p:Person)-[:ACTED_IN]->(m:Movie),

(d:Person)-[:DIRECTED]->(m:Movie)

WHERE p.name = 'Tom Hanks' AND p.born = 1956

AND d.name = 'Robert Zemeckis' AND d.born = 1951

RETURN m.title, m.released

파이썬으로 그래프 데이터 관리

이 예에서는 파이썬을 이용해 그래프 데이터를 관리한다. 다음을 설치한다.

pip install neo4j

파이썬에서 드라이버를 설정한다.

from neo4j import GraphDatabase

uri = "bolt://localhost:7687"

username = "neo4j"

password = "neo4jneo4j"

driver = GraphDatabase.driver(uri, auth=(username, password))

session = driver.session()

쿼리를 실행한다.

q = 'MATCH (n) RETURN n LIMIT 5'

nodes = session.run(q)

for node in nodes:

print(node)

query = '''

MERGE (n:Person {name: 'Joe'})

RETURN n

'''

results, summary, keys = driver.execute_query(query, database_='neo4j')

그래프 DB RAG 처리 예시

다음은 그래프 DB를 RAG처리하는 간단한 예시를 보여준다. 이 코드는 text를 langchain의 실험적 기능을 이용해 graph로 변환하고, neo4j에 저장한 후, 이를 질의하는 순서로 실행된다. 이를 실행하기 위해선 미리 PC에 neo4j 설치 후 서버로 우선 실행해야 한다.

text = """

Marie Curie, born in 1867, was a Polish and naturalised-French physicist and chemist who conducted pioneering research on radioactivity.

She was the first woman to win a Nobel Prize, the first person to win a Nobel Prize twice, and the only person to win a Nobel Prize in two scientific fields.

Her husband, Pierre Curie, was a co-winner of her first Nobel Prize, making them the first-ever married couple to win the Nobel Prize and launching the Curie family legacy of five Nobel Prizes.

She was, in 1906, the first woman to become a professor at the University of Paris.

"""

# 라이브러리를 임포트한다.

from dotenv import load_dotenv

from langchain.chains import GraphCypherQAChain

from langchain_community.graphs import Neo4jGraph

from langchain_core.documents import Document

from langchain_experimental.graph_transformers import LLMGraphTransformer

from langchain_openai import ChatOpenAI

load_dotenv()

# llm을 생성한다.

llm = ChatOpenAI(temperature=0, model_name="gpt-4o", openai_api_key=os.getenv("OPENAI_API_KEY")) # LLM 모델 설정. gpt-4o는 실험적 모델로, 성능이 다를 수 있음

llm_transformer = LLMGraphTransformer(llm=llm) # 실험적 모듈. 언어 모델(LLM)을 사용하여 텍스트 데이터를 그래프 데이터로 변환하는 데 사용

# neo4j에 접속하고, 텍스트를 통해 문서를 만든 후, 문서를 그래프 형식으로 변환한다.

def build_graph():

graph = Neo4jGraph(url='bolt://localhost:7687', username='neo4j', password='neo4jneo4j') # Neo4j 그래프 데이터베이스에 연결

documents = [Document(page_content=text)]

graph_documents = llm_transformer.convert_to_graph_documents(documents)

graph.add_graph_documents(graph_documents)

return graph

# 그래프 DB에 질의한다.

def query_graph(graph, query):

chain = GraphCypherQAChain.from_llm(graph=graph, llm=llm, verbose=True, validate_cypher=True)

response = chain.invoke({"query": query})

return response

graph = build_graph()

response = query_graph(graph, "In what university Marie Curie was professor and when she did it?")

print(response)

만약, neo4j가 설치되지 않았다면, 다음 도커 명령을 통해 실행 후 시작한다.

docker run --publish=7474:7474 --publish=7687:7687 --volume=neo_data:/data --env=NEO4J_AUTH=neo4j/12345678oo --env NEO4J_PLUGINS='["apoc"]' -d neo4j

공학용 IFC 그래프로 변환

다음은 IFC파일을 그래프 데이터베이스로 변환하는 파이썬 코드이다.

import ifcopenshell, sys, time

from py2neo import Graph, Node

def typeDict(key):

f = ifcopenshell.file()

value = f.create_entity(key).wrapped_data.get_attribute_names()

return value

ifc_path = "input.ifc" # 입력파일

nodes = []

edges = []

# 노드 설정

f = ifcopenshell.open(ifc_path)

for el in f:

if el.is_a() == "IfcOwnerHistory":

continue

tid = el.id()

cls = el.is_a()

pairs = []

keys = []

try:

keys = [x for x in el.get_info() if x not in ["type", "id", "OwnerHistory"]]

except RuntimeError:

pass

for key in keys:

val = el.get_info()[key]

if any(hasattr(val, "is_a") and val.is_a(thisTyp)

for thisTyp in ["IfcBoolean", "IfcLabel", "IfcText", "IfcReal"]):

val = val.wrappedValue

if val and type(val) is tuple and type(val[0]) in (str, bool, float, int):

val = ",".join(str(x) for x in val)

if type(val) not in (str, bool, float, int):

continue

pairs.append((key, val))

nodes.append((tid, cls, pairs))

for i in range(len(el)):

try:

el[i]

except RuntimeError as e:

if str(e) != "Entity not found":

print("ID", tid, e, file=sys.stderr)

continue

if isinstance(el[i], ifcopenshell.entity_instance):

if el[i].is_a() == "IfcOwnerHistory":

continue

if el[i].id() != 0:

edges.append((tid, el[i].id(), typeDict(cls)[i]))

continue

try:

iter(el[i])

except TypeError:

continue

destinations = [x.id() for x in el[i] if isinstance(x, ifcopenshell.entity_instance)]

for connectedTo in destinations:

edges.append((tid, connectedTo, typeDict(cls)[i]))

if len(nodes) == 0:

print("no nodes in file", file=sys.stderr)

sys.exit(1)

# 그래프 데이터베이스 생성

graph = Graph(auth=('neo4j', 'neo4jneo4j')) # http://localhost:7474

graph.delete_all()

for node in nodes:

nId, cls, pairs = node

one_node = Node("IfcNode", ClassName=cls, nid=nId)

for k, v in pairs:

one_node[k] = v

graph.create(one_node)

# graph.run("CREATE INDEX ON :IfcNode(nid)")

print("Node creat prosess done. Take for ", time.time() - start)

print(time.strftime("%Y/%m/%d %H:%M", time.strptime(time.ctime())))

query_rel = """

MATCH (a:IfcNode)

WHERE a.nid = {:d}

MATCH (b:IfcNode)

WHERE b.nid = {:d}

CREATE (a)-[:{:s}]->(b)

"""

for (nId1, nId2, relType) in edges:

graph.run(query_rel.format(nId1, nId2, relType))

생성 결과는 다음과 같다.

MATCH (n1)-[r]->(n2)

WHERE n1.ClassName = "IfcBuildingStorey"

RETURN r, n1, n2;

결론

Neo4j 그래프 데이터베이스는 데이터를 노드와 엣지(관계)로 구성된 그래프 형태로 저장한다. 이러한 방식은 실제 세계의 복잡한 관계를 모델링하기에 적합하며, 네트워크, 사회 네트워크, 지도 및 추천 시스템 등 다양한 분야에서 유용하게 사용된다. 이 글을 통해 그래프 데이터를 질의하고 정보를 생성하는 서비스 개발 방법을 확인해 보았다.

레퍼런스

2024년 5월 21일 화요일

LLM 기반 그래프 RAG 기술 구현하기

이 글은 그래프 RAG 기술을 구현하는 방법을 정리한다.

이 장은 Microsoft 에서 언급한 Graph RAG를 기반으로 실습한다.

Graph RAG

개발 환경 설치

다음 명령을 실행해 설치한다.

pip install lanchain

pip install openai

pip install neo4j

개발

다음과 같이 코딩한다.

import os

import json

import pandas as pd

from dotenv import load_dotenv

load_dotenv()

오픈API 키를 설정한다.

# os.environ["OPENAI_API_KEY"] = "<your_api_key>"

# print(os.environ["OPENAI_API_KEY"])

실습할 데이터셋 로딩한다.

# Loading a json dataset from a file

file_path = 'data/amazon_product_kg.json'

with open(file_path, 'r') as file:

jsonData = json.load(file)

df = pd.read_json(file_path)

df.head()

데이터베이스에 연결하고, 파싱한다.

url = "bolt://localhost:7687"

username ="neo4j"

password = "<your_password_here>"

from langchain.graphs import Neo4jGraph

graph = Neo4jGraph(

url=url,

username=username,

password=password

)

def sanitize(text):

text = str(text).replace("'","").replace('"','').replace('{','').replace('}', '')

return text

각 JSON 파일의 객체를 데이터베이스에 추가한다.

i = 1

for obj in jsonData:

print(f"{i}. {obj['product_id']} -{obj['relationship']}-> {obj['entity_value']}")

i+=1

query = f'''

MERGE (product:Product {{id: {obj['product_id']}}})

ON CREATE SET product.name = "{sanitize(obj['product'])}",

product.title = "{sanitize(obj['TITLE'])}",

product.bullet_points = "{sanitize(obj['BULLET_POINTS'])}",

product.size = {sanitize(obj['PRODUCT_LENGTH'])}

MERGE (entity:{obj['entity_type']} {{value: "{sanitize(obj['entity_value'])}"}})

MERGE (product)-[:{obj['relationship']}]->(entity)

'''

graph.query(query)

질의 처리를 위해 각 유형 속성에 대한 벡터 인덱스를 생성한다. 이는 임베딩 함수를 사용해 처리한다.

from langchain.vectorstores.neo4j_vector import Neo4jVector

from langchain.embeddings.openai import OpenAIEmbeddings

embeddings_model = "text-embedding-3-small"

vector_index = Neo4jVector.from_existing_graph(

OpenAIEmbeddings(model=embeddings_model),

url=url,

username=username,

password=password,

index_name='products',

node_label="Product",

text_node_properties=['name', 'title'],

embedding_node_property='embedding',

)

def embed_entities(entity_type):

vector_index = Neo4jVector.from_existing_graph(

OpenAIEmbeddings(model=embeddings_model),

url=url,

username=username,

password=password,

index_name=entity_type,

node_label=entity_type,

text_node_properties=['value'],

embedding_node_property='embedding',

)

entities_list = df['entity_type'].unique()

for t in entities_list:

embed_entities(t)

데이터베이스 질의를 한다.

from langchain.chains import GraphCypherQAChain

from langchain.chat_models import ChatOpenAI

chain = GraphCypherQAChain.from_llm(

ChatOpenAI(temperature=0), graph=graph, verbose=True,

)

chain.run("""

Help me find curtains

""")

프롬프트 템플릿을 이용해, 좀 더 상세한 결과가 나올 수 있도록 한다.

entity_types = {

"product": "Item detailed type, for example 'high waist pants', 'outdoor plant pot', 'chef kitchen knife'",

"category": "Item category, for example 'home decoration', 'women clothing', 'office supply'",

"characteristic": "if present, item characteristics, for example 'waterproof', 'adhesive', 'easy to use'",

"measurement": "if present, dimensions of the item",

"brand": "if present, brand of the item",

"color": "if present, color of the item",

"age_group": "target age group for the product, one of 'babies', 'children', 'teenagers', 'adults'. If suitable for multiple age groups, pick the oldest (latter in the list)."

}

relation_types = {

"hasCategory": "item is of this category",

"hasCharacteristic": "item has this characteristic",

"hasMeasurement": "item is of this measurement",

"hasBrand": "item is of this brand",

"hasColor": "item is of this color",

"isFor": "item is for this age_group"

}

entity_relationship_match = {

"category": "hasCategory",

"characteristic": "hasCharacteristic",

"measurement": "hasMeasurement",

"brand": "hasBrand",

"color": "hasColor",

"age_group": "isFor"

}

system_prompt = f'''

You are a helpful agent designed to fetch information from a graph database.

The graph database links products to the following entity types:

{json.dumps(entity_types)}

Each link has one of the following relationships:

{json.dumps(relation_types)}

Depending on the user prompt, determine if it possible to answer with the graph database.

The graph database can match products with multiple relationships to several entities.

Example user input:

"Which blue clothing items are suitable for adults?"

There are three relationships to analyse:

1. The mention of the blue color means we will search for a color similar to "blue"

2. The mention of the clothing items means we will search for a category similar to "clothing"

3. The mention of adults means we will search for an age_group similar to "adults"

Return a json object following the following rules:

For each relationship to analyse, add a key value pair with the key being an exact match for one of the entity types provided, and the value being the value relevant to the user query.

For the example provided, the expected output would be:

{{

"color": "blue",

"category": "clothing",

"age_group": "adults"

}}

If there are no relevant entities in the user prompt, return an empty json object.

'''

print(system_prompt)

from openai import OpenAI

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

질의 함수를 정의하고, 원하는 제품 아이템을 질의해본다. Cypher 쿼리를 사용한다.

def define_query(prompt, model="gpt-4-1106-preview"):

completion = client.chat.completions.create(

model=model,

temperature=0,

response_format= {

"type": "json_object"

messages=[

{

"role": "system",

"content": system_prompt

{

"role": "user",

"content": prompt

}

]

)

return completion.choices[0].message.content

example_queries = [

"Which pink items are suitable for children?",

"Help me find gardening gear that is waterproof",

"I'm looking for a bench with dimensions 100x50 for my living room"

]

for q in example_queries:

print(f"Q: '{q}'\n{define_query(q)}\n")

추출된 엔티티가 현재 로딩한 데이터와 정확히 일치하지 않을 수 있으므로, GDS 코사인 유사도 함수를 이용해 질의된 제품을 반환할 수 있도록 한다.

def create_embedding(text):

result = client.embeddings.create(model=embeddings_model, input=text)

return result.data[0].embedding

# The threshold defines how closely related words should be. Adjust the threshold to return more or less results

def create_query(text, threshold=0.81):

query_data = json.loads(text)

# Creating embeddings

embeddings_data = []

for key, val in query_data.items():

if key != 'product':

embeddings_data.append(f"${key}Embedding AS {key}Embedding")

query = "WITH " + ",\n".join(e for e in embeddings_data)

# Matching products to each entity

query += "\nMATCH (p:Product)\nMATCH "

match_data = []

for key, val in query_data.items():

if key != 'product':

relationship = entity_relationship_match[key]

match_data.append(f"(p)-[:{relationship}]->({key}Var:{key})")

query += ",\n".join(e for e in match_data)

similarity_data = []

for key, val in query_data.items():

if key != 'product':

similarity_data.append(f"gds.similarity.cosine({key}Var.embedding, ${key}Embedding) > {threshold}")

query += "\nWHERE "

query += " AND ".join(e for e in similarity_data)

query += "\nRETURN p"

return query

def query_graph(response):

embeddingsParams = {}

query = create_query(response)

query_data = json.loads(response)

for key, val in query_data.items():

embeddingsParams[f"{key}Embedding"] = create_embedding(val)

result = graph.query(query, params=embeddingsParams)

return result

example_response = '''{

"category": "clothes",

"color": "blue",

"age_group": "adults"

}'''

result = query_graph(example_response)

# Result

print(f"Found {len(result)} matching product(s):\n")

for r in result:

print(f"{r['p']['name']} ({r['p']['id']})")

질의 결과, neo4j 데이터베이스를 검색한 후, 이를 바탕으로 LLM 생성된 텍스트를 json 형식으로 결과 얻을 수 있다.

레퍼런스

2024년 5월 18일 토요일

LLM 기반 구조화된 JSON 데이터 RAG 및 생성하기

이 글은 최근 LLM 개발자 관심 중 하나인 LLM 기반 구조화된 형식의 데이터 생성하는 방법을 간략히 소개한다. 이를 위해, JSON 입출력이 가능하도록 RAG처리하는 방법을 개발한다.

구조화된 LLM 출력

본 글은 OpenAI ChatGPT와 같이 API를 사용하려면 구독해야 하는 상용 모델 대신 라마, 미스랄과 같은 오픈소스 모델을 사용한다.

JSON 파일을 RAG하기 위해서는 여러가지 부분을 고려해야 한다. 보통, RAG를 위해서는 랭체인, 라마 인덱스 등 라이브러리를 사용하는 데, 의존성 변화가 심한데다, 급격히 발전하고 있어 설치, 빌드에 여러 에러가 발생하고 있다. 벡터 데이터베이스를 이용한 RAG도 한계가 있어, 필요한 정보를 제대로 검색하지 못하는 이슈들이 있다. 이런 문제들을 고려하고, RAG처리해야 한다.

개발 환경 준비

다음과 같이 개발환경을 설치한다. 그리고, ollama 도구를 설치하도록 한다.

pip install llama-cpp-python

pip install 'crewai[tools]'

pip install langchain

TEXT TO JSON

라마 모델을 로딩하고, JSON 문법으로 출력하도록 GGUF 문법 정의를 이용해 JSON 출력을 생성한다. 다음 코드를 실행한다.

from llama_cpp.llama import Llama, LlamaGrammar

import httpx

grammar_text = httpx.get("https://raw.githubusercontent.com/ggerganov/llama.cpp/master/grammars/json_arr.gbnf").text

grammar = LlamaGrammar.from_string(grammar_text)

llm = Llama("llama-2-13b.Q8_0.gguf")

response = llm(

"JSON list of name strings of attractions in SF:",

grammar=grammar, max_tokens=-1

)

import json

print(json.dumps(json.loads(response['choices'][0]['text']), indent=4))

출력 결과는 다음과 같이, 샌프란시스코에 있는 놀이 시설을 보여준다.

[

{

"address": {

"country": "US",

"locality": "San Francisco",

"postal_code": 94103,

"region": "CA",

"route": "Museum Way",

"street_number": 151

"geocode": {

"latitude": 37.782569,

"longitude": -122.406605

"name": "SFMOMA",

"phone": "(415) 357-4000",

"website": "http://www.sfmoma.org/"

}

]

이와 같이, LLM 출력을 컴퓨터 처리하기 용이한 구조로 생성할 수 있다.

참고로, 여기서 사용한 JSON 문법은 다음과 같이 정형 규칙 언어로 정의된 것을 사용한 것이다.

json.gbnf

TEXT TO XML

다음은 XML에서 데이터를 검색하는 방법을 보여준다.

from langchain.output_parsers import XMLOutputParser

from langchain_community.chat_models import ChatAnthropic

from langchain_core.prompts import PromptTemplate

model = ChatAnthropic(model="claude-2", max_tokens_to_sample=512, temperature=0.1)

actor_query = "Generate the shortened filmography for Tom Hanks."

output = model.invoke(

f"""{actor_query}

Please enclose the movies in <movie></movie> tags"""

)

print(output.content)

JSON RAG 처리

이 예시는 특정 웹사이트 내용과 제품 정보가 포함된 JSON파일(예제 다운로드)를 RAG 한다. 예제파일은 input_json 폴더를 만들고, 그 아래 복사한다.

이제 다음과 같이 랭체인를 이용해 JSON 파일을 RAG 처리한다.

from langchain_community.vectorstores import Chroma

from langchain_community.chat_models import ChatOllama

from langchain_community.embeddings.fastembed import FastEmbedEmbeddings

from langchain_community.document_loaders import WebBaseLoader

from langchain.text_splitter import CharacterTextSplitter

from langchain_core.prompts import ChatPromptTemplate

from langchain.chains.combine_documents import create_stuff_documents_chain

from langchain.chains import create_retrieval_chain

from langchain_core.documents import Document

import sys, os, json

class ChatWebDoc:

vector_store = None

retriever = None

chain = None

def __init__(self):

self.model = ChatOllama(model="mistral:instruct")

#Loading embedding

self.embedding = FastEmbedEmbeddings()

self.text_splitter = CharacterTextSplitter(chunk_size=1024, chunk_overlap=100)

self.prompt = ChatPromptTemplate.from_messages(

[

("system",

"""You are an assistant for question-answering tasks. Use only the following

context to answer the question. If you don't know the answer, just say that you don't know.

CONTEXT:

{context}

"""),

("human", "{input}"),

]

)

def ingest(self, url_list):

#Load web pages

docs = WebBaseLoader(url_list).load()

chunks = self.text_splitter.split_documents(docs)

#Create vector store

vector_store = Chroma.from_documents(documents=chunks,

embedding=self.embedding, persist_directory="./chroma_db")

def ingest_json(self, input_folder):

all_chunks = []

for filename in os.listdir(input_folder):

if filename.endswith('.json') == False:

continue

file_path = os.path.join(input_folder, filename)

with open(file_path, 'r', encoding='utf-8') as file:

dataset = json.load(file)

for data in dataset:

text = json.dumps(data) if isinstance(data, dict) else str(data)

document = Document(page_content=text, metadata={"source": "local"})

all_chunks.append(document)

vector_store = Chroma.from_documents(documents=all_chunks,

embedding=self.embedding,

persist_directory="./chroma_db")

def load(self):

vector_store = Chroma(persist_directory="./chroma_db",

embedding_function=self.embedding)

self.retriever = vector_store.as_retriever(

search_type="similarity_score_threshold",

search_kwargs={

"k": 3,

"score_threshold": 0.5,

)

document_chain = create_stuff_documents_chain(self.model, self.prompt)

self.chain = create_retrieval_chain(self.retriever, document_chain)

def ask(self, query: str):

if not self.chain:

self.load()

result = self.chain.invoke({"input": query})

print(result["answer"])

for doc in result["context"]:

print("Source: ", doc.metadata["source"])

def build():

w = ChatWebDoc()

w.ingest([

"https://www.webagesolutions.com/courses/WA3446-comprehensive-angular-programming",

"https://www.webagesolutions.com/courses/AZ-1005-configuring-azure-virtual-desktop-for-the-enterprise",

"https://www.webagesolutions.com/courses/AZ-305T00-designing-microsoft-azure-infrastructure-solutions",

])

w.ingest_json("./input_json")

def chat():

w = ChatWebDoc()

w.load()

build()

while True:

query = input(">>> ")

if len(query) == 0:

continue

if query == "/exit":

break

w.ask(query)

if len(sys.argv) < 2:

chat()

elif sys.argv[1] == "--ingest":

build()

실행 결과는 다음과 같다.

답변이 잘 생성되나, 매 질문에 따라 잘못된 답을 생성하기도 한다. 이런 이유로, 답변 정확도를 높이기 위한 좀 더 다양한 LLM RAG 처리 옵션과 기법이 적용될 필요가 있다.

결론

앞서 다양한 방식으로 목적 달성을 위해 솔류션 테크 트리를 탐색하고, 시도했으나, 아직 완벽하지 않고, 공개된 자료들도 에러가 많이 발생하였다. 앞에 예시된 내용은 그 테크트리 중 일부 성공한 것만 기술한 것이다.

사실, LLM 튜닝은 리소스 제약으로 인해 대부분 열악한 인프라 환경?인 국내에서는 RAG를 대안으로 선택하지만, 한계가 명확하다. 사실, LLM 튜닝 및 풀튜닝 할 수 있는 능력있는 국내 IT기업은 거의 없다고 봐야 한다. 참고로, 라마3 8B모델 훈련에는 H100 GPU(80GB) 하나를 사용했을 때 학습 기간은 1,388.9 개월이 걸린다(계산방법 참고 - Transformer Math 101).

Llama-3-8B-Instruct | NVIDIA NGC (How large an LLM can I train from scratch on a single A100 GPU with 80Gb memory?)

이보다 더 적은 파라메터수를 가진 라마2 7B 모델 훈련에는 114.5개월(약 10년)/A100 GPU 이 소요된다.

The Carbon Impact of Large Language Models: AI's Growing Environmental Cost

대안인 파인튜닝, 양자화, LoRA방식은 많은 학습 데이터(최소 수만건 이상 잘 정재된 데이터셋)이 필요하고, 그 결과 또한 한계가 있다(계산된 logit 확률로만 이야기하는 논문이나 리더보드 발표 결과를 믿을 수 없음. 실제 결과는 사용하기 어려운 경우가 많다). 이런 이유로 RAG를 하지만, 이 또한, 연구가 필요한 이슈들이 많다.

좀 더 상세한 내용은 아래 링크를 참고한다.

레퍼런스

JSON에만 특화된 RAG는 다음과 같다.

부록: RAG 기반 SQL 코딩 에이전트 개발

부록: 구조적(structured) 출력 RAG 및 파인튜닝 모델 개발

2024년 5월 12일 일요일

LlamaIndex와 LLM 기반 이미지-TO-텍스트

LlamaIndex와 LLM 기반 이미지-TO-텍스트 개발 방법을 간략히 소개한다.

개발환경 설치

다음 패키지를 pip install 로 설치한다.

llama-index

llama-index-llms-huggingface

llama-index-embeddings-fastembed

fastembed

Unstructured[md]

chromadb

llama-index-vector-stores-chroma

llama-index-llms-groq

einops

accelerate

sentence-transformers

llama-index-llms-mistralai

llama-index-llms-openai

코딩

다음 그림을 설명하는 텍스트를 생성해 본다.

다음을 코딩해 실행한다.

import os, torch

from os import path

from PIL import Image

from llama_index.core import VectorStoreIndex

from llama_index.core import SimpleDirectoryReader

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

from llama_index.core import Settings

from llama_index.llms.huggingface import (

HuggingFaceInferenceAPI,

HuggingFaceLLM,

)

from transformers import AutoModelForCausalLM, AutoTokenizer

print(f'GPU={torch.cuda.is_available()}')

# Moondream2 사전 훈련 모델 로딩.

model_id = "vikhyatk/moondream2"

revision = "2024-04-02"

model = AutoModelForCausalLM.from_pretrained(

model_id, trust_remote_code=True, revision=revision

)

tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision) # 토큰나이저

# 이미지 로딩 후 모델 전달. 텍스트 생성.

image = Image.open('./cat.jpg')

enc_image = model.encode_image(image)

image_caption = model.answer_question(enc_image, "Describe this image in detail with transparency.", tokenizer)

print(image_caption)

실행

결과가 다음과 같다면, 성공한 것이다.

In a verdant field, a black and white dog lies on its side, its head resting on the lush green grass. A brown and white cat sits upright on its hind legs, its gaze fixed on the dog. The dog's tail is wagging, and the cat's tail is curled up in a playful manner. The background is a serene landscape of trees and bushes, with the sun casting dappled shadows on the grass.

레퍼런스

2024년 5월 11일 토요일

구글 클라우드 데이터 퓨전 기반 데이터 ETL 파이프라인 개발

이 글은 구글 클라우드 데이터 퓨전 기반 데이터 ETL 파이프라인을 개발하는 방법을 간략히 기술한다.

준비

Google Cloud Console에서 프로젝트를 선택하거나 만든다.

Creating and managing projects | Resource Manager Documentation | Google Cloud

Google Cloud Console에서 인스턴스 페이지를 열고, View instance를 클릭하면 Cloud data fusion 웹이 열린다.

사용하기

Studio 페이지를 이동해, Source를 선택하고, Cloud Storage 노드를 확인한다.

이 중에 Properties를 선택한다.

더블클릭하여, 필요한 속성을 입력한다.

Transform 드롭다운 메뉴에서 랭글러를 선택한다. 그리고, 각 노드를 드래그&드롭한다.

Sink 드롭다운 메뉴에서 Cloud Storage를 선택한다. 그리고 적절히 속성을 입력한다.

이러한 방식으로 데이터 ETL(Extract, Transform, Load)를 처리할 수 있다.

레퍼런스
Design and create a reusable pipeline | Cloud Data Fusion Documentation | Google Cloud
Cloud Data Fusion을 사용하여 데이터 파이프라인 만들기 | Cloud Data Fusion 문서 | Google Cloud