728x90

개요

https://medium.com/malware-buddy/applying-llms-to-threat-intelligence-f3b8ba4463a4

Applying LLMs to Threat Intelligence

A Practical Guide with Code Examples

blog.securitybreak.io

개인적 공부를 위해 위 글을 단순 번역한 글입니다.

LLMs, or Large Language Models, are an exciting technology designed to leverage natural languages with various technologies. Specifically in Cybersecurity, and more so in Threat Intelligence, there are challenges that can be partially addressed with LLMs and generative AI.
- LLM은 다양한 기술과 함께 자연어를 활용하기 위해 설계된 놀라운 기술이다.
- 특히 컴퓨터 보안 더 나아가 위협 인텔리전스 분야는 LLM과 생성 AI로 부분적으로 해결할 수 있는 도전들이다.
While much of the focus is on prompt engineering skills, there’s more to consider than just choosing the right word to interact with a model.
- 대부분 prompt engineering 기술에 초점을 맞추고 있지만, 그보다 더욱 고려해야 할 것이 많다.
In this blog, I will discuss the potential of LLMs for threat intelligence applications. I will first introduce some common challenges, then define what prompt engineering is and how it can be applied to practical use cases. Next, I will discuss some techniques such as few-shot learning, RAG, and agents. Everything will be illustrated with code examples. Stay with me, as we’re about to dive deep and acquire real skills, rather than just skimming the surface.
- 나는 threat intelligence에 LLM에 대한 잠재력을 논의할 것이다.
- 우선, 몇몇 일반적인 과제를 소개할 것이다. 그 후, prompt engineering이 어떻게 유용하게 적용될지 정의할 것이다.
- 다음으로 few-shot learning, RAG, agents와 같은 몇몇 기술에 대해 논의할 것이다.
- 모든 것은 코드 예제로 설명된다.
- 단순히 훑어보기 보다는 실제 기술들을 얻을 것이다.

Threat Intelligence Challenges

In Threat Intel, there are several challenges to deal with. First, the sheer volume of information produced today can be overwhelming, and no one has the time to read it all. Second, investigating a threat can be time-consuming, and junior analysts might lack the necessary background to conduct the investigation effectively. Additionally, the dynamic nature of threats means that analysts often have to keep up with rapidly changing tactics, techniques, and procedures, which can be daunting even for seasoned professionals.
- Threat Intelligence 분야에서 몇몇 과제들이 존재한다.
- 첫째로는, 하루에 생산되는 엄청난 양의 정보들이다. 누구도 모두 읽을 시간이 없다.
- 둘째로는, 위협을 조사한다는 건 시간이 소요되는 일이며 특히, 주니어 애널리스트에게는 효율적으로 조사를 수행하는데 부족하다.
- 게다가, 위협의 동적인 특성으로 인해 급변하는 전술, 기술 및 절차를 빠르게 따라잡아야 하는 경우가 많다.
- 이는 숙련된 전문가에게도 벅찬 일이 될 수 있다. (daunting : 어려운, 벅찬)
With these challenges in mind, let’s explore how LLMs can be utilized to enhance analysts’ capabilities.
- 이러한 과제들을 염두에 두고 LLM을 어떻게 활용하여 애널리스트의 역량을 끌어올릴 수 있는지 알아보자.

What is Prompt Engineering?

We cannot discuss LLMs without defining what is Prompt Engineering.
- Prompt Engineering이 무엇인지에 대한 정의없이는 LLM을 논할 수가 없다.
Prompt Engineering is the discipline and science of crafting effective prompts to guide AI models, particularly LLMs, toward desired outputs. Much like a potter, wood carver, or a “tailleur de pierre” (stone cutter), prompt engineering is the essential tool.
- Prompt Engineering은 AI 모델 (특히, LLM)을 원하는 출력으로 안내하기 위한 효과적인 프롬프트를 만드는 학문이자 과학이다.
- 도예가, 나무 조각가, 돌 전달기처럼 prompt engineering은 필수적인 도구이다.

To craft the ideal prompt, there are several basics to follow:
- 이상적인 프롬프트를 만들기 위해 몇가지 원칙을 소개한다.
  - Clarity: Define the task you want the model to perform clearly.
    - 명확성: 모델이 수행할 작업을 명확하게 정의하라
  - Specificity: Provide as much detail as necessary to eliminate ambiguity.
    - 특수성: 모호성을 제거하기 위해 필요한만큼 세부사항을 제공하라
  - Iteration: Continuously refine prompts based on feedback from the AI.
    - 반복: AI의 피드백을 기반으로 프롬프트를 지속적으로 개선하라
However, there are also common pitfalls to be wary of:
- 하지만, 다음과 같은 일반적인 함정도 존재한다
  - Over-complexity: Refrain from making prompts excessively detailed.
    - 과도한 복잡성: 지나치게 상세한 프롬프트를 만드는 것을 삼가하라
  - Ambiguity: Avoid vague prompts as they can lead to generic answers.
    - 모호성: 모호한 프롬프는 일반적인 답변으로 이어질 수 있기 때문에 피하라
  - Blind Trust in the Model: Relying too much on the model’s capabilities without adequate verification.
    - 모델에 대한 맹목적인 신뢰: 적절한 검증없이 모델의 능력에 너무 의존하는 것
  - No Examples: Omitting example inputs and outputs.
    - 예제 없음: 예제 입력 및 출력을 생략하는 것
  - Misplaced Belief in Model’s Understanding: Assuming the model grasps your intent without clarity.
    - 모델의 이해에 대한 잘못된 믿음: 모델이 명확하지 않은 상태에서 의도를 파악한다고 가정하는 것
  - Ignoring Obsolescence: Neglecting to refresh prompts in tandem with model updates or changes in relevant data.
    - 노후화 무시: 모델 업데이트 혹은 관련 데이터 변경과 함께 프롬프트를 새로 고치는데 소홀함
The following example demonstrates an ideally crafted prompt:
- 다음 예제는 이상적으로 생성된 프롬프트를 보여준다.

Anatomy of an Ideal Prompt (Extract from my BsidesMelbourne conference)

But while many individuals are focusing on crafting the perfect prompt, they are essentially overlooking the true potential of LLMs and their vast capabilities.
- 하지만 대다수의 사람들이 완벽한 프롬프트를 생성하는데 초점을 맞추는 바람에, 그들은 본질적으로 LLM의 잠재력과 방대한 능력을 간과하고 있다.
Now, let’s talk about the genuine strength of LLMs and explore how we can pragmatically create our own applications with it.
- 이제 LLM의 강점에 대해 이야기하고 이를 통해 실용적으로 자체 애플리케이션을 만들 수 있는 방법에 대해 알아보자.

🤓 Practical Application of LLMs

There are multiple techniques that can be used in conjunction with a model. In this section, I will explore some of them to provide you with the keys to delve into this technology independently and achieve a better understanding of it.
- 모델과 함께 사용될 수 있는 기술은 여러 가지가 있습니다. 이번 섹션에서는 기술에 대해 더 깊은 이해를 하고 독립적으로 기술을 파고 들기 위해 그 기술들 중 몇가지를 설명할 것이다.

Few-Shot Prompting

Few-shot prompting is an interesting technique that can be employed to instruct an LLM using a very limited amount of data.
- Few-shot prompting은 제한된 양의 데이터를 사용하여 LLM을 학습시키는 흥미로운 기술이다.
The idea is to supply your model with some examples of what you expect so it can replicate them directly. For instance, in the code below, I ‘teach’ the model a desired output — in this case, a mermaid mindmap — so that it can produce similar mindmaps in the future.
- 이 아이디어는 모델에 몇가지 예제를 제공하고 그들이 직접 모방할 수 있도록 하는 것이다.
- 예를 들어, 아래 코드에서는 모델이 미래에 비슷한 마인드맵을 생성할 수 있도록 인어 마인드맵을 가르친다.

# Function to generate a mindmap (few shot technique). 
# NB: the more shot you add the better the result will be
def run_models(input_text):
  response = openai.ChatCompletion.create(
      model="gpt-4",
      messages= [
          {
              "role": "system",
              "content":"You are tasked with creating an in-depth mindmap designed specifically for a threat analyst. This mindmap aims to visually organize key findings and crucial highlights from the text. Please adhere to the following guidelines: \n1. Avoid using hyphens in the text, as they cause errors in the Mermaid.js code \n2. Limit the number of primary nodes branching from the main node to four. These primary nodes should encapsulate the top four main themes. Add detailed sub-nodes to elaborate on these themes \n3. Incorporate icons where suitable to enhance readability and comprehension\n 4. Use single parentheses around each node to give them a rounded shape."
          },
          {
              "role": "user",
              "content": "Title: \ud83e\udda0 Lazarus Group's Infrastructure Reuse Leads to Discovery of New Malware\n\nThe Lazarus Group, a North Korean state-sponsored actor famous for its relentless cyber offensive actions, continues to adjust its tactics and expand its arsenal. Recently, the revealed an exploitation of the ManageEngine ServiceDesk vulnerability (CVE-2022-47966) in another campaign. This exposure led to deploying multiple threats, with a new one identified as CollectionRAT, alongside an already used threat named QuiteRAT. \n\nThe advanced malware CollectionRAT has standard remote access trojan (RAT) capabilities, being able to run arbitrary commands on an infected system. Our intense analysis linked CollectionRAT to Jupiter/EarlyRAT, a malware family somewhat known to be associated with Andariel, a subgroup under the Lazarus Group umbrella. Interestingly, the group is gradually increasing its reliance on open-source tools and frameworks in the initial access phase of its attacks, as shown by Lazarus' use of the DeimosC2 framework. \n\nThe Lazarus Group's unchanging use of certain infrastructures, despite them being well-documented by security researchers, is another noteworthy observation. Their modus operandi, captured in the repeated use of the same tactics, techniques, and procedures (TTPs), shows audacious assurance in their operations. But this approach also offers intelligence opportunities for security analysts on tracking these reusable infrastructure components. \n \nDespite their blatant actions, the Lazarus Group's dynamism is evident by their increasingly heavy reliance on using dual-use utilities for activities like reverse tunneling. Some of the reused infrastructure components hosted the new CollectionRAT malware, and an altered copy of PuTTY's Plink utility was downloaded onto compromised endpoints, further demonstrating Lazarus Group's rapid evolution. \n\nIn conclusion, the Lazarus Group continues to reuse and recycle its well-worn tactics while pushing ahead with new threats and evolving TTPs. This information highlights why there's a need for organized cooperation among threat researchers and the critical importance of staying up-to-date with the latest threat intelligence.",
          },
          {
              "role": "assistant",
              "content": "mindmap\nroot(Lazarus Group Threat Analysis)\n    (Infrastructure Reuse)\n      ::icon(fa fa-sync-alt)\n      (Used in latest campaign)\n      (Includes Plink, an open-source tool)\n      (Indicates confidence in operations)\n    (New Malware - CollectionRAT)\n      ::icon(fa fa-bug)\n      (Remote Access Trojan)\n      (Allows running arbitrary commands on infected system)\n      (Similarities to EarlyRAT malware)\n    (Shifting Tactics)\n      ::icon(fa fa-exchange-alt)\n      (Increased use of open-source tools and frameworks)\n      (Including DeimosC2 framework)\n    (Protection Measures)\n      ::icon(fa fa-shield-alt)\n      (Detection and blocking with Cisco security products)\n      (Indicators of Compromise available on GitHub)\n    (Threat Report Significance)\n      ::icon(fa fa-exclamation-circle)\n      (Highlighted continued activity of Lazarus Group)\n      (Essential for enhancing detection and response capabilities)",
          },
          {"role": "user", "content": input_text},
      ],
  )
  return response.choices[0].message['content']

In the code above, I provide some examples to clarify what I’m expecting for the output. The information breaks down as follows:
- 위 코드에서 나는 출력에 대한 기대를 명확히 하기 위해서 몇가지 예제를 제공한다. 그 정보는 아래와 같이 분류된다.
- System: I assign the role of “system” to my tool and detail what I anticipate from this system. For this example, I’m aiming for a mindmap.
  - 시스템: 저는 시스템의 역할을 저의 도구에 할당하고 이 시스템에게 원하는 것을 자세히 설명해준다. 예를 들어, 마인드맵을 목표로 하고 있다.
- User: The second line designates the role of “user.” This line presents examples of user inputs.
  - 유저: 두번쨰 줄은 "유저"의 역할을 지정한다. 이 줄은 유저의 입력 예를 보여준다.
- Assistant: With the “assistant” role (representing the model), I provide an illustration of the expected output — in this instance, the mermaid mindmap code.
  - 어시스턴트: 어시스턴트 역할로 예상 출력의 예시를 제공한다. 여기서는 인어 마인드맵 코드이다.
- Finally, I capture the user input, allowing the assistant to generate the subsequent mindmap based on that input.
  - 마침내, 유저 인풋을 캡처하여 어시스턴트가 인풋을 바탕으로 후속 마인드맵을 생성하도록 한다.
An example of the resulting mindmap can be seen below:

Retrieval Augmented Generation (RAG)

The models we use are trained on a specific set of data up to a particular date. This implies that more recent data might not be recognized by the model, and most importantly, your personal/private data isn’t known to it either.
- 우리가 사용한 모델은 특정 날짜까지의 특정 데이터 세트에 대해 훈련되었다.
- 이는 최근 데이터는 모델에 의해 인식되지 않는 다는 것을 의미한다.
- 무엇보다 개인 데이터도 그렇다.
RAG presents an interesting approach that enables you to supplement the model with your own data, thereby expanding its capabilities. RAG is a technique that melds retrieval-based and generative models.
- RAG는 당신만의 데이터를 모델을 보완할 수 있는 흥미로운 접근 방식을 보여준다.
- 이를 통해 모델의 역량을 확장시킬 수 있다.
- RAG는 검색 기반 모델과 생성 모델을 융합하는 기법이다.
Two Phases: Retrieval & Generation (Retrieval & Generation의 2단계)
- Retrieval: This phase searches the database of your data you provided.
  - 이 단계는 당신이 제공한 데이터를 검색하는 단계이다.
- Generation: This phase produces a context-relevant response based on the retrieved information from your database.
  - 이 단계는 당신의 데이터베이스로부터 검색된 정보를 기반으로 상황에 맞는 답변을 생성한다.
The primary objective here is to enhance a model using your data. But how does it work under the hood?
- 주된 목표는 당신의 데이터를 사용하여 모델을 향상시키는 것이다.
- 하지만 후드 아래에서는 어떻게 작동할까? (자동차 후드를 의미. 몰라도 지장 없지만, 제대로 하려면 원리를 알아야 한다.)
RAG operates in multiple stages. The subsequent diagram offers a streamlined visualization of the process.
- RAG는 몇개의 스테이지로 작동한다.
- 다음 다이어그램은 RAG를 간소화한 시각화이다.

My friend, Roberto Rodriguez, conducted in-depth research on this topic using the Mitre ATT&CK Groups as data source.
For the sake of this blog, I’ve adapted his code to be compatible with Jupyter Notebook and create an interface using pywidget. I’ll walk you through each step to construct your own RAG. In this example, we used LangChain, which is an open-source library designed for interacting with an LLM.
- 이 글을 위해 그의 코드를 주피터 노트북과 호환되도록 조정하고 pywidget을 사용하여 인터페이스를 구성했다.
- 각각의 단계를 안내하여 자신만의 RAG를 만들어보자.
- 우리는 LangChain을 사용할 것이고 이는 LLM과 상호 작용하도록 설계된 오픈 소스 라이브러리이다.

Prepare Your Data (No, Really!)

You might have heard that when working with machine learning, deep learning, or AI models, it’s essential to clean your dataset. This step is crucial for obtaining the most accurate results.
- 머신러닝, 딥러닝, AI 모델을 작동시킬 때 데이터셋을 정제하는 것이 중요하다고 들었을 것이다.
- 이 단계는 대부분 정확한 결과를 얻는데 정말 중요하다.
Ensuring that your entire dataset is well-formatted and consists of clean data is of utmost importance. Once your data is prepped, you can begin crafting your RAG.
- 데이터셋 자체가 완벽하게 포맷되어 있고 깨끗한데이터로 구성되어 잇는지 확인하는 것이 중요하다.
- 데이터가 준비되면 당신은 당신만의 RAG를 만들 수 있게 된다.
In this example, we used data exported from the Mitre ATT&CK groups. After downloading the data to your local system, you can begin loading it using Langchain.
- 이번 예제에서는, 우리는 Mitre ATT&CK groups부터 추출된 데이터를 사용했다. 이 데이터를 당신의 로컬에 다운받은 뒤, Langchain을 사용하여 load를 시작할 수 있다.
Note: For this example, the data is stored in Markdown format, but you can use any type of data.
- 이번 예제에서는 데이터가 마크다운 포맷으로 저장되어 있으나, 당신은 어떤 종류의 데이터 유형이어도 사용이 가능하다.

from langchain.document_loaders import UnstructuredMarkdownLoader
# Using glob to find all Markdown files in the knowledge_directory
# The "*.md" means it will look for all files ending with .md (Markdown files)
group_files = glob.glob(os.path.join(knowledge_directory, "*.md"))

# Initializing an empty list to store the content of Markdown files
md_docs = []

# Loop through each Markdown file path in group_files
for group in group_files:
    
    # Create an instance of UnstructuredMarkdownLoader to load the content of the current Markdown file
    loader = UnstructuredMarkdownLoader(group)
    
    # Load the content and extend the md_docs list with it
    md_docs.extend(loader.Load())

Here we are using the group knowledge to load into our RAG.
- 여기서 group knowledge를 사용하여 RAG에 로드한다.

Tokenization

Tokenization is the process of converting a sequence of text into individual units, known as “tokens.” These tokens can range from being as small as characters to as long as words, depending on the specific needs of the task and the language in question. Tokenization is an essential pre-processing step in Natural Language Processing (NLP) and text analytics models. Tokenzation can be done using the library Tiktoken.
- 트큰화는 텍스트 시퀀스를 토큰이라고 하는 개별 단위로 변환하는 과정이다.
- 이러한 토큰들은 작게는 문자, 길게는 단어까지 다양할 수 있으며, 요구되는 task나 언어에 따라 다를 것이다.
- 토큰화는 NLP에서 필수적인 전처리 과정이며 Tiktoken 라이브러리를 사용하면 토큰화를 수행할 수 있다.
In our context, tokenization isn’t strictly required. However, it proves beneficial if you aim to manage the amount of data sent and for optimization and cost-control purposes.
- 여기서 토큰화는 엄격하게 필요한건 아니지만, 전송되는 데이터의 양을 관리하고 최적화 및 비용 컨트롤 목적이라면 유용하다.

Splitting into Smaller Chunks

Dividing your imported data into smaller chunks is a strategy designed to make it easier for the model to access the imported data.
- 가져온 데이터를 더 작은 청크로 나누는 것은 모델이 데이터에 더 쉽게 접근할 수 있도록 설계된 전략이다.
In this instance, we’re using the `RecursiveCharacterTextSplitter` from LangChain. This method attempts to divide the text based on a set sequence of characters until the resulting chunks reach a desired size. By default, the characters used for splitting are [“\n\n”, “\n”, “ “, “”]. The method strives to maintain the integrity of paragraphs, sentences, and words as they’re typically semantically connected. The size of each chunk is determined by its character count.
- 이 예제에서 우리는 LangChain의 ' RecursiveCharacterTextSplitter'를 사용할 것이다.
- 이 방법은 결과적으로 생성된 청크가 원하는 사이즈가 될 때까지 분할하려고 시도한다.
- 기본적으로 분할에 사용되는 문자는 “\n\n”, “\n”, “ “, “”이다.
- 이 방법은 문단, 문장, 단어가 의미론적으로 연결되어 있으므로 무결성을 유지하려고 노력한다.
- 각 청크의 크기는 문자 수에 따라 결정된다.
The following code demonstrates how to employ this method with our MITRE ATT&CK Groups data.
- 다음 코드는 MITRE ATT&CK 그룹 데이터에서 이 방법을 사용하는 방법을 보여준다.

# Import the RecursiveCharacterTextSplitter class from the langchain library
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Create an instance of RecursiveCharacterTextSplitter with specified parameters
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,  # Maximum number of tokens in each chunk
    chunk_overlap=50,  # Number of tokens that will overlap between adjacent chunks
    length_function=tiktoken_len,  # Function to calculate the number of tokens in a text
    separators=['\n\n', '\n', ' ', '']  # List of separators used to split the text into chunks
)

Embeddings

Embeddings provide a means to convert words or phrases into numerical representations, or vectors, so they can be easily processed by computers. Why is this useful? By transforming text into numerical form, it becomes simpler to gauge the similarity between words or sentences, facilitating tasks such as search and classification.
- 임베딩은 단어나 구를 숫자나 벡터로 변환하는 수단을 제공한다. 그렇게 그들은 컴퓨터에서 쉽게 처리될 수 있다.
- 왜 이것이 유용한가?
- 텍스트를 숫자 형태로 변환하여 단어나 문장 사이의 유사성을 측정하는 것이 더 쉬워져서 검색과 분류와 같은 작업을 용이하게 한다.
A vector, in essence, is a list of numbers. In embeddings, each number in this list captures some aspect or feature of the text. Such vectors allow computers to grasp and compare concepts. For instance, the vector for “apple” might bear more similarity to the one for “fruit” than to that of “car.” This helps a computer discern that apples are more akin to fruits than to vehicles.
- 벡터는 본질적으로 숫자들의 목록이다. 임베딩에서, 이 목록에 있는 각각의 숫자는 텍스트의 어떤 측면 혹은 특징을 포착한다.
- 이러한 벡터들은 컴퓨터가 개념을 파악하고 비교할 수 있게 해준다.
- 예를 들어, "사과"에 대한 벡터는 "자동차"에 대한 벡터보다 "과일"에 대한 벡터와 더 유사할 수 있다.
- 이것은 컴퓨터가 사과가 차량에 대한 것보다 과일에 대한 것임을 식별하는 데 도움이 된다.
In simpler terms, embeddings convert text into vectors. As you might glean, these vectors provide a convenient means to store data for our RAG and model.
- 간단하게 말하면, 임베딩은 텍스트를 벡터로 변환한다.
- 이 벡터들은 RAG와 모델을 위해 데이터를 저장하기 위한 편리한 방법을 제공한다.
In the example below, we use FAISS. Developed by Facebook, FAISS aids in swiftly identifying items that resemble a particular item based on their numerical (vector) representation. To illustrate, imagine a vast library of books, and you wish to pinpoint the ones most similar to a specific title. FAISS expedites this task, even with an extensive collection.
- 아래 예제에서 우리는 FAISS를 사용한다. 페이스북에서 개발된 FAISS는 특정 품목과 유사한 품목을 수치(벡터) 표현을 기반으로 신속하게 식별할 수 있도록 도와준다.
- 예를 들어 방대한 양의 도서관을 상상해보면 특정 제목과 가장 유사한 품목을 찾고자 한다.
- FAISS는 광범위한 양에도 불구하고 이 작업을 빠르게 수행한다.

from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import FAISS


embeddings = OpenAIEmbeddings()

# Send text chunks to OpenAI Embeddings API
db = FAISS.from_documents(chunks, embeddings)

retriever = db.as_retriever(search_kwargs={"k":5})
query = "What are some phishing techniques used by threat actors?"
print("[+] Getting relevant documents for query..")
relevant_docs = retriever.get_relevant_documents(query)

Alright, our retriever is now up and running. The next step is to integrate this retriever with our LLM.
- 자, 이제 리트리버가 작동 중이다. 다음 단계는 이 리트리버를 LLM과 연결시키는 것이다.

Retriever and LLM

Once we can interact with our data, we can then employ our LLM to formulate the expected answer. The below screenshot shows you the Jupyter notebook with the code discussed. 👇
- 데이터와 상호작영할 수 있게 되면 우리는 LLM을 사용하여 예상되는 답을 공식화할 수 있게 된다.
- 아래 스크린샷은 언급된 코드가 포함된 주피터 노트북을 보여준다.

We now have our RAG operational. But one thing that’s bothersome is that our model doesn’t remember what we’ve discussed previously…
- 이제 RAG를 사용할 수 있게 되었다.
- 하지만 한가지 불편한 건 이전에 논의했던 내용을 기억하지 못한다는 점이다.

RAG + Memory

Being able to interact with your own data is quite powerful; you can essentially feed any type of data and let your LLM work with your personalized or internal data.
- 자신만의 데이터와 상호작용할 수 있다는 것은 매우 강력하다.
- 기본적으로 모든 유형의 데이터를 제공하고 LLM이 개인화된 데이터 또는 내부 데이터와 함께 작동하도록 할 수 있다.
However, as seen in our previous example, the model doesn’t retain the memory of prior interactions, which can be somewhat frustrating when trying to gather multiple pieces of information about the same threat actor.
- 하지만, 이전 예에서 볼 수 있듯이, 모델은 이전 상호 작용의 기억을 계속 유지하지 못하여 동일한 위협에 대한 정보를 수집하려 할 때 다소 실망스러울 수 있다.
By configuring memory in your RAG tools, you can maintain a record of previous interactions, ensuring a continuous flow of information without needing to pose the same questions repeatedly.
- RAG 도구에 메모리를 구성하면 이전 상호작용의 기록을 유지할 수 있으므로 동일한 질문을 반복하여 제기할 필요 없이 지속적인 정보 흐름을 보장할 수 있다.
This can be seamlessly achieved using Langchain.
- 이는 Langchain을 사용하여 원활하게 달성할 수 있다.

from langchain.chains import ConversationalRetrievalChain
from langchain.memory import ConversationBufferMemory
from langchain.llms import OpenAI
from langchain.prompts.prompt import PromptTemplate
import json

# Initialize your Langchain model
model = ChatOpenAI(model_name="gpt-4", temperature=0.3)

# Initialize your retriever (assuming you have a retriever named 'db')
retriever = db.as_retriever(search_kwargs={"k": 8})

# Define your custom template
custom_template = """You are an AI assistant specialized in MITRE ATT&CK and you interact with a threat analyst, answer the follow up question. If you do not know the answer reply with 'I am sorry'.
Chat History:
{chat_history}
Follow Up Input: {question}
Answer: """
CUSTOM_QUESTION_PROMPT = PromptTemplate.from_template(custom_template)

# Initialize memory for chat history
memory = ConversationBufferMemory(memory_key="chat_history", return_messages=True)

# Initialize the ConversationalRetrievalChain
qa_chain = ConversationalRetrievalChain.from_llm(model, retriever, condense_question_prompt=CUSTOM_QUESTION_PROMPT, memory=memory)

Now, we can initiate a dialogue with our RAG:
- 지금 우리는 RAG와 대화를 시작할 수 있다.

To make things easier, I’ve put together a comprehensive Jupyter notebook available on my website for you to tailor the code to your specific needs.

ReAct and Agents

The next concept I’d like to discuss is the ReAct framework and the agent features offered by LangChain.
- 다음으로 논의하고자 하는 개념은 ReAct 프레임워크와 LangChain에서 제공하는 agent 기능입니다.
ReAct is a logical framework designed for crafting intelligent agents. Its chief purpose is to endow agents with the ability to carry out complex tasks through a series of actions. Central to ReAct are two core components: ‘Reason’ and ‘Act’. The ‘Reason’ facet reflects the agent’s cognitive process, where it ponders and decides the subsequent action. In contrast, ‘Act’ symbolizes the tangible action the agent executes based on its prior reasoning.
- ReAct는 지능형 agent를 제작하기 위해 고안된 논리적인 프레임워크다.
- ReAct의 주요 목적은 agent에게 일련의 행동을 통해 복잡한 작업을 수행할 수 있는 능력을 부여하는 것이다.
- ReAct의 핵심 요소는 'Reason'과 'Act'이다.
- 'Reason'은 agent의 인지 과정을 반영하며, 여기서 agent는 숙고하고 이후의 행동을 결정한다.
- 'Act'는 agent가 사전 추론에 근거하여 실행하는 유형의 행동을 상징한다.
You can think of ReAct’s operational flow as an “Action → Observation → Thought Cycle”. Initially, the agent performs an action. It then observes and evaluates the results of that action. After observing, the agent ponders or reasons about its next step. This iterative process ensures the agent continually adapts and responds to the dynamic conditions of its surroundings.
- ReAct의 작동 흐름을 Action -> Observation -> Thought 이라고 생각할 수 있다.
- 처음에 agent는 행동을 수행한다.
- 그 후, 그 행동의 결과를 관찰하고 평가한다..
- 관찰 후, agent는 다음 단계에 대해 생각하고 추론한다.
- 이 반복적인 과정을 통해 agent가 주변의 동적인 조건에 지속적으로 적응하고 반응할 수 있게 된다.

This notion is incredibly powerful and can be seamlessly integrated with various tools. Remember, in LangChain, an agent can represent anything, allowing you to essentially craft your own applications atop this foundation.
- 이 개념은 매우 강력하며 다양한 도구와 결합될 수 있다.
- LangChain에서 agent는 무엇이든 나타낼 수 있으며, 이러한 기반위에서 자신만의 어플리케이션을 만들 수 있다.
In the example that follows, I’ve employed the agent functionality of LangChain in synergy with MSTICpy, constructing an agent that leverages MSTICpy’s features.
- 다음 예제에서 MSTICpy와의 시너지를 위해 LangChain의 agent 기능을 사용했다.
- 그렇게 MSTICpy의 기능을 활용한 agent를 구성했다.
NB: MSTICpy is the Python library dedicated to threat intelligence investigation.

from msticpy.sectools.tilookup import TILookup
from langchain.chat_models import ChatOpenAI
from langchain.agents import Tool
from langchain.agents import initialize_agent
from langchain.agents import AgentType

llm = ChatOpenAI(model_name="gpt-4", temperature=0.3)

class TIVTLookup:
    def __init__(self):
        self.ti_lookup = TILookup()

    def ip_info(self, ip_address: str) -> str:
        result = self.ti_lookup.lookup_ioc(observable=ip_address, ioc_type="ipv4", providers=["VirusTotal"])
        details = result.at[0, 'RawResult']
        sliced_details = str(details)[:3500]
        return sliced_details

    def communicating_samples(self, ip_address: str) -> str:
        domain_relation = vt_lookup.lookup_ioc_relationships(observable = ip_address, vt_type = 'ip_address', relationship = 'communicating_files', limit = "10")
        return domain_relation

    def samples_identification(self, hash: str) -> str:
        hash_details = vt_lookup.get_object(hash, "file")
        return hash_details

ti_tool = TIVTLookup()

tools = [
    Tool(
        name="Retrieve_IP_Info",
        func=ti_tool.ip_info,
        description="Useful when you need to look up threat intelligence information for an IP address.",
    ),
    Tool(
        name="Retrieve_Communicating_Samples",
        func=ti_tool.communicating_samples,
        description="Useful when you need to get communicating samples from an ip or domain.",
    ),
    Tool(
        name="Retrieve_Sample_information",
        func=ti_tool.samples_identification,
        description="Useful when you need to obtain more details about a sample.",
    ),
]

This example demonstrates how to craft agents. In this scenario, my agents utilize three functions from MSTICpy:
- 이 예제는 agent를 만드는 방법을 보여준다.
- 이 시나리오에서 agent는 MSTICpy의 세 가지 기능을 활용한다.
- Retrieve_IP_Info: This function queries VirusTotal for a specific IP address and relays the obtained information back to the model.
  - 이 함수는 VirusTotal에 특정 IP 주소를 조회하고 획득한 정보를 모델에 다시 전달한다.
- Retrieve_Communicating_Samples: This function fetches from VirusTotal the samples that communicate with a particular IP, as provided by the user.
  - 이 함수는 VirusTotal(유저에 의해 제공된 특정 IP와 통신하는 샘플)에서 가져온다.
- Retrieve_Sample_Information: Here, we obtain details about a specific sample.
  - 여기에서 특정 샘플에 대한 세부 정보를 얻는다.
It’s worth noting that numerous other functions can be integrated into our code. However, for the purpose of this demonstration, we’ll maintain simplicity.
- 주목할 점은 수많은 다른 기능들이 이 코드에 통합될 수 있다는 것이다.
- 하지만, 여기서는 간단하게 유지할 것이다.

agent = initialize_agent(
    tools, llm=llm, agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION, verbose=False, agent_kwargs=agent_kwargs, memory=memory
)
agent.run("Can you give me more details about this ip: 77.246.107.91? How many samples are related to this ip? If you found samples related, can you give me more info about the first one?")

In the provided example, I seek information regarding a specific IP. What’s remarkable about pairing agents with LLMs is the innate ability of the model to determine which agent to invoke based solely on the given description. the description you provide is crucial, as it’s interpreted as a prompt and serves as directives for the model.
- 제공된 예제에서 특정 IP를 찾는다.
- LLM과 agent를 구성하는데 주목할 점은 주어진 설명에만 기반하여 호출할 agent를 결정하는 모델의 타고난 능력이다.
- 제공하는 설명은 프롬프트로 해석되고 모델에 대한 지침 역할을 하기 때문에 중요하다.
Upon executing this example, my code activates the MSTICpy agents to get the details. These details are then fed to the model to generate the final response, as illustrated below.
- 이 예제를 실행하면 이 코드는 MSTICpy agent를 활성화하여 세부 정보를 얻는다.
- 그런다음 아래처럼 세부 정보를 모델에 공급하여 최종 응답을 생성한다.

In detail, the code will run MSTICpy automatically as shown below.
- 상세하게는 아래와 같이 MSTICpy가 자동으로 실행된다.

🦾Conclusion

In this blog, I explored some interesting LLM features that allow you to build your own application. I created some proof-of-concept implementations that can be easily adapted for your own use case.
- 이 블로그에서 자신만의 어플리케이션을 만들 수 있는 몇가지 흥미로운 LLM 특징을 알아보았다.
- 나는 당신의 사용 사례에 맞추어 쉬운 몇몇 개념 증명 주현을 만들어냈다.
I started with a deep dive into prompt engineering concepts and few-shot learning, and then looked at how to build a RAG with your own data. Lastly, I discussed Agents and how they can be used in conjunction with your existing tools.
- prompt engineering과 few-shot learning에 대해 깊이 파고들며 시작한 후, 자신만의 데이터로 RAG를 구축하는 방법을 살펴보았다.
- 마지막으로 agent와 기존 도구와 함께 사용할 수 있는 방법에 대해 논의했다.
I hope you enjoyed the journey. If you want to explore more about these concepts, check out the resources below. 👇

728x90

저작자표시 (새창열림)

[자연어처리] Applying LLMs to Threat Intelligence (medium 번역)

개요

Threat Intelligence Challenges

What is Prompt Engineering?

🤓 Practical Application of LLMs

Few-Shot Prompting

Retrieval Augmented Generation (RAG)

Prepare Your Data (No, Really!)

Tokenization

Splitting into Smaller Chunks

Embeddings

Retriever and LLM

RAG + Memory

ReAct and Agents

🦾Conclusion

티스토리툴바