Skip to Content
AI Era🔗 RAG Intro

🔗 RAG Intro

This is a basic introduction to Retrieval-Augmented Generation (RAG), for a detailed explanation of the RAG link, please move to: 🔗 Details of RAG R4 Link

What is RAG (Retrieval-Augmented Generation)

Imagine if you were to write a report on “Mars exploration” but you are not familiar with this topic. You would probably first search for relevant information online and then write the report based on the information found. RAG (Retrieval-Augmented Generation) is such a process, but it is executed by a machine.

Simply put, RAG does two things:

  1. Find information: Just like you search online for information about “Mars exploration”, RAG first looks for information related to the question asked in a huge knowledge base. This ensures that it has enough material to answer the question.
  2. Write the answer: With this information, RAG then uses a large model to summarize the answer or generate relevant text. The model will try to express the information found in a smooth and reasonable way.

In short, RAG is an automated process that helps machines better answer questions or generate text by looking for information. This means that the content generated by the machine is not just made up, but is based on real information, making it more accurate and rich.

RAG is now widely used in many application scenarios, such as: AI smart customer service, corporate intelligent knowledge base, AI search engine, etc.

Why RAG is needed

RAG (Retrieval-Augmented Generation) is important because it effectively solves various problems encountered when using large language models, and greatly improves the quality and accuracy of the generated content:

  1. Makes the generated content more timely: The timeliness of the data of large language models depends on their training data, and once the training is completed, it is no longer updated. This makes the information mastered by the large language models often outdated. By adding the RAG link, the model can obtain the latest information and generate content reflecting the current state of knowledge. This is particularly important for applications that require the latest data support.
  2. Handle knowledge-intensive tasks: In most scenarios, the existing knowledge and data of large language models alone are not enough to handle tasks that require extensive background knowledge, such as writing detailed reports on specific topics. RAG can enrich its answers by quickly integrating different knowledge bases and retrieving relevant information, making the generated content more accurate and informative.
  3. Improve the relevance and quality of the generated text: Generation models (such as the GPT series) may generate answers that are irrelevant or partially incorrect (creating illusions) without direct access to specific data. RAG can ensure the quality and relevance of the output by first retrieving relevant information and then generating text based on this information.
  4. Alleviate the problem of data scarcity: For some specific fields or niche topics, there may not be enough training data to train an efficient generation model. RAG can compensate for this by using external knowledge bases, providing high-quality output even in situations where data is scarce.

In summary, RAG provides an effective way to solve complex tasks. By combining retrieval and generation methods, it enhances the machine’s understanding and text generation capabilities, making the generated content not only more accurate, but also richer and more diverse.

An example: How RAG works

The implementation of RAG requires a series of steps, which can generally be divided into two parts: data preparation and actual application.

The data preparation part can generally be divided into 2 steps:

  1. Data acquisition: Prepare a dataset for retrieval from the Internet or other data sources, usually a series of documents
  2. Data stripping: Also called data chunking, the data obtained in the first step is divided into paragraphs according to meaning, or knowledge chunks corresponding to certain specific questions are extracted.
  3. Data processing: The knowledge chunks separated in the second part are further extracted for entity words and timeliness recognition, making it easier to filter more matching content during recall.

The actual application part can generally be divided into 4 steps:

  1. Question rewriting (Rewriter): Diverge and answer the user’s input question to get more keywords for the original question
  2. Keyword recall (Retriever): According to the question and answer provided by the Rewriter, use vector matching, keyword matching, search engine and other technologies to recall content from the database
  3. Data fine-tuning (Reranker): According to the user’s input original question, find the n most relevant data from the text content provided by the Retriever
  4. Summarize answer (Reader): Understand the original question input by the user, read the n most relevant data found, summarize and give an answer, and attach the source when answering

Next, I will share a practical example to illustrate how the RAG link works:

Suppose we are building a “Travel Assistant” that can provide users with recommendations for attractions around their destination.

Data acquisition

  • Goal: Get travel guides, attractions recommendations, etc. from the Internet, and store them as the original dataset
  • Dependent ability: spider technology, OpenAPI or cleaning from own data
  • Output: An original dataset containing travel guides, attractions recommendations, etc.

Case

The original dataset may consist of multiple pieces of the following content:

"Title": "Introduction to attractions around Hangzhou" "Release time": "2023/12/23 12:00" "Author": "Traveler" "Original link": "<https://xxxx.xxx>" "Text": "Hangzhou, a city where history and modernity blend, not only has its own very attractive landscapes, but also is scattered with countless attractive tourist destinations around it. West Lake is the most well-known to people, symbolizing the "paradise on earth", the lake and mountain scenery and historical relics shine on each other, making every visitor who steps foot here linger. Leifeng Tower and Broken Bridge are not only the iconic buildings of West Lake, but also the places where many classical love stories took place, attracting countless literature and history enthusiasts. Gu Shan and Su Ti Chun Xiao are known for their more fresh and natural scenery, and are excellent places for strolling and sightseeing. The surrounding ancient towns are also not to be missed. Wuzhen and Xitang, as two of the six major ancient towns in the south of the Yangtze River, are famous for their well-preserved ancient water town layout and rich cultural heritage. The water alleys and ancient bridges of Wuzhen retain hundreds of years of history and are the ideal place to explore traditional southern culture. The night scene of Xitang, where the lights reflect the water, has a unique flavor. Tourists can not only visit ancient relics, taste authentic southern snacks, but also enjoy the fun of making traditional handicrafts. These ancient towns are not just sightseeing spots, they are living cultural heritage. For those who love nature and adventure, the natural landscapes around Hangzhou are also rich and colorful. Tianmu Mountain is an ideal place for adventure and hiking. It has not only dense forest coverage, but also spectacular waterfalls and rich biodiversity. The autumn Tianmu Mountain is dyed with colorful layers of forest, it is a paradise for photography enthusiasts. In addition, the Fuchun River and Anji's bamboo sea are also worth a visit. The Fuchun River is famous for its literati and ink guests and magnificent landscape paintings in history, while the bamboo sea in Anji is refreshing with its endless bamboo forest and fresh air. These places not only provide a secluded place away from the city, but also are excellent choices for experiencing natural beauty and peaceful life. In general, Hangzhou and its surrounding areas have become popular destinations for domestic and foreign tourists with their unique culture and natural landscapes. Whether it's immersing in the atmosphere of classical culture, or seeking solace in the arms of nature, there is always a landscape here that can meet your needs."

Data stripping

  • Goal: The content of the data that has been captured is divided into paragraphs according to semantics, or answer content is extracted according to the question, forming knowledge blocks.
  • Dependent ability: Longformer, large model (segmentation or extraction)
  • Output: A series of content clips or a series of question and answer pairs

Case

After the original text is divided into knowledge blocks by paragraph, the original data may be divided into the following knowledge blocks:

"Title": "Introduction to attractions around Hangzhou" "Release time": "2023/12/23 12:00" "Author": "Traveler" "Original link": "<https://xxxx.xxx>" "Segment 1": "Hangzhou, a city where history and modernity blend, not only has its own very attractive landscapes, but also is scattered with countless attractive tourist destinations around it. West Lake is the most well-known to people, symbolizing the "paradise on earth", the lake and mountain scenery and historical relics shine on each other, making every visitor who steps foot here linger. Leifeng Tower and Broken Bridge are not only the iconic buildings of West Lake, but also the places where many classical love stories took place, attracting countless literature and history enthusiasts. Gu Shan and Su Ti Chun Xiao are known for their more fresh and natural scenery, and are excellent places for strolling and sightseeing."
"Title": "Introduction to attractions around Hangzhou" "Release time": "2023/12/23 12:00" "Author": "Traveler" "Original link": "<https://xxxx.xxx>" "Segment 2": "The surrounding ancient towns are also not to be missed. Wuzhen and Xitang, as two of the six major ancient towns in the south of the Yangtze River, are famous for their well-preserved ancient water town layout and rich cultural heritage. The water alleys and ancient bridges of Wuzhen retain hundreds of years of history and are the ideal place to explore traditional southern culture. The night scene of Xitang, where the lights reflect the water, has a unique flavor. Tourists can not only visit ancient relics, taste authentic southern snacks, but also enjoy the fun of making traditional handicrafts. These ancient towns are not just sightseeing spots, they are living cultural heritage."

After the original text is divided into knowledge blocks by question and answer, the original data may be divided into the following knowledge blocks:

"Title": "Introduction to attractions around Hangzhou" "Release time": "2023/12/23 12:00" "Author": "Traveler" "Original link": "<https://xxxx.xxx>" "Question 1": "What are the natural landscapes in Hangzhou?" "Answer 1": "Hangzhou, a city where history and modernity blend, not only has its own very attractive landscapes, but also is scattered with countless attractive tourist destinations around it. West Lake is the most well-known to people, symbolizing the "paradise on earth", the lake and mountain scenery and historical relics shine on each other, making every visitor who steps foot here linger. Leifeng Tower and Broken Bridge are not only the iconic buildings of West Lake, but also the places where many classical love stories took place, attracting countless literature and history enthusiasts. Gu Shan and Su Ti Chun Xiao are known for their more fresh and natural scenery, and are excellent places for strolling and sightseeing. Tianmu Mountain is an ideal place for adventure and hiking. It has not only dense forest coverage, but also spectacular waterfalls and rich biodiversity. The autumn Tianmu Mountain is dyed with colorful layers of forest, it is a paradise for photography enthusiasts. In addition, the Fuchun River and Anji's bamboo sea are also worth a visit. The Fuchun River is famous for its literati and ink guests and magnificent landscape paintings in history, while the bamboo sea in Anji is refreshing with its endless bamboo forest and fresh air. These places not only provide a secluded place away from the city, but also are excellent choices for experiencing natural beauty and peaceful life."
"Title": "Introduction to attractions around Hangzhou" "Release time": "2023/12/23 12:00" "Author": "Traveler" "Original link": "<https://xxxx.xxx>" "Question 2": "What are the ancient towns around Hangzhou?" "Answer 2": "The surrounding ancient towns are also not to be missed. Wuzhen and Xitang, as two of the six major ancient towns in the south of the Yangtze River, are famous for their well-preserved ancient water town layout and rich cultural heritage. The water alleys and ancient bridges of Wuzhen retain hundreds of years of history and are the ideal place to explore traditional southern culture. The night scene of Xitang, where the lights reflect the water, has a unique flavor. Tourists can not only visit ancient relics, taste authentic southern snacks, but also enjoy the fun of making traditional handicrafts. These ancient towns are not just sightseeing spots, they are living cultural heritage."

Data processing

  • Goal: Extract metadata from knowledge blocks to facilitate content filtering during recall phase and improve accuracy
  • Dependent ability: Pattern matching, large model, knowledge graph, etc.
  • Output: wide table containing stripped content and various metadata information, or graph

Case

NumberKnowledge Block ContentRelevant EntitiesTime
1Hangzhou, a city where history and modernity blend, not only has its own charming landscape, but also has countless attractive tourist destinations around it. West Lake is the most well-known, hailed as a symbol of “paradise on earth”, where the lake and mountains and historical sites complement each other, making every visitor who sets foot here linger. Leifeng Tower and Broken Bridge are not only the landmark buildings of West Lake, but also the places where many classical love stories took place, attracting countless literature and history lovers. And Gushan and Sudi Chunxiao are known for their fresher, natural scenery, which are excellent places for walking and sightseeingZhejiang Province, Hangzhou City, West Lake, Sudi, Leifeng Tower, Broken Bridge2023/12/23 12:00

After these steps, the data preparation work of the “Travel Assistant” is completed. Next, we will show how the assistant obtains content and responds to users during actual use.

We assume that the user asks the “Travel Assistant”: “What are the fun places around Hangzhou?”

Question Rewriting (Rewriter)

  • Goal: Diverge and answer the user’s input question to obtain more keywords for the original question
  • Dependent capabilities: Prompt word engineering + large model
  • Input: User’s original question
  • Output: Multiple question and answer pairs rewritten based on the original question

Case

  • Assume the user’s original input question is: “What are the fun places around Hangzhou?”
  • After question rewriting, the following question and answer pairs will be generated:
1. What historical and cultural attractions are there around Hangzhou? There are rich historical and cultural resources around Hangzhou. West Lake is one of the most famous attractions, and its "mountain and water garden" has been listed as a World Cultural Heritage. Song City is also a not-to-be-missed historical and cultural experience, it is a large-scale historical and cultural theme park. There is also Lingyin Temple, a long-established Buddhist temple, known to the world. 2. What places are suitable for parent-child tours around Hangzhou? There are many places suitable for parent-child tours around Hangzhou. For example, Hangzhou Wildlife World is a large wild animal park where family visitors can get close to and understand various animals. In addition, Hangzhou Paradise and Songcheng's Songcheng Thousand Ancient Love Scenic Area are also good choices for parent-child tours, providing a variety of entertainment facilities and performances. 3. What natural scenic spots are there around Hangzhou? There are various natural landscapes around Hangzhou. Among them, Huangshan is one of the most well-known natural scenic spots, known for its strange pines, strange stones, sea of clouds, and hot springs. Tianmu Mountain is also worth a visit, with rich biodiversity and fresh air. In addition, Qiandao Lake has become a popular tourist destination for its clear water quality and numerous islands.
  • As such, in the next step of Retriever recalling relevant content, the keywords that can be used can be expanded from the original: Hangzhou, surroundings, fun places, to: Hangzhou, surroundings, historical and cultural attractions, West Lake, mountain and water garden, Song City, historical and cultural theme park and so on, greatly increasing the amount of content that can be recalled

Keyword Retrieves (Retriever)

  • Goal: Use the question and answer pairs provided by Rewriter, use vector matching, keyword matching, search engine technology, etc., to recall content from the database
  • Dependent capabilities: Vector search, data graph, search engine
  • Input: Question and answer pairs provided by Rewriter
  • Output: Top 50 text content recalled for each question and answer pair

Case

According to the question and answer pairs provided by the Rewriter in step 1 and the keywords contained therein (Hangzhou, surroundings, historical and cultural attractions, West Lake, mountain and water garden, Song City, etc.), recall 50 related contents from the database for each question and answer pair

Data Reranker

  • Goal: According to the user’s original input question, find n most relevant data from the text content provided by the Retriever
  • Dependent capabilities: bg_rank small model
  • Input: User’s original question, text content provided by Retriever
  • Output: n most relevant text content

Case

According to the user’s original question “What are the fun places around Hangzhou?”, find the Top n from the bunch of text content provided by the Retriever in step 2

Summary Answer (Reader)

  • Goal: Understand the user’s original input question, read the n most relevant data output by Reranker, summarize and give an answer, and attach the source when answering
  • Dependent capabilities: Prompt word engineering+large model
  • Input: User’s original question, n most relevant data output by Reranker
  • Output: Answer with traceability

Case

According to the user’s original question “What are the fun places around Hangzhou?”, and the n most relevant data (including their links) provided by Reranker in step 3, the answer output is:

There are many fun places worth exploring around Hangzhou, suitable for different types of tourists: 1. West Lake: As a symbol of Hangzhou, West Lake is not only a must-visit attraction in Hangzhou, but also the starting point for surrounding tours. You can take a boat tour of the lake, stroll along the Sudi Chunxiao, or experience the poetic and picturesque feelings of the ancients in the broken bridge and snow. 2. Lingyin Temple: Located under the Feilai Peak in the West Lake District of Hangzhou, it is a long-established famous Buddhist temple. The mountains, water, and woods here are quiet, and the temple is simple and solemn, suitable for tourists seeking spiritual tranquility. 3. Song City: Located in the West Lake District of Hangzhou, it is a large-scale theme park that integrates Song Dynasty culture, history, entertainment, and performing arts. It is particularly recommended to watch the "Songcheng Thousand Ancient Love" performance and experience the shocking audio-visual feast. Sources: [1], [2], [3]

Conclusion

RAG (Retrieval Augmented Generation) is a content generation solution that combines retrieval and generation capabilities.

Starting from the user’s question, diverge several question and answer pairs associated with it (the answer contained in the divergence will be more conducive to diverging different keywords), retrieve relevant information from a large knowledge base with the keywords contained in the question and answer pairs, and after filtering and sorting, generate answers or texts based on this information to improve the quality and relevance of the model’s generation.

The RAG method combines the efficiency of retrieval systems and the flexibility of generation models, and can provide more accurate and rich output when dealing with complex queries or tasks.

Extended Reading: For a detailed explanation of the RAG link, please move to: 🔗 Details of RAG R4 Link

Last updated: