Featured image of post Development Practice of an English Continuation Writing Model Based on RAG Technology

Development Practice of an English Continuation Writing Model Based on RAG Technology

In-depth introduction to the entire process of developing an English reading continuation writing vertical domain model

The idea of developing an English reading continuation writing model first emerged in late November last year. At the time, I happened to come across a public article introducing the principles of RAG (Retrieval-Augmented Generation). After reading it, I had a sudden inspiration: a lightweight model equipped with a knowledge base might be able to compete with top-tier large language models in specific domains. One evening, while discussing AI with my roommates, it struck me that RAG could be applied to the scenario of English reading continuation writing. When I proposed this idea, everyone praised it, saying the project was practical and full of potential. Due to the pressures of year-end studies and exam preparations, the project was temporarily shelved. It wasn’t until this January, after my final exams, that I finally had the time to work on it independently.

RAG Overview

In simple terms, RAG technology works as follows: when a user inputs a question or request, the system first retrieves the most relevant text segments from a vector database and then generates a new response based on the retrieved information. This ensures the generated content aligns more closely with real-world contexts, effectively mitigating the “hallucination” phenomenon common in AI models.

RAG (Retrieval-Augmented Generation) is a hybrid model that combines information retrieval and text generation, primarily consisting of two modules:


  1. Retrieval Module: This module searches the pre-built knowledge base for text segments most relevant to the current input (Query). This step typically employs text retrieval models (such as Alibaba’s text-embedding-v3 or BGE-M3) to convert text into vectors and then identifies the best matches through similarity calculations (e.g., cosine similarity, which measures the cosine of the angle between two vectors). Mathematically, it can be expressed as:

$$ \text{retrieve}(q) = { d_1, d_2, \ldots, d_k } \quad \text{such that} \quad S(q, d_i) = \cos\big(f(q), f(d_i)\big) \text{is maximized} $$

Here, for a query $q$, the system returns $k$ most relevant documents ${d_1, d_2, \ldots, d_k}$, ensuring the similarity score $S(q, d_i)$ between the query and the documents is maximized.

The similarity score is calculated as:
$$ S(q, d_i) = \cos(f(q), f(d_i)) = \frac{f(q) \cdot f(d_i)}{|f(q)| |f(d_i)|} $$

Here, $f(q)$ and $f(d_i)$ represent the high-dimensional vectors transformed from the query and documents, respectively. Cosine similarity measures the angular similarity between them—the higher the value, the greater the similarity.


  1. Generation Module: (To be continued…)Retrieval-Augmented Generation (RAG) Module: After obtaining the retrieval results, they are fed into the generative model along with the original query to produce a more contextually relevant and informative output. The formula is as follows:

$$ y = \arg\max_{y’} P(y’ \mid q, {d_1, d_2, \ldots, d_k}; \theta) $$

This formula indicates that among all possible output texts $ \ y’ \ $, the one that maximizes the probability $ P(y’ \mid q, { d_1, d_2, \dots, d_k }; \theta) $ is selected as the final generated result.

Here, $ y \ $ represents the model’s output, $\ y’ $ denotes all candidate output texts, $ \theta $ stands for the model’s parameters, and $ P(y’ \mid q, { d_1, d_2, \dots, d_k }; \theta) $ represents the probability of the generative model producing output $( y )$ given the query $( q )$ and the retrieved set of documents.

In simpler terms, RAG first uses a retrieval model to find documents $({ d_1, d_2, \dots, d_k })$ relevant to $( q )$. Then, based on these documents, the generative model calculates the probabilities of generating different candidate answers and selects the answer with the highest probability as the optimal answer $( y )$.

Model Implementation

Data Collection

A good dataset is half the battle in model success.

— As I always say

The success of a model often hinges on whether the dataset is of high quality. Without high-quality data, even the most sophisticated algorithms struggle to reach their full potential. Teachers typically use PowerPoint (PPT) documents for lectures, so I extracted English text from these high-quality teaching materials as the raw data. The entire process consists of two main steps:

1. File Renaming

To facilitate subsequent text extraction using Python, I first renamed all .pptx files in the current folder in the order of their modification times. Below is an example of VBScript code that accomplishes this task. Its primary function is to iterate through all .pptx files, sort them by modification time, and rename them in the format number + ".pptx".

Example code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
' Create a file system object  
Set fso = CreateObject("Scripting.FileSystemObject")  

' Get the current folder path  
currentFolder = fso.GetAbsolutePathName(".")  

' Retrieve all files in the current folderSet folder = fso.GetFolder(currentFolder)
Set files = folder.Files

' Store all .pptx files
Dim pptxFiles()
i = 0

' Iterate through files and filter out all .pptx files
For Each file In files
    If LCase(fso.GetExtensionName(file.Name)) = "pptx" Then
        ReDim Preserve pptxFiles(i)
        Set pptxFiles(i) = file  ' Use the Set keyword to assign objects
        i = i + 1
    End If
Next

If i = 0 Then
    WScript.Echo "No .pptx files found"
    WScript.Quit
End If

' Sort files by modification time
For i = 0 To UBound(pptxFiles) - 1
    For j = i + 1 To UBound(pptxFiles)
        If pptxFiles(i).DateLastModified > pptxFiles(j).DateLastModified Then
            ' Swap file objects
            Set temp = pptxFiles(i)
            Set pptxFiles(i) = pptxFiles(j)
            Set pptxFiles(j) = temp
        End If
    Next
Next

' Rename files
For i = 0 To UBound(pptxFiles)
    newName = (i + 1) & ".pptx"
    fso.MoveFile pptxFiles(i).Path, currentFolder & "\" & newName
Next

WScript.Echo "File renaming completed!"

2. Text Extraction

Next, use the pptx library in Python to extract all English text from each renamed PPT file in the current directory. First, iterate through all the .pptx files that have been renamed in the current directory.Sort files in numerical order, then read all the text from the slides, filter out pure English characters and punctuation using regular expressions, and finally save the extracted results to the extract.txt file:

Sample code is as follows
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
import os
import re
from pptx import Presentation

pptx_folder = './'
output_txt = 'extract.txt'

# Match English characters and punctuation
english_text_pattern = re.compile(r'[A-Za-z0-9.,!?()";: -]+')

all_english_text = []

# Get all files in the current directory that match the format
pptx_files = [f for f in os.listdir(pptx_folder) if f.endswith('.pptx') and f[:-5].isdigit()]

# Sort filenames in numerical order
pptx_files.sort(key=lambda x: int(x[:-5]))

# Iterate through each PPT file
for pptx_filename in pptx_files:
    pptx_path = os.path.join(pptx_folder, pptx_filename)

    presentation = Presentation(pptx_path)

    # Iterate through each shape in every slide
    for slide in presentation.slides:
        for shape in slide.shapes:
            # Ensure the shape contains a text box
            if hasattr(shape, 'text'):
                # Get the text content
                text = shape.text.strip()

                # If the text is not empty, extract only the English text
                if text:
                    # Use regex to extract the English parts of the text
                    extracted_engl```python
ish_text = ' '.join(english_text_pattern.findall(text))
                    
                    if extracted_english_text:  # If there is English text
                        all_english_text.append(extracted_english_text)

                # If the text is empty, add an empty line
                else:
                    all_english_text.append('')

# Write the extracted English text to the extract.txt file
with open(output_txt, 'w', encoding='utf-8') as f:
    for line in all_english_text:
        f.write(line + '\n')

print(f"Extraction completed. All English texts have been saved to {output_txt}")

Data Preprocessing

The extracted raw data contains noise such as spelling errors, mixed punctuation, irregular spacing, and capitalization issues. I chose to use kimi.ai to clean, deduplicate, and tokenize the collected data for subsequent model training.

Example Prompt
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
This is an English continuation writing corpus dataset I created. Kimi, I need you to perform data cleaning tasks. Please follow the steps below to check and correct errors in the data:  

### Types of errors to check and correct include but are not limited to:  
1. Spelling errors: Check for misspellings and correct them according to standard English spelling.  
2. Mixed Chinese and English punctuation: (Ensure full English content uses English punctuation, and full Chinese content uses Chinese punctuation.)  
3. Spacing issues: Missing spaces (e.g., between words or after punctuation); extra spaces (e.g., consecutive spaces or unnecessary leading/trailing spaces).  
4. Capitalization errors: The first word of a sentence should be capitalized, and proper nouns should follow correct capitalization rules.  
5. Punctuation errors: Check for correct usage of commas, periods, quotation marks, brackets, etc., and whether there are repeated, missing, or misused punctuation marks.  
6. TexThis format is non-standard: Are there extra spaces at the beginning/end of lines? Are there abnormal line breaks (e.g., a single sentence split across multiple lines)?  
7. Special characters and garbled text: Check for any special characters that do not conform to the corpus format, and remove or correct garbled text.  

### Pre-Task Confirmation  
Before officially executing the task, please answer the following questions:  
1. Do you fully understand the requirements and operational procedures of this task?  
2. Do you have any questions or need further clarification?  

After Kimi completes the data cleaning, additional auxiliary manual annotation is required to ensure the accuracy of the dataset.

Model Training

Using RAG (Retrieval-Augmented Generation) technology, the preprocessed data is used as input to train an English continuation writing model.

I employed an open-source RAG framework from GitHub, FastGPT.

Building the Knowledge Base

First, upload the preprocessed dataset to the knowledge base and use an indexing model (such as Alibaba’s text-embedding-v3) to convert the text into vector representations, enabling semantic retrieval. Mathematically, we aim to achieve a mapping function ( f: \text{Text} \rightarrow \mathbb{R}^d ), where semantically similar texts are closer in the vector space. The similarity is often calculated using cosine similarity, defined as:

$$ \text{sim}(a, b) = \frac{f(a) \cdot f(b)}{|f(a)| |f(b)|} $$

Here, ( a ) and ( b ) represent two pieces of text.

Creating a Workflow

Next, create a workflow to implement the following process:

  1. Knowledge Base Retrieval: Upon receiving user input or test cases, retrieve the most relevant text fragments from the knowledge base.
  2. Text Generation: Pass the retrieved results as context, along with the user input, to a text generation model (e.g., DeepSeek V3), and generate writing content using pre-written prompts.

The objective of the generation model is to maximize the conditional probability of the generated text ( y ):

$$ y = \arg\max_{y’} P\big(y’ \mid q, {d_1, d_2, \ldots, d_k}; \theta\big) $$

The parameter explanations are as mentioned earlier, and the formula’s meaning will not be reiterated here.

With the FastGPT framework, the deployment process and model training were completed according to its documentation. The main steps include:

  • Uploading data and constructing vector indices
  • Defining workflows and integrating the retrieval and generation modules- Adjusting model parameters (e.g., temperature parameter in generative models)

Regarding the temperature parameter, here’s a brief explanation: temperature controls the randomness of the output from generative models. A higher temperature results in more random generated text, while a lower temperature makes the output more deterministic. In this project, to prevent overfitting in the RAG-based model, it is recommended to set the temperature to 50%–70% of its maximum value.

Model Optimization

After training the model, its performance is evaluated using a custom test set. Key evaluation metrics include the coherence of the generated text, semantic relevance, and creativity. If the results are unsatisfactory, improvements can be made by optimizing the prompt or adjusting certain model parameters. This process can also be flexibly implemented in the FastGPT console.

Model Limitations

After the public release of this model, it received widespread acclaim. Even English teachers inquired about its principles, calling it a new approach to teaching. However, to be frank, the model still has significant limitations:

  1. Dataset Limitations: The sample size is very small. Only 16 PPT documents from teachers’ course materials were selected for model training. Both the volume and quality of the data need improvement, and the model’s generalization capability requires enhancement.

  2. Overfitting Risk: With limited training data, the RAG model is prone to overfitting. A friend previously mentioned that during LoRA fine-tuning, datasets with thousands of samples often lead to overfitting after just 5 epochs. My strategy to mitigate this issue involves increasing the generation temperature, but more diverse and abundant data are still needed to improve the model’s robustness.

GitHub Repository: https://github.com/Crosssense-Lab/Kelly-s1
(The dataset is not open-sourced due to copyright reasons.)

Licensed under CC BY-NC-SA 4.0