The idea of developing an English reading continuation writing model first emerged in late November last year. At the time, I happened to come across a public article introducing the principles of RAG (Retrieval-Augmented Generation). After reading it, I had a sudden inspiration: a lightweight model equipped with a knowledge base might be able to compete with top-tier large language models in specific domains. One evening, while discussing AI with my roommates, it struck me that RAG could be applied to the scenario of English reading continuation writing. When I proposed this idea, everyone praised it, saying the project was practical and full of potential. Due to the pressures of year-end studies and exam preparations, the project was temporarily shelved. It wasn’t until this January, after my final exams, that I finally had the time to work on it independently.
RAG Overview
In simple terms, RAG technology works as follows: when a user inputs a question or request, the system first retrieves the most relevant text segments from a vector database and then generates a new response based on the retrieved information. This ensures the generated content aligns more closely with real-world contexts, effectively mitigating the “hallucination” phenomenon common in AI models.
RAG (Retrieval-Augmented Generation) is a hybrid model that combines information retrieval and text generation, primarily consisting of two modules:
- Retrieval Module: This module searches the pre-built knowledge base for text segments most relevant to the current input (Query). This step typically employs text retrieval models (such as Alibaba’s text-embedding-v3 or BGE-M3) to convert text into vectors and then identifies the best matches through similarity calculations (e.g., cosine similarity, which measures the cosine of the angle between two vectors). Mathematically, it can be expressed as:
$$ \text{retrieve}(q) = { d_1, d_2, \ldots, d_k } \quad \text{such that} \quad S(q, d_i) = \cos\big(f(q), f(d_i)\big) \text{is maximized} $$
Here, for a query $q$, the system returns $k$ most relevant documents ${d_1, d_2, \ldots, d_k}$, ensuring the similarity score $S(q, d_i)$ between the query and the documents is maximized.
The similarity score is calculated as:
$$
S(q, d_i) = \cos(f(q), f(d_i)) = \frac{f(q) \cdot f(d_i)}{|f(q)| |f(d_i)|}
$$
Here, $f(q)$ and $f(d_i)$ represent the high-dimensional vectors transformed from the query and documents, respectively. Cosine similarity measures the angular similarity between them—the higher the value, the greater the similarity.
- Generation Module: (To be continued…)Retrieval-Augmented Generation (RAG) Module: After obtaining the retrieval results, they are fed into the generative model along with the original query to produce a more contextually relevant and informative output. The formula is as follows:
$$ y = \arg\max_{y’} P(y’ \mid q, {d_1, d_2, \ldots, d_k}; \theta) $$
This formula indicates that among all possible output texts $ \ y’ \ $, the one that maximizes the probability $ P(y’ \mid q, { d_1, d_2, \dots, d_k }; \theta) $ is selected as the final generated result.
Here, $ y \ $ represents the model’s output, $\ y’ $ denotes all candidate output texts, $ \theta $ stands for the model’s parameters, and $ P(y’ \mid q, { d_1, d_2, \dots, d_k }; \theta) $ represents the probability of the generative model producing output $( y )$ given the query $( q )$ and the retrieved set of documents.
In simpler terms, RAG first uses a retrieval model to find documents $({ d_1, d_2, \dots, d_k })$ relevant to $( q )$. Then, based on these documents, the generative model calculates the probabilities of generating different candidate answers and selects the answer with the highest probability as the optimal answer $( y )$.
Model Implementation
Data Collection
A good dataset is half the battle in model success.
— As I always say
The success of a model often hinges on whether the dataset is of high quality. Without high-quality data, even the most sophisticated algorithms struggle to reach their full potential. Teachers typically use PowerPoint (PPT) documents for lectures, so I extracted English text from these high-quality teaching materials as the raw data. The entire process consists of two main steps:
1. File Renaming
To facilitate subsequent text extraction using Python, I first renamed all .pptx files in the current folder in the order of their modification times. Below is an example of VBScript code that accomplishes this task. Its primary function is to iterate through all .pptx files, sort them by modification time, and rename them in the format number + ".pptx"
.
Example code:
|
|
2. Text Extraction
Next, use the pptx
library in Python to extract all English text from each renamed PPT file in the current directory. First, iterate through all the .pptx files that have been renamed in the current directory.Sort files in numerical order, then read all the text from the slides, filter out pure English characters and punctuation using regular expressions, and finally save the extracted results to the extract.txt
file:
Sample code is as follows
|
|
Data Preprocessing
The extracted raw data contains noise such as spelling errors, mixed punctuation, irregular spacing, and capitalization issues. I chose to use kimi.ai to clean, deduplicate, and tokenize the collected data for subsequent model training.
Example Prompt
|
|
After Kimi completes the data cleaning, additional auxiliary manual annotation is required to ensure the accuracy of the dataset.
Model Training
Using RAG (Retrieval-Augmented Generation) technology, the preprocessed data is used as input to train an English continuation writing model.
I employed an open-source RAG framework from GitHub, FastGPT.
Building the Knowledge Base
First, upload the preprocessed dataset to the knowledge base and use an indexing model (such as Alibaba’s text-embedding-v3) to convert the text into vector representations, enabling semantic retrieval. Mathematically, we aim to achieve a mapping function ( f: \text{Text} \rightarrow \mathbb{R}^d ), where semantically similar texts are closer in the vector space. The similarity is often calculated using cosine similarity, defined as:
$$ \text{sim}(a, b) = \frac{f(a) \cdot f(b)}{|f(a)| |f(b)|} $$
Here, ( a ) and ( b ) represent two pieces of text.
Creating a Workflow
Next, create a workflow to implement the following process:
- Knowledge Base Retrieval: Upon receiving user input or test cases, retrieve the most relevant text fragments from the knowledge base.
- Text Generation: Pass the retrieved results as context, along with the user input, to a text generation model (e.g., DeepSeek V3), and generate writing content using pre-written prompts.
The objective of the generation model is to maximize the conditional probability of the generated text ( y ):
$$ y = \arg\max_{y’} P\big(y’ \mid q, {d_1, d_2, \ldots, d_k}; \theta\big) $$
The parameter explanations are as mentioned earlier, and the formula’s meaning will not be reiterated here.
With the FastGPT framework, the deployment process and model training were completed according to its documentation. The main steps include:
- Uploading data and constructing vector indices
- Defining workflows and integrating the retrieval and generation modules- Adjusting model parameters (e.g., temperature parameter in generative models)
Regarding the temperature parameter, here’s a brief explanation: temperature controls the randomness of the output from generative models. A higher temperature results in more random generated text, while a lower temperature makes the output more deterministic. In this project, to prevent overfitting in the RAG-based model, it is recommended to set the temperature to 50%–70% of its maximum value.
Model Optimization
After training the model, its performance is evaluated using a custom test set. Key evaluation metrics include the coherence of the generated text, semantic relevance, and creativity. If the results are unsatisfactory, improvements can be made by optimizing the prompt or adjusting certain model parameters. This process can also be flexibly implemented in the FastGPT console.
Model Limitations
After the public release of this model, it received widespread acclaim. Even English teachers inquired about its principles, calling it a new approach to teaching. However, to be frank, the model still has significant limitations:
-
Dataset Limitations: The sample size is very small. Only 16 PPT documents from teachers’ course materials were selected for model training. Both the volume and quality of the data need improvement, and the model’s generalization capability requires enhancement.
-
Overfitting Risk: With limited training data, the RAG model is prone to overfitting. A friend previously mentioned that during LoRA fine-tuning, datasets with thousands of samples often lead to overfitting after just 5 epochs. My strategy to mitigate this issue involves increasing the generation temperature, but more diverse and abundant data are still needed to improve the model’s robustness.
GitHub Repository: https://github.com/Crosssense-Lab/Kelly-s1
(The dataset is not open-sourced due to copyright reasons.)