Pretraining Dataset Design for Domain-Specific Small Language Models: A Pedagogical Framework and the AI-PEAP Case Study

Yiwen Qiang; Yale Yu; Qiujun Lan; Xinjian Qiang; Jianpeng Che; Weijuan Wen

doi:10.70711/aitr.v3i9.9017

Pretraining Dataset Design for Domain-Specific Small Language Models: A Pedagogical Framework and the AI-PEAP Case Study

Yiwen Qiang, Yale Yu, Qiujun Lan, Xinjian Qiang, Jianpeng Che, Weijuan Wen

Abstract

As enterprises shift toward specialized small language models, the lack of systematic methodologies for pretraining dataset design
remains a critical bottleneck. This paper introduces a pedagogical framework for constructing high-quality pretraining datasets that enable
Domain-Specific Small Language Models (DSSLMs) to achieve professional-level competency. The methodologycomprising five phases:
domain competency mapping, curriculum design, source strategy, data creation, and expert validationemphasizes learning efficiency over
token volume and professional judgment over general benchmarks. We validate the framework through the AI-PEAP (Professional Enterprise
Architecture Practitioner) case study, demonstrating that a 3B-parameter SLM pretrained on only 8B pedagogically designed tokens (??2.7)
achieves architecture review competency scores of 4.3/5.070% fewer tokens than Chinchilla-optimal scaling would suggest. The framework
reduces pretraining costs by 49% while improving professional alignment. We further analyze its domain-agnostic applicability to medicine,
law, and education, establishing "pedagogical data design" as a critical discipline for building expert-grade, efficient SLMs.

Keywords

Small Language Models; Domain-Specific AI; Pretraining Dataset Design; Pedagogical Framework; Enterprise Architecture; Pro fessional AI; Data Efficiency;AI Pedagogy

Full Text:

PDF

Included Database

References

[1] Brown T, Mann B, Ryder N, et al. Language Models are Few-Shot Learners.Advances in Neural Information Processing Systems, 2020,

33: 1877-1901.

[2] Touvron H, Lavril T, Izacard G, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288, 2023.

[3] Patterson D, Gonzalez J, Le Q, et al. The MLPerf Training Benchmark. Computer, 2022, 55(7): 18-28.

[4] Yu Y. Research on XXX. 8th International Conference on Natural Language Processing (ICNLP 2026), Xian, China, 2026.

[5] Abdin M, Zoph B, Tan M, et al. Gemini:A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2404.14219, 2024.

[6] Jiang A Q, Sablayrolles A, Mensch A, et al. Mistral 7B. arXiv preprint arXiv:2310.06825, 2023.

[7] Ji Z, Li X, Zou Y, et al. Survey of Hallucination in Natural Language Generation.ACM Computing Surveys, 2023, 55(12): 1-38.

[8] Yu Y.AI-PEAP SLM Product Development Plan (Version 1.3). 2026.

[9] Hoffmann J, Borgeaud S, Mensch A, et al. Training Compute-Optimal Large Language Models. arXiv preprint arXiv:2203.15556, 2022.

[10] Kaplan J, McCandlish S, Henighan T, et al. Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361, 2020.

[11] Sharma U, Kaplan J. The Neural Data Hypothesis:A Unifying View of Systematic Generalization in Humans and Machines. Journal of

Machine Learning Research, 2022, 23: 1-34.

[12] Zhang P, Li L,Wang H, et al. Overview of Large Model Alignment Techniques. arXiv preprint arXiv:2401.02385, 2024.

[13] Wu S,Weber J,Nevard G, et al.MMMLU:Measuring Massive Multitask Language Understanding. arXiv preprint arXiv:2303.17564,2023.

[14] Pan S J,Yang Q.A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345-1359.

[15] Hu E J, Shen Y,Wallis P, et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685, 2021.

[16] Shortliffe E H. Computer-Based Medical Consultations: MYCIN.New York: Elsevier, 1976.

[17] Esteva A, Robicquet A, Ramsundar B, et al.A Guide to Deep Learning in Healthcare.Nature Medicine, 2019, 25(1): 24-29.

[18] Aletras N, Tsarapatsanis D, Preo?iuc-Pietro D, et al. Evaluating the State-of-the-Art in Automatic Text Simplification. PeerJ Computer

Science, 2016, 2: e93 .

[19] Koedinger K R,Anderson J R, Hadley W H, et al. Intelligent Tutoring Goes to School in the Big City. International Journal of Artificial

Intelligence in Education, 1997, 8: 30-43.

[20] The Open Group. TOGAF Standard,Version 9.2.Van Haren Publishing, 2018.

DOI: http://dx.doi.org/10.70711/aitr.v3i9.9017

Refbacks

There are currently no refbacks.

Pretraining Dataset Design for Domain-Specific Small Language Models: A Pedagogical Framework and the AI-PEAP Case Study

Abstract

Keywords

Full Text:

Included Database

References

Refbacks

Scineer

Valueble Links

Username
Password
Remember me