Pretraining Dataset Design for Domain-Specific Small Language Models: A Pedagogical Framework and the AI-PEAP Case Study
Abstract
remains a critical bottleneck. This paper introduces a pedagogical framework for constructing high-quality pretraining datasets that enable
Domain-Specific Small Language Models (DSSLMs) to achieve professional-level competency. The methodologycomprising five phases:
domain competency mapping, curriculum design, source strategy, data creation, and expert validationemphasizes learning efficiency over
token volume and professional judgment over general benchmarks. We validate the framework through the AI-PEAP (Professional Enterprise
Architecture Practitioner) case study, demonstrating that a 3B-parameter SLM pretrained on only 8B pedagogically designed tokens (??2.7)
achieves architecture review competency scores of 4.3/5.070% fewer tokens than Chinchilla-optimal scaling would suggest. The framework
reduces pretraining costs by 49% while improving professional alignment. We further analyze its domain-agnostic applicability to medicine,
law, and education, establishing "pedagogical data design" as a critical discipline for building expert-grade, efficient SLMs.
Keywords
Full Text:
PDFReferences
[1] Brown T, Mann B, Ryder N, et al. Language Models are Few-Shot Learners.Advances in Neural Information Processing Systems, 2020,
33: 1877-1901.
[2] Touvron H, Lavril T, Izacard G, et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv preprint arXiv:2307.09288, 2023.
[3] Patterson D, Gonzalez J, Le Q, et al. The MLPerf Training Benchmark. Computer, 2022, 55(7): 18-28.
[4] Yu Y. Research on XXX. 8th International Conference on Natural Language Processing (ICNLP 2026), Xian, China, 2026.
[5] Abdin M, Zoph B, Tan M, et al. Gemini:A Family of Highly Capable Multimodal Models. arXiv preprint arXiv:2404.14219, 2024.
[6] Jiang A Q, Sablayrolles A, Mensch A, et al. Mistral 7B. arXiv preprint arXiv:2310.06825, 2023.
[7] Ji Z, Li X, Zou Y, et al. Survey of Hallucination in Natural Language Generation.ACM Computing Surveys, 2023, 55(12): 1-38.
[8] Yu Y.AI-PEAP SLM Product Development Plan (Version 1.3). 2026.
[9] Hoffmann J, Borgeaud S, Mensch A, et al. Training Compute-Optimal Large Language Models. arXiv preprint arXiv:2203.15556, 2022.
[10] Kaplan J, McCandlish S, Henighan T, et al. Scaling Laws for Neural Language Models. arXiv preprint arXiv:2001.08361, 2020.
[11] Sharma U, Kaplan J. The Neural Data Hypothesis:A Unifying View of Systematic Generalization in Humans and Machines. Journal of
Machine Learning Research, 2022, 23: 1-34.
[12] Zhang P, Li L,Wang H, et al. Overview of Large Model Alignment Techniques. arXiv preprint arXiv:2401.02385, 2024.
[13] Wu S,Weber J,Nevard G, et al.MMMLU:Measuring Massive Multitask Language Understanding. arXiv preprint arXiv:2303.17564,2023.
[14] Pan S J,Yang Q.A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering, 2010, 22(10): 1345-1359.
[15] Hu E J, Shen Y,Wallis P, et al. LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685, 2021.
[16] Shortliffe E H. Computer-Based Medical Consultations: MYCIN.New York: Elsevier, 1976.
[17] Esteva A, Robicquet A, Ramsundar B, et al.A Guide to Deep Learning in Healthcare.Nature Medicine, 2019, 25(1): 24-29.
[18] Aletras N, Tsarapatsanis D, Preo?iuc-Pietro D, et al. Evaluating the State-of-the-Art in Automatic Text Simplification. PeerJ Computer
Science, 2016, 2: e93 .
[19] Koedinger K R,Anderson J R, Hadley W H, et al. Intelligent Tutoring Goes to School in the Big City. International Journal of Artificial
Intelligence in Education, 1997, 8: 30-43.
[20] The Open Group. TOGAF Standard,Version 9.2.Van Haren Publishing, 2018.
DOI: http://dx.doi.org/10.70711/aitr.v3i9.9017
Refbacks
- There are currently no refbacks.