Arham ’27 | Building AI Models at Scale

This summer, I spent three months diving deep into AI model development thanks to the Dill Fund. The goal was straightforward: understand how modern AI models actually work by building and training them myself.

I primarily worked with open-source models, including Llama 3.1, Mistral 7B, Phi-2, and Qwen2.5. Each model taught me something different. Llama showed me how transformer architectures scale, Mistral demonstrated efficiency tricks, and Phi-2 proved that smaller models can punch above their weight. The real learning came from getting these models to run on hardware optimization and environment setup.

The dataset work was intense. I processed chunks of Common Crawl, worked with RefinedWeb for cleaner text data, and experimented with LAION-5B for multimodal tasks. Data preprocessing isn’t glamorous—it’s writing scripts to clean text, dealing with encoding errors, and figuring out why your data loader crashes after processing 10 million samples. But it’s also where I learned that data quality matters more than model size. A well-curated dataset of 100K examples often outperformed millions of noisy samples. Along with I tried to make my own smaller dataset on a very specific topic. Then I realized how hard is to make an own dataset, making all data into a common structure along with generating Q-A pairs

QLoRA became my go-to technique for fine-tuning. The math is elegant—you’re essentially learning small rank updates to frozen model weights—but implementation is tricky. Getting the rank parameters right, balancing learning rates, and preventing catastrophic forgetting required constant experimentation. My previous work on an Educational Chatbot gave me a starting point, but this summer, I pushed much further into optimization territory.

I’ve also explored code generation with models like CodeLlama, tested multilingual capabilities with mT5, and built simple vision-language applications. Each project had its own challenges. For example, code generation models need different evaluation metrics. Multilingual models consume memory like crazy. Vision-language models require careful preprocessing of both modalities. Every project meant new bugs to fix and new papers to read.

I developed lots of technical skills that will exactly help me with graduate research. I can now set up distributed training across multiple GPUs, implement custom training loops, and debug CUDA errors without panicking. More importantly, I understand the research process—how to form hypotheses about model behavior, design experiments to test them, and iterate based on results. Working with tools like Weights & Biases for experiment tracking and the Hugging Face ecosystem for model deployment became second nature.

AWS became both my best friend and biggest challenge. Managing p3.2xlarge instances meant being extremely careful about costs. I learned to checkpoint frequently, use spot instances when possible, and squeeze every bit of performance from the hardware. Those late nights debugging why a model wouldn’t converge or why GPU memory kept overflowing taught me more about practical ML than any course could.

I’m grateful to the Dill Fund donors for making this possible. The funding covered compute costs, dataset access, and gave me the freedom to experiment without worrying about breaking things. Professors Brown and Stout provided valuable guidance, especially when I got stuck on particularly thorny optimization problems.

For other Wabash students considering AI research: it’s mostly about persistence. You’ll spend more time debugging than coding. Models will refuse to converge for mysterious reasons. Your perfectly crafted training script will crash at epoch 4 of 5. But when you finally get a model working—when it generates coherent text or correctly classifies images—the satisfaction is worth every frustrating hour.

This summer confirmed that AI research is where I want to focus my career. Not because it’s trendy or well-funded, but because the problems are genuinely fascinating. How do we make models more efficient? How do we reduce hallucinations? How do we build systems that can truly understand and reason? These questions drive me, and this summer gave me the tools to start finding answers.

This summer solidified my technical foundation in AI research. I gained practical experience with production-level model training, learned to optimize for limited compute resources, and developed debugging skills that only come from fixing real training failures. The hands-on work with datasets, model architectures, and distributed computing gave me the confidence to tackle more complex research problems. Most importantly, I now have a clear understanding of the gap between academic papers and actual implementation—and the skills to bridge it.