Wide cinematic visualization of IoT devices running AI inference at the edge

AI / ML·10 min read

Edge AI: Running Machine Learning Models on IoT Devices

By Osman Kuzucu·Published on 2025-06-28

The traditional machine learning deployment model is straightforward: train a model in the cloud, serve it behind an API, and have edge devices send data to the cloud for inference. This works until it does not — when network latency makes real-time decisions impossible, when bandwidth costs for streaming sensor data become prohibitive, when privacy regulations prevent sending raw data off-device, or when connectivity simply is not available. Edge AI flips this model by running inference directly on the device where data is generated. A security camera that classifies objects locally, a factory sensor that detects anomalies without phoning home, a medical wearable that identifies arrhythmias on the wrist — these are not hypothetical scenarios but production deployments happening today across industries.

Model Compression and Quantization

A state-of-the-art image classification model might be 500MB with 25 million parameters — far too large for a microcontroller with 256KB of RAM. The bridge from cloud-scale models to edge-deployable ones involves several compression techniques applied in combination. Pruning removes weights that contribute little to model accuracy, typically reducing model size by 50-90% with minimal accuracy loss. Knowledge distillation trains a smaller "student" model to mimic the outputs of a larger "teacher" model, transferring learned representations into a compact architecture. Quantization converts 32-bit floating-point weights to 8-bit integers or even lower precision, cutting model size by 4x while often maintaining 95-99% of the original accuracy. Post-training quantization is the simplest approach — convert weights after training is complete. Quantization-aware training inserts simulated quantization operations during training, producing models that are inherently more robust to reduced precision.

Runtime Options: TensorFlow Lite vs ONNX Runtime

Two runtimes dominate edge ML deployment. TensorFlow Lite (TFLite) is the most mature option for microcontrollers and mobile devices, with excellent support for ARM-based hardware and a well-documented conversion pipeline from TensorFlow models. Its Micro variant runs on bare-metal devices with as little as 16KB of memory. ONNX Runtime, backed by Microsoft, offers broader framework compatibility — you can export models from PyTorch, TensorFlow, scikit-learn, and other frameworks to the ONNX intermediate format and run them through a single runtime. ONNX Runtime also provides hardware-specific execution providers that automatically leverage NPUs, GPUs, or DSPs when available. For teams using PyTorch as their primary training framework, ONNX Runtime often provides a more natural deployment path than converting to TFLite. In practice, benchmark both runtimes on your target hardware — inference speed, memory footprint, and accuracy after quantization can vary significantly between them depending on the model architecture.

Hardware Considerations for Edge Deployment

Choosing the right hardware platform depends on your inference requirements, power budget, and cost constraints:

Microcontrollers (ARM Cortex-M): Ideal for always-on keyword detection, vibration analysis, and simple anomaly detection. Power consumption under 1mW makes battery operation viable for years. Limited to models under 1MB.
Edge SoCs (NVIDIA Jetson, Google Coral): Deliver GPU or TPU acceleration for real-time computer vision and NLP at the edge. Can run full neural networks with hundreds of millions of parameters at 15-30 FPS. Power draw ranges from 5W to 30W.
FPGAs and custom ASICs: For high-volume deployments where cost-per-unit and power efficiency are critical, custom silicon provides the best performance-per-watt. The trade-off is long development cycles and high upfront NRE costs.

Edge AI is not a replacement for cloud-based ML but a powerful complement to it. The most effective architectures use a hybrid approach: edge devices handle real-time inference and local decision-making while periodically syncing with the cloud for model updates, aggregated analytics, and retraining on fresh data. As hardware accelerators become cheaper and more capable, and as compression techniques continue to improve, the range of models deployable at the edge will only grow. At OKINT Digital, we help teams navigate the full edge AI pipeline — from model optimization and hardware selection to deployment orchestration and OTA update strategies that keep edge models current without downtime.

edge aiiotmachine learningmodel optimizationembedded systems

Want to discuss these topics in depth?

Our engineering team is available for architecture reviews, technical assessments, and strategy sessions.

Schedule a consultation →