Skip to content

Commit

Permalink
Remove duplicate section from data_engineering.qmd
Browse files Browse the repository at this point in the history
  • Loading branch information
Sara-Khosravi authored Jul 4, 2024
1 parent 3e0381b commit d74f0c0
Showing 1 changed file with 0 additions and 1 deletion.
1 change: 0 additions & 1 deletion contents/data_engineering/data_engineering.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -534,7 +534,6 @@ Throughout this pipeline, transparency through documentation and provenance trac
By thoughtfully engineering high-quality training data, machine learning practitioners can develop accurate, robust, and responsible AI systems. This includes applications in embedded systems and TinyML, where resource constraints demand particularly efficient and effective data-handling practices. In the context of TinyML, data engineering practices take on a unique character. Resource-constrained devices often necessitate smaller datasets with high signal-to-noise ratios. Data collection may be limited to on-device sensors or specific environmental conditions. Crowdsourcing and synthetic data generation have become precious tools for generating specialized datasets with limited memory and processing power. Careful optimization techniques for data cleansing, feature selection, and model compression are essential for TinyML applications. By understanding these nuances, data engineers can empower the development of efficient and effective AI solutions at the edge.
## Resources {#sec-data-engineering-resource .unnumbered}

Data is the fundamental building block of AI systems. Without quality data, even the most advanced machine learning algorithms will fail. Data engineering encompasses the end-to-end process of collecting, storing, processing, and managing data to fuel the development of machine learning models. It begins with clearly defining the core problem and objectives, which guides effective data collection. Data can be sourced from diverse means, including existing datasets, web scraping, crowdsourcing, and synthetic data generation. Each approach involves tradeoffs between cost, speed, privacy, and specificity. Once data is collected, thoughtful labeling through manual or AI-assisted annotation enables the creation of high-quality training datasets. Proper storage in databases, warehouses, or lakes facilitates easy access and analysis. Metadata provides contextual details about the data. Data processing transforms raw data into a clean, consistent format for machine learning model development. Throughout this pipeline, transparency through documentation and provenance tracking is crucial for ethics, auditability, and reproducibility. Data licensing protocols also govern legal data access and use. Key challenges in data engineering include privacy risks, representation gaps, legal restrictions around proprietary data, and the need to balance competing constraints like speed versus quality. By thoughtfully engineering high-quality training data, machine learning practitioners can develop accurate, robust, and responsible AI systems, including embedded and TinyML applications.

## Resources {#sec-data-engineering-resource}

Expand Down

0 comments on commit d74f0c0

Please sign in to comment.