cs249r_book/data_engineering.qmd

# Data Engineering

::: {.callout-note collapse="true"}
## Learning Objectives

* coming soon.

:::

## Introduction
Data is the lifeblood of AI systems. Without good data, even the most advanced machine learning algorithms will fail. In this section, we will dive into the intricacies of building high-quality datasets to fuel our AI models. Data engineering encompasses the processes of collecting, storing, processing, and managing data for training machine learning models.

We begin by discussing data collection: Where do we source data, and how do we gather it? Options range from scraping the web, accessing APIs, utilizing sensors and IoT devices, to conducting surveys and gathering user input. These methods reflect real-world practices. Next, we delve into data labeling, including considerations for human involvement. We'll discuss the trade-offs and limitations of human labeling and explore emerging methods for automated labeling. Following that, we'll address data cleaning and preprocessing, a crucial yet frequently undervalued step in preparing raw data for AI model training. Data augmentation comes next, a strategy for enhancing limited datasets by generating synthetic samples. This is particularly pertinent for embedded systems, as many use cases don't have extensive data repositories readily available for curation. Synthetic data generation emerges as a viable alternative, though it comes with its own set of advantages and disadvantages. We'll also touch upon dataset versioning, emphasizing the importance of tracking data modifications over time. Data is ever-evolving; hence, it's imperative to devise strategies for managing and storing expansive datasets. By the end of this section, you'll possess a comprehensive understanding of the entire data pipeline, from collection to storage, essential for operationalizing AI systems. Let's embark on this journey!


## Problem Definition
In many domains of machine learning, while sophisticated algorithms take center stage, the fundamental importance of data quality is often overlooked. This neglect gives rise to [“Data Cascades”](https://research.google/pubs/pub49953/) — events where lapses in data quality compound, leading to negative downstream consequences such as flawed predictions, project terminations, and even potential harm to communities.

![Cascades](images/data_engineering_cascades.png)

Despite many ML professionals recognizing the importance of data, many practitioners report facing these cascades. This underscores a systemic issue: while the allure of developing advanced models persists, data often remains underappreciated.

Take for example Keyword spotting (KWS). KWS stands as a prime example of TinyML in action, serving as a critical technology behind voice-enabled interfaces on endpoint devices such as smartphones. Typically operating as lightweight wake-word engines, these systems are consistently active, listening for a specific phrase to trigger further actions. When we say the phrases ’Ok Google’ or ’Alexa’, this initiates a process on a microcontroller embedded within the device. These microcontrollers, despite their limited resources, play a pivotal role in enabling seamless voice interactions with devices, often operating in environments with high ambient noise levels. The uniqueness of the wake-word assists in minimizing false positives, ensuring that the system is not triggered inadvertently.

![Virtual assistants](images/data_engineering_kws.png)

Building a reliable KWS model is not a straightforward task. It demands a deep understanding of the deployment scenario, encompassing where and how these devices will operate. For instance, a KWS model's effectiveness is not just about recognizing a word; it's about discerning it amidst various accents, amidst the cacophony of background sounds in a bustling cafe, or the blaring sound of a television in a living room or a kitchen where these devices are commonly found. It's about ensuring that a whispered 'Alexa' in the dead of night or a shouted 'Ok Google' in a noisy marketplace are both recognized with equal precision.

Moreover, many of the current KWS voice assistants support a limited number of languages, leaving a substantial portion of the world’s linguistic diversity unrepresented. This limitation is partly due to the difficulty in gathering and monetizing data for languages spoken by smaller populations. The long tail distribution of languages implies that many languages have limited data available, making the development of supportive technologies challenging.


Generally, in ML, problem definition has a few key steps:

1. Identifying the problem definition clearly

2. Setting clear objectives

3. Establishing success benchmark

4. Understanding end-user engagement/use

5. Understanding the constraints and limitations of deployment

6. Followed by finally doing the data collection.


Laying a solid foundation for a project is essential for its trajectory and eventual success. Central to this foundation is first identifying a clear problem, such as ensuring voice commands in voice assistance systems are recognized consistently across varying environments. Clear objectives, like creating representative datasets for diverse scenarios, provide a unified direction, while benchmarks, such as system accuracy in keyword detection, offer measurable outcomes to gauge progress. Engaging with stakeholders, from end-users to investors, provides invaluable insights, ensuring alignment with market needs. Additionally, when delving into areas like in voice assistance, understanding platform constraints is pivotal. Embedded systems, like microcontrollers, come with inherent limitations in processing power, memory, and energy efficiency. Recognizing these limitations ensures that functionalities, like keyword detection, are tailored to operate optimally, balancing performance with resource conservation.

In this context, using KWS as an example, we can break each of the steps out as follows:

1. **Identifying the Problem:**
   At its core, KWS aims to detect specific keywords amidst a sea of ambient sounds and other spoken words. The primary problem is to design a system that can recognize these keywords with high accuracy, low latency, and minimal false positives or negatives, especially when deployed on devices with limited computational resources.

2. **Setting Clear Objectives:**
   The objectives for a KWS system might include:
   - Achieving a specific accuracy rate (e.g., 98% accuracy in keyword detection).
   - Ensuring low latency (e.g., keyword detection and response within 200 milliseconds).
   - Minimizing power consumption to extend battery life on embedded devices.
   - Ensuring the model's size is optimized for the available memory on the device.

3. **Benchmarks for Success:**
   Establish clear metrics to measure the success of the KWS system. This could include:
   - True Positive Rate: The percentage of correctly identified keywords.
   - False Positive Rate: The percentage of non-keywords incorrectly identified as keywords.
   - Response Time: The time taken from keyword utterance to system response.
   - Power Consumption: Average power used during keyword detection.

4. **Stakeholder Engagement and Understanding:**
   Engage with stakeholders, which might include device manufacturers, hardware and software developers, and end-users. Understand their needs, capabilities and constraints. For instance:
   - Device manufacturers might prioritize low power consumption.
   - Software developers might emphasize ease of integration.
   - End-users would prioritize accuracy and responsiveness.

5. **Understanding the Constraints and Limitations of Embedded Systems:**
   Embedded devices come with their own set of challenges:
   - Memory Limitations: KWS models need to be lightweight to fit within the memory constraints of embedded devices. Typical, KWS models might need to be as small as 16KB to fit in the always-on island of the SoC. Moreover, this is just the model. The model itself might need to be wrapped around other application code which needs to do pre-processing.
   - Processing Power: The computational capabilities of embedded devices are limited (few hundred MHz of clock speed), so the KWS model must be optimized for efficiency.
   - Power Consumption: Since many embedded devices are battery-powered, the KWS system must be power-efficient.
   - Environmental Challenges: Devices might be deployed in various environments, from quiet bedrooms to noisy industrial settings. The KWS system must be robust enough to function effectively across these scenarios.

6. **Data Collection and Analysis:**
   For a KWS system, the quality and diversity of data are paramount. Considerations might include:
   - Variety of Accents: Collect data from speakers with various accents to ensure wide-ranging recognition.
   - Background Noises: Include data samples with different ambient noises to train the model for real-world scenarios.
   - Keyword Variations: People might pronounce keywords differently, or there might be slight variations in the wake word itself. Ensure the dataset captures these nuances.

7. **Iterative Feedback and Refinement:**
    Once a prototype KWS system is developed, it's crucial to test it in real-world scenarios, gather feedback, and iteratively refine the model. This ensures that the system remains aligned with the defined problem and objectives. This is important because the deployment scenarios change over time as things evolve.

Moreover, it is important to appreciate that these keyword spotting technologies are not isolated; they integrate seamlessly into larger systems, processing signals continuously while managing low power consumption. These systems extend beyond simple keyword recognition, evolving to facilitate diverse sound detections, such as the breaking of glass. This evolution is geared towards creating intelligent devices capable of understanding and responding to a myriad of vocal commands, heralding a future where even household appliances can be controlled through voice interactions.


## Data Sourcing
The quality and type of data gathered can significantly affect the performance of both the trained model and its downstream applications. This is particularly true in embedded systems where computational resources may be limited, necessitating models that are both accurate and efficient. Despite its critical importance, the issue of sourcing and data quality is often overlooked in favor of focusing on model architecture and performance metrics.

Generally, data can be sourced from various places depending on the project's objectives. Some common data sources include pre-existing datasets, web scraping, sensors, APIs, and crowdsourcing.

While pre-existing datasets, accessible from platforms like [Kaggle](https://www.kaggle.com/datasets) and [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/index.php), serve as a convenient starting point, they come with their challenges. Often, these datasets are well-curated, documented, and ready for consumption. However, the question remains as to how these datasets were created, labeled, and validated. The methodologies employed in their curation and their inter-annotator reliability metrics are rarely scrutinized, leaving questions of bias, validity, and reproducibility unanswered.

Crowdsourcing is an effective method, particularly suited for tasks requiring human judgment. Platforms like [Amazon's Mechanical Turk](https://www.mturk.com/) enable the distribution of micro-tasks to a broad audience who can assist in labeling or annotating data. However, it's crucial to note the importance of transparency in how data is sourced and validated. Detailed information about annotation criteria and protocols should be readily available to ensure both accountability and interpretability.

This method is widely used for various tasks such as sentiment analysis and image recognition, which may require human interpretation for nuanced understanding. For example, the [ImageNet project](http://www.image-net.org/) sourced candidate images online and then used crowdsourcing to categorize them into thousands of object classes. While this approach leverages massive parallelism for time-efficiency, challenges persist in ensuring consistent quality across annotations.

Another example is [Mozilla's Common Voice project](https://commonvoice.mozilla.org/en), which has successfully gathered a publicly accessible dataset of diverse voice recordings. Volunteers contribute by recording phrases in different languages and accents and also validate the submissions of others.

Data can also be directly sourced from the field through sensors or devices, especially in the context of embedded systems or IoT devices. For instance, a weather station might collect real-time data on temperature, humidity, and wind speed. Such data is often raw and must be pre-processed before it can be utilized in a machine learning model. It's crucial to maintain a high standard of data quality for reliable machine learning outcomes. Documentation of the origin and methods used for data collection can enhance transparency and accountability.

APIs offer another channel for data collection. Various services like [Twitter](https://developer.twitter.com/en/docs/twitter-api), [Google](https://developers.google.com/products), or financial platforms provide APIs through which data can be collected programmatically. This allows for real-time data sourcing and can be customized to collect only the information that is relevant to the project.


## Data Processing
Once data is collected, it must be processed to transform it into a usable format. For instance, the Multilingual Spoken Words Corpus (MSWC) used a forced alignment method that extracts individual word recordings to train keyword spotting models, from the Common Voice project which features crowdsourced sentence-level recordings.

The MSWC serves as an example of data pipelines—systematic and automated workflows for data transformation, storage, and processing. By streamlining the data flow, from raw data to usable datasets, data pipelines enhance productivity and facilitate the rapid development of machine learning models.

Data often comes from diverse sources and can be unstructured or semi-structured. Thus, it's essential to process and standardize it, ensuring it adheres to a uniform format. Such transformations may include:

- Normalizing numerical variables
- Encoding categorical variables
- Using techniques like dimensionality reduction


Data cleaning involves refining the dataset to remove inconsistencies, duplications, and inaccuracies. For instance, in the MSWC data, crowd-sourced recordings often feature background noises, such as static and wind. Depending on the model's requirements, these noises can be removed or intentionally retained.

Data validation serves a broader role than just ensuring adherence to certain standards like preventing temperature values from falling below absolute zero.
 It is imperative to catch data errors early, before they propagate through data pipline.
 Rigorous validation processes, including verifying the initial annotation practices, detecting outliers and handling missing values through techniques like mean imputation, contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them.

One often-overlooked aspect is the importance of benchmarks designed to evaluate data quality. While traditional benchmarks may focus on model performance, the need for data-centric benchmarks is increasingly evident.

Maintaining the integrity of the data infrastructure is a continuous endeavor. This encompasses data storage, security, error handling, and stringent version control. Periodic updates are crucial, especially in dynamic realms like keyword spotting, to adjust to evolving linguistic trends and device integrations.

Keeping track of data provenance—essentially the origins and the journey of each data point through the data pipeline—is not merely a good practice but an essential requirement for data quality. Data provenance contributes significantly to the transparency of machine learning systems. Transparent systems make it easier to scrutinize data points, enabling better identification and rectification of errors, biases, or inconsistencies.

For instance, if a ML model trained on medical data is underperforming in particular areas, tracing back the data provenance can help identify whether the issue is with the data collection methods, the demographic groups represented in the data, or other factors. This level of transparency doesn't just help in debugging the system but also plays a crucial role in enhancing the overall data quality. By improving the reliability and credibility of the dataset, data provenance also enhances the model's performance and its acceptability among end-users.


## Feature Engineering

Explanation: Feature engineering involves selecting and transforming variables to improve the performance of AI models. It's vital in embedded AI systems where computational resources are limited, and optimized feature sets can significantly improve performance.

- Importance of Feature Engineering
- Techniques of Feature Selection
- Feature Transformation for Embedded Systems
- Embeddings
- Real-time Feature Engineering in Embedded Systems

## Data Version Control

Explanation: Version control is critical for managing changes and tracking versions of datasets during the development of AI models, facilitating reproducibility and collaboration.

- Version Control Systems
- Metadata

## Optimizing Data for Embedded AI

Explanation: This section concentrates on optimization techniques specifically suited for embedded systems, focusing on strategies to reduce data volume and enhance storage and retrieval efficiency, crucial for resource-constrained embedded environments.

- Low-Resource Data Challenges
- Data Reduction Techniques
- Optimizing Data Storage and Retrieval

## Challenges in Data Engineering

Explanation: Understanding potential challenges can help in devising strategies to mitigate them. This section discusses common challenges encountered in data engineering, particularly focusing on embedded systems.

- Scalability
- Data Security and Privacy
- Data Bias and Representativity

## Promoting Transparency

Explanation: We explain that as we increasingly use these systems built on the foundation of data, we need to have more transparency in the ecosystem.

- Definition and Importance of Transparency in Data Engineering
- Transparency in Data Collection and Sourcing
- Transparency in Data Processing and Analysis
- Transparency in Model Building and Deployment
- Transparency in Data Sharing and Usage
- Tools and Techniques for Ensuring Transparency

## Licensing

Explanation: This section emphasizes why one must understand data licensing issues before they start using the data to train the models.

- Metadata
- Data Nutrition Project
- Understanding Licensing

## Conclusion

Explanation: Close up the chapter with a summary of the key topics that we have covered in this section.

- The Future of Data Engineering in Embedded AI
- Key Takeaways

## Helpful References
1. [Common Voice: A Massively-Multilingual Speech Corpus](https://arxiv.org/abs/1912.06670)
2. [Data Engineering for Everyone](https://arxiv.org/abs/2102.11447)
3. [DataPerf: Benchmarks for Data-Centric AI Development](https://arxiv.org/abs/2207.10062)
4. [Deep Spoken Keyword Spotting: An Overview](https://arxiv.org/abs/2111.10592)
5. [“Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI](https://research.google/pubs/pub49953/)
6. [Multilingual Spoken Words Corpus](https://openreview.net/pdf?id=c20jiJ5K2H)
7. [Model Cards for Model Reporting](https://arxiv.org/abs/1810.03993)
8. [Small-footprint keyword spotting using deep neural networks](https://ieeexplore.ieee.org/abstract/document/6854370?casa_token=XD6SL8Um1Y0AAAAA:ZxqFThJWLlwDrl1IA374t_YzEvwHNNR-pTWiWV9pyr85rsl-ZZ5BpkElyHo91d3_l8yU0IVIgg)