A comprehensive strategy for computer vision by combining Data-centric and Model-based approaches with high quality synthetic datasets
By SKY ENGINE AI 31 January 2022
Keywords: Computer Vision, Synthetic Data, Data-centric AI, Model-based AI, Data-driven AI, MLOps, DataOps
In this article, you’ll discover how to think about your machine learning models from a data-centric standpoint, stressing the relevance and value of data in the AI models creation process. The focus of data-centric AI is on methodically iterating on the data to enhance its quality and/or to provide high quality data (from neural network’s perspective) initially in order to increase performance; regardless iteration or generation — it is a continuous process that you undertake not just at the start but even after deployment into production. We added there another critical component which is synthetic data generation to make it more inclusive in Data-driven AI strategy and to further boost the dataset’s quality in terms of AI model’s accuracy.
An old paradigm in Machine Learning
The Data Science community has a long history of creating and deploying datasets for AI systems. However, this undertaking is frequently painful and costly. The community needs high-productivity and efficient open data engineering tools that make creating, managing, and analysing datasets easier, less expensive, and more repeatable. As a result, the primary goal is to democratize data engineering and assessment in order to expedite dataset development and iteration while also boosting the efficiency of usage. If data preparation accounts for 80% of machine learning labor, then ensuring data quality is the most critical task of a data science team. Human-labeled data is yet fuelling AI-based systems and applications, despite the fact that most inventive initiatives have concentrated on AI models and code improvements. Because annotators are the source of data and ground truth, the increased emphasis on volume, speed, and cost of developing and enhancing datasets has had an influence on quality, which is ambiguous and frequently circularly defined. The development of methods to make repeatable and systematic tuning and balancing of the datasets has also lagged. While dataset quality remains everyone’s top priority, the methods by which it is assessed in practice are poorly understood and, in many cases, incorrect. A decade later, we see some cause for concern: fairness and bias issues in labeled datasets, quality issues, benchmarks limitations, reproducibility issues in ML research, lack of documentation and replication of data, and unrealistic performance metrics.
Beyond a certain point, the present model-centric approach to the data science projects tends to face a brick wall. When your data does not enable you to go farther, due to e.g., its quantity and/or quality, there is only so much you can do by trying with many different AI model architectures or tweaking current one. Experimenting with various models and determining what is best suited for the given data, machine learning problem and business case does not solve the issue in the long term. If your best model does not satisfy the metric that the business requires to approve or roll out the project, it’s probably wise to look into the data and dig deeper to determine why the data quality is excluding it from the training set.
In addition, when you train an AI model using data that is statistically different from the real data such inconsistency is making it failing to generalize to the real use case. It might appear in unexpectedly subtle ways or is frequently neglected when gathering true representative data is thought too difficult. This problem is especially serious since the validation dataset, which is used to evaluate the quality of the AI model, will most likely also carry the issues seen in the training data, resulting in believing you appear to be doing well, yet this is not the reality.
Even though all of the training data is consistent with the real data, some of it may be missing from a training dataset. In reality, not all occurrences come on an equal basis. Some are uncommon, while others are frequent, therefore while collecting data, a lot of it is acquired for typical scenarios and relatively little for unusual ones. In those few circumstances, this immediately leads to poor AI model’s performance. In reality, after sufficient data for typical scenarios has been acquired, additional data in these circumstances no longer leads to greater performance. In many AI-driven computer vision applications these rare occurrences clearly have the largest influence on the metrics. For instance, additional thousand hours of video recorded from cars driving in safe and unchallenging conditions is unlikely to improve any autonomous vehicle’s self-driving performance because it currently performs relatively well; however, two minutes of footage immediately before crashes could.
The data-centric strategy adds tremendous value to industry use cases
As AI spreads throughout sectors, a data-centric strategy is especially effective in use cases with a small amount of available representative and labelled data. Healthcare, manufacturing, and agriculture are examples of industries that frequently deal with relatively limited datasets or massive unlabeled datasets with few domain specialists. A data-driven approach is especially important when dealing with unstructured data (images and video data in computer vision applications), which is widespread in the sectors described above.
It is expected to most likely see more synergy — model-based techniques that reduce reliance on labelled data combined with technologies that improve the quality of the limited unbalanced labelled dataset you have. And growing use of tools for synthetic data generation to close the accuracy gap with simultaneous acceleration of the AI models production. For example, a typical manufacturing company may have hundreds of possible defect detection use cases, each of which, if successfully resolved, might result in millions of dollars in savings. However, many businesses lack the data science resources to filter each and every dataset to assure good quality. Methods like learning on synthetic data can be used within a data-centric framework to get the desired performance output.
A new order in the AI models training
Over the last few years, nearly all academic research in deep learning has focused on the AI model architecture and design and related advancements. Pre-processing/cleaning and annotating is also regarded as tedious and dull by many data scientists. The adoption of a data-centric strategy within the deep learning community intends to spur greater innovation in academia and business in areas such as data collection and generation, labeling, augmentation, data quality evaluation, data debt, and data management.
When you have some leeway in the data gathering process, you can make considerable improvements by paying attention to how you collect the data. In general, acquiring high-quality data that accurately reflects your use case is far more essential than just getting as much as possible from a more convenient source. The ideal situation would be to produce data using the same technique that would be utilized in the final model. The SKY ENGINE AI platform perfectly serves this purpose, boosting the accuracy of the AI models by synthetic data generation in multiple modalities including ground truths and by applying advanced domain adaptation algorithms.
Learn more about SKY ENGINE AI Synthetic data-centric AI
- A talk on Ray tracing for Deep Learning to generate synthetic data at NVIDIA GTC 2020 by Jakub Pietrzak, CTO SKY ENGINE AI (Free registration required)
- Example synthetic data videos generated in SKY ENGINE AI platform to accelerate AI models training
- Presentation on using synthetic data for 5G network performance optimization platform
- Working example with code for synthetic data generation and AI models training for team-based sport analytics in SKY ENGINE AI platform
- Explore warehousing and inventorying solutions with AI training in virtual environments and synthetically generated data — more examples