Developing a Data Quality Framework for Gen AI | Industry Insights | All MKC Content | ANA

Developing a Data Quality Framework for Gen AI

Share        

The HBR article "Is Your Company's Data Ready for Generative AI" written by Davenport and Tiwari was very insightful. I had a chance to collaborate with Tom Davenport and his team when I worked at the General Electric Company and Cigna — and his dedication and commitment to research in the field of data and analytics and now artificial intelligence (AI) has been instrumental to the entire data and analytics practice.

Several recent studies by MIT and various data partners including AWS, Snowflake, and Databricks all concur that data is emerging as the key ingredient for developing a successful AI application. The MIT and Databricks article found that in a survey of CIOs major spending growth is planned to bolster AI's data foundations. This was especially emphasized by the participants from companies considered to be AI leaders who participated in the survey. Moreover, marketing applications of generative AI consistently make the top of the list ncerning where gen AI applications can have the greatest impact in firms.

In the HRB article, Davenport and Tiwari pointed to the fact that the survey of over 300 chief data officers indicated that despite all the hype and effort firms are making to become more AI-driven, these same firms have yet to create new data strategies or have begun to manage their data in the ways necessary to make generative AI work for them.

For example, while 93 percent of the firms in the study agreed that data strategy is critical for getting value from gen AI, only 57 percent said they had made no changes thus far in their organizations. Moreover, only 11 percent of the 37 percent which did make the changes to their data strategy agreed very strongly that their organizations have the right data foundation for gen AI.

Furthermore, they point out that the data that generative AI uses needs to be of a high quality if generative AI models employing it are to be highly useful. Like the old saying, garbage-in, garbage out. This phase can be modernized to the age of AI by stating, "Poor-quality internal data will yield poor-quality responses from gen AI models."

Given the need for high quality data in order to power gen AI applications, it would be now worthwhile to dip deeper and create a comprehensive data quality framework to identify and address the issues of data quality related to the implementation of AI applications.

First, a general word of caution to all managers of gen AI projects. Just because your data management team is telling you the data in your firm is of a high quality, the definition of high quality for one application does not necessarily mean it the same for ique to every application. Therefore, the quality of your data will have to be reevaluated for your new gen AI application.

Now how do we find the data quality issues themselves? The first step to understanding the data quality in your firm is to operate your investigation with a solid data quality framework. The seven key dimensions of data quality we would like to focus on which comprise this framework are accuracy, completeness, consistency, timeliness, relevance, uniqueness, and validity. Below is the short definition of each:

  • Accuracy: Refers to the degree to which the data values are correct and, not some made-up mumbo jumbo.
  • Completeness: Refers to the degree where the data values are complete and not missing.
  • Consistency: Refers to the degree to which the data values are free from contradiction and conforms to a set of established rules or standards.
  • Timeliness: Refers to the degree to which the data values are up to date, i.e. the timeliness or recency of the data.
  • Relevance: Refers to the degree to which data is pertinent and useful to a particular situation or decision-making process.
  • Uniqueness: Refers to the degree to which data items in a dataset are distinct from one another and represents a unique piece of information.
  • Validity: Refers to the extent to which data conforms to the predefined format and that data is entered in the correct format such that it fits within the constraints of the system.

Each one of the seven dimensions plays a critical role in ensuring the overall quality of data. By examining your data for gen AI according to the above framework it should provide a structured format to data managers in order to assess and improve the quality of data for gen AI.

Now the next step is how one will go about examining their data quality fitness for gen AI in the context of the above framework. For this evaluation I highly recommend some of the tools for LEAN Six Sigma, especially DMAIC as I found this paradigm to be excellent for quantifying, measuring, and getting to the root causes of data quality issues. I spent 20 years at the General Electric Company as a certified Master Black Belt in Lean Six Sigma and used this quality paradigm many times for addressing data quality. (For more on this process, check out the American Society for Quality.) The important thing to remember is that one does not have to be a Black Belt or Master Back Belt in Lean Six Sigma (highest trained members in the discipline) to be able to put some of these principles in place. Start small and at one stage at a time and build your data quality program.

Now some practical examples of how to employ Lean Six Sigma DMAIC quality paradigm to manage data quality issues. Lean Six Sigma DMAIC is an acronym for define, measure, analyze, improve, and control. It is a phased approach that can be sequential but is often more parallel in practice. The first step is to define the quality goals for gen AI during the Define Phase. Who will participate in the project? What are the limits of the project?

The second phase known as the Measure Phase is where the actual extent of the data quality issues is measured. What are the metrics we will use to measure data quality using the framework? At this phase, one would also benchmark the data quality against certain standards. Then, during the Analyze Phase, one would find the root cause of the situations creating the data quality issues. Tools such as Fishbone Diagrams and Pareto Analyses can be very helpful in this context.

Next, the Improvement Phase is where improvement plans are implemented and tested. There are a myriad of commercial tools available to fix data quality issues and this is the phase where these all would be evaluated And, finally, during the Control Phase tools like the FMEA (which is the acronym for failure mode, effect analysis) are used to make sure that the quality improvements can be constantly monitored and don't slip back to the original level of defects and that the data owners know how to measure and what to do if there is too much of a drift back to lower quality. Control Charts are a really useful tool during this phase. The Control Phase shouldn't necessarily signal the end of the DMAIC cycle. Leaders can actually immediately propose another DMAIC cycle to implement continuous improvement and focus on some of the items on the Pareto chart which were deprioritized in the first round.

In conclusion, if firms are planning to implement gen AI applications for marketing and other functions they should start with their data quality. I hope this was useful for firms to start thinking about a framework and process for managing the data quality of their gen AI applications for marketing.

References:

Databricks and MIT (2023). CIO vision 2025: Bridging the gap between BI and AI. Databricks. 

Davenport, T. H., & Tiwari, P. (2024). Is Your Company's Data Ready for Generative AI? Harvard Business Review. 

Davenport, T., Bean, R., & Wang, R. (2024). CDO Agenda 2024: Navigating Data and Generative AI Frontiers. CDO Agenda. 


The views and opinions expressed are solely those of the contributor and do not necessarily reflect the official position of the ANA or imply endorsement from the ANA.



David Fogarty, PhD, MBA is the SVP of Data Excellence and Privacy at the Association of National Advertisers (ANA). Prior to the ANA, he was a seasoned Fortune 100 chief data and analytics executive and an adjunct professor at Columbia University, Cornell University and New York University. David is also a bestselling author with over 50 published research papers and books.

Share