Garbage In, Garbage Out: Why Data Relevance Is Make-or-Break for AI

As AI consumes more and more time, attention, and money, organizations are learning that a very old adage still applies to very new technology: “garbage in/garbage out.” Your AI model ‘s output is only as good as the data used to train it, plus the data it can access to fulfill its function. 

AscentAI’s Lead Regulatory Advisor, Jilaine Bauer, sees this firsthand as she works with clients. She notes, “it’s important for AI solutions to be trained on data sets that include industry-specific data to achieve greater accuracy, relevance and insights.  For example, when working with an insurance company, it is important for an AI solution to be trained on data sets that include terminology and concepts relevant to the insurance sector and related subsectors. For FinTech firms delivering traditional financial services in new, engaging ways, such as digital banking, data sets may need to include new or different terms and concepts.  And, for firms operating in more than one country, taxonomies and ontologies can help structure and categorize data to help ensure it is applied in a consistent matter.  Finally, perhaps the most important step we take at AscentAI to ensure the data is fit for the AI application is to develop use cases specific to client use cases and then scope the data we think applies for client review and approval.”

As companies grapple with the practical implications of AI, the attitude towards it has grown more discerning, and yes, skeptical. Large language models trained on the wrong training data or  flawed training data  are prone to failure. The Institute of Electrical and Electronics Engineers (IEEE), after discussing the unreliability of LLMs like ChatGPT, wrote, “A common assumption is that scaling up the models driving these applications will improve their reliability—for instance, by increasing the amount of data they are trained on, or the number of parameters they use to process information. However, more recent and larger versions of these language models have actually become more unreliable, not less, according to a new study.”

There is growing recognition that LLMs often fail because of flawed datasets that are so massive that developers don’t fully understand them, and one solution is improving data quality and relevancy through improved data governance. “It’s a really hard problem to solve,” says Bauer, “but the success of AI applications depends on it. I think it’s a key determinant on whether  you succeed or fail in leveraging the power of AI.”

AI needs timely and usable data sets for training and analysis. That means organizations must evolve their data governance policies to meet the needs of AI. Again, “garbage in/garbage out.” As the romance with LLMs trained on huge, more generalized data sets wanes and organizations rely more on industry-specific and internal data, ensuring that data is accurate, accessible, and up to date is critical. 

For Dataversity, a producer of education resources for IT professionals, Michelle Knight highlights the fact that today’s data governance programs “enforce roles, procedures and tools for some structured data throughout the company. Yet AI models learn from and use very large data sets, containing structured and unstructured data. All this data needs to be of good quality too, so that the AI model can respond accurately, completely, consistently, and relevantly.”

Knight goes on to place data quality fundamentals and clean-up as higher priorities than immediate AI implementation, likening AI to an iceberg:

The CEO and senior management see only the tip, visible with all of AI’s promise and reward. However, beneath the surface lies the vast majority of the iceberg, e.g., all the data that no one has bothered to understand or its lineage.

In the meantime, without adequate data quality as governed by data governance, the C-suite steers the corporation into the iceberg, which is an expensive accident. Consequently, evaluating the existing data governance program and applying data quality best practices may give organizations the best chance for AI readiness.”

AscentAI sees a clear relationship between data and AI governance as the basis for transparent and trustworthy AI.  We rigorously monitor our AI/ML and modelling aspects of our solutions, and employ a comprehensive approach that includes 10 layers of redundancy for data quality, incorporating both automated and human validations at every stage, to ensure our data is correct and accurate. In addition, our sole data sources for AI are materials published by national, state, and local regulatory governing authorities. There is no flotsam in that data set. In compliance as in every other AI application, clean, trusted, and accessible data are the keys to successful AI implementation.

Learn more about how AscentAI’s purpose-built AI is revolutionizing the way customers monitor regulatory developments and fully automate regulatory  change management powered by obligations on our website

Have a question about horizon scanning or regulatory change management? We’re here to help!