Building trustworthy data foundation via machine-learning
Challenge to overcome
For most portfolio managers and asset owners, data reliability and quality remain a central challenge in sustainable investing. Human error in capturing reported data , combined with corporate disclosures that are often incomplete, inconsistent, or difficult to compare across peers and jurisdictions, undermines data confidence. In addition, differing reporting boundaries and definitions further complicate consistency.
These issues reinforce the well-known “garbage in, garbage out” problem: unreliable inputs lead to unreliable outputs, limiting investors’ ability to make sound decisions, manage risks effectively, or meet regulatory and client expectations.
AI application / use case
Clarity AI designed an AI-native data reliability system to address these core issues. Reliability checks are embedded across the entire data lifecycle, starting at ingestion, rather than treating data quality as a control applied at the end of the process.
This design choice is critical: errors in sustainability data tend to propagate downstream and become harder and more costly to detect and correct. By embedding AI early in the collection process, error risk is reduced before flawed data can cascade down into models, metrics, and user outputs. This enables rapid feedback loops to identify inconsistencies, anomalies, or gaps far more efficiently than traditional human-intensive workflows or general-purpose AI models.
At ingestion, AI selectively automates the collection of high-reliability data, such as qualitative disclosures (e.g. policies). For text-based metrics, Large Language Models (LLMs) achieve high accuracy, enabling the process to be largely automated and limiting human oversight to a minimum. Lower-reliability data (more complex metrics, primarily quantitative data) are routed into targeted human-in-the-loop workflows. Natural language processing (NLP) tools guide subject matter experts to relevant sections of reports and flag anomalies. This allows analysts to focus on evaluating context, credibility, and relevance, rather than searching through entire documents.
Reliability is further reinforced through layered verification. Machine learning models assess the likelihood that a data point is accurate using a broad set of signals (including historical trends, industry benchmarks, expected ranges, units, the nature of the source document, etc.), the collector’s expertise and past performance. Additional human review is triggered when there is a higher probability of inaccuracy.
Self-reported data by companies is not treated as correct by default. For example, values such as Scope 3 emissions are systematically challenged against expectations and contextual signals, built with AI. Where deviations indicate missing inputs or non-representative disclosures, further review is initiated. If the reported information cannot support a reliable value, e.g. due to insufficient disclosure, we generate AI-driven estimates or rely on inherited data from higher levels of the corporate structure. In all cases, estimates are clearly marked as such, with transparency about the underlying data and methodology.
Figure 1: Clarity AI’s Data Collection & Curation Workflows
Once an initial dataset is established, the system manages changes over time. Companies may restate data due to methodological updates, boundary changes, or revised assumptions. Rather than assuming that previously reported historical values remain unchanged, detection models identify when and why data reported for prior years differ from values disclosed in earlier reporting cycles, preserving this context. This enables us to explain not only what has changed, but why; a critical requirement for maintaining confidence and interpretability.
Data is then curated or aggregated according to the user's needs. For example:
- Some users may choose to include estimates, while others rely solely on reported data.
- Some may view reported values even if they fall outside methodological criteria, while others restrict outputs to data that meet predefined standards.
Throughout the process, transparency and traceability are foundational. Every data point is traceable to its source, with a record of how it was collected, verified, curated, and, where applicable, estimated.
Together, this approach delivers:
- Early-stage error reduction through AI embedded at ingestion
- Scalability without linear increases in manual effort
- Targeted use of human expertise where judgment adds the most value
- Robustness through layered AI and human validation
- Trust through full transparency and explainability
The result is a scalable, explainable data foundation, offering reliability levels exceeding 99%, supporting confident and informed investment decision-making.
Use case key beneficiaries
☒ Relationship Managers
☒ Portfolio Managers
☒ Research teams, macroeconomists
☒ Control functions
☐ Support functions (HR, CFO, …)
☒ Others: Chief Data Officers
Benefits of AI use case for financial services sector
By systematically identifying, assessing, and filtering unreliable inputs across the data lifecycle, this AI-powered approach enables financial institutions to rely on more accurate, consistent, and transparent sustainability data. This reduces the risk of errors across investment decision-making, portfolio construction, risk monitoring, and regulatory reporting, while strengthening confidence in underlying analyses.
Beyond sustainability use cases, curated datasets produced through this process also provide a robust foundation for other AI models and analytical applications. High-quality, well-governed data is a prerequisite for achieving reliable and usable results, and the combination of AI-driven controls with expert oversight helps ensure that this prerequisite is met.
Supporting technology
To deliver this level of performance, we leverage a specialized technology stack designed to address the complexity and variability of sustainability data.
Document ingestion and language processing
We utilize workflows, including LLM-powered agents, Optical Character Recognition (OCR), and automated translation, to ingest unstructured corporate disclosures across multiple formats and languages. This process is driven by our proprietary platforms: Igloo, which manages the data ingestion lifecycle, and Forge, which enables the development, validation, and deployment of agentic LLM data extractors at scale. Together, these technologies transform fragmented, non-standardized reports into high-quality, machine-readable data.
Generative AI and Large Language Models (LLMs)
We apply generative AI, including LLMs, to support tasks such as document understanding, information extraction, and contextual interpretation of dense corporate disclosures. The system is deliberately model agnostic: different models exhibit different strengths across tasks, and performance evolves rapidly as new models emerge. This architecture allows to continuously evaluate and deploy the models that are most effective for each specific use case, rather than relying on a single fixed solution.
Machine learning (ML) for reliability and quality control
We use supervised machine learning models based on decision-tree architectures to assess data reliability, detect inconsistencies, and identify anomalous patterns that would be difficult to uncover through manual review alone. These models incorporate multiple parameters and signals drawn from the data collection and verification process. Model quality is ensured through robust training, validation, and ongoing performance monitoring procedures. These same models also support the generation and validation of ML-driven estimates when reported data is unavailable or does not meet defined reliability criteria.