Where does CHAOS get its data from and how does it ensure data quality?

The article explains how CHAOS chooses its data sources and how we ensure data quality.

Overview

CHAOS offers to customers state-of-the-art market and business insights and forecasts. We combine data from multiple sources and gather a range of datasets across categories, from crowd movement to points of interest to the market price.

We carefully select data providers and use only trusted sources. We take tedious tasks of searching, sourcing and cleaning the data to provide you with easily accessible quality, up to date data and insights.

FAQ

  1. What are CHAOS data sources?
  2. How does CHAOS select data sources?
  3. Why does CHAOS use National statistics departments?
  4. How do you ensure data quality?
  5. Why data is never 100% accurate?

What are CHAOS data sources?

CHAOS utilises a wide variety of data obtained from multiple sources. These sources include:

  • Municipalities and various governmental institutions (e.g. City of Helsinki, HSL, MML);
  • State-funded entities (e.g. national statistical institutions like Tilastokeskus in Finland); 
  • Private companies (e.g. Google, RPT Byggfakta);
  • Large collaborative projects relying on crowdsourced data (e.g. OpenStreetMap);
  • Partnerships
CHAOS is a reliable partner with a strong portfolio. We are open to partnership, whether it is providing data in exchange for our service or collaborating in developing new tech.  Please, feel free to contact us if you are interested to become our partner.
  • CHAOS own crowd engagement tools

CHAOS utilises data collected from its own engagement tools, which allows us to provide customers with rich citizen sentiment data, such as feedback, ideas, reported problems, etc.

Read more on CHAOS Engagement tools.

How does CHAOS select data sources?

Before any data is used in CHAOS dashboards and insights, our team considers a few factors and performs at least one screening process.

CHAOS carefully selects data providers, and in cases where multiple providers offer data sets that are suitable for our specifications, we compare sources according to the following criteria:

  • Geographical coverage, spatial resolution.
CHAOS selects sources that provide data from multiple countries and over the smaller area units. For example, data aggregated over the entire municipality is too low resolution for us and does not fit our needs to provide more detailed information.
  • Temporal resolution and timeliness.
CHAOS prioritises sources that are more frequently updated.  We also consider how up to date is the information when selecting our sources.
  • Renewable data.
We avoid sourcing from publicly posted 'one-time' datasets that are not renewed consistently (e.g. datasets published for Kaggle competition).   
  • Market share.
Whether obtaining data from a provider, CHAOS considers the providers' market share. (e.g. a telecom company with 20% of the population as its subscribers would be preferred over the one with 3% population coverage)
  • Responsible source.
CHAOS takes matters of data privacy extremely seriously. We make sure that data collection is compliant with privacy laws and GDPR.
  • Sources that are widely accepted in the industry.

When multiple alternatives available, CHAOS selects data providers widely used throughout the industry. 

Why does CHAOS use National statistics departments?

At CHAOS, we consider National statistics departments as reliable sources in general, especially for demographic information. The main reason for such choice is the consistency of the data. In EU countries, national statistics departments (including Finland's Tilastokeskus) follow the European statistics Code of Practice and Quality Assurance Framework for European Statistical Systems. Having a common framework for compiling statistical information has the advantage of the consistency of data structure and format as well as descriptions or any necessary metadata across countries. 

    How do you ensure data quality?

    CHAOS has well-defined data specifications before our team starts pushing a new product or features to the development pipeline.

    We screen for data completeness first, meaning that we check that data meets our specifications and has all the information that is necessary to properly interpret records in each field. Before being accessible to the customers, all datasets are going through the common process of data check: we look for the missing values, outliers or other types of errors, and consistency.

    Checking the accuracy of a single record or point of interest is not always possible.  In some cases it might take some time for the data sources to update their registers, so cases, where individual data points have recently changed and not yet updated, might occur. On our side, whenever possible we cross-check the correctness of the information across different datasets.

    I found error in your data, why your data is not 100% accurate?

    We at CHAOS believe that honesty and transparency is the foundation for long term strategic relationships with our customers and, therefore, we do not promise or market CHAOS as a company with 100% accurate data. Our data scientist Valeri, explains, why 100% accuracy in data is  almost impossible and not always needed:

    Very rarely any data is 100% accurate, especially crowd sourced or privacy-sensitive data. However, accuracy as commonly seen is a misleading metric because errors in the data can be of different type and not all errors are harmful.  People frequently overlook the fact that that specific use case or decision does not necessarily need very high accuracy data. Let's consider an example.

    Let's say Telecom provider with 50% market share shows that on average 15K people passes through the central railway station daily. If the data has been collected over a longer period of time then it is safe to assume that the numbers we see are  systematically around 50% off the real picture. Despite pretty low accuracy, data is completely valid for all use cases where we want to understand relative activity in different areas. Even if we need absolute estimation of number of people passing the location, knowing the market share of the provider and the fact that the estimates are systematically deviating from the actual value we can still have a good evaluation the real number of people.        

    Therefore, data accuracy is not simply the number of errors or how far off the estimate of population captured in the measurements. Rather, what matters in terms of data accuracy is the types of errors and the sizable enough sample to make statistically sound conclusions.

    Our vision, company policy and process of selecting data providers ensure that CHAOS customers receive data from qualified trusted sources.