The article explains how CHAOS chooses its data sources and how we ensure data quality.
Overview
CHAOS offers to customers state-of-the-art market and business insights and forecasts. We combine data from multiple sources and gather a range of datasets across categories, from crowd movement to points of interest to the market price.
We carefully select data providers and use only trusted sources. We take tedious tasks of searching, sourcing and cleaning the data to provide you with easily accessible quality, up to date data and insights.
FAQ
- What are CHAOS data sources?
- How does CHAOS select data sources?
- Why does CHAOS use National statistics departments?
- How do you ensure data quality?
What are CHAOS data sources?
CHAOS utilises a wide variety of data obtained from multiple sources. These sources include:
- Municipalities and various governmental institutions (e.g. City of Helsinki, HSL, MML);
- State-funded entities (e.g. national statistical institutions like Tilastokeskus in Finland);
- Private companies (e.g. Google, RPT Byggfakta);
- Large collaborative projects relying on crowdsourced data (e.g. OpenStreetMap);
- Partnerships
- CHAOS own crowd engagement tools
CHAOS utilises data collected from its own engagement tools, which allows us to provide customers with rich citizen sentiment data, such as feedback, ideas, reported problems, etc.
Read more on CHAOS Engagement tools.
How does CHAOS select data sources?
Before any data is used in CHAOS dashboards and insights, our team considers a few factors and performs at least one screening process.
CHAOS carefully selects data providers, and in cases where multiple providers offer data sets that are suitable for our specifications, we compare sources according to the following criteria:
- Geographical coverage, spatial resolution.
- Temporal resolution and timeliness.
- Renewable data.
- Market share.
- Responsible source.
- Sources that are widely accepted in the industry.
When multiple alternatives available, CHAOS selects data providers widely used throughout the industry.
Why does CHAOS use National statistics departments?
At CHAOS, we consider National statistics departments as reliable sources in general, especially for demographic information. The main reason for such choice is the consistency of the data. In EU countries, national statistics departments (including Finland's Tilastokeskus) follow the European statistics Code of Practice and Quality Assurance Framework for European Statistical Systems. Having a common framework for compiling statistical information has the advantage of the consistency of data structure and format as well as descriptions or any necessary metadata across countries.
How do you ensure data quality?
CHAOS has well-defined data specifications before our team starts pushing a new product or features to the development pipeline.
We screen for data completeness first, meaning that we check that data meets our specifications and has all the information that is necessary to properly interpret records in each field. Before being accessible to the customers, all datasets are going through the common process of data check: we look for the missing values, outliers or other types of errors, and consistency.
Checking the accuracy of a single record or point of interest is not always possible. In some cases it might take some time for the data sources to update their registers, so cases, where individual data points have recently changed and not yet updated, might occur. On our side, whenever possible we cross-check the correctness of the information across different datasets.
I found error in your data, why your data is not 100% accurate?
We at CHAOS believe that honesty and transparency is the foundation for long term strategic relationships with our customers and, therefore, we do not promise or market CHAOS as a company with 100% accurate data. Our data scientist Valeri, explains, why 100% accuracy in data is almost impossible and not always needed:
Very rarely any data is 100% accurate, especially crowd sourced or privacy-sensitive data. However, accuracy as commonly seen is a misleading metric because errors in the data can be of different type and not all errors are harmful. People frequently overlook the fact that that specific use case or decision does not necessarily need very high accuracy data. Let's consider an example.
Let's say Telecom provider with 50% market share shows that on average 15K people passes through the central railway station daily. If the data has been collected over a longer period of time then it is safe to assume that the numbers we see are systematically around 50% off the real picture. Despite pretty low accuracy, data is completely valid for all use cases where we want to understand relative activity in different areas. Even if we need absolute estimation of number of people passing the location, knowing the market share of the provider and the fact that the estimates are systematically deviating from the actual value we can still have a good evaluation the real number of people.
Therefore, data accuracy is not simply the number of errors or how far off the estimate of population captured in the measurements. Rather, what matters in terms of data accuracy is the types of errors and the sizable enough sample to make statistically sound conclusions.
Our vision, company policy and process of selecting data providers ensure that CHAOS customers receive data from qualified trusted sources.