Connecting the dots with data science

by Ariel Henemann (Ph.D), Intent IQ’s head of Data Science

Identity data on the internet is like a gold mine. The gold is there but you have to work hard to find it, clean it, and purify it in order to make it valuable.

Our job in the data science team at Intent IQ is to make sense of the high throughput data that we get as input every second.

The process starts with data organizing. The data comes from multiple sources each of which has different schemes and conventions. The data goes into our pipeline to unify all our sources into one scheme.

The raw data is then cleaned from extreme outliers and other artifacts. Now it’s ready to move to the fun part.

From deterministic to probabilistic with supervised learning

The available identity data usually contains a subset of deterministic data. This includes all the facts that we know for certain about the data. While this part of the data is highly accurate, one should also use the unknown part of the data to give value with higher impact. This is where our supervised learning comes in. Supervised learning is a set of models and algorithms that are able to learn patterns from a subset of known data and generalize this knowledge over the entire data. When applied correctly, we are able to obtain a very large set of probabilistic data with an accuracy that is very close to one of our deterministic data.

Supervised learning is a very strong tool for the case when we have partially known data. However, in order to harness the full strength of supervised learning to produce value, one must tailor the concept to a specific problem. An out of the box solution can only get you so far. When approaching these types of challenges one must consider many aspects of the data such as the distributions of various segments of the data, specific domain constraints, and many more. One of our main advantages is our vast familiarity with our domain – a strength that one can only acquire with time and experience.

An important factor that strongly affects the ability to receive probabilistic data with very high accuracy is the ratio between the portion of deterministic data to the one of probabilistic data. For this reason we invest much of our resources in maintaining a very large set of deterministic data. Our deterministic data comes from hundreds of millions of events every day while our probabilistic data comes from billions. Understanding the importance of accurate data, we maintain this ratio as our data grows.

Train/Test validation - measuring the accuracy of our probabilistic data

We want our accuracy to be as high as possible. Validating data accuracy is important for continuous improvement as well as for our customers to know what they are getting.

After verifying that the deterministic part of the data is distributed in a similar manner as the entire data, we can use the train / test method to validate our learning. In this case we only learn from a subset of the deterministic data and we leave a portion of the deterministic data for validation. After applying what we learned on the remaining portion of the data, we can compare our results with the real known data and evaluate our accuracy.

An extra mile with unsupervised learning - organizing the unknown

In some domains, we can add value to our customers without having a known subset of the data. Unsupervised learning discovers similarities and differences in the data without knowing the “correct answers” to part of the data. It gives us the ability to cluster and organize objects based on these similarities and differences. Applying these techniques helps our customers make the right choices when deciding how to allocate their resources.

Learning in an ever-changing environment

The field identity is constantly changing with the development of new technologies on one hand and the introduction of new regulations on the other. As a team that specializes in learning (supervised or not), we believe that in such a dynamic field, we must stay agile and learn new tricks every day while always relying on our profound knowledge of the identity world.

Ariel Henneman is Intent IQ’s head of Data Science deptartmenrt, P.hD in computer science.