Hello. I'm standing inside the atrium of
the Civil Center which houses
the Computer Science department here at the University of Illinois.
The open atrium with a grand staircase and glass walls makes
a strong visual impression and the convenient coffee shop on the first floor,
and ample meeting spaces encourage discussion and collaboration.
In some ways, this space makes the Civil Center an outlier from most buildings on campus.
Especially those that are much older and
that were designed and constructed in a different era.
In this module, we introduce the concept of
anomaly detection which generally means finding outliers.
While the common perception is that outliers are always bad,
that is not always the case.
An outlier is generally defined as an instance in a
data set that is outside the normal ranges spanned by the data.
If you were interested in identifying your best customers,
say those that buy the most from your stores,
or visit your shops the most often,
you likely would want to find the outliers on the high side.
On the other hand, you might wish to find those who visit your store often but spend very
little to see if you can engage more productively with them to drive new revenue.
Thus, outliers are often dichotomous and that they can be
bad such as fraud or goods such as your highest revenue customers.
In either case, finding them can be both challenging and rewarding.
The first lesson in this module includes several readings on
fraud detection including from the perspective of an account.
Given the importance of identifying fraud,
most of the work an anomaly detection has been focused in finding even potential fraud.
The second lesson introduces statistical techniques for identifying outliers.
In many cases, this task simplifies to assuming a distribution for the data of
interest and using descriptive statistics to identify the data that are in the tails.
When dealing with time series data or streaming data,
these techniques can become more complicated.
But the basic concepts remain the same.
The final lesson introduces machine learning techniques for identifying outliers.
This can be done in several ways.
First, we can use clustering to find clusters of data and then identify data that
are near the edge of a cluster or that are outside of identified clusters in the data.
Second, we can use classification to
identify data that are somehow different than the rest of the data.
This is known as a one versus many classification approach.
Finally, we can use regression to model the data and then use
descriptive statistics on this new model to
identify the data that are considerably different than the rest of the data.
Fraud detection has become big business
from credit card companies providing real time alerts,
to banks working to prevent money laundering,
to stores trying to minimize theft.
New techniques are constantly being developed and deployed.
Increasingly these techniques are also finding a home in
the audit process and as data rates increase with real time reporting,
the importance of algorithmic fraud detection will only increase. Good luck.