There are five pretrained classifiers in Microsoft 365 and, as was mentioned earlier, they are created and maintained by Microsoft Corporation. These are the following:
- Source code: Detects items that are written in any of the top 25 computer programming languages on GitHub:
– ActionScript
– C
– C#
– C++
– Clojure
– CoffeeScript
– Go
– Haskell
– Java
– JavaScript
– Lua
– MATLAB
– Objective-C
– Perl
– PHP
– Python
– R
– Ruby
– Scala
– Shell
– Swift
– Tex
– Vimscript
- Threat: Detects offensive language with threats of committing acts of violence, physical harm, or damage to property or a person.
- Harassment: Detects offensive language directed at one or several individuals based on traits such as ethnicity, national origin, gender, religion, age, sexual orientation, or disability.
- Profanity: Detects offensive language that contains expressions listed as profane, with the purpose of embarrassing users.
- Resumes: Detects information related to resumes present in Microsoft 365, such as an applicant, personal or professional qualifications, work experience, or other personally identifiable information (PII) related to resumes.
These classifiers can be found in the Microsoft 365 compliance center, https://compliance.microsoft.com, under Data classification | Trainable classifiers, as illustrated in the following screenshot:
Figure 4.1 – Screenshot showing the location of Trainable classifiers in the Microsoft 365 compliance center
As you can see in the preceding screenshot, a sixth classifier appears called Offensive Language. This classifier is deprecated by Microsoft as it is prone to generate many false positives, which is why we’re not going to cover it to any extent and will rather focus on the ones listed earlier as they cover the areas of offensive language as well.
Custom classifiers
Custom classifiers are not present in the Microsoft 365 compliance center by default; as stated earlier, a global administrator will need to consent to the usage of classifiers before being able to create any customized ones. The consent popup looks as in the screenshot shown next:
Figure 4.2 – The opt-in popup visualized in the Microsoft 365 compliance center
To create your own custom classifiers, you need to click on the Start scanning process button highlighted in Figure 4.2. This will initiate a scan of your data in your Microsoft 365 tenant to better understand what data lives therein. The scan will take 7 to 14 days to complete, and you will have to complete it before creating your custom classifier.
The custom classifier differentiates itself from the pretrained classifier in the previous section in that they are untrained and you need to feed them with data in order to train them. The process flow of creating a custom classifier looks as in the following diagram:
Figure 4.3 – The process flow of creating a custom classifier with each step described
This concludes this section of the chapter. Up next, we will delve deeper into the topic of trainable classifiers and try to distinguish when they are a good choice for classifying data.
Identifying when to use trainable classifiers
When do you use trainable classifiers? Well, to answer that, we will need to revisit the different options we have at our disposal to classify data in our environment:
- Manually: The responsibility of classifying the data falls on the end user or the administrator, using the sensitivity labels created in the previous chapter, Chapter 3, Creating and Managing Sensitive Information Types.
- Automated pattern matching: Automated pattern matching includes finding content by using the following:
a. Metadata values or keywords
b. Sensitive information types to identify data
c. Document fingerprinting to recognize data
d. Exact data match by finding the presence of exact strings
- Trainable classifiers: The classification method of trainable classifiers is very well suited for content that is not easily identified by automated pattern matching or the manual approach.
As the list makes it known, when we have data that is not easily detected by our unified labeling client for automatic labeling or when the manual approach cannot quite get us all the way there, the trainable classifiers make their entrance known.
A classifier learns how data is constructed by looking at hundreds and thousands of examples of the content of interest.
The following table visualizes when there is a need for using trainable classifiers:
Figure 4.4 – This table shows us the use cases for trainable classifiers rather than manual or automatic (pattern match) classification
So, identifying when to use trainable classifiers really boils down to the description of the classifiers themselves, namely, trainable classifiers are very well suited for content that is not easily identified by automated pattern matching or the manual approach.
If the data is hard to identify as sensitive, profane, harassment, or threats, there is no other option but to use trainable classifiers to gain knowledge of such content and label it accordingly.
Now that we have covered the introduction to trainable classifiers and when to use them, it is time to start trying this out ourselves and create some classifiers in our tenant.
Leave a Reply