filters.Rmd
library(bupaR)
## Loading required package: edeaR
## Loading required package: eventdataR
## Loading required package: processmapR
## Loading required package: xesreadR
##
## Attaching package: 'bupaR'
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:utils':
##
## timestamp
library(edeaR)
library(eventdataR)
The filters for event data subsetting can mostly be divided into two type: event filters and case filters. Event filters will subset parts of cases based on criteria applied on the events (e.g. the resource which performed it), while case filters will subset complete cases, based on criteria applied on the cases (e.g. the trace length).
Each filter has a reverse argument, which allows to reverse the filter very easily. Furthermore, each filter has an interface-alternative, which can be called by adding a i before the function name.
The filter activity function can be used to filter activities by name. It has three arguments
patients %>%
filter_activity(c("X-Ray", "Blood test")) %>%
summary
## Number of events: 996
## Number of cases: 498
## Number of traces: 2
## Number of distinct activities: 2
## Average trace length: 2
##
## Start eventlog: 2017-01-05 08:59:04
## End eventlog: 2018-05-05 01:34:30
## handling patient employee
## Blood test :474 Length:996 r1: 0
## Check-out : 0 Class :character r2: 0
## Discuss Results : 0 Mode :character r3:474
## MRI SCAN : 0 r4: 0
## Registration : 0 r5:522
## Triage and Assessment: 0 r6: 0
## X-Ray :522 r7: 0
## handling_id registration_type time
## Length:996 complete:498 Min. :2017-01-05 08:59:04
## Class :character start :498 1st Qu.:2017-05-06 12:31:43
## Mode :character Median :2017-09-08 00:10:11
## Mean :2017-09-03 07:11:55
## 3rd Qu.:2017-12-23 02:06:20
## Max. :2018-05-05 01:34:30
##
## .order
## Min. : 1.0
## 1st Qu.:249.8
## Median :498.5
## Mean :498.5
## 3rd Qu.:747.2
## Max. :996.0
##
As one can see, there are only 2 distinct activities left in the event log.
It is also possible to filter on activity frequency. This filter uses a percentile cut off, and will look at those activities which are most frequent until the required percentage of events has been reached. Thus, a percentile cut off of 80% will look at the activities needed to represent 80% of the events. In the example below, the least frequent activities covering 50% of the event log are selected, since the reverse argument is true.
patients %>%
filter_activity_frequency(percentile_cut_off = 0.5, reverse = T) %>%
activity_frequency("activity")
## Warning in deprecated_perc(percentage, ...): Argument percentile_cut_off is
## deprecated. Use percentage instead.
## # A tibble: 4 x 3
## handling absolute relative
## <fct> <int> <dbl>
## 1 Blood test 237 0.193
## 2 Check-out 492 0.401
## 3 MRI SCAN 236 0.192
## 4 X-Ray 261 0.213
The filter_attributes function is a very generic function an can be supplied with conditions on the data set, in the same way as the dplyr::filter
function. As such, it allows you to filter on event or case attributes. Multiple conditions can be listed, separated by a comma. In that case, the comma will be treated as “and”. You can use the |-symbol to state “OR”. Since the patients dataset does not have many additional attributes, the example below uses the resource and activity. This filter is thus the same as the combination of filter_activity and filter_resource, in case both conditions were required. However, it has the advantange of stating both conditions as OR.
patients %>%
filter_attributes(employee == "r1" | handling == "X-Ray")
## Event log consisting of:
## 1522 events
## 2 traces
## 500 cases
## 2 activities
## 761 activity instances
##
## # A tibble: 1,522 x 7
## handling patient employee handling_id registration_type
## <fct> <chr> <fct> <chr> <fct>
## 1 Registration 1 r1 1 start
## 2 Registration 2 r1 2 start
## 3 Registration 3 r1 3 start
## 4 Registration 4 r1 4 start
## 5 Registration 5 r1 5 start
## 6 Registration 6 r1 6 start
## 7 Registration 7 r1 7 start
## 8 Registration 8 r1 8 start
## 9 Registration 9 r1 9 start
## 10 Registration 10 r1 10 start
## # ... with 1,512 more rows, and 2 more variables: time <dttm>,
## # .order <int>
Similar to the activity filter, the resource filter can be used to filter events by listing on or more resources.
patients %>%
filter_resource(c("r1","r4")) %>%
resource_frequency("resource")
## # A tibble: 2 x 3
## employee absolute relative
## <fct> <int> <dbl>
## 1 r1 500 0.679
## 2 r4 236 0.321
The trim filter is a special event filter, as it also take into account the notion of cases. In fact, it trim cases such that they start with a certain activities until they end with a certain activity. It requires two list: one for possible start activities and one for end activities. The cases will be trimmed from the first appearance of a start activity till the last appearance of an end activity. When reversed, these slices of the event log will be removed instead of preserved.
patients %>%
filter_trim(start_activities = "Registration", end_activities = c("MRI SCAN","X-Ray")) %>%
traces()
## # A tibble: 2 x 3
## trace absolute_frequen~ relative_frequen~
## <chr> <int> <dbl>
## 1 Registration,Triage and Assessment,~ 236 0.475
## 2 Registration,Triage and Assessment,~ 261 0.525
This functions allows to filter cases that contain certain activities. It requires as input a vector containing one or more activity labels and it has a method
argument. The latter can have the values all, none or one_of. When set to all, it means that all the specified activity labels must be present for a case to be selected, none means that they are not allowed to be present, and one_of means that at least one of them must be present.
The case filter allows to subset a set of case identifiers. As arguments it only requires a vector of case id’s. The selection can also be negated using reverse = T
.
The filter_endpoints
method filters cases based on the first and last activity label. It can be used in two ways: by specifying vectors with allowed start activities and/or allowed end activities, or by specifying a percentile. In the latter case, the percentile value will be used as a cut off. For example, when set to 0.9, it will select the most common endpoint pairs which together cover at least 90% of the cases, and filter the event log accordingly. This filter can also be reversed.
In order to extract a subset of an event log which conforms with a set of precedence rules, one can use the filter_precedence
method. There are two types of precendence relations which can be tested: activities that should directly follow each other, or activities that should eventually follow each other. The type can be set with the precedence_type argument. Further, the filter requires a vector of one or more antecedents (containing activity labels), and one or more consequents. Finally, also a filter_method argument can be set. This argument is relevant when there is more than one antecedent or consequent. In such a case, you can specify that all possible precedence combinations must be present (all), or at least one of them (_one_of).
There are three different filters which take into account the length of a case:
Each of these filters can work in two ways, similar to the endpoints filter: either by using an interval or by using a percentile cut off. The percentile cut off will always start with the shortest cases first and stop including cases when the specified percentile is reached. The processing and throughput time filters also have a units attribute to specify the time unit used when defining an interval. All the methods can be reversed by setting reverse = T
.
Cases can also be filtered by supplying a time window to the method filter_time_period
. There are four different filter methods, of which one can be used as argument:
The selection can also be reversed. Note that there is a 5 filter method, trim, but this is actually an event filter and will thus be discussed in the next section.
The last case filter can be used to filter cases based on the frequency of the corresponding trace. A trace is a sequence of activity labels, and will be discussed in more detail in Section . There are again two ways to select cases based on trace frequency, by interval or by percentile cut off. The percentile cut off will start with the most frequent traces. This filter also contains the reverse argument.