The Boa Programming Guide - Output Aggregators
To generate output, Boa provides a specific type called output types. Logically, an output variable can be thought of as a process running in the system that collects and aggregates data. When processing a single project, Boa code emits values which are sent to output variables. After all projects are processed (in parallell), the output variables aggregate the data and produces the final output of the query.
Output types specify an aggregator function to use. Users select from one of the predefined aggregators shown below. When defining aggregators users can specify one or more indices. If an aggregator contains indices, this is a grouping operation. All values sent to the output variable will be grouped by their indices and then each group will be aggregated.
Some aggregators also specify parameters. These allow controlling the aggregation, for example by setting limits on the output it generates.
Aggregator | Type |
---|---|
bottom | output bottom [param] [indices] of T [weight T2] |
A statistical sampling that records the bottom [param] number of elements, of type T . | |
collection | output collection [indices] of T |
A collection of data points. No aggregation is performed and every value appears in the output. | |
maximum | output maximum [param] [indices] of T [weight T2] |
A precise sample of the [param] highest weighted elements, of type T . | |
mean | output mean [indices] of T |
An arithmetic mean of the data, of type T . Types supported currently are: int, float. | |
minimum | output minimum [param] [indices] of T [weight T2] |
A precise sample of the [param] lowest weighted elements, of type T . | |
set | output set [param] [indices] of T |
A set of the data, of type T . If [param] is specified, the set can have at most that many elements. | |
sum | output sum [indices] of T |
An arithmetic sum of the data, of type T . Types supported currently are: int, float. | |
top | output top [param] [indices] of T [weight T2] |
A statistical sampling that records the top [param] number of elements, of type T . |
Parameters
Some aggregators are parameterized (look for [param]
above). The
syntax for parameters is to give parentheses with the parameter value inside.
For example, the top aggregator requires a parameter to indicate how many
values it should keep.
computes the top-10 values of type string, using integers as weights (see below for more description of weights).
Indices
Aggregators optionally take 1 or more indices. This allows grouping data sent to the aggregator and performing aggregation on each group. For example, if you wanted to compute the sum of something for each Project, you could add an index of project id:
The aggregator will group all values with the same index, and then perform aggregation (in this case sum) for each grouping. To use such an aggregator, you must specify the concrete value for each index when emitting values to the output variable:
So in this case, the values 1 are being sent to the output variable. They will be grouped by each project ID and then for each project ID a separate aggregation is performed. In the final output, there will be 1 result per unique index.
Weights
Some aggregators support item weights. Aggregation is first done by taking values sent to the output variable, grouping by each unique value, and then summing their total weights. Then aggregation is performed based on that total weight. For example,
will take values of type string
with associated integer weights.
It will group by each unique string, sum the weights, and then take the 10 with
the highest total weight.
Example 1 - Counting Features
Now we will provide a few examples, showing how to define and use output variables for a few sample tasks.
First consider a simple task: let's count how often a particular feature appears. For this we would want to use the sum aggregator to sum integer values:
We can use this variable by emitting integer values to it:
The final result in the output will be a single integer value that adds all the integer values emitted to the variable, for all projects analyzed. Thus the output looks something like:
var[] = 3983920
Note the name of the variable appears in the output. Also note the empty square brackets, indicating there were no indices used.
Example 2 - Grouping by Project/Revision
Now instead let's assume we wanted to sum all values, but group them by each project analyzed. For this we need to use an index (of type string):
To use this output variable, we again emit integers to it. However we also indicate what the index value is:
In this case we used the project ID to group values. The final output here will include one sum, per unique project ID (indicated as the index):
.. var[1000] = 3983920 var[1001] = 84832 var[1002] = 947174 var[1003] = 4859 ..
We could go further and group not just by project, but by unique revision in each project:
.. var[1000][1] = 3983920 var[1000][2] = 84832 var[1001][1] = 947174 var[1001][2] = 4859 ..
This allows fine-grained grouping with as many indices as you need.
Example 3 - Top-N
Finally, consider the case where you may want to find the top-N weighted elements, perhaps the top-10. You can define an output variable:
In this case, we are defining a variable that will produce the top-10 string values, based on the total weight seen for each unique value. To use the variable, we need to emit both a value and a weight:
The aggregator will group by all unique string values, then sum the weights, sort by total weight, and keep the top-10 based on highest total weight. The output will contain 10 entries, with the value and their total weight listed:
var[] = java, 240943.0 var[] = c++, 200384.0 var[] = scheme, 103910.0 ..