The Boa Programming Guide - Output Aggregators

To generate output, Boa provides a specific type called output types. Logically, an output variable can be thought of as a process running in the system that collects and aggregates data. When processing a single project, Boa code emits values which are sent to output variables. After all projects are processed (in parallell), the output variables aggregate the data and produces the final output of the query.

Output types specify an aggregator function to use. Users select from one of the predefined aggregators shown below. When defining aggregators users can specify one or more indices. If an aggregator contains indices, this is a grouping operation. All values sent to the output variable will be grouped by their indices and then each group will be aggregated.

Some aggregators also specify parameters. These allow controlling the aggregation, for example by setting limits on the output it generates.

AggregatorType
bottomoutput bottom [param] [indices] of T [weight T2]
A statistical sampling that records the bottom [param] number of elements, of type T.
collectionoutput collection [indices] of T
A collection of data points. No aggregation is performed and every value appears in the output.
maximumoutput maximum [param] [indices] of T [weight T2]
A precise sample of the [param] highest weighted elements, of type T.
meanoutput mean [indices] of T
An arithmetic mean of the data, of type T. Types supported currently are: int, float.
minimumoutput minimum [param] [indices] of T [weight T2]
A precise sample of the [param] lowest weighted elements, of type T.
setoutput set [param] [indices] of T
A set of the data, of type T. If [param] is specified, the set can have at most that many elements.
sumoutput sum [indices] of T
An arithmetic sum of the data, of type T. Types supported currently are: int, float.
topoutput top [param] [indices] of T [weight T2]
A statistical sampling that records the top [param] number of elements, of type T.

Parameters

Some aggregators are parameterized (look for [param] above). The syntax for parameters is to give parentheses with the parameter value inside. For example, the top aggregator requires a parameter to indicate how many values it should keep.

var: output top(10) of string weight int;

computes the top-10 values of type string, using integers as weights (see below for more description of weights).

Indices

Aggregators optionally take 1 or more indices. This allows grouping data sent to the aggregator and performing aggregation on each group. For example, if you wanted to compute the sum of something for each Project, you could add an index of project id:

var: output sum[string] of int;

The aggregator will group all values with the same index, and then perform aggregation (in this case sum) for each grouping. To use such an aggregator, you must specify the concrete value for each index when emitting values to the output variable:

var[input.id] << 1;

So in this case, the values 1 are being sent to the output variable. They will be grouped by each project ID and then for each project ID a separate aggregation is performed. In the final output, there will be 1 result per unique index.

Weights

Some aggregators support item weights. Aggregation is first done by taking values sent to the output variable, grouping by each unique value, and then summing their total weights. Then aggregation is performed based on that total weight. For example,

var: output top(10) of string weight int;

will take values of type string with associated integer weights. It will group by each unique string, sum the weights, and then take the 10 with the highest total weight.

Example 1 - Counting Features

Now we will provide a few examples, showing how to define and use output variables for a few sample tasks.

First consider a simple task: let's count how often a particular feature appears. For this we would want to use the sum aggregator to sum integer values:

var: output sum of int;

We can use this variable by emitting integer values to it:

var << 1;

The final result in the output will be a single integer value that adds all the integer values emitted to the variable, for all projects analyzed. Thus the output looks something like:

var[] = 3983920

Note the name of the variable appears in the output. Also note the empty square brackets, indicating there were no indices used.

Example 2 - Grouping by Project/Revision

Now instead let's assume we wanted to sum all values, but group them by each project analyzed. For this we need to use an index (of type string):

var: output sum[string] of int;

To use this output variable, we again emit integers to it. However we also indicate what the index value is:

var[input.id] << 1;

In this case we used the project ID to group values. The final output here will include one sum, per unique project ID (indicated as the index):

..
var[1000] = 3983920
var[1001] = 84832
var[1002] = 947174
var[1003] = 4859
..

We could go further and group not just by project, but by unique revision in each project:

var: output sum[string][string] of int;
var[input.id][rev.id] << 1;
..
var[1000][1] = 3983920
var[1000][2] = 84832
var[1001][1] = 947174
var[1001][2] = 4859
..

This allows fine-grained grouping with as many indices as you need.

Example 3 - Top-N

Finally, consider the case where you may want to find the top-N weighted elements, perhaps the top-10. You can define an output variable:

var: output top(10) of string weight int;

In this case, we are defining a variable that will produce the top-10 string values, based on the total weight seen for each unique value. To use the variable, we need to emit both a value and a weight:

var << "foo" weight 1;

The aggregator will group by all unique string values, then sum the weights, sort by total weight, and keep the top-10 based on highest total weight. The output will contain 10 entries, with the value and their total weight listed:

var[] = java, 240943.0
var[] = c++, 200384.0
var[] = scheme, 103910.0
..