# Mon 05 May 2008 04:45:39 PM EDT
# Uday Bondhugula

OVERVIEW

PLUTO is an automatic parallelization tool that is based on the 
polyhedral model. The Polyhedral model is a geometrical representation 
for programs that utilizes machinery from Linear Algebra and Linear 
Programming for analysis and high-level transformations. Pluto 
transforms C programs from source to source for coarse-grained 
parallelism and locality simultaneously. The core transformation 
framework mainly works by finding affine transformations for efficient 
tiling and fusion, but not limited to it. OpenMP parallel code for 
multicores can be automatically generated from sequential C program 
sections. Pluto currently lacks intra-tile interchange (supposed to be 
done post-transformation) for spatial locality, prefetching, etc.. This 
will be implemented in future.

This is the chain of the entire source-to-source system that polycc will  
run. 

C code --> Polyhedral extraction -->  Dependence analysis
           (clan or PET)                (ISL or candl)

-->    Pluto transformer
      (core Pluto algorithm + post transformation)

 --> CLooG                     -->       C (with OpenMP, ivdep pragmas)
 (cloog + clast processing 
 to mark loops parallel, ivdep)


USING PLUTO

- Use '#pragma scop' and '#pragma endscop' around the section of code 
  you want to parallelize/optimize.

- Then, run 
    
    ./polycc <C source file> --parallel --tile

    The output file will be named <original prefix>.pluto.c unless '-o 
    <filename>" is supplied. When --debug is used, the .cloog used to 
    generate code is not deleted and is named similarly.

Please refer to the documentation of Clan or PET for information on the 
kind of code around which one can put '#pragma scop' and '#pragma 
endscop'.  Most of the time, although your program may not satisfy the 
constraints, it may be possible to work around them.


COMMAND-LINE OPTIONS
    -o output
    Output to file 'output'. Without -o, name of the output file is 
    determined as described earlier (under 'Using PLUTO')

    --help 
    List all available options with one-line summary

    --tile [--l2tile]
    Tile code; in addition, --l2tile will tile once more for the L2 
    cache. By default, both of them are disabled. Tile sizes can be 
    forced if needed from a file 'tile.sizes' (see below), otherwise, 
    tile sizes are set automatically using a rough heuristic. Tiling 
    also allows extraction of coarse-grained pipelined parallelism with 
    the Pluto model.

    --intratileopt  [enabled by default]
    Optimize a tile's execution order for locality (spatial and temporal
    reuse); the right loop permutation for a tile will be chosen, in particular,
    the right innermost loop. Spatial locality is not otherwise captured 
    by Pluto's cost function.

    --parallel
    Parallelize code with OpenMP (usually useful when used with --tile) 

    --parallelize
    Same as --parallel

    --innnerpar
    Prever inner parallelism over pipelined/wavefront parallelism obtained
    via skewing whenever both exist

    --multipipe 
    Will enable extraction of multiple degrees of parallelism (from all 
    parallel loops).  Disabled by default. By default, only one degree of 
    outer parallelism or coarse-grained pipelined parallelism is extracted 
    if it exists.

    --smartfuse [default]
    This is the default fusion heuristic. Will try to fuse between SCCs 
    of the same dimensionality. 

    --nofuse
    Separate all strongly-connected components (SCCs) in the dependence
    graphs to start with, i.e., no fusion across SCCs, and at any level
    inside.

    --maxfuse 
    This is geared towards maximal fusion, but maximal fusion is not
    guaranteed. Fusion is done across SCCs.

    --[no]unroll
    Automatically identify and unroll-jam up to two loops. Not enabled 
    by default.

    --ufactor=<n>
    Unroll or unroll-jam factor (default is 8). Note that if two loops
    are unroll-jammed, you will get an nxn body

    --[no]prevector
    Perform post-transformations to make the code amenable to 
    vectorization. Enabled by default.

    --rar
    Consider RAR dependences for optimization (increases running time by 
    a little). Disabled by default

    --debug
    Verbose information to give some insights into the algorithm.  
    Intermediate files are not deleted (like the program-readable 
    statement domains, dependences, pretty-printed dependences, the 
    .cloog file, etc.). For the format of these files, refer doc/DOC.txt 
    for a pointer.

    --verbose
    Higher level of output. ILP formulation constraints are 
    pretty-printed out dependence-wise, along with solution hyperplanes 
    at each level. 

    --indent
    Indent generated code

    --islsolve
    Use ISL for solving ILPs

    --isldep
    See usage message (run polycc with no arguments)

    --lastwriter
    See usage message (run polycc with no arguments)

    --readscoplib
    Read input from a scoplib file

    --context=<value>
    An assertion from the user that all program parameters are greater than 
    or equal to <value>

    --version
    Print version number and exit

	-q | --silent
	UNIX-style silence: no output as long as everything goes fine

    Besides these, 'tile.sizes' and '.fst' files allow the user to force 
    certain things.

    Other options will only make sense to power users. See comments in 
    src/pluto.h for details.


SPECIFYING CUSTOM TILE SIZES THROUGH 'tile.sizes'

A 'tile.sizes' file in the current working directory can be used to manually 
specify tile sizes. Specify one tile size on each line and as many tile 
sizes are there are hyperplanes in the outermost non-trivial permutable 
band. When specifying tile sizes for multiple levels (with --l2tile), first 
specify L1 tile sizes, then L2:L1 tile size ratios.  See 
examples/matmul/tile.sizes as an example.  If 8x128x8 is the L1 tile size, 
and 128x256x128 L2, the tile.sizes file will be

# L1 tile size 8x128x8
8
128
8
# L2 is 16*8 x 2*128 x 16*8
16
2
16

The default tile size in the absence of a tile.sizes file is 32 (along all 
dimensions), and the default L2/L1 ratio is 8 (when using --l2tile). The 
sizes specified correspond to transformed loops in that order.  For eg., for 
heat-3d, you'll see this output when you run Pluto

$ ../../polycc 3d7pt.c

[...]

[pluto] Affine transformations [<iter coeff's> <param> <const>]

T(S1): (t, t+i, t+j, t+k)
loop types (loop, loop, loop, loop)

[...]

Hence, the tile sizes specified correspond to t, t+i, t+j, and t+k.


SETTING GOOD TILE SIZES

One would like to maximize locality while making sure there are enough tiles 
to keep all cores busy. Rough thumb rules on selecting tile sizes 
empirically are below.

(1) data set should fit in L1 roughly (or definitely within L2), 
(2) the innermost dimension should have enough iterations, especially if 
it's being vectorized so that (a) cleanup code if any doesn't hurt and (b) 
prefetching provides benefits,
(3) for non-innermost dimensions from which temporal reuse is typically being
exploited, increasing tile size beyond 32 often provides diminishing returns 
(additional reuse); so one may not want to experiment much beyond 32. Each 
time tile.sizes is changed, code has to be regenerated (rm -f *.tiled.c; 
make tiled; ./tiled).

If one wishes to carefully tune tile sizes, it may be good to first run 
Pluto without tiling (without --tile/--partlbtile/--lbtile) and check the 
number of iterations in each of the loops in the tilable band of the 
generated code, and then determine tile sizes.

TILE SIZES FOR DIAMOND TILING OF STENCILS 

For 'lbpar' (generated with --partlbtile --parallel), which is code 
diamond-tiled along the outermost two dimensions, the same tile size should 
be chosen along the first two dimensions; otherwise, the tile wavefront will 
no longer be parallel to the concurrent start face and tile-wise concurrent 
start will be lost.



SPECIFYING A CUSTOM FUSION STRUCTURE THROUGH '.fst' or '.precut' file

One can force a particular fusion structure in two ways: either using 
the .fst file or the .precut file. In either case, the file has to be in 
your `present working directory'.  If both files exist in the directory 
polycc is run from, '.fst' takes precedence. If --nofuse is specified, 
distribution will be done in spite of a .fst/.precut.

Alternative 1 (.fst file)

A component is a subset of statements. Components are disjoint, and the
union of all components is the set of all statements. So the set of all
statements is partitioned into components. Here's the format of the .fst 
file:

<number of components>
# description of 1st component
<num of statements in component>
<ids of statements in first component - an id is b/w 0 and num_stmts-1>
<whether to tile this component -- 0 for no, 1 or more for yes>
# description of 2nd component
...

See examples/gemver/.fst as an example; here is the meaning of a sample 
.fst for examples/gemver/gemver.c:

3 # number of components
1 # number of statements in first component
0 # statement id's of statements in 1st component (id is from 0 to   num_stmts-1): so we have S0 here
0 # don't tile this component {S0}
2 # number of statements in second component
1 2  # stmts in this component are S1 and S2
256 # tile this component {S1,S2}
1 # number of statements in the third component
3 # stmt id 3, i.e., S3 
0 # don't tile this {S3}

So, the above asks Pluto to distribute the statements at the outer
level as: {S0}, {S1,S2}, {S3}. It'll try to fuse as much as possible 
within each component if you use --maxfuse along with the above .fst, 
but will always respect the distribution specified.

Alternative 2

Using the '.precut' file

The '.precut' can be used to specify any partial transformation that Pluto 
can then complete. This is a useful way to test or manually specify a 
particular transformation, or to check if a particular transformation is a 
valid one. The format of .precut is as below.

<number of statements>
<number of levels to tile>
# first statement
<number of rows, and number of columns for the partial transformation>
<rows of the partial transformation in Cloog-like scattering function 
format. For eg.,  if the partial transformation is c1 = i+j+1 for a 2-d 
statement with one program parameter, the row would be 0 1 1 0 1, the first 
number is always a zero (for an equality), then 1 for i, 1 for j, 
0 for the parameter, and 1 for the constant part>
<a number that's not currently used, specify anything>
<whether or not to tile each of the levels, 0 for `don't tile', any 
other number for tiling; as many lines as the number of tiling levels
specified earlier>
# second statement
...

Parsing of any other text (like comments) in .fst or .precut is not 
handled.


MORE ON POST-PROCESSING

--prevector will cause bounds of the loop to be vectorized replaced by 
scalars and insert the ivdep and 'vector always' pragmas to allow 
vectorization with ICC. GCC does not support these, and its 
vectorization capalibility is currently very limited.


LOOKING AT THE TRANSFORMATION

Transformations are pretty-printed. The transformation is a 
multi-dimensional affine function on a per-statement basis.


LOOKING AT THE GENERATED CODE

The best options for taking a good look at the generated code are:

<pluto_dir>/polycc source.c --noprevector --tile --parallel

or 

<pluto_dir>/polycc source.c --noprevector 

to just look at the transformation without the actual tiling and 
parallelization performed; the transformation is just applied, and loop 
fusion can be seen.

USING PLUTO AS A LIBRARY

See src/test_plutolib.c

To install libpluto on a system, run 'make install' from Pluto's top-level 
directory.  libpluto.{so,a} can be found in src/.libs/

