Your Algebra Homework Can Now Be Easier Than Ever!

Scientific Computing

Dense Linear Algebra

• Scaling and sums
• Transpose
• Rank-one updates
• Rotations
• Matrix vector products
• Matrix Matrix products

Designing Numerical Software

• “ Premature optimization is the root of all evil”
• First make the code produce the expected results
 – Validation
 – Then worry about performance
• Test individual components of the software separately
• Report and handle errors
• Validate input data

Mathematical Libraries

• BLAS – Basic Linear Algebra Subprograms
 – Three levels:
  • Level 1: Vector operations
  • Level 2: Matrix-Vector operations
  • Level 3: Matrix-Matrix operations
 – Originally written in F77, the standard has evolved to contain
F77,F95,C,C++ libraries, each with their own version of the calling
program. The dcopy routine is used to copy two vectors and is given
by: f_dcopy for fortran users and c_dcopy for C users.
 – BLAS 1 was originally designed for vector processors. BLAS 2 and 3
were designed for cache-based computers
• LINPACK (Linear Algebra Package )
 – Linear equation and linear least squares solvers
 – Based on level 1 routines in BLAS

Mathematical Libraries, cont’d

Mathematical Libraries, cont’d
• EISPACK (Eigenvalue Software Package)
 – Computes eigenvalues and eigenvectors
• LAPACK (Linear Algebra Package)
 – Originally designed to extend LINPACK and EISPACK on shared memory vector and parallel processors
 – Also implements highly- tuned system -dependend BLAS libraries, using BLAS levels 2 and 3 wherever possible
• SCALAPACK (Scalable Linear Algebra Package)
 – Designed to extend LAPACK routines to work on distributed memory parallel machines
 – Uses BLACS (Basic Linear Algebra Communications Subprograms) for interprocessor communication
• PLAPACK (Parallel Linear Algebra Package)
 – Parallel linear algebra routines using high levels of abstraction
 – Allows implementation of advanced linear algebra routines
• FFTPACK (FFT Package)
 – Complex and real fft routines (available from
• VSIPL (Vector/ Signal /Image Processing Library)
 – Attempt to create an fft standard

Self-tuning Libraries

• PHIPAC (Portable High-performance ANSI C Linear Algebra
 – Run a performance tuner which stores information about particular machine
to optimize BLAS routines
 – Performance tuner can take days because it creates system-specific
versions of all BLAS routines
• ATLAS (Automatically Tuned Linear Algebra Software)
 – System-specific information is placed into the design of one specific
subroutine which can be called by all BLAS routines.
• FFTW (Fastest Fourier Transform in the West)
 – System-specific information created at runtime using a "planner", which
stores information used by "executor" functions during program execution

Commercial Libraries

• Hardware vendors sell highly-tuned versions of BLAS and LAPACK
for their hardware
 – Intel Math Kernel Library (MKL)
 – HP MLIB (Mathematical Software Library)
 – IBM ESSL (Engineering and Scientific Subroutine Library)
 – SUN Performance Library

• Other libraries developed independently of BLAS, specifically for
relevant hardware.
 – Visual Numerics' IMSL (Mathematical and Statistical Library)
 – Numerical Algorithms Groups' Numerical Libraries

BLAS Structure

Names are cryptic, follow a pattern

Prefix Matrix Type Name

• Example: DGEMM: double precision
matrix multiplication of a general full matrix
• First versions written in Fortran 77
 – 1-based indexing
 – Call by reference
 – No dynamic memory allocation

BLAS, cont’d

Prefix Fortran type
S Real
D Double precision
C Complex
Z Double precision complex
Type code Matrix type
GE General full
SY Symmetric
HE Hermitian
TR Triangular
GB General banded
SB Symmetric banded

Simple Vector Operations (BLAS level 1)

• Setting to zero or a specified constant
  • bzero = set elements to zero
  • memset = set elements to a specified constant
• Vector copy
  • c_dcopy=Double precision vector copy y=x

Scalar-Vector accumulation

• daxpy=Double precision y=a*x+y
• daxpby=Double precision
• Compiler will unroll these loops

Dot Product: BLAS ddot

No unrolling:

Two-way loop unrolling:
for(i=0;i<nend;i+=2) {

• Effect of loop unrolling with N=1024 (8 kb), in MFlops

gcc –O0:
0: 235.90
2: 538.41 (2.28x)
4: 941.68 (3.99x)
6: 1024.97 (4.34x)
8: 1071.04 (4.54x)
10: 1053.29 (4.46x)
12: 1063.94 (4.51x)
14: 1061.28 (4.50x)

gcc –O2:
0: 223.60
2: 553.34 (2.47x)
4: 1092.23 (4.88x)
6: 1449.38 (6.48x)
8: 1548.78 (6.93x)
10: 1594.80 (7.13x)
12: 1572.16 (7.03x)
14: 1529.57 (6.84x)

• Loop unrolling reduces
overhead and allows for
• Too much unrolling leads to
register spilling

Some BLAS Level 1 Routines

Routine Operation Memory
ops per
Floating point
ops per
F: M ratio
dcopy yi=xi 2 0 0
daxpy yi=yi+a*xi 3 2 0.67
daxpby yi=β*yi+a*xi 3 3 1.00
ddot ddot=ddot+xi*yi 2 2 1.00

BLAS Level 2 routines

• Level 2 routines consist of 2 loops and include matrix
copy, matrix transpose, and matrix-vector routines.
• Matrix copy: dge_copy
Performs well because of unit stride on inner-most loop:

The transpose case is not as straightforward:

gcc –O2 (1024 X 1024 int)
Copy: 707.31 Mb/s
Transpose: 54.59 Mb/s

Transpose with 1-d blocking

1-d blocks (a and b ints):
int en = bsize*(n/bsize);

• After i=0 inner loop (assuming 32-byte cache blocks):
– 8*bsize elements of a in L1 cache (a[0:bsize-1][0:7])

Transpose with 2-d blocking

2-d blocks (a and b ints):
int en = bsize*(n/bsize);

Effect of Blocking on Transpose
Performance (1024 x 1024 int)

Peak throughput occurs when bsize=45
(when two 45x45 matrices fit into L1 cache!)

In-place Transpose: dge_trans

• Previous examples involved a copy and transpose
– This is termed out -of-place matrix transpose
• Consider the in-place matrix transpose (method 1):

for(j=i+1;j<n;j++) {
temp = a[i][j];
a[i][j] = a[j][i];
a[j][i] = temp;

Move along a column and swap elements.
suffers from possible cache thrashing when n is a power of 2 since a[i*n+j]
and a[j*n+i] can map to same cache location

In-place Transpose: Method 2

for(j=i;j<n;j++) {
temp = a[j-i][j];
a[j-i][j] = a[j][j-i];
a[j][j-i] = temp;

Move along diagonals and swap elements.

• This method is less likely to incur cache thrashing
when n is a power of 2 since stride is n+1 in j-loop:
• a[j-i][j]=a[(j-i)*n+j]=a[(n+1)*j – i*n]
• The BLAS routines employ blocking for in-place

Summary of BLAS routines

• BLAS level 1: single-loop
– Set to zero (bzero), set to a constant (memset), copy vectors
(dcopy), inner-product (ddot), ax+y (daxpy), ax+by (daxpby)

• BLAS level 2: double-loop
– matrix copy (dge_copy), matrix transpose (deg_trans), outerproduct
(dger), matrix-vector product (dgemv)

• BLAS level 3: triple-loop
– matrix-matrix product (dgemm): optimized in-cache matrix solver .
For use with dge_copy to create a blocked matrix-matrix


• Builds on BLAS for more linear algebra
• Linear Systems
– LU, Cholesky
• Least Squares
– QR
• Eigenproblems
• Singular Value Decomposition (SVD)

Dense Linear Algebra

• Matrix Decompositions
 – Linear systems
  • Symmetric,Nonsymmetric
 – Least squares
  • QR, SVD
• Eigenvalues
 – Rayleigh iteration
 – QR algorithm
• Standard libraries
 – Linpack, Lapack, Atlas, PetSC, MKL

Prev Next

Start solving your Algebra Problems in next 5 minutes!

Algebra Helper
Download (and optional CD)

Only $39.99

Click to Buy Now:

OR is an authorized reseller
of goods provided by Sofmath

Attention: We are currently running a special promotional offer for visitors -- if you order Algebra Helper by midnight of July 6th you will pay only $39.99 instead of our regular price of $74.99 -- this is $35 in savings ! In order to take advantage of this offer, you need to order by clicking on one of the buttons on the left, not through our regular order page.

If you order now you will also receive 30 minute live session from for a 1$!

You Will Learn Algebra Better - Guaranteed!

Just take a look how incredibly simple Algebra Helper is:

Step 1 : Enter your homework problem in an easy WYSIWYG (What you see is what you get) algebra editor:

Step 2 : Let Algebra Helper solve it:

Step 3 : Ask for an explanation for the steps you don't understand:

Algebra Helper can solve problems in all the following areas:

  • simplification of algebraic expressions (operations with polynomials (simplifying, degree, synthetic division...), exponential expressions, fractions and roots (radicals), absolute values)
  • factoring and expanding expressions
  • finding LCM and GCF
  • (simplifying, rationalizing complex denominators...)
  • solving linear, quadratic and many other equations and inequalities (including basic logarithmic and exponential equations)
  • solving a system of two and three linear equations (including Cramer's rule)
  • graphing curves (lines, parabolas, hyperbolas, circles, ellipses, equation and inequality solutions)
  • graphing general functions
  • operations with functions (composition, inverse, range, domain...)
  • simplifying logarithms
  • basic geometry and trigonometry (similarity, calculating trig functions, right triangle...)
  • arithmetic and other pre-algebra topics (ratios, proportions, measurements...)


Algebra Helper
Download (and optional CD)

Only $39.99

Click to Buy Now:

OR is an authorized reseller
of goods provided by Sofmath
Check out our demo!
"It really helped me with my homework.  I was stuck on some problems and your software walked me step by step through the process..."
C. Sievert, KY
19179 Blanco #105-234
San Antonio, TX 78258
Phone: (512) 788-5675
Fax: (512) 519-1805

Home   : :   Features   : :   Demo   : :   FAQ   : :   Order

Copyright © 2004-2022, Algebra-Answer.Com.  All rights reserved.