Archive

Posts Tagged ‘Google’

Relative spacing of operands affects perception of operator precedence

January 22nd, 2012 1 comment

What I found most intriguing about Google Code Search (shutdown Nov 2011) was how quickly searches involving regular expressions returned matches. A few days ago Russ Cox, the implementor of Code Search not only explained how it worked but also released the source and some precompiled binaries. Google’s database of source code did not include the source of R, so I decided to install CodeSearch on my local machine and run some of my previous searches against the latest (v2.14.1) R source.

In 2007 I ran an experiment that showed developers made use of variable names when making binary operator precedence decisions. At about the same time two cognitive psychologists, David Landy and Robert Goldstone, were investigating the impact of spacing on operator precedence decisions (they found that readers showed a tendency to pair together the operands that were visibly closer to each other, e.g., a with b in a+b * c rather than b with c).

As somebody very interested in finding faults in code the psychologists research findings on spacing immediately suggested to me the possibility that ‘incorrectly’ spaced expressions were a sign of failure to write code that had the intended behavior. Feeding some rather complicated regular expressions into Google’s CodeSearch threw up a number of ‘incorrectly’ spaced expressions. However, this finding went no further than an interesting email exchange with Landy and Goldstone.

Time to find out whether there are any ‘incorrectly’ spaced expressions in the R source. cindex (the tool that builds the database used by csearch) took 3 seconds on a not very fast machine to process all of the R source (56M byte) and build the search database (10M byte; the Linux database is a factor of 5.5 smaller than the sources).

The search:

csearch "\w(\+|\-)\w +(\*|\/) +\w"

returned a few interesting matches:

...
modules/internet/nanohttp.c:       used += tv_save.tv_sec + 1e-6 * tv_save.tv_usec;
modules/lapack/dlapack0.f:     $          ( T*( ONE+SQRT( ONE+S / T ) ) ) )
modules/lapack/dlapack2.f:               S = Z( 3 )*( Z( 2 ) / ( T*( ONE+SQRT( ONE+S / T ) ) ) )
modules/lapack/dlapack4.f:     $          ( T*( ONE+SQRT( ONE+S / T ) ) ) )

There were around 15 matches of code like 1e-6 * var (because the pattern \w is for alphanumeric sequences and that is not a superset of the syntax of floating-point literals).

The subexpression ONE+S / T is just the sort of thing I was looking for. The three instances all involved code that processed tridiagonal matrices in various special cases. Google search combined with my knowledge of numerical analysis was not up to the task of figuring out whether the intended usage was (ONE+S)/T or ONE+(S/T).

Searches based on various other combination of operator pairs failed to match anything that looked suspicious.

There was an order of magnitude performance difference for csearch vs. grep -R -e (real 0m0.167s vs. real 0m2.208s). A very worthwhile improvement when searching much larger code bases with more complicated patterns.

Share

Program analysis via information leakage

March 31st, 2010 No comments

The use of software in high value transactions has created an interesting new field of software research that investigates the leakage of information from programs. The kind of information leaked, so called side-band information, can take various forms, including:

  • The amount of time taken to perform some operation. Many developers instinctively do their best to ensure that code does not take any longer to execute than it has to. In the case of one commonly used authentication system the time taken to fail to authenticate an encryption key provided useful information on how close one trial encryption-key was compared to another (the closer the trial key to the actual key the longer the authentication took to fail). The obvious implementation technique to foil this kind of attack is to add random delays into the authentication process.

    It has even proved possible to perform timing attacks against a remote machine over the Internet to remote

  • Use of some part of the value of secure information, by a system library function, to create the value passed back to the caller, e.g.,
    if (secret_value & 0xf000)  // Tell the caller that the top 'secret' four bits are set
       return 1;
    else
       return 0;

    Researchers have been able to analyse the information flow of input values through some very large C programs.

  • Analyse of network traffic routing information to work out who is talking to who. Various kinds of anonymizers have been created in attempt to make various forms of Internet traffic untraceable.

Any Internet program is accessible to information flow analysis. Using these techniques to analyse the search algorithm used by Google might be overly ambitious. A Google algorithm that might be within reach of is the one used by Adwords; the behavior of this algorithm is of interest to a growing number of people.

Information leakage techniques are becoming more widely known and developers working on programs containing a security component now need to consider how they can prevent information being leaked to attackers who sample program behavior looking for exploitable weaknesses.

Share
www.wenn.com
www.tinynibbles.com