Archive for September, 2010

So what if it is undecidable or NP!

September 19th, 2010 3 comments

An all too frequent refrain I hear when discussing how to solve a problem is “… but it is undecidable” and this is often followed by the statement that it is not worth attempting to solve the problem. Sometimes NP-completeness gets mentioned in the mix of mumbo-jumbo that is talked about this such problems.

Just because a problem is undecidable does not mean that developers should not attempt to write code to solve it. When a problem is undecidable there exists at least one set of inputs for which the algorithm you use cannot be relied on to give a correct answer and however much you rework the algorithm at least one such set of input will always exist (this set of input may change from algorithm to algorithm).

Undecidability theorems do not say anything about what percentage of all possible input sets have this ‘not known if answer is correct’ property. If there are other theorems addressing this question I don’t know about them (and I don’t claim to be an expert or up to date on this topic).

Mathematicians (well at least those working on Zermelo-Fraenkel set theory) spend lots of time trying to find problems that are undecidable and would be overjoyed to find an instance of a problem/dataset where undecidability actually occurred.

Undecidability is an interesting property; however I think educationalists should stop overselling it and give their students some practical advice such as “Don’t worry, these situations are extremely rare and perhaps many orders of magnitude less likely to generate a fault than the traditional fault generation methods.”

NP’ness deals with how execution time grows as the problem input size grows. It is easy to jump to the conclusion that problems known to be in P are ‘good’ and those believed to be in NP are ‘bad'; and that NP problems take so long to solve they are not worth consideration. For small input sets the constant multipliers in the non-polynomial time equation may mean that the actual execution time is acceptable; this is something that requires a bit of experimentation to find out.

Sometimes a NP problem can be solved in an acceptable amount of time.

The reverse situation is also true. While polynomial execution time does not grow as fast as non-polynomial (which could be exponential or worse) execution time, for small input sets the constant multipliers in the polynomial time equation may result in it starting off with a huge overhead, with problem size only becoming a significant factor for very large inputs.

Sometimes P problems cannot be solved in an acceptable amount of time.

Thinking in R: vectors

September 4th, 2010 No comments

While I have been using the R language/environment for over five years or so, whenever I needed to do some statistical calculations, until recently I was only a casual user. A few months ago I started to use R in more depth and made a conscious decision to try and do things the ‘R way’.

While it is claimed that R is a functional language with object oriented features these are sufficiently well hidden that the casual user would be forgiven for thinking R was your usual kind of domain specific imperative language cooked up by a bunch of people who had not done this sort of thing before.

The feature I most wanted to start using in an R-like way was vectors. A literal value in R is not a scalar, it is a vector of length 1 and the built-in operators take vector operands and return a vector result, e.g., the result of the multiplication (c is the concatenation operator and in the following case returns a vector of containing two elements)

> c(1, 2) * c(3, 4)
[1] 3  8

The language does contain a for-loop and an if-statement, however the R-way is to use the various built-in functions operating on vectors rather than explicit loops.

A problem that has got me scratching my head for a better solution than the one I have is the following. I have a vector of numbers (e.g., 1 7 1 2 9 3 2) and I want to find out how many of each value are contained in the vector (i.e., 1 2, 2 2, 3 1, 7 1, 9 1).

The unique function returns a vector containing one instance of each value contained in the argument passed to it, so that removes duplicates. Now how do I get a count of each of these values?

When two vectors of unequal length are compared the shorter operand is extended (or rather the temporary holding its value is extended) to contain the same number of elements as the longer operand by ‘reusing’ values from the shorter operand (i.e., when indexing reaches the end of a vector the index is reset to the beginning and then moves forward again). This means that in the equality test X == 2 the right operand is extended to be a vector having the number of elements as X, all with the value 2. Some of these comparisons will be true and others false, but importantly the resulting vector can be used to index X, i.e., X[X == 2] returns a vector of 2’s, one for each occurrence of 2 in X. We can use length to obtain the number of elements in a vector.

The following function returns the number of instances of n in the vector X (now you will see why thinking the ‘R way’ needs practice):

num_occurrences = function(n)
   length(X[X == n])  # X is a global variable

My best solution to the counting problem, so far, uses the function lapply, which applies its second argument to every element of its first argument, as follows:

unique_X = unique(X)  # I know, real staticians use <- for assignment
X_counts = lapply(unique_X, num_occurrences)

Using lapply feels like a bit of a cheat to me. Ok, the loop is in compiled C code rather than interpreted R code, but an R function is being called for every unique value.

I have been rereading Matter Computational which has gotten me looking for a solution like those in the first few chapters of this book (and which will probably be equally obscure).

Any R experts out there with suggestions for a non-lapply solution?