Yue Zhang - Techlog

Friday, September 26, 2008

Java note: toarray

Because of the implementation of containers, the contained type cannot be instantiated within them. An inconvenient consequence is toArray. There are two ways to call it

1.
Object arr[] = list.toArray();

But it's not easy to change Object[] to Type[].

2.
Type arr[] = new Type[m];
list.toArray(arr);

But it would be much better not to have to allocate arr separately.

The next version of Java should really reconsider the implementation of templates.

Wednesday, September 17, 2008

Java note: generics

In Java 1.5 (and 1.6), generics is used by containers to avoid runtime type cast. Generics are not implemented in the same way as C++ templates, and are much less flexible. A generic class is compiled into only one class file, instead of one for each template instance. A big disadvantage is that a generic container cannot make new instances of its elements.

Java note: assertion

The assert statement in Java are ignored from compile by default. To enable assert, add -ea when compiling.

javac -ea example.java

Thursday, July 10, 2008

Python note: swapping objects

The best way to swap two objects is:

a, b = b, a

It swaps the names of two object by the use of a tuple, without altering the objects.

The way to swap objects in many languages involves copying. However, this method can be tricky with Python. This is particularly because object assignment in Python does not copy objects. It simply links a variable name to the existing object. Suppose

class X(object):
__init__(self, x)
self.x = x

a = X([1, 2, 3])
b = X([])

Then we want this:

b = a
a.x = []

The above will not work, because b.x will become [] too. To avoid copying, use

a, b = b, a
a.x = []

In another occasion, if a will be kept while b becomes a copy of a's value, define a copying function or use the copy module.

With the new-style class, everything is an object. So the rule for object assignment applies to lists, dicts etc. To make a copy of the original list, use l2 = l1[:].

Monday, July 07, 2008

Python note: functional programming saves code when dealing with lists

Another note on using functional programming. It makes code with lists shorter and clearer sometimes.

For example, there is a list l. We want to add one to each element:

l = map(lambda x: x+1, l)

LIBSVM note: problems

1. Why does svm-scale run forever, while keeping writing the output file (.scale)?

It might be because the original training file contains [0, 1] range features, but the scale output requires [-1, 1] range. This is the default option. Add the option -l 0.

2. Why does svm-train run forever?

It might be because of the epsilon value. The default parameter (-e 0.001) sets this value. The smaller the value is, the more accurate will the trained model be, but the more iterations will be taken. Consider setting epsilon to 1 and try.

If there are a lot of features, consider trying LIBLINEAR instead. It does not use a kernel, but runs faster than LIBSVM for a linear model.

Another param, -m, sets the memory cache. Make it as large as possible within RAM.

Tuesday, May 20, 2008

Python tool: Chinese Treebank

I have put some scripts to process the Penn Chinese Treebank to
Google code. These files include a parser to turn bracketed annotations into Python objects, a converter to translate POS tags into the Stanford tagger format, and a set of head finding rules to translate CTB into dependency trees.

I haven't made any releases for download, but have been updating the source code. The files are available by browsing the trunk from the SVN repository.

Tuesday, April 01, 2008

Python note: reversing lists

There are two ways to reverse a list. One way is to modify the original list:

l = [1,2,3]
l.reverse() # l becomes [3,2,1]

The other way is not to modify the original list, but make a copy:

l = [1,2,3]
l[::-1] # l remains

Notes: l[start:end:step] makes a new sequence from l by slicing. For example, l[1:2] makes a new list [2] and l[0:3] makes [1,2,3]. l[:] is often used to make a copy of the whole list l. This is useful when we want to pass the value (not rerence) of l to a new list.

Sunday, March 30, 2008

Python note: the difference between getattr and getattribute

I tried to overload __setattr__ today, so that when a value is assigned to item.type, I can check whether it is in the set of possible choices. However, I made a mistake by defining a __setattribute__ function instead of __setattr__.

There is no special function named __setattribute__. The only functions that intercepts attribute accesses in Python are __getattr__, __getattribute__ and __setattr__. The difference between __getattr__ and __getattribute is that, __getattr__ is called when the attribute is not in the object's dictionary, while __getatttribute__ is called whenever the attribute is accessed. Therefore, __getattribute__ will make the speed slower. __setattr__ is the same as __getattribute__ in the triggering mechanism -- it intercepts the assignment operation no matter the attribute to modify already exists or not.

Another note about __getattr__ is that the overloaded method must raise attribute error itself or the program may run into unexpected output. For example, there is an overloaded __getattr__ method:

class Foo(object):
   def __getattr__(self, attr):
       if attr == "bar":
          return "bar"

foo = Foo()

Now when we try to see foo.barrrr which doesn't exist, we get None value instead of a thrown attribute error. The code should be corrected into:

class Foo(object):
    def __getattr__(self, attr):
       if attr == "bar":
          return "bar"
       else:
          raise AttributeError, attr

Thursday, March 20, 2008

Python tool: traditional to simplified Chinese converter

I just wrote this script to convert traditional Chinese text to simplified Chinese. Since the relationship between traditional and simplified characters is many to one, I haven't decided to write the revert convertion script.

It has been tested with my files and can be downloaded here, and please report bugs and suggestions if you found any.

The package contains two files, simplify.py and utftable.txt. The python script is the converter and utftable.txt is the character table. The two files must be put into the same directory.

Usage:
python simplify.py input.txt >output.txt

Both the input and the output text files must be in UTF8.

Note that you can replace the character relationship table file with your own file (the new file must be in the same format as the original file), just in case there are more comprehensive tables than this one.

Tuesday, March 18, 2008

sqlite note: using the command line tool

The command line tool sqlite3 can be used to view the content of a database. One way of using it is typing in "sqlite3 FILE" and the database contained in FILE is opened for query commands.

The command line tool sqlite3 can also be used to perform a query directly. For example, typing "sqlite3 FILE 'select * from Table1'" will print out all contents in table Table1. This is handy for showing large tables, because we can pipe the output into a reader tool. "sqlite3 FILE 'select * from Table1' | more".

Monday, March 10, 2008

C++ note: static_cast from a reference to a value

By default, the result of a static_cast is a r-value. It can't be used as a l-value, and thus can't be given a new value.

Such misuse will lead to compile errors. But the report from the compiler can be misleading or quite hard to understand. For example, suppose we have a base class Base and a derived class Derived from Base. We want to overload the istream >> operator for Derived. The following way to overloading the operator does not work:

istream & operator >> (istream &is, Derived &derived) {
// special processing
is >> static_cast<Base>(derived);
}

This is because the result for casting will be passed as a reference function parameter. The reported error from the compiler has nothing to do with cast, however, and it simply sais that there is no match for operator >> from ...

It should be noticed that even if the cast is done for pointers, the results are still r-values.

To avoid the above problem, use static_cast<Base&>.

Monday, January 28, 2008

Python note: the main entry of a python file

In a python module we could write

if __name__ == "__main__":
# the main program entry
...

and this is often taken as the main entry of the module. The fact may be misleading for C++ programmers, because python doesn't actually need a main entry when running a module.

Unlike C++, each python module is treated as a sequence of executable commands. No matter when it is imported or run as a program, python always executes the python file from the first line to the end. This is also the reason why python is called a script language. Generally, there are three types of commands in python

1. import commands: when python meets these commands, it checks sys.modules to see if the imported module is already in the list. If it is, python bypasses this command. Otherwise, python goes into the imported module and runs every command in it.

2. definition commands: these commands include "class" and "def"; when python meets them it doesn not do anything, but remembers the definitions in the corresponding place (normally in the __dir__ of the current module).

3. execution commands: all the other commands, including assignments, conditions and branching statements, function calls etc; python runs them according to their semantics.

When a module is loaded into python as the main program, python gives the module name __name__ the special value "__main__". The condition check for __name__ == "__main__" is used to make sure that the commands below are executed only when the module is run as the main program, but not as an inmported module. There can be as many such conditions as possible, and they can occur anywhere in a python module. There is not a particular function or statement that python takes as the main entry of modules.

Friday, January 25, 2008

C++ Makefile case study: incorrect dependencies cause unexpected errors

Incorrect dependencies in Makefiles can cause not only timestamp confusion, but also unexpected errors and segmentation faults after compiling.

Suppose that there are three files: header.h, module1.cpp and module2.cpp. Both cpp files include the h file for the definition of common structures. module1 generates module1.o and module2 generates module2.o, which will be linked together.

Now in the make file, suppose header.h was missing from the dependencies of module1.o and module2.o. An immediate problem that arises is that a "make" command won't compile the program if only header.h is modified after the last build. But this is not the worst problem. Suppose both header.h and module1.cpp are modified, and the common structure is touched. Now when "make" is executed, it's possible that nothing happens at compile time, but various strange bugs or unexpected results come out later at runtime. The reason is that module2 is still using the out dated version of data structure. Such a case can be confirmed by running "make clean" and then compile again to see if the unexpected errors go away.

Monday, January 14, 2008

C++ question - linker failed

I defined a constant boolean in an external header file:
const bool CONDITION = false ;

Then I wrote in a function the following line:
void func(...) {
...
if (cond) {
   if (CONDITION) {
      ...
   }
   else
      ...
...
}

Then g++ couldn't compile the code, reporting that the linker couldn't find the function.

I haven't figured out why it happended, but a solution is changing the if statement and constant bool value into a #ifdef directive and a macro.

Thursday, November 22, 2007

latex note: minipage with borders

One way to add borders to a minipage is embedding it into an fbox. For example:

\fbox{
\begin{minipage}[t]{0.5\textwidth}
...
\end{minipage}
}

Friday, November 09, 2007

C++ note: initializers in multiple inheritance

Here is a tutorial on C++ multiple inheritance.

A C++ class can inherit multiple base classes. If these base class still have base classes, they will be initialized separately by default. (For each parent class, the subclass will contain the inherited structures) For example, suppose B and C derive from A, and D derive from B and C, D will have two copies of A. One inherited from B, and the other from C.

In some occasions only one copy of A is needed in D. This type of inheritance is enabled by letting B and C virtually derive from A. By virtual inheritance, B and C no longer encompass the structure of A into their structure, but include a pointer to the A structure in their vtables. When D derive from B and C, the compiler generates only one A structure that is shared by B and C.

Here comes a question: in the constructor of D, what structures do we need to initialize?

class A {
public:
   int a;
  A ( int i ) : a (i) {}
}

class B : virtual public A {
public:
   B ( int i ) : A(i) {}
}

class C : virtual public A {
public:
   C ( int i ) : A(i) {}
}

class D : public B, public C {
   D ( int i ) : A(i), B(i), C(i) {} // <-------------
}

In the above example, D has to initilize A, B and C in its constructor. The initialization of A is mandatory even though both B and C do have the initializers in their constructor. When D is constructed, the initializers of B and C will not assign any value to A by default. This fact can be observed by removing the initialization of A in D, and printing the a field from D-object.

Thursday, November 08, 2007

C++ note: avoid allocating large structures in the stack

Unexpected segmentation fault when calling a function can be caused by big memory allocation in the stack. The following piece of code will fail.

class Big {
   int data[10000000000];
};

void test() {
   cout << "entering test";
   Big big;
}

void main(int argc, char** argv) {
   test();
}

The reason is this: when test() is called, the variables declared in this function are put to the stack ( which is a special area of memory for function calls, commonly supported by a stack pointer in the CPU architecture ). The size of the stack is limited, and putting too much data in it can cause unexpected errors without warning. ( Note that compilers may choose to push stack before any executions, and therefore the "entering test" hint might not be printed. ) This kind of error is hard to trace during debugging. With new CPU architectures, the stack size may be improved, but this trap is still commonly seen nowadays.

To void such segmentation faults, we need to put the big structures into the heap. Two methods can be used. One is allocating memory inside the class:

class Big {
   int *data;
   Big() { data = new int[10000000000]; }
   virtual ~Big() { delete data; }
};

The other is allocating memory inside the function instead:

void test() {
   cout << "entering test";
   Big *big = new Big;
   delete big;
}

Wednesday, November 07, 2007

C++ note: pointer and reference

The semantics of the * operator is to get the value from a pointer. When it is used as a reference, *p converts the pointer to the reference of the object it points to. This is illustrated with the following example.

void reference( int &i ) { i = 4; }

void testReference() {
   int i=5;
   int *p = &i;
   reference( *p );
   cout < < i << endl;
}

As can be seen from the output (4), the i in the testReference method is actually modified. This shows the semantics of the * operator, when used as a reference.

Friday, November 02, 2007

C++ note: predefined macros

The standard C++ compilers all support the following macros:

__DATE__ - a string literal containing the date when the source file was compiled, in the form of "Mmm dd yyyy".

__FILE__ - a string literal containing the name of the source file.

__LINE__ - an integer representing the current source line number.

__STDC__ - defined if the C compiler conforms to the ANSI standard. This macro is undefined if the language level is set to anything other than ANSI.

__TIME__ - a string literal containing the time when the source file was compiled, in the form of "hh:mm:ss".

__cplusplus - defined when the compiler supports c plus plus.

These macros can be used anywhere in the source code, just as if they were defined by the #define directive. Their value changes according the the specific file and line in the source code. Therefore, we can use them to implement error reporting that includes file and line numbers in C++ source code.

#define REPORT(x) cerr << endl << "In " << __FILE__ << ", line " << __LINE__ << ": " << endl << x << endl; cerr.flush();