Yue Zhang - Techlog

Sunday, March 30, 2008

Python note: the difference between getattr and getattribute

I tried to overload __setattr__ today, so that when a value is assigned to item.type, I can check whether it is in the set of possible choices. However, I made a mistake by defining a __setattribute__ function instead of __setattr__.

There is no special function named __setattribute__. The only functions that intercepts attribute accesses in Python are __getattr__, __getattribute__ and __setattr__. The difference between __getattr__ and __getattribute is that, __getattr__ is called when the attribute is not in the object's dictionary, while __getatttribute__ is called whenever the attribute is accessed. Therefore, __getattribute__ will make the speed slower. __setattr__ is the same as __getattribute__ in the triggering mechanism -- it intercepts the assignment operation no matter the attribute to modify already exists or not.

Another note about __getattr__ is that the overloaded method must raise attribute error itself or the program may run into unexpected output. For example, there is an overloaded __getattr__ method:

class Foo(object):
   def __getattr__(self, attr):
       if attr == "bar":
          return "bar"

foo = Foo()

Now when we try to see foo.barrrr which doesn't exist, we get None value instead of a thrown attribute error. The code should be corrected into:

class Foo(object):
    def __getattr__(self, attr):
       if attr == "bar":
          return "bar"
       else:
          raise AttributeError, attr

Thursday, March 20, 2008

Python tool: traditional to simplified Chinese converter

I just wrote this script to convert traditional Chinese text to simplified Chinese. Since the relationship between traditional and simplified characters is many to one, I haven't decided to write the revert convertion script.

It has been tested with my files and can be downloaded here, and please report bugs and suggestions if you found any.

The package contains two files, simplify.py and utftable.txt. The python script is the converter and utftable.txt is the character table. The two files must be put into the same directory.

Usage:
python simplify.py input.txt >output.txt

Both the input and the output text files must be in UTF8.

Note that you can replace the character relationship table file with your own file (the new file must be in the same format as the original file), just in case there are more comprehensive tables than this one.

Tuesday, March 18, 2008

sqlite note: using the command line tool

The command line tool sqlite3 can be used to view the content of a database. One way of using it is typing in "sqlite3 FILE" and the database contained in FILE is opened for query commands.

The command line tool sqlite3 can also be used to perform a query directly. For example, typing "sqlite3 FILE 'select * from Table1'" will print out all contents in table Table1. This is handy for showing large tables, because we can pipe the output into a reader tool. "sqlite3 FILE 'select * from Table1' | more".

Monday, March 10, 2008

C++ note: static_cast from a reference to a value

By default, the result of a static_cast is a r-value. It can't be used as a l-value, and thus can't be given a new value.

Such misuse will lead to compile errors. But the report from the compiler can be misleading or quite hard to understand. For example, suppose we have a base class Base and a derived class Derived from Base. We want to overload the istream >> operator for Derived. The following way to overloading the operator does not work:

istream & operator >> (istream &is, Derived &derived) {
// special processing
is >> static_cast<Base>(derived);
}

This is because the result for casting will be passed as a reference function parameter. The reported error from the compiler has nothing to do with cast, however, and it simply sais that there is no match for operator >> from ...

It should be noticed that even if the cast is done for pointers, the results are still r-values.

To avoid the above problem, use static_cast<Base&>.

Monday, January 28, 2008

Python note: the main entry of a python file

In a python module we could write

if __name__ == "__main__":
# the main program entry
...

and this is often taken as the main entry of the module. The fact may be misleading for C++ programmers, because python doesn't actually need a main entry when running a module.

Unlike C++, each python module is treated as a sequence of executable commands. No matter when it is imported or run as a program, python always executes the python file from the first line to the end. This is also the reason why python is called a script language. Generally, there are three types of commands in python

1. import commands: when python meets these commands, it checks sys.modules to see if the imported module is already in the list. If it is, python bypasses this command. Otherwise, python goes into the imported module and runs every command in it.

2. definition commands: these commands include "class" and "def"; when python meets them it doesn not do anything, but remembers the definitions in the corresponding place (normally in the __dir__ of the current module).

3. execution commands: all the other commands, including assignments, conditions and branching statements, function calls etc; python runs them according to their semantics.

When a module is loaded into python as the main program, python gives the module name __name__ the special value "__main__". The condition check for __name__ == "__main__" is used to make sure that the commands below are executed only when the module is run as the main program, but not as an inmported module. There can be as many such conditions as possible, and they can occur anywhere in a python module. There is not a particular function or statement that python takes as the main entry of modules.

Friday, January 25, 2008

C++ Makefile case study: incorrect dependencies cause unexpected errors

Incorrect dependencies in Makefiles can cause not only timestamp confusion, but also unexpected errors and segmentation faults after compiling.

Suppose that there are three files: header.h, module1.cpp and module2.cpp. Both cpp files include the h file for the definition of common structures. module1 generates module1.o and module2 generates module2.o, which will be linked together.

Now in the make file, suppose header.h was missing from the dependencies of module1.o and module2.o. An immediate problem that arises is that a "make" command won't compile the program if only header.h is modified after the last build. But this is not the worst problem. Suppose both header.h and module1.cpp are modified, and the common structure is touched. Now when "make" is executed, it's possible that nothing happens at compile time, but various strange bugs or unexpected results come out later at runtime. The reason is that module2 is still using the out dated version of data structure. Such a case can be confirmed by running "make clean" and then compile again to see if the unexpected errors go away.

Monday, January 14, 2008

C++ question - linker failed

I defined a constant boolean in an external header file:
const bool CONDITION = false ;

Then I wrote in a function the following line:
void func(...) {
...
if (cond) {
   if (CONDITION) {
      ...
   }
   else
      ...
...
}

Then g++ couldn't compile the code, reporting that the linker couldn't find the function.

I haven't figured out why it happended, but a solution is changing the if statement and constant bool value into a #ifdef directive and a macro.

Thursday, November 22, 2007

latex note: minipage with borders

One way to add borders to a minipage is embedding it into an fbox. For example:

\fbox{
\begin{minipage}[t]{0.5\textwidth}
...
\end{minipage}
}

Friday, November 09, 2007

C++ note: initializers in multiple inheritance

Here is a tutorial on C++ multiple inheritance.

A C++ class can inherit multiple base classes. If these base class still have base classes, they will be initialized separately by default. (For each parent class, the subclass will contain the inherited structures) For example, suppose B and C derive from A, and D derive from B and C, D will have two copies of A. One inherited from B, and the other from C.

In some occasions only one copy of A is needed in D. This type of inheritance is enabled by letting B and C virtually derive from A. By virtual inheritance, B and C no longer encompass the structure of A into their structure, but include a pointer to the A structure in their vtables. When D derive from B and C, the compiler generates only one A structure that is shared by B and C.

Here comes a question: in the constructor of D, what structures do we need to initialize?

class A {
public:
   int a;
  A ( int i ) : a (i) {}
}

class B : virtual public A {
public:
   B ( int i ) : A(i) {}
}

class C : virtual public A {
public:
   C ( int i ) : A(i) {}
}

class D : public B, public C {
   D ( int i ) : A(i), B(i), C(i) {} // <-------------
}

In the above example, D has to initilize A, B and C in its constructor. The initialization of A is mandatory even though both B and C do have the initializers in their constructor. When D is constructed, the initializers of B and C will not assign any value to A by default. This fact can be observed by removing the initialization of A in D, and printing the a field from D-object.

Thursday, November 08, 2007

C++ note: avoid allocating large structures in the stack

Unexpected segmentation fault when calling a function can be caused by big memory allocation in the stack. The following piece of code will fail.

class Big {
   int data[10000000000];
};

void test() {
   cout << "entering test";
   Big big;
}

void main(int argc, char** argv) {
   test();
}

The reason is this: when test() is called, the variables declared in this function are put to the stack ( which is a special area of memory for function calls, commonly supported by a stack pointer in the CPU architecture ). The size of the stack is limited, and putting too much data in it can cause unexpected errors without warning. ( Note that compilers may choose to push stack before any executions, and therefore the "entering test" hint might not be printed. ) This kind of error is hard to trace during debugging. With new CPU architectures, the stack size may be improved, but this trap is still commonly seen nowadays.

To void such segmentation faults, we need to put the big structures into the heap. Two methods can be used. One is allocating memory inside the class:

class Big {
   int *data;
   Big() { data = new int[10000000000]; }
   virtual ~Big() { delete data; }
};

The other is allocating memory inside the function instead:

void test() {
   cout << "entering test";
   Big *big = new Big;
   delete big;
}

Wednesday, November 07, 2007

C++ note: pointer and reference

The semantics of the * operator is to get the value from a pointer. When it is used as a reference, *p converts the pointer to the reference of the object it points to. This is illustrated with the following example.

void reference( int &i ) { i = 4; }

void testReference() {
   int i=5;
   int *p = &i;
   reference( *p );
   cout < < i << endl;
}

As can be seen from the output (4), the i in the testReference method is actually modified. This shows the semantics of the * operator, when used as a reference.

Friday, November 02, 2007

C++ note: predefined macros

The standard C++ compilers all support the following macros:

__DATE__ - a string literal containing the date when the source file was compiled, in the form of "Mmm dd yyyy".

__FILE__ - a string literal containing the name of the source file.

__LINE__ - an integer representing the current source line number.

__STDC__ - defined if the C compiler conforms to the ANSI standard. This macro is undefined if the language level is set to anything other than ANSI.

__TIME__ - a string literal containing the time when the source file was compiled, in the form of "hh:mm:ss".

__cplusplus - defined when the compiler supports c plus plus.

These macros can be used anywhere in the source code, just as if they were defined by the #define directive. Their value changes according the the specific file and line in the source code. Therefore, we can use them to implement error reporting that includes file and line numbers in C++ source code.

#define REPORT(x) cerr << endl << "In " << __FILE__ << ", line " << __LINE__ << ": " << endl << x << endl; cerr.flush();

Monday, October 29, 2007

C++ note: operator precedence

Another common trap is the precedence between == and & | ^

The expression a & b == 0 is interpreted as a & (b==0) instead of (a & b) == 0, quite different from a + b == 0.

Conclusion: always add brackets when evaluating the bitwise operators & | ^

Monday, October 01, 2007

Compile log: SRILM

SRILM is a language modelling package, used by the moses translation system. The compilation is quite straightforward, except two things:

1. The current path in the Makefile needs to be changed
2. In common/Makefile.machine.xxx, there are paths to gcc and g++.
In the first line, GCC_FLAGS might need editing as well. For example, -mtune=pentium3.

Sunday, September 16, 2007

vim note: grep

The handy tool grep from Linux can also be used in VIM, by just typing ":grep PATTERN FILES" in the command mode. It finds the string according to the input pattern from all designated files. Note that FILES needs to be in the format of absolute path. Wildcard, such as "c:\\files\\*.txt" can be used. In version 7.0, "\\" is required for the path splitter under windows. "/" does not seem to work. Regular expression could be used for PATTERN.

By default, only the first search result is shown. Use ":cn" to navigate to the next search result, and use ":cp" to navigate to the previous result. The navigation can jump from one file to another, of course. When multiple buffers (or files) are opened, use ":bn" and ":bp" to jump from one file to the other. Use ":bd" to remove a file from buffers.

Saturday, September 15, 2007

mod_python note: sessions

mod_python has its own modules for Session and Cookies. The mod_python documentation contains consise explanations about the usage. Cookies are used for session maintenance. An introduction can be found here.
Python supports cookies by a built in module, which is probably developed for CGI programming before adopted into Python core. mod_python can use this module as well for cookies.

Python note: using the windows clipboard

The windows clipboard allows HTML content to be processed. A specification about the specific format can be found here.

A python support for the windows clipboard is found in this site package.

A recipe for using the clipboard, written by Phillip Piper, can be found from this web page. I gave a simplified version of putting something into the clipboard at here.

Sunday, September 09, 2007

Apach note: mod_python global objects

The term "global objects" in this article means the objects that exists from the starting of the web server until the shutdown. An example of such global objects is a database proxy, which is initialized at server start, and handles database calls during all the serving session.

At the first look, it appeared to be no place for defining global objects, for mod_python runs on the per-request basis, with each request being mapped to a call.

However, mod_python has the advantage over CGI that a python interpreter is created for one virtual server to handle all requests to server. The interpreter starts when the server starts, and last until the server is shut down. Here is a reference for the multiple interpreter mechanism for the mod_python module.

To take this advantage, global variables can be place in the global namespace for the Python interpreter. For example, it could be intitilized from a module's namespace.

Apache note: set up a mod_python server

Though the building of mod_python can be a little daunting under UNIX, it's quite easy to start a mod_python server with apache under the windows platform.

Suppose that a machine has python installed.

First download the apache http server from the apache web site. (you need to find the download site by following the links from the main site) Current version is 2.2.

Second download the mod_python binaries for windows from the mod_python web site. Current version is 3.3

Third install apache following the instructions.

Fourth install mod_python following the instructions. The information in the last page is important. The LoadModule command must be added to the httpd.conf file for apache so that the http server recognizes mod_python.

The installation test can be found the documentation Note that http://127.0.0.1/test/*.py will work. This is because the request url is handed to mptest.py as parameters.

Latex note: how to shrink tables

To reduce the size of a table, two general methods can be used.

1. Shrink the the font size.
For example, use \small{content} instead of content in the table cell.

2. Squeeze space between columns.
For example, a table could be written in the following way
\begin{tabular}
\setlength{\tabcolsep}{1pt}
...
\end{tabular}