Problems encoutered in Testing the module

2012-08-11T20:09:00.000-07:00

Last two weeks i'm trying to test my new functions in the module. Compared to the normalize function which has a test script. They are more difficult to test. I located they were used in the bib merge function. So i tried to extract some real use cases from the function. After that i can build some test cases to test the vandelay.add_field and vadelay.strip_field which is called by vandelay.merge_record_xml.

So I planned to batch import some marc records and extract some of real cases when there is a match needed to merge the record. But i failed to batch import the marc records(oca_unicode.mrc and Open Access Titles UOP). I tried a lot of ways. But the import process always crashes after i wait a long time. No error recorded in the logs. So i have to turn to the community for help.

New functions in extension c_functions

2012-07-29T20:01:00.000-07:00

Last week I implemented two functions vandelay.add_field and vandelay.strip_field. These two functions are the first functions in extension c_functions which handles MARC in C.
These functions relies on libxml2, libxslt and ICU4C.
Now the extensions has some of the general c functions to perform some of perl functions in a simple way which will improve the development speed to translate perl script into c program.
I push the branch to evergreen/working. You can find the branch on http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dsy/extension_in_c.
I push the previous olis_xslt_process function, too.
All the functions have not tested thoroughly. I'm going to test them in the following week.

Progress on project

2012-07-21T07:39:00.000-07:00

In the previous week, i implemented the oils_xslt_process function in C. I used the libxml2 and libxslt to implement it. But I didn't find a test script or something to test it. Could anyone give me some clue on where it is used or where could i test it? I will push it to git after I test it.

I'm beginning to implement some of the vandelay functions in C. After I read through the previous PLPERLU code, I think I can implemented it in ICU4C and libxml2. And I will supplement some of the regular processing function in C. And I will build up a series of functions in C to handle MARC record such as parse a MARC record in xml, add a MARC filed in xml. I think it will be a lot of work.

Summary on previous work

2012-07-16T21:00:00.000-07:00

It's been one and a half months since i started my GSOC project on evergreen. I have been working on optimizing the performance of evergreen system. Sorry to post the blog such late for everyone in the community.

In the first period of my project. I implemented the normalize functions in stored procedures in C. The normalize function is implemented in Plperlu before. I used the lib ICU4C to implemented it. And this is the first PostgreSQL extension module in Evergreen. So I created a folder named extension in /Open-ILS/src/sql/Pg/. All the extension related files are stored in this path.
After a thorough test using the script naco_normalize.t, I pushed the my branch to our git. You can find it from
http://git.evergreen-ils.org/?p=working/Evergreen.git;a=shortlog;h=refs/heads/user/dsy/normalize_in_c.

Now the normalize functions are tested in debian and Windows(a little different code). You are very welcomed to download it and test it in your environment.
After tested the functionality of the normalize function, i inserted some of the time test in naco_normalize.t script (I didn't push it). The result shows the performance difference between c version and perl version is slightly. So i started to optimize my extension.

The first optimization I find is there are a lot of palloc function calls in my function and some of them are unnecessary. So i deleted them.

The second optimization is found when I trace into lib ICU4C code. In previous version, I load a instance of normalizer in each normalize function calls. But the load procedure will be time consuming cause it will read large file from disk. So i added a PG_inif function in the extension which will be called when the extension library loaded into memory. In this way the load procedure will be called only once when PostgreSQL loads the extension lib.

My test shows these optimization has some performance improvement. But the improvement between perl version and c version is not significant than we expected. After some analysis and discussion with my mentor, I think the reason is the input data size of normalize function is rather small. So the performance improvement is weakened by the other system overhead. I guess the decisive bottleneck will locate somewhere else because normalization function has only very small input data size. Which means the execute time difference between c version and Perl version will be in milliseconds. Compared to the search in millions of MARC records, the index searching, results ranking or some other procedure handling millions of data will surpass normalization a lot.

In the following weeks, I began to analyze system logs of OpenSRF and PostgreSQL trying to locate some real bottleneck in our system. The example I used is a simple search. At beginning, I had only very small data set in my database as the only data set in find in my virtual machine image is concerto.sql. And a week ago, I got the data set Content Alliance records and a running system's OpenSRF log with the help of Dan Scott and Rogan Hamby. Thanks for their help.

In OpenSRF log file osrfsys.log, I find the following logs.

--------------------------osrfsys.log---------------------------------

[2012-07-10 22:17:27] open-ils.storage [INFO:2272:Pg.pm:67:] Attempting to connect to evergreen at localhost

[2012-07-10 22:17:28] open-ils.storage [INFO:2272:Pg.pm:97:] Connected to MASTER db evergreen at localhost

[2012-07-10 22:17:28] open-ils.storage [INFO:2243:Transport.pm:163:] Message processing duration: 3.141

[2012-07-10 22:17:28] open-ils.search [INFO:2233:Biblio.pm:1281:] staged search: DB call took 3.1780788898468 seconds and returned 0 rows, including summary

[2012-07-10 22:17:28] open-ils.search [INFO:2233:Biblio.pm:1285:] search returned 0 results: duration=3.1780788898468: params={"estimation_strategy":"inclusion","skip_check":0,"limit":"1000","check_limit":"1000","core_limit":10000,"offset":0,"searches":{"keyword":{"term":"test depth(0)"}}}

[2012-07-10 22:17:28] open-ils.search [INFO:2233:Biblio.pm:1541:] staged search: cached with key=open-ils.search_d16605fbc7a72a177177f4386d751d15, superpage=0, estimated=, visible=0

[2012-07-10 22:17:28] open-ils.search [INFO:2233:Biblio.pm:910:] compiled search is {"estimation_strategy":"inclusion","skip_check":0,"limit":10,"check_limit":"1000","core_limit":10000,"offset":0,"searches":{"keyword":{"term":"test depth(0)"}}}

[2012-07-10 22:17:28] open-ils.search [INFO:2233:Transport.pm:163:] Message processing duration: 3.286

-----------------------------------------------------------------------

At first, I misunderstood the log info. I thought the Message processing duration is the sole time used in message passing and handling. So the conclusion is the Message processing used as much time as the DB execute time. But after I added some time counter in the Transport.pm and some related files. I found the message processing time means the total time used to handle a request which contains the DB execution time. So in a search the most time consumed is still in DB. Then I changed my focus on DB logs. I changed the log level and the info recorded in the log. And shows the details of DB execution when responding to a search request. But the info has not shown any bottleneck in a particular procedure. I'm still working on it.

At the same time, I think the optimization in DB is still meaningful as the DB consumes most of the time in a single request like a search. So I stick to my plan to implement other plperlu functions in C. I think it will somehow optimize some functions performance in the system.

And in the previous work I got a lot of help from another student Joseph. And thanks to my mentor Mike, Galen and all the community.

GSOC---Evergreen

Problems encoutered in Testing the module

New functions in extension c_functions

Progress on project

Summary on previous work