Java Tuning: 2008

Friday, December 26, 2008

Saving on memory usage in Java #1 - the Byte.valueOf method

Say you wanna keep in memory a list of martial arts experts and their respective shoe size. One way to implement it would be to populate a Map structure with the following sets of key and value:

Map map = ...
map.put("Jean-Claude Van Damme",  new Byte(45));
map.put("Jet Li",  new Byte(45));
map.put("Chuck Norris", new Byte(112));
...
map.put("person number million", new Byte(45));

What if your JVM runs on a Lego mechanical computer that has a very limited amount of memory, you would probably want to save on memory wherever possible.

[caption id="attachment_65" align="alignnone" width="300" caption="A state of the art 3Hz Lego computer"]

[/caption]

Autoboxing anybody?

Keeping in mind that an object instance weights much more than just the primitive it holds, as it hold additional "plumbing" data (monitor, etc). Even an Object class instance weighs 8 bytes while not holding to any application information. What about keeping only primitives as the map value?
Autoboxing, introduced in Java 5 onwards, allows to pass a byte primitive argument instead of a Byte object instance argument in the following manner:

map.put("Bruce Lee", 42);

Does this help us avoid the costly Byte Objects? Not really, the auto-boxing feature, as the name hints, just statically replaces the 42 literal with a new Byte object instance, this is done during compilation. So there's no real saving opportunity here, and we're back where we started.

[caption id="attachment_66" align="alignnone" width="300" caption="AutoBoxing"]

[/caption]

How about a plain old cache?

Examining the code above, you notice that you are creating one million unique Byte objects to hold the fighters' shoe size, even though there are only 256 different shoe size values. Is this a venue for saving?
Considering the fact that Byte objects are immutable, why not have just a single Byte object for each distinct byte value (we'll need only 256 instance to cover all values). This way we'll pass the same Byte instance to all people with a 45 shoe size, Jean-Claude and Jet-Li map in our case. This will reduce the number of Byte instance from a million to only 256. Sounds super!

[caption id="attachment_67" align="alignnone" width="201" caption="Too much objects in memory"]

[/caption]

How do you implement this? You'll might rush into initializing an array of 256 Byte objects during application start-up, giving birth to something of this sort:


// init instances array
int RANGE_OF_VALUES = 2^8; // we don't care about negatives
Static Byte constShoeSizes = new Byte[

RANGE_OF_VALUES

];
for (byte b=0; b<

RANGE_OF_VALUES

; b++) {
constShoeSizes[b] = new Byte(B

yte.MIN_VALUE

 + b);
}
map.put("Jean-Claude Van Damme", constShoeSizes[45]);
map.put("Jet Li", constShoeSizes[45]);
map.put("Chuck Norris", constShoeSizes[112]);

Enter the valueOf() method

WHOA! Hold you horses! Doesn't this use case seems to be just too common and trivial?! haven't the Java language designers and implemented came accross the same problem? Surely, some of the JRE classes themselves must have Byte instances data members. In an effort to reduce the JRE memory footprint, won't the JRE programmers cache instances using something very much like the static Byte array we implemented ourselves?
The short answer of course is YES! Java 5 presents a new overloaded Byte.valueOf(byte b) method. This method returns a reference to a Byte instance taken from a shared cache. This trivial cache strategy save memory and CPU, as there's no need to construct new objects and later on garbage collect them.
Here's the relevant Byte.valueOf method source code taken from Byte.java source:


private static class ByteCache {
private ByteCache(){}

static final Byte cache[] = new Byte[-(-128) + 127 + 1];

static {
for(int i = 0; i < cache.length; i++)
cache[i] = new Byte((byte)(i - 128));
}
}
...
public static Byte valueOf(byte b) {
final int offset = 128;
return ByteCache.cache[(int)b + offset];
}

Using the valueOf method, here's how the final version of our code will look like:


map.put("Jean-Claude Van Damme", Byte.valueOf(45));
map.put("Jet Li", Byte.valueOf(45));
map.put("Chuck Norris", Byte.valueOf(112));

Wrapping up quickly:

From Java 5 onwards, use the valueOf method for Number extenders like: Byte, Short, and Integer. Notice that as the Integer object has 2^32 different values, only the (-128) to 127 values range is cached. Meaning that expression (Integer.valueOf(129)==Integer.valueOf(129)) will always be false, since it returns a new Integer object on every call.
Other object types (Double,Float, etc...) valueOf method does not implement a cache at all. If your value range is limited in nature, you might choose to create a caching scheme of your own.

Always be on the lookout and Inspect repetetive Instance creation closely, see if you can avoid it by referencing an shared immutable object, or by borrowing an instance from an object pool.

Strings can have an even larger space and time performance gains than numbers objects, though at the same time they are inherently harder to reuse. You might want to take time to learn about Strings instances reuse strategies; start with the String.intern() method.

Tuesday, November 18, 2008

Java Podcasts - tune in to the Java posse

I haven't wrote anything new for a long time now, part due to spending most of October vacationing in Thailand, part due to the massive work load that landed at my doorstep, when returning from Thailand islands.

Until I'm finished with clearing my desk, so I could get back to write here, I thought I might leave you with some quality Java development babbling, to help drive a way the boredom of the daily commute.
Now, these four dudes must have a lot of free times on their hands, delivering an hour long quality podcast every week. Check the Java posse here .

Thursday, September 4, 2008

Miniature scale pro-active IT - Creating windows services dependencies

Yesterday night we had a scheduled power shutdown in the Dev lab. Today morning, the lab manager rose up early to get all servers running before the armies of developers arrive to the office. My work includes interacting with a Lotus Sametime server (IBM's Instant Messaging (IM) server), so I run my own private IM server. starting the day's work, I quickly noticed that my IM client fail to log in to the IM server. In fact, all of the developers could not log in to their own private IM servers.
Remembering that during the client log in process the IM server validates the client supplied log in credentials against a central LDAP server, the LDAP server became an immediate suspect.

The LDAP server we're using is an old IBM ITDS LDAP server running on Win2000. It's comprised out of two processes: the ITDS process that parses and execute LDAP queries, and a DB process (DB2) that takes care of data persistency. Both processes are registered as Windows services.

The investigation commenced! Maybe the LDAP server is down? I went a head and checked the ITDS and DB2 services status, both were running. Hmm... I moved on to inspect the ITDS log, and saw that during its start up stage it failed to create a connection to the DB2, therefore it resorted to starting in a, "crippled", configuration only mode. That means that it just sits there, wasting random CPU cycles, giving the illusion that it's there to provide service, but not actually answering any queries.
To remedy the situation I simply re-started the ITDS service. It started up normally and began servicing incoming LDAP queries from the IM servers.
At this point, you'll be tempted to announce world wide: "I fixed it!", but before you do that, stop and think about it for a minute; what is it exactly that you fixed? In did, the ITDS began servicing queries, and the client can log in, meaning that the current manifestation of the problem was eliminated, but did you fix the problem itself? Part of being pro-active means that you solve future problems before they actually occur. What stops the problem from re-occurring the next time someone decides to restart the server? in order to solve it for good, you first need to understand what was the cause of the problem.

So, what happens during a server start up? The ITDS and DB2 services startup-type is set on Automatic, thus they start when the OS starts. The db connection error message fits a scenario in which the ITDS service started and tried to connect to the DB2 before the DB2 service finished starting up.
We would like to instruct the ITDS service to be less hasty, and wait for the DB2 service to finish starting, before stepping into itself start up process. Educating it can be achieved by defining a service dependency, stating that the ITDS service is dependent on the DB2 service.

Implementing it: Dependencies can't be created using the windows MMC "computer manager" snap-in GUI, so you'll have to get your hands dirty with registry mud using the following procedure.

Problem uprooted! You won you pro-activity badge.

Friday, July 25, 2008

Google I/O 2008 - Josh Bloch talk - Effective Java 2nd edition

Now days, technology eager and innovation craving programmers can find abundant amounts of learning material online, served in an easy to swallow and digest form, such as video casts. One of the most prolific sources of technical info is Google, which now posted video sessions from the "google I/O", may 08, dev gathering.
In this video Josh Bloch (formally at sun) gives an hour long session about his new second edition of the Effective Java book. I found the session to be only somewhat interesting (Enum sets are not my main point of interest), plus the video quality is not ideal for reading through source code.
The discussed book is a well gathered compilation of 78 Java best-practices (although, to be honest, I've only read about 20% of it). Another great book of his, that I've read and planning to post here about, is Java Puzzlers.

Listed below are other sessions I watched, or plan to watch (sadly, most sessions are about web programming and client side - not my cup of tee).

Google I/O 2008 - Underneath the Covers at Google - GFS, big table, and the parallelism library MapReduce.
I wonder what similar constructs for parallelism IBM have up their sleeve...

Best Practices - Building a Production Quality Application on Google App Engine (Production stuff - I like the news from the front)

Dalvik VM Internals (That's Google implementation of Java to avoid paying Sun royalties for JME)

How Open Source Projects Survive Poisonous People (programmers intrigues always interesting)

Painless Python for Proficient Programmers (I'm starting to work my way through Python these days)

Sunday, June 29, 2008

Book of the month - Linux Server Hacks

I just read through most of O'Reilly's Linux Server Hacks book.
I expected another dull Linux how-to book, which goes over the man/info of the most obvious commands, but instead I found an interesting, original, advanced hardcore book, full of Linux goodies to brag about in front of my colleagues.

Some note worthy items:

A thought effective usage of SSH, especially as a secure channel for moving bits around the network, between a pair of processes each running on its own host.

How to reset your root password, without a rescue disk, using the LILO boot loader.

I didn't knew about ext2/3 chattr and lsattr before reading the book...

Periodical rsync runs could save a lot wasted scp time.

(#44) burning a CD over the network using a pipe - cool

(#50) setting up a VPN using IPIP tunneling :-)

(#57) lsof - hey, I've been using it for years.

(#63) loved to learn that the send_arp utility can help me to revoke all of the subnet's machine (and router?) IP->mac mapping. Handy when setting up a two bits IP fail-over system.

(#68) ssh-agent - now I know what it is - very useful in the hands of an all mighty admin ruling over hundreds of minions machines.

(#73) loved the one-liners perl scriptlets.

To conclude, a must have in your bathroom library.

Saturday, June 28, 2008

VMWare: converting a hosted VM to a hypervisor VM - Linux troubleshooting

When using the VMWare convertor utility to convert between VmWare player/Workstation/server VM images to an ESX image, if the VM you are converting is Linux you might run into boot problems ("kernel panic" message) due to SCSI drivers problems.

I found a couple of resources about the problem but none fully worked for me, here is my special recipe:
The configuration I used was: RHEL 5.1 VM, and ESX 3.x server.

Use the converter to load the image to the ESX

If you will start the converted image on the ESX you will see a kernel panic message

Go to VMWare infrastructure client -> ESX server -> vm props -> hardware -> SCSI controller -> change from buslogic to LSI Logic

Load the vm CD-ROM drive with RHEL5 install disk (also serves as a rescue disk)

Boot the VM from the CD -> when prompted, enter: linux rescue

The rescue disk should identify the linux partition and mounts it on /mnt/sysimage

After getting a prompt enter: chroot /mnt/sysimage

Backup, and then edit /etc/modules.conf, add this line: alias scsi_hostadapter BusLogic

Backup the current ramdisk file: cp /boot/init-[version].img /boot/init-[version].img.bak

Rebuild with new module and overwrite existing: mkinitrd -f -v /boot/initrd-[version]-img [version]

Reboot the OS.

Boot from the hard drive - The system will start normally

Weird that VMWare do not bother with their official proper documentation.
Kudos to the vmware user community!

Saturday, June 21, 2008

I'm changing the hostname. Deal with it!

Lately, I've been crossing paths with too many enterprise-level server products that, once installed, can't tolerate any change to the local machine's hostname.
Don't get me wrong, I'm not spoiled to dare wishing that a hostname change will be handle in run-time, without a restart. I'm not even suggesting that the change would be automatically detected and processed on the next product restart. I much more modest that that, Just having a documented working procedure on how to do that offline would make me a happy man. The current, glum, state of affairs is that some products would have to be completely re-installed if the hostname were to change.

Some of the reasons for changing a machine's hostname might be:
(1) You want to clone a new server from a, best practiced already installed, server template, each cloned copy should have a unique computer name (very useful in test environments, especially handy when making a vm duplicate of a template virtual machine).
(2) You have an existing server which changed its business role - you plan to install a new application module (EAR), but want to keep the existing middleware infrastructure (JEE AS).
(3) You no longer want the server to be reachable by it's original name (without making use of DNS administration, and aliases tricks).
(4) You want to implement a new server naming convention in your production environment.

Now Programmers, how hard can it be to live in peace with a dynamic hostname?
(1) If you are sure that the target network resource is the local machine then just use the localhost loopback interface instead of a hostname, when addressing it.
(2) Query the OS when retrieving the machine's hostname, instead of relying on static, sometimes binary, stale, configuration files.
(3) Keep all application network resources is a centralized configuration repository. Provide an offline API for the admin to access it.

As a side note:
IBM WAS ND 6.X now has, a long awaited, offline API for updating the hostname of a machine.
If you know and care about other products that support or don't support hostname updates, please place your comment.

Saturday, May 24, 2008

Cycling through the Integer range - A Fermi problem

After graduating in economics during the summer of 2005, I went interviewing for a business analyst position in a couple of business consulting firms (e.g. Mckinsey & Company).

Since, real life, business dilemmas requires estimating and decision making under uncertainty (not all of the required information is available nor it is accurate), a major part of the interview for these type of firms is confronting you with the "How many pay phones are there on the island of Manhattan?" type of problems, also known as Fermi's problems.
Although that, at first, these problems seem quite puzzling, given that you remain focused, methodical and leverage a modest amount of common sense, it all gets pretty easy. The "trick" is to combine basic facts which you already know, with some four grader algebra, doing this brings you to good enough estimates.

Allow me to introduce to you a quick CS Fermi problem that someone through around while in the office. The problem might also be presented during an interview with a fresh graduate student candidates. Here goes:

Running on your average home computer (A single 2Ghz core), how long would it take for this Java program to complete it's operation?

long startTime = System.currentTimeMillis();
for (int i=Integer.MIN_VALUE; i<Integer.MAX_VALUE; i++) {
};
System.out.println(System.currentTimeMillis()-startTime);

How long? Two nanoseconds? Three seconds? Four hours? Five years? Six centuries? Seven millenniums? What's important here is the order of magnitude and not the exact answer, you might find this question to be trivial, but you will be surprised of how many people can't get a clue on how to start answering it. Take thirty seconds and try to come up with your own estimation, before reading through my estimation:

Let's compute a ball park figure:
Since an Integer is a 32Bit creature, the loop will cycle 2^32 times (about 4.3 billion times. Remember that a billion is 10^9). The 2006's average home computer CPU runs at about 2GHz, this means that the CPU can perform two billion simple instructions per second (Complex instructions consume several CPU cycles).

The loop does three obvious operations on each cycle: (1) I is incremented. (2) the values of i and the max Integer constant are compared between (3) we jump back to the beginning of the loop.
All are fairly simple instructions (don't have to be an assembly programmer to know that), so it's safe to assume that these instructions are executed with in a single CPU cycle.
BTW: Instructions 2 and 3 can be combined in to a single instruction (jump is less then).
If the loop would have been coded in assembly language, my guess is that it would take 4 seconds to complete: (2 instructions) * (4*10^9) loop cycles / (2*10^9) instructions/sec = 4 seconds. Thus, we have just found the lower limit value for our answer: the Java code couldn't execute in under 4 seconds.

My guesstimation would probably be between 4-40 seconds.

Other possible influencing factors:
(*) Now we know that Java is not effective as machine language and adds some overhead to our code. Depending on the implementation of the JVM in use, the method might be complied to machine native code, instead of executing in interpreted mode. This would improve the performance of course.

(*) As i recall, by JLS specification, the JVM is obliged to check for overflow while incrementing integers; If so, this will add a fix number of operations per loop cycle.

(*) Since our Integer isn't volatile (a local variable can't be volatile anyway), its value would be probably cached in one of the CPU's registers throughout all of the loop execution. Have it been declared a volatile, the JVM would had been forced to read and write the Integer value to the machine's main memory on each operation that involves the Integer variable. since memory CAS latency is measured in nanos as well, this should, theoretically, add a fixed cost for each loop cycle (~10-100 nanos), possibly increasing the estimation's order of magnitude by a factor of one.

(*) Running on a multi-core chip should have no direct positive effect, as this is a single threaded program.

Here is the relevant disassembled Bytecode:
4: ldc #3; //int -2147483648
6: istore_3
7: iload_3
8: ldc #4; //int 2147483647
10: if_icmpge 19
13: iinc 3, 1
16: goto 7

Actual results:
(1) On my IBM T41 ThinkPad it took 80 seconds to complete.
(2) On my workstation at home, equipped with an Intel core2 6300 1.8GHz CPU it tool only 9 seconds to complete.

Since I can't explain such discrepancies. I'll have to check further and update with new information. Try it yourselves!

Saturday, March 29, 2008

How does hardware evolution affect progamming language design?

I've recently watched the interesting webcast Programming Language Design and Analysis Motivated by Hardware Evolution by Professor Alan Mycroft (Webcast's link is accessible only from within the IBM Intranet). Ahead are a few keynotes I've kept.

Not everything is kept linear
As chip designers continue to scale down chips and transistors, they begin hitting design walls. Some of these walls are related to the fact that as the transistors` physical size is scaled down, some other properties of the chip do not scale linearly as well. This simplest example of this are dimensions, consider length Vs surface area: reducing a square side to 50% of its original size, will causes the square surface space to reduced by 75%, not a linear change. Different electricity characteristics might change at different rates than the rate in which length is changed.

Where is my 12Ghz CPU?
Moore's law, which predicts the doubling of transistors quantity on a chip every ~18 months is still in effect, sadly, this doesn't translate into clock speed. Although that, when transistors are miniaturized the distances within the chip reduces as well, and this should mean an increase in speed, but, due to heat dispersion problems (not all dimensions shrink at the same rate, generated heat is one of them, remember?) chip designers are forced to reduce the voltage in which the chip components operate. Therefore no clock speed gain.
This enables us, however, to squeeze in more cores into that optimal one cm^2 silicon pad. Hence, the multi-core technological path that the industry had resorted to in the last couple of years.

There's always a trade-off
As the voltage in which the chip operates drops, chip designers are starting to face computation inaccuracy problems. How could we live in peace with these imprecision? the professor ponders, do we must insist on absolute accuracy? Consider the task of rendering video, do we really care about the correctness of each pixel on each of the frames, probably not, just remember those old analog VCR and audio cassettes, they were highly inaccurate and still were able to deliver the goods. We might decide to compromise on accuracy, some of the time, in order to benefit on speed, just another type of trade-off. Programming language designers should assist chip designers by developing programming languages that are able to operate in a world of non absolute certainty.
Also think about the build-in error correction mechanisms put in to network protocol stacks.

Better on one world, worse on the other
A major problem with multi-core chips processing, is that although inter-cores communication enjoy a high bandwidth (2.5GB/s), it is stained by a high latency (~75 clock cycles) .
Another problem is that programs are written based on a shared memory model, in which all cores must coordinate when accessing the shared main memory, core's caches must also be refreshed quite often. While this doesn't seems a major problem for dual or quad cores, think on how this heavily limits performance on a, not so futuristic anymore, 128 cores chip.
Trying to refrain from shared main memory access might turn the table on some of the disciplines we got accustomed to think of as obvious. For example, when you code a parametrized function you declare how parameters are passed; either by reference, or by value. Declaring this during coding time (rather then deciding this during runtime) can be regarded as "early-binding". From a performance perspective, everybody knows that passing by reference is, almost always, faster than passing by value (assuming you don't intend on changing the passed value). This preferred way of action might not hold true on a multi-core system that will have to incur an expensive overhead when it access the data which the reference point to in the shared main memory, no such price has to be paid if the parameter is past by value. One way in which future programming languages might deal with this is to allow for late-binding of the parameters passing method. When running on a chip with only a few cores, a pass by reference will occur, just as, when running on a cores enriched chip a pass by value will be selected. This is true when the pass by reference/value makes no difference to the program logic (no changes to the parameter's data are visible to the method caller, nor the parameter data is accessed concurrently), and therefore both could be used interchangeably.
Future languages will need to support this "late-binding" feature and others like it.

Summing up
It will be interesting to keep follow of these hardware to software trends of mutual influences.

Wednesday, February 27, 2008

Getting, finally, started...

28/Feb/2008 Bootstrap_Blogger Thread#0: Hello! we got ourselves an original site logo (the credit goes to Eliran Nachum). And I've got several half completed draft articles that I'm planning on publishing soon enough. So, keep your eyes open ;-)