Sunday, October 18, 2009

Utility Frenzy #1 – The log summarizer

Here's a post I wrote (in the Hebrew language) which tells the story of the log summarizer utility that I've wrote. This story is the first in a line of "utilities stories" I'm planning on writing.
My apologies for those of you whom won't be able to read it. Posts in this site do appear in English..

Monday, October 5, 2009

Google is pregnant again - Noop

After the zillion new dynamic languages that had flooded the earth
(groovy, ruby, ...), Google is concocting Noop; a new type-safe language to join Java, Scala, and the rest.

The new language sets out to excel in testability, dependency injection, and readable code (see the proposed features). More interesting than whether Noop will gain a crowd of enthusiasts, is the language's dynamic and lucid development process; made available through Google code (Sun's JSRs, are also transparent but here it's a whole new language).

What do you think?

Sunday, August 23, 2009

Why is Thread.sleep() inherently inaccurate

Avi Ribchinsky, a friend and a college of mien, is transitioning from C++ to the Java world. He had been playing with Thread.sleep(), when he noticed that the sleep method might oversleep more than ordered, and moreover, it could also under sleep (see Fig 1). Coming from the C++ world, that surely caught him surprised ;)

Fig 1.

[caption id="attachment_173" align="alignleft" width="584" caption="Thread.sleep() under sleeping"]Thread.sleep() under sleeping[/caption]

How is sleep implemented in Java anyway?


Avi came asking me if I knew anything about it, I was wondering myself how such a common and important method could be faking in the way shown above. Is it the OS? a Bug in the specific JRE version used? Maybe the API doesn't guarantee milliseconds precision to begin with?
Thinking about all of these factors, we realized that we don't really know how the JVM implements the sleep method functionality, my best guess would have been that the process registers itself in the OS for a wake up call, and the OS wakes the process via a software interrupt. OK, time to search the web.

The following article gives a very detailed answer, explaining that sleep is implemented by a thread giving up its OS scheduling quantum back to the scheduler, on the next execution quantum the thread gets, it has the chance to wake up and continue processing, or again continue sleeping.
Therefore, the accuracy resolution of sleep is directly dependent on the process scheduling resolution of the operating system in usage. Since windows XP process scheduling resolution is roughly 10ms, the sleep mechanism, in the Avi's example, might had preferred to under sleep "a little" rather than oversleeping "a lot", by waking himself in the current scheduling cycle quantum, rather than in the next, future, quantum.

The article also mentions that the inaccuracies are worsened when a process with a higher scheduling priority, than the sleeping process, is in a runnable state.

I assume that, running on a Hypervisor with course grained process scheduling would also produce greater inaccuracies.

[ad#horizontal]
sleeping

Conclusion


You can't rely on the millisecond accuracy of the sleep method. Take a before and after time measurament to find the actual time spent sleeping, in order to avoid ever increasing inacurracies.
Sleep tight :)

Monday, July 27, 2009

ESX Server tuning – quick tour

[caption id="attachment_164" align="alignleft" width="150" caption="esx"]esx[/caption]

Our VMWare ESX server does us a great job.
Running on an IBM X3650 HW, with 24GB RAM and 2x4 cores, it can simultaneously run up to 25 virtual machines, each VM is configured with around ~1.5 GB of RAM.

After reaching  the 25 running VMs mark, we started noticing increasing sluggishness when additional VMs were turned on.

Of course, we did the trivial stuff of making sure that all screen savers are disabled, antivirus agents are not correlated to run at the same point in time, and making sure that all of the VMs are running the latest VMWare tools agent.
It was time to dig in deeper to find out where is the bottleneck we came across.

SLKNB_ASomeone told me that the stats that the reliability of the performance indicators that the graphic VI console shows is questionable and it's recommended using the terminal utilities.So, I SHHed to the service console VM and ran the top utility. Immediately, I understood that what I'm actually doing is surveying the service console VM processes, rather than the overall ESX hypervisor activity. A quick dig up made me realize that the hypervisor is visible through the esxtop command, which is also executed from within the service console VM.

even for those of you that knows your way through the output of top and linux's sysstat package, the data shown by esxtop is rather cryptic.
This great esxtop tutorial did me a great service with understanding the esxtop output.

I started more than 30 machines to reproduce the problem, and quickly went through the list of usual suspects: CPU, memory and IO:

  • CPU
    I've verified that it's not a CPU problem since the "CPU load average" was around 0.2. and PCPU was much the same.

  • Memory
    Then I've switched to the memory display and verified that it's not a physical memory issue. I saw the "high state" marker which was a good sign + there were almost 17GB ursvd (unreserved memory) in the VMKMEM/MB line.
    SWAP (~3GB) seemed OK.
    VMWare's ballooning and memory sharing does miracles in broad day light.

  • I/O
    I didn't see any queues forming. read/write rates seemed pretty low.


So, the 25 VMs performance limit will remain a mystery until I'll have proper time to analyze it more throughly, or even better, I'll find someone from IT to do that for me.

Monday, July 13, 2009

Extanding your troubleshooting facilities - Always on verbose GC

Getting it right the first time


What happens when customers are experiencing problems with you application in production? The customer would send you the various logs artifacts and, ideally, you should be able to diagnose the problem and provide a resolution. If you find yourself sending the customer back and forth in an effort to gather additional types of log artifacts and system information, then you are, must likely, doing something wrong.

Who should be helping you


If you deploy your application on top of a application server platform, like Websphere Application Server (WAS) in my case, the platform should be assisting with automatic logs generation and collection. Our development team has been increasingly relying on such services provided by WAS, like: FFDC, WAS Collector, hung threads detection. All of which honorably earned their production stripes and badges.

garbage2One new serviceability artifact that I have long ago really wanted to have in production was the verbose GC, this feature records the JVM garbage collection activity over time, providing insight for resolving issues such as: stop-the-world performance freezes, memory leaks, native heap corruption, etc.

Until today, I was reluctant to enable the verbose GC in production, since I believed that there's no way to direct the verbose GC output from the native stder (default) to a rotating dedicated file, not doing so might lead to files larger than 2GB (a problem on some file systems), or would cause the system to run out of disk space. I was assuming that the performance implications would be negligible, but still, you have to be extra prudent when it comes to live customers environments.

Taking out the garbageA trigger for action


Last week I had an issue with a WAS component, after opening a ticket with Websphere support, I was asked to reproduce the scenario in order to generate verbose GC output, I decided that enough is enough! I'm gonna look into the GC output file rollover issue again and see what solutions exist, what the community have to say about it, or whether there might be some other custom solution (with the Apache web server, for example, the file rolling is handled by an external process into which the log output is redirected, the process then does the rolling files management itself).

Following a quick search, I was happy to find that the IBM JVM offers a rolling over verbose GC. I quickly found additional hands on reports, Chris Bailey published verbose GC performance impact results that reassured my gut feeling about any performance impact being a non issue.

Here's the syntax: (quoting the IBM Java 6 diagnostics guide):

-Xverbosegclog[:<file>[,<X>,<Y>]]
Causes -verbose:gc output to be written to the specified file. If the file cannot be found, -verbose:gc tries to create the file, and then continues as normal if it is successful. If it cannot create the file (for example, if an invalid filename is passed into the command), it redirects the output to stderr.
If you specify and the -verbose:gc output is redirected to X files, each containing Y GC cycles.


Final thoughts



  1. I don't like having to specify the entire path for the file files, the default path should have been the server's logs directory, or the CWD (CWD is the profile's directory I believe).

  2. Rollover threshold parameter - I would rather be specifying it in units of max MBs instead of in units of the number of GC cycles entries. I've empirically found that 1MB of verbose GC log translates to ~700 GC cycle entries (YMMV).

  3. Good enough. I'll start doing the preparations to put this into production.

Friday, June 12, 2009

A hand made freeware windows firewall

I have two windows servers that shouldn't talk to each other. How do I make sure they don't?
Right, why not use some firewall? well, because I can't just install any software on these servers, company regulations, and windows' built-in firewall suck big time (only inbound, have to configure ALL exceptions).
On Linux this is quite a trivial IPTables command. Run the following on server#1:
iptables -I INPUT -s server#2 -j DROP
iptables -I OUTPUT -d server#2 -j DROP

Unfortunately there's nothing like IPTables built into windows.
Driving inspired from the IPTables concept of routing the packets to the trashcan ("-j drop"), I realized that much same could be implemented on windows by twicking the OS routing table causing it to deliver packets for server#2 to no where.
Here's my hand tailored, freeware, no software required, windows firewall that sends packets to a vacation in /dev/null:
route ADD 1.1.1.2 MASK 255.255.255.255 1.1.1.0

Where:
Server#1 IP is 1.1.1.1
Server#2 IP is 1.1.1.2
1.1.1.0 isn't assigned to anyone - our /dev/null for the occasion.

Additional blabber:
If you add the route instruction only to server#1, but not to server#2, then server#2 can still send IP packets to server#1, while this breaks TCP completely, server#2 could still send UDP datagrams to server#1.
Make sure the servers are configured with static IP, otherwise your solution would break over time. In order to make the route persistent across server reboots, add the -p flag.
[caption id="attachment_132" align="alignnone" width="514" caption="wrong way! Packet! turn back now!"]wrong way! Packet! turn back now![/caption]

My first question at Stackoverflow.com

Could stackoverflow.com, or any other programming Q&A service, be the alternative for a serious think process, in which you just put in your question and immediately granted with the perfect answer? Hopefully it is.


To test that I've submitted the following "how to regulate the amount of logging printouts" question. Let's wait, pray, and see if I get any smart/unpredicted answer from any of the 6 billion inhabitant of planet Earth.[caption id="attachment_121" align="alignnone" width="376" caption="question-mark"]question-mark[/caption]

Friday, February 27, 2009

Why catch Throwable is evil - A real life story

Disclaimer: Now I know that this is an old idiom, I'm just presenting my own real life incident taken straight away from the bloody Java trenches.

Exceptions can be threads assassins
when running on top of Websphere thread pool, any Runtime exception that isn't caught by the applicative code, will bubble up in the stack, ending up killing the specific thread. WAS helps here, by automatically creating a new thread that will take the place of the murdered one, but still, killing and immediately creating a thread is everything but the thread pool rational.

Hiring a thread bodyguard
bodyguardA simple way to avoid thread death is wrapping the first applicative layer (e.g., Run() method) with a try block that catches and swallows any Exception that's thrown from anywhere in the application code.
Our project's code also used this concept, but instead of catch (Exception e), it had a catch (Throwable t), When I noticed that I didn't rushed to fix it, just in case someone before me had done funky stuff with dynamic class loading that might throw ClassNotFoundError (although this should be caught at a very localized resolution), or maybe it's there for some other historical reason that not being one the code's forefathers I’m just not aware of. In any case, I did promise myself that I'll revisit this piece of code in the future.

Getting some bulls to do correct things
today I finally got the excuse I needed in order to change the catch Throwable in a catch Exception:
We were running stress tests, when the server had an OOME (out of memory error). Since the catch Throwable caught and swallowed the OOME (as OOME is a subclass of Error which is a subclass of Throwable), the thread that generated the OMME kept on living, instead of dieing right there, and so, the JVM continued running, crippled and limping, instead of turning to an honorable solution like hara-kiri. Choosing the quick death route would have been rewarded with a quick resurrection to be provided by the gracious NodeAgent and its watchdog mechanism, and the end result would have been a newly born healthy server ready to get back in business. A retreat in order to attack, you might put it.
Instead, the server had to limp for long minutes, suffering from a series of consecutive strokes (OOME), until the OOME was so bad that the JVM just had to exit.

Conclusions
The Catch Throwable was causing down time, by preventing an imminent restart of the JVM due to an OOME.

Open Questions

  1. I know that an uncaught exception kills only the specific thread does the JVM treats an error differently? Put other words, if the OOME is not caught, will the entire JVM die or only the specific thread? I assume that the answer is the entire JVM, maybe this is implemented by the JVM itself, or maybe it's implemented somewhere in the WAS bedrock. If for some reason it's not the case, one could catch an Error and then execute System.exit(1); in order to hasten the process imminent death.


Friday, January 23, 2009

My attempts with IP Spoofing

Why did I wanted to spoof source IP addresses? and why did I failed? Here's the story before you:

------------
UPDATE Sep/2010: Dear Filipe (see comments below) had proven to me that spoofing over the internet is indeed possible, read all about it on the continuation post: My attempts with IP Spoofing – Revisited. Now back to the original story:
------------

When customers install our product, they often forget to setup firewall rules to accept incoming connections from public IM (instant messaging) providers. Without the firewall rules in place the product does not function properly, of course, and the customer rushes to open a support trouble ticket. Troubleshooting to pinpoint the problem to a missing firewall rule isn't trivial. When we try to validate whether the customer defined the required firewall rule, we need the external entity (that we have no control on) to open a connection to the customer's IP, but the external entity will only do so following the successful completion of a handshake sequence that must be initiated by the customer (consider for example: XMPP Dial-Back mechanism), since this handshake by itself is prone to failures, you can see how reproducing the problem is a combursum process.


I started looking for a simple, independent, and reliable, troubleshooting procedure that would be able to give a clear-cut answer to whether or not the customer defined the firewall correctly.
Here's what I've concocted:




  1. Assume that the customer IP is 1.1.1.1 and they were suppose to configure their firewall to allow incoming connections from 2.2.2.2.

  2. I'll send a single TCP SYN packet (the 1st of the standard three messages TCP handshake) from my computer (say it's IP is 9.9.9.9), but I'll spoof the IP datagram's source address field to be 2.2.2.2 instead of what normally should have been my actual machine address (9.9.9.9).

  3. I'll ask the customer to run a network sniffer on the IM Gateway machine. Waiting for the single packet to arrive at the destination socket.

  4. If the sniffer had recorded the incoming IP message, then it means that the firewall is setup correctly and the problem is else where.
    But, If the sniffer didn't record any incoming SYN packet, then we shell blame the firewall guys.


Pretty simple, eh? Now, in order to spoof the TCP SYN packet I needed a something that could generate and send raw IP packets, since you can't just fiddle with the source IP address if you choose to ride on the good'ol TCP/IP stack. I found this IP spoofing perl script on the net, and it does the job.

[caption id="attachment_103" align="alignnone" width="300" caption="Visualization of the various routes through a portion of the Internet. Took it from Wikipedia."]Visualization of the various routes through a portion of the Internet. Took it from Wikipedia.[/caption]

I did my first test on the office LAN, I sent a message from machine (IP 9.9.9.9) to to machine 1.1.1.1 claiming the message source was 2.2.2.2, it worked! Machine 1.1.1.1 registered an incoming packet from 2.2.2.2.
It seems that the office router went along with the scam, perhaps it thought that the machine switched IP it IP, or the DHCP server went crazy, or that it's ARP cache is just stall.

In the next test I tried sending the packet over the Internet, I tried sending a packet to my home computer from the office, with a source IP of some foreign entity, to my dismay, it never got to my home computer. Other IP variations didn't work either.
My guess is that some router along the way noticed that it's getting a packet with a source IP address that the part of the network it is looking can't can't possibly generate (imagine CIDR based ACLs), and that caused it to immediately drop the packet. This failure caused me to give up on the whole spoofing troubleshooting procedure idea.

Some thoughs about what I've seen:

  1. Evidently, It's quite trivial to spoofe IP addresses on a LAN.

  2. Spoofing  IP addresses over the Internet doesn't seem to be trivial.

  3. A side note: If the customer has a reverese proxy, or any form of entity that delegates TCP handshakes, deployed before the actual IM Gateway machine, then the procedure is not applicable, as the first TCP SYN message will never reach the IM Gateway machine.

  4. I would assume that the closer you inject the packet into the Internet backbone blood stream, the better the chances of not getting a rejection of the spoofed packet. The backbone routers communicate with many difference parts of the network, and might not have rational of where certain packates should be coming from or not.
    IP Packets tend to travel in different routes, making it harder to judge what IP CIDR is ligit from each fellow router.

  5. I'm guessing that the biggest problem for spoofing is the first or the second router (the ISP's), since the ISP knows exactly what is your assinged address. Thereby knowning that the packet is spoofed.

  6. If any one knows a better method of spoofing source IP, please step forward and share your secret :)

Tuesday, January 20, 2009

Increasing the site's posting rate - new paradigm

I've been promising myself to post and publish much more frequent than the current rate of publishing.

Except from reserving more time for the actual posts authoring. I'm also counting on changing the nature of the post as a primary means of increasing the posts rate to once a week or more. From now on, I'll publish stories from my day-to-day work as a software developer, interesting technical things I come across, questions that I don't always have an answer for, general discussion, etc. The post will be less educative, less articles like, less accurate, with less checked facts, but on the other hand, much more real life related, much more up-to-date, and presented in an open discussion inviting format.

So, I'll be writting to you soon as promised :)