An American Girl Doll House

So my daughters are huge American Girl doll fans, and can’t seem to get enough (its a pretty consistent theme to be asked about the next trip to Chicago). Recently, my oldest was not able to participate in a local school activity (Destination Imagination) due to lack of available groups (there were only three teams available). Around the same time, I stumbled upon a thread where a guy built an American Girl Dollhouse for his daughter. I discussed with Joelle and we’re going to build one, and approach it as a project similar to the activity she wasn’t able to do. I’ll post more updates as we progress.

Posted in with the kids

Screenshots with OS X

I’ve used Grab for several years now, but always saving in tiff is problematic. I keep having to looking up how to use the built in features of OS X, so figured maybe it was time to document this for myself ;-).

* Command-Shift-3: Take a screenshot of the screen, and save it as a file on the desktop
* Command-Shift-4, then select an area: Take a screenshot of an area and save it as a file on the desktop
* Command-Shift-4, then space, then click a window: Take a screenshot of a window and save it as a file on the desktop
* Command-Control-Shift-3: Take a screenshot of the screen, and save it to the clipboard
* Command-Control-Shift-4, then select an area: Take a screenshot of an area and save it to the clipboard
* Command-Control-Shift-4, then space, then click a window: Take a screenshot of a window and save it to the clipboard

In Leopard and later, the following keys can be held down while selecting an area (via Command-Shift-4 or Command-Control-Shift-4):

* Space, to lock the size of the selected region and instead move it when the mouse moves
* Shift, to resize only one edge of the selected region
* Option, to resize the selected region with its center as the anchor point

Posted in Mac OS X

Apache logs with load balancers / monitors

I have run into log pollution quite a bit when using load balancers with http checks. Typically, these checks will try to call a http based service and request a default response, and as a general rule, look to receive a http response code of “200 OK”. This makes debugging / tracing almost irritating. I figured out this fix, so figured I’d post it to save someone else the extra effort.

First, you’ll want to set an environment variable based on the remote address of the load balancer(s).

SetEnvIf Remote_Addr “10\.160\.252\.[2|3]” nolog

Once you have the nolog environment variable set, you’ll want to add that as a clause to your CustomLog declaration.

CustomLog “|/opt/apache/current/bin/rotatelogs /path_to_logs_dir/access_%Y%m%d.log 86400 -300″ combined env=!nolog

Posted in Apache, Systems

Apache maintenance option

Ever want to have your apache instance running, but not allow application access? Try this method for various implementations, including built in apache+php, or even when using back-end java services (with http_proxy, weblogic location handlers, etc).

First, the rewrites:

# enable global maintenance page, modified to check for environmental varible - see start script
RewriteCond %{REQUEST_URI} !^/siteMaintenance.html$ [NC]
RewriteRule .* /siteMaintenance.html${pages:%1:NULL}? [R=302,L]

Next, we need to get apache (and mod_rewrite) to see our environment variable (ENV:MAINTENANCE). There are probably various ways this can be achieved, but I’ve personally had good luck injecting the environment variable into apache as I launch the httpd process. Below is a sample case section that can be added to you /etc/init.d/apache script:

 MAINTENANCE=TRUE ${APACHE_ROOT}/bin/apachectl -d ${APACHE_ROOT} -f ${APACHE_CONF} -k start

What we have essentially done above is to force inject the MAINTENANCE variable into apache, and in this specific case, denoting that the value contained in the variable in TRUE. In the Rewrite Condition, we are checking that this variable is indeed true, and if so, the trap anything and send it to the siteMaintenance.html page. To avoid a looping issue which would happen in a scenario where a request comes in, gets a 302 redirect to /siteMaintenance.html and the get trapped again, we put in a exclusion statement that says that if the REQUEST_URI matches /siteMaintenance.html, then we do not wish to redirect it.

Posted in Apache, Systems

OpenSSH, sftp, groups and a jail

Recently, my boss came to me and wanted to update a very old Redhat 9 Sftp server. There was one little snag tho, the old server was using a jailed approach. Seeing as this server was quite old, and had a patched SSH binary installation to support the jailed configuration, I figured I was in for a hefty amount of work. I couldn’t have been more mistaken. The good folks over at OpenSSH have already implemented the ability to run ssh in a jailed configuration. The other gotcha was that we also wish to use shared keys, so one more thing, right?

The relevant sshd_config directives are:


These three are the only ones that really pertain directly to the jail scenario I was building towards. YMMV. If all else fails, the OpenSSH guys (and OpenBSD related projects in general) tend to have very good man pages. You can find the man pages for OpenSSH at

So, we start out by unpacking and compiling SSH. This is really only if you want to maintain the installation separate from the default system. In our case, Suse was lagging substantially behind the formal project releases (SLES 10.x was/is using OpenSSH 4.x).

tar zxvf openssh-5.4p1.tar.gz

cd openssh-5.4p1

./configure –prefix=/opt/[your_dest_dir]


sudo make install

Once I completed the above steps, I had my new installation of OpenSSH ready. Now to configure everything and set up our jail. For this, I needed to build out the jail structure. I opted to go with /www/jail. As part of this, I also needed to have the ability for shared keys, so decided on /www/jail/.ssh as the directory for they keys to exist.

mkdir /www

mkdir /www/jail

mkdir  /www/jail/.ssh

Next, I started out with a base sshd_config. This is per defaults that I have found to work over the years. Note that you should change [install_dir] to the directory structure you specified as the argument to –prefix. I modified the port so as not to conflict with the default installation residing at port 22 and also tuned the number of sessions allowed. Some other things that I modified included X-tunneling, permit user environment, dns resolving of remote clients, etc.

Port 2022
Protocol 2
HostKey /[install_dir]/etc/ssh_host_rsa_key
HostKey /[install_dir]/etc/ssh_host_dsa_key
KeyRegenerationInterval 1h
ServerKeyBits 1024
SyslogFacility AUTH
LogLevel INFO
LoginGraceTime 2m
PermitRootLogin no
StrictModes yes
MaxAuthTries 3
MaxSessions 20
MaxStartups 20
PubkeyAuthentication yes
PasswordAuthentication yes
ChallengeResponseAuthentication no
X11Forwarding no
PrintMotd yes
PrintLastLog yes
TCPKeepAlive yes
UseLogin no
UsePrivilegeSeparation yes
PermitUserEnvironment no
UseDNS no
PidFile /[install_dir]/var/run/[pidname].pid

The next thing that is needed in the sshd_config file is to setup some specifics about our sftp server.

Banner /www/jail/sftp.banner
Subsystem    sftp internal-sftp
AuthorizedKeysFile /www/jail/.ssh/%u

Note above that we have set the Subsystem to use ‘sftp internal-sftp’, versus ‘sftp /[install_dir]/libexec/sftp-server’ as is default in the sshd_config file. This is where the magic really happens. As per the man page:

             Alternately the name ``internal-sftp'' implements an in-process
             ``sftp'' server.  This may simplify configurations using
             ChrootDirectory to force a different filesystem root on clients.

Ok, so what does this mean? It means that instead of having to replicate all of our libraries, etc into a jail as has been historically necessary; we can specify internal-sftp and do not need to have the libraries/binaries inside the jail as the in-process sftp server will inherit the libraries already loaded/accessed by the parent sshd process. Simply stated, this makes it very easy to implement, and minimizes the amount of churn we’d typically have to go through.

Next, we save our file, and start our server up to test. Note we are issuing ‘-d’ to put the server into debug mode.

/opt/[install_dir]/sbin/sshd -d -f /opt/[install_dir]/etc/sshd_config

you should see some debug output ending with something like:

debug1: Bind to port 2022 on
Server listening on port 2022.

Hit CTRL+C to exit the server. Next we need to create a couple of groups and some users to place in the groups.

groupadd jailedgroup1

useradd -g jailedgroup1 -c “Jailed user #1″ -d / -s `which false` username1

useraadd -g jailedgroup1 -c “Jailed user #2″ -d / -s `which false` username2

groupadd jailedgroup2

useradd -g jailedgroup2 -c “Jailed User #1 Group 2″ -d / -s `which false` username1g2

passwd username1

passwd username2

passwd username1g2

So lets review what we’ve done. We have a working ssh installation, and have created two groups (jailedgroup1 and jailedgroup2) , added two users to the first group and one user to the second group. For each of the users, we set their home directory to / (root). Normally, this may seem odd, but there is a method to the madness.

When a user logs into a jailed service, we want them to go to the root of that directory, instead of treeing things out as in [jaildir/home/username]. This is of course open to implementation so may work differently depending on the requirements. For me, I wanted my users to share the same ftp area, so this made the most sense. At this point, we know that we are going to jail users/groups under /www/jail, but have not decided on the exact directories for each group. For the purposes of this document, we’ll pick something fairly straight forward. Lets create two directories, named jailedgroup1 and jailedgroup2.

mkdir /www/jail/jailedgroup1

mkdir /www/jail/jailedgroup2

The next piece of this puzzle is to setup our group definitions in our sshd_config file. Lets open the file back up and add the following sections.

Match Group jailedgroup1
X11Forwarding no
AllowTcpForwarding no
ForceCommand internal-sftp
ChrootDirectory /www/jail/jailedgroup1

Match Group jailedgroup2
X11Forwarding no
AllowTcpForwarding no
ForceCommand internal-sftp
ChrootDirectory /www/jail/jailedgroup2

Save and exit the sshd_config file. We now want to perform a simple test. Lets start sshd as we did earlier.

/opt/[install_dir]/sbin/sshd -d -f /opt/[install_dir]/etc/sshd_config

We should once again have a working sshd server listening on port 2022. From another window on that server, execute the following.

sftp -o port=2022 username1@localhost

You should be prompted to accept the keys, choose yes here and enter the password for username1. Once logged in, execute a `pwd` to verify you are in the / (root) area of the “remote” server. You can then do a ls -l / to verify no files exist in the root.

In a third window, you can log in and create a directory structure, and re-run the ls to verify it shows up, etc (as in mkdir /www/jail/jailedgroup1/somedir). The nice part here is that if you set your group permissions of the jailed directory, then all of the users in jailedgroup1 will be able to update files in the same directory (chgroup jailedgroup1 /www/jail/jailedgroup1 && chmod 775 /www/jail/jailedgroup1).

Test with other users from the group to observe the behavior. Login as username1g2 to verify you show up in the other jailed directory. Note that each time you log out, your debug sshd service listening on port 2022 will exit. This is normal behavior. If you want to login with multiple users and verify the binary spawns out a new child process, execute the sshd command without the -d argument.

At this point, its just about wrapping everything up with a couple of scripts which I will leave as an exercise for the reader (hint: /etc/init.d/sshd already exists, why not use that as a base?).

The next phase for me is to implement shared keys. This is largely open depending on what the client is using to connect to the sftp server. In our case, I am putting the keys in /www/jail/.ssh/[username]. The only gotcha here is to make sure that the shared key file for that user is owned by that user. Testing in debug mode will go a long way in helping to determine any issues (key based logins included).

Hope someone else finds this useful.

Posted in Systems

Dynamic Fabric with Boto

Recently, I built a small utility that would query AWS and find patterns using regex (see article for more info).

For another project, I requested a colleague to modify that query so that we could get the information for groups of servers. He ended up with a really slick little piece of code that we could use to query by security group. Around the same time, I was introduced to fabric, and the sun has now risen :-). In Fabric, you need to load of the env.hosts list when targeting groups of machines. Most of the examples I’ve come across have set these manually in code. As I’m not found of those types of approaches, I took the code my colleague had written and integrated it into my fabric scripts. I’ve found this to be very useful at times.

def get_running_instances(access_key=None, secret_key=None, security_group=None):
    Get all running instances. Only within a security group if specified.
    conn = EC2Connection(aws_access_key_id=access_key, aws_secret_access_key=secret_key)

    if security_group:
        sg = SecurityGroup(connection=conn, name=security_group)
        instances = [i for i in sg.instances() if i.state == 'running']
        return instances
        instances = conn.get_all_instances()
        return instances

While I quickly used it to grab groups of machines, I found other very cool ways to limit those groups even further, by using ec2 tags. Has become a very powerful tool in my kit :-).

For example, say I wanted to get all the running systems in a given security group, using a function (I have lots of thoughts about how to streamline this even more, so stay tuned).

def get_app_servers():
    instances = get_running_instances(aws_key,aws_sec,"your-app-security-group")
    for i in instances:
	print i

What if I wanted to get just a subset of that, by keying off the tag information? Not to fear, the regex library will fit the need quite well.

def get_some_group_within_sg():
    p = re.compile("app-[0-9][d,s,p]")
    instances = get_running_instances(aws_key,aws_sec,"your-app-security-group")
    for i in instances:
	name = i.tags.get('Name','')

Here is version of the get_running_instances() def, that will return all (so a little more generic, in cases where the system may not be running, e.g. start the instance…).

def get_instances(access_key=None, secret_key=None, security_group=None):
    Get all instances. Only within a security group if specified., doesnt' matter their state (running/stopped/etc)
    conn = EC2Connection(aws_access_key_id=access_key, aws_secret_access_key=secret_key)

    if security_group:
        sg = SecurityGroup(connection=conn, name=security_group)
        instances = sg.instances()
        return instances
        instances = conn.get_all_instances()
        return instances


Posted in Python

Get info from AWS using Boto

I recently had a need to be able to search and retrieve information from various AWS accounts and output in json format. As json is becoming so nice to work with from a supported standpoint, and it sucks a lot less than XML, figured I’d roll up my sleeves and build a Python script to do the heavy lifting for me. This tool will search against AWS EC2 instances (and all properties of) for specific strings. Once it compiles a list, it outputs the data in json format. Boto makes it way easy!

To use, you pass in your key and secret (via the -k and -s parameters, and optionally the -p [regex] parameter), and the script will do the rest. In essence, it loops through all of the instances with-in the reservation (see the boto docs and AWS api for more info); and as its iterating over the instance, it will do a sub-iteration over the properties of the instance, so you are able to search anything having to do with any instance. Hope someone finds this useful! Pointers or tips on how I could do this better would also be useful. One thought I’ve had would be to read the environment variables for the key and secret similar to Amazons toolkit.


Python (Note that 2.6 is the version tested against)



Usage: --key [aws_key] --secret [aws_secret] --pattern [regex]

  -h, --help            show this help message and exit
  -k KEY, --key=KEY     AWS Key to use
  -s SECRET, --secret=SECRET
                        AWS Secret to use
  -p PATTERN, --pattern=PATTERN
                        Regular Expression pattern (pcre compliant) to search
                        for, examples include '.*', '^ubi.*', '*.dev.*', etc


Adding a link to the actual file as the pre/code tags I use will not allow the code below to line wrap, and python is stupid anal about white

#!/usr/bin/env python26
import json 
import re
import sys
from optparse import OptionParser
from boto.ec2.connection import EC2Connection

# main function
def main(argv):
	config = {}			# hold our configuration for json to dump
	data = {}			# temp dict to hold per instance variables (will be copied to config[name])

	# parse the options/args out
	(options, args) = parser.parse_args() #load options into a dictionary and arguments into a list

	# act on our arguments
	if(options.key) and (options.secret):
		conn = EC2Connection(options.key,options.secret)	# get connection to EC2

                for r in conn.get_all_instances():			# loop through all reservations
                        groups = [ for g in r.groups]		# get a list of groups for this reservation

                        for i in r.instances:				# loop through all instances with-in reservation
                                i.groups = ','.join(groups)		# join the groups into a comma separated list for consumption below
                        	name = i.tags.get('Name','');		# get instance name from the 'Name' tag

				if (options.pattern):			# check if the user specified a pattern

					# try to compile the provided regex pattern, and catch it if it throws an error
						p = re.compile(options.pattern)
						print "error compiling the regular expression specified: '" + options.pattern + "'"
						print " see for supported patterns"
						print "\nEx:"
					    	print "  .* 		= greedy match anything"
						print "  .*uat.*	= will match any name containing 'uat'"
						print "  .*dev.*	= will match any name containing 'dev'"
						print "  ^ubi.*		= will match any name beginning with 'ubi'\n"

					search_pool = dict(i.__dict__)
					match_counter = 0
					for item in search_pool:
							match_counter += 1

					if (p.match(name) or match_counter > 0):
					#	print i.__dict__
						data = dict( 
							public_ip = i.ip_address, 
							public_dns = i.public_dns_name,
							private_ip = i.private_ip_address, 
							private_dns = i.dns_name,
							instance_type = i.instance_type,
							region = i.placement,
							root_device_type = i.root_device_type,
							root_device_name = i.root_device_name,
							image_id = i.image_id,
							security_groups = i.groups,
							security_key = i.key_name,
							instance_id =,
							hypervisor_type = i.hypervisor,
							arch_type = i.architecture
						config[name] = data
					data = dict( 
						public_ip = i.ip_address, 
						public_dns = i.public_dns_name,
						private_ip = i.private_ip_address, 
						private_dns = i.dns_name,
						instance_type = i.instance_type,
						region = i.placement,
						root_device_type = i.root_device_type,
						root_device_name = i.root_device_name,
						image_id = i.image_id,
						security_groups = i.groups,
						security_key = i.key_name,
						instance_id =,
						hypervisor_type = i.hypervisor,
						arch_type = i.architecture
					config[name] = data

		print json.dumps(config,sort_keys=True, indent=4)

if __name__ == "__main__":

	# usage statement
	usage = "usage: %prog --key [aws_key] --secret [aws_secret] --pattern [regex]"

	#assign usage display to option parser
	parser = OptionParser(usage=usage) 

	# add key option
	parser.add_option("-k","--key", dest="key", help="AWS Key to use\n") 

	# add secret option
	parser.add_option("-s","--secret", dest="secret", help="AWS Secret to use\n")

	# add regex pattern
	parser.add_option("-p","--pattern", dest="pattern", help="Regular Expression pattern (pcre compliant) to search for, examples include '.*', '^ubi.*', '*.dev.*', etc")

	# check the length of our args and 
	# start the ball rolling....
	if(len(sys.argv) > 1):

Sample run

[jess@ip-x-x-x-x bin]$ ./ -k [key] -s [secret] -p '.*\.micro'
    "[hostname]": {
        "arch_type": "x86_64", 
        "hypervisor_type": "xen", 
        "image_id": "[image_id]", 
        "instance_id": "[instance_id]", 
        "instance_type": "t1.micro", 
        "private_dns": "[private_ip]", 
        "private_ip": "[private_ip]", 
        "public_dns": "[public_ip]", 
        "public_ip": "[public_ip]", 
        "region": "us-east-1d", 
        "root_device_name": "/dev/sda1", 
        "root_device_type": "ebs", 
        "security_groups": "[security_group]", 
        "security_key": "[some_key]"
[jess@ip-10-x-x-x bin]$ ./ -k [key] -s [secret] -p '10\.x\.x\.x'
    "[hostname]": {
        "arch_type": "x86_64", 
        "hypervisor_type": "xen", 
        "image_id": "[image_id]", 
        "instance_id": "[instance_id]", 
        "instance_type": "m2.2xlarge", 
        "private_dns": "", 
        "private_ip": "[private_ip]", 
        "public_dns": "", 
        "public_ip": "[public_ip]", 
        "region": "us-east-1d", 
        "root_device_name": "/dev/sda1", 
        "root_device_type": "ebs", 
        "security_groups": "", 
        "security_key": "[some_key]"
[jess@ip-10-x-x-x bin]$

Posted in Python

Java metric analysis with SNMP

Recently at Ohio Linux fest, I was discussing JVM analysis of heap, threads, etc with another engineer in regards to Oracle’s JVM diagnostic tool, which runs inside the jvm and provides diagnostic abilities.

While I’ve used similar tools in the past, I like to approach statistical collection externally by using Perl, SNMP and stashing the values in some sort of round-robin archive (I’ve used rrdtool quite a bit for this type of work, but have recently moved to storing my data in capped collections with-in mongodb). I’ve seen some discussion about this on the web, but was thinking maybe I should note some of my own experience to get at the data for others to review. When I went down this road, complete examples were sparse and generally simplistic without any deep dive type commentary.

First, you need to modify your JVM start parameters. This will be largely dependent on what type of container you are using (Tomcat, WebLogic, WebSphere, straight java, etc), but you’ll need to get the following options added to allow the JVM to permit access to this information via SNMP.

Note on the above, that the acl value is set to ‘false’. Do this only if you are sure that your JVM is only accessed through/via a firewall (ie, not connected directly to the net, but nat’d or protected in some manner. This will set the default SNMPv2 community to ‘public’. Consult the documentation for java for more information, and/or options

Once you have modified your parameters (don’t forget to change out ‘your_jvm_host_ip’ and ‘your_jvm_host_port’), start up your JVM, perform a ‘ps -ef’ and verify the arguments are there and check the log files for any issues. For the next verification step, perform a ‘netstat -anp | grep ‘ and verify that the java process is listening to the UDP port you defined.

The next step towards verification involves using the snmpwalk command. The OID that contains the bulk pull for Java is ‘’.

Here is an example command via snmpwalk.

snmpwalk -v 2c -c public host_ip:host_port ''

Now that we’ve gotten this far, we’re ready to throw some perl in the mix. You’ll need “Net::SNMP”, and I’ll leave that largely up to the reader to install and get working (typically, need a few dev packages and install via cpan). For the purposes of this demo, we’ll replicate what we just did with snmpwalk, and grab the entire bulk table and print it back out.

use strict;
use Net::SNMP qw(snmp_dispatcher oid_lex_sort);
use Data::Dumper;

use constant DEBUG_MODE => 1;
my $bulk_oid = '';

# change host_ip and host_port to match the settings you chose
#  for your jvm

sub makeRequest {
  my $hostname = shift;  # The fully qualified hostname or ip
  my $port = shift;  # The snmp port
  my %resultList = ();
  my $result;
  # Create the SNMP session
  my ($session, $error) = Net::SNMP->session(
      -hostname  => $hostname ,
      -community => 'public',
      -port      => $port ,
      -version   => 'snmpv2c',
      -translate => ['-octetstring' => 0,'-unsigned' => 0] # , '-timeticks' => 0],

  # Was the session created?
  if (!defined($session)) {
      printf("ERROR: %s.\n", $error);
      exit 1;

  $result = $session->get_table(-baseoid => $bulk_oid);

  # loop through the results and add them to our resultList hash
  for my $res_key (keys %{$result}) {
      $resultList{$res_key} = $result->{$res_key};
      if(DEBUG_MODE) {
        print "res:" . $result->{$res_key} . ":\n";

#  print Dumper(%resultList);
  return %resultList;

So with that, we have just grabbed our bulk tree and spit it back out to the user (in this case, our selves). While this may seem a bit fuzzy at first; it can provide a basis for you to build up a custom collection / trending tool for your needs. My own uses have ranged from integrating JVM’s into Simple Network Manager (SNM), Cacti, MRTG and others; to rolling up my sleeves and building my own trending analysis tools customized to the business’ needs. There are lots of possibilities with Perl and SNMP, and even more when you bind together mongo, php and javascript.

Some of the advantages I can see with these types of approaches (versus the idea of running a diagnostic tool inside the jvm), would be:

External monitoring capabilities (think thresholds for heap or thread consumption)

Minimal impact to JVM other than the initial start up costs associated with the SNMP stack layered over JMX (to my knowledge, anyone out there know different and could perhaps shed some light?)

Minimal impact to Heap, where as internal diagnostic tools have to store their info somewhere? Is this the right line of thought? (I’ve not delved into the code base for most of those tools, so hopefully, someday, someone will read this and can maybe provide some details on how the magic works)

Known standard with SNMPv2 (some folks seem to dislike SNMP, but I’ve always had decent luck with it, minus the various versions/permutations in use….Note I don’t use it for traps, but for data analysis so have a somewhat skewed view of the landscape)

Customization abilities via most languages that have support for Net-SNMP (quite a few as I understand it)

Most trending applications support custom OID inclusion (such as Cacti, Zenoss, etc)

Some of the dis-advantages in my mind would include:

Limited access to JVM internals, JMX hooks that developers have created in their applications are not automagically exposed via SNMP (to my knowledge and from reading the docs)

Useful for debugging only in the sense of data analysis (ie, no debug hooks from this angle)

In the end, I guess it boils down to what you’re after. If remote access to the data for storage elsewhere (mongo, rrd, etc) and have knowledge of SNMP, then maybe this method will help you out. If you want debugging, custom hooks (as I understand it, not too bad for most java developers via jmx), and custom metrics, then perhaps the internal methodology might be the best approach. it always comes down to what makes sense for the folks involved :-P.

Useful resources:

Java MIB file

iReasoning MIB Browser

A more complicated example (lots of things to power this are needed; mongodb, libs, etc, but feel free to review)

Does anyone out there have any thoughts? I’d love to hear more information (and perhaps some schooling on writing some JMX queries could be presented).

Posted in Java Systems, Perl, Systems

Rockin’ the MarkLogic 4.2

Cengage was nice enough to get a group of developers trained in MarkLogic, and allowed some extra folks to sit in. As I’ve been setting up F5, Tomcat and Apache front-ends for MarkLogic for quite a while, I figured it would be useful to actually get the initial introduction and xquery training.

First thing to do, grab the installer files:

For my own personal interest, I wanted to find out how to access MarkLogic via PHP and/or Perl (of course, right :-)

So, download the installer, and get it installed. Make sure to write down your username and password. As I’m on a Mac, my install put a MarkLogic in the “Other” group under the System Preferences area. Open that baby up and start the server.

Next, click administer or open a browser and go to http://localhost:8001. Log in using the account information you provided during the install.

Another useful link for Mac folks regarding using webdav with the ML server:

Some quick jot downs

Description Operator(s) Examples
Remark parts of xquery statement (: :) (: /biography//title :)
Assign a value to a variable := let $somevalue := ‘value’
Evaluate a value = or eq //bibcit.fielded[//first eq “Paul”]
if ($name = ‘somevalue’)
Evaluate as an “or” | //bibcit.fielded[//(first|last) eq “Paul”]

Sample queries

Return all data in the title elements with-in the Biography DTD content in the database


Return all biographies that where published by “St. James Press”

Example #1:
for $bibcitf in doc()/biography/bio.head/source/bibcitation/bibcit.fielded
let $publisher := $bibcitf/citation/print/publisher
where $publisher = "St. James Press"
order by $bibcitf/name/last ascending
return {$bibcitf/title}{$bibcitf/name}

Example #2:
for $bibcitf in doc()/biography/bio.head/source/bibcitation/bibcit.fielded
let $publisher := $bibcitf/citation/print/publisher
where $publisher = "St. James Press"
order by $bibcitf/name/last ascending
return fn:concat($bibcitf/title,"",$bibcitf/name)

Does “The Adventures of Huckleberry Finn” novel contain the word “slave”?

cts:contains(/novel[title eq "The Adventures of Huckleberry Finn"], "slave") 

Does “The Adventures of Huckleberry Finn” novel contain the word Sawyer?

cts:contains(/novel[title eq "The Adventures of Huckleberry Finn"], "Sawyer")

Write Xquery to report the total number of elements with-in each Biography

for $bio in fn:doc()/biography
let $count := fn:count($bio//para[@type="Updates"])
let $value := $bio//title/string()
order by $count descending
return {$count}:{$value}

Which biography has the most elements?

(for $bio in fn:doc()/biography
let $count := fn:count($bio//para[@type="Updates"])
let $value := $bio//title/string()
order by $count descending
return {$count}:{$value})[1]

Misc queries/statements

//emphasis [@n eq 1]
/biography/bio.body/works/workgroup/bibcitation/bibcit.composed/emphasis[@n eq 1]
/biography/bio.head/bioname[//last = "Adams"]
/biography/bio.head/bioname/(mainname|variantname)[last eq "Clemens"]
fn:count(//bibcit.fielded[//first eq "Paul"])
//bibcit.fielded[//first eq "Paul" and //editor = "Paul"]
//bibcit.fielded[name/first eq "Paul" and @role="editor"]
//bibcit.fielded[name/@role="editor"][/first eq "Paul"]
//bibcit.fielded[name[@role eq "editor"]/first eq "Paul"]

for $doc in fn:collection()/*
return fn:base-uri($doc)

cts:contains(/biography/bio.head/bioname,"Mark Twain")
cts:search(/(biography|essay), "Mark Twain")

cts:word-query("Saturday Night Live"),
cts:word-query("acted in"))))

cts:search(/biography/bio.head/bioname/mainname, cts:directory-query(("/bios/"),"1"))

for $bio in doc()/biography
return cts:string-concat($bio[//title])

cts:contains(/novels[//title, "slave")

(: for $bio in doc()/biography :)

(: fn:doc()/biography//title/string() :)

(: fn:data(fn:doc()/biography//title) :)



So, as you can see, lots to learn, lots to understand. At face value, XQuery seems like a hodgpodge (no offense intended) of Perl, PL/SQL and some XPath ;-). I’m looking to start checking services with Perl, maybe even hook in some PHP to see what I can really do. We’ll see how it goes, but definitely something interesting and even a little challenging. At times, it seems like XQuery almost operates backwards from what you would normally think about in terms of how the data is processed, but that could just be an initial impression.

So, other things I’m looking at:

  • Perl (as mentioned, querying whether data is being returned corretly, etc. There’s a CPAN module available as ‘Net::MarkLogic::XDBC’, which I’m just now investigating. More to come there.
  • PHP, which I’m still struggling to compile the libxcc module for (and to be fair, its on my MacBook so is a little odd when thinking about most installations of Apache/PHP).
  • Building out an HTTP server in MarkLogic, and then building apps around xqy files

More to come on MarkLogic.

Posted in Development, Java programming, Java Systems, Systems

What’s in those jar’s?

Ever need to know what classes/files are included in the jars in your classpath? I ran into a need to know at one point, and banged out a quick and dirty perl script to check all jar files in a lib directory and output all of the files in each jar found. To use, copy and paste the code below into a file and then make it executable (chmod 755 scriptname) and then execute it via ‘./ /path/to/lib/dir’.

my $TARGET_DIR  = $ARGV[0];
my $JAR_EXEC    = '/usr/bin/jar';
my @jars = glob("$TARGET_DIR/*.jar");

die "lib directory not passed\n" if(!$ARGV[0]);

foreach $jarfile (@jars) {
   print "\njarfile:$jarfile:\n";
   @result = `$JAR_EXEC tvf $jarfile 2>/dev/null`;
   $listing{$jarfile} = \@result;

foreach $key (sort keys %listing) {
  print "$key contains:\n";
  foreach (@{$listing{$key}}) {
   print "\t$_";
Posted in Development, Java programming, Java Systems, Perl, Systems