Wednesday, July 23, 2008

Amounts and Reversals and Voids. Oh, my.

With our upcoming installation (fingers crossed) of Flexible Spending Account (‘FSA’) functionality at one of our OLS.Switch client sites, we’ve implemented a new tranlog column called ‘originalAmount’.  FSA is joined at the hip to partial authorization and real-time credit reversals.  We’ve implemented partial auth before for our Stored Value Systems (‘SVS’) interfaces, but we were more ad-hoc about that.  Now, with the advent of FSA and the overall partial auth focus in the payment switch industry, we’re taking a more permanent approach on this thing.

I summarized the key points to our flagship client like this:

  1. The originalAmount column will always get populated with the incoming <amount> field from the incoming store request – this is true for all financial applications (Debit, EBT, all Credit, all SVC and Check)
  2. For all financial requests except those authorized by Stored Value Systems (‘SVS’) and FSA transaction originals (i.e., not reversals) authorized through FDR North, the amount column will contain the amount authorized by the endpoint.  [NOTE:  Though we support SVS partial auths today, we currently retain no evidence of the original request amount on the tranlog.]
  3. For FSA voids, the incoming <amount> on the store’s request reflects the authorized amount of the corresponding original.  To be safe, we pluck the amount to reverse from the amount column of the corresponding original and use that to populate the reversal.  Unlike ‘basic’ credit, FSA terminal voids and reversals must be sent to FDR North via Store and Forward (‘SAF’) processing.
  4. For FSA timeout reversals (at the terminal), the device is not aware of any partial auth that might have occurred, so the incoming reversal request has the amount originally requested.  That’s what ends up in the originalAmount column of the reversal.  The amount column of the reversal will contain the amount we pluck from the same column in the corresponding original.
  5. The FDR North switch interface does not support Host timeout reversals of Credit (which includes FSA).
  6. SVS voids and terminal timeout reversals are not implemented by the store system.
  7. The SVS switch interface does support Host timeout reversals. 

Saturday, July 12, 2008

The ‘P’ in ‘TPK’

The advent of PCI-compliant Card Track/PAN encryption schemes at the point-of-sale and the payment switches that support them has brought with it no small amount of confusion, especially with Online Debit and EBT, where two types of encryption are now in flight on all transactions.  There’s one scheme for the PIN, and a second fundamentally different scheme for the Card Track/PAN.  As a result, we get exchanges like this (and these are smart people on all sides, trust me):

Our client – looking to on-board a new POS hardware vendor– sends a communication to the vendor rep and says:

We will be sending your key custodians their respective components for one Terminal Master Key (TMK) and two Base Derivation Keys (BDK-DUKPT).

They get a response back from the vendor that says:

I am filling in for the Key Manager that is currently out on vacation.  I informed him of your intentions to send a TMK along with two BDKs.  He informed me that we do not usually accept TMKs from our customers.

Okay, this is a reasonable misinterpretation.  The guy sees ‘TMK’ and throws up a red flag, thinking we must be using Master Session to encrypt our PINs.  The prevailing standard for PINs is Triple DES DUKPT.  Our client's security guy clarifies with a nice summary of operations and how everything fits:

We use a Thales TRSM to decrypt/encrypt the debit PINs from our stores.  The store PEDs encrypt the PINs with the BDKs we have injected into them.  For credit card processing, we encrypt the credit card data at the PED using our own TPKs, decrypt on our host, and send to [the authorizer].  Our Thales is used to produce the TMK and BDKs that are loaded into the PEDs. These keys are all encrypted using our own Local Master Key (LMK).  The TMK is used to encrypt our TPKs that are sent to the PED’s from our host.

He asked me to follow-up with any further clarification I could add.  I said this:

[You are] absolutely 100% correct.  The TMKs and TPKs have nothing to do with Debit/EBT PIN encryption.  They are used in exactly the manner you describe.  They are thinking of the TMK in its traditional usage of Master Session encryption for PINs.  [You] have re-purposed the TMK/TPK infrastructure to play a part in a PCI-compliant track/PAN encryption scheme.  This brings some additional confusion and concern because the ‘P’ in ‘TPK’ stands for ‘PIN.’  That’s some unfortunate nomenclature.  They are not used for PINs.  Period.  You use a master session scheme for the PAN/track info, and Triple DES BDK-based DUKPT for PINs.

No Circuit Diversity = SPOF

The death blow of any mission-critical payment switch is a SPOF.  There are the obvious ones – like relying on one application server with no architected high availability or fault tolerance built into the design.  There are also some unobvious factors, like a lack of circuit diversity.  I’ll pass along some lessons learned over the past week. 

We urge our OLS.Switch clients to take a number of steps to maximize the up-time of their payment switch implementations.  These include:

  • Replicated application nodes with connections to all endpoints from each node (establish this need early with your authorizers)
  • Content Service Switch (‘CSS’) – aka a “load balancer” - fronting the nodes (and taking this to its logical conclusion, you want two of these)
  • Virtual DB clusters
  • OLS.Switch DB schema on a SAN
  • jPOS QMUX configuration with two or more channels in the MUX definition connecting to physically separate lines
  • The two lines provisioned by separate carriers – this practice is called ‘circuit diversity’…no sensitivity training required! Hey, it even has its own research initiative
  • HSRP built into the authorizer connections

Furthermore, we appreciate authorizer/endpoints that offer geographical diversity in their data centers, like in AMEX’s nice configuration where one connection goes to Phoenix (their ‘IPC’) and one to Greensboro, NC (their ‘NROC’).  You have little control over this from your side, but I like to put this on the table early in planning meetings.  If the authorizer doesn’t do it, we go on the record as comparing them to their peers and noting their shortcomings vs. best practices.

You can do all that and still get bitten by an unforeseen SPOF.  Earlier this week, one of our clients got it, big-time.  That ‘circuit diversity’ initiative referenced above?  It states in part that “Manual assessment and periodic manual assurance are required to ensure that circuits are diverse and remain diverse over time.”  Man, no truer words were ever written.  One authorizer had what it thought was a dual-carrier approach, only to find out that both lines traced through the same CO.  When the CO tanked, so did 100% of the point-of-sale authorizations serviced by that endpoint…to the tune of > $1M USD in lost sales.  “Ouch” doesn’t do that justice.  Now, our client’s excellent network team is working aggressively with this endpoint to engineer the SPOF out of the path.

I write here to prevent you from having similar problems.  Question your authorizers very carefully about their circuit diversity.  Don’t take the words for proof – ask them to demonstrate via manual assessment that the circuits are indeed diverse.

 

Thursday, July 10, 2008

Adding timeout, keep-alive properties to your channel

I’ve written in the past about some of the operational improvements we’ve made in our jPOS QMUX configurations over time.  As a prime example, we’ve had two situations where an authorizer’s tight disconnect model has forced us to go with a MUX Pool approach.  Now here recently, we’ve been presented with another ‘opportunity’ for improvement.  Specifically, we’ve wanted to address a situation seen in production at a client site where our connections to Stored Value Systems (‘SVS’) channels disconnected, then got into a hung state after reconnect.  The status lights stayed green and transactions repeatedly timed-out over the impacted channel.  [NOTE:  SVS is a really good service provider; the issue here was on our end, so it makes for a good, real-life example.]

Alejandro advised us to implement a property value that will set a socket-level timeout.  The ‘receive’ function in the multiplexer’s ‘Channel Adapter’ will fail (with log event ‘<io_timeout>’) if nothing is received within the specified timeout period.  The channel will disconnect and then attempt a reconnection. 

While you can apply these property values to any QMUX channel, we had to take some special care for this one because of the MUX Pool considerations.  The specified value of the <timeout> property should be greater than the related <echo-interval> specified in the related logon manager.  SVS features a tight (3 minutes 30 seconds) disconnect model, leading us to have to implement the ‘MUX pool’ approach with aggressive echo intervals (180000 ms, or three minutes).  The timeout property value specified needs to provide breathing space to allow the echoes to operate in their intended fashion.  An appropriate value is 300000 (five minutes).  Since the MUX pool forces an echo on each channel in the SVS MUX every three minutes, not getting ‘receive’ activity in five minutes raises the possibility of a ‘hung’ line.  Accordingly, we put the channel through a disconnect/reconnect cycle as a proactive measure.

The final result:  our channel configuration now looks like this – see new lines in bold, red (NOTE: I’ve obfuscated the ‘host’ value here):

<channel-adaptor name='svs'
    class="org.jpos.q2.iso.ChannelAdaptor" logger="Q2">
<channel class="org.jpos.iso.channel.NACChannel" logger="Q2"
       realm="svs-channel"
       packager="org.jpos.iso.packager.GenericPackager">
  <property name="packager-config" value="cfg/svs.xml" />
  <property name="host" value="299.999.999.999" />
  <property name="port" value="24110" />
  <property name="timeout" value="300000" />
  <property name="keep-alive" value="true" /> 
</channel>
<in>svs-send</in>
<out>svs-receive</out>
<reconnect-delay>10000</reconnect-delay>
</channel-adaptor>

For the record, the corresponding Logon Manager configuration – which remains unchanged in this exercise – looks like this:

<svs-logon-mgr class="org.jpos.svs.LogonManager" logger="Q2">
<property name="persistent-space" value="jdbm:svslogon:log/svslogon" />
<property name="mux"            value="svs-mux-0" />
<property name="channel-ready"  value="svs.ready" />
<property name="timeout"        value="900000" />
<property name="echo-interval"  value="180000" />
<property name="logon-interval" value="43200000" />
</svs-logon-mgr>

Flexible Spending Accounts (New Initiatives, Part 3)

We’re going through the steps right now to certify our OLS.Switch payment switch (on the acquirer side) regarding its ability to support Flexible Spending Accounts (‘FSA’).  This post is Part 3 in my ‘New Initiatives’ series.  The goal of these posts is to show how OLS.Switch’s make-up (most notably its jPOS-based transaction processing framework) facilitates the consideration and on-boarding of new payment initiatives.  As usual, the Good Switch, Bad Switch caveat is in place: jPOS is not a panacea, as even its esteemed creator (btw, he’s in the lower, right-hand corner here, a screen capture from his CNN appearance) will acknowledge…with the wrong team, a craptastic outcome is within reach.  With the right team, you can make magic happen.

FSA installation is, however, the classic ‘slippery slope’ situation – the dominoes start toppling here pretty quickly: your core FSA work leads you partial authorization.  You’re obligated to handle those in the FSA world.  And, in turn, partial auth means you’ve got to start handling credit reversals online, i.e., you’ve got to SAF the 0400 or 0420 reversal to the endpoint.  In most implementations I encounter, Debit/EBT reversals are SAF-able (btw, I never used ‘SAF’ as a verb until I met The Gladiators…they’ve turned me), but credit reversals “stay local” and – as I am wont to say – everything washes out in the settlement.  Moreover, the SAF-ed credit reversals in our situation look and behave differently than the Debit/EBT reversals.  That’s a subject for an upcoming post, which I’ll definitely entitle “All Reversals Are Not Created Equally.”  But, for now, here’s a summary of the work we did to prep for FSA certification…

[NOTE: Remote auth here is performed on the FDR North platform.  Some details here are masked.]

Core FSA

  1. Add base support for the store system’s new FSA transaction set (x.xx through x.yy comprises sales, returns, plus associated voids and reversals; x.zz – Balance Inquiry – not implemented at POS or tested by OLS).
  2. Add support for x.xx through x.yy to main TransactionManager.
  3. Parse ‘flex-info’ and place in new column ‘flexInfo’.
  4. Make FSA cards card type “FS” (card brands remain unaffected).
  5. Adjust in-place model for FDR North Field 63, Table 14 (now populated differently for FSA vs. non-FSA).
  6. Add new routine to populate FDR North Field 63, Table 68 for FSA cards.  Of special note:  Parser to pluck the Total Rx amount out of <flex-info> and use it to populate the ‘4S’ amount on the outgoing request.
  7. Build FSA response to device using Gift Card model (approved amount and balances); special attention paid to duplicate approvals; new ‘balance map’ to extract balance (if present) from Field 63, Table 68 of response.
  8. ALWAYS lock out manager overrides on all FSA transaction responses.
  9. Modify populating of FDR’s ‘XD05’ record (PTS) for approved FSA transactions that are Visa branded.
  10. Add new routine to build FDR PTS Special Condition (‘S’) record for approved FSA transactions that are MC branded.
  11. Add support for card type ‘FS’ to internal reports – backbone report handlers now have to treat ‘CR’ & ‘FS’ report handlers in the same vein.  On the “real” extract side, these new transactions get extracted - without change - by FDRExtractHandler.  On the internal report side, you can drive the decision as to whether these get extracted as part of the Current XXXExtractDSMCVI or broken out into a separate report.
  12. Related UI changes – add recognition of FSA transaction types; add support for visibility of originalAmount and flexInfo columns.

Partial Auth

  1. Put ‘amount’ from request in ‘originalAmount’ column (new).
  2. Put approved amount from FDR gateway in ‘amount’ column (right now, do this for FSA transactions only).
  3. For Online Credit Reversal (see next), need to grab ‘amount’ from tranlog of original (in Timeout Reversal scenario, the device does not know a partial auth was consummated) and use it populate Field 4 of outbound ISO request (which, in turn, will end up as the ‘amount’ on the tranlog – see previous bullet).
  4. In new FDR North Field 63, Table 68, make note that we can accept partial approvals on these FSA transactions.
  5. Recognize and treat FDR North Remote Response Code ‘10’ as Approval.
  6. Inform store system of Approved Amount on all FSA responses.

Online Credit Reversal

  1. If original is found, then TransactionManager needs to do real-time Credit reversal.  Make TransactionManager changes to support that.  [NOTE: Sale only; Return is handled locally, like it is today.]
    Add SAF support in FSA Void and FSA timeout reversal.  [NOTE: Sale only; Return is handled locally, like it is today.]
  2. Need to populate Fields 38 (Approval Number) and 39 (Remote Response Code) from tranlog of original.  [Differs from FDR North’s Debit/EBT reversal model.]
  3. Add additional logic to not SAF sale reversal or void if original was a Manager Override (supposed to be locked out of doing this, but this is prevent defense).
  4. Adjust in-place model for FDR North Field 63, Table 14 (now populated differently for original vs. reversal – needs amounts, BankNet or PS/2000 info from the original).

Friday, July 04, 2008

jPOS Runs Your Peaks

July 3rd brought another surge in pre-holiday buying at our flagship OLS.Switch client location.  We hit 1,087,254 transactions, supporting US residents from coast to coast in efforts to stock up on goods for their Independence Day BBQ.  Granted, these US festivities pale in comparison to the Uruguayan BBQ, the world’s league table leader.  But we get by.  And OLS.Switch is there to serve, fueled by the underpinnings of the super-efficient jPOS payment processing framework.  I think this was the 2nd highest day.  December 24, 2007 still holds the trophy, and probably will until the same day later this year. 

Some guy asked us the other day about our “scaling strategy.”  My straight answer was: 

First, two moderate sized servers (one application, one DB) will easily support millions of transaction a day, or > 300 TPS in our tests. [NOTE: These results may vary depending upon speed and capacity of external authorizers.] However, in our configuration we use additional machines for achieving high availability levels. We use also load balancer technology to allow us to replicate and extend application server capacity; on the database side, we use clustering. We also have the ability to split processing and provide multiple instances of the environment.

My flip answer is that we do more than a million transactions a day on $28,000 worth of core server technology…purchased over two years ago, mind you, so less than $14,000 for the same firepower today.  We never average more than 25% CPU on the application servers, and less than 5% on the primary DB (in the virtual cluster) machine.  This is one of the country’s biggest retailers.  So, for the majority of acquirers out there, scalability concerns are simply not on the table.    Find other things to worry about.  My blog shows there’s plenty else to weigh on your mind.

My one caveat is the Good Switch, Bad Switch warning:  You do have it in your power as a manager to assemble a crappy, ill-suited team.   In which case, jPOS is no magic elixir.  We know of software vendors who implemented solutions where the client found out later – much to their shock – that it was incapable of doing more than one transaction at a time.  We know because we took the panicky phone call from the manager looking for a bailout.

Wednesday, July 02, 2008

08D7B4

Those are the Check Digits for the notoriously weak double-length key that everyone uses for their test ZMK:

0123456789ABCDEF FEDCBA9876543210

I did a Google search to see if anyone else had posted this number.  Nope.  I own the Weak ZMK Space as of this moment. 

Frankly, the weak key serves a good purpose:  it’s my touchstone to ensure that I’ve not accidentally crossed my test and prod versions of the keys file.  That would be most unfortunate.  As Egon Spengler has duly noted: don’t cross the streams

Tuesday, July 01, 2008

Even more on QMUX configurations

The jPOS QMUX feature forms the backbone of OLS.Switch’s remote authorization infrastructure.  I talk about QMUX configuration models in my On-Boarding Guide.  I’ve also blogged about how tight disconnect models can lead you to consider a sub-species of the QMUX model called the MUX Pool.  We make use of that in our connectivity to Stored Value Systems (‘SVS’), a good provider of branded gift card services and a very reliable authorizer.  They use a wicked tight disconnect model:  3 mins 30 secs or so of no traffic raises a peer disconnect on their end.  It’s a good, proactive approach.  What I liked about the conversation with SVS is that they could clearly articulate their approach.  By comparison, we’ve had some frustrations with organizations who can’t describe or only hazily describe what the connection model will be like in production…especially with our replicated application node strategy in play at our client locations.

QMUX has proven to be extraordinarily resilient and efficient in the face of large authorization transaction volumes.  Lines go up, down, up, down…QMUX does the channel management with great skill.  However, we did see a recent situation where an SVS line got in a hang-up situation for a number of hours.  We had the line marked as a connected.  QMUX kept the channel in the mix (only one of the two active connections was affected).  Yet, transaction after transaction timed out because neither we nor SVS saw the line as being disconnected or in any type of situation requiring some type of programmatic reset. 

I reviewed the scenario with Alejandro.  He suggested that what we need to do is to add ‘timeout’ and ‘keep-alive’ properties to SVS channel definitions. The timeout value will set a socket-level timeout.  The ‘receive’ function in the multiplexer’s ‘Channel Adapter’ will fail (with log event ‘<io_timeout>’) if nothing is received within the specified timeout period.  The channel will disconnect and then attempt a reconnection.

The specified value of the <timeout> property should be greater than the related <echo-interval> specified in the related logon manager.  SVS features a tight (3 minutes 30 seconds) disconnect model, leading us to have to implement the ‘MUX pool’ approach with aggressive echo intervals (180000 ms, or three minutes).  The timeout property value specified needs to provide breathing space to allow the echoes to operate in their intended fashion.  An appropriate value is 300000 (five minutes).  Since the MUX pool forces an echo on each channel in the SVS MUX every three minutes, not getting ‘receive’ activity in five minutes raises the possibility of a ‘hung’ line.  Accordingly, we put the channel through a disconnect/reconnect cycle as a proactive measure.

What we end up with is a Logon Manager (one per defined channel in the Mux) that looks like this…

<svs-logon-mgr class="org.jpos.svs.LogonManager" logger="Q2">
<property name="persistent-space" value="jdbm:svslogon:log/svslogon" />
<property name="mux"            value="svs-mux-0" />
<property name="channel-ready"  value="svs.ready" />
<property name="timeout"        value="900000" />
<property name="echo-interval"  value="180000" />
<property name="logon-interval" value="43200000" />
</svs-logon-mgr>

…and a Channel Manager (one per defined channel in the Mux) that looks like this (“host” and “port” values are examples only):

<channel-adaptor name='svs'
    class="org.jpos.q2.iso.ChannelAdaptor" logger="Q2">
<channel class="org.jpos.iso.channel.NACChannel" logger="Q2"
       realm="svs-channel"
       packager="org.jpos.iso.packager.GenericPackager">
  <property name="packager-config" value="cfg/svs.xml" />
  <property name="host" value="127.0.0.1" />
  <property name="port" value="36000" />
  <property name="timeout" value="300000" />
  <property name="keep-alive" value="true" /> 
</channel>
<in>svs-send</in>
<out>svs-receive</out>
<reconnect-delay>10000</reconnect-delay>
</channel-adaptor>

Monday, June 30, 2008

Defending Your SAF

We implemented a Verizon Stored Value interface in our OLS.Switch solution using jPOSFSDISOMsg.  Using this facility, our retail clients can offer their customer phone card activation and refresh (aka. ‘top-up’) at the point of sale.  Now comes some ‘prevent defense’ requirements on our part.  Because Verizon is a non-standard interface, we need to take some extra protection to ensure that we don't get blown up by a crazy incoming dollar amount.  We saw on Wednesday, May 28th where 10 position amounts are coming in from the POS.  Our client’s Store System has a bug (re-created in the lab) causing ID numbers to come in as the amount.  This situation doesn't blow up the other interfaces because ISO Field 4 is 12 positions in length. But on Verizon, we get an exception because the defined length in the outgoing request is only six (6) positions.  Our q2.log shows this exception:

org.jpos.iso.ISOException: invalid len 10/6

The problem is further exacerbated because we put the item into SAF queue.  The exception repeats until the item is removed.  Thank god Alejandro’s very nice SAF facility bails us out by removing the crap record from the top of the queue after it expires.  For the record, our 20_verizon_saf.xml participant looks like this:

<saf name='verizon-saf' logger='Q2' realm='verizon-saf' class='org.jpos.saf.SAF'>
<property name='space' value='jdbm:saf-verizon' />
<property name='mux' value='verizon-mux' />
<property name='flag-retransmissions' value='no'>
  if MTI is in list, messages would be retransmitted as xxx1
</property>
<property name='initial-delay' value='60000' />
<property name='inter-message-delay' value='1000' />
<property name='wait-for-response' value='60000' />
<property name='max-retransmissions' value='1000' />
<property name='expire-after' value='1200'>
  in seconds
</property>
<property name='valid-response-codes' value='*' />
<property name='retry-response-codes' value='ZZ' />

That red, bolded value is the failsafe here: the ‘expire-after’ feature trumps the fact that the record is malformed.  After 20 minutes of retransmission misfires (the exception gets re-raised with every re-try), ‘expire-after’ cleans up the mess.

Now, we do the prevent defense thing as we’ve done in the past.  We need to protect and defend the integrity of the SAF.  In this case, we ought to flat-out reject the Verizon transaction locally if the incoming amount if the incoming amount string is > 6 positions.

So, we made this change in our CreateVerizonrequest.java program, leveraging one of OLS.Switch’s internal result codes:

- fsd.set ("purchase-price", msg.get ("amount"));

+ // If the amount length is greater than six, an
+ // exception occurs. The field length in VZN is 6;
+ // the store can do larger amounts. The problem is
+ // further exacerbated because we put the item into
+ // the SAF queue.  The exception repeats until the
+ // item is removed.

+ String reqamt = msg.get ("amount");
+ assertTrue (reqamt.length() < 7, APPLERR_INVAMOUNT,
+     "Amount too big for Verizon outbound message format");
+ if (reqamt.length() < 7) {
+     fsd.set ("purchase-price", msg.get ("amount"));
+ }
+       

As further prevention, Dave has suggested a change not to place an entry in the SAF if an exception is raised when processing the original.  That’s something on deck.

Saturday, June 28, 2008

Real Systems Do Extracts, Part 6

We practice continuous improvement of our extract processes.  Too many payment systems teams get locked into the ‘coolness’ factor on the OLTP side and leave the entire extract side of the equation until very late in the project game.  That’s not the way our OLS.Switch clients see things.  The extract is of equal importance.  After all, it’s how they get paid.  You can do all the transactions you want in whiz-bang way, but if you can’t reliably produce and ship a related extract file one or more times a day, your sponsoring bank isn’t going to fork over a dime.  You’ll be hearing from your client’s accountant (hey, maybe even the CFO) very early in morning.  In fact, in this business, if you get a 3 AM phone call on the red phone like Hillary, I can guarantee you it’s not because your most excellent jPOS OLTP engine blew up.  It’s because your extract PoC flopped.  

At one client site, we just formulated and are prepared to rollout an improved process for troubleshooting exceptions in the OLS.Switch nightly extract.  The approach also aims to contain the operational fallout that occurs when an exception is encountered.  Our goal:  Provide our clients' on-call engineers with the tools required to attain self-sufficiency in troubleshooting and resolution.  While don’t mind the phone calls, the new approach greatly increases the odds that they can sort it out themselves.

Specifically, when the nightly extract encounters an exception condition, we face the following challenges:

  1. The exception is buried in the q2.log.  It can be difficult to find.
  2. The exception relates to one or more specific records on the tranlog, but the ID is not referenced.  It takes a fair degree of intuition and esoteric knowledge to figure out which row prompted the error.
  3. The exception leaves the files in a “half-baked” state, but to the automated scripts that follow they appear complete.  As a result, [our client] sends invalid files (they don’t have trailer records) to processors.
  4. The exception is especially dangerous with respect to [one specific] extract.  Because this file format does not contain a trailer record, the half-baked file is processed correctly by [that endpoint].  This outcome heightens the possibility of duplicate billing of [that provider’s] transactions when the extract is re-run.

To address these concerns, OLS has re-engineered the extract exception process as follows:

  1. The exception text in the q2 log now references the ID of the offending row.
  2. To prevent the VB script from executing at job’s end, when an exception is found all extract files are renamed with an “*.bad” suffix. 
  3. For each extract and report file, the OLS.Switch extract process writes a complementary “*.log” file when an exception occurs.  The exception message will appear inside the log related to the specific file in which we encountered an error while trying to create it.

Thursday, June 26, 2008

More on the mechanics of jPOS’ FSDMsg facility

A jPOS Users list member asks:

“Does FSDMsg only supports fixed-length fields?  Can FSDMsg supports variable-length fields? Let’s say 2 bytes of length.”

You can do variable-length fields.  ‘FSD’ means “field separator delimited.”  Delimiters define the end of fields that can be variable in length.  Take a look at this schema definition from one of our jPOS implementations:

<schema>
<field id="magnetic-strip-info"      type="AFS" length="80" />
<field id="expiration-date"          type="NFS" length="4" />
<field id="pin-data-fs"              type="A"   length="1" />
<field id="flex-info"                type="AFS" length="108" />
<field id="amount"                   type="AFS" length="14" />
<field id="additional-amount"        type="AFS" length="14" />
<field id="register-number"          type="A"   length="2" />
<field id="tran-id"                  type="A"   length="7" />
<field id="tender-attempt-indicator" type="A"   length="1" />
<field id="tender-number"            type="N"   length="4" />
<field id="tender-attempt"           type="A"   length="4" />
</schema>

That's something we use to parse a portion of an incoming message, where:

  • A = fixed-length alphanumeric field
  • N = fixed-length numeric field
  • AFS = variable-length alphanumeric field, terminated by a field separator ('1C)
  • NFS = variable-length numeric field, terminated by a field separator ('1C)

Here's another example:

<schema id='S'> <!-- Special Condition Record-->
<field id='quasi-cash-indicator'  type='K' length='1' >N</field>
<field id='special-condition-indicator' type='A' length='1'  />
<field id='clearing-sequence'           type='N' length='2'  />
<field id='clearing-count'              type='N' length='2'  />
<field id='cust-svc-phone-flag'   type='K' length='1' >N</field>
<field id='filler-1'                    type='A' length='34' />
<field id='seqno'                       type='N' length='6'  />
<field id='special-use-fields'          type='A' length='19' />
<field id='merchant-trans-indicator'    type='A' length='1'  />
<field id='cert-for-mc-advice-code'     type='A' length='1'  />
<field id='mc-trans-category-indicator' type='A' length='2'  />
<field id='filler-2'                    type='A' length='9'  />
</schema>

Where:

  • K = Constant (you can see how I've specified the constant values).

Some other FSD behaviors to note...

If you're building a message via FSD and you *don't* specify a value for a particular field in your code, then what you get as a default is:

  • For 'A': Filled with spaces
  • For 'N': Filled with zeroes
  • For 'AFS' and 'NFS': You'll get just the field separator

Also:

For 'A': If you populate a field with a value shorter than the length of the field, FSD will left-justify and space-fill to the right.  For example, if I were to populate this field...

<field id="tran-id"                  type="A"   length="7" />

...with "4433", I'd get:

"4433   "

For 'N': If you populate a field with a value shorter than the length of the field, FSD will right-justify and zero-fill to the left.  For example, if I were to populate this field...

<field id='seqno'                       type='N' length='6'  />

...with 4433, I'd get:

004433

If, however, 'seqno' had been defined like so...

<field id='seqno'                       type='NFS' length='6'  />

...and you placed '4433' in there, then FSD would build this in the construction of the message:

4433'1C

(where '1C indicates the presence of a hex field separator)

Additionally, when using or defining NFS or AFS you MUST use a length that is the maximum than any expected length you plan on receiving.  You see above:

<field id="magnetic-strip-info"      type="AFS" length="80" />

If I were to get, say, 85 characters in from the origination point prior to a field separator, then I'm screwed and my FSD parse breaks down because I've blown past where my field separator ought to be.  We worked with the store system team in this case to determine the absolute maximum we would see coming in...then we set that value a few characters bigger for safety.

Another behavior to note:

Suppose in your program you use the wrong FSD field name.  For example, when populating this field...

<field id='seqno'                       type='N' length='6'  />

...let's say that in your code, you did this:

        m.set ("seq-no", Integer.toString(seqno));

(note incorrect field name used in set).

What would happen would be:

  1. Your code would compile without error.
  2. Your program would execute without error.
  3. The records would be created with 'seqno' populated with 000000.

Double-check all your field references in your program to make sure you've got the names exactly right!

See my previous posts related to different FSDMsg usages here and here.

Monday, June 16, 2008

Conversion(s) Complete

For payment systems managers wondering if the jPOS project is a good choice for your enterprise, we’ve got an update from our flagship OLS.Switch client as a consideration in your decision-making process. 

At that client, we reached a gratifying dual milestone last week.  First, they’ve just finished on-boarding the last of about 1,800 acquired stores, bringing the total to around 5,020 locations across the four US continental time zones.  Second, we’ve just converted the last of the payment acceptance types off of the legacy platform.  That's a Check Authorization application.  As a result, as of Friday the legacy application is processing a big fat zero transactions per day, while the jPOS-fueled OLS.Switch clocked in a week in which the daily average was just a gnat's hair shy of 1 million (991,588 to be exact). 

I had said here recently that the 'New Normal' after we reached these milestones was going to be one where we reached a million customers interactions a day the 'natural' way, i.e., without the a holiday-fueled consumer buying bulge.  We're not quite there yet.  Thursday, Friday and Saturday all clocked in comfortably over the one million threshold but were no doubt fueled by a sizable Father's Day run-up.

We do all this on $28,000 of core server hardware - replicated application nodes + a two-server MS SQL Server virtual cluster.  Refer to my 'Can jPOS Handle It?' series for further details.

Thursday, June 12, 2008

Now Blogging With Windows Live Writer

I’m taking part in Six Apart’s new TypePad Blog Tune-Up Service.  One of the things they noticed in their review is my bad habit of occasionally copying things in from Word.  This is causing some problems because, as the tune-up report notes “Copying content from Word will cause formatting problems, error messages, and feed validation problems. The issue with Microsoft Word formatting code is not unique to TypePad's Rich Text Editor. Any editor that copies over formatting will cause issues such as this -because the code that Microsoft Word inserts is not web-compatible.”

So, I’m taking the Tune-up team’s suggestion by blogging with Windows Live Writer.  So far, I’m impressed.  The download was easy and the integration into IE is seamless.  I just started writing immediately without any feelings of disorientation.

Payment Systems Podcast, Episode 1

My OLS colleagues Dave Bergert and Chuck Wilke were in Big D earlier this week for a planning meeting.  While here, Dave had the idea of recording a first Podcast for his Payment Systems Blog.  He invested in a pretty impressive-looking microphone and recorded and edited the session with Garage Band on his MacBook. 

Dave sounds great here.  His background as a QSA makes him a really good questioner.  I need to get closer to the microphone.  Blame the echo on the "minimalist style" that my wife and I have throughout our house.

Tuesday, May 27, 2008

Bitten by a JRE bug

We have an OLS.Switch implementation where we're forced to do a JNI call to our client's proprietary decryption DLL.  This approach is known for its instability (the Wikipedia entry is rife with warnings).  Alejandro raised the red flag on it since Day One.  His predictions came true: that DLL suffers from some stack management and other issues.  Every so often when volume is hot and heavy on a Friday afternoon, the JVM crashes.  The logs don't do us any good because what happens is outside OLS.Switch's and jPOS' span of control.  All we see in our logs is a five- to 10-second gap.

Fortuitously, our redundant application node strategy minimizes the impact.  The content switch fronting the nodes immediately swings traffic to the other node.  On the impacted server, we lose transactions in flight at time of the crash, but we close that hole with good follow-up because the origination points generate timeout reversals.  There are additional checks and balances on the extract and settlement side to clean up any impact.  [NOTE: The focus and concern on these events is always Debit/EBT impact because the 0200 Purchase request is what I call the 'letter of record.'  There's no problem on Credit.  The affected transactions simply never get placed on the resulting extract files.] 

This particular client runs on Windows.  We have the OLS.Switch application running as a service.  These are set to auto-start.  So, when the JVM crashes, the thing pops back up almost immediately.  Frankly, this resilience is both good and bad - good because no manual intervention is required and we run on a single node for only a short time; bad because the resilience and 'pop back-ability' of the thing causes people to take their eyes off of a fix of the root cause (do better stack management in the DLL).

Now, the story gets a twist.  With Sun EOL-ing JVM 1.4.2 in the fall of 2008, we made a move to upgrade to JRE 1.5.0.  Imagine our surprise when we put this into production and after our first DLL-induced crash as described above, OLS.Switch encountered this error when jPOS' Q2 tried to redeploy the TransactionManager upon service restart:

java.lang.NullPointerException
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1820)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
        at jdbm.recman.TransactionManager.recover(TransactionManager.java:224)
        at jdbm.recman.TransactionManager.<init>(TransactionManager.java:105)
        at jdbm.recman.RecordFile.<init>(RecordFile.java:349)
        at jdbm.recman.BaseRecordManager.<init>(BaseRecordManager.java:198)
        at jdbm.recman.Provider.createRecordManager(Provider.java:108)
        at jdbm.RecordManagerFactory.createRecordManager(RecordManagerFactory.java:114)
        at org.jpos.space.JDBMSpace.<init>(JDBMSpace.java:67)
        at org.jpos.space.JDBMSpace.getSpace(JDBMSpace.java:107)
        at org.jpos.space.SpaceFactory.createSpace(SpaceFactory.java:129)
        at org.jpos.space.SpaceFactory.getSpace(SpaceFactory.java:113)
        at org.jpos.space.SpaceFactory.getSpace(SpaceFactory.java:101)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:585)
        at bsh.Reflect.invokeMethod(Unknown Source)
        at bsh.Reflect.invokeStaticMethod(Unknown Source)
        at bsh.Name.invokeMethod(Unknown Source)
        at bsh.BSHMethodInvocation.eval(Unknown Source)
        at bsh.BSHPrimaryExpression.eval(Unknown Source)
        at bsh.BSHPrimaryExpression.eval(Unknown Source)
        at bsh.BSHVariableDeclarator.eval(Unknown Source)
        at bsh.BSHTypedVariableDeclaration.eval(Unknown Source)
        at bsh.Interpreter.run(Unknown Source)
        at bsh.Interpreter.main(Unknown Source)

Ouch.  This error got the service into an ugly "half-baked" state, which I descibed in some breakdown analysis.  Here's what I said:

  1. The JVM running the ‘ols-switch’ service on APP02 crashed at 15:19:56 on Friday.  Cause of the crash was Decrypt.dll.  [See the five-second gap in attached snippet on the q2.log.  Other than that gap, we have no crash visibility as Decrypt.dll is outside the span of our control.]
  2. The service is set for auto-restart.  See deploy sequence in q2.log starting at “01_capture_date.xml”
  3. In the log, note the “NullPointerException” when attempting to deploy “30_ev_txnmgr.”  This is the program that accepts transactions requests from the server which listens to port 33000 (see next point).  It also handles EV, Discount Coupons, Check, Reward Cards + all transactions routed to FDR, AMEX and JCP.  [NOTE:  All SV-class transactions are forwarded by 30_ev_txnmgr to smaller, separate, independent transaction managers.]  As a result of the NPE, this participant did not initiate. 
  4. The deploy sequence continues after the error, including most notably, “50_ev_server.xml”.  This is the participant that listens to requests from port 33000 and queues them for sending to 30_ev_txnmgr.
  5. The deploy sequence completes without a loaded version of 30_ev_txnmgr.
  6. Transactions begin to be accepted by APP02 (because 50_ev_server is listening to the port).
  7. Because 30_ev_txnmgr isn’t available to accept messages, 50_ev_server simply deletes any request from its queue older than 60 seconds.
  8. After 150 seconds (two-and-a-half minutes) of inactivity, OLS.Switch’s Status Manager placed the “OLS.Switch TransManager 02” into a WARN state (yellow light).  This warning notes that the transaction manager is not processing transactions.
  9. As triage, the solution was to stop the ‘ols-switch’ service on APP02.  This stopped the ‘listen’ on port 33000, meaning that the load balancer shifted all traffic to APP01.
  10. APP01 ran the full chain load for the next 45+ minutes.
  11. We examined the code and realized that one of the JDBM files was corrupted.  Specifically, it was one of the STAN files (we use JBDM spaces for SAF and Logon Managers, too).  Whenever the JVM is ended unnaturally (as what happened here), there is the risk of this corruption happening. 
  12. 30_ev_txmgr has three JDBM files: amex-stan; fdr-stan; and jcp-stan.  These are spaces that keep a counter to assign the next System Trace Audit Number (‘STAN’).  Physically each JDBM implementation is two files, a *.lg and *.db file, so there were actually six files in total.  We instructed the SysAdmin to rename those six files (he prefixed them all with “100.”).  [NOTE:  At this point, the SysAdmin also changed the ‘OLS_JAVA’ environment variable on ITCOLSAPP02 to point to 1.4.2_09.]
  13. The service was restarted.  The application recreated the missing files and transactions begin flowing again.

Note the comment that the 'one of the JDBM files was corrupted.'  It certainly manifested itself that way in the trace and in subsequent testing.  As Alejandro noted:  "If there's a JVM corruption things, can be quite unpredictable."  We figured we'd just dodged a bullet until this point.  We didn't tie the JDBM behavior to the JRE 1.4.2 --> 1.5 upgrade.

We were wrong.  Now, the story takes another turn. 
OLS colleague Dave Bergert tells the story from here:

This step is a follow-up to the "corrupted JDBM" files issue, OLS’ proposed “pre-flight check,” and the installation of Java JRE 1.5  The files were never corrupted, it appeared to be a JVM bug.

OLS and [our client] tested the pre-flight check in production on May 9th using the crash files from April 11th (when we had JRE 1.5.0_15 installed and had a JVM crash).  The expected result was for the pre-flight check to attempt to open the JDBM files, discover that they were corrupt and shut down the system.  Instead, the result of the exercise was that the switch came up fine with the "corrupted files."

This result led us to discover that using JRE 1.4.2_9 - the "corrupted files" could be opened and used fine -- under JRE 1.5.0_15 - the files could not be opened and the crash check marked them as problem files to be renamed.

We performed further testing to see why the "corrupted files" were able to be opened under JRE 1.4.2_09 and not under JRE 1.5.0_15. OLS tested the following Java JREs on its MS Windows environment test environment to locally duplicate the issue for testing and resolution.

  • JRE 1.4.2_09

  • JRE 1.5.0_15

  • JRE 1.6.0_06

The results were that under JRE 1.4.2_09 and JRE 1.6.0_06 - the "corrupted files" were able to be opened and the switch would work; under JRE 1.5.0_15 the crash check would mark these files as corrupted.  It was odd that JRE 1.5.0_15 could not open the files but the other two could.

We did further investigation and discovered a Sun bug report that affected Java JRE versions 1.5.0_08 through 1.6.0_03.

We installed JRE 1.5.0_07 (The latest 1.5.0 release update not affected by this bug).  The "corrupted files" were able to be open and the switch would work -- suggesting this bug was the core issue. 

The "corrupted files" were never actually corrupted -- it was a SUN Java JRE bug resulting in a behavior in which the OLS.Switch could not application open/read these files under certain circumstances.  The bug existed for versions 1.5.0_08 through 1.6.0_03 of the Java JRE.

As part of the JRE investigation (as described above), we took the opportunity to upgrade the incorporated JDBM jar from v0.2 to v1.0.  We updated from the project’s CVS source and made an additional fix described here.

Monday, May 26, 2008

The Field Behavior of Transactions

Every transaction in a payment switch environment has a series of behaviors that go along with it.  In OLS.Switch, these behaviors are the focus of implementation of jPOS' TransactionManager and its participants.  A lot of factors come into play here:

  • Are we switching out the transaction for remote auth or handling things locally?
  • Can the originating device reverse the transaction?
  • Can the originating device void the transaction?
  • Is a duplicate transaction possible?
  • Is a Manager Override version of the transaction possible?
  • If the request for remote auth times out, are we obligated to send a reversal?

The answer to these questions can vary by card type, card brand and endpoint. 

For one customer, I synthesized all these behaviors into a one-page summary.  Take note that the originating point in this example uses a Visa Gen 2 message set, so the G.93 designations (the Debit Sale - listed here simply as an example) are fairly esoteric.  I believe it still gets the point across in a clear manner. 

Having this one-pager in hand allows anyone to summarize the operational model quickly.  For example, you can see that all 'Merchandise Return' transactions are processed locally, except the JC Penney ('JCP') version (which is switched out for auth) and the EBT Food version (where a timed-out external auth request is reversed via an entry placed in that endpoint's SAF queue).

0100 or 0200?

Ahmed El-Malatawy, a reader of this blog, has asked a series of good questions related to decisions he needs to make concerning his jPOS implementation.  Here's our latest exchange...

Ahmed asks:

I need to implement a purchase functionality (in general debit money from a specific account) from my application that uses JPOS as an ISO 8583 protocol implementation. The question is what is the message class that I can use: Authorization 1xx or Financial 2xx.  In other words, what is the suitable message class for the operations of debits, credits and money transfer between accounts - Authorization, Financial or Both?  I have a problem in differentiating between the 2 classes Authorization & Financial.

It's really a nice question because the answer is not obvious.

Here's my answer...

----------

Ahmed –

There’s no definitive right and wrong, but typically in practice the difference is typically drawn like this:

  • If the transaction is to have an actual effect on the customer’s account, then you use an 0200.*
  • If the transaction is instead to put a *hold* on funds in the person’s account in anticipation of a settlement record to come later, then you use an 0100.

As a real-life example, we ‘talk’ to FDR North.  If we do a credit card or offline debit transaction, we send an 0100 authorization.  As long as that transaction is not subsequently reversed, then we’re obligated to send a corresponding settlement record later that night.  It’s that piece of information – not the preceding authorization – that has the real effect on the customer’s account.  Conversely, an online Debit is what I call the ‘letter of record.’  It’s that transaction that affects the account , so it's implemented as an 0200.

Note that some gateways like FDR North may ask for Online Debit transactions to appear in the settlement file that night, but it’s more for institution-to-institution accounting purposes, rather than to affect individual cardholder balances.

For some better definitions of my usage here of the terms ‘credit,’ ‘offline debit’ and ‘online debit,’ take a look at my “Credit vs. Debit – Part 2” post.

----------

Just to give you an idea of how loosely this standard is applied, here's a quick 0100/0200 tally of various implementations we've got running:

  • FDR North (Credit)- 0100
  • FDR North (Offline Debit) - 0100
  • FDR North (Online Debit) - 0200*
  • FDR Omaha (Credit) - 0100
  • American Express (Credit) - 1100 (AMEX uses the 1993 version of ISO 8583)
  • Discover (Credit) - 0100
  • SVS (Pre-paid Balance Inquiry) - 0100
  • SVS (Pre-paid Refresh, Purchase, Return, Deactivate) - 0100
  • SVS (Pre-paid Activation) - 0300
  • GreenDot (Pre-paid Activation, Deactivation) - 0100
  • Incomm (Pre-paid Activation, Refresh, Deactivation) - 0200

*The simplifying caveat here is that I'm not delving into the whole Pre-auth/Completion scenario in this post.

Saturday, May 24, 2008

Replicating your issuer-side jPOS implementation in Country #2

I blog here mostly about all the development, operational and new initiative activity going on on the acquirer-side of OLS.Switch (our jPOS-based payment switch), but we've got an issuer side solution we created as well.  You don't hear about that much here because, by nature, there are not as many new initiatives, volume is lower (just a quirk of our client set mix) and there's only a single payment gateway in play in the solution environment.  Compare that last point with a typical acquirer-side solution.  At our flagship client, for example, we've got 32 endpoints in play. 

The reason I bring this up is that an issuer client of ours has approached us about replicating our in-place solution for an operation they want to bring online in another country.  I was asked to put together a list of items that would need analysis and probable change.  I made the assumption (confirmed, in this case) that the payment gateway we're using  here in the US instance won't be a viable option for Country #2 and beyond.  With that fact in hand, here's the list of touchpoints I had tallied:

  1. We need copies of all related specifications describing the interface that will perform the role that the current gateway provider now plays in conjunction with the US implementation.
  2. In particular, we need four pieces of information: the authorization message set and field descriptions; the telecommunications guide; the encryption guide; and the certification script.  Optimally, these documents should be provided to us in English.
  3. After we have time to review the documents, we need a formal walkthrough in the form of a teleconference with the gateway provider.
  4. We need to know details about the reversal model; specifically, how does the reversal get matched-up to the original?  (each gateway provider has a distinct match-up methodology).
  5. Similarly, we need to know details about the pre-auth/completion; specifically, how the completion get matched to the pre-auth? (similarly, each gateway provider has a distinct match-up methodology)
  6. We need to understand the testing facilities that will be made available to us, how to coordinate testing and what will be required for certification.
  7. We need to know if we need to have any specific language capabilities on our side to perform testing and certification.
  8. We also need to know whether any currency conversion issues will be in play.

Items 4 and 5 - how to match the reversal to the original; how to match the completion to the pre-auth - are worthy of a closer look.  I mention above that "each gateway provider has a distinct match-up methodology."  I have a write-up in my files that delves into the way that one such gateway provider does it.  What we, in turn, have to do is described here.  I think is pretty good one-page summary of things. 

Friday, May 23, 2008

Dave's Productivity

Daves_workspace Now I know why my OLS colleague Dave Bergert is so productive: the quad-core processor in his brain is hovering over all those machines dividing up his work cycles so that he's coding transaction simulators on one device, fixing production problems on another, chatting with me on Skype on a third, setting up PABP Audit environments on a fourth, doing jPOS commits on a fifth...

jPOS Created Here

Aprs_desktop_2 That's the desktop where the world's best payment systems infrastructure gets created.  I'm grateful to play a role in the world Alejandro has created.

In other news, it looks like we'll top 5,000 store locations today - or, by latest, next Tuesday - at our flagship OLS.Switch installation.  It's been a long time getting those 1,800 new stores on-boarded, but our client got some serious momentum over the last couple of months.  The last store gets converted on Friday, May 30th.

Sunday, May 18, 2008

Mother's Day Follow-up

Mothersdayweek2008A quick follow-up to my earlier Mother's Day post...our jPOS-enabled payment switch (acquirer side) continued to see big volume increases over Mother's Day weekend.  We topped a million on Saturday on OLS.Switch and, more notably, almost 800,000 on the big day itself.  If you compare that number to the previous Sunday, you can  make a reasonable assumption that about 150,000 panicky sons and daughters across this fine country piled into stores desperately seeking that last minute token of gratitude to stay in Mom's good graces for one more year. 

That 914,077 average for the week is an all-time high.

Saturday, May 10, 2008

Implementing MethCheck (New Initiatives, Part 2)

One of the gratifying things about the maturation of the OLS.Switch payment switch solution is that our clients have the confidence to come to us with challenging new point-of-sale payment initiatives.  Recently, we had one come our way that was quite unique.  It's called MethCheck, a new service from Appriss, Inc.  As you can tell from the service's name, this new initiative involves tracking the sale of Pseudophedrine.  This particular client is a large pharmacy chain, and they're being mandated by the state of Kentucky (surely only the first state of many to insist on this type of service) to implement it.  Here are some of Appriss' key talking points about the service:

  • A single point of contact for managing compliance, ensuring pharmacies are submitting all required data to law enforcement.
  • Tracks PSE purchase limits, any aggregate limits required by the law, box limits, pill counts, and acceptable forms of identification.
  • Multi-state Compliance Manager (MSCM) keeps up to date with new PSE legislation.
  • Communicates with state electronic PSE repositories allowing pharmacies to stay compliant without maintaining multiple interfaces.

Methcheck Here's a nice diagram (see pop-up at left) from the Appriss site describing their solution.  OLS.Switch in this setting is just a piece of the orange box (as marked-up by me) in the upper-right corner.

My colleague Dave Bergert has an excellent post on his blog describing the nitty-gritty details of how we got this very non-standard message set flowing through our application.  Most notably, he relates how we used jPOS' FSDMsg class to get the job done in less than a week (see here for a post I did about Extracts - part of of my 'Real Systems Do Extracts' series that ends up spotlighting FSD). 

As a payment systems manager, the key takeaway here is that by implementing a jPOS-based solution, you become the go-to-guy (or gal) in your organization for anyone envisioning innovative uses of your payment systems infrastructure.  MethCheck is far afield from Debit/EBT, for example, but we achieved co-existence without jeopardizing what's already in place.  This is good news for you because we know the frustrations of those of you managing legacy payment systems.  You tell us it always seems to be "six months and major bucks" from your vendor for any initiative of this scale.  And for those of you outsourcing, well, best of luck getting some attention.

350,000,000 and counting

Churchsign_4 Most days, we've got heads down, doing the continuous improvement thing on our jPOS implementations.  Today, I lifted my head long enough to notice that we just hit row ID 350,000,000 on our tranlog at a large OLS.Switch acquirer-side payment switch implementation.  That's not to say we've got that many rows loaded in there as we're pretty aggressive about our tranlog culling practices, but it is a  good "from Day One" counting proxy. 

As a complement to that feat, I mentioned to Alejandro earlier this week that the two Subversions that feed this particular implementation are sitting at a combined r3300 right now.  For anyone thinking the work tails off at some point, I can confidently state that it actually increases.  In fact, with our new payment initiatives, the speed of SVN check-ins right now is as rapid as it has ever been.  We take this pace of activity as a vote of confidence from our client.

Add Mother's Day to the pantheon of buying bulges

Mothers_day_2008_runup So, I was talking to my wife and mentioning that with our flagship OLS.Switch payment switch implementation closing in on 5,000 store locations (4,865 as of yesterday), we would soon see our first 'natural' 1,000,000 transaction day and that yesterday may come close.  To which she responded:  "Umm, Mother's Day anyone?" 

Oh, yeah.  That.

We serviced 964,582 separate customer interactions yesterday, as consumers across the four US Continental times streamed in to buy cards and other testamonials to their Moms.  This buying bulge matches previous waves we've seen pre-Easter, Valentine's Day and most notably on December 24th.

It's steady as she goes on the jPOS performance front - our two replicated application nodes sit at about a 50 MB Memory footprint, at peak we're at about 20% CPU on $7,000 Quad Core servers, and our optimized MS SQL Server virtual cluster barely tickles the CPU charts (about 2% - 5% or so).

Here's Friday, May 9th in depth.  We took a new application (internal Check Authorization) live this week, so followers of these charts will note some new rows as we start to roll out this new service.  It's deployed in one pilot location.

Saturday, May 03, 2008

The New Normal (continued)

0502_2 Store conversions continue at a rapid clip at our flagship OLS.Switch payment system acquirer-side implementation.  As noted in a previous post, we're approaching a situation where the 'new normal' will be 1,000,000 transactions a day from 5,000+ store locations.  Two weeks ago we were at 4,674 locations.  As of yesterday, we're at 4,801.  And this "typical" Friday now sees us handling 920,522 separate customer interactions including:

  • Credit card (American Express, MasterCard, Discover, Visa, JC Penney, JCB, Diners Club)
  • Offline (signature) debit card
  • Online ("PIN-ed") debit card (go here for my explanation of Credit vs. Debit)
  • Electronic Benefits Transactions ('EBT'), both the 'cash' and 'food stamp' transaction varieties
  • Our client's own branded Stored Value (aka, "Gift") Card, including brand differentiation (i.e., their brand as well as the brands of companies they've acquired and assimilated)
  • Phone cards
  • Third-party Stored Value cards (those ubiquitous "card malls" you've seen sprouting up in retailers all over, thanks to these guys)
  • Private label card
  • Employee Verification (if we validate the employee, the Store System applies a discount on the subsequent transaction)
  • Discount Coupon validation
  • Customer reward card "reverse lookup" by phone number

Coming soon:

  • Check authorization - Developed and certified; this is getting implemented into production this upcoming week...we expect to see another 50,000 or more transactions a day when this transaction type gets rolled out completely.
  • Online "MethCheck" Pseudoephedrine Inquiry - Developed and now in test and acceptance by our client (more on this new initiative in a subsequent post)
  • Healthcare Flexible Spending Accounts ('FSA') - Now under development (more on this new initiative in a subsequent post)

The full working week looked like this...presented in Spanish in homage to jPOS Project Lead Alejandro Revilla (this one, not this one) and his many fans in Latin America:

Lunes - 821,258
Martes - 847,890
Miércoles - 848,590
Jueves - 876,766
Viernes - 920,522

¿Impresionante, verdad?

One other observation is that the average response time for all externally authorized transactions was 928 milliseconds, an important metric because payment systems implementers always strive for the goal of "sub-second response time."  [NOTE:  I'm only measuring approvals; denials are skewed by issuer problems beyond our span of control.]  Two factors bring this number up higher to 1 second than we've seen in the past:

  • The client has rolled out ""PIN promting by BIN" at its stores, so we're weighted more heavily towards online Debit in the transaction mix.  Those transactions go through a gateway (and most times a regional network) before hitting the issuer, meaning it requires three to four hardware PIN translations along the way.
  • One of the Stored Value authorizers had an average response time yesterday about 2x their norm.