Question Everything I know about Indexes and Querys are wrong.

JamesBowen · Sep 5, 2013

OE 11.3 WebSpeed
OS Linux CentOS 6.4 64bit

I've been a Progress Programmer for 15-16 years quite happily writing code the and learning from other Progress developers along the way. However for the last 6-7 years I been coding solo and I came across a complexing situation with a new piece of code I'm writing.

I've created simple database which stores Geo IP Address ranges relating IP address to Geo Location i.e. City & Country. I know there are API Web Services which does for me, I just wanted to create my own Database.

The confusion I am having is between FOR FIRST vs. FOR LAST for a unique record find, when there are about 1.7 million records.

This query takes 0-1 milliseconds to process.

Code:

DEFINE VARIABLE inIPAddress AS INT64  NO-UNDO.

inIPAddress = convIPaddressToDec(INPUT  "114.45.67.89").
FOR LAST GeoCityBlock NO-LOCK
    WHERE GeoCityBlock.IPAddressFromDec LE inIPAddress
    AND  GeoCityBlock.IPAddressToDec  GE inIPAddress:

END.

This code takes about 4752 milliseconds to process.???

Code:

DEFINE VARIABLE inIPAddress AS INT64  NO-UNDO.

inIPAddress = convIPaddressToDec(INPUT  "114.45.67.89").

FOR FIRST GeoCityBlock NO-LOCK
    WHERE GeoCityBlock.IPAddressFromDec LE inIPAddress
    AND  GeoCityBlock.IPAddressToDec  GE inIPAddress:
END.

Why would FOR FIRST Take longer to process? I must confess that I hardly (almost never) use a FIND LAST or FOR LAST statement because I've never needed to. Teach an Old dog new trick.

The convIPaddressToDec function just convert IPv4 IP address into a integer (int64).

DataDefinitions for the GeoCityBlock Table:

Code:

ADD TABLE "GeoCityBlock"
  AREA "Schema Area"
  DUMP-NAME "geocity"

ADD FIELD "IPAddressFromDec" OF "GeoCityBlock" AS int64
  FORMAT ">>>>>>>>>9"
  INITIAL "0"
  LABEL "IP Address From Dec"
  POSITION 2
  MAX-WIDTH 8
  COLUMN-LABEL "IP AddressFrom Dec"
  ORDER 30

ADD FIELD "IPAddressToDec" OF "GeoCityBlock" AS int64
  FORMAT ">>>>>>>>>9"
  INITIAL "0"
  LABEL "IP Address To Dec"
  POSITION 3
  MAX-WIDTH 8
  COLUMN-LABEL "IP Address To Dec"
  ORDER 40

ADD FIELD "LocationID" OF "GeoCityBlock" AS integer
  FORMAT ">>>>>>9"
  INITIAL "0"
  LABEL "LocationID"
  POSITION 4
  MAX-WIDTH 4
  COLUMN-LABEL "LocationID"
  ORDER 50

ADD INDEX "idxIPAddressRange" ON "GeoCityBlock"
  AREA "Index Area"
  PRIMARY
  INDEX-FIELD "IPAddressFromDec" ASCENDING
  INDEX-FIELD "IPAddressToDec" ASCENDING

.
PSC
cpstream=UTF-8
.
0000000840

JamesBowen · Sep 5, 2013

After reading this thread about Find First, VS. For First... I'll try using a FIND Statement () :

Averaged about 2275 milliseconds.

Code:

DEFINE VARIABLE inIPAddress AS INT64 NO-UNDO. 
inIPAddress = convIPaddressToDec(INPUT "114.45.67.89").
FIND GeoCityBlock NO-LOCK                               
    WHERE GeoCityBlock.IPAddressFromDec LE inIPAddress
    AND  GeoCityBlock.IPAddressToDec  GE inIPAddress
    NO-ERROR.

Averaged about 3224 milliseconds.

Code:

DEFINE VARIABLE inIPAddress AS INT64 NO-UNDO. 
inIPAddress = convIPaddressToDec(INPUT "114.45.67.89").
FIND FIRST GeoCityBlock NO-LOCK                               
    WHERE GeoCityBlock.IPAddressFromDec LE inIPAddress
    AND  GeoCityBlock.IPAddressToDec  GE inIPAddress
    NO-ERROR.

Averaged about 0 milliseconds (that's ZERO milliseconds).

Code:

DEFINE VARIABLE inIPAddress AS INT64 NO-UNDO. 
inIPAddress = convIPaddressToDec(INPUT "114.45.67.89").
FIND LAST GeoCityBlock NO-LOCK                               
    WHERE GeoCityBlock.IPAddressFromDec LE inIPAddress
    AND  GeoCityBlock.IPAddressToDec  GE inIPAddress
    NO-ERROR.

NOTE: I'm just using a the IP Address "114.45.67.89" as an example. It does not matter was IP address I use the result it almost the same. FOR LAST or FIND LAST is always the fastest for this query. But why?

JamesBowen · Sep 5, 2013

This one it is for TOM, using a FOR EACH statement and Applying a LEAVE on the first record found.

Average Time Taken 3042 milliseconds :

Code:

GeoCityBlock:
FOR EACH GeoCityBlock NO-LOCK
    WHERE GeoCityBlock.IPAddressFromDec LE inIPAddress
    AND  GeoCityBlock.IPAddressToDec  GE inIPAddress:


    LEAVE GeoCityBlock.
END.

TomBascom · Sep 6, 2013

A few initial comments:

0) I told you so. FIRST/LAST does not behave as expected and always leads to trouble

1) Your index has 2 components -- "FROM" and "TO". The where clause is range matches. So only the first component (which is "FROM") can participate in record selection. Since the index is ascending the case where LAST is specified is easily resolved, the query immediately gets the proper FROM record and the next record in the FROM order is going to be out of range for the TO and you're done. FIRST will return a very large set of results that match the FROM criteria before it finds one that also matches the TO criteria. If you add DISPLAY statements to the plain old FOR EACH you should be able to get a feel for that -- but it's going to be painful to watch

2) This may be one of those rare situations where 2 single component indexes are beneficial. Try adding discrete single-field indexes for FROM and TO. I haven't finished my coffee yet but I think you might want to make the TO index "descending". That should help the FOR EACH but will not work with the FIND (FIND can only use one index, the same may be true of FOR FIRST/LAST -- I forget).

JamesBowen · Sep 8, 2013

Thank Tom for your insight. I split the original index and now spans over two. indexes (one per field). From the FOR EACH statement I had to remove the second condition of the WHERE record-phrase (GeoCityBlock.IPAddressFromDec GE inIPAddress) and Include a sort BY option into the Query. This is also now very fast, but there is a still something not quite right. The performance is now dictated on the value of the variable inIPAddress. If it a low value which is representing an IP address like "11.45.1.2" the time is zero milliseconds, compared with a higher value like "225.45.7.20"

0 (zero) milliseconds.

Code:

etime(TRUE)
GeoCityBlock:
FOR EACH GeoCityBlock NO-LOCK
    WHERE GeoCityBlock.IPAddressFromDec LE inIPAddress
    BY GeoCityBlock.IPAddressFromDec:

/** etime goes here. **/

LEAVE GeoCityBlock.
END.

4375 milliseconds.

Code:

etime(TRUE)
GeoCityBlock:
FOR EACH GeoCityBlock NO-LOCK
    WHERE GeoCityBlock.IPAddressFromDec LE inIPAddress
        AND GeoCityBlock.IPAddressToDec GE inIPAddress
    BY GeoCityBlock.IPAddressFromDec:
/** etime goes here. **/
LEAVE GeoCityBlock.
END.

I've done a compile with XREF to find out what index is being used and it using the correct one. More testing is needed.

JamesBowen · Sep 8, 2013

I've Done more testing and I concluded that for this particular query "FOR LAST" is still the fastest all statement regardless of what IP Address i use. Also I made a little mistake that I do need a second record-phase option on the WHERE clause. The Query was just returning the first record in the database.

Code:

etime(TRUE)
GeoCityBlock:
FOR LAST GeoCityBlock NO-LOCK
    WHERE GeoCityBlock.IPAddressFromDec LE inIPAddress
        AND GeoCityBlock.IPAddressToDec GE inIPAddress
    BY GeoCityBlock.IPAddressFromDec:
/** etime goes here. **/
LEAVE GeoCityBlock.
END.

TomBascom · Sep 8, 2013

Cecil said:
I've done a compile with XREF to find out what index is being used and it using the correct one. More testing is needed.

Only 1? I was hoping it would use both

Did you try ascending vs descending on the TO address?

TomBascom · Sep 8, 2013

About the data...

You have a FROM and TO.

The FROM and TO values cannot be reused nor can they overlap right? (If that isn't true then none of the above makes much sense to me anymore -- but maybe I need more coffee. Or beer.)

Are there gaps? Even if there are... does it actually matter? I'm thinking that this problem really only needs one comparison:

Code:

for each GeoCityBlock no-lock where inIPAddress >= GeoCityBlock.IPAddressFromDec by GeoCityBlock.IPAddressFromDec:
  leave.
end.

Should be sufficient. If the query above finds a match it is going to be the proper "block". I guess you could then procedurally test the result vs. the TO value just to make sure. But that would only ever fail if there are "rogue" IP addresses that are not in a block in your table. And if that is possible you probably want some such test anyway.

TomBascom · Sep 8, 2013

The performance is now dictated on the value of the variable inIPAddress. If it a low value which is representing an IP address like "11.45.1.2" the time is zero milliseconds, compared with a higher value like "225.45.7.20"

That means that some part of the query is still doing a table scan.

TomBascom · Sep 8, 2013

One more thing about the data... if the blocks all have the same "size" (range) then you could truncate the FROM as well as the target and then just do a unique find. No range matches needed.

tamhas · Sep 8, 2013

I suspect that the blocks are not necessarily contiguous, nor are they the same size, but I think you are close, Tom, in a strategy. What one needs to find is the FROM which is the highest without being higher than the target, then test the TO to see if it is included.

JamesBowen · Sep 8, 2013

TomBascom said:
Only 1? I was hoping it would use both Did you try ascending vs descending on the TO address?

Here is a snip-it of the XREF listing for my procedure of my prototype code FetchGeoResults.p

Code:

.\src\srv\FetchGeoResults.p .\src\srv\FetchGeoResults.p 145 ACCESS geoip.GeoCityBlock IPAddressFromDec
.\src\srv\FetchGeoResults.p .\src\srv\FetchGeoResults.p 145 ACCESS geoip.GeoCityBlock IPAddressToDec
.\src\srv\FetchGeoResults.p .\src\srv\FetchGeoResults.p 145 ACCESS geoip.GeoCityBlock IPAddressFromDec
.\src\srv\FetchGeoResults.p .\src\srv\FetchGeoResults.p 145 SEARCH geoip.GeoCityBlock idxIPAddressFrom

The code segment of the FOR EACH block which currently takes about 3463 milliseconds vs. zero if I change the statement to a FOR LAST (and the data returned is the same):

Code:

etime(TRUE).
GeoCityBlock:
FOR EACH GeoCityBlock NO-LOCK
  WHERE inIPAddress GE GeoCityBlock.IPAddressFromDec
  AND  inIPAddress LE GeoCityBlock.IPAddressToDec
  BY GeoCityBlock.IPAddressFromDec:
  ASSIGN
  ttGeoCountry.GeoCityBlockLap = etime.
  etime(TRUE).
  FIND GeoCityLocation NO-LOCK
  WHERE GeoCityLocation.LocationID EQ  GeoCityBlock.LocationID
  NO-ERROR.
  IF AVAILABLE GeoCityLocation THEN
  ASSIGN
  ttGeoCountry.Longitude  = GeoCityLocation.Longitude
  ttGeoCountry.Latitude  = GeoCityLocation.Latitude
  ttGeoCountry.City  = GeoCityLocation.City
  ttGeoCountry.Region  = GeoCityLocation.Region
  ttGeoCountry.GeoCityLocationLap = etime.
  ELSE
  ASSIGN
  ttGeoCountry.Longitude  = ?
  ttGeoCountry.Latitude  = ?
  ttGeoCountry.City  = STRING(GeoCityBlock.LocationID)
  ttGeoCountry.Region  = 'UNKNOWN'
  ttGeoCountry.GeoCityLocationLap = etime.
  LEAVE GeoCityBlock.
END.

And Finally a snip-it of Data Definitions.

Code:

ADD INDEX "idxIPAddressFrom" ON "GeoCityBlock"
  AREA "Index Area"
  INDEX-FIELD "IPAddressFromDec" ASCENDING
ADD INDEX "idxIPAddressTo" ON "GeoCityBlock"
  AREA "Index Area"
  INDEX-FIELD "IPAddressToDec" DESCENDING

If you are really interested you can download the CSV file from here: http://geolite.maxmind.com/download/geoip/database/GeoLiteCity_CSV/GeoLiteCity-latest.zip
as well as other IPGeo related data.

The next thing I going to try and do is rebuild the database from scratch. Side thought: It shouldn't matter that I'm working with 64bit integers(int64) vs. the traditional integer data types does it??

http://geolite.maxmind.com/download/geoip/database/GeoLiteCity_CSV/GeoLiteCity-latest.zip

Question Everything I know about Indexes and Querys are wrong.

JamesBowen

19+ years progress programming and still learning.

JamesBowen

19+ years progress programming and still learning.

JamesBowen

19+ years progress programming and still learning.

TomBascom

Curmudgeon

JamesBowen

19+ years progress programming and still learning.

JamesBowen

19+ years progress programming and still learning.

TomBascom

Curmudgeon

TomBascom

Curmudgeon

TomBascom

Curmudgeon

TomBascom

Curmudgeon

tamhas

ProgressTalk.com Sponsor

JamesBowen

19+ years progress programming and still learning.