sql - Parse Dataframe string column into Street, City, State & Zip code -


i trying break following fixed string several columns street ,city, state & zip code. possible in sqldf via instr & subtr method?

sample address string. difficult part nv , zip code parsing.

727 wright brothers ln, las vegas, nv 89119, usa  

i able parse city/street information using sqldf/instr unable parse final 2 values state/zip code

parsed_tweetaddressdf <- sqldf("select lon, lat, result, substr(result,0,instr(result,',')) street, substr(result,instr(result,',')+1,instr(result,',')-1) city tweetaddressdf") 

here alternatives. use instr , substr required question although third writes out data , reads in (in addition using instr , substr). notes @ end point out easy in plain r or using read.pattern in gsubfn.

1) assume state, zip , country fields fixed width 1 sample record impossible know general case if assume every record ends in ss zzzzz, usa ss 2 letter state abbreviation , zzzzz 5 digit zip works:

df <- data.frame(v = "727 wright brothers ln, las vegas, nv 89119, usa")  library(sqldf) sqldf("select    substr(v, 0, instr(v, ',')) street,   substr(v, instr(v, ',') + 2, length(v) - 16 - instr(v, ',')) city,   substr(v, -13, 2) state,   substr(v, -10, 5) zip  df") 

giving:

                  street      city state   zip 1 727 wright brothers ln las vegas    nv 89119 

2) separate strictly based on commas (except state/zip) approach avoids assumptions in (1) @ expense of additional complication. takes first 2 comma separated fields, 2 character state , after next comma zip.

it uses triple nested select. innermost select denoted a parses input string into: street , a.rest. next 1 proceeding outward denoted b returns street parsed a, , parses a.rest city , b.rest. outermost 1 returns street , city parsed plus returns 2 state characters in b.rest , beyond them in b.rest next comma zip.

library(sqldf)  sqldf("   select     street,     city,     substr(b.rest, 1, 2) state,     substr(b.rest, 4, instr(b.rest, ',') - 4) zip   (     select        street,        substr(a.rest, 0, instr(a.rest, ',')) city,       substr(a.rest, instr(a.rest, ',') + 2) rest      (select              substr(v, 0, instr(v, ',')) street,             substr(v, instr(v, ',') + 2) rest     df) a) b ") 

giving:

                  street      city state   zip 1 727 wright brothers ln las vegas    nv 89119 

3) read.csv.sql if it's ok write out , read in can use read.csv.sql, wrapper around sqldf. although question did not ask it, 1 parses out country:

write.table(df, "addresses.csv", row.names = false, col.names = false,              sep = ",", quote = false)  read.csv.sql("addresses.csv", header = false, sql =         "select v1 street,                 v2 city,                 substr(v3, 2, 2) state,                 substr(v3, 4) zip,                 v4 country          file") 

giving:

                  street       city state    zip country 1 727 wright brothers ln  las vegas    nv  89119     usa 

note 1: easy in plain r.

dd <- read.table(text = as.character(df$v), sep = ",",  col.names = c("street", "city", "state_zip", "country"))  transform(dd,                state = substring(state_zip, 2, 3),                zip = substring(state_zip, 4))[c(1, 2, 5, 6, 4)] 

giving:

                  street       city state    zip country 1 727 wright brothers ln  las vegas    nv  89119     usa 

note 2: easier using read.pattern gsubfn:

library(gsubfn)  pat <- "(.*), (.*), (..) (.*), (.*)" read.pattern(text = as.character(df$v), pattern = pat,     col.names = c("street", "city", "state", "zip", "country")) 

giving:

                  street      city state   zip country 1 727 wright brothers ln las vegas    nv 89119     usa 

Comments

Popular posts from this blog

sql - VB.NET Operand type clash: date is incompatible with int error -

SVG stroke-linecap doesn't work for circles in Firefox? -

python - TypeError: Scalar value for argument 'color' is not numeric in openCV -