sql - Parse Dataframe string column into Street, City, State & Zip code -
i trying break following fixed string several columns street ,city, state & zip code. possible in sqldf via instr & subtr method?
sample address string. difficult part nv , zip code parsing.
727 wright brothers ln, las vegas, nv 89119, usa
i able parse city/street information using sqldf/instr unable parse final 2 values state/zip code
parsed_tweetaddressdf <- sqldf("select lon, lat, result, substr(result,0,instr(result,',')) street, substr(result,instr(result,',')+1,instr(result,',')-1) city tweetaddressdf")
here alternatives. use instr
, substr
required question although third writes out data , reads in (in addition using instr
, substr
). notes @ end point out easy in plain r or using read.pattern
in gsubfn.
1) assume state, zip , country fields fixed width 1 sample record impossible know general case if assume every record ends in ss zzzzz, usa
ss 2 letter state abbreviation , zzzzz 5 digit zip works:
df <- data.frame(v = "727 wright brothers ln, las vegas, nv 89119, usa") library(sqldf) sqldf("select substr(v, 0, instr(v, ',')) street, substr(v, instr(v, ',') + 2, length(v) - 16 - instr(v, ',')) city, substr(v, -13, 2) state, substr(v, -10, 5) zip df")
giving:
street city state zip 1 727 wright brothers ln las vegas nv 89119
2) separate strictly based on commas (except state/zip) approach avoids assumptions in (1) @ expense of additional complication. takes first 2 comma separated fields, 2 character state , after next comma zip.
it uses triple nested select. innermost select denoted a
parses input string into: street
, a.rest
. next 1 proceeding outward denoted b
returns street
parsed a
, , parses a.rest
city
, b.rest
. outermost 1 returns street
, city
parsed plus returns 2 state characters in b.rest
, beyond them in b.rest
next comma zip
.
library(sqldf) sqldf(" select street, city, substr(b.rest, 1, 2) state, substr(b.rest, 4, instr(b.rest, ',') - 4) zip ( select street, substr(a.rest, 0, instr(a.rest, ',')) city, substr(a.rest, instr(a.rest, ',') + 2) rest (select substr(v, 0, instr(v, ',')) street, substr(v, instr(v, ',') + 2) rest df) a) b ")
giving:
street city state zip 1 727 wright brothers ln las vegas nv 89119
3) read.csv.sql if it's ok write out , read in can use read.csv.sql
, wrapper around sqldf
. although question did not ask it, 1 parses out country:
write.table(df, "addresses.csv", row.names = false, col.names = false, sep = ",", quote = false) read.csv.sql("addresses.csv", header = false, sql = "select v1 street, v2 city, substr(v3, 2, 2) state, substr(v3, 4) zip, v4 country file")
giving:
street city state zip country 1 727 wright brothers ln las vegas nv 89119 usa
note 1: easy in plain r.
dd <- read.table(text = as.character(df$v), sep = ",", col.names = c("street", "city", "state_zip", "country")) transform(dd, state = substring(state_zip, 2, 3), zip = substring(state_zip, 4))[c(1, 2, 5, 6, 4)]
giving:
street city state zip country 1 727 wright brothers ln las vegas nv 89119 usa
note 2: easier using read.pattern
gsubfn:
library(gsubfn) pat <- "(.*), (.*), (..) (.*), (.*)" read.pattern(text = as.character(df$v), pattern = pat, col.names = c("street", "city", "state", "zip", "country"))
giving:
street city state zip country 1 727 wright brothers ln las vegas nv 89119 usa
Comments
Post a Comment