Delivery charge for if postcode starts with ?? Javascript or PHP - javascript

I have a list of postcodes in the UK with a region id next to it. Now for delivering products it costs more depending on the region a user lives in.
For example, if a user lives in Birmingham and has a postcode that starts with B, he will get free delivery because that postcode region doesn't have any charge.
Likewise, if a user has a postcode starting with IM , they have to pay more delivery as that postcode region is more.
Sample postcode list:
Postcode | Region
AL | A
BA | A
BB | A
BD | A
B | B
BH | B
LN | D
LS | D
IV1 | E
IV23 | F
From the example above if a user wants to get a delivery and their postcode starts with BA then I want to apply the delivery charge rate of region A.
I'm actually a bit confused as to how I can programmatically do this. At first I thought I would simply do something similar to:
$postcodes = [
'AL'=>'A',
'BA'=>'A',
//And so on ....
];
//get the first 2 letters
$user_input = substr( $user_postcode, 0, 2 );
if(array_key_exists($user_input,$postcodes)){
//Get the region code
$region = $postcodes[$user_input];
// Charge the user with the delivery rate specific to that user, then carry on
}
But problem is that some similar postcodes can be in different regions, so for example, IV1 is region E and IV23 is region F like seen above.
That means I have to match a users post code on either, the 1 , 2 ,3 or 4 characters. That probably doesn't make sense. To elaborate more see below:
//From Birmingham and is in region B
$user1_input = 'B';
//From Bradford and is in region A
$user1_input = 'BD';
//From Inverness and is in region E
$user1_input = 'IV1';
So if the user input is from Birmingham and user input starts with B , how can i tell that apart from a postcode that also starts with B but then has other letters in it which makes it a different postcode.
I'm trying my best to explain, hopefully, this does make sense. If not please ask for more info.
Can anyone please help me with the logic to how I could achieve this? Either in Javascript or PHP , because i can convert the logic afterwards.

If you have what looks like a valid UK postcode, then remove the spaces and just search the array till you find a match:
$lookup = [
'' => 'X', // in case no match is found
'AL'=>'A',
'BA'=>'A',
//And so on ....
];
function get_delivery_for($postcode)
{
global $lookup;
for ($x=5; $x>0 && !$result; $x--) {
$result=$lookup[substr($postcode, 0, $x)];
}
return ($result);
}
Note that the code above is intended for illustration, I would recommend using something more elaborate to avoid it throwing warnings....
$result=isset($lookup[substr($postcode, 0, $x)])
? $lookup[substr($postcode, 0, $x)]
: false;

One option would be to order your postcode/region array by the descending length of the postcode key. This way, the longer (more specific) keys are checked first. Taking your list above, it would become something like this...
$postcodes = array(
"IV23" => "F",
"IV1" => "E",
"LS" => "D",
"LN" => "D",
"BH" => "B",
"BD" => "A",
"BB" => "A",
"BA" => "A",
"AL" => "A",
"B" => "B",
);
After you have that, it's as simple as looping through the array, checking for a match against the provided postcode (starting from the left), and stopping when you find a match.
foreach($postcodes as $code => $region)
{
if($code == substr($user_postcode, 0, strlen($code)))
{
$shippingRegion = $region;
break;
}
}
echo $shippingRegion;

Related

adding indicators into a string according to different case

I will receive an array of string-like below.
In each string, there may be three signs: $,%,* in the string
For example,
“I would $rather %be $happy, %if working in a chocolate factory”
“It is ok to play tennis”
“Tennis $is a good sport”
“AO is really *good sport”
However, there may be no signs in it, maybe only one sign in it.
There are only five cases in string,
1. no sign at all,
2. having $,% ;
3. having only $,
4 having only %,
5 having only *
If there is no sign, I don’t need to process it.
Otherwise, I need to process it and add an indicator to the left of the first sign that occurs in the sentence.
For example:
“I would ---dollorAndperSign—-$rather %be $happy, %if working in a chocolate factory”
“Tennis --dollorSign—-$is a good sport”
This is my idea code.
So, I need to decide if the string contains any sign. If there is no sign, I don’t need to process it.
texts.map((text) => {
if (text.includes("$") || text.includes("%") || text.includes("*")) {
//to get the index of signs
let indexOfdollar, indexOfper, indexOfStar;
indexOfdollar = text.indexOf("$");
indexOfper = text.indexOf("%");
indexOfStar = text.indexOf("*");
//return a completed process text
}
});
Question:
how do I know which index is the smallest one in order to locate the position of the first sign occurring in the text? Getting the smallest value may not be the correct approach coz there may be the case that I will get -1 from the above code?
I focussed only on the "get the smallest index" part of your question... Since you will be able to do what you want with it after.
You can have the indexOf() in an array, filter it to remove the -1 and then use Math.min() to get the smallest one.
Edited to output an object instead, which includes the first index and some booleans for the presence each char.
const texts = [
"I would $rather %be $happy, %if working in a chocolate factory",
"It is ok to play tennis",
"Tennis $is a good sport",
"AO is really *good sport"
]
const minIndexes = texts.map((text,i) => {
//to get the signs
const hasDollard = text.indexOf("$") >= 0
const hasPercent = text.indexOf("%") >= 0
const hasStar = text.indexOf("*") >= 0
//to get the first index
const indexes = [text.indexOf("$"), text.indexOf("%"), text.indexOf("*")].filter((index) => index >= 0)
if(!indexes.length){
return null
}
return {
index: Math.min( ...indexes),
hasDollard,
hasPercent,
hasStar
}
});
console.log(minIndexes)
const texts = [
"I would $rather %be $happy, %if working in a chocolate factory",
"It is ok to play tennis",
"Tennis $is a good sport",
"AO is really *good sport"
]
texts.forEach(text => {
let sighs = ["%","$","*"];
let chr = text.split('').find(t => sighs.find(s => s==t));
if (!chr)
return;
text = text.replace(chr, "---some text---" + chr);
console.log(text);
})
const data = ['I would $rather %be $happy, %if working in chocolate factory', 'It is ok to play tennis', 'Tennis $is a good sport', 'AO is really *good sport']
const replace = s => {
signs = { $: 'dollar', '%': 'per', '*': 'star' },
characters = Array.from(s, (c,i)=> '$%*'.includes(c)? c:'').join('')
headText = [...new Set(Array.from(characters))].map(c => signs[c]).join('|')
s.replace(/[\$\%\*]/, `--${text}--$&`);
}
const result = data.map(replace)

Finding n-gram frequencies in a large set of sentences

I have a set of text messages. Lets call them m1, m2, ..... The maximum number of message is below 1,000,000. Each message is below 1024 characters in length, and all are in lowercase. Lets also pick an n-gram s1.
I need to find frequency of all possible substring from all of these messages. For example, lets say we have only two messages:
m1 = a cat in a cage
m2 = a bird in a cage
The frequency of some n-gram in these two messages:
'a' = 4
'in a cage' = 2
'a bird' = 1
'a cat' = 1
...
Note that, as in = 2, in a = 2, and a cage = 2 are subsets of in a cage = 2 and have same frequency, they should not be listed. Only take the longest one that have the highest frequency; follow this condition: the longest sn-gram should consist of at most 8 words, with a total character count below 30. If a n-gram exceeds this limit, it can be broken into two or more n-grams and listed separately.
I need to find such n-grams for all of these text messages and sort them by their number of occurrences in descending order.
How to I approach this problem? I need a solution in javascript.
PS: I need help, but do not know to where to ask this. If the question
is not for this site, then where should I post it? please guide this
newbie here.
May be you can approach as follows. I will edit to add explanation as soon as i have some time.
var subSentences = (w,...ws) => ws.length ? ws.reduce((r,s) => (r.push(r[r.length-1] + ` ${s}`), r),[w])
.concat(subSentences(...ws))
: [w],
frequencyMap = sss => sss.reduce((map,ss) => subSentences(...ss.split(/\s+/)).reduce((m,s) => m.set(s, m.get(s) + 1 || 1), map), new Map());
frequencies = frequencyMap(["this is a test string",
"this is another one",
"yet another one is here"]);
console.log(...frequencies.entries()); // logging map object seems not possible hence entries
.as-console-wrapper { max-height : 100% !important
}

How to count occurrence of multiple sub-string in a long string with JavaScript

I am a fresh with JavaScript. I just tried a lot, but did not get the answer and information to show how to count occurrence of multiple sub-string in a long string at one time.
Further information: I need get the occurrence of these sub-string and if the number of their occurrence to much, I need replace them at one time,so I need get the occurrence at one time.
Here is an example:
The long string Text as below,
Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.
The sub-string is a question, but what I need is to count each word occurrence in this sub-string at one time. for example, the word "name","NFL","championship","game" and "is","the" in this string.
What is the name of NFL championship game?
One of problems is some sub-string is not in the text, and some have shown many times.(which I might replaced it)
The Code I have tried as below, it is wrong, I have tried many different ways but no good results.
$(".showMoreFeatures").click(function(){
var text= $(".article p").text(); // This is to get the text.
var textCount = new Array();
// Because I use match, so for the first word "what", will return null, so
this is to avoid this null. and I was plan to get the count number, if it is
more than 7 or even more, I will replace them.
var qus = item2.question; //This is to get the sub-string
var checkQus = qus.split(" "); // I split the question to words
var newCheckQus = new Array();
// This is the array I was plan put the sub-string which count number less than 7, which I really needed words.
var count = new Array();
// Because it is a question as sub-string and have many words, so I wan plan to get their number and put them in a array.
for(var k =0; k < checkQus.length; k++){
textCount = text.match(checkQus[k],"g")
if(textCount == null){
continue;
}
for(var j =0; j<checkQus.length;j++){
count[j] = textCount.length;
}
//count++;
}
I was tried many different ways, and searched a lot, but no good results. The above code just want to show what I have tried and my thinking(might totally wrong). But actually it is not working , if you know how to implement it,solve my problem, please just tell me, no need to correct my code.
Thanks very much.
If I have understood the question correctly then it seems you need to count the number of times the words in the question (que) appear in the text (txt)...
var txt = "Super Bowl 50 was an American ...etc... Arabic numerals 50.";
var que = "What is the name of NFL championship game?";
I'll go through this in vanilla JavaScript and you can transpose it for JQuery as required.
First of all, to focus on the text we can make things a little simpler by changing the strings to lowercase and removing some of the punctuation.
// both strings to lowercase
txt = txt.toLowerCase();
que = que.toLowerCase();
// remove punctuation
// using double \\ for proper regular expression syntax
var puncArray = ["\\,", "\\.", "\\(", "\\)", "\\!", "\\?"];
puncArray.forEach(function(P) {
// create a regular expresion from each punctuation 'P'
var rEx = new RegExp( P, "g");
// replace every 'P' with empty string (nothing)
txt = txt.replace(rEx, '');
que = que.replace(rEx, '');
});
Now we can create a cleaner array from str and que as well as a hash table from que like so...
// Arrays: split at every space
var txtArray = txt.split(" ");
var queArray = que.split(" ");
// Object, for storing 'que' counts
var queObject = {};
queArray.forEach(function(S) {
// create 'queObject' keys from 'queArray'
// and set value to zero (0)
queObject[S] = 0;
});
queObject will be used to hold the words counted. If you were to console.debug(queObject) at this point it would look something like this...
console.debug(queObject);
/* =>
queObject = {
what: 0,
is: 0,
the: 0,
name: 0,
of: 0,
nfl: 0,
championship: 0,
game: 0
}
*/
Now we want to test each element in txtArray to see if it contains any of the elements in queArray. If the test is true we'll add +1 to the equivalent queObject property, like this...
// go through each element in 'queArray'
queArray.forEach(function(A) {
// create regular expression for testing
var rEx = new RegExp( A );
// test 'rEx' against elements in 'txtArray'
txtArray.forEach(function(B) {
// is 'A' in 'B'?
if (rEx.test(B)) {
// increase 'queObject' property 'A' by 1.
queObject[A]++;
}
});
});
We use RegExp test method here rather than String match method because we just want to know if "is A in B == true". If it is true then we increase the corresponding queObject property by 1. This method will also find words inside words, such as 'is' in 'San Francisco' etc.
All being well, logging queObject to the console will show you how many times each word in the question appeared in the text.
console.debug(queObject);
/* =>
queObject = {
what: 0
is: 2
the: 17
name: 0
of: 2
nfl: 1
championship: 0
game: 4
}
*/
Hoped that helped. :)
See MDN for more information on:
Array.forEach()
Object.keys()
RegExp.test()

Is there a way to measure string similarity in Google BigQuery

I'm wondering if anyone knows of a way to measure string similarity in BigQuery.
Seems like would be a neat function to have.
My case is i need to compare the similarity of two urls as want to be fairly sure they refer to the same article.
I can find examples using javascript so maybe a UDF is the way to go but i've not used UDF's at all (or javascript for that matter :) )
Just wondering if there may be a way using existing regex functions or if anyone might be able to get me started with porting the javascript example into a UDF.
Any help much appreciated, thanks
EDIT: Adding some example code
So if i have a UDF defined as:
// distance function
function levenshteinDistance (row, emit) {
//if (row.inputA.length <= 0 ) {var myresult = row.inputB.length};
if (typeof row.inputA === 'undefined') {var myresult = 1};
if (typeof row.inputB === 'undefined') {var myresult = 1};
//if (row.inputB.length <= 0 ) {var myresult = row.inputA.length};
var myresult = Math.min(
levenshteinDistance(row.inputA.substr(1), row.inputB) + 1,
levenshteinDistance(row.inputB.substr(1), row.inputA) + 1,
levenshteinDistance(row.inputA.substr(1), row.inputB.substr(1)) + (row.inputA[0] !== row.inputB[0] ? 1 : 0)
) + 1;
emit({outputA: myresult})
}
bigquery.defineFunction(
'levenshteinDistance', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
levenshteinDistance // Reference to JavaScript UDF
);
// make a test function to test individual parts
function test(row, emit) {
if (row.inputA.length <= 0) { var x = row.inputB.length} else { var x = row.inputA.length};
emit({outputA: x});
}
bigquery.defineFunction(
'test', // Name of the function exported to SQL
['inputA', 'inputB'], // Names of input columns
[{'name': 'outputA', 'type': 'integer'}], // Output schema
test // Reference to JavaScript UDF
);
Any i try test with a query such as:
SELECT outputA FROM (levenshteinDistance(SELECT "abc" AS inputA, "abd" AS inputB))
I get error:
Error: TypeError: Cannot read property 'substr' of undefined at line 11, columns 38-39
Error Location: User-defined function
It seems like maybe row.inputA is not a string perhaps or for some reason string functions not able to work on it. Not sure if this is a type issue or something funny about what utils the UDF is able to use by default.
Again any help much appreciated, thanks.
Ready to use shared UDFs - Levenshtein distance:
SELECT fhoffa.x.levenshtein('felipe', 'hoffa')
, fhoffa.x.levenshtein('googgle', 'goggles')
, fhoffa.x.levenshtein('is this the', 'Is This The')
6 2 0
Soundex:
SELECT fhoffa.x.soundex('felipe')
, fhoffa.x.soundex('googgle')
, fhoffa.x.soundex('guugle')
F410 G240 G240
Fuzzy choose one:
SELECT fhoffa.x.fuzzy_extract_one('jony'
, (SELECT ARRAY_AGG(name)
FROM `fh-bigquery.popular_names.gender_probabilities`)
#, ['john', 'johnny', 'jonathan', 'jonas']
)
johnny
How-to:
https://medium.com/#hoffa/new-in-bigquery-persistent-udfs-c9ea4100fd83
If you're familiar with Python, you can use the functions defined by fuzzywuzzy in BigQuery using external libraries loaded from GCS.
Steps:
Download the javascript version of fuzzywuzzy (fuzzball)
Take the compiled file of the library: dist/fuzzball.umd.min.js and rename it to a clearer name (like fuzzball)
Upload it to a google cloud storage bucket
Create a temp function to use the lib in your query (set the path in OPTIONS to the relevant path)
CREATE TEMP FUNCTION token_set_ratio(a STRING, b STRING)
RETURNS FLOAT64
LANGUAGE js AS """
return fuzzball.token_set_ratio(a, b);
"""
OPTIONS (
library="gs://my-bucket/fuzzball.js");
with data as (select "my_test_string" as a, "my_other_string" as b)
SELECT a, b, token_set_ratio(a, b) from data
Levenshtein via JS would be the way to go. You can use the algorithm to get absolute string distance, or convert it to a percentage similarity by simply calculating abs(strlen - distance / strlen).
The easiest way to implement this would be to define a Levenshtein UDF that takes two inputs, a and b, and calculates the distance between them. The function could return a, b, and the distance.
To invoke it, you'd then pass in the two URLs as columns aliased to 'a' and 'b':
SELECT a, b, distance
FROM
Levenshtein(
SELECT
some_url AS a, other_url AS b
FROM
your_table
)
Below is quite simpler version for Hamming Distance by using WITH OFFSET instead of ROW_NUMBER() OVER()
#standardSQL
WITH Input AS (
SELECT 'abcdef' AS strings UNION ALL
SELECT 'defdef' UNION ALL
SELECT '1bcdef' UNION ALL
SELECT '1bcde4' UNION ALL
SELECT '123de4' UNION ALL
SELECT 'abc123'
)
SELECT 'abcdef' AS target, strings,
(SELECT COUNT(1)
FROM UNNEST(SPLIT('abcdef', '')) a WITH OFFSET x
JOIN UNNEST(SPLIT(strings, '')) b WITH OFFSET y
ON x = y AND a != b) hamming_distance
FROM Input
I did it like this:
CREATE TEMP FUNCTION trigram_similarity(a STRING, b STRING) AS (
(
WITH a_trigrams AS (
SELECT
DISTINCT tri_a
FROM
unnest(ML.NGRAMS(SPLIT(LOWER(a), ''), [3,3])) AS tri_a
),
b_trigrams AS (
SELECT
DISTINCT tri_b
FROM
unnest(ML.NGRAMS(SPLIT(LOWER(b), ''), [3,3])) AS tri_b
)
SELECT
COUNTIF(tri_b IS NOT NULL) / COUNT(*)
FROM
a_trigrams
LEFT JOIN b_trigrams ON tri_a = tri_b
)
);
Here is a comparison to Postgres's pg_trgm:
select trigram_similarity('saemus', 'seamus');
-- 0.25 vs. pg_trgm 0.272727
select trigram_similarity('shamus', 'seamus');
-- 0.5 vs. pg_trgm 0.4
I gave the same answer on How to perform trigram operations in Google BigQuery?
I couldn't find a direct answer to this, so I propose this solution, in standard SQL
#standardSQL
CREATE TEMP FUNCTION HammingDistance(a STRING, b STRING) AS (
(
SELECT
SUM(counter) AS diff
FROM (
SELECT
CASE
WHEN X.value != Y.value THEN 1
ELSE 0
END AS counter
FROM (
SELECT
value,
ROW_NUMBER() OVER() AS row
FROM
UNNEST(SPLIT(a, "")) AS value ) X
JOIN (
SELECT
value,
ROW_NUMBER() OVER() AS row
FROM
UNNEST(SPLIT(b, "")) AS value ) Y
ON
X.row = Y.row )
)
);
WITH Input AS (
SELECT 'abcdef' AS strings UNION ALL
SELECT 'defdef' UNION ALL
SELECT '1bcdef' UNION ALL
SELECT '1bcde4' UNION ALL
SELECT '123de4' UNION ALL
SELECT 'abc123'
)
SELECT strings, 'abcdef' as target, HammingDistance('abcdef', strings) as hamming_distance
FROM Input;
Compared to other solutions (like this one), it takes two strings (of the same length, following the definition for hamming distance) and outputs the expected distance.
bigquery similarity standardsql hammingdistance
While I was looking for the answer Felipe above, I worked on my own query and ended up with two versions, one which I called string approximation and another string resemblance.
The first is looking at the shortest distance between letters of source string and test string and returns a score between 0 and 1 where 1 is a complete match. It will always score based on the longest string of the two. It turns out to return similar results to the Levensthein distance.
#standardSql
CREATE OR REPLACE FUNCTION `myproject.func.stringApproximation`(sourceString STRING, testString STRING) AS (
(select avg(best_result) from (
select if(length(testString)<length(sourceString), sourceoffset, testoffset) as ref,
case
when min(result) is null then 0
else 1 / (min(result) + 1)
end as best_result,
from (
select *,
if(source = test, abs(sourceoffset - (testoffset)),
greatest(length(testString),length(sourceString))) as result
from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
cross join
(select *
from unnest(split(lower(testString),'')) as test with offset as testoffset)
) as results
group by ref
)
)
);
The second is a variation of the first, where it will look at sequences of matching distances, so that a character matching at equal distance from the character preceding or following it will count as one point. This works quite well, better than string approximation but not quite as well as I would like to (see example output below).
#standarSql
CREATE OR REPLACE FUNCTION `myproject.func.stringResemblance`(sourceString STRING, testString STRING) AS (
(
select avg(sequence)
from (
select ref,
if(array_length(array(select * from comparison.collection intersect distinct
(select * from comparison.before))) > 0
or array_length(array(select * from comparison.collection intersect distinct
(select * from comparison.after))) > 0
, 1, 0) as sequence
from (
select ref,
collection,
lag(collection) over (order by ref) as before,
lead(collection) over (order by ref) as after
from (
select if(length(testString) < length(sourceString), sourceoffset, testoffset) as ref,
array_agg(result ignore nulls) as collection
from (
select *,
if(source = test, abs(sourceoffset - (testoffset)), null) as result
from unnest(split(lower(sourceString),'')) as source with offset sourceoffset
cross join
(select *
from unnest(split(lower(testString),'')) as test with offset as testoffset)
) as results
group by ref
)
) as comparison
)
)
);
Now here is a sample of result:
#standardSQL
with test_subjects as (
select 'benji' as name union all
select 'benjamin' union all
select 'benjamin alan artis' union all
select 'ben artis' union all
select 'artis benjamin'
)
select name, quick.stringApproximation('benjamin artis', name) as approxiamtion, quick.stringResemblance('benjamin artis', name) as resemblance
from test_subjects
order by resemblance desc
This returns
+---------------------+--------------------+--------------------+
| name | approximation | resemblance |
+---------------------+--------------------+--------------------+
| artis benjamin | 0.2653061224489796 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| benjamin alan artis | 0.6078947368421053 | 0.8947368421052629 |
+---------------------+--------------------+--------------------+
| ben artis | 0.4142857142857142 | 0.7142857142857143 |
+---------------------+--------------------+--------------------+
| benjamin | 0.6125850340136053 | 0.5714285714285714 |
+---------------------+--------------------+--------------------+
| benji | 0.36269841269841263| 0.28571428571428575|
+----------------------------------------------------------------
Edited: updated the resemblance algorithm to improve results.
Try Flookup for Google Sheets... it's definitely faster than Levenshtein distance and it calculates percentage similarities right out of the box.
One Flookup function you might find useful is this:
FUZZYMATCH (string1, string2)
Parameter Details
string1: compares to string2.
string2: compares to string1.
The percentage similarity is then calculated based on these comparisons. Both parameters can be ranges.
I'm currently trying to optimise it for large data sets so you feedback would be very welcome.
Edit: I'm the creator of Flookup.

Javascript regex with varying input

I want to filter out the following information out of a long piece of text. Which I copy
and paste in a textfield and then want to process into a table as a result. with
Name
Address
Status
Example snippet:(Kind of randomized the names and addresses etc)
Thuisprikindeling voor: Vrijdag 15 Mei 2015 DE SMART BON 22 afspraken
Pagina 1/4
Persoonlijke mededeling:
Algemene mededeling:
Prikpostgegevens: REEK-Eeklo extern, (-)
Telefoonnummer Fax Mobiel 0499/9999999 Email dummy.dummy#gmail.com
DUMMY FOO V Stationstreet 2 8000 New York F N - Sober BSN: 1655
THUIS Analyses: Werknr: PIN: 000000002038905
Opdrachtgever: Laboratorium Arts:
Mededeling: Some comments // VERY DIFFICULT
FO DUMMY FOO V Butterstreet 6 8740 Melbourne F N - Sober BSN: 15898
THUIS Analyses: Werknr: AFD 3 PIN: 000000002035900
Opdrachtgever: Laboratorium Arts:
Mededeling: ZH BLA / BLA BLA - AFD 3 - SOCIAL BEER
JOHN FOOO V Waterstreet 1 9990 Rome F N - Sober BSN: 17878
THUIS / Analyses: Werknr: K111 PIN: 000000002037888
Opdrachtgever: Laboratorium Arts:
Mededeling: TRYOUT/FOO
FO SMOOTH M.FOO M Queen Elisabethstreet 19 9990 Paris F NN - Not Sober BSN: 14877
What I want to get out of it is this:
DUMMY FOO Stationstreet 2 8000 New York Sober
FO DUMMY FOO Butterstreet 6 8740 Melbourne Sober
JOHN FOOO Waterstreet 1 9990 Rome Sober
FO SMOOTH M.FOO Queen Elisabethstreet 19 9990 Paris Not sober
My strategy for the moment is using the following:
Filter all the lines with at least two words in capitals at the beginning of the line. AND a 4 digit postal code.
Then discard all the other lines as I only need the lines with the names and adresses
Then I strip out all the information needed for that line
Strip the name / address / status
I use the following code:
//Regular expressions
//Filter all lines which start with at least two UPPERCASE words following a space
pattern = /^(([A-Z'.* ]{2,} ){2,}[A-Z]{1,})(?=.*BSN)/;
postcode = /\d{4}/;
searchSober= /(N - Sober)+/;
searchNotSober= /(NN - Not sober)+/;
adres = inputText.split('\n');
for (var i = 0; i < adres.length; i++) {
// If in one line And a postcode and which starts with at least
// two UPPERCASE words following a space
temp = adres[i]
if ( pattern.test(temp) && postcode.test(temp)) {
//Remove BSN in order to be able to use digits to sort out the postal code
temp = temp.replace( /BSN.*/g, "");
// Example: DUMMY FOO V Stationstreet 2 8000 New York F N - Sober
//Selection of the name, always take first part of the array
// DUMMY FOO
var name = temp.match(/^([-A-Z'*.]{2,} ){1,}[-A-Z.]{2,}/)[0];
//remove the name from the string
temp = temp.replace(/^([-A-Z'*.]{2,} ){1,}[-A-Z.]{2,}/, "");
// V Stationstreet 2 8000 New York F N - Sober
//filter out gender
//Using jquery trim for whitespace trimming
// V
var gender = $.trim(temp.match(/^( [A-Z'*.]{1} )/)[0]);
//remove gender
temp = temp.replace(/^( [A-Z'*.]{1} )/, "");
// Stationstreet 2 8000 New York F N - Sober
//looking for status
var status = "unknown";
if ( searchNotsober.test(temp) ) {
status = "Not soberr";
}
else if ( searchSober.test(temp) ) {
status = "Sober";
}
else {
status = "unknown";
}
//Selection of the address /^.*[0-9]{4}.[\w-]{2,40}/
//Stationstreet 2 8000 New York
var address = $.trim(temp.match(/^.*[0-9]{4}.[\w-]{2,40}/gm));
//assemble into person object.
var person={name: name + "", address: address + "", gender: gender +"", status:status + "", location:[] , marker:[]};
result.push(person);
}
}
The problem I have now is that:
Sometimes the names are not written in CAPITALS
Sometimes the postal code is not added so my code just stops working.
Sometimes they put a * in front of the name
A broader question is what strategy can you take to tackle these type of messy input problems?
Should I make cases for every mistake I see in these snippets I get? I feel like
I don't really know exactly what I will get out of this piece of code every time I run
it with different input.
Here is a general way of handling it:
Find all lines that are most likely matches. Match on "Sober" or whatever makes it unlikely to miss a match, even if it gives you false positives.
Filter out false positives, this you have to update and tweak as you go. Make sure you only filter out what isn't relevant at all.
Strict filtering of input, what doesn't match gets logged/reported for manual handling, what does match now conforms to a known strict pattern
Normalize and extract data should now be much easier since you have limited possible input at this stage

Categories

Resources