I want to filter out the following information out of a long piece of text. Which I copy
and paste in a textfield and then want to process into a table as a result. with
Name
Address
Status
Example snippet:(Kind of randomized the names and addresses etc)
Thuisprikindeling voor: Vrijdag 15 Mei 2015 DE SMART BON 22 afspraken
Pagina 1/4
Persoonlijke mededeling:
Algemene mededeling:
Prikpostgegevens: REEK-Eeklo extern, (-)
Telefoonnummer Fax Mobiel 0499/9999999 Email dummy.dummy#gmail.com
DUMMY FOO V Stationstreet 2 8000 New York F N - Sober BSN: 1655
THUIS Analyses: Werknr: PIN: 000000002038905
Opdrachtgever: Laboratorium Arts:
Mededeling: Some comments // VERY DIFFICULT
FO DUMMY FOO V Butterstreet 6 8740 Melbourne F N - Sober BSN: 15898
THUIS Analyses: Werknr: AFD 3 PIN: 000000002035900
Opdrachtgever: Laboratorium Arts:
Mededeling: ZH BLA / BLA BLA - AFD 3 - SOCIAL BEER
JOHN FOOO V Waterstreet 1 9990 Rome F N - Sober BSN: 17878
THUIS / Analyses: Werknr: K111 PIN: 000000002037888
Opdrachtgever: Laboratorium Arts:
Mededeling: TRYOUT/FOO
FO SMOOTH M.FOO M Queen Elisabethstreet 19 9990 Paris F NN - Not Sober BSN: 14877
What I want to get out of it is this:
DUMMY FOO Stationstreet 2 8000 New York Sober
FO DUMMY FOO Butterstreet 6 8740 Melbourne Sober
JOHN FOOO Waterstreet 1 9990 Rome Sober
FO SMOOTH M.FOO Queen Elisabethstreet 19 9990 Paris Not sober
My strategy for the moment is using the following:
Filter all the lines with at least two words in capitals at the beginning of the line. AND a 4 digit postal code.
Then discard all the other lines as I only need the lines with the names and adresses
Then I strip out all the information needed for that line
Strip the name / address / status
I use the following code:
//Regular expressions
//Filter all lines which start with at least two UPPERCASE words following a space
pattern = /^(([A-Z'.* ]{2,} ){2,}[A-Z]{1,})(?=.*BSN)/;
postcode = /\d{4}/;
searchSober= /(N - Sober)+/;
searchNotSober= /(NN - Not sober)+/;
adres = inputText.split('\n');
for (var i = 0; i < adres.length; i++) {
// If in one line And a postcode and which starts with at least
// two UPPERCASE words following a space
temp = adres[i]
if ( pattern.test(temp) && postcode.test(temp)) {
//Remove BSN in order to be able to use digits to sort out the postal code
temp = temp.replace( /BSN.*/g, "");
// Example: DUMMY FOO V Stationstreet 2 8000 New York F N - Sober
//Selection of the name, always take first part of the array
// DUMMY FOO
var name = temp.match(/^([-A-Z'*.]{2,} ){1,}[-A-Z.]{2,}/)[0];
//remove the name from the string
temp = temp.replace(/^([-A-Z'*.]{2,} ){1,}[-A-Z.]{2,}/, "");
// V Stationstreet 2 8000 New York F N - Sober
//filter out gender
//Using jquery trim for whitespace trimming
// V
var gender = $.trim(temp.match(/^( [A-Z'*.]{1} )/)[0]);
//remove gender
temp = temp.replace(/^( [A-Z'*.]{1} )/, "");
// Stationstreet 2 8000 New York F N - Sober
//looking for status
var status = "unknown";
if ( searchNotsober.test(temp) ) {
status = "Not soberr";
}
else if ( searchSober.test(temp) ) {
status = "Sober";
}
else {
status = "unknown";
}
//Selection of the address /^.*[0-9]{4}.[\w-]{2,40}/
//Stationstreet 2 8000 New York
var address = $.trim(temp.match(/^.*[0-9]{4}.[\w-]{2,40}/gm));
//assemble into person object.
var person={name: name + "", address: address + "", gender: gender +"", status:status + "", location:[] , marker:[]};
result.push(person);
}
}
The problem I have now is that:
Sometimes the names are not written in CAPITALS
Sometimes the postal code is not added so my code just stops working.
Sometimes they put a * in front of the name
A broader question is what strategy can you take to tackle these type of messy input problems?
Should I make cases for every mistake I see in these snippets I get? I feel like
I don't really know exactly what I will get out of this piece of code every time I run
it with different input.
Here is a general way of handling it:
Find all lines that are most likely matches. Match on "Sober" or whatever makes it unlikely to miss a match, even if it gives you false positives.
Filter out false positives, this you have to update and tweak as you go. Make sure you only filter out what isn't relevant at all.
Strict filtering of input, what doesn't match gets logged/reported for manual handling, what does match now conforms to a known strict pattern
Normalize and extract data should now be much easier since you have limited possible input at this stage
Related
I have a set of text messages. Lets call them m1, m2, ..... The maximum number of message is below 1,000,000. Each message is below 1024 characters in length, and all are in lowercase. Lets also pick an n-gram s1.
I need to find frequency of all possible substring from all of these messages. For example, lets say we have only two messages:
m1 = a cat in a cage
m2 = a bird in a cage
The frequency of some n-gram in these two messages:
'a' = 4
'in a cage' = 2
'a bird' = 1
'a cat' = 1
...
Note that, as in = 2, in a = 2, and a cage = 2 are subsets of in a cage = 2 and have same frequency, they should not be listed. Only take the longest one that have the highest frequency; follow this condition: the longest sn-gram should consist of at most 8 words, with a total character count below 30. If a n-gram exceeds this limit, it can be broken into two or more n-grams and listed separately.
I need to find such n-grams for all of these text messages and sort them by their number of occurrences in descending order.
How to I approach this problem? I need a solution in javascript.
PS: I need help, but do not know to where to ask this. If the question
is not for this site, then where should I post it? please guide this
newbie here.
May be you can approach as follows. I will edit to add explanation as soon as i have some time.
var subSentences = (w,...ws) => ws.length ? ws.reduce((r,s) => (r.push(r[r.length-1] + ` ${s}`), r),[w])
.concat(subSentences(...ws))
: [w],
frequencyMap = sss => sss.reduce((map,ss) => subSentences(...ss.split(/\s+/)).reduce((m,s) => m.set(s, m.get(s) + 1 || 1), map), new Map());
frequencies = frequencyMap(["this is a test string",
"this is another one",
"yet another one is here"]);
console.log(...frequencies.entries()); // logging map object seems not possible hence entries
.as-console-wrapper { max-height : 100% !important
}
How can I split a list of strings based on quantity and name?
For example if I have string str that looks like the following:
5 apples
7x pine apples
10 oranges
14x corn on the cob
apple pie
I could do,
var list = str.split(/\r?\n/);
So now I have each line in an array list but now I still need to get the quantity and name from each element in the list.
For list[0] which is '5 apples' I could do,
var breakdown = list[0].split(' ');
For list[1] I'd have to remove the 'x' from '7x' and it would incorrectly be split into 3 rather than just the quantity and name , etc.
For 'apple pie' the quantity should be 1.
The expected result is always,
breakdown[0]: quantity
breakdown[1]: name
How can I get the quantity and name regardless of how it's entered?
A regex on each line would do it. This follows with a second .map() to convert the numeric (or empty) string to a number.
var data = `5 apples
7x pine apples
10 oranges
14x corn on the cob
apple pie`;
var result = data.split(/\s*?(?:\r?\n)+\s*/g).map(s =>
/^(?:(\d+)x?\s+)?(.+)$/.exec(s).slice(1)
).map(([q, d]) => [+q || 1, d]);
console.log(result);
It could actually be done with just a regex too, if you include the m modifier.
var data = `5 apples
7x pine apples
10 oranges
14x corn on the cob
apple pie`;
var re = /^(?:(\d+)x?\s+)?(.+)$/gm;
var m;
var result = [];
while((m = re.exec(data))) {
result.push([+m[1] || 1, m[2]]);
}
console.log(result);
I am a fresh with JavaScript. I just tried a lot, but did not get the answer and information to show how to count occurrence of multiple sub-string in a long string at one time.
Further information: I need get the occurrence of these sub-string and if the number of their occurrence to much, I need replace them at one time,so I need get the occurrence at one time.
Here is an example:
The long string Text as below,
Super Bowl 50 was an American football game to determine the champion of the National Football League (NFL) for the 2015 season. The American Football Conference (AFC) champion Denver Broncos defeated the National Football Conference (NFC) champion Carolina Panthers 24–10 to earn their third Super Bowl title. The game was played on February 7, 2016, at Levi's Stadium in the San Francisco Bay Area at Santa Clara, California. As this was the 50th Super Bowl, the league emphasized the "golden anniversary" with various gold-themed initiatives, as well as temporarily suspending the tradition of naming each Super Bowl game with Roman numerals (under which the game would have been known as "Super Bowl L"), so that the logo could prominently feature the Arabic numerals 50.
The sub-string is a question, but what I need is to count each word occurrence in this sub-string at one time. for example, the word "name","NFL","championship","game" and "is","the" in this string.
What is the name of NFL championship game?
One of problems is some sub-string is not in the text, and some have shown many times.(which I might replaced it)
The Code I have tried as below, it is wrong, I have tried many different ways but no good results.
$(".showMoreFeatures").click(function(){
var text= $(".article p").text(); // This is to get the text.
var textCount = new Array();
// Because I use match, so for the first word "what", will return null, so
this is to avoid this null. and I was plan to get the count number, if it is
more than 7 or even more, I will replace them.
var qus = item2.question; //This is to get the sub-string
var checkQus = qus.split(" "); // I split the question to words
var newCheckQus = new Array();
// This is the array I was plan put the sub-string which count number less than 7, which I really needed words.
var count = new Array();
// Because it is a question as sub-string and have many words, so I wan plan to get their number and put them in a array.
for(var k =0; k < checkQus.length; k++){
textCount = text.match(checkQus[k],"g")
if(textCount == null){
continue;
}
for(var j =0; j<checkQus.length;j++){
count[j] = textCount.length;
}
//count++;
}
I was tried many different ways, and searched a lot, but no good results. The above code just want to show what I have tried and my thinking(might totally wrong). But actually it is not working , if you know how to implement it,solve my problem, please just tell me, no need to correct my code.
Thanks very much.
If I have understood the question correctly then it seems you need to count the number of times the words in the question (que) appear in the text (txt)...
var txt = "Super Bowl 50 was an American ...etc... Arabic numerals 50.";
var que = "What is the name of NFL championship game?";
I'll go through this in vanilla JavaScript and you can transpose it for JQuery as required.
First of all, to focus on the text we can make things a little simpler by changing the strings to lowercase and removing some of the punctuation.
// both strings to lowercase
txt = txt.toLowerCase();
que = que.toLowerCase();
// remove punctuation
// using double \\ for proper regular expression syntax
var puncArray = ["\\,", "\\.", "\\(", "\\)", "\\!", "\\?"];
puncArray.forEach(function(P) {
// create a regular expresion from each punctuation 'P'
var rEx = new RegExp( P, "g");
// replace every 'P' with empty string (nothing)
txt = txt.replace(rEx, '');
que = que.replace(rEx, '');
});
Now we can create a cleaner array from str and que as well as a hash table from que like so...
// Arrays: split at every space
var txtArray = txt.split(" ");
var queArray = que.split(" ");
// Object, for storing 'que' counts
var queObject = {};
queArray.forEach(function(S) {
// create 'queObject' keys from 'queArray'
// and set value to zero (0)
queObject[S] = 0;
});
queObject will be used to hold the words counted. If you were to console.debug(queObject) at this point it would look something like this...
console.debug(queObject);
/* =>
queObject = {
what: 0,
is: 0,
the: 0,
name: 0,
of: 0,
nfl: 0,
championship: 0,
game: 0
}
*/
Now we want to test each element in txtArray to see if it contains any of the elements in queArray. If the test is true we'll add +1 to the equivalent queObject property, like this...
// go through each element in 'queArray'
queArray.forEach(function(A) {
// create regular expression for testing
var rEx = new RegExp( A );
// test 'rEx' against elements in 'txtArray'
txtArray.forEach(function(B) {
// is 'A' in 'B'?
if (rEx.test(B)) {
// increase 'queObject' property 'A' by 1.
queObject[A]++;
}
});
});
We use RegExp test method here rather than String match method because we just want to know if "is A in B == true". If it is true then we increase the corresponding queObject property by 1. This method will also find words inside words, such as 'is' in 'San Francisco' etc.
All being well, logging queObject to the console will show you how many times each word in the question appeared in the text.
console.debug(queObject);
/* =>
queObject = {
what: 0
is: 2
the: 17
name: 0
of: 2
nfl: 1
championship: 0
game: 4
}
*/
Hoped that helped. :)
See MDN for more information on:
Array.forEach()
Object.keys()
RegExp.test()
I have a list of postcodes in the UK with a region id next to it. Now for delivering products it costs more depending on the region a user lives in.
For example, if a user lives in Birmingham and has a postcode that starts with B, he will get free delivery because that postcode region doesn't have any charge.
Likewise, if a user has a postcode starting with IM , they have to pay more delivery as that postcode region is more.
Sample postcode list:
Postcode | Region
AL | A
BA | A
BB | A
BD | A
B | B
BH | B
LN | D
LS | D
IV1 | E
IV23 | F
From the example above if a user wants to get a delivery and their postcode starts with BA then I want to apply the delivery charge rate of region A.
I'm actually a bit confused as to how I can programmatically do this. At first I thought I would simply do something similar to:
$postcodes = [
'AL'=>'A',
'BA'=>'A',
//And so on ....
];
//get the first 2 letters
$user_input = substr( $user_postcode, 0, 2 );
if(array_key_exists($user_input,$postcodes)){
//Get the region code
$region = $postcodes[$user_input];
// Charge the user with the delivery rate specific to that user, then carry on
}
But problem is that some similar postcodes can be in different regions, so for example, IV1 is region E and IV23 is region F like seen above.
That means I have to match a users post code on either, the 1 , 2 ,3 or 4 characters. That probably doesn't make sense. To elaborate more see below:
//From Birmingham and is in region B
$user1_input = 'B';
//From Bradford and is in region A
$user1_input = 'BD';
//From Inverness and is in region E
$user1_input = 'IV1';
So if the user input is from Birmingham and user input starts with B , how can i tell that apart from a postcode that also starts with B but then has other letters in it which makes it a different postcode.
I'm trying my best to explain, hopefully, this does make sense. If not please ask for more info.
Can anyone please help me with the logic to how I could achieve this? Either in Javascript or PHP , because i can convert the logic afterwards.
If you have what looks like a valid UK postcode, then remove the spaces and just search the array till you find a match:
$lookup = [
'' => 'X', // in case no match is found
'AL'=>'A',
'BA'=>'A',
//And so on ....
];
function get_delivery_for($postcode)
{
global $lookup;
for ($x=5; $x>0 && !$result; $x--) {
$result=$lookup[substr($postcode, 0, $x)];
}
return ($result);
}
Note that the code above is intended for illustration, I would recommend using something more elaborate to avoid it throwing warnings....
$result=isset($lookup[substr($postcode, 0, $x)])
? $lookup[substr($postcode, 0, $x)]
: false;
One option would be to order your postcode/region array by the descending length of the postcode key. This way, the longer (more specific) keys are checked first. Taking your list above, it would become something like this...
$postcodes = array(
"IV23" => "F",
"IV1" => "E",
"LS" => "D",
"LN" => "D",
"BH" => "B",
"BD" => "A",
"BB" => "A",
"BA" => "A",
"AL" => "A",
"B" => "B",
);
After you have that, it's as simple as looping through the array, checking for a match against the provided postcode (starting from the left), and stopping when you find a match.
foreach($postcodes as $code => $region)
{
if($code == substr($user_postcode, 0, strlen($code)))
{
$shippingRegion = $region;
break;
}
}
echo $shippingRegion;
I have been handed a project at work where I need to find duplicate pairings from multiple rows within a dataset. While the data set is much larger, the main portion revolves around the date of a training, the location of a training, and the names of the trainers. So every row of data has a date, a location, and then a comma separated list of names:
Date Location Names
1/13/2014 Seattle A, B, D
1/16/2014 Dallas C, D, E
1/20/2014 New York A, D
1/23/2014 Dallas C, E
1/27/2014 Seattle B, D
1/30/2014 Houston C, A, F
2/3/2014 Washington DC D, A, F
2/6/2014 Phoenix B, E
2/10/2014 Seattle C, B
2/13/2014 Miami A, B, E
2/17/2014 Miami C, D
2/20/2014 New York B, E, F
2/24/2014 Houston A, B, F
My goal is to be able to find rows with similar pairings of names. One example would be to know that A & B were in paired in Seattle on 1/13, Miami on 2/13, and Houston on 2/24, even though the third name is different in each occurrence. So instead of just simply finding duplicates among the entire string of names, I would also like to find pairings among partial segments of the “Names” column.
Is this possible to do within Excel or would I need to use a programming language to accomplish the task?
While I can manually do this, it represents a lot of time that could be used towards other things. If there was a way that I could automate this it would make this portion of my task a lot simpler.
Thank you in advance for any assistance or advice on a way forward.
You can do it with VBA. The solution below assumes
Your data is on the active sheet in columns A:C
You results will be output in columns E:G
The output will be a list sorted by pairs, and then by dates, so you can easily see where pairs repeated.
The routine assumes no more than three trainers at a time, but could be modified add more possible combinations.
Cities with just a single trainer will be ignored.
The routine uses a Class module to gather the information, and two Collections to process the data. It also makes use of the feature that collections will not allow addition of two items with the same key.
Class Module
Rename the Class Module: cPairs
Option Explicit
Private pTrainer1 As String
Private pTrainer2 As String
Private pCity As String
Private pDT As Date
Public Property Get Trainer1() As String
Trainer1 = pTrainer1
End Property
Public Property Let Trainer1(Value As String)
pTrainer1 = Value
End Property
Public Property Get Trainer2() As String
Trainer2 = pTrainer2
End Property
Public Property Let Trainer2(Value As String)
pTrainer2 = Value
End Property
Public Property Get City() As String
City = pCity
End Property
Public Property Let City(Value As String)
pCity = Value
End Property
Public Property Get DT() As Date
DT = pDT
End Property
Public Property Let DT(Value As Date)
pDT = Value
End Property
Regular Module
Option Explicit
Option Compare Text
Public cP As cPairs, colP As Collection
Public colCityPairs As Collection
Public vSrc As Variant
Public vRes() As Variant
Public rRes As Range
Public I As Long, J As Long
Public V As Variant
Public sKey As String
Sub FindPairs()
vSrc = Range("A1", Cells(Rows.Count, "C").End(xlUp))
Set colP = New Collection
Set colCityPairs = New Collection
'Collect Pairs
For I = 2 To UBound(vSrc)
V = Split(Replace(vSrc(I, 3), " ", ""), ",")
If UBound(V) >= 1 Then
'sort the pairs
SingleBubbleSort V
Select Case UBound(V)
Case 1
AddPairs V(0), V(1)
Case 2
AddPairs V(0), V(1)
AddPairs V(0), V(2)
AddPairs V(1), V(2)
End Select
End If
Next I
ReDim vRes(0 To colCityPairs.Count, 1 To 3)
vRes(0, 1) = "Date"
vRes(0, 2) = "Location"
vRes(0, 3) = "Pairs"
For I = 1 To colCityPairs.Count
With colCityPairs(I)
vRes(I, 1) = .DT
vRes(I, 2) = .City
vRes(I, 3) = .Trainer1 & ", " & .Trainer2
End With
Next I
Set rRes = Range("E1").Resize(UBound(vRes, 1) + 1, UBound(vRes, 2))
With rRes
.EntireColumn.Clear
.Value = vRes
With .Rows(1)
.HorizontalAlignment = xlCenter
.Font.Bold = True
End With
.Sort key1:=.Columns(3), order1:=xlAscending, key2:=.Columns(1), order2:=xlAscending, _
Header:=xlYes
.EntireColumn.AutoFit
V = VBA.Array(vbYellow, vbGreen)
J = 0
For I = 2 To rRes.Rows.Count
If rRes(I, 3) = rRes(I - 1, 3) Then
.Rows(I).Interior.Color = .Rows(I - 1).Interior.Color
Else
J = J + 1
.Rows(I).Interior.Color = V(J Mod 2)
End If
Next I
End With
End Sub
Sub AddPairs(T1, T2)
Set cP = New cPairs
With cP
.Trainer1 = T1
.Trainer2 = T2
.City = vSrc(I, 2)
.DT = vSrc(I, 1)
sKey = .Trainer1 & "|" & .Trainer2
On Error Resume Next
colP.Add cP, sKey
If Err.Number = 457 Then
Err.Clear
colCityPairs.Add colP(sKey), sKey & "|" & colP(sKey).DT & "|" & colP(sKey).City
colCityPairs.Add cP, sKey & "|" & .DT & "|" & .City
Else
If Err.Number <> 0 Then Stop
End If
On Error GoTo 0
End With
End Sub
Sub SingleBubbleSort(TempArray As Variant)
'copied directly from support.microsoft.com
Dim Temp As Variant
Dim I As Integer
Dim NoExchanges As Integer
' Loop until no more "exchanges" are made.
Do
NoExchanges = True
' Loop through each element in the array.
For I = LBound(TempArray) To UBound(TempArray) - 1
' If the element is greater than the element
' following it, exchange the two elements.
If TempArray(I) > TempArray(I + 1) Then
NoExchanges = False
Temp = TempArray(I)
TempArray(I) = TempArray(I + 1)
TempArray(I + 1) = Temp
End If
Next I
Loop While Not (NoExchanges)
End Sub
Ok. I got bored and did this whole thing in Python code. I assume you are familiar with the language; however, you should be able to get the following piece of code to work on any computer with Python installed.
I have made a few assumptions. For instance, I have used your example input as definite input.
A few things which will mess up the program:
Not entering with case sensitivity. Beware of capital letters etc.
Having a inputfile which has the following row: "Date Location Names". Just remove and keep straight facts in the file. I got lazy and do not bother adjusting this.
A ton of other small stuff. Just do what the program asks you to do and dont enter funky input.
About program:
Revolves around using a dictionary with person names as keys. The values in the dictionary is a set with tuples containing the places they've been during what date. By then comparing these sets and getting the intersection, we can find the answer.
Kinda messy since I took this as Python practice. Have not coded in Python for a while and I got a thrill out of doing it all without utilizing objects. Just follow the "instructions" and keep the inputfile, which stores all information, in the same folder as the piece of code are running.
As a side note, you might want to check that the program yields correct output.
If you have any questions, feel free to contact me.
def readWord(line, stringIndex):
word = ""
while(line[stringIndex] != " "):
word += line[stringIndex]
stringIndex += 1
return word, stringIndex
def removeSpacing(line, stringIndex):
while(line[stringIndex] == " "):
stringIndex += 1
return stringIndex
def readPeople(line, stringIndex):
lineSize = len(line)
people = []
while(stringIndex < lineSize):
people.append(line[stringIndex])
stringIndex += 3
return people
def readLine(travels, line):
stringIndex = 0
date, stringIndex = readWord(line, stringIndex)
stringIndex = removeSpacing(line, stringIndex)
location, stringIndex = readWord(line, stringIndex)
stringIndex = removeSpacing(line, stringIndex)
people = readPeople(line, stringIndex)
for person in people:
if(person not in travels.keys()):
travels[person] = set()
travels[person].add((date, location))
return travels
def main():
f = open(input("Enter filename (must be in same folder as this program code. For instance, name could be: testDocument.txt\n\n"))
travels = dict()
for line in f:
travels = readLine(travels, line)
print("\n\n\n\n PROGRAM RUNNING \n \n")
while(True):
persons = []
userInput = "empty"
while(userInput):
userInput = input("Enter person name (Type Enter to finish typing names): ")
if(userInput):
persons.append(userInput)
output = travels[persons[0]]
for person in persons[1:]:
output = output.intersection(travels[person])
print("")
for hit in output:
print(hit)
print("\nFINISHED WITH ONE RUN. STARTING NEW ONE\n")