Splitting Text Into Words in SQL Revisited
Has it really been over a week since last I blogged? Oh well ... I've been busy both at work and at home.
Something I've been working on of late is a better routine to split a string into individual words in T-SQL. You might remember my previous effort, which works really well. This one works even better (although it may be a little slower), because it allows for any number of "delimiter" characters (in the @delims variable).
I know a few of my readers are absolute SQL guns, so if there's something in here that doesn't look right, or could work better, feel free to call me on it in the comments section.
Update! It turns out that I need to move the hyphen character in @delims to the end of the string - otherwise it's treating it as a range operator, and matching everything between the characters on either side of it!
CREATE FUNCTION SplitWords(@text varchar(8000))RETURNS @words TABLE (pos smallint,value varchar(8000))ASBEGINSET @text = RTRIM(@text)DECLARE @delims varchar(10)SET @delims = ' ,:/-'DECLARE @pos smallint, @i smallint, @s varchar(8000)SET @pos = 1WHILE @pos < LEN(@text)AND CHARINDEX(SUBSTRING(@text, @pos, 1), @delims) > 0SET @pos = @pos + 1WHILE @pos <= LEN(@text)BEGINSET @i = PATINDEX('%[' + @delims + ']%',SUBSTRING(@text, @pos, len(@text) - @pos + 1))IF @i > 0BEGINSET @i = @i + @pos - 1IF @i > @posBEGIN-- @i now holds the earliest delimiter in the stringSET @s = SUBSTRING(@text, @pos, @i - @pos)INSERT INTO @wordsVALUES (@pos, @s)ENDSET @pos = @i + 1WHILE @pos < LEN(@text)AND CHARINDEX(SUBSTRING(@text, @pos, 1), @delims) > 0SET @pos = @pos + 1ENDELSEBEGINSET @s = SUBSTRING(@text, @pos, LEN(@text) - @pos + 1)INSERT INTO @wordsVALUES (@pos, @s) SET @pos = LEN(@text) + 1ENDEND
-- remove common words that we don't want to search forDELETE FROM @wordsWHERE value IN ('an', 'the', 'of', 'and', '&')RETURNEND
Comments
# Rob Farley
3/10/2006 3:35 PM
How about this:
create function [dbo].[wordsplit](@str nvarchar(max),@delims nvarchar(max) = ' ,-:/')
returns @ResTable table
(word nvarchar(max))
AS
begin
insert into @ResTable
select val from (select
left(s,patindex(N'%['+@delims+']%',s+N',')-1) as val
from (
select num, substring(@str,num,len(@str)) as s
from dbo.nums
where num <= len(@str)+1
and charindex(substring(N','+@str,num,1),@delims) > 0
) splits) v
where val not in ('an', 'the', 'of', 'and', '&')
;
return
end
;
--then try select * from dbo.wordsplit('this is the test',default)
--NB: this requires an auxiliary table of numbers, check out http://robfarley.blogspot.com/2006/09/auxiliary-table-of-numbers.html for details of that. You will want to make sure that it's topped up to longer than your longest string before calling wordsplit.
# mabster
3/10/2006 3:51 PM
You and your table of numbers, Rob :)
I do like the use of varchar(max) there - I should change mine to do the same, since I'm running on Sql 2005 too.
Since you already have that table of numbers handy, any chance you can run some profiling on the two functions to see which is quicker? I'm guessing yours, but I'd be interested in seeing the figures.
# mabster
3/10/2006 3:54 PM
Ah - yours seems to not like multiple delimiters in a row, Rob.
Try it with:
'Testing: wordsplit, by Rob'
I'm sure it's only a minor fix.
# mabster
3/10/2006 4:01 PM
Hmm. Just tried a quick run through, printing the number of ms it takes to do 1000 iterations of both routines.
Yours: 10526
Mine: 560
So it looks like my dodgy one wins this round.
# Rob Farley
4/10/2006 11:12 AM
To fix it for multiple characters, just add the empty string to the ones you remove at the end. :)
And yes, yours is faster when used as a function like this. But can you convert yours into a set-based approach?
For example, try running this:
declare @delims nvarchar(max);
set @delims = N' ,:/';
declare @start datetime;
set @start = current_timestamp;
select w.value
from AdventureWorks.Production.ProductDescription d
cross apply
dbo.splitwords(d.description) w
;
select 'Matt' as who, datediff(ms,@start, current_timestamp);
set @start = current_timestamp;
select w.word
from AdventureWorks.Production.ProductDescription d
cross apply
dbo.wordsplit(d.description,@delims) w
;
select 'Rob-UDF' as who, datediff(ms,@start, current_timestamp);
set @start = current_timestamp;
select left(s,patindex(N'%['+@delims+']%',s+N',')-1) as val
from (select num, substring(d.Description,num,len(d.Description)) as s
from AdventureWorks.Production.ProductDescription d
join
dbo.nums
on num <= len(d.Description)+1
and charindex(substring(N','+d.Description,num,1),@delims) > 0
) splits;
select 'Rob-set' as who, datediff(ms,@start, current_timestamp);
--You may find that the set-based approach works a little faster.
# mabster
4/10/2006 11:34 AM
Yeah, true - in fact, when I first looked at your implementation I was impressed that it was a single SELECT query rather than a series of loops.
Isn't CROSS APPLY a wonderful thing? How did we live without it for so long?
Anyway, you may notice that I've had to slightly change mine - the hyphen character is a special case when you do a LIKE using square brackets.
And yeah - removing empty strings at the end makes total sense.
# Sree
5/08/2008 5:03 PM
Great !!!!!!!!!
# chapter844
27/11/2008 4:11 PM
Hello Matt.
I want to split text into words from SQL for my asp page then I need SELECT Query.
How do I doing?
# DeveloperDan
3/08/2009 11:48 PM
Thanks Matt! Here is my wordy commented version of your routine to help myself better understand it:
CREATE FUNCTION SplitWords(@text varchar(max))
RETURNS @words TABLE (
position smallint,
word varchar(max)
)
AS
BEGIN
-- Use like this:
-- SELECT * FROM dbo.SplitWords('This is a test')
-- Or this:
-- SELECT *
-- FROM dbo.SplitWords(@userinput) w
-- JOIN table1 t ON dbo.SplitWords(t.column1) = w.word
-- Or this:
-- SELECT *
-- FROM MyTable t
-- CROSS APPLY dbo.SplitWords(t.MyColumn) w ON w.word = @word
-- WHERE dbo
-- (Where @word is the word you're searching for.)
-- from: www.madprops.org/.../Splitting-Text-
-- see also: www.madprops.org/.../T-SQL-to-Split-
SET @text = RTRIM(@text)
DECLARE @delims varchar(10)
SET @delims = ' ,:/-'
DECLARE @pos smallint, @thisDelimPos smallint, @thisWord varchar(max)
SET @pos = 1
-- Get the position of the first non-delimiter character.
-- Push the current position forward by one,
-- as long as there is a delimiter beyond the current position.
WHILE @pos < LEN(@text) AND CHARINDEX(SUBSTRING(@text, @pos, 1), @delims) > 0
SET @pos = @pos + 1
WHILE @pos <= LEN(@text)
BEGIN
-- find the position of the next delimeter
SET @thisDelimPos = PATINDEX('%[' + @delims + ']%', SUBSTRING(@text, @pos, len(@text) - @pos + 1))
IF @thisDelimPos > 0
BEGIN
SET @thisDelimPos = @thisDelimPos + @pos - 1
IF @thisDelimPos > @pos
BEGIN
-- @thisDelimPos now holds the position of the earliest delimiter in the string
SET @thisWord = SUBSTRING(@text, @pos, @thisDelimPos - @pos)
INSERT INTO @words VALUES (@pos, @thisWord)
END
SET @pos = @thisDelimPos + 1
-- Again, push the current position forward by one,
-- as long as there is a delimiter beyond the current position.
WHILE @pos < LEN(@text) AND CHARINDEX(SUBSTRING(@text, @pos, 1), @delims) > 0
SET @pos = @pos + 1
END
ELSE
BEGIN
SET @thisWord = SUBSTRING(@text, @pos, LEN(@text) - @pos + 1)
INSERT INTO @words VALUES (@pos, @thisWord)
SET @pos = LEN(@text) + 1
END
END
-- remove common words that we don't want to search for
DELETE FROM @words
WHERE word IN ('an', 'the', 'of', 'and', '&')
RETURN
END
# Ignas
10/07/2010 2:56 AM
if you need something fast and don't want to use temporary tables >>
declare @str varchar(max)
declare @delims nvarchar(max)
set @str ='Testing: wordsplit, by Rob'
set @delims = N' ,:;/';
with wordlst (word, leftwords, wordNb) as
(
select
left(@str,patindex(N'%['+@delims+']%',@str+N',')-1) as word,
case when len(@str)-patindex(N'%['+@delims+']%',@str+N',') > 0 then
right(@str,len(@str)-patindex(N'%['+@delims+']%',@str+N','))
else '' end as leftwords,
1 as wordNb
where len(@str) - patindex(N'%['+@delims+']%',@str+N',') > 0
union all
select
left(wordlst.leftwords,patindex(N'%['+@delims+']%',wordlst.leftwords+N',')-1) as word,
case when len(wordlst.leftwords) - patindex(N'%['+@delims+']%',wordlst.leftwords+N',') > 0 then
right(wordlst.leftwords,len(wordlst.leftwords) - patindex(N'%['+@delims+']%',wordlst.leftwords+N','))
else '' end as leftwords,
wordNb + 1
from wordlst where len(wordlst.leftwords) > 0 and wordNb < 101
)
select word from wordlst where word not in ('', 'an', 'the', 'of', 'and', '&') order by wordNb
# Njeru
18/12/2010 12:53 AM
Hi matt.I used the last example and its commendable.How can I use this to compare between the values produced and table values i.e see if a string contains any of these words.
# Aeman
21/02/2011 4:02 AM
Hi, I have just started working on SQL, so I dint understand squat what u guys wrote above. I am still writing very basic select queries. so I wanted to ask you that there are some cities and postcodes mixed up in my address column which I needed to split, could u guys help me
# a dude
15/01/2012 10:26 AM
removing punctuation? other non-printable word delimiters (tab, newline, etc)? Did you test this on more than just "mary had a little lamb"? weak...
# mabster
15/01/2012 12:29 PM
Thanks for the (brave, anonymous) feedback, "a dude".
You'll notice that there's a "@delims" variable declared in the code that lets you customize the word delimiters. That takes care of most of the punctuation - all you need to do is add any other characters you might be interested in.
# Abu Dina
3/08/2012 11:19 PM
Great work! Thanks.
# hamed
18/10/2012 6:35 PM
tnx a lot
# C.Young
29/08/2013 10:20 AM
Thanks for that. It works well.
# Tim-Nape
3/09/2013 12:11 AM
I used #Ignas solution, returns nothing if I pass it one value i.e. '1' instead of '1,2,...'
Tried to correct - but am not SQL master
# PoP
13/09/2013 2:04 AM
Great work! Thanks.
I've been doing this in much more complicated way. :) I was really astonished. :) But I have some problems. I have added new delimiters which are not working well.
I have added '-', because I have some calculations inside the text. If minus is declared as standalone delimiter it works, but if it is in combination with others it doesn't.
For example if I have: 'A-B-C', I get:
A-B-C
I really don't find the reason for this, because if I declare just minus as delims it works, I'm getting 3 rows.
Almost same thing was with '*' delimiter, I had to move it from the last to the first position in when declaring. I tried with minus too, no luck.
Did you have such problems?
# mabster
13/09/2013 8:17 AM
Hi PoP,
My code above has "-" as a delimiter already, and I know it works.
Be aware that it has to be the last thing in the text. Otherwise it gets treated like a regular expression (eg "A-Z" means "match anything from A to Z").
# PoP
13/09/2013 4:11 PM
Hi Matt,
Thank you. Now it works. :) I saw you have '-', but I have added some delimiters after it, so it didn't work. It was confusing when it was working with a space before '-' (a - b) and without not (a-b). :)
Thanks again.
# Olaf Doschke
25/10/2013 8:29 PM
Just what I needed, though I want to split Texts into sentences. That just needs one minor modification:
SET @delims = '.!?'+CHAR(13)+CHAR(10)
The only other thing to care for is abbreviations. Replace them before you split, eg replace 'U.S.' with 'U{dot]S{dot}' and after splitting replace each '{dot}' back to a '.' again. This is more of a pain in german, because more abbreviations include dots.
# Lai
14/04/2014 4:29 PM
Gud day! I have tried to split word and saved it in a temporary table. What I want to do is to make the first letter of each Item in a column Capitalized and combine the words again. But i'm having problem in combining the values.
Ex. table @container
-----------------------|
col1 | col2 |
-----------------------|
1 | De |
2 | Guzman |
-----------------------|
I want to have an output like:
'De Guzman'
I have tried to use this:
declare @output varchar(100) = ''
set @output = @output + ' ' + (Select col2 from @container where col1 = @counter)
Hope you can help me. Thanks in advance.
# mabster
15/04/2014 10:50 AM
Check this thread out on Stack Overflow, Lai:
stackoverflow.com/.../1565013
My answer details how to combine string fields from rows in a table into one delimited string. I'll leave the conversion of the first letter to upper case to you. :)