The goal
Generate some random data for an application. The value we’re focusing on here, is a short “about me” paragraph.
What we have
- Around 40 pre-populated rows.
- Free reign to do as we want.
First solution
Create a collection of unique words from existing records by splitting the existing about me strings on spaces, removing special characters and building a unique list of words using an associative array.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
FUNCTION build_word_list RETURN nt_type IS TYPE aa_type IS TABLE OF VARCHAR2(500) INDEX BY VARCHAR2(500); words_aa aa_type; word_list nt_type; l_idx VARCHAR2(200); c_special_chars CONSTANT VARCHAR2(50) := ';:*~!#$%^()_-+=&#1234567890}]{{/?.,><'||CHR(9)||CHR(10)||CHR(13); BEGIN FOR about_list IN (SELECT dbms_lob.substr(about_yourself) about FROM dd_members WHERE member_id < 41) LOOP FOR word_list IN (SELECT regexp_substr(about_list.about,'[^ ]+', 1, level) word FROM dual CONNECT BY regexp_substr(about_list.about, '[^ ]+', 1, level) IS NOT NULL) LOOP words_aa(NVL(TRANSLATE(word_list.word, c_special_chars,'I'), 'I')) := 1; --remove non alphabet characters except ' END LOOP; END LOOP; l_idx := words_aa.first; WHILE (l_idx IS NOT NULL) LOOP word_list(word_list.count + 1) := l_idx; l_idx := words_aa.next(l_idx); END LOOP; RETURN word_list; END build_word_list; |
This gets run once when the package initializes. After that the word_list is retained in memory.
As each new record is created we generate the about value as a random number of random words from the word list.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
FUNCTION generate_about RETURN CLOB IS new_about CLOB; num_words NUMBER; BEGIN num_words := dbms_random.value(10, 101); --About will have between 10 and 100 words FOR indx IN 1..num_words LOOP new_about := new_about || word_list(dbms_random.value(1, word_list.count)) || ' '; END LOOP; RETURN rtrim(new_about); END generate_about; |
The problem
Even though the words are real, the values are gibberish.
We decided to create a table of “generic” sentences and build the new ‘about me’ value by randomly selecting sentences.
With this change, we can drop the build_word_list function and just do a simple bulk collect from the sentences table. Then we can change generate about to use the sentences list instead of a word list.
1 2 3 |
SELECT sentence_text BULK COLLECT INTO about_sentences_list FROM dd_sentences; |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
FUNCTION generate_about RETURN CLOB IS new_about CLOB; num_sentences NUMBER; BEGIN num_sentences := dbms_random.value(3, 11); --About will have between 3 and 10 sentences FOR indx IN 1..num_sentences LOOP new_about := new_about || about_sentence_list(dbms_random.value(1, num_sentences.count)) || ' '; END LOOP; RETURN rtrim(new_about); END generate_about; |
This works fine except, now that we have a much sorter list to choose from, we wind up with a lot of duplicate sentences. We’d like to only have unique sentences.
Ideas
Associative Array
- Create an associative array to keep track of the the used indexes.
- Generate the index number of the sentence to use separately.
- If the new index number is already used regenerate the index.
The risk here is a long running process if we repeatedly hit the same random number. It’s a low risk, but it’s there.
Copy the Collection and delete used items
Make a copy of the master collection and collection.delete(x) as the sentences are used.
Of course the problem is, deleting an item doesn’t collapse the collection and we’ll get an exception if we hit an empty index. There are several ways to get around this, but now our code is getting long.
Copy the Collection using the Table() function
If we make a copy of the master collection by selecting from it with the Table() function, we can make a collection that is randomized. Also, if we use FETCH NEXT we can just get the number of rows needed.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
FUNCTION generate_about RETURN CLOB IS new_about CLOB; num_sentences NUMBER; unused_sentences_list about_sentences_type; random_sentence_index NUMBER; l_idx NUMBER; BEGIN /* Will have between 3 and 10 sentences */ num_sentences := floor(DBMS_RANDOM.VALUE (3, 11)); /* copy the chosen number of rows at random from the sentences list */ SELECT CAST ( MULTISET( SELECT * FROM TABLE( about_sentences_list ) ORDER BY DBMS_RANDOM.VALUE FETCH NEXT num_sentences ROWS ONLY ) AS about_sentences_type) INTO unused_sentences_list FROM dual; l_idx:= unused_sentences_list.FIRST; WHILE (l_idx IS NOT NULL) LOOP new_about := new_about || unused_sentences_list(l_idx) || ' '; l_idx:= unused_sentences_list.NEXT(l_idx); END LOOP; RETURN RTRIM (new_about); END generate_about; |
Cool, but…
We can do better
Let’s get rid of the CAST
1 2 3 4 5 |
SELECT * BULK COLLECT INTO unused_sentences_list FROM TABLE (about_sentences_list) ORDER BY DBMS_RANDOM.VALUE FETCH NEXT num_sentences ROWS ONLY; |
Now, let’s use an analytic function and get rid of the new collection and the loop that goes with it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
FUNCTION generate_about RETURN CLOB IS new_about CLOB; num_sentences NUMBER; BEGIN /* Will have between 3 and 10 sentences */ num_sentences := floor(DBMS_RANDOM.VALUE (3, 11)); SELECT LISTAGG(column_value, ' ') WITHIN GROUP (ORDER BY NULL) INTO new_about FROM (SELECT column_value FROM TABLE (about_sentences_list) ORDER BY DBMS_RANDOM.VALUE FETCH NEXT num_sentences ROWS ONLY); RETURN new_about; END generate_about; |
Note, that we used a nested SELECT statement. This allows us to still use FETCH NEXT to limit the number of sentences we send to LISTAGG.
We could have selected directly from the sentences table, but that would mean an extra trip back to the database which may add up when generating 10,000 rows.
In the end
We added a small table of sentences, removed an extra function and generated semi-coherent sounding profiles.
And we look cool doing it.