|
|
Line 1: |
Line 1: |
− | A solution for the known <math>S</math> size: While <math>S'</math> size is less than <math>k</math>, look at the first element of <math>S</math>. For each iteration, calculate the probability of admitting the element from <math>S</math> into <math>S'</math> as <math>\frac{k - \text{size}(S')}{\text{size}(S)}</math>, and admit elements accordingly. Each step, remove the first element from <math>S</math>, and update the probability ratio accordingly.
| |
| | | |
− | Whether we know <math>n</math> or not we can sample <math>k</math> values uniformly by assigning a random value to each element, sorting, and taking the top <math>k</math> values. However, this is not very efficient so instead we can use a <i>priority queue</i> of size <math>k</math> with random priority.
| |
− |
| |
− | Alternatively, we can iterate over the elements in <math>S</math> from <math>i = 0</math> until <math>S[i]</math> is empty. If <math>i < k</math>, <math>S'[i] = S[i]</math>. After that, for each <math>i</math>, replace a random element in <math>S'</math> with <math>S[i]</math>, with probability <math>\frac{k}{i + 1}</math> of performing the replacement rather than skipping to the next.
| |
− |
| |
− | _______________________________________________________________________________________________________________________
| |
− |
| |
− | The solution above for the known n is incorrect. There is a chance that the final set will not have k elements and even if you change it to guarantee that there will always be k elements, the probabilities are not uniform. My solution is based on permutations and combinations. There are <math>{n \choose k} </math> subsets of k numbers from a set of n. The Idea is to pick a random integer between 1 and <math>{n \choose k} </math> and generate the corresponding subset. Now that we have the subset, we will go through our set of n elements and pick the corresponding elements and return them. Below is the implementation of subset generation algorithm (in c++) since all other steps are trivial. The idea is to order subsets of 1..n based on their inclusion of 1, then 2, then 3, etc. So the first <math>{n-1 \choose k-1} </math> subsets will include 1 and the rest won't. In those that include 1, the first <math>{n-2 \choose k -2} </math> include 2 and the rest won't while in those that don't include 1, the first <math>{n-2 \choose k-1} </math> include 2 and the rest won't. Here's an example ordering for n =5, k =3: {1,2,3}, {1,2,4}, {1,2,5}, {1,3,4} ,{1,3,5}, {1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}.
| |
− |
| |
− | Ok, now that you get the idea, let me explain the code below. The code finds the "number"th subset. In each iteration of the while loop we determine if number i belongs in the subset. The following logic is based on the explanation above. Let's say we've already picked some numbers less than i for our subset, so now we have to only pick m numbers from i to n. Then we if the subset number is greater than <math> {n-i \choose m-1} </math> we skip i and subtract <math>{n-i \choose m-1} </math> from number since that's the number of subsets we skipped by skipping i. Otherwise, we'll include i and decrement k since now we need one fewer element in our subset.
| |
− |
| |
− | long long n_choose_k (int n, int k);
| |
− | std::vector<int> generate_permutation_without_table (int n, int k, long long number){
| |
− | vector<int> res{};
| |
− | if(number < 1 || number > n_choose_k(n,k)) {
| |
− | return res;
| |
− | }
| |
− | int i = 1;
| |
− | while(k > 0){
| |
− | long long clc = n_choose_k(n-1, k-1);
| |
− | if(number > clc) {
| |
− | --n;
| |
− | ++i;
| |
− | number -= clc;
| |
− | } else {
| |
− | --k;
| |
− | --n;
| |
− | res.push_back(i++);
| |
− | }
| |
− | }
| |
− | return res;
| |
− | }
| |
− |
| |
− |
| |
− | This chapter is about time complexity so let's get to that. The worst case time complexity of the priority q method given by previous author is <math> n \lg k </math>. This is because in the worst case, each insertion into a priority q of k elements takes lg(k) time. My method has inferior time complexity. In general, calculating n_choose_k takes linear time in n. So the whole method is over all <math>O(n^2)</math> (while loop is performed n times at most). But there is a linear way of doing this!!! (You might be asking why I didn't start with that, it's because I like subsets and permutations and ordering).
| |
− |
| |
− | The idea is to generate k random numbers increasingly below n and for each number take the element from the set at that index. But what if all of your k numbers are 1? Well, you have to remove the selected element from the set before picking the next random number. This is a linear operation in arrays which makes the whole thing quadratic. But you can use a int vector to make this linear.
| |
− |
| |
− | The idea is to take an array of n integers all initialized to 0. Then generate k random numbers between 1 and n (number i will be between 1 and n - i+1). For each number you increment that index of the array. Then you go from the beginning of the array to the end maintaining a sum variable, you decrement the sum for each element in array (but don't go below 0) and add the value of the element to the sum. You pick the indexes for which the sum was more than 0. Here's an example for n = 6, k = 4:
| |
− |
| |
− | set = 1, 2, 3, 4, 5, 6
| |
− | random numbers picked = 5, 5, 1,1
| |
− | initial int vector = 0, 0, 0, 0, 0, 0
| |
− | int vector after applying random numbers = 2, 0, 0, 0, 2, 0
| |
− | sum variable tracked over the int vector = 2, 1, 0, 0, 2, 1
| |
− | selected numbers (subset) = 1,2,5,6.
| |
− |
| |
− | C++ below.
| |
− |
| |
− | #include <random>
| |
− | #include <iostream>
| |
− | #include <vector>
| |
− | using namespace std;
| |
− | typedef std::uniform_int_distribution<int>::param_type dist_param;
| |
− | vector<int> pick_subset(int k, int n)
| |
− | {
| |
− | vector<int> result(n);
| |
− | random_device rd; //Will be used to obtain a seed for the random number engine
| |
− | mt19937 gen(rd()); //Standard mersenne_twister_engine seeded with rd()
| |
− | uniform_int_distribution<int> dis(1, n);
| |
− | vector<int> int_field(n, 0); //int_vector initialized with n zeros.
| |
− | for(int i = 0; i < k; ++i) {
| |
− | int picked = dis(gen, dist_param(1, n - i)); //pick a number between 1 and n - i
| |
− | int_field[picked - 1]++; //in c++ indexes start 0 while in our algorithm they start at 1 so we subtract 1 here.
| |
− | }
| |
− | int sum = 0;
| |
− | for(int i = 0; i < n; ++i) {
| |
− | sum += int_field[i];
| |
− | if(sum > 0) {
| |
− | result.push_back(i+1); //again the 0 vs 1 problem
| |
− | }
| |
− | sum = sum == 0 ? 0 : sum - 1;
| |
− | }
| |
− | return result;
| |
− | }
| |
− |
| |
− |
| |
− | Both of the algorithms presented here are provably correct. But I'm not gonna write a proof here. Clearly the last algorithm is the fastest.
| |
− | And I'm tpourjalali@gmail.com (sorry I feel like I should take credit :))
| |
− |
| |
− | ____________________________________________________________________________________________________________________________________________________
| |
− |
| |
− | Return to [[Algo-analysis-TADM2E]] ...
| |